A Category of Systems

Exploring the consequences of Systems & Cybernetics in Engineering
Author Tom Westbury License Creative Commons Licence

A better type of type

Date 2020-07-18
Tags

This post is an in­tro­duc­tion to an in­cred­i­bly pow­er­ful pat­tern that is quite well es­tab­lished in the func­tional pro­gram­ming world but does­n’t seem to have per­co­lated back to the OOP world, let alone mod­el­ling. In­ter­est­ingly enough this pat­tern is all about ab­stract­ing from the way com­put­ers han­dle data val­ues and bring­ing data de­scrip­tion up to a more hu­man friendly lev­el. This pat­tern is known as the Al­ge­braic Data Type.

Be­fore we talk about al­ge­braic data types, lets ask what is a data type? In soft­ware en­gi­neer­ing, Data Types act like a trans­la­tion rule be­tween bits in the mem­ory and some use­ful mean­ing to the user. De­pend­ing on the level of your pro­gram­ming lan­guage, data types ab­stract away the ac­tual bits and bytes in the mem­ory be­ing used to store your data and pro­vide you with mean­ing­ful val­ues such as in­te­gers, char­ac­ters and strings. Some data types though force you to think a lit­tle harder about the im­ple­men­ta­tion, mis­un­der­stand­ing float­ing point arith­metic has lead to a num­ber of high pro­file fail­ures in­clud­ing a friendly fire in­ci­dent of a mis­sile sys­tem!.

There are a few dif­fer­ent ways that pro­gram­ming lan­guages treat types and val­ues which I’ll go over quickly be­fore we move on. Some pro­gram­ming lan­guages are weakly typed. In these lan­guages a pro­gram­mer can feed any value into any func­tion and the run­time will han­dle the er­rors if there are any. Lisp, Python and Javascript all live in this fam­i­ly.

Other lan­guages are strongly typed. This mean that the lan­guage’s func­tions can only be used on a vari­able if the types match up. Hav­ing strong types means that the com­piler pre­vents you from feed­ing val­ues into func­tions that don’t know what to do with them pre­vent­ing run­time er­rors but sac­ri­fic­ing some com­piler ef­fi­cien­cy. Ex­am­ples of this are C (although cast­ing al­lows you to be very cheeky with type­s), Java and the most rigid of the bunch, Ada. UML lives in this fam­ily too.

Fi­nal­ly, it is worth men­tion­ing that there are strongly typed lan­guages that have type sys­tems (no­tably the Hind­ly-Mil­ner type sys­tem) where the com­piler can in­fer the re­quired type sig­na­tures of func­tion de­f­i­n­i­tions. This is a sig­nif­i­cant ad­van­tage over lan­guages such as Ada where a lot of type boil­er­plate code is re­quired. No­table ex­am­ples in this fam­ily are Haskell and OCaml. Some of these types sys­tems are pow­er­ful enough to be used as math­e­mat­i­cal the­o­rem provers like Agda and Idris which may pave the way to prov­ably safe code in the fu­ture once the par­a­digm has taken hold.

In UML terms, how­ev­er, a data type is de­fined as a clas­si­fier whose in­stances are anony­mous. This means that its just a class with un­named ob­jects (the ob­jects are un­named as they cor­re­spond to val­ues). UML gives us three kinds of data types: struc­tured data types (a name given to any type that con­tains an at­tribute of an­other type) prim­i­tives and enu­mer­a­tions. Enu­mer­a­tion is an in­ter­est­ing choice of name be­cause it evokes the un­der­ly­ing mech­a­nism of the ab­strac­tion (though was prob­a­bly cho­sen to match enu­mer­a­tion types in C, a de­ci­sion mir­rored in the Rust pro­gram­ming lan­guage). Prim­i­tive data types are a UML cop-out:

A Prim­i­tiveType de­fines a pre­de­fined DataType, with­out any sub­struc­ture. A Prim­i­tiveType may have al­ge­bra and op­er­a­tions de­fined out­side of UML, for ex­am­ple, math­e­mat­i­cal­ly. The run-­time in­stances of a Prim­i­tiveType are val­ues that cor­re­spond to math­e­mat­i­cal el­e­ments de­fined out­side of UML (for ex­am­ple, the In­te­gers). [1]

I per­son­ally ob­ject to this part of the stan­dard and be­lieve that types and their as­so­ci­ated op­er­a­tions shouldbe brought into UML so that we can rea­son about which types best con­vey re­quire­ments and al­low model check­ers to catch type mis­use er­rors.

Part of the ar­gu­ment that I will try to con­vey in this post (but mainly the fu­ture one on type­class­es) is that know­ing the pos­si­ble val­ues that a data type can take and what func­tions can be per­formed upon it are not a crit­i­cal part of a UM­L/SysML de­sign. This would en­able use of UM­L/SysML model check­ing to en­sure that we’re not in­tro­duc­ing type er­rors at the re­quire­ments lev­el: if you try to feed some­thing of the wrong type into the wrong func­tion in your UML mod­el, let­ting the model pick that up saves some poor pro­gram­mer hav­ing to try and im­ple­ment con­tra­dic­tory re­quire­ments. This blog post will in­tro­duce al­ge­braic data types through a mo­ti­vat­ing ex­am­ple and then ex­plore a can­di­date no­ta­tion for defin­ing them in UML.

Al­ge­braic Data Types

The name “al­ge­braic data type” tends to con­jure up im­ages of com­plex maths but its sim­ply and ab­stract pat­tern to en­able us to de­fine new data types as the “sum” or “pro­duct” of ex­ist­ing data types. The prod­uct of types is al­ready fa­mil­iar to most mod­ellers and pro­gram­mers, it oc­curs when one type has val­ues that are a com­bi­na­tion of two or more other types. These are known as tu­ples when the con­stituent val­ues are not named and are called struc­tures or records when the con­stituent val­ues are named. In UML this is just the struc­tured data type.

The sum of two (or more) types, some­times called a “u­nion”, is a new type that has all of the val­ues of both of the types. In UML enu­mer­a­tion data types can be thought as the union of each of its enu­mer­a­tion lit­er­als. UML enu­mer­a­tion lit­er­als are not quite suited for this pur­pose as they mst have a de­fined set of val­ues that can­not be gen­er­ated by a func­tion. Some mod­ellers try and get around this by us­ing in­her­i­tance as a means to show that a data type can be ei­ther one value or an­oth­er. The prob­lem in this case is that the “in­put” data types to the sum can­not ex­ist in­de­pen­dently of the “out­put” (the su­per class) which does not con­vey the de­sired in­for­ma­tion to the reader of the mod­el. This is why we will need to in­tro­duce a new no­ta­tion for this con­cept.

Sum type using inheritance, not really worth it

A nice in­tro­duc­tion to this con­cept can be found in the guide to the Elm pro­gram­ming lan­guage: Types as Sets

Fail­ure in the sys­tem

To in­tro­duce the use­ful­ness of al­ge­braic data types, lets in­ves­ti­gate a com­mon sys­tems en­gi­neer­ing prob­lem; how to rep­re­sent fail­ure in the sys­tem. Imag­ine an op­er­a­tion in a model that di­vides two real num­bers. In a UML model we would see some­thing like this in the browser:

Di­vide(a : re­al, b : re­al) : real

Now imag­ine that we have per­formed some sort of HA­ZOPS process on the model and found that this func­tion cause some quite spec­tac­u­lar prob­lems when we try and di­vide by 0.

Don’t panic though, this is eas­ily caught by the op­er­a­tion it­self if we check to see if b is 0. We can catch that er­ror eas­i­ly, but how do we com­mu­ni­cate to down­stream func­tions that Di­vide has failed? Here are a few so­lu­tions:

1) Add an ex­tra fail­ure line

Prob­a­bly the most triv­ial an­swer, this so­lu­tion just adds a sep­a­rate fail­ure prop­a­ga­tion out­put to the func­tion.

A uml call operation with the signature safeDivide(a : real, b : real, * f : failure) : real

It does­n’t take much ex­trap­o­la­tion to see how quickly this would get com­pletely out of hand, dou­bling up all of the lines on your di­a­gram. UML only al­lows one re­turn type, in the di­a­gram the fail­ure is con­sid­ered an “ar­gu­ment by ref­er­ence” but the fail­ure could have been the re­turn type in­stead. The lan­guage does not have a con­ven­tion in this case mak­ing reusabil­ity awk­ward. Now imag­ine if the up­stream func­tion could fail too, we’d need to add an­other two in­puts! And how would these re­ceived fail­ures map to the out­put? This ap­proach is def­i­nitely not scaleable.

2) Pick an ar­bi­trary value to rep­re­sent the fail­ure

This time we will pick an un­used value of the data type to rep­re­sent the fail­ure. This is of­ten the favoured ap­proach of simulink en­gi­neers. If your data type is an int, enu­mer­ate your fail­ures from 255 back &c. This is most likely how a good com­piler will im­ple­ment your soft­ware er­ror prop­a­ga­tions any­way, but it does­n’t pre­serve the mean­ing in­side the model in a mean­ing­ful way. It de­feats the point of mod­el­ling some­what as it forces the reader to think fur­ther down the ab­strac­tion lay­ers. It also causes prob­lems if the value range of the data type has to change.

This method is also a step back from the first op­tion in terms of reusabil­i­ty, if you are re­quired to reuse this func­tion in a con­text where the er­ror value is re­quired as part of the nom­i­nal range, you’re go­ing to have to de­fine a new one with a new er­ror val­ue. This so­lu­tion is a case of the an­tipat­tern “ge­netic dis­or­der”. If the im­ple­men­tor of the soft­ware is us­ing a lan­guage like Haskell to code the so­lu­tion (see be­low for the Haskell so­lu­tion), you have forced their hand through the mis­use of UM­L/SysML to cre­ate a so­lu­tion in the soft­ware that is con­sid­ered bad prac­tice for that lan­guage. This hap­pens a lot with UML when try­ing to im­ple­ment spec­i­fied be­hav­iour in a dif­fer­ent pro­gram­ming par­a­digm from OOP.

3) Use the Maybe pat­tern

There’s no di­a­gram for this case as like case 2, we’re em­bed­ding the con­cept of fail­ure into the very data type it­self. ex­cept in this case, the lan­guage is pow­er­ful enough to let us em­bed the idea of fail­ure into the data type with­out the use of a un­scaleable bodge. As a mo­ti­vat­ing ex­am­ple from a pro­gram­ming lan­guage with these con­cepts baked in, Haskell has a pretty el­e­gant syn­tax for the Maybe con­cept:

Data Maybe a = Just a | Noth­ing

This is a Haskell type con­struc­tor de­c­la­ra­tion that tells us that we can cre­ate a Maybe type for any ex­ist­ing type (the a term in the de­cla­tion which is ba­si­cally a tem­plate pa­ra­me­ter) which, when fed a type (such as Int) cre­ates a type that may be an Int or may be noth­ing.

To il­lus­trate how this would be used in Haskell, the Safe­Di­vide op­er­a­tion would be im­ple­mented us­ing the fol­low­ing syn­tax:

safe­Di­vide :: (Int, Int) -> Maybe Int
safe­Di­vide (_ , 0) = Noth­ing
safe­Di­vide (a , b) = Just (a/b)

The first line tells us that safe­Di­vide is a func­tion that takes two in­te­gers and re­turns a Maybe Int. The sec­ond line tells us to re­turn Noth­ing if the de­nom­i­na­tor is 0 and not worry about the nu­mer­a­tor in this case (an un­der­score in Haskell means any val­ue). The third line tells us that for all other in­put pairs, re­turn the di­vi­sion wrapped in a Just. The next func­tion can then check to see if the in­put is a Noth­ing or a Just and act ac­cord­ing­ly.

A great in­tro­duc­tion to this pro­gram­ming pat­tern is the slide show on “Rail­way Ori­ented Pro­gram­ming” on F# for fun and profit. If you have a play with func­tions of the type a -> Maybe a you’ll prob­a­bly quickly come up against a few prob­lems of how to chain them to­geth­er. This is es­pe­cially true if you’re try­ing to con­cate­nate er­rors with Maybe’s cous­in, Ei­ther. To solve these prob­lems we’ll have to learn about more pat­terns such as Func­tors, Ap­plica­tive Func­tors and Mon­ads which are out of the scope of this post but I promise I’ll cover them at a later date!

Bak­ing in ADTs

Lets imag­ine that we could de­sign ADTs within UML, such as the Maybe type, what could the syn­tax look like? As we dis­cussed ear­lier, to cre­ate ADTs, we re­quire the abil­ity to de­fine new data types as the sum of two oth­ers or the prod­uct of two oth­ers. As a start­ing point, lets cre­ate two spe­cial­i­sa­tions of di­rected as­so­ci­a­tions to rep­re­sent our con­struc­tors: one called «pro­duct» to rep­re­sent the carte­sian prod­uct of 2 or more types and an n-ary as­so­ci­a­tion called «u­nion» to rep­re­sent that the own­ing type is a union of two or more types. The prod­uct type is rel­a­tively fa­mil­iar as it al­ready ex­ists in UML as a struc­ture, we are just us­ing the as­so­ci­a­tions to make it di­a­gram­mat­i­cally ex­plic­it. Now we have de­fined an ex­ten­sion for sum types, we can use them to cre­ate de­f­i­n­i­tions for all sorts of new types:

product type

Here we are us­ing the nor­mal com­po­si­tion re­la­tion­ship to show that this type is a prod­uct of the two oth­ers. The name of the prod­uct re­la­tion­ship acts as the name of the “prop­er­ty” in the data type in a sim­i­lar way to the way mem­bers of a class’s names show on the com­po­si­tion as­so­ci­a­tion.

Sum type

In this case the data type is the sum of two oth­ers. This is a tagged union; the names on the sum as­so­ci­a­tion give us a tag that we can use to ref­er­ence the val­ues of the con­stituent types.

Definition of the natural numbers

In this di­a­gram, we are us­ing the stan­dard type the­o­retic in­duc­tive de­f­i­n­i­tion of the nat­ural num­bers. We can read this as: a nat­ural num­ber is ei­ther 0 or an in­cre­ment of a nat­ural num­ber. This is prob­a­bly the most sim­ple type to de­fine in this sys­tem but it shows how we can use ADTs to de­fine data types from scratch within the lan­guage it­self. Note the “U­nit” data type is a type with one val­ue. This is needed to make the al­ge­bra of al­ge­braic data types an al­ge­bra (a semi­ring to be ex­ac­t). There should also be an Empty type that acts as the iden­tity of the prod­uct type.

Let’s think about this de­f­i­n­i­tion math­e­mat­i­cally for a minute. What does this de­f­i­n­i­tion ac­tu­ally say? Well, we can think of the noth­ing type as “0” and in­cre­ment as “+1” so our nat­ural num­bers are:

0 or 0+1 or (0+1)+1 or ((0+1)+1)+1 or …

Which is very much the de­f­i­n­i­tion of the nat­ural num­bers! Can we ap­ply this sort of re­cur­sive de­f­i­n­i­tion to more ex­cit­ing (and far more use­ful) data types? of course! Lets see how we can de­fine the idea of a list in­side of this pro­file:

List of Natural numbers

In this di­a­gram we have de­fined a list of nat­ural num­bers to show how even more in­ter­est­ing types can be de­fined. Here we can see that a Nat List is de­fined as ei­ther the empty list or a prod­uct of a nat­ural num­ber (the head of the list) and a list of nat­ural num­bers (the tail of the list). In a way, the list type is a Maybe type that can con­tain more than one val­ue.

You’re prob­a­bly look­ing at the above im­age right now and al­ready think­ing why should we have to de­fine a new list type any time we need a list of things of a cer­tain type. Es­pe­cially when the pat­tern is such that we could re­place the data type “Nat” on this di­a­gram with any type that we could imag­ine (in­clud­ing other list types to make lists of list­s). There is def­i­nitely an open­ing here to pro­vide some ab­strac­tion and reuse value by cre­at­ing “tem­plate” data types as we saw ear­lier with the Haskell maybe con­struc­tor. For now though I’ll leave this for an­other blog post.

Definition of the maybe type using the sum association

And fi­nal­ly, here’s the de­f­i­n­i­tion of the May­be­Real used in the mo­ti­vat­ing ex­am­ple.

As an ex­er­cise, why don’t you try this out your­self and cre­ate a di­a­gram to de­fine a bi­nary tree of in­te­gers us­ing just the el­e­ments de­scribed in this sec­tion?

Next time in this se­ries I’ll talk a lit­tle more about how we can gen­er­alise these data types for even greater reuse val­ue: Data type con­struc­tors and type class­es. These are the tools that we can use to cre­ate ver­sions of maybes and ei­thers that can be strung to­gether to al­low us to spec­ify sys­tems more ab­stractly with­out hav­ing to rely on C, C++ and Java lan­guage con­structs.

Ref­er­ences

[1] OMG 2017 OMG™ Uni­fied Mod­el­ing Lan­guage™ (OMG UM­L™) Ver­sion 2.5.1 http://www.omg.org/spec/SysM­L/1.4/