The FOMAcation
Eh, why not spend this precious weekend diving into the inner sanctum of finite-state transducers?
My progress is in the ncv9/foma-stuff repo.
OK but why do this in the first place
Currently, Ŋarâþ Crîþ inflection is implemented by a program called f9i. In its batch mode, f9i reads a text file with descriptions of lexical items and outputs a file in another format (depending on the command-line arguments) with all the inflections. The code to generate this website converts the dictionary source code (which is written in a language implemented by Racket) into the format expected by f9i. It then calls f9i to either generate HTML to embed into the page, or for the fancy-schmancy interactive version, an SQLite database to store into a separate file.
This has a few advantages. First, it’s fast! Just look up the form you have in the database. And f9i is pretty good at generating all the forms fast, hence the f in its name. More importantly, it’s simple! As long as you can write the code to inflect words forwards, you can do this for all the words you care about. You can make inflection require factoring prime numbers for all I care and it wouldn’t stop you from doing it this way.
This method comes with a few downsides, though. First of all, this approach is outright impossible if a word’s inflections can be literally infinite. But even if this isn’t the case for Ŋarâþ Crîþ, the language has rich morphology (40 forms for a noun, and a few thousand for a verb), making the resulting lookup table a lot bigger than the lexical entries themselves. If you visit the interactive dictionary, then some server controlled by GitLab will serve you 31 MB of SQLite database just so you can look up what ⟨ŋarâþ⟩ means. That’s because as of the time of writing, this file stores over 685 thousand unique inflected forms. In fact, I had to omit some of the verbal inflections (namely, participles with object affixes attached) to keep the size of the database manageable. Right now, the dictionary has about a thousand entries. If that number grows tenfold, then the database will become about ten times larger.
(Part of this problem comes from the fact that this site is completely static, which means that it can’t process the lookups on the server and give you only the data you need. Perhaps this will change some day.)
Is a smarter approach possible? Maybe for Arka, where suffixes apply pretty cleanly, but it seems hopeless for Ŋarâþ Crîþ. It turns out, however, that as long as your phonological rules can be modeled by something called a regular relation, you can represent them using a finite-state transducer, which is like a finite-state automaton but with pairs of input and output symbols instead of individual symbols. Although this is pretty complicated, we don’t have to implement everything from scratch because there are programs that help with this.
We’ll be using plain old-fashioned FSTs, with none of that two-level nonsense. To get started, I modeled my work after PyFoma’s official tutorial, but I soon ran into some problems.
Initially, I tried using the original C version of Foma, but it seemed to crash a lot, so I switched to using PyFoma, despite hating Python with the heat of 65,536 burning Tergio. The choice of language wasn’t as onerous as I thought. I barely had to write any actual Python! Unfortunately, everything else is way worse.
Almost everything in Ŋarâþ Crîþ has multiple stems
And almost everything in Ŋarâþ Crîþ – nouns, verbs, relationals – has multiple paradigms. Nouns, for example, have six declension classes. The first declension has three stems: N (for nominative), L (for locative), and S (for semblative), whose uses are divided up by case. It also has three smaller pieces used in the paradigm. Θ is called a thematic vowel but isn’t in the way Indo-Europeanists use the term; it’s just the vowel that occurs in the ending of the lemma form. Σ is the consonant of that ending, and Λ is another vowel used in the locative direct form.
You can’t specify your morphemes directly in your lexicon. (Or at least I don’t think you can. Let me know if you can.) This is what the lexicon for my nouns looks like:
grammar = {}
grammar["Noun"] = [
# first-declension nouns have N, L, S, ΘΣ, Λ
("ga:ši:d/ga:šjo:d/gel:ši:d/a/a", "N_I"),
("el:t/il:t/el:d/es/a", "N_I"),
("ŋa:r/ŋôr:þ/ŋa:l/âþ/a", "N_I"),
("to:vr/te:vr/te:v/a/a", "N_I"),
]
grammar["N_I"] = [("'[N.I]'", "N_Case")]
grammar["N_Case"] = [("'[Nom]'", "N_Num")]
grammar["N_Num"] = [("'[Di]'", "#"), ("'[Du]'", "#"), ("'[Pl]'", "#"), ("'[St]'", "#"), ("'[Gc]'", "#")]
And the job of picking the right stem and other parts of the entry falls to the alternation rules:
defs = {'lexicon': lexicon}
# Helper expression for getting the parts we want
defs['chunk'] = FST.re("[^/]*", defs)
defs['stem03'] = FST.re("$chunk ('/' $chunk '/' $chunk '/'):'-'", defs)# ...
# First-declension nouns
defs['NITheme'] = FST.re("[aeo]|([ae][sn]:'')|((â:a|ê:e)þ:'')", defs)
defs['NIEnd'] = FST.re("$^input($NITheme)", defs)
# ...
defs["NINomDi"] = FST.re("$stem03 $NIEnd ('/' $chunk '[N.I]' '[Nom]' '[Di]'):''", defs)
# ...
This is not entirely elegant, but I haven’t found another way to do this, especially since some noun declension paradigms (yes, you, III) don’t allocate their stems cleanly across case lines.
Some paradigms have more paradigms underneath their skin
Apparently nominative plurals and singulatives don’t work the same for all first-declension nouns?
# AAAAAAAAAAAAAAAAAAAA
defs['NIThemeNomPl'] = FST.re("([aeo]@$pi) | (([ae]@$gamma)s) | (([âê]@$pi)þ) | (([ae]@$pi)(n:''))", defs)
defs['NIThemeNomSt'] = FST.re("(([aeo]@$gamma)'':l) | ('':'<Fn'([ae]@$gamma)s) | ('':'<Fn'([âê]@$gamma)þ) | ('':'<Fn'([ae]@$gamma)(n:l))", defs)
# ...
defs["NINomPl"] = FST.re("$stem03 $NIThemeNomPl ('/' $chunk '[N.I]' '[Nom]' '[Pl]'):''", defs)
defs["NINomSt"] = FST.re("$stem03 $NIThemeNomSt ('/' $chunk '[N.I]' '[Nom]' '[St]'):''", defs)
# ...
Fortunately, most of the forms won’t need this special treatment.
Hyphens hate you
Ŋarâþ Crîþ has the concept of morpheme boundaries, colloquially called hyphens. These are important for determining where morphophonological changes occur.
For example, certain duplicate consonants around vowels, such as ⟦þ⟧ or ⟦v⟧, are called oginiþe cfarðerþ and get resolved by changing the first occurrence of the consonant. The rules describing these changes would turn ⟦mov-a-ve⟧ he/she/it makes you float to ⟨monave⟩, avoiding the duplicate ⟦v⟧.
Sometimes, though, we want to preserve these environments in borrowed words. If we want to adapt the place name of Babel, then we want it to be written as ⟨*@vavel⟩ and not ⟨*@navel⟩. The native word ⟨viviþ⟩ also escaped deduplication. In these cases, we consider the duplication to occur within a single morpheme, so restricting phonological changes to occur across morpheme boundaries lets us preserve these oddities.
Right now, Ŋarâþ Crîþ describes hyphens as working sequentially, so something like a-b-c is evaluated as (a-b)-c. This means that you have to repeat the step of concatenating the first two items, repeatedly until the hyphens are All Gone™. This probably isn’t friendly to a FST-based approach. We could probably specify enough iterations of concatenating the first two components, but doing that would probably result in huge state machines.
The most obvious solution is to resolve all morpheme boundaries at once. Interestingly, this was the way that Ŋarâþ Crîþ and f9i worked before Project Elaine; the hyphen was considered a full member of the layer-0 representation, and the morphophonological rules were applied with the whole word in scope.
This changed when Project Elaine wanted to introduce a “stem fusion” operation so that suffixes attached to stems could effectively start with a ⟦n⟧, ⟦þ⟧, or ⟦t⟧. The difficulty of ensuring that stem fusion rules would always lead to a valid stem under the old system prompted me to look at affixes not as raw strings of letters but rather as paths through a finite state machine. Consequently, each hyphen would lie on a specific state.
However, it was a difficult task to represent such paths efficiently (that is, without a glut of heap allocations). My initial attempt at this in Rust, intended to work with all types, turned out to be a nightmare to implement and use, and I opted to write a build script to do this with my specific use case, which still proved to be a nightmare but on a smaller scale.
(In ps9, a scrapped experimental reimplementation of f9i, I used type parameters to represent the start and end state of each path and type-erased the contents. The path types also have a usize parameter that acts as the lower bound on the length of the path. This turns out to be yet a third nightmare.)
Since representing these paths without the hyphens was hard enough, I thought that it would be too hard to make the respective types store these hyphens. As a result, hyphens no longer occurred physically in layer 0 but rather represented the binary concatenation operation with phonological changes.
And after all those changes, f9i got faster at Generating All the Inflections™. At that time, I wasn’t sure that it would be possible to do better than that for Ŋarâþ Crîþ, with all the morphophonology going on, so I was content with those results. However, over time, I started wanting to find a better way for reasons mentioned earlier.
Enough with the tangent here. We’re no longer satisfied with implementing Ŋarâþ Crîþ morphology in PyFoma; we’re now changing it up again. I actually tried doing this some time after the threesome of Elaine, Caladrius, and Nibiru in Project Shiva.
Wait, what?
(The relevant text in the current version of the grammar is slightly different now, but don’t worry about that.)
Yes, for some reason, paths through the State Machine™ also got called assemblages. But worse yet, something like ⟦reþ-eþ⟧ is ambiguous! It turns out that (*gasp*) some transitions can be associated with an empty string, and you can’t move hyphens willy-nilly, even across these empty transitions, without changing the result.
As the Project Shiva document points out, the culprit is something called bridge repair. A bridge in Ŋarâþ Crîþ parlance is the coda of one syllable plus the onset of the next, and this process, in addition to simplifying awkward combos like ⟦rþ-cþ⟧, makes the bridge conform to the maximal-onset rule. In the case when we append the syllables ⟦reþ⟧ and ⟦eþ⟧ together, the first syllable has a coda of ⟦þ⟧ and the second has an empty onset. Since we can redistribute these consonants to make a longer onset, we do so, transferring the ⟦þ⟧ to the onset of the second syllable.
But wait! Where does the hyphen go? It stays in place because I forgot to worry about that! And the ⟦þ⟧ sneaks over the hyphen, the oginiþe cfarðerþ remains unseen, and cities fall to ruin, with wails being heard in the background.
The solution that went into effect was to introduce new symbols to disambiguate the hyphens. Now you can get them in strawberry, grape, orange, and nectarine flavors, and while the cities are still giant heaps of rubble, at least you can look the other way. Right? Kinda?
But Project Shiva proposes revising the deduplication rules to be afraid of more consonants duplicating around a vowel and ideally not leave situations like these. Maybe ⟦reþ-eþ⟧ should produce ⟨reteþ⟩, regardless of whether the hyphen is strawberry- or grape-flavored. More precisely, we want the layer 1 result to stay the same if any hyphens move through empty components in the State Machine™. Ensuring that this holds will let us use one single type of hyphen instead of juggling around all these flavors.
Also, Ŋarâþ Crîþ allows one morpheme ending in the o state to be joined before another starting in the g state, merging the overlapping glides. In the new model, we’ll have to deal with things like ⟦gercj-jel⟧ for the genitive plural of ⟨ercjor⟩ (shield).
Okay, but how do we make sure the result is phonotactically valid now?
We can represent the set of all phonotactically valid words as a regular language (in this case admitting hyphens). If we can express our morphophonological changes as a regular relation , then we can compose with to get the output language , which is also regular. We can then determine if .
Markers hate you
Ŋarâþ Crîþ has a set of letters that appear word-initially to mark different types of words; hence the name markers. The nef, romanized as ⟨*⟩, denotes a word of foreign origin; the sen (⟨&⟩) indicates reduplication of an unspecified prefix of the word, and the rest (⟨#⟩, ⟨+⟩, ⟨+*⟩, and ⟨@⟩) denote different types of names. They kind of sit out in space, not really interacting with anything, with one exception we’ll get to later.
If we don’t want to strip out the marker during inflection (which we can’t really do for a reason we’ll get to later), then we not only have to recongize a “word boundary” when we reach a marker but also treat prefixes specially so that the marker will move in front of the prefix.
The stem fusion rules were written for Rust, not Foma
Do you remember this thing called stem fusion? Look at this sentence hidden in the notation section:
Earlier rules take precedence over later ones.
Okay, fair enough. While PyFoma doesn’t have Foma’s “priority union” operator, it can be implemented using simpler operations, namely ($x | (~$^input($x) @ $y))
. And PyFoma supports passing in custom functions to be used in regexes, so we don’t literally have to spell out these operations every time.
Alternatively, since we use a marker symbol to denote stem fusion, we can remove the marker when a rule matches, and the resulting string won’t be processed by later stem fusion rules. Slap a priority union to let the non-matching cases pass through, and then compose the rules together.
The latter approach also allows handling rules that are formulated recursively. If each recursive rule can only be caught in a rule later than it, then one pass is enough to implement stem fusion. If any recursive rules can recurse into an earlier rule (as in the (Ccc) rule with a null fusion consonant), then we need multiple iterations through all the rules, in which case we can get away with the priority union approach for matching the rules if we’re willing to apply even more iterations.
To figure out how many iterations are required, we can construct a directed graph that shows how rules can recurse into each other. If a rule is recursive, and the recursion can match rule for some input that matched , then we draw an edge from to . Then we need as many iterations as the length of the longest path in the graph.
Doing this requires manual analysis (which is hard!), so another way to do this is to compose the rules an increasing number of times until the output can no longer contain any marker symbols for stem fusion. And hope that stem fusion only requires a bounded number of recursions.
But having to mess with all of this might be a sign that our stem fusion rules are poorly written.
The pips of the dice go SQUEAK SQUEAK SQUEAK
Ŋarâþ Crîþ assigns a numeric value to each letter. For some inflections, the values of the letters of a certain stem or form are added up and taken modulo some number to provide a sort of pseudo-randomness, hence the term roll.
Determining if the letter sum modulo a fixed constant is a certain value is within the realm of a regular language. It could be worse:
The notation ⟦x0 x1 … xn−1⟧ √« y is used to mean that the integer square root of the letter sum of y should be taken modulo n and used as an index into the list.
The notation ⟦x0 x1 … xn−1⟧ popcnt« y is used to mean that the number of 1’s in the binary representation of the letter sum of y should be taken modulo n and used as an index into the list.
Yeah, thank goodness we don’t live in that world.
Unfortunately, the fun ends there. The fourth and fifth declensions of nouns use rolls based on the letter sums of inflected forms, not of any stem individually. This means that we have to do another inflection to inflect forms that require these rolls. As an extreme example, to get the dative dual form of a fifth-declension noun, we need to compute the following forms:
- the nominative generic, giving us ⚁
- the nominative dual using ⚁, giving us ⚂
- the accusative direct using ⚂, giving us ⚃
- and finally the dative dual using ⚃
This means that to get the correct roll for inflecting a word, we might have to reapply the morphological rules up to four times. Presumably, this could work by surrounding the raw morphemes of interest in some kind of delimiter, converting the outside hyphens into another character, and writing the morphological rules Extra Carefully™ to work if we want it to apply to the current delimited word.
A simpler approach is to keep the rolls as a part of the lexical entry, but this will give incorrect roll values with compound words. (This is why verbal inflections don’t use any rolls.) Alternatively, we could output all possibilities and verify that any output that claims that a word is an inflection that uses a roll has a compatible letter sum for the appropriate inflection. For example, the accusative direct form of ⟨corþ⟩ certainty could be ⟨corðen⟩, ⟨corðan⟩, or ⟨corðin⟩, depending on the letter sum of the nominative direct form. So if we get one of these results for a reverse lookup, we compute the nominative direct (namely, ⟨corþ⟩) and compute its letter sum, which is . This tells us that the ⚀ vowel must be ⟦e⟧, so we can retain the result for ⟨corðen⟩ but must discard it for the other two forms.
If we really want to simplify things and don’t mind changing the rules, then we can base rolls on the letter sum of a certain stem across the board, which was done for Ŋarâþ Crîþ’s recent relational reform. This keeps the spirit of using letter sums as a sort of quasi-randomness but is probably easier to do than bothering with sub-inflections.
Some rolls in the fifth declension also specify incrementing the letter sum until some condition is met. This adds a minor complication, but less so than the previous problem.
And guess what? Markers have numeric values, too, so they can affect the letter sum. That’s why we can’t ignore them during inflection altogether.
Oh yeah mutations exist
Ŋarâþ Crîþ has two different types of mutations which coincidentally has the same name as those of Irish. While lenited consonants (those with a ⟨·⟩, because ⟨h⟩ is already taken) can occur in the middle of words, eclipsed consonants (which are written in about the same way as in Irish) can only occur word-initially. That’s not much of a problem most of the time, since mutations are usually a finishing touch on an inflection, but one interrogative pronoun, ⟨penna⟩ who?, has an intrinsically eclipsed S stem ⟦mpad-⟧. I’m sure that Ŋarâþ Asoren has some funky things going on with multiple mutations, but in Ŋarâþ Crîþ, any mutated consonant resists further mutation.
If we wanted to add additional words with intrinsically mutated stems, then the stems can’t be the L stems of third- or sixth-declension nouns, since these take prefixes in the instrumental- and abessive-case forms. They also can’t be verbs or relationals, since these take derivational prefixes. And I hope nouns don’t get any derivational prefixes or else.
Conclusion
This was supposed to be a silly weekend project, but this article took me over two weeks to write, mostly because I’ve had to organize my thoughts for an audience unfamiliar with my conlang – and hopefully not in vain. And I haven’t even finished the whole thing because I think I need to sort out the problems I’ve outlined above before I press on. As you can see, Ŋarâþ Crîþ isn’t your grandma’s average conlang, but the features I mentioned add a sort of charm to the language, and I think they can be adapted to better fit the model of regular relations and finite-state transducers.
If any of this has interested you, feel free to hop on to the Ŋarâþ Crîþ v9 Discord server. Alternatively, you can go onto the /r/NecarassoCryssesa subreddit, open an issue on this site’s issue tracker, or send me an email (my email address, in letter numbers, is 19 6 19 −1 15 23 12 6 4 13 4 2 32 9 15 7 [dot] 0 4 32).
Finally, here’s an image I drew on .
- cespj-eacþ
- acknowledge-not_only_but_also
- ves-cþantr-ifos
- 2gc-appreciate-inf.dat.subj
- roc.
- on_behalf_of