Regular systems

In this chapter, we introduce a notion of regular systems, which are a collection of related regular languages. Specifically, a regular system consists of a finite state machine, yielding a language for each pair of start and end states. If the set of phonotactically valid words in a natural language is a regular language, then substrings of words can be considered to belong in one of the languages yielded by a corresponding regular system.

Figure 1: The finite state machine describing the phonotactics of Ŋarâþ Crîþ.

An example is Ŋarâþ Crîþ’s finite state machine (Figure 1), which defines five states ${s, g, o, n, ω}$ . Each transition carries a payload defining a component of a syllable. For instance, the transition from $s$ to $g$ is labeled with $Ι$ , the set of all initials, indicating that any initial is sufficient to transition from $s$ to $g$ . A phonotactically valid Ŋarâþ Crîþ word starts from $s$ to $ω$ , but individual morphemes can be between a different pair of states. For instance, the morphemes in cenv-as-le you write it have the following structures:

cenv- from $s$ to $o$ . In the case that this has at least one full syllable and does not end with a lenited consonant, as is here, this is called a stem.
-as from $o$ to $s$
-le from $s$ to $ω$

Formally, we define a language system $Λ$ over a set of states $Q$ and an alphabet $Σ$ as a function $Q \times Q \to 𝒫 (Σ)$ that has the following properties:

Identity: For all $q \in Q$ , $ε \in Λ (q, q)$ .
Transitivity: For all $p, q, r \in Q$ , $Λ (p, q) \cdot Λ (q, r) \subseteq Λ (p, r)$ , where $\cdot$ denotes the concatenation of languages. We say that a language system has strict transitivity if $Λ (p, r) = ⋃_{q \in Q} Λ (p, q) \cdot Λ (q, r)$ .

A regular system is a language system in which all languages $Λ (p, q)$ are regular.

Given a subset of states $Q^{'} \subseteq Q$ , we can define a subsystem $Λ^{'} = Λ |_{Q^{'} \times Q^{'}}$ over $Q^{'}$ and $Σ$ .

Semiautomata

A semiautomaton (DSA) is a deterministic finite automaton that does not specify a start or accept state. A DSA $M = (Q, Σ, δ)$ yields a regular language $M (p, q)$ for every $p, q \in Q$ , as supplementing starting and accepting states to a semiautomaton yields a DFA. Additionally, these languages have the identity and transitivity properties required for regular systems. These properties (except for perhaps strict transitivity) still hold when we only look at a subset of states $Q^{'} \subseteq Q$ .

We also introduce the notion of a nondeterministic semiautomaton (NSA), which can be seen as a nondeterministic finite automaton without starting or accepting states. More formally, an NSA is a tuple $M = (Q, Σ, δ)$ where $δ : Q \times Σ \to 𝒫 (Q)$ is the transition function. For states $p, q \in Q$ , $M (p, q)$ is the set of strings that can be generated by a path from $p$ to $q$ . We also define nondeterministic semiautomata with ε-moves (NSA-ε) similarly, in which transitions without an input symbol are allowed. For similar reasons, NSAs and NSA-εs have the properties of regular systems, and these properties (except for perhaps strict transitivity) still hold when we only look at a subset of states $Q^{'} \subseteq Q$ . It follows that any regular system can be converted into a subsystem of an NSA-ε by applying Thompson’s construction algorithm between each pair of input states.

However, while it is possible to translate an NFA into a DFA, translating an NSA into a DSA is less trivial. In general, it is not possible to translate an NSA into a DSA with the same regular system if each state of the NSA must correspond to exactly one state in the DSA. A single accepting state in the NSA might correspond to multiple accepting states in the DSA as shown in Figure 2.

Figure unavailable in HTML. Alt text: n/a

Figure 2: A counterexample to the conjecture that every NSA can be translated into a DSA with an equivalent regular system: the

q_{0}

and

q_{1}

states both correspond to multiple accepting states in the DSA.

Start	End	Language
$q_{0}$	$q_{0}$	(ab)
$q_{0}$	$q_{1}$	a+(ba)
$q_{1}$	$q_{0}$	b+(ab)
$q_{1}$	$q_{1}$	(ba)

Table 1: The regular expression for each language of the system described by Figure 2.

However, it is possible to map each state of the NSA to a set of states in the DSA. In Figure 2, $q_{0}$ in the NSA $M$ would correspond to ${{q_{0}}, {q_{0}, q_{1}}}$ in the DSA $M^{'}$ and $q_{1}$ would correspond to ${{q_{1}}, {q_{0}, q_{1}}}$ . Let $D$ be the function mapping states in $M$ to corresponding sets of states in $M^{'}$ .

However, note that given two states $p$ and $q$ of $M$ , the existence of a path from some $p^{*} \in D (p)$ to $q^{*} \in D (q)$ in $M^{'}$ accepting a string $s$ does not imply that $s \in M (p, q)$ . For example, while $M (q_{0}, q_{1})$ does not accept the empty string, $M^{'} ({q_{0}, q_{1}}, {q_{0}, q_{1}})$ does. Therefore, we require that for all $p^{*} \in D (p)$ , there exists a state $q^{*} \in D (q)$ such that $M^{'}$ accepts $s$ from the state $p^{*}$ to $q^{*}$ .

More generally, we can establish a bijection between the set of abstract states $𝒬$ of a regular system $Λ$ and the set of sets of concrete states $Q$ of a semiautomaton $M$ . Then a string $s$ is in $Λ (p, q)$ if and only if for all $p^{*} \in p$ , there exists a $q^{*} \in q$ such that there is a path from $p^{*}$ to $q^{*}$ in $M$ for $s$ :

$\begin{array}{lrlr} Λ (p, q) & = ⋂_{p^{*} \in p} ⋃_{q^{*} \in q} M (p^{*}, q^{*}) \end{array}$

In this text, we concern ourselves only with DSAs and their subsystems. This is sufficient for most natural languages if we take the various components of a syllable, including null segments, as symbols themselves and augment each symbol with its source and destination states. For instance, Ŋarâþ Crîþ tfełor would be represented as

$\begin{array}{lrlr} ( & (tf, s, g), \\ (ε, g, o), \\ (e, o, n), \\ (ε, n, s), \\ (ł, s, g), \\ (ε, g, o), \\ (o, o, n), \\ (r, n, ω)) \end{array}$

We refer to strings represented this way as fully syllabified strings or assemblages and any element of an assemblage as an assemblage unit. A disambiguated regular system is a regular system $Λ$ over $Q$ in which for each nonempty string $s$ , there is at most one $(p, q) \in Q \times Q$ such that $s \in Λ (p, q)$ .