Glosa Lexicon Builder
The Glosa Lexicon Builder is a
morphological database tool that
allows the rapid development and testing of descriptions
of natural language words and word formation rules.
It consists of two components: a lexicon
component and a rules component.
Either one can be developed first;
the lexicon component automatically determines
derivations and
inflections
from the rules component, allowing overriding for
special cases, while the rules component can be developed
automatically from individual lexicon examples, using a special
machine-learning technology.
Features of the Lexicon Builder:
- Hierarchical approach to rule building; Use
two-level phonological rules to specify language-wide
relationships, context-dependent affixation rules for each
derivational relationship, and exceptions for individual
word special cases.
- User-defined character classes with regular expressions
allow much more compact rule definition.
- Option to automatically generate a new rule each time you
add a non-comforming case.
- Graphical User Interface lets you see generated forms as
you build.
- Multiple levels of inflection for
agglutinative languages,
using finite-state
morphotactics.
- Can store abstract forms that are not in lexicon.
- Store user-defined sets of properties with each form.
User can define default properties for each derivation, and override
them when necessary. Properties can also control the application
of affixing rules.
- Export lexicons for
Glosa Inflected Forms Generator, or a morphological front-end
for custom applications written in C or C++.
Defining the Rules Component
The Lexicon Builder offers three different approaches
to defining a language's morphology. These approaches can
be used separately or in combination.
Affix-Specifc Surface Rules
This paradigm combines phonological rules
and morphotactics into a single rule formalism, using only
surface symbols. Each set of rules shows the relationship
between surface configurations with and without a particular
affix; a single rule shows the relationship for a subset of
the word forms that can take the affix. For example, the
English Noun Plural suffix -s could be
represented by this series of rules (B represents sibilants
s, x, and z):
The most specific rule that applies in a given case is
used. Specificity is defined as
- Greater length of matching terminal string.
- Smaller size of character class (for instance B,
which contains only three members, would be more
specific than C, the set of all consonants).
A single character is more specific than any
character class.
General Surface Rules
To capture language-wide phonological phenomona,
special rules can be applied at morpheme boundaries
These occur conceptually after an affix attachment or
before an affix removal, but in reality they are composed
with the affix-specific rules to form a new set of
affix-specific rules.
Thus we could define phonological rules similar to the
plural suffix rules above, that would apply in any situation where the
suffix consists of a single s (i.e. Noun Plural and Verbal
Third Person Singular). The plural suffix specification
would then be reduced to a single rule.
General Lexical Symbol Rules
The Lexicon Builder supports special Lexical Symbols
in affixes. As in two-level rules[1], these lexical symbols
can be realized as different surface formations in
different contexts. For example, one could define
a lexical symbol S that is realized as es after B, o, ch,
or sh, and as s the rest of the time.
The plural suffix specification is then reduced to
| ... | <->...S. |
The corresponding surface only rules are automatically
computed, and can be viewed in the transformation
window.
Comparison of the Lexicon Builder with Two-Level Finite
State Transducer based Implementations
Similarities:
- In a particular context, one and only one rule
is applied. No need to worry about ordering of multiple
rules as in classical generative morphology.
- Rules are bi-directional. The same specification
is used to direct both generation and recognition.
- Both allow use of rules to isolate and remove
predictable phonological phenomena.
Differences:
- The Lexicon Builder compiles all rules to a TRIE-based
surface-only transformation algorithm. This algorithm
operates in constant time (independent of the size of the
lexicon, number of rules, or the length of the input word) for each
derivational step.
Backtracking is required only when using regular expressions
or character classes, and then only in certain situations.
While very efficient at verifying an acceptable form,
finite-state transducers can run in exponential time when
trying to produce a form from scratch[2], largely because
of the need for extensive backtracking.
Execution time also increases linearly with the number of
automata used.
- Generation proceeds
from the root to the affixes (the left and right ends of the
input string), rather than from left to right or right to left.
The exact position of newly generated material depends on whether
the affix is a prefix or suffix, in some cases being both
(German past participles) or neither (infixes for Arabic
finite verbs).
Recognition proceeds in the opposite manner, from the affixes
inward to the root, using a breadth-first search to enumerate
all possible morphological parses.
This simplifies the construction of sophisticated morphotactics.
- Exceptions can be specified without affecting any
other part of the lexicon.
- Word properties can be specified separately from
the word's lexical form. This avoids a complicated interaction
between symbols representing properties (such as stress) and
symbols that are actually part of the word. Words themselves
are stored as surface forms.
- User-friendly interface allows the user to verify derived
forms immediately. No need for traces or complex debugging.
- WYSIWYG rule editor simplifies creation of rules for
unusual morphological phenomena like negative affixing and
reduplication.
References
[1] Koskenniemi, Kimmo. 1983. Two-level morphology: a
general computational model for word-form recognition and
production. Publication No. 11. Helsinki: University of
Helsinki Department of General Linguistics.
[2] Barton, G. Edward, Robert C. Berwick, and Eric Sven
Ristad. 1987. Computational complexity and natural
language. Cambridge, MA: The MIT Press. (see chapter
5, "The complexity of two-level morphology").
Demos of Lexicon Builder applications:
- Spelling Alternations in English (available soon)
- Morphotactics and Vowel Harmony in Turkish (available soon)
- A Hyphenator for Greek (available soon)
A portion of an English lexicon.
For further information and pricing, contact:
Glosa International
4538 Winona Ct.
Denver, CO 80212-2513
USA
+1-303-458-1496 (voice and fax)
[Next]
[Home]