hunspell − format of Hunspell dictionaries and affix files re‐
quires two files to define the language that it is spell check‐
ing. The first file is a dictionary containing words for the
language, and the second is an "affix" file that defines the
meaning of special flags in the dictionary.A dictionary file (*.dic) contains a list of words, one per line.
The first line of the dictionaries (except personal dictionaries)contains the approximate word count (for optimal hash memory
size). Each word may optionally be followed by a slash ("/") and
one or more flags, which represents affixes or special at‐
tributes. Dictionary words can contain also slashes with the ""syntax. Default flag format is a single (usually alphabetic)
character. In a Hunspell dictionary file, there are also optional
fields separated by tabulators or spaces (spaces from Hunspell
1.2), see Optional data fields.
Personal dictionaries are simple word lists, but with optional
word patterns for affixation, separated by a slash:
fooFoo/Simpson
In this example, "foo" and "Foo" are personal words, plus Foo
will be recognized with affixes of Simpson (Foo’s etc.).
An affix file (*.aff) may contain a lot of optional attributes.For example, is used for setting the character encodings of af‐
fixes and dictionary files. sets the change characters for sug‐
gestions. sets a replacement table for multiple character cor‐
rections in suggestion mode. and defines prefix and suffix
classes named with affix flags.The following affix file example defines UTF‐8 character encod‐
ing. ‘TRY’ suggestions differ from the bad word with an English
letter or an apostrophe. With these REP definitions, Hunspell can
suggest the right word form, when the misspelled word contains f
instead of ph and vice versa.SET UTF‐8
TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ’
REP 2REP f phREP ph fPFX A Y 1PFX A 0 re .SFX B Y 2SFX B 0 ed [^y]
SFX B y ied yThere are two affix classes in the dictionary. Class A defines an
‘re‐’ prefix. Class B defines two ‘‐ed’ suffixes. First suffix
can be added to a word if the last character of the word isn’t
‘y’. Second suffix can be added to words terminated with an ‘y’.
(See details later.) The following dictionary file uses these af‐
fix classes.3hellotry/B
work/AB
All accepted words with this example: hello, try, tried, work,
worked, rework, reworked.
Set character encoding of words and morphemes in affix and dic‐
tionary files. Possible values: UTF‐8, ISO8859−1 − ISO8859−10,
ISO8859−13 − ISO8859−15, KOI8‐R, KOI8‐U, microsoft‐cp1251, ISCII‐
DEVANAGARI. Set flag type. Default type is the extended ASCII
(8‐bit) character. ‘UTF‐8’ parameter sets UTF‐8 encoded Unicode
character flags. The ‘long’ value sets the double extended ASCII
character flag type, the ‘num’ sets the decimal number flag type.
Decimal flags numbered from 1 to 65535, and in flag fields are
separated by comma. BUG: UTF‐8 flag type doesn’t work on ARM
platform. Set twofold prefix stripping (but single suffix strip‐
ping) for agglutinative languages with right‐to‐left writing sys‐
tem. Set language code. In Hunspell may be language specific
codes enabled by LANG code. At present there are az_AZ, hu_HU,
TR_tr specific codes in Hunspell (see the source code). Ignorecharacters from dictionary words, affixes and input words. Use‐
ful for optional characters, as Arabic diacritical marks
(Harakat). Hunspell can substitute affix flag sets with ordinal
numbers in affix rules (alias compression). First example withalias compression:3hellotry/1
work/2
AF definitions in the affix file:SET UTF‐8
TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ’
AF 2AF AAF ABSee also tests/alias* examples.
Note: If affix file contains the FLAG parameter, define it before
the AF definitions.Note II: Use makealias utility in Hunspell distribution to com‐
press aff and dic files. Hunspell can substitute also morpholog‐
ical data with ordinal numbers in affix rules (alias compres‐
sion). See tests/alias* examples. Suggestion parameters can op‐
timize the default n‐gram, character swap and deletion sugges‐
tions of Hunspell. REP is suggested to fix the typical and espe‐
cially bad language specific bugs, because the REP suggestions
have the highest priority in the suggestion list. PHONE is forlanguages with not pronunciation based orthography. Hunspell
searches and suggests words with one different character replacedby a neighbor KEY character. Not neighbor characters in KEYstring separated by vertical line characters. Suggested KEY pa‐
rameters for QWERTY and Dvorak keyboard layouts:KEY qwertyuiop|asdfghjkl|zxcvbnm
KEY pyfgcrl|aeouidhtns|qjkxbmwvz
Using the first QWERTY layout, Hunspell suggests "nude" and
"node" for "*nide". A character may have more neighbors, too:
KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
Hunspell can suggest right word forms, when they differ from the
bad input word by one TRY character. The parameter of TRY is casesensitive. Words signed with NOSUGGEST flag are not suggested.Proposed flag for vulgar and obscene words (see also SUBSTAN‐
DARD). Set number of n‐gram suggestions. Value 0 switches off
the n‐gram suggestions. Disable split‐word suggestions. Add
dot(s) to suggestions, if input word terminates in dot(s). (Not
for OpenOffice.org dictionaries, because OpenOffice.org has an
automatic dot expansion mechanism.) We can define language‐de‐
pendent phonetic information in the affix file (.aff) by a re‐
placement table. First REP is the header of this table and oneor more REP data line are following it. With this table, Hunspell
can suggest the right forms for the typical faults of spellingwhen the incorrect form differs by more, than 1 letter from the
right form. For example a possible English replacement tabledefinition to handle misspelled consonants:REP 8REP f phREP ph fREP f ghREP gh fREP j dgREP dg jREP k chREP ch kNote I: It’s very useful to define replacements for the most typ‐
ical one‐character mistakes, too: with REP you can add higher
priority to a subset of the TRY suggestions (suggestion list be‐
gins with the REP suggestions).Note II: Suggesting separated words by REP, you can specify a
space with an underline:REP 1REP alot a_lotNote III: Replacement table can be used for a stricter compound
word checking (forbidding generated compound words, if they are
also simple words with typical fault, see CHECKCOMPOUNDREP).
We can define language‐dependent information on characters that
should be considered related (i.e. nearer than other chars not inthe set) in the affix file (.aff) by a character map table. Withthis table, Hunspell can suggest the right forms for words, which
incorrectly choose the wrong letter from a related set more thanonce in a word.For example a possible mapping could be for the German umlauted
Ăź versus the regular u; the word FrĂźhstĂźck really should be
written with umlauted u’s and not regular ones
MAP 1MAP uĂź
PHONE uses a table‐driven phonetic transcription algorithm bor‐
rowed from Aspell. It is useful for languages with not pronuncia‐
tion based orthography. You can add a full alphabet conversionand other rules for conversion of special letter sequences. Fordetailed documentation see http://aspell.net/man‐html/Phonetic‐
Code.html. Note: Multibyte UTF‐8 characters have not worked with
bracket expression yet. Dash expression has signed bytes and not
UTF‐8 characters yet. Define break points for breaking words and
checking word parts separately. Rationale: useful for compound‐
ing with joining character or strings (for example, hyphen in
English and German or hyphen and n‐dash in Hungarian). Dashes
are often bad break points for tokenization, because compounds
with dashes may contain not valid parts, too.) With BREAK, Hun‐
spell can check both side of these compounds, breaking the words
at dashes and n‐dashes:
BREAK 2BREAK ‐
BREAK ‐‐ # n‐dash
Breaking are recursive, so foo‐bar, bar‐foo and foo‐foo‐‐bar‐bar
would be valid compounds.Note: COMPOUNDRULE is better (or will be better) for handling
dashes and other compound joining characters or character
strings. Use BREAK, if you want check words with dashes or other
joining characters and there is no time or possibility to de‐
scribe precise compound rules with COMPOUNDRULE (COMPOUNDRULE hashandled only the last suffixation of the compound word yet).Note II: For command line spell checking, set WORDCHARS parame‐
ters: WORDCHARS ‐‐‐ (see tests/break.*) example Define custom
compound patterns with a regex‐like syntax. The first COM‐
POUNDRULE is a header with the number of the following COM‐
POUNDRULE definitions. Compound patterns consist compound flags
and star or question mark meta characters. A flag followed by a
‘*’ matches a word sequence of 0 or more matches of words signed
with this compound flag. A flag followed by a ‘?’ matches a word
sequence of 0 or 1 matches of a word signed with this compound
flag. See tests/compound*.* examples.
Note: ‘*’ and ‘?’ metacharacters work only with the default 8‐bit
character and the UTF‐8 FLAG types.
Note II: COMPOUNDRULE flags haven’t been compatible with the COM‐
POUNDFLAG, COMPOUNDBEGIN, etc. compound flags yet (use these
flags on different words). Minimum length of words in compoundwords. Default value is 3 letters. Words signed with COMPOUND‐
FLAG may be in compound words (except when word shorter than COM‐
POUNDMIN). Affixes with COMPOUNDFLAG also permits compounding ofaffixed words. Words signed with COMPOUNDBEGIN (or with a signed
affix) may be first elements in compound words. Words signed
with COMPOUNDLAST (or with a signed affix) may be last elementsin compound words. Words signed with COMPOUNDMIDDLE (or with a
signed affix) may be middle elements in compound words. Suffixes
signed with ONLYINCOMPOUND flag may be only inside of compounds
(Fuge‐elements in German, fogemorphemes in Swedish). ONLYINCOM‐
POUND flag works also with words (see tests/onlyincompound.*).
Prefixes are allowed at the beginning of compounds, suffixes are
allowed at the end of compounds by default. Affixes with COM‐
POUNDPERMITFLAG may be inside of compounds. Suffixes with this
flag forbid compounding of the affixed word. COMPOUNDROOT flagsigns the compounds in the dictionary (Now it is used only in theHungarian language specific code). Set maximum word count in a
compound word. (Default is unlimited.) Forbid word duplication
in compounds (e.g. foofoo). Forbid compounding, if the (usually
bad) compound word may be a non compound word with a REP fault.
Useful for languages with ‘compound friendly’ orthography. For‐
bid upper case characters at word bound in compounds. Forbidcompounding, if compound word contains triple letters (e.g.
foo|ox or xo|oof). Bug: missing multi‐byte character support in
UTF‐8 encoding (works only for 7‐bit ASCII characters). Forbid
compounding, if first word in compound ends with endchars, and
next word begins with beginchars. Need for special compoundingrules in Hungarian. First parameter is the maximum syllable num‐
ber, that may be in a compound, if words in compounds are more
than COMPOUNDWORDMAX. Second parameter is the list of vowels
(for calculating syllables). Need for special compounding rules
in Hungarian. An affix is either a prefix or a suffix attached
to root words to make other words. We can define affix classeswith arbitrary number affix rules. Affix classes are signed withaffix flags. The first line of an affix class definition is theheader. The fields of an affix class header:
(0) Option name (PFX or SFX)
(1) Flag (name of the affix class)
(2) Cross product (permission to combine prefixes and suffixes).
Possible values: Y (yes) or N (no)
(3) Line count of the following rules.
Fields of an affix rules:
(0) Option name
(1) Flag
(2) stripping characters from beginning (at prefix rules) or end
(at suffix rules) of the word
(3) affix (optionally with flags of continuation classes, sepa‐
rated by a slash)
(4) condition.
Zero stripping or affix are indicated by zero. Zero condition isindicated by dot. Condition is a simplified, regular expression‐
like pattern, which must be met before the affix can be applied.
(Dot signs an arbitrary character. Characters in braces sign an
arbitrary character from the character subset. Dash hasn’t got
special meaning, but circumflex (^) next the first brace sets the
complementer character set.)
(5) Optional morphological fields separated by spaces or tabula‐
tors.Affixes signed with CIRCUMFIX flag may be on a word when this
word also has a prefix with CIRCUMFIX flag and vice versa. Thisflag signs forbidden word form. Because affixed forms are also
forbidden, we can subtract a subset from set of the accepted af‐
fixed and compound words. Forbid uppercased and capitalizedforms of words signed with KEEPCASE flags. Useful for special or‐
thographies (measurements and currency often keep their case inuppercased texts) and writing systems (e.g. keeping lower case ofIPA characters).Note: With CHECKSHARPS declaration, words with sharp s and KEEP‐
CASE flag may be capitalized and uppercased, but uppercased forms
of these words may not contain sharp s, only SS. See germancom‐
pounding example in the tests directory of the Hunspell distribu‐
tion. Not used in Hunspell 1.2. Use "st:" field instead of LEM‐
MA_PRESENT. This flag signs virtual stems in the dictionary.
Only affixed forms of these words will be accepted by Hunspell.Except, if the dictionary word has a homonym or a zero affix.
NEEDAFFIX works also with prefixes and prefix + suffix combina‐
tions (see tests/pseudoroot5.*). Deprecated. (Former name of the
NEEDAFFIX option.) SUBSTANDARD flag signs affix rules and dic‐
tionary words (allomorphs) not used in morphological generation
(and in suggestion in the future versions). See also NOSUGGEST.
WORDCHARS extends tokenizer of Hunspell command line interface
with additional word character. For example, dot, dash, n‐dash,
numbers, percent sign are word character in Hungarian. SS letter
pair in uppercased (German) words may be upper case sharp s (Ă).
Hunspell can handle this special casing with the CHECKSHARPS dec‐
laration (see also KEEPCASE flag and tests/germancompounding ex‐
ample) in both spelling and suggestion.Hunspell’s dictionary items and affix rules may have optional
space or tabulator separated morphological description fields,
started with 3‐character (two letters and a colon) field IDs:
word/flags po:noun is:nom
Example: We define a simple resource with morphological informa‐
tions, a derivative suffix (ds:) and a part of speech category
(po:):
Affix file:
SFX X Y 1
SFX X 0 able . ds:able
Dictionary file:
drink/X po:verb
Test file:
drink
drinkable
Test:
$ analyze test.aff test.dic test.txt
> drink
analyze(drink) = po:verb
stem(drink) = po:verb
> drinkable
analyze(drinkable) = po:verb ds:able
stem(drinkable) = drinkable
You can see in the example, that the analyzer concatenates the
morphological fields in item and arrangement style.Default morphological and other IDs (used in suggestion, stemming
and morphological generation): Alternative transliteration for
better suggestion. It’s useful for words with foreign pronounci‐
ation. (Dictionary based phonetic suggestion.) For example:
Marseille ph:maarsaylStem. Optional: default stem is the dictionary item in morpholog‐
ical analysis. Stem field is useful for virtual stems (dictionarywords with NEEDAFFIX flag) and morphological exceptions insteadof new, single used morphological rules.
feet st:foot is:pluralmice st:mouse is:pluralteeth st:tooth is:pluralWord forms with multiple stems need multiple dictionary items:lay po:verb st:lie is:past_2lay po:verb is:presentlay po:nounAllomorph(s). A dictionary item is the stem of its allomorphs.
Morphological generation needs stem, allomorph and affix fields.
sing al:sang al:sungsang st:singsung st:singPart of speech category. Derivational suffix(es). Stemming
doesn’t remove derivational suffixes. Morphological generation
depends on the order of the suffix fields.In affix rules:SFX Y Y 1SFX Y 0 ly . ds:ly_adjIn the dictionary:ably st:able ds:ly_adjable al:ablyInflectional suffix(es). All inflectional suffixes are removed
by stemming. Morphological generation depends on the order of
the suffix fields.feet st:foot is:pluralTerminal suffix(es). Terminal suffix fields are inflectional
suffix fields "removed" by additional (not terminal) suffixes.Useful for zero morphemes and affixes removed by splitting rules.work/D ts:present
SFX D Y 2SFX D 0 ed . is:past_1
SFX D 0 ed . is:past_2
Typical example of terminal suffix is nominative of languages
with case suffixes.Surface prefix. Temporary solution for adding prefixes to the
stems and generated word forms. See tests/morph.* example.
Parts of the compound words. Output fields of morphological anal‐
ysis for stemming. Planned: derivational prefix. Planned: in‐
flectional prefix. Planned: terminal prefix.Ispell’s original algorithm strips only one suffix. Hunspell can
strip another one yet.The twofold suffix stripping is a significant improvement in han‐
dling of immense number of suffixes, that characterize agglutina‐
tive languages.Extending the previous example by adding a second suffix (affixclass Y will be the continuation class of the suffix ‘able’):
SFX Y Y 1
SFX Y 0 s . +PLUR
SFX X Y 1
SFX X 0 able/Y . +ABLE
Dictionary file:
drink/X [VERB]
Test file:
drink
drinkable
drinkables
Test:
$ hunmorph test.aff test.dic test.txt
drink: drink[VERB]
drinkable: drink[VERB]+ABLE
drinkables: drink[VERB]+ABLE+PLUR
Theoretically with the twofold suffix stripping needs only thesquare root of the number of suffix rules, compared with a Hun‐
spell implementation. In our practice, we could have elaborated
the Hungarian inflectional morphology with twofold suffix strip‐
ping.Note: In Hunlex preprocessor’s grammar can be use not only
twofold, but multiple suffix slitting.
Hunspell can handle more than 65000 affix classes. There are twonew syntax for giving flags in affix and dictionary files.FLAG long command sets 2‐character flags:
FLAG long
SFX Y1 Y 1
SFX Y1 0 s 1
Dictionary record with the Y1, Z3, F? flags:
foo/Y1Z3F?
FLAG num command sets numerical flags separated by comma:
FLAG num
SFX 65000 Y 1
SFX 65000 0 s 1
Dictionary example:
foo/65000,12,2756
Hunspell’s dictionary can contain repeating elements that are
homonyms:
work/A [VERB]
work/B [NOUN]
An affix file:
SFX A Y 1
SFX A 0 s . +SG3
SFX B Y 1
SFX B 0 s . +PLUR
Test file:
works
Test:
> works
work[VERB]+SG3
work[NOUN]+PLUR
This feature also gives a way to forbid illegal prefix/suffix
combinations in difficult cases.An interesting side‐effect of multi‐step stripping is, that the
appropriate treatment of circumfixes now comes for free. For in‐
stance, in Hungarian, superlatives are formed by simultaneous
prefixation of leg‐ and suffixation of ‐bb to the adjective base.
A problem with the one‐level architecture is that there is no way
to render lexical licensing of particular prefixes and suffixes
interdependent, and therefore incorrect forms are recognized as
valid, i.e. *legvĂŠn = leg + vĂŠn ‘old’. Until the introduction
of clusters, a special treatment of the superlative had to be
hardwired in the earlier HunSpell code. This may have been legit‐
imate for a single case, but in fact prefix‐‐suffix dependences
are ubiquitous in category‐changing derivational patterns (cf.
English payable, non‐payable but *non‐pay or drinkable, undrink‐
able but *undrink). In simple words, here, the prefix un‐ is le‐
gitimate only if the base drink is suffixed with ‐able. If both
these patters are handled by on‐line affix rules and affix rules
are checked against the base only, there is no way to express
this dependency and the system will necessarily over‐ or under‐
generate.In next example, suffix class R have got a prefix ‘continuation’
class (class P).PFX P Y 1PFX P 0 un . [prefix_un]+
SFX S Y 1SFX S 0 s . +PL
SFX Q Y 1SFX Q 0 s . +3SGV
SFX R Y 1SFX R 0 able/PS . +DER_V_ADJ_ABLE
Dictionary:2drink/RQ [verb]
drink/S [noun]
Morphological analysis:
> drink
drink[verb]
drink[noun]
> drinks
drink[verb]+3SGV
drink[noun]+PL
> drinkable
drink[verb]+DER_V_ADJ_ABLE
> drinkables
drink[verb]+DER_V_ADJ_ABLE+PL
> undrinkable
[prefix_un]+drink[verb]+DER_V_ADJ_ABLE
> undrinkables
[prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
> undrink
Unknown word.
> undrinks
Unknown word.Conditional affixes implemented by a continuation class are not
enough for circumfixes, because a circumfix is one affix in mor‐
phology. We also need CIRCUMFIX option for correct morphologicalanalysis.
# circumfixes: ~ obligate prefix/suffix combinations
# superlative in Hungarian: leg‐ (prefix) AND ‐bb (suffix)
# nagy, nagyobb, legnagyobb, legeslegnagyobb
# (great, greater, greatest, most greatest)
CIRCUMFIX XPFX A Y 1PFX A 0 leg/X .
PFX B Y 1PFX B 0 legesleg/X .
SFX C Y 3SFX C 0 obb . +COMPARATIVESFX C 0 obb/AX . +SUPERLATIVE
SFX C 0 obb/BX . +SUPERSUPERLATIVE
Dictionary:1nagy/C [MN]
Analysis:
> nagy
nagy[MN]
> nagyobb
nagy[MN]+COMPARATIVE
> legnagyobb
nagy[MN]+SUPERLATIVE
> legeslegnagyobb
nagy[MN]+SUPERSUPERLATIVE
Allowing free compounding yields decrease in precision of recog‐
nition, not to mention stemming and morphological analysis. Al‐
though lexical switches are introduced to license compounding of
bases by Ispell, this proves not to be restrictive enough. For
example:
# affix file
COMPOUNDFLAG X2foo/X
bar/X
With this resource, foobar and barfoo also are accepted words.
This has been improved upon with the introduction of direction‐
sensitive compounding, i.e., lexical features can specify sepa‐
rately whether a base can occur as leftmost or rightmost con‐
stituent in compounds. This, however, is still insufficient to
handle the intricate patterns of compounding, not to mention
idiosyncratic (and language specific) norms of hyphenation.The Hunspell algorithm currently allows any affixed form of
words, which are lexically marked as potential members of com‐
pounds. Hunspell improved this, and its recursive compound check‐
ing rules makes it possible to implement the intricate spellingconventions of Hungarian compounds. For example, using COMPOUND‐
WORDMAX, COMPOUNDSYLLABLE, COMPOUNDROOT, SYLLABLENUM options can
be set the noteworthy Hungarian ‘6‐‐3’ rule. Further example in
Hungarian, derivate suffixes often modify compounding properties.
Hunspell allows the compounding flags on the affixes, and there
are two special flags (COMPOUNDPERMITFLAG and (COMPOUNDFORBID‐
FLAG) to permit or prohibit compounding of the derivations.Suffixes with this flag forbid compounding of the affixed word.We also need several Hunspell features for handling German com‐
pounding:
# German compounding
# set language to handle special casing of German sharp s
LANG de_DE
# compound flags
COMPOUNDBEGIN UCOMPOUNDMIDDLE VCOMPOUNDEND W
# Prefixes are allowed at the beginning of compounds,
# suffixes are allowed at the end of compounds by default:
# (prefix)?(root)+(affix)?
# Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
COMPOUNDPERMITFLAG P
# for German fogemorphemes (Fuge‐element)
# Hint: ONLYINCOMPOUND is not required everywhere, but the
# checking will be a little faster with it.
ONLYINCOMPOUND X
# forbid uppercase characters at compound word bounds
CHECKCOMPOUNDCASE
# for handling Fuge‐elements with dashes (Arbeits‐)
# dash will be a special word
COMPOUNDMIN 1WORDCHARS ‐
# compound settings and fogemorpheme for ‘Arbeit’
SFX A Y 3SFX A 0 s/UPX .
SFX A 0 s/VPDX .
SFX A 0 0/WXD .
SFX B Y 2SFX B 0 0/UPX .
SFX B 0 0/VWXDP .
# a suffix for ‘Computer’
SFX C Y 1SFX C 0 n/WD .
# for forbid exceptions (*Arbeitsnehmer)
FORBIDDENWORD Z
# dash prefix for compounds with dash (Arbeits‐Computer)
PFX ‐ Y 1
PFX ‐ 0 ‐/P .
# decapitalizing prefix
# circumfix for positioning in compounds
PFX D Y 29PFX D A a/PX A
PFX D à ä/PX Ă
.
.
PFX D Y y/PX Y
PFX D Z z/PX Z
Example dictionary:4Arbeit/A‐
Computer/BC‐
‐/W
Arbeitsnehmer/Z
Accepted compound compound words with the previous resource:ComputerComputernArbeitArbeits‐
ComputerarbeitComputerarbeits‐
ArbeitscomputerArbeitscomputernComputerarbeitscomputerComputerarbeitscomputernArbeitscomputerarbeitComputerarbeits‐Computer
Computerarbeits‐Computern
Not accepted compoundings:computerarbeitArbeitsarbeitsComputerArbeitComputerArbeitsArbeitcomputerArbeitsComputerComputerarbeitcomputerComputerArbeitcomputerComputerArbeitscomputerArbeitscomputerarbeitsComputerarbeits‐computer
ArbeitsnehmerThis solution is still not ideal, however, and will be replaced
by a pattern‐based compound‐checking algorithm which is closely
integrated with input buffer tokenization. Patterns describing
compounds come as a separate input resource that can refer to
high‐level properties of constituent parts (e.g. the number of
syllables, affix flags, and containment of hyphens). The patterns
are matched against potential segmentations of compounds to as‐
sess wellformedness.Both Ispell and Myspell use 8‐bit ASCII character encoding, which
is a major deficiency when it comes to scalability. Although alanguage like Hungarian has a standard ASCII character set (ISO
8859‐2), it fails to allow a full implementation of Hungarian or‐
thographic conventions. For instance, the ’‐‐’ symbol (n‐dash)
is missing from this character set contrary to the fact that itis not only the official symbol to delimit parenthetic clauses inthe language, but it can be in compound words as a special ’big’
hyphen.MySpell has got some 8‐bit encoding tables, but there are lan‐
guages without standard 8‐bit encoding, too. For example, a lot
of African languages have non‐latin or extended latin characters.
Similarly, using the original spelling of certain foreign names
like ĂngstrĂśm or Molière is encouraged by the Hungarian spell‐
ing norm, and, since characters ’Ă’ and ’è’ are not part of
ISO 8859‐2, when they combine with inflections containing charac‐
ters only in ISO 8859‐2 (like elative ‐bĹl, allative ‐tĹl or
delative ‐rĹl with double acute), these result in words (like
ĂngstrĂśmrĹl or Molière‐tĹl.) that can not be encoded using
any single ASCII encoding scheme.The problems raised in relation to 8‐bit ASCII encoding have long
been recognized by proponents of Unicode. It is clear that trad‐
ing efficiency for encoding‐independence has its advantages when
it comes a truly multi‐lingual application. There is implemented
a memory and time efficient Unicode handling in Hunspell. In non‐
UTF‐8 character encodings Hunspell works with the original 8‐bit
strings. In UTF‐8 encoding, affixes and words are stored in
UTF‐8, during the analysis are handled in mostly UTF‐8, under
condition checking and suggestion are converted to UTF‐16. Uni‐
code text analysis and spell checking have a minimal (0‐20%) time
overhead and minimal or reasonable memory overhead depends fromthe language (its UTF‐8 encoding and affixation).