hunspell man page on YellowDog

Man page or keyword search:  
man Server   18644 pages
apropos Keyword Search (all sections)
Output format
YellowDog logo
[printable version]

hunspell  −  format  of Hunspell dictionaries and affix files re‐
quires two files to define the language that it is  spell  checking.   The  first  file	 is a dictionary containing words for the
language, and  the  second is an "affix" file  that  defines  the
meaning of special flags in the dictionary.

A dictionary file (*.dic) contains a list of words, one per line.
The first line of the dictionaries (except personal dictionaries)
contains  the  approximate  word  count	 (for optimal hash memory
size). Each word may optionally be followed by a slash ("/")  and
one  or	 more  flags,  which  represents  affixes  or special at‐
tributes. Dictionary words can contain also slashes with  the  ""
syntax.	 Default  flag	format	is  a single (usually alphabetic)
character. In a Hunspell dictionary file, there are also optional
fields	separated  by  tabulators or spaces (spaces from Hunspell
1.2), see Optional data fields.

Personal dictionaries are simple word lists,  but  with	 optional
word patterns for affixation, separated by a slash:

foo
Foo/Simpson

In  this  example,  "foo"  and "Foo" are personal words, plus Foo
will be recognized with affixes of Simpson (Foo’s etc.).

An affix file (*.aff) may contain a lot of  optional  attributes.
For  example,  is used for setting the character encodings of af‐
fixes and dictionary files.  sets the change characters for  suggestions.   sets  a replacement table for multiple character corrections in suggestion	mode.	and  defines  prefix  and  suffix
classes named with affix flags.

The  following	affix file example defines UTF‐8 character encod‐
ing.  ‘TRY’ suggestions differ from the bad word with an  English
letter or an apostrophe. With these REP definitions, Hunspell can
suggest the right word form, when the misspelled word contains	f
instead of ph and vice versa.

SET UTF‐8
TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZREP 2
REP f ph
REP ph f

PFX A Y 1
PFX A 0 re .

SFX B Y 2
SFX B 0 ed [^y]
SFX B y ied y

There are two affix classes in the dictionary. Class A defines an
‘re‐’ prefix. Class B defines two ‘‐ed’	 suffixes.  First  suffix
can  be	 added	to a word if the last character of the word isn’t
‘y’.  Second suffix can be added to words terminated with an ‘y’.
(See details later.) The following dictionary file uses these af‐
fix classes.

3
hello
try/B
work/AB

All accepted words with this example: hello,  try,  tried,  work,
worked, rework, reworked.

Set  character	encoding of words and morphemes in affix and dic‐
tionary files.	Possible values: UTF‐8, ISO8859−1  −  ISO8859−10,
ISO8859−13 − ISO8859−15, KOI8‐R, KOI8‐U, microsoft‐cp1251, ISCII‐
DEVANAGARI.  Set flag type. Default type is  the  extended  ASCII
(8‐bit)	 character.  ‘UTF‐8’ parameter sets UTF‐8 encoded Unicode
character flags.  The ‘long’ value sets the double extended ASCII
character flag type, the ‘num’ sets the decimal number flag type.
Decimal flags numbered from 1 to 65535, and in	flag  fields  are
separated  by  comma.	BUG:  UTF‐8 flag type doesn’t work on ARM
platform.  Set twofold prefix stripping (but single suffix stripping) for agglutinative languages with right‐to‐left writing sys‐
tem.  Set language code. In Hunspell  may  be  language	 specific
codes  enabled	by  LANG code. At present there are az_AZ, hu_HU,
TR_tr specific codes in Hunspell (see the source  code).   Ignore
characters  from dictionary words, affixes and input words.  Use‐
ful  for  optional  characters,	 as  Arabic   diacritical   marks
(Harakat).   Hunspell can substitute affix flag sets with ordinal
numbers in affix rules (alias compression).  First  example  with
alias compression:
3
hello
try/1
work/2
AF definitions in the affix file:
SET UTF‐8
TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZAF 2
AF A
AF AB
See also tests/alias* examples.

Note: If affix file contains the FLAG parameter, define it before
the AF definitions.

Note II: Use makealias utility in Hunspell distribution	 to  com‐
press aff and dic files.  Hunspell can substitute also morphological data with ordinal numbers in  affix  rules	 (alias	 compres‐
sion).	See tests/alias* examples.  Suggestion parameters can op‐
timize the default n‐gram, character swap  and	deletion  sugges‐
tions  of Hunspell. REP is suggested to fix the typical and especially bad language specific bugs, because  the	 REP  suggestions
have  the  highest priority in the suggestion list.  PHONE is for
languages with not  pronunciation  based  orthography.	 Hunspell
searches and suggests words with one different character replaced
by a neighbor KEY  character.  Not  neighbor  characters  in  KEY
string	separated by vertical line characters.	Suggested KEY pa‐
rameters for QWERTY and Dvorak keyboard layouts:
KEY qwertyuiop|asdfghjkl|zxcvbnm
KEY pyfgcrl|aeouidhtns|qjkxbmwvz
Using the first	 QWERTY	 layout,  Hunspell  suggests  "nude"  and
"node" for "*nide". A character may have more neighbors, too:
KEY qwertzuop|yxcvbnm|qaw|say|wse|dsx|sy|edr|fdc|dx|rft|gfv|fc|tgz|hgb|gv|zhu|jhn|hb|uji|kjm|jn|iko|lkm
Hunspell  can suggest right word forms, when they differ from the
bad input word by one TRY character. The parameter of TRY is case
sensitive.   Words  signed with NOSUGGEST flag are not suggested.
Proposed flag for vulgar and obscene  words  (see  also	 SUBSTAN‐
DARD).	 Set  number  of n‐gram suggestions. Value 0 switches off
the n‐gram suggestions.	  Disable  split‐word  suggestions.   Add
dot(s)	to suggestions, if input word terminates in dot(s).  (Not
for OpenOffice.org dictionaries, because  OpenOffice.org  has  an
automatic  dot	expansion mechanism.)  We can define language‐de‐
pendent phonetic information in the affix file (.aff)  by  a  replacement  table.   First REP is the header of this table and one
or more REP data line are following it. With this table, Hunspell
can  suggest  the  right forms for the typical faults of spelling
when the incorrect form differs by more, than 1 letter	from  the
right  form.   For  example  a possible English replacement table
definition to handle misspelled consonants:
REP 8
REP f ph
REP ph f
REP f gh
REP gh f
REP j dg
REP dg j
REP k ch
REP ch k
Note I: It’s very useful to define replacements for the most typ‐
ical  one‐character  mistakes,	too:  with REP you can add higher
priority to a subset of the TRY suggestions (suggestion list  begins with the REP suggestions).

Note  II:  Suggesting  separated  words by REP, you can specify a
space with an underline:

REP 1
REP alot a_lot

Note III: Replacement table can be used for a  stricter	 compound
word  checking	(forbidding generated compound words, if they are
also simple words with typical fault, see CHECKCOMPOUNDREP).

We can define language‐dependent information on	 characters  that
should be considered related (i.e. nearer than other chars not in
the set) in the affix file (.aff)  by a character map table. With
this table, Hunspell can suggest the right forms for words, which
incorrectly choose the wrong letter from a related set more  than
once in a word.

For  example  a possible mapping could be for the German umlauted
Ăź versus the regular u; the word FrĂźhstĂźck  really  should  be
written with umlauted u’s and not regular ones
MAP 1
MAP uĂź
PHONE  uses  a table‐driven phonetic transcription algorithm bor‐
rowed from Aspell. It is useful for languages with not pronunciation  based  orthography.  You can add a full alphabet conversion
and other rules for conversion of special letter  sequences.  For
detailed  documentation	 see http://aspell.net/man‐html/Phonetic‐
Code.html.  Note: Multibyte UTF‐8 characters have not worked with
bracket	 expression yet. Dash expression has signed bytes and not
UTF‐8 characters yet.  Define break points for breaking words and
checking  word parts separately.  Rationale: useful for compounding with joining character or strings  (for  example,  hyphen  in
English	 and  German  or hyphen and n‐dash in Hungarian).  Dashes
are often bad break points for	tokenization,  because	compounds
with  dashes may contain not valid parts, too.)	 With BREAK, Hun‐
spell can check both side of these compounds, breaking the  words
at dashes and n‐dashes:
BREAK 2
BREAK BREAK ‐‐    # n‐dash
Breaking  are recursive, so foo‐bar, bar‐foo and foo‐foo‐‐bar‐bar
would be valid compounds.

Note: COMPOUNDRULE is better (or will  be  better)  for	 handling
dashes	and  other   compound  joining	characters  or	character
strings. Use BREAK, if you want check words with dashes or  other
joining	 characters  and  there	 is no time or possibility to de‐
scribe precise compound rules with COMPOUNDRULE (COMPOUNDRULE has
handled only the last suffixation of the compound word yet).

Note  II:  For command line spell checking, set WORDCHARS parame‐
ters: WORDCHARS ‐‐‐ (see  tests/break.*)  example  Define  custom
compound  patterns  with  a  regex‐like	 syntax.   The first COM‐
POUNDRULE is a header with  the	 number	 of  the  following  COM‐
POUNDRULE  definitions.	 Compound patterns consist compound flags
and star or question mark meta characters. A flag followed  by	a
‘*’  matches a word sequence of 0 or more matches of words signed
with this compound flag.  A flag followed by a ‘?’ matches a word
sequence  of  0	 or 1 matches of a word signed with this compound
flag.  See tests/compound*.* examples.

Note: ‘*’ and ‘?’ metacharacters work only with the default 8‐bit
character and the UTF‐8 FLAG types.

Note II: COMPOUNDRULE flags haven’t been compatible with the COM‐
POUNDFLAG, COMPOUNDBEGIN, etc.	compound  flags	 yet  (use  these
flags  on  different words).  Minimum length of words in compound
words.	Default value is 3 letters.  Words signed with	COMPOUND‐
FLAG may be in compound words (except when word shorter than COMPOUNDMIN). Affixes with COMPOUNDFLAG also permits compounding  of
affixed words.	Words signed with COMPOUNDBEGIN (or with a signed
affix) may be first elements in	 compound  words.   Words  signed
with  COMPOUNDLAST  (or with a signed affix) may be last elements
in compound words.  Words signed with COMPOUNDMIDDLE (or  with	a
signed affix) may be middle elements in compound words.	 Suffixes
signed with ONLYINCOMPOUND flag may be only inside  of	compounds
(Fuge‐elements	in German, fogemorphemes in Swedish).  ONLYINCOM‐
POUND flag works also with  words  (see	 tests/onlyincompound.*).
Prefixes  are allowed at the beginning of compounds, suffixes are
allowed at the end of compounds by default.   Affixes  with  COMPOUNDPERMITFLAG	 may  be inside of compounds.  Suffixes with this
flag forbid compounding of the affixed word.   COMPOUNDROOT  flag
signs the compounds in the dictionary (Now it is used only in the
Hungarian language specific code).  Set maximum word count  in	a
compound  word.	 (Default is unlimited.)  Forbid word duplication
in compounds (e.g. foofoo).  Forbid compounding, if the	 (usually
bad)  compound	word may be a non compound word with a REP fault.
Useful for languages with ‘compound friendly’ orthography.   For‐
bid  upper  case  characters  at word bound in compounds.  Forbid
compounding, if	 compound  word	 contains  triple  letters  (e.g.
foo|ox	or xo|oof).  Bug: missing multi‐byte character support in
UTF‐8 encoding (works only for 7‐bit ASCII  characters).   Forbid
compounding,  if  first	 word in compound ends with endchars, and
next word begins with beginchars.  Need for  special  compounding
rules in Hungarian.  First parameter is the maximum syllable number, that may be in a compound, if words in  compounds	are  more
than  COMPOUNDWORDMAX.	 Second	 parameter  is the list of vowels
(for calculating syllables).  Need for special compounding  rules
in  Hungarian.	 An affix is either a prefix or a suffix attached
to root words to make other words. We can  define  affix  classes
with arbitrary number affix rules.  Affix classes are signed with
affix flags. The first line of an affix class definition  is  the
header. The fields of an affix class header:

(0) Option name (PFX or SFX)

(1) Flag (name of the affix class)

(2)  Cross product (permission to combine prefixes and suffixes).
Possible values: Y (yes) or N (no)

(3) Line count of the following rules.

Fields of an affix rules:

(0) Option name

(1) Flag

(2) stripping characters from beginning (at prefix rules) or  end
(at suffix rules) of the word

(3)  affix  (optionally with flags of continuation classes, sepa‐
rated by a slash)

(4) condition.

Zero stripping or affix are indicated by zero. Zero condition  is
indicated by dot.  Condition is a simplified, regular expression‐
like pattern, which must be met before the affix can be	 applied.
(Dot  signs  an arbitrary character. Characters in braces sign an
arbitrary character from the character subset.	Dash  hasn’t  got
special meaning, but circumflex (^) next the first brace sets the
complementer character set.)

(5) Optional morphological fields separated by spaces or  tabula‐
tors.

Affixes	 signed	 with  CIRCUMFIX  flag may be on a word when this
word also has a prefix with CIRCUMFIX flag and vice versa.   This
flag  signs  forbidden	word form. Because affixed forms are also
forbidden, we can subtract a subset from set of the accepted  af‐
fixed  and  compound  words.   Forbid  uppercased and capitalized
forms of words signed with KEEPCASE flags. Useful for special orthographies  (measurements  and currency often keep their case in
uppercased texts) and writing systems (e.g. keeping lower case of
IPA characters).

Note:  With CHECKSHARPS declaration, words with sharp s and KEEP‐
CASE flag may be capitalized and uppercased, but uppercased forms
of  these  words may not contain sharp s, only SS. See germancom‐
pounding example in the tests directory of the Hunspell distribution.	Not used in Hunspell 1.2. Use "st:" field instead of LEM‐
MA_PRESENT.  This flag signs virtual  stems  in	 the  dictionary.
Only  affixed  forms of these words will be accepted by Hunspell.
Except, if the dictionary word has a homonym  or  a  zero  affix.
NEEDAFFIX  works  also with prefixes and prefix + suffix combinations (see tests/pseudoroot5.*).  Deprecated. (Former name of the
NEEDAFFIX  option.)   SUBSTANDARD flag signs affix rules and dictionary words (allomorphs) not used in	morphological  generation
(and  in  suggestion in the future versions). See also NOSUGGEST.
WORDCHARS extends tokenizer of Hunspell	 command  line	interface
with  additional  word character. For example, dot, dash, n‐dash,
numbers, percent sign are word character in Hungarian.	SS letter
pair in uppercased (German) words may be upper case sharp s (Ă).
Hunspell can handle this special casing with the CHECKSHARPS declaration  (see also KEEPCASE flag and tests/germancompounding ex‐
ample) in both spelling and suggestion.

Hunspell’s dictionary items and affix  rules  may  have	 optional
space  or  tabulator  separated morphological description fields,
started with 3‐character (two letters and a colon) field IDs:

 word/flags po:noun is:nom

Example: We define a simple resource with morphological	 informa‐
tions,	a  derivative  suffix (ds:) and a part of speech category
(po:):

Affix file:

 SFX X Y 1
 SFX X 0 able . ds:able

Dictionary file:

 drink/X po:verb

Test file:

 drink
 drinkable

Test:

 $ analyze test.aff test.dic test.txt
 > drink
 analyze(drink) = po:verb
 stem(drink) = po:verb
 > drinkable
 analyze(drinkable) = po:verb ds:able
 stem(drinkable) = drinkable

You can see in the example, that the  analyzer	concatenates  the
morphological fields in item and arrangement style.

Default morphological and other IDs (used in suggestion, stemming
and morphological generation):	Alternative  transliteration  for
better suggestion.  It’s useful for words with foreign pronounci‐
ation. (Dictionary based phonetic suggestion.)	For example:

Marseille ph:maarsayl

Stem. Optional: default stem is the dictionary item in morphological analysis. Stem field is useful for virtual stems (dictionary
words with NEEDAFFIX flag) and morphological  exceptions  instead
of new, single used morphological rules.
feet  st:foot  is:plural
mice  st:mouse is:plural
teeth st:tooth is:plural

Word forms with multiple stems need multiple dictionary items:

lay po:verb st:lie is:past_2
lay po:verb is:present
lay po:noun

Allomorph(s).  A  dictionary  item is the stem of its allomorphs.
Morphological generation needs stem, allomorph and affix fields.
sing al:sang al:sung
sang st:sing
sung st:sing
Part of	 speech	 category.   Derivational  suffix(es).	 Stemming
doesn’t	 remove	 derivational suffixes.	 Morphological generation
depends on the order of the suffix fields.

In affix rules:

SFX Y Y 1
SFX Y 0 ly . ds:ly_adj

In the dictionary:

ably st:able ds:ly_adj
able al:ably

Inflectional suffix(es).  All inflectional suffixes  are  removed
by  stemming.	Morphological  generation depends on the order of
the suffix fields.

feet st:foot is:plural

Terminal suffix(es).  Terminal	suffix	fields	are  inflectional
suffix fields "removed" by additional (not terminal) suffixes.

Useful for zero morphemes and affixes removed by splitting rules.

work/D ts:present

SFX D Y 2
SFX D	0 ed . is:past_1
SFX D	0 ed . is:past_2

Typical	 example  of  terminal	suffix is nominative of languages
with case suffixes.

Surface prefix. Temporary solution for	adding	prefixes  to  the
stems and generated word forms. See tests/morph.* example.

Parts of the compound words. Output fields of morphological analysis for stemming.  Planned: derivational prefix.   Planned:  inflectional prefix.  Planned: terminal prefix.

Ispell’s  original algorithm strips only one suffix. Hunspell can
strip another one yet.

The twofold suffix stripping is a significant improvement in handling of immense number of suffixes, that characterize agglutina‐
tive languages.

Extending the previous example by adding a second  suffix  (affix
class Y will be the continuation class of the suffix ‘able’):

 SFX Y Y 1
 SFX Y 0 s . +PLUR

 SFX X Y 1
 SFX X 0 able/Y . +ABLE

Dictionary file:

 drink/X   [VERB]

Test file:

 drink
 drinkable
 drinkables

Test:

 $ hunmorph test.aff test.dic test.txt
 drink:	     drink[VERB]
 drinkable:  drink[VERB]+ABLE
 drinkables: drink[VERB]+ABLE+PLUR

Theoretically  with  the  twofold suffix stripping needs only the
square root of the number of suffix rules, compared with  a  Hun‐
spell  implementation.	In our practice, we could have elaborated
the Hungarian inflectional morphology with twofold suffix  stripping.

Note:  In  Hunlex  preprocessor’s  grammar  can	 be  use not only
twofold, but multiple suffix slitting.

Hunspell can handle more than 65000 affix classes.  There are two
new syntax for giving flags in affix and dictionary files.

FLAG long command sets 2‐character flags:

  FLAG long
  SFX Y1 Y 1
  SFX Y1 0 s 1

Dictionary record with the Y1, Z3, F? flags:

  foo/Y1Z3F?

FLAG num command sets numerical flags separated by comma:

  FLAG num
  SFX 65000 Y 1
  SFX 65000 0 s 1

Dictionary example:

  foo/65000,12,2756

Hunspell’s  dictionary	can  contain  repeating elements that are
homonyms:

 work/A	   [VERB]
 work/B	   [NOUN]

An affix file:

 SFX A Y 1
 SFX A 0 s . +SG3

 SFX B Y 1
 SFX B 0 s . +PLUR

Test file:

 works

Test:

 > works
 work[VERB]+SG3
 work[NOUN]+PLUR

This feature also gives a way  to  forbid  illegal  prefix/suffix
combinations in difficult cases.

An  interesting	 side‐effect of multi‐step stripping is, that the
appropriate treatment of circumfixes now comes for free.  For instance,	 in  Hungarian,	 superlatives  are formed by simultaneous
prefixation of leg‐ and suffixation of ‐bb to the adjective base.
A problem with the one‐level architecture is that there is no way
to render lexical licensing of particular prefixes  and	 suffixes
interdependent,	 and  therefore incorrect forms are recognized as
valid, i.e. *legvĂŠn = leg + vĂŠn ‘old’. Until	the  introduction
of  clusters,  a  special  treatment of the superlative had to be
hardwired in the earlier HunSpell code. This may have been legitimate  for  a single case, but in fact prefix‐‐suffix dependences
are ubiquitous in category‐changing  derivational  patterns  (cf.
English	 payable, non‐payable but *non‐pay or drinkable, undrink‐
able but *undrink). In simple words, here, the prefix un‐ is  le‐
gitimate  only	if the base drink is suffixed with ‐able. If both
these patters are handled by on‐line affix rules and affix  rules
are  checked  against  the  base only, there is no way to express
this dependency and the system will necessarily over‐  or  under‐
generate.

In  next example, suffix class R have got a prefix ‘continuation’
class (class P).

PFX P Y 1
PFX P	0 un . [prefix_un]+

SFX S Y 1
SFX S	0 s . +PL

SFX Q Y 1
SFX Q	0 s . +3SGV

SFX R Y 1
SFX R	0 able/PS . +DER_V_ADJ_ABLE

Dictionary:

2
drink/RQ	[verb]
drink/S [noun]

Morphological analysis:

> drink
drink[verb]
drink[noun]
> drinks
drink[verb]+3SGV
drink[noun]+PL
> drinkable
drink[verb]+DER_V_ADJ_ABLE
> drinkables
drink[verb]+DER_V_ADJ_ABLE+PL
> undrinkable
[prefix_un]+drink[verb]+DER_V_ADJ_ABLE
> undrinkables
[prefix_un]+drink[verb]+DER_V_ADJ_ABLE+PL
> undrink
Unknown word.
> undrinks
Unknown word.

Conditional affixes implemented by a continuation class	 are  not
enough	for circumfixes, because a circumfix is one affix in mor‐
phology. We also need CIRCUMFIX option for correct  morphological
analysis.

# circumfixes: ~ obligate prefix/suffix combinations
# superlative in Hungarian: leg‐ (prefix) AND ‐bb (suffix)
# nagy, nagyobb, legnagyobb, legeslegnagyobb
# (great, greater, greatest, most greatest)

CIRCUMFIX X

PFX A Y 1
PFX A 0 leg/X .

PFX B Y 1
PFX B 0 legesleg/X .

SFX C Y 3
SFX C 0 obb . +COMPARATIVE
SFX C 0 obb/AX . +SUPERLATIVE
SFX C 0 obb/BX . +SUPERSUPERLATIVE

Dictionary:

1
nagy/C	[MN]

Analysis:

> nagy
nagy[MN]
> nagyobb
nagy[MN]+COMPARATIVE
> legnagyobb
nagy[MN]+SUPERLATIVE
> legeslegnagyobb
nagy[MN]+SUPERSUPERLATIVE

Allowing  free compounding yields decrease in precision of recognition, not to mention stemming and morphological analysis.   Al‐
though	lexical switches are introduced to license compounding of
bases by Ispell, this proves not to be	restrictive  enough.  For
example:

# affix file
COMPOUNDFLAG X

2
foo/X
bar/X

With this resource, foobar and barfoo also are accepted words.

This  has  been improved upon with the introduction of directionsensitive compounding, i.e., lexical features can  specify  sepa‐
rately	whether	 a  base  can occur as leftmost or rightmost con‐
stituent in compounds.	This, however, is still	 insufficient  to
handle	the  intricate	patterns  of  compounding, not to mention
idiosyncratic (and language specific) norms of hyphenation.

The Hunspell algorithm	currently  allows  any	affixed	 form  of
words,	which  are  lexically marked as potential members of com‐
pounds. Hunspell improved this, and its recursive compound check‐
ing  rules  makes it possible to implement the intricate spelling
conventions of Hungarian compounds. For example, using	COMPOUND‐
WORDMAX,  COMPOUNDSYLLABLE, COMPOUNDROOT, SYLLABLENUM options can
be set the noteworthy Hungarian ‘6‐‐3’ rule.  Further example  in
Hungarian, derivate suffixes often modify compounding properties.
Hunspell allows the compounding flags on the affixes,  and  there
are  two  special  flags (COMPOUNDPERMITFLAG and (COMPOUNDFORBIDFLAG) to permit or prohibit compounding of the derivations.

Suffixes with this flag forbid compounding of the affixed word.

We also need several Hunspell features for handling  German  compounding:

# German compounding

# set language to handle special casing of German sharp s

LANG de_DE

# compound flags

COMPOUNDBEGIN U
COMPOUNDMIDDLE V
COMPOUNDEND W

# Prefixes are allowed at the beginning of compounds,
# suffixes are allowed at the end of compounds by default:
# (prefix)?(root)+(affix)?
# Affixes with COMPOUNDPERMITFLAG may be inside of compounds.
COMPOUNDPERMITFLAG P

# for German fogemorphemes (Fuge‐element)
# Hint: ONLYINCOMPOUND is not required everywhere, but the
# checking will be a little faster with it.

ONLYINCOMPOUND X

# forbid uppercase characters at compound word bounds
CHECKCOMPOUNDCASE

# for handling Fuge‐elements with dashes (Arbeits‐)
# dash will be a special word

COMPOUNDMIN 1
WORDCHARS ‐

# compound settings and fogemorpheme for ‘Arbeit’

SFX A Y 3
SFX A 0 s/UPX .
SFX A 0 s/VPDX .
SFX A 0 0/WXD .

SFX B Y 2
SFX B 0 0/UPX .
SFX B 0 0/VWXDP .

# a suffix for ‘Computer’

SFX C Y 1
SFX C 0 n/WD .

# for forbid exceptions (*Arbeitsnehmer)

FORBIDDENWORD Z

# dash prefix for compounds with dash (Arbeits‐Computer)

PFX ‐ Y 1
PFX ‐ 0 ‐/P .

# decapitalizing prefix
# circumfix for positioning in compounds

PFX D Y 29
PFX D A a/PX A
PFX D Ă	 ä/PX Ă
 .
 .
PFX D Y y/PX Y
PFX D Z z/PX Z

Example dictionary:

4
Arbeit/A‐
Computer/BC‐
‐/W
Arbeitsnehmer/Z

Accepted compound compound words with the previous resource:

Computer
Computern
Arbeit
ArbeitsComputerarbeit
ComputerarbeitsArbeitscomputer
Arbeitscomputern
Computerarbeitscomputer
Computerarbeitscomputern
Arbeitscomputerarbeit
Computerarbeits‐Computer
Computerarbeits‐Computern

Not accepted compoundings:

computer
arbeit
Arbeits
arbeits
ComputerArbeit
ComputerArbeits
Arbeitcomputer
ArbeitsComputer
Computerarbeitcomputer
ComputerArbeitcomputer
ComputerArbeitscomputer
Arbeitscomputerarbeits
Computerarbeits‐computer
Arbeitsnehmer

This  solution	is still not ideal, however, and will be replaced
by a pattern‐based compound‐checking algorithm which  is  closely
integrated  with  input	 buffer tokenization. Patterns describing
compounds come as a separate input resource  that  can	refer  to
high‐level  properties	of  constituent parts (e.g. the number of
syllables, affix flags, and containment of hyphens). The patterns
are  matched  against potential segmentations of compounds to assess wellformedness.

Both Ispell and Myspell use 8‐bit ASCII character encoding, which
is  a  major deficiency when it comes to scalability.  Although a
language like Hungarian has a standard ASCII character	set  (ISO
8859‐2), it fails to allow a full implementation of Hungarian or‐
thographic conventions.	 For instance, the ’‐‐’	 symbol	 (n‐dash)
is  missing  from this character set contrary to the fact that it
is not only the official symbol to delimit parenthetic clauses in
the  language, but it can be in compound words as a special ’big’
hyphen.

MySpell has got some 8‐bit encoding tables, but	 there	are  lan‐
guages	without	 standard 8‐bit encoding, too. For example, a lot
of African languages have non‐latin or extended latin characters.

Similarly, using the original spelling of certain  foreign  names
like ĂngstrĂśm or Molière is encouraged by the Hungarian spell‐
ing norm, and, since characters ’Ă’ and ’è’  are  not	part  of
ISO 8859‐2, when they combine with inflections containing charac‐
ters only in ISO 8859‐2 (like elative ‐bĹl,  allative  ‐tĹl  or
delative  ‐rĹl	with  double acute), these result in words (like
ĂngstrĂśmrĹl or Molière‐tĹl.) that can not be	encoded	 using
any single ASCII encoding scheme.

The problems raised in relation to 8‐bit ASCII encoding have long
been recognized by proponents of Unicode. It is clear that  trading  efficiency for encoding‐independence has its advantages when
it comes a truly multi‐lingual application. There is  implemented
a memory and time efficient Unicode handling in Hunspell. In nonUTF‐8 character encodings Hunspell works with the original  8‐bit
strings.  In  UTF‐8  encoding,	affixes	 and  words are stored in
UTF‐8, during the analysis are handled	in  mostly  UTF‐8,  under
condition  checking  and suggestion are converted to UTF‐16. Uni‐
code text analysis and spell checking have a minimal (0‐20%) time
overhead  and  minimal or reasonable memory overhead depends from
the language (its UTF‐8 encoding and affixation).

[top]

List of man pages available for YellowDog

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net