As explain in my article “test – what is eng.traineddata?“ tesseract 3.00 expects several dawg (Directed Acyclic Word Graph) dictionaries:
These files are created from simple UTF-8 text files (one word per line) by program wordlist2dawg. As a second parameter it needs unicharset file. So for Slovak I run:
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg number \ slk.number-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg punc \ slk.punc-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg word_list \ slk.word-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg frequency_list \ slk.freq-dawg slk.unicharset
Dictionary helps to improve result of OCR. For example: in some fonts/cases it is difficult to distinguish between “l” and “1” for OCR software. In such cases dictionary could help: OCR result will not be “a11” but “all” (if “all” is in dictionary and “a11” is not in dictionary).
In tesseract 3.00 dawg dictionaries are optional files (in case of version 2.04 you must have dictionary files otherwise tesseract do not work).
If you decide to create dictionary, there must be at least one word in input file. Input file could be created from wikipedia easily. Other good sources could be spellcheckers, translation dictionaries or other linguistics open projects, but pay attention to license condition of data.
If you need to turn off some dawg file or to increase verbosity for lang.traineddata file, you can use following variables:
variable | default setting | comment |
---|---|---|
global_load_punc_dawg | true | Load dawg with punctuation patterns. |
global_load_number_dawg | true | Load dawg with number patterns. |
global_load_freq_dawg | true | Load frequent word dawg. |
global_load_system_dawg | true | Load system word dawg. |
global_tessdata_manager_debug_level | 0 | Debug level for TessdataManager functions. |
According Training Tesseract 2.04 this file is created manually. It represents the intrinsic ambiguity between characters or sets of characters. It is optional file (e.g. you can skipped it for creating lang.traineddata)
Here is example of few lines from eng.unicharambigs:
v1 2 ' ' 1 " 1 2 ` ’ 1 " 1 2 ’ ` 1 " 1 2 ‘ ‘ 1 “ 1 2 ‘ ’ 1 " 1 2 ’ ‘ 1 " 1 2 ’ ’ 1 ” 1 2 , , 1 „ 1 1 m 2 r n 0 2 r n 1 m 0 1 m 2 i n 0
For tesseract 3.00 there are some changes:
''
should always be changed to "
).There are several rules for this files:
If you are interested in the development of lang.unicharambigs please have a look to extracted unicharambigs files from tesseract 3.00 lang.traineddata. Files for following languages are present in this package: