Based on my tests it looks like I miss only few files and I can use most of my trained data for tesseract 2.04.
So I to created these files for Slovak:
When I analyses existing language files for tesseract 3.00 I found out that xxx.config file is not present in any file. So I believe it can bi skipped for the moment.
xxx.unicharambigs is present only in few language files (deu, ell, eng, fra, ita, nld, rus, spa). Based on a content it looks like new version of DangAmbigs with version line and additional column:
v1 2 ' ' 1 " 1 2 ` ' 1 " 1 2 ' ` 1 " 1 2 ‘ ' 1 " 1
xxx.punc-dawg (punctuation dictionary?) and xxx.number-dawg (number dictionary?) looks like another Directed Acyclic Word Graph dictionaries. It is enough if there is one word (based on information from DangAmbigs). For first test I ignored them (number and punctuation is in my old slk.word-dawg).
slk.user-words is not used by combine_tessdata
Following command produced slk.traineddata without problem:
$ training/combine_tessdata /Projekty/tesseract/tesseract-slovak3/slk.
Than I installed it:
$ sudo cp -f /Projekty/tesseract/tesseract-slovak3/slk.traineddata \ /usr/local/share/tessdata/
First test reveled something is wrong:
$ /usr/local/bin/tesseract eurotext.tif eurotext -l slk
So I decided to create all slk.dawg with tesseract 3.00 (I used files created with tesseract 2.04). I found out that new version of wordlist2dawg (located in directory training) need more arguments than version in 2.04:
Usage: training/wordlist2dawg [-t] word_list_file dawg_file unicharset_file
So I split old slk.word_list to slk.number, slk.punc and slk.word_list and created new dictionaries:
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg number \ slk.number-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg punc \ slk.punc-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg word_list \ slk.word-dawg slk.unicharset $ /usr/src/tesseract-ocr-r319/training/wordlist2dawg frequency_list \ slk.freq-dawg slk.unicharset
After this change I got new error:
Tesseract Open Source OCR Engine with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/genericvector.h, line 215
Segmentation fault
After few checks I found out that there are some problem with slk.unicharambigs so I just simple removed it.
Than I created and installed slk.traineddata once again. This time tesseract worked with my slk.traineddata.