almalinux9.2/plesk環境です。
下記を参照
https://qiita.com/aki_abekawa/items/c2b94187f2ba7dc56993
https://qiita.com/keraFPV/items/fc87a3d048c47cf2ba8b
学習データの保存先を確認
$ /usr/local/bin/tesseract --list-langs
List of available languages in "/usr/local/share/tessdata/" (3):
eng
jpn
spa
手書き学習の元となるデータをダウンロード
$ wget http://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/matlab.zip
画像データとテキストデータを生成します
$ python3 generate_training_data.py ../matlab/emnist-byclass.mat
学習のためのOCR-Dをダウンロード
$ cd ../ && $ git clone --depth 1 https://github.com/OCR-D/ocrd-train.git
トレーニングデータを格納するフォルダを作成
$ mkdir -p ocrd-train/usr/share/tessdata/
#データ格納用
$ mkdir -p ocrd-train/data/
# 学習データ用フォルダ
$ mkdir -p ocrd-train/data/(トレーニングデータ名)-ground-truth/
ベストデータをダウンロード
$ git clone https://github.com/tesseract-ocr/tessdata_best
$ cp tessdata_best/eng.traineddata ocrd-train/usr/share/tessdata/
$ cp tessdata_best/jpn.traineddata ocrd-train/usr/share/tessdata/
$ cp tessdata_best/jpn_vert.traineddata ocrd-train/usr/share/tessdata
ocrd-train/dataに下記をダウンロード(必要?)
$ git clone https://github.com/tesseract-ocr/langdata/
画像用データを移動する
$ mv emnist/*.txt ocrd-train/data/tegaki-ground-truth/
$ mv emnist/*.tif ocrd-train/data/tegaki-ground-truth/
トレーニングを実行
$ make training MODEL_NAME=tegaki START_MODEL=eng >> train.log 2>&1
$ tail -f httpdocs/ocr.train/ocrd-train/train.log
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.7383.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.0745.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.4581.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.11543.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.16189.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.14246.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.6824.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.17192.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.2213.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.0363.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.13637.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.7977.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.1859.lstmf
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, Mean rms=3.152%, delta=15.367%, char train=56.65%, word train=100%, skip ratio=0%, New best char error = 56.65 wrote best model:data/tegaki/checkpoints/tegaki56.65_100.checkpoint wrote checkpoint.
2 Percent improvement time=100, best error was 56.65 @ 100
At iteration 200/200/200, Mean rms=2.754%, delta=11.794%, char train=44.687%, word train=100%, skip ratio=0%, New best char error = 44.687 wrote best model:data/tegaki/checkpoints/tegaki44.687_200.checkpoint wrote checkpoint.
2 Percent improvement time=99, best error was 44.687 @ 200
At iteration 299/300/300, Mean rms=2.505%, delta=9.899%, char train=37.567%, word train=99.667%, skip ratio=0%, New best char error = 37.567 wrote best model:data/tegaki/checkpoints/tegaki37.567_299.checkpoint wrote checkpoint.
2 Percent improvement time=98, best error was 37.567 @ 299
At iteration 397/400/400, Mean rms=2.291%, delta=8.477%, char train=32.087%, word train=99.25%, skip ratio=0%, New best char error = 32.087 wrote best model:data/tegaki/checkpoints/tegaki32.087_397.checkpoint wrote checkpoint.
.
.
.
2 Percent improvement time=12, best error was 2.565 @ 830
At iteration 842/2400/2400, Mean rms=0.095%, delta=0.014%, char train=0.068%, word train=1.6%, skip ratio=0%, New best char error = 0.068 wrote best model:data/tegaki/checkpoints/tegaki0.068_842.checkpoint wrote checkpoint.
2 Percent improvement time=12, best error was 2.565 @ 830
At iteration 842/2500/2500, Mean rms=0.087%, delta=0.012%, char train=0.048%, word train=1.1%, skip ratio=0%, New best char error = 0.048 wrote best model:data/tegaki/checkpoints/tegaki0.048_842.checkpoint wrote checkpoint.
.
.
.
2 Percent improvement time=100, best error was 2.565 @ 830
At iteration 930/4700/4700, Mean rms=0.039%, delta=0%, char train=0.003%, word train=0.1%, skip ratio=0%, New best char error = 0.003 wrote best model:data/tegaki/checkpoints/tegaki0.003_930.checkpoint wrote checkpoint.
完成データはdataの中に作成されるので、これを圧縮
$ combine_tessdata -c tegaki.traineddata
#利用するときは作成したtrainddataを呼び出すのを忘れずに
$ /usr/local/bin/tesseract test.image.jpg stdout -l tegaki