Tesseract 5 学習データ作成

almalinux9.2/plesk環境です。

下記を参照

https://qiita.com/aki_abekawa/items/c2b94187f2ba7dc56993
https://qiita.com/keraFPV/items/fc87a3d048c47cf2ba8b

学習データの保存先を確認

				
					$ /usr/local/bin/tesseract --list-langs

List of available languages in "/usr/local/share/tessdata/" (3):
eng
jpn
spa

				
			

手書き学習の元となるデータをダウンロード

				
					$ wget http://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/matlab.zip

				
			

画像データとテキストデータを生成します

				
					$ python3 generate_training_data.py ../matlab/emnist-byclass.mat

				
			

学習のためのOCR-Dをダウンロード

				
					$ cd ../ && $ git clone --depth 1 https://github.com/OCR-D/ocrd-train.git

				
			

トレーニングデータを格納するフォルダを作成

				
					$ mkdir -p ocrd-train/usr/share/tessdata/

#データ格納用
$ mkdir -p ocrd-train/data/

# 学習データ用フォルダ
$ mkdir -p ocrd-train/data/(トレーニングデータ名)-ground-truth/
				
			

ベストデータをダウンロード

				
					$ git clone https://github.com/tesseract-ocr/tessdata_best
$ cp tessdata_best/eng.traineddata ocrd-train/usr/share/tessdata/
$ cp tessdata_best/jpn.traineddata ocrd-train/usr/share/tessdata/
$ cp tessdata_best/jpn_vert.traineddata ocrd-train/usr/share/tessdata

				
			

ocrd-train/dataに下記をダウンロード(必要?)

				
					$ git clone https://github.com/tesseract-ocr/langdata/

				
			

画像用データを移動する

				
					$ mv emnist/*.txt ocrd-train/data/tegaki-ground-truth/
$ mv emnist/*.tif ocrd-train/data/tegaki-ground-truth/

				
			

トレーニングを実行

				
					$ make training MODEL_NAME=tegaki START_MODEL=eng >> train.log 2>&1

				
			
				
					$ tail -f httpdocs/ocr.train/ocrd-train/train.log 

Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.7383.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.0745.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.4581.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.11543.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.16189.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.14246.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.6824.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.17192.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.2213.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.0363.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.13637.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.7977.lstmf
Loaded 1/1 lines (1-1) of document data/tegaki-ground-truth/eng.1859.lstmf
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, Mean rms=3.152%, delta=15.367%, char train=56.65%, word train=100%, skip ratio=0%,  New best char error = 56.65 wrote best model:data/tegaki/checkpoints/tegaki56.65_100.checkpoint wrote checkpoint.

2 Percent improvement time=100, best error was 56.65 @ 100
At iteration 200/200/200, Mean rms=2.754%, delta=11.794%, char train=44.687%, word train=100%, skip ratio=0%,  New best char error = 44.687 wrote best model:data/tegaki/checkpoints/tegaki44.687_200.checkpoint wrote checkpoint.

2 Percent improvement time=99, best error was 44.687 @ 200
At iteration 299/300/300, Mean rms=2.505%, delta=9.899%, char train=37.567%, word train=99.667%, skip ratio=0%,  New best char error = 37.567 wrote best model:data/tegaki/checkpoints/tegaki37.567_299.checkpoint wrote checkpoint.

2 Percent improvement time=98, best error was 37.567 @ 299
At iteration 397/400/400, Mean rms=2.291%, delta=8.477%, char train=32.087%, word train=99.25%, skip ratio=0%,  New best char error = 32.087 wrote best model:data/tegaki/checkpoints/tegaki32.087_397.checkpoint wrote checkpoint.
.
.
.

2 Percent improvement time=12, best error was 2.565 @ 830
At iteration 842/2400/2400, Mean rms=0.095%, delta=0.014%, char train=0.068%, word train=1.6%, skip ratio=0%,  New best char error = 0.068 wrote best model:data/tegaki/checkpoints/tegaki0.068_842.checkpoint wrote checkpoint.

2 Percent improvement time=12, best error was 2.565 @ 830
At iteration 842/2500/2500, Mean rms=0.087%, delta=0.012%, char train=0.048%, word train=1.1%, skip ratio=0%,  New best char error = 0.048 wrote best model:data/tegaki/checkpoints/tegaki0.048_842.checkpoint wrote checkpoint.
.
.
.

2 Percent improvement time=100, best error was 2.565 @ 830
At iteration 930/4700/4700, Mean rms=0.039%, delta=0%, char train=0.003%, word train=0.1%, skip ratio=0%,  New best char error = 0.003 wrote best model:data/tegaki/checkpoints/tegaki0.003_930.checkpoint wrote checkpoint.


				
			
				
					完成データはdataの中に作成されるので、これを圧縮
$ combine_tessdata -c tegaki.traineddata 
				
			
				
					#利用するときは作成したtrainddataを呼び出すのを忘れずに
$ /usr/local/bin/tesseract test.image.jpg stdout -l tegaki