OCR

This section groups all OCR specific configurations.

PAPERMERGE__OCR__DEFAULT_LANGUAGE

By default Papermerge DMS will use language specified with this option to perform OCR. Change this value for language used by majority of your documents. For detailed list of three letter codes see 639-2/T column from ISO 639 2.

Example as environment variable:

PAPERMERGE__OCR__DEFAULT_LANGUAGE=spa

Example in toml configuration file:

[ocr]
default_language="spa"

Default value is “deu” (German language).

PAPERMERGE__OCR__LANGUAGES

Note

This option may be defined only in toml configuration file

Defines all languages available for OCR. This option is defined as inline table where key is ISO 639 2 code and value is human text name for language.

Example:

[ocr]
languages = { heb = "hebrew", jpn = "japanese"}

Note that both hebrew and japanes language data for tesseract must be installed. You can check Tesseract’s available languages with following command:

Important

languages value must be written in one line! This is requirement of the toml inline table format.

List available languages
$ tesseract --list-langs

Default value

[ocr]
languages = { deu = "Deutsch", eng = "English" }

See Adding OCR Languages to the Docker Image for detailed example of using this option.