Let’s start by clarifying the most important concept first: the document. For Papermerge DMS a document is anything which is a good candidate for archiving - some piece of information which is not editable but you need to store it for future reference. For example receipts - you don’t need to edit receipts or read them everyday, but eventually you will need them for your tax declaration. In this sense - scanned documents, which are usually in PDF or TIFF format, are perfect match.
Another important thing - if you take a picture of a paper document with your mobile phone - you’ll have a file in jpeg format (or maybe png file format). In context of Papermerge DMS that picture of a document (though just a single jpeg file) is a valid one page document. Generally speaking, pictures of the documents produced by your camera - might be regarded as bad quality scans.
On the other hand, if you take a picture of a flower and upload that jpeg image to Papermerge DMS - the ‘document’ will be processed. However, that jpeg format flower image is not a document in Papermerge DMS sense.
Usually office formats with .docx (Microsoft Word), .odt (Libre Office), .txt (plain text) are not good candidates for archiving - as by their nature they are meant to be changed/edit regularly. However, once converted to PDF format (for instance Contract_C2.docx to Contract_C2.pdf) they are full fledged documents in Papermerge DMS sense.
Papermerge DMS 2.1 works only with PDF documents. Before version 2.1 Papermerge DMS supported tiff, jpeg and png formats. Because of internal refactorings the support for tiff, jpeg and png formats was dropped for 2.1. The support for tiff, jpeg and png file formats will be re-introduced in future versions.
Optical Character Recognition (OCR)
OCR is a technique to extract text information from binary image formats. This technique enables users to:
copy/paste text from the document’s content
search documents by document’s actual text content
OCR is essential tool (or technique if you will) which helps basically to extract textual information and thus derive useful work-flows (based on document’s actual content) with the documents. Papermerge DMS relies on external open source specialized tools like Google’s Tesseract OCR
An informal, more detailed, explanation of term OCR is provided in glossary.
Many times scanning documents in bulk yields documents with blank pages; some pages my be out of order or maybe part of totally different document. Even if you notice these flaws immediately it is time consuming and frustrating to redo scanning process. Papermerge DMS helps you with your scanned documents like no other tool. With Papermerge DMS you can delete blank or erroneous pages, you can move pages from one document into another and most importantly you can reorder document pages in case you need to do so.
There is a separate chapter about Page Management where you can learn details about this feature.