Server Configurations

The default is to use Django’s development server provided by ./manage.py runserver command, as that’s easy and does the job well enough on a home network. However it is heavily discouraged to use it for more than that.

If you want to do things right you should use a real webserver capable of handling more than one thread. You will also have to let the webserver serve the static files (CSS, JavaScript) from the directory configured in :ref:static_dir. The default static files directory is static.

For that you need to activate your virtual environment and collect the static files with the command:

./manage.py collectstatic

Setting up a web server can sound daunting for folks who don't normally do that kind of thing. This guide will help you walk through the configuration for Apache or Nginx on Linux and OSX.

Apache

The most common setup for Papermerge on a linux server is to use Apache, so if you're not sure what to pick, Apache might be the best bet, as it's free, easy to configure, and well documented.

In order use apache web server with Django (web framework used by Papermerge) you need to install so called module mod_wsg

Step 1 - Install Apache Web Server

On Ubuntu 20.04 LTS you install apache web server with following command:

sudo apt install apache2

Step 2 - Get mod_wsgi

Get latest release of mod_wsgi from here. Extract archive:

unzip mod_wsgi-4.7.1
cd mod_wsgi-4.7.1

Step 3 - Build & Install mod_wsgi

In order to build mod_wsgi on Ubuntu Linux, you need three things:

build-essential ubuntu package with gcc compiler and friends
apache2-dev package
python interpreter from your papermerge virtual environment

Let's first install required packages:

sudo apt install build-essential apache2-dev

Next, activate your Papermerge virtual environment (python virtual environment):

source /opt/papermerge/.venv/bin/activate

Warning

Activating python virtual environment is very important step. Because when compilying mod_wsgi it must find in $PATH python interpreter located in same virtual environment with other python dependencies.

Switch to extracted directory mod_wsgi-4.7.1 and run following commands:

$ ./configure
$ make
$ sudo make install

On Ubuntu 20.04 LTS sudo make install command will copy mod_wsgi.so binary file to /usr/lib/apache2/modules/mod_wsgi.so

Next enable mod_wsgi module with following command:

a2enmod mod_wsgi

You can double check if mod_wsgi module was enabled with:

apachectl -M

It should display a list enabled modules. Among other should be:

...
wsgi_module (shared)
...

Step 4 - Configure Virtual Host

In directory /etc/apache2/sites-available create a virtual configuration file for Papermerge. Let's say papermerge.site. Here is configuration example for virtual host:

<VirtualHost *:8060>
    <Directory /opt/papermerge/config>
        Require all granted
    </Directory>

    Alias /media/ /var/media/papermerge/
    Alias /static/ /var/static/papermerge/

    <Directory /var/media/papermerge>
        Require all granted
    </Directory>

    <Directory /var/static/papermerge>
        Require all granted
    </Directory>

    ServerName papermerge.home
    ServerRoot /opt/papermerge
</VirtualHost>

WSGIPythonHome /opt/papermerge/.venv/
WSGIPythonPath /opt/papermerge/
WSGIScriptAlias / /opt/papermerge/config/wsgi.py

The first bit in the WSGIScriptAlias line is the base URL path you want to serve your application at (/ indicates the root url), and the second is the location of a WSGI file, inside papermerge project as config/wsgi.py. This tells Apache to serve any request below the given URL using the WSGI application defined in that file.

WSGIPythonHome is path to python's virtual environment.

Nginx + Gunicorn

Another way to deploy Papermerge behind a real web server is by using Nginx + Gunicorn duo. Gunicorn is called application server - it serves WSGI (Papermerge/Django) application via HTTP protocol (in that sense Gunicorn is kind of web server). However, gunicorn cannot serve static content (JavaScript, CSS, images), this task falls on NginX shoulders.

Step 1 - Install Gunicorn

Gunicorn is not provided in list of dependencies. Thus, you need to installed in your current virtual python environment:

$ source .venv/bin/activate
$ pip install gunicorn

Create gunicorn configuration file:

// Content of /opt/etc/gunicorn.conf.py file

workers = 2
errorlog = "/opt/log/gunicorn.error"
accesslog = "/opt/log/gunicorn.access"
loglevel = "debug"

bind = ["127.0.0.1:9001"]

Note

Gunicorn configuration file must have .py extention and its syntax is valid python syntax.

Important

Binding port is 9001. This same port will be later used to proxy http requests from nginx to gunicorn.

and environment variables file:

// Content of /opt/etc/gunicorn.env file

DJANGO_SETTINGS_MODULE=config.settings.production

You need to create a production.py file in /opt/papermerge/config/setting/ directory. Here is an example of production.py file content:

// Content of /opt/papermerge/config/settings/production.py file

from .base import *  # noqa

DEBUG = False
ALLOWED_HOSTS = ['*']

Step 2 - Systemd Service for Gunicorn

Example of systemd unit file for Gunicorn:

// SystemD unit file for gunicorn

[Unit]
Description=Gunicorn Service

[Service]
WorkingDirectory=/opt/papermerge
EnvironmentFile=/opt/etc/gunicorn.env
ExecStart=/opt/papermerge/.venv/bin/gunicorn config.wsgi:application --config /opt/etc/gunicorn.conf.py

[Install]
WantedBy=multi-user.target

Step 3 - Nginx

And finally connect nginx with gunicorn. Here is a sample configuration for nginx:

server {
    server_name papermerge.home;
    listen 9000;

    location /static/ {
        alias /opt/static/;
    }

    location /media/ {
        alias /opt/media/;
    }

    location / {
        proxy_pass http://127.0.0.1:9001;
    }
}

Worker

Here is worker.service unit:

// Worker.service unit file

[Unit]
Description=Papermerge Worker
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/papermerge
ExecStart=/opt/papermerge/.venv/bin/python /opt/papermerge/manage.py worker --pidfile /tmp/worker.pid
Restart=on-failure

[Install]
WantedBy=multi-user.target

Note

Notice that ExecStart is absolute path to python interpreter inside python virtual environment. Absolute path to python interpreter inside virtual environment is enough information for python to figure out the rest of python dependencies from the same virtual environment. Thus, you don't need to provide futher information about virtual environment.

Systemd .service may be placed in one of several locations. One options is to place it in /etc/systemd/system together with other system level units. In this case you need root access permissions.

Another option is to place .service file inside $HOME/.config/systemd/user/ In this case you can start/check status/stop systemd unit service with following commands:

// Useful systemd comments

$ systemctl --user start worker
$ systemctl --user status worker
$ systemctl --user stop worker

Broker, Messaging Queue and their Configuration

Web application (a.k.a. main app) shows users fancy UI and is basically what end users see and interact with. Worker extracts information from scanned documents OCRs them) i.e workers actually do the most laborious task. Number of workers is only limited by your resources: there can be one worker or one thousand.

How does web application pass the heavy OCR jobs to the worker(s)? How does it happen that in case of many workers one starts the job and others are aware of it and do not start the same again - i.e. a job is never performed twice? All this workers management is done by a component called Broker. Passing of those OCR related jobs from main app to the broker (which in turn will pass it to correct worker) is done via so called Messaging Queue. Messaging queue can be something as simple as file system; but database, computer memory, key/value in-memory databases are also good candidates.

The thing is, to keep initial setup very simple (i.e. to require the minimum amount of configuration to start the application) the broker part is performed by a package called celery - which is part of Papermerge. Similarly, to keep everything simple at the beginning message queue was chosen to be file system itself.

By default, configurations for broker and messaging queue are following:

CELERY_BROKER_URL = "filesystem://"
CELERY_BROKER_TRANSPORT_OPTIONS = {
    'data_folder_in': PAPERMERGE_TASK_QUEUE_DIR,
    'data_folder_out': PAPERMERGE_TASK_QUEUE_DIR,
}

Where PAPERMERGE_TASK_QUEUE_DIR points to the folder on the file system, and its default value is queue. Which basically means that all messages will be saved in the current folder named queue.

Above configuration is fantastic for development, because zero configuration required.

However, filesystem based broker configuration is terrible for production!

If you will use it, you will experience CPU increase over time, like described in this ticket on github.

Following is good configuration for production:

// Recommended options for production

CELERY_BROKER_URL = "redis://"
CELERY_BROKER_TRANSPORT_OPTIONS = {}
CELERY_RESULT_BACKEND = "redis://localhost/0"

It uses redis key value database. With redis as broker transport you will never have CPU spikes.

Important

CELERY_BROKER_URL, CELERY_BROKER_TRANSPORT_OPTIONS and CELERY_RESULT_BACKEND configurations go into django configuration file of Papermerge project not in papermerge.conf.py. Django configuration file is the one in project_dir/config/base.py