Inside Paperless-ngx: How OCR + Tagging Actually Works When You Drop a PDF

Inside Paperless-ngx: How OCR + Tagging Actually Works When You Drop a PDF

You drop a PDF into a folder. Twenty seconds later it shows up in the UI with text you can search, tags that match the content, and a filename that looks like a human typed it. Most users never look at what happens in those twenty seconds. Today we do.

Paperless-ngx is the open-source document management system that ate the homelab. The reason it feels magic isn't the UI; it's the pipeline running in the background. A chain of six stages turns an opaque scan into a searchable, classifiable record. Knowing how that chain works pays for itself the first time something gets stuck, and it explains why your tuning knobs do (or don't) help.

The pipeline in 30 seconds

Six stages, from drop to searchable:

  1. Consumer watches the consume folder and picks up your file.
  2. Pre-processing decides what it's looking at (PDF with a text layer? image-only scan? Office doc?) and routes accordingly.
  3. OCR runs Tesseract via OCRmyPDF on anything that isn't already text-rich.
  4. Text extraction pulls the OCR'd text into plain UTF-8 and saves it on the document model.
  5. Classifier auto-assigns tags, correspondent, and document type via a scikit-learn model trained on your existing library.
  6. Storage and index writes metadata to Postgres and the full text to the Whoosh search index.

The first three stages are the slow ones. The last three are almost free in comparison. Let's walk each.

Stage 1: The consumer

Paperless-ngx runs a worker that polls the consume folder (default /usr/src/paperless/consume) every few seconds. When a file lands, the worker locks it, computes a hash to detect duplicates, and pushes the job onto a Redis queue.

Small but real gotcha here: if you write a file with scp or rsync, the worker can grab it half-written. Use atomic writes (write to a .tmp file, then mv) or bump PAPERLESS_CONSUMER_POLLING to give the upload time to finish.

Stage 2: Pre-processing

This is where Paperless decides what tools to call. It checks if the PDF has an embedded text layer (text-based PDF) or if it's a scan or image (image-based PDF). Text-based PDFs skip OCR entirely. Image-based PDFs get normalized for orientation, deskewed, and handed off to OCRmyPDF. Office formats like .docx and .odt route through Apache Tika for text extraction.

The cheapest win at this stage is making sure your scanner outputs orientation hints. A landscape page that arrives rotated forces the deskew step to do real work and costs you a second or two per page.

Stage 3: OCR

The expensive stage. OCRmyPDF wraps Tesseract and does two things at once: it produces searchable text and it embeds that text back into the PDF as an invisible layer. The visible page stays identical, what changes is that Ctrl+F now works.

Tuning levers that actually matter:

  • PAPERLESS_OCR_LANGUAGE lists which Tesseract language packs to load. Each pack is 10-50 MB and adds OCR time. Set only what you actually scan.
  • PAPERLESS_OCR_MODE defaults to skip. Use redo to force re-OCR after a language change, or skip_noarchive to save disk by skipping the OCR'd archive copy.
  • PAPERLESS_OCR_THREADS_PER_WORKER controls parallel page OCR. Default is 1. Bump to 2 or 4 on a multi-core box.

Realistic numbers on Elestio MEDIUM (2c / 4 GB): an A4 scanned page OCRs in 1-3 seconds. A 10-page document lands in your library in 15-30 seconds end to end.

Stage 4: Text extraction

After OCR, Paperless reads the text layer back out as UTF-8 and stores it on the Document model. This step is short. The interesting part is non-PDF inputs: Tika handles Office docs, a plain reader handles .eml, and image-only files like PNG or JPEG get their OCR text mapped to a synthetic content field.

Stage 5: The classifier

This is the stage that makes Paperless feel smart. It runs a scikit-learn pipeline (TF-IDF vectorizer plus a multi-label classifier) trained on your existing tagged documents. Auto-matching has three modes per tag, correspondent, and document type:

  • Keyword matching (any/all) for simple rule-based assignment.
  • Literal string match for exact phrases.
  • Auto for the ML route. It learns from your manually tagged documents and predicts tags on new ones.

The classifier retrains nightly by default (controlled by PAPERLESS_TIME_CRON_TRAINING). Plan on roughly 10+ documents per tag before predictions get useful. If auto-tag isn't firing, your training set is too small or too noisy.

Final stop. Document metadata goes to Postgres. The full text goes to the Whoosh index (file-based, default) or Elasticsearch if you've wired it up. Original files live on disk under /usr/src/paperless/media, with optional archive copies (the OCR'd version of the original) under the same tree.

One thing worth knowing: the Whoosh index rebuilds from scratch if you run document_index reindex, which is the fix for "search returns no results even though the document is there."

Deploy on Elestio

A MEDIUM (2c / 4 GB) at $16/month handles tens of thousands of documents comfortably, with OCR concurrency of 2 and room for the nightly classifier retrain. Past 100,000 documents or heavy concurrent OCR, scale to LARGE ($30/month).

Storage is the other lever to plan around. A 1 GB scanned-PDF library easily becomes 3-5 GB after archive copies and OCR-embedded text layers, so size your disk against the archived total, not the raw scan total.

Deploy here: https://elest.io/open-source/paperless-ngx.

Troubleshooting the usual three

"Documents are stuck in the consume folder." Check docker logs paperless-ngx | grep consumer. Most often it's a permission issue (the worker can't read your file) or a half-written upload. Force a rescan with the management command document_consumer --restart.

"OCR text is gibberish." Wrong language pack, or the source scan is below 150 DPI. Re-scan at 300 DPI, set PAPERLESS_OCR_LANGUAGE correctly (use ISO 639-2 codes like eng+fra), and re-run document_archiver.

"Auto-tagging never fires on new uploads." You don't have enough tagged training data yet, or auto-matching is set to keywords when you wanted auto. Tag 10-15 documents per category, then trigger a manual retrain with document_create_classifier.

The unglamorous part is the magic

Most "AI document management" demos hide the boring parts. With Paperless-ngx the boring parts are visible: a queue, a watcher, an OCR engine, a classifier, a search index. Knowing the boundaries makes the system predictable, which is exactly what you want from the place you put six years of receipts.

Thanks for reading ❤️ See you in the next one 👋