Paperless-ngx + S3: Build a Searchable Document Archive with OCR and Automated Tagging

Paperless-ngx + S3: Build a Searchable Document Archive with OCR and Automated Tagging

Every organization has a document problem. Invoices pile up. Contracts get lost in email threads. That receipt you need for the tax audit? Somewhere in a folder called "Misc 2024."

Paperless-ngx solves this by turning your documents into a searchable, tagged archive with full-text OCR. Add S3 storage to the mix, and you've got infinite scalability without worrying about disk space.

Here's how to set it up properly.

What You're Building

By the end of this guide, you'll have:

  • Paperless-ngx running with full OCR capability
  • Documents stored on S3-compatible storage (AWS S3, MinIO, Backblaze B2)
  • Automated tagging rules that learn from your behavior
  • API access for programmatic document ingestion

The Architecture

Paperless-ngx doesn't have native S3 support yet. The workaround is elegant: use rclone as a Docker volume plugin to mount S3 as a filesystem. Docker manages the FUSE mount transparently.

Documents → Paperless-ngx → rclone → S3 Bucket
                ↓
            PostgreSQL (metadata)
            Redis (task queue)
            Tesseract (OCR)

Step 1: Deploy Paperless-ngx on Elestio

The fastest path is Elestio's managed Paperless-ngx. Select your provider, click deploy, and you're running in minutes with backups and updates handled automatically.

For self-managed deployments, here's the Docker Compose setup:

version: "3.8"
services:
  broker:
    image: redis:7
    restart: always
    volumes:
      - redis_data:/data

  db:
    image: postgres:15
    restart: always
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: ${DB_PASSWORD}

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: always
    depends_on:
      - db
      - broker
    ports:
      - "8000:8000"
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBPASS: ${DB_PASSWORD}
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_SECRET_KEY: ${SECRET_KEY}

volumes:
  redis_data:
  pgdata:
  data:
  media:

Step 2: Add S3 Storage with rclone

Install the rclone Docker volume plugin:

docker plugin install rclone/docker-volume-rclone:amd64 \
  args="-v" --alias rclone --grant-all-permissions

Create your rclone configuration at /var/lib/docker-plugins/rclone/config/rclone.conf:

[s3documents]
type = s3
provider = AWS
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
region = us-east-1

For MinIO or other S3-compatible storage, add the endpoint:

[minio]
type = s3
provider = Minio
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
endpoint = https://minio.yourdomain.com

Update your Docker Compose to use the S3 volume:

webserver:
  volumes:
    - s3documents:/usr/src/paperless/media/documents
    - data:/usr/src/paperless/data

volumes:
  s3documents:
    driver: rclone
    driver_opts:
      remote: "s3documents:your-bucket-name/documents"
      allow_other: "true"
      vfs_cache_mode: "full"

Restart the stack:

docker compose down && docker compose up -d

Step 3: Configure OCR

Paperless-ngx uses Tesseract for OCR. Key environment variables:

environment:
  PAPERLESS_OCR_LANGUAGE: eng+fra+deu  # Multiple languages
  PAPERLESS_OCR_MODE: skip_noarchive   # Skip OCR if text exists
  PAPERLESS_OCR_ROTATE_PAGES: true     # Auto-correct rotation
  PAPERLESS_OCR_OUTPUT_TYPE: pdfa      # PDF/A for archival

For documents with poor scan quality, enable image preprocessing:

PAPERLESS_OCR_DESKEW: true
PAPERLESS_OCR_CLEAN: clean-final

Step 4: Set Up Automated Tagging

Paperless-ngx has three matching algorithms for auto-tagging:

Algorithm Use Case
Exact Match specific phrases ("Bank of America")
Fuzzy Partial matches, typo-tolerant
Auto Neural network learns from your tagging

Create a tag in the web UI, set the matching algorithm, and define your match text. For example:

  • Tag: "Utilities"
  • Algorithm: Any
  • Match: "electricity" "water bill" "gas company"

The Auto algorithm is powerful. Tag a few documents manually, and Paperless learns the patterns. It trains hourly by default.

Step 5: API Integration for Automated Ingestion

Generate an API token from Settings → Django Admin → Auth Tokens.

Upload documents programmatically:

import requests

API_URL = "https://paperless.yourdomain.com/api"
TOKEN = "your-api-token"

def upload_document(file_path, tags=None):
    headers = {"Authorization": f"Token {TOKEN}"}

    with open(file_path, "rb") as f:
        files = {"document": f}
        data = {}
        if tags:
            data["tags"] = tags

        response = requests.post(
            f"{API_URL}/documents/post_document/",
            headers=headers,
            files=files,
            data=data
        )

    return response.json()

# Upload with auto-tagging
upload_document("/path/to/invoice.pdf")

For email ingestion, point your scanner or email rules at the consume folder. Paperless watches it automatically.

Troubleshooting

S3 mount not working? Check rclone plugin logs: docker plugin inspect rclone --format '{{.Settings.Mounts}}'. Verify credentials with rclone lsd s3documents:.

OCR produces garbage text? Your scans might be too low resolution. Aim for 300 DPI minimum. Enable PAPERLESS_OCR_DESKEW for skewed documents.

Auto-tagging not learning? Documents must be removed from inbox (no inbox tags) before the neural network considers them. Training runs hourly by default.

Documents not appearing after upload? Check the task queue in the web UI. Large documents or slow OCR can delay processing. Increase worker count with PAPERLESS_TASK_WORKERS=2.

What's Next

You've got a production-ready document archive. From here:

  • Set up retention policies for automatic cleanup
  • Configure workflows for approval chains
  • Integrate with Nextcloud for file sync
  • Add paperless-gpt for AI-powered categorization

The combination of OCR, S3 storage, and auto-tagging means you'll actually find documents when you need them. No more digging through folders.

Thanks for reading. See you in the next one.