Paperless-ngx + S3: Build a Searchable Document Archive with OCR and Automated Tagging
Every organization has a document problem. Invoices pile up. Contracts get lost in email threads. That receipt you need for the tax audit? Somewhere in a folder called "Misc 2024."
Paperless-ngx solves this by turning your documents into a searchable, tagged archive with full-text OCR. Add S3 storage to the mix, and you've got infinite scalability without worrying about disk space.
Here's how to set it up properly.
What You're Building
By the end of this guide, you'll have:
- Paperless-ngx running with full OCR capability
- Documents stored on S3-compatible storage (AWS S3, MinIO, Backblaze B2)
- Automated tagging rules that learn from your behavior
- API access for programmatic document ingestion
The Architecture
Paperless-ngx doesn't have native S3 support yet. The workaround is elegant: use rclone as a Docker volume plugin to mount S3 as a filesystem. Docker manages the FUSE mount transparently.
Documents → Paperless-ngx → rclone → S3 Bucket
↓
PostgreSQL (metadata)
Redis (task queue)
Tesseract (OCR)
Step 1: Deploy Paperless-ngx on Elestio
The fastest path is Elestio's managed Paperless-ngx. Select your provider, click deploy, and you're running in minutes with backups and updates handled automatically.
For self-managed deployments, here's the Docker Compose setup:
version: "3.8"
services:
broker:
image: redis:7
restart: always
volumes:
- redis_data:/data
db:
image: postgres:15
restart: always
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: ${DB_PASSWORD}
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: always
depends_on:
- db
- broker
ports:
- "8000:8000"
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- ./consume:/usr/src/paperless/consume
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_DBPASS: ${DB_PASSWORD}
PAPERLESS_OCR_LANGUAGE: eng
PAPERLESS_SECRET_KEY: ${SECRET_KEY}
volumes:
redis_data:
pgdata:
data:
media:
Step 2: Add S3 Storage with rclone
Install the rclone Docker volume plugin:
docker plugin install rclone/docker-volume-rclone:amd64 \
args="-v" --alias rclone --grant-all-permissions
Create your rclone configuration at /var/lib/docker-plugins/rclone/config/rclone.conf:
[s3documents]
type = s3
provider = AWS
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
region = us-east-1
For MinIO or other S3-compatible storage, add the endpoint:
[minio]
type = s3
provider = Minio
access_key_id = YOUR_ACCESS_KEY
secret_access_key = YOUR_SECRET_KEY
endpoint = https://minio.yourdomain.com
Update your Docker Compose to use the S3 volume:
webserver:
volumes:
- s3documents:/usr/src/paperless/media/documents
- data:/usr/src/paperless/data
volumes:
s3documents:
driver: rclone
driver_opts:
remote: "s3documents:your-bucket-name/documents"
allow_other: "true"
vfs_cache_mode: "full"
Restart the stack:
docker compose down && docker compose up -d
Step 3: Configure OCR
Paperless-ngx uses Tesseract for OCR. Key environment variables:
environment:
PAPERLESS_OCR_LANGUAGE: eng+fra+deu # Multiple languages
PAPERLESS_OCR_MODE: skip_noarchive # Skip OCR if text exists
PAPERLESS_OCR_ROTATE_PAGES: true # Auto-correct rotation
PAPERLESS_OCR_OUTPUT_TYPE: pdfa # PDF/A for archival
For documents with poor scan quality, enable image preprocessing:
PAPERLESS_OCR_DESKEW: true
PAPERLESS_OCR_CLEAN: clean-final
Step 4: Set Up Automated Tagging
Paperless-ngx has three matching algorithms for auto-tagging:
| Algorithm | Use Case |
|---|---|
| Exact | Match specific phrases ("Bank of America") |
| Fuzzy | Partial matches, typo-tolerant |
| Auto | Neural network learns from your tagging |
Create a tag in the web UI, set the matching algorithm, and define your match text. For example:
- Tag: "Utilities"
- Algorithm: Any
- Match: "electricity" "water bill" "gas company"
The Auto algorithm is powerful. Tag a few documents manually, and Paperless learns the patterns. It trains hourly by default.
Step 5: API Integration for Automated Ingestion
Generate an API token from Settings → Django Admin → Auth Tokens.
Upload documents programmatically:
import requests
API_URL = "https://paperless.yourdomain.com/api"
TOKEN = "your-api-token"
def upload_document(file_path, tags=None):
headers = {"Authorization": f"Token {TOKEN}"}
with open(file_path, "rb") as f:
files = {"document": f}
data = {}
if tags:
data["tags"] = tags
response = requests.post(
f"{API_URL}/documents/post_document/",
headers=headers,
files=files,
data=data
)
return response.json()
# Upload with auto-tagging
upload_document("/path/to/invoice.pdf")
For email ingestion, point your scanner or email rules at the consume folder. Paperless watches it automatically.
Troubleshooting
S3 mount not working? Check rclone plugin logs: docker plugin inspect rclone --format '{{.Settings.Mounts}}'. Verify credentials with rclone lsd s3documents:.
OCR produces garbage text? Your scans might be too low resolution. Aim for 300 DPI minimum. Enable PAPERLESS_OCR_DESKEW for skewed documents.
Auto-tagging not learning? Documents must be removed from inbox (no inbox tags) before the neural network considers them. Training runs hourly by default.
Documents not appearing after upload? Check the task queue in the web UI. Large documents or slow OCR can delay processing. Increase worker count with PAPERLESS_TASK_WORKERS=2.
What's Next
You've got a production-ready document archive. From here:
- Set up retention policies for automatic cleanup
- Configure workflows for approval chains
- Integrate with Nextcloud for file sync
- Add paperless-gpt for AI-powered categorization
The combination of OCR, S3 storage, and auto-tagging means you'll actually find documents when you need them. No more digging through folders.
Thanks for reading. See you in the next one.