Architecture decisions, benchmarks, and infrastructure optimization for MIRROR.
MIRROR runs entirely on a single server with no GPU — a deliberate choice that forces every component to be optimized for CPU inference. This section documents the hardware budget and how each service fits within it.
| Resource | Specification | Budget Allocation |
|---|---|---|
| RAM | 64 GB DDR4/DDR5 | LLM ~12 GB · Embedding ~2 GB · Qdrant ~4 GB · OS+App ~6 GB · Free ~40 GB |
| CPU | 12 cores (x86_64) | Auto-detected by Docker — no fixed allocation, scheduler manages contention |
| GPU | None | Full CPU inference — drives all model choices |
| Storage | SSD (assumed) | Models ~10 GB · Qdrant data · Uploaded docs |
The absence of a GPU is the defining constraint. Every model selection prioritizes
CPU inference speed while maintaining production-grade quality. Thread allocation is
auto-detected at runtime (multiprocessing.cpu_count() - 2), letting Docker's cgroup
limits apply naturally when deployed on different VMs.
The LLM is the brain of MIRROR. Instead of locking into a single model, the system supports hot-swapping between multiple models at runtime — download from HuggingFace, load into RAM, and switch without restarting.
All available models are declared in a central MODEL_REGISTRY with their HuggingFace coordinates, RAM estimates, and context window sizes. The user selects a model from the UI, downloads it with one click, and loads it — the previous model is automatically unloaded and garbage-collected to free RAM. The default model is Phi-4 Mini 3.8B Q4_K_M, chosen for fast CPU inference.
| Model | Params | Quant | RAM | CPU t/s (est.) | Use Case |
|---|---|---|---|---|---|
| Phi-4 14B | 14B | Q8_0 | ~16 GB | 3-6 | Default — best quality/size |
| Phi-4 14B | 14B | Q4_K_M | ~9 GB | 5-10 | Lighter variant, faster |
| Qwen 2.5 32B | 32B | Q6_K | ~28 GB | 1-3 | Top multilingual reasoning |
| Phi-3.5 MoE 42B | 42B (6.6B active) | Q8_0 | ~46 GB | 1-2 | MoE for complex tasks |
| Llama 3.1 8B | 8B | FP16/Q8/Q6/Q4 | 5-16 GB | 4-18 | Quantization comparison baseline |
MODEL_REGISTRY with HuggingFace repo/filename, RAM estimate, per-model n_ctxdel model + gc.collect() ensures RAM is reclaimed before loading the next model
We use llama-cpp-python (Python bindings for llama.cpp) rather than Ollama or vLLM because:
Embeddings convert text into numerical vectors that capture meaning. Similar texts end up close in vector space, enabling semantic search. The embedding model is critical for RAG quality — it determines which documents get retrieved.
| Model | Params | MTEB | Latency (CPU) | Top-5 Acc | Multilingual |
|---|---|---|---|---|---|
| e5-small | 118M | — | 16ms | 100% | Limited |
| BGE-M3 | 567M | 63.0 | <30ms | Competitive | 100+ languages |
| Qwen3-Embed-8B | 8B | 70.58 | ~200ms | High | 100+ languages |
| all-MiniLM-L6-v2 | 22.7M | — | 12ms | 56% | English only |
For a PDF page (~500-1000 tokens), BGE-M3 chunks and embeds in <5 seconds on CPU, well within our 10-second target. Using ONNX runtime or OpenVINO quantization can further reduce this by 30-40%.
A vector database stores the embedding vectors and allows fast similarity search ("find the 5 most relevant chunks for this question"). Qdrant is the engine that makes RAG retrieval near-instant.
always_ram=True for speedsource_type, source_name for O(1) pre-filtering| Parameter | Value | Rationale |
|---|---|---|
m | 16 | Default. Each node connects to 16 neighbors. Good recall/memory tradeoff |
ef_construct | 200 | High build quality. Slower indexing but better recall at search time |
ef_search | 128 | Search-time beam width. Ensures >95% recall with sub-10ms latency |
| Quantization | INT8 scalar | Reduces 1024-dim float32 vectors from 4KB to 1KB each |
always_ram | True | With 64GB RAM, we can keep all quantized vectors in memory |
| Database | Pros | Cons |
|---|---|---|
| Qdrant | Rust, fast, HNSW, quantization, filtering | Separate process (Docker) |
| PGVector | PostgreSQL integration | Slower ANN, no native quantization, GC pauses |
| ChromaDB | Simple Python API | Limited scaling, no quantization, SQLite backend |
| Weaviate | GraphQL, modules | Heavier RAM footprint, Go-based |
| FAISS | In-process, fast | No persistence, no filtering, no API |
RAG (Retrieval-Augmented Generation) is the core pattern: instead of relying on the LLM's memory alone, we search for relevant document passages first, then let the LLM write an answer grounded in those passages — with citations.
The pipeline follows the canonical RAG pattern from Lewis et al. (2020), adapted for CPU-only inference:
Each retrieved chunk carries metadata (source_name, page, chunk_index).
The system prompt instructs the LLM to cite sources using [Source: name, p.X] format.
Sources are also returned as structured JSON for the frontend to display.
n_ctxThe initial vector search is fast but approximate. A reranker is a second, more precise model that re-scores the top candidates to select only the most relevant passages for the LLM.
Bi-encoder retrieval (BGE-M3 + cosine similarity) is fast but approximate. A cross-encoder jointly encodes query+document with full cross-attention, yielding significantly more accurate relevance scores at the cost of higher latency. By retrieving 8 candidates and reranking to top-3, we get the best of both worlds.
| Stage | Model | Latency (CPU) | Output |
|---|---|---|---|
| 1. Embedding | BGE-M3 | <30ms | 1024-dim query vector |
| 2. ANN Search | Qdrant HNSW | <10ms | Top-8 candidates |
| 3. Reranking | MiniLM-L-6-v2 | ~40-120ms (8 pairs) | Top-3 reranked |
| 4. Generation | Phi-4 14B | 5-50s | Answer with citations |
Some PDF pages contain charts, diagrams, or complex layouts that text extraction misses. A Vision Language Model (VLM) can "see" the rendered page and describe its visual content, enriching the RAG context.
Many PDFs contain charts, tables, diagrams, and complex layouts that text extraction alone misses. A VLM can "see" the rendered page and describe its visual content, augmenting the text-only RAG pipeline.
The vision model is optional (activated via VISION_ENABLED=1).
When enabled, PDF pages are rendered at 150 DPI and analyzed by the VLM.
The visual description is appended to the text extraction, enriching the RAG context.
On CPU, each page analysis takes ~30-60 seconds — acceptable for async document processing.
The entire MIRROR stack is containerized and runs with a single command. This section covers the Docker architecture, the Compose orchestration, the reverse proxy, and the deployment workflow.
The application image is built from python:3.11-slim with minimal system dependencies
(build-essential, cmake for llama.cpp compilation). The resulting image includes all Python packages
and the Flask app, but not the models — those are downloaded at runtime via the UI
or the bootstrap script.
| Layer | Content | Size Impact |
|---|---|---|
| Base | python:3.11-slim | ~120 MB |
| System deps | build-essential, cmake, git | ~200 MB (build only) |
| Python deps | llama-cpp-python, sentence-transformers, Flask, PyMuPDF… | ~2.5 GB (includes PyTorch CPU) |
| App code | Flask routes, templates, static files | ~5 MB |
The docker-compose.yml defines three services that work together:
| Service | Image | Role | Ports |
|---|---|---|---|
| mirror | Custom (Dockerfile) | Flask app + LLM + Embedding + Reranker | 5000 (internal) |
| qdrant | qdrant/qdrant:v1.12.4 | Vector database for RAG retrieval | 6333/6334 (internal) |
| caddy | caddy:2.8.4 | Reverse proxy, auto HTTPS, TLS certificates | 80, 443 (public) |
Key design decisions:
os.sched_getaffinity()./models, ./uploads, ./articles, ./data are mounted from host for persistence and easy accessqdrant:6333), never exposed to the host.env passes HF_TOKEN and secrets securely into the containerunless-stopped on all services for automatic recovery after crashes or rebootsCaddy sits in front of the Flask app and handles:
:443 → mirror:5000 internallyThe Flask app runs behind Gunicorn with carefully tuned settings for LLM workloads:
Deploying MIRROR to a new server requires only Docker and one script:
git clone + create .env with HF_TOKENbash run.sh — builds images, starts the stack, waits for readiness, downloads the default model automaticallyhttps://your-domain (or http://localhost for local dev)
The bootstrap script (run.sh) handles the full lifecycle: build → start → healthcheck → model download → ready.
No manual steps required beyond the initial .env setup.
MIRROR is designed as a single-node deployment optimized for simplicity. For larger-scale deployments:
/models across pods, or use an init container to download modelsMIRROR can ingest content from any URL — paste a link, and the system extracts clean text for immediate Q&A or permanent indexing into the knowledge base.
The scraping tool uses trafilatura as primary extractor (precision >90% on web content benchmarks) with BeautifulSoup as fallback. Scraped content can be:
The chat interface is the main way users interact with MIRROR. Four specialized modes let users choose between direct conversation, document-grounded Q&A, web content analysis, and document-only search.
MIRROR supports four chat modes, each optimized for a different use case:
| Mode | Pipeline | Use Case |
|---|---|---|
| Chat | Direct LLM + personal context + conversation history | Natural conversation, greetings, personal questions |
| RAG | Embed → Qdrant search → Rerank → LLM | Document Q&A with source citations |
| Scrap | In-memory web content → LLM | Questions about scraped web pages |
| Full Doc | RAG filtered to source_type=document | Search only uploaded documents |
All modes support Server-Sent Events (SSE) streaming for real-time token-by-token display.
The frontend uses the ReadableStream API to process SSE chunks as they arrive,
providing a responsive typing experience even with CPU inference at 3-6 t/s.
Chat mode includes multi-turn conversation context — the last 6 messages are injected into the prompt, enabling coherent follow-up questions without re-stating context. History is persisted to SQLite and can be resumed across sessions.
All persistent state — users, conversations, chat history, uploaded sources, and logs — lives in a single SQLite file. Zero configuration, zero external dependencies, portable with the container.
| Table | Purpose | Key Fields |
|---|---|---|
users | Anonymous user sessions (cookie-based) | id (UUID), created_at, last_seen |
conversations | Chat threads per user | id, user_id, title, mode, timestamps |
messages | Individual messages with metadata | id, conversation_id, role, content, sources, timings |
user_sources | Per-user document/web source tracking | id, user_id, source_name, source_type |
logs | Structured application logs | timestamp, level, component, message, details |
Full observability without external dependencies — all events are logged to SQLite with structured metadata, queryable via API.
All application events (queries, errors, model loads, document uploads) are logged to the logs SQLite table
with structured metadata. Logs can be queried via /api/chat/logs with filters by component and level.
This provides full observability without external dependencies (no ELK/Grafana needed).
| Layer | Technology | Role |
|---|---|---|
| Web Framework | Flask 3.1 | Lightweight, Jinja2 templates, blueprint architecture |
| LLM | Multi-model (Phi-4, Qwen 2.5, Llama 3) + llama-cpp-python | CPU inference, hot-swappable, 5-46 GB RAM |
| Embedding | BGE-M3 + sentence-transformers | Multilingual, 1024-dim, <30ms/query |
| Reranker | ms-marco-MiniLM-L-6-v2 | Cross-encoder, 22M params, ~15ms/pair on CPU |
| Vision (opt.) | MiniCPM-V 2.6 INT4 | PDF visual understanding, ~4 GB RAM |
| Vector Store | Qdrant (Docker) | HNSW + INT8 quantization, sub-10ms search |
| PDF Parsing | PyMuPDF (fitz) | Fast, handles complex layouts |
| Scraping | trafilatura + BeautifulSoup | High-precision content extraction |
| Database | SQLite (WAL mode) | Users, conversations, chat history, logs |
| Sessions | Cookie-based UUID | Anonymous user isolation, no login required |
| Streaming | Server-Sent Events (SSE) | Real-time token-by-token chat display |
| Chat Modes | Chat / RAG / Scrap / Full Doc | Direct chat, document Q&A, web scraping, doc-only search |
| Containerization | Docker Compose | Full encapsulation, zero host deps, portable |