Technical Choices

Architecture decisions, benchmarks, and infrastructure optimization for MIRROR.

1. Infrastructure Constraints

MIRROR runs entirely on a single server with no GPU — a deliberate choice that forces every component to be optimized for CPU inference. This section documents the hardware budget and how each service fits within it.

ResourceSpecificationBudget Allocation
RAM64 GB DDR4/DDR5LLM ~12 GB · Embedding ~2 GB · Qdrant ~4 GB · OS+App ~6 GB · Free ~40 GB
CPU12 cores (x86_64)Auto-detected by Docker — no fixed allocation, scheduler manages contention
GPUNoneFull CPU inference — drives all model choices
StorageSSD (assumed)Models ~10 GB · Qdrant data · Uploaded docs

The absence of a GPU is the defining constraint. Every model selection prioritizes CPU inference speed while maintaining production-grade quality. Thread allocation is auto-detected at runtime (multiprocessing.cpu_count() - 2), letting Docker's cgroup limits apply naturally when deployed on different VMs.

2. LLM: Dynamic Multi-Model Switching

The LLM is the brain of MIRROR. Instead of locking into a single model, the system supports hot-swapping between multiple models at runtime — download from HuggingFace, load into RAM, and switch without restarting.

Architecture: Model Registry + Hot-Swap

All available models are declared in a central MODEL_REGISTRY with their HuggingFace coordinates, RAM estimates, and context window sizes. The user selects a model from the UI, downloads it with one click, and loads it — the previous model is automatically unloaded and garbage-collected to free RAM. The default model is Phi-4 Mini 3.8B Q4_K_M, chosen for fast CPU inference.

Available Models

ModelParamsQuantRAMCPU t/s (est.)Use Case
Phi-4 14B14BQ8_0~16 GB3-6Default — best quality/size
Phi-4 14B14BQ4_K_M~9 GB5-10Lighter variant, faster
Qwen 2.5 32B32BQ6_K~28 GB1-3Top multilingual reasoning
Phi-3.5 MoE 42B42B (6.6B active)Q8_0~46 GB1-2MoE for complex tasks
Llama 3.1 8B8BFP16/Q8/Q6/Q45-16 GB4-18Quantization comparison baseline

Why Phi-4 as Default?

  • Best quality/size ratio at 14B — Microsoft's Phi-4 matches or exceeds many 30B+ models on reasoning benchmarks (MMLU, HumanEval, GSM8K)
  • Q8_0 quantization — ~16 GB RAM, near-lossless quality retention vs FP16
  • CPU-optimized via llama.cpp — GGUF format with AVX2/AVX-512 SIMD acceleration
  • MIT license — full commercial use, no restrictions
  • 4096 token context — sufficient for RAG with 3-5 reranked chunks

Model Switching Mechanism

  • Registry-based — all models declared in MODEL_REGISTRY with HuggingFace repo/filename, RAM estimate, per-model n_ctx
  • One-click download — background download from HuggingFace with progress tracking via polling
  • GC-aware unload — explicit del model + gc.collect() ensures RAM is reclaimed before loading the next model
  • Thread-safe — generation lock prevents concurrent inference during model swap

Inference Engine: llama-cpp-python

We use llama-cpp-python (Python bindings for llama.cpp) rather than Ollama or vLLM because:

  • Zero overhead — direct C++ inference, no HTTP server layer, no Docker container
  • Fine-grained control — n_threads, n_batch, n_ctx tunable per-request
  • Hot-swappable — load/unload models without restarting the Flask app
  • Memory efficient — mmap support, only loads needed layers

References

  • Abdin et al. (2024). "Phi-4 Technical Report". Microsoft Research. arXiv:2412.08905
  • Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models". NeurIPS 2023
  • llama.cpp GGUF quantization benchmarks — github.com/ggml-org/llama.cpp

3. Embedding: BGE-M3

Embeddings convert text into numerical vectors that capture meaning. Similar texts end up close in vector space, enabling semantic search. The embedding model is critical for RAG quality — it determines which documents get retrieved.

Why BGE-M3?

  • Multilingual — supports 100+ languages including French, English, and Japanese (critical for Japan job market positioning)
  • Hybrid retrieval — dense (1024-dim) + sparse + multi-vector representations in a single model
  • MTEB performance — 63.0 MTEB score, top open-source embedding model for retrieval tasks
  • CPU-friendly — 567M parameters, <30ms per query on CPU (sentence-transformers benchmark)
  • Apache 2.0 license — full commercial use

Benchmark Results (CPU Inference)

ModelParamsMTEBLatency (CPU)Top-5 AccMultilingual
e5-small118M16ms100%Limited
BGE-M3567M63.0<30msCompetitive100+ languages
Qwen3-Embed-8B8B70.58~200msHigh100+ languages
all-MiniLM-L6-v222.7M12ms56%English only

For a PDF page (~500-1000 tokens), BGE-M3 chunks and embeds in <5 seconds on CPU, well within our 10-second target. Using ONNX runtime or OpenVINO quantization can further reduce this by 30-40%.

References

  • Chen et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation". arXiv:2402.03216
  • Muennighoff et al. (2023). "MTEB: Massive Text Embedding Benchmark". EACL 2023
  • AIM Research (2025). "Benchmark of 16 Best Open Source Embedding Models for RAG"

4. Vector Store: Qdrant

A vector database stores the embedding vectors and allows fast similarity search ("find the 5 most relevant chunks for this question"). Qdrant is the engine that makes RAG retrieval near-instant.

Why Qdrant?

  • Rust-native — compiled binary, minimal memory overhead, no JVM/GC pauses
  • HNSW indexing — Hierarchical Navigable Small World graphs for sub-10ms approximate nearest neighbor search
  • Scalar INT8 quantization — 4× RAM reduction with <1% recall loss, configurable always_ram=True for speed
  • Payload filtering — indexed keyword filters on source_type, source_name for O(1) pre-filtering
  • Self-hosted — full data sovereignty, critical for sensitive documents and Japan data residency
  • Active community — 21k+ GitHub stars, weekly releases, excellent documentation
  • gRPC + REST API — Python client with async support, batch operations

HNSW Configuration

ParameterValueRationale
m16Default. Each node connects to 16 neighbors. Good recall/memory tradeoff
ef_construct200High build quality. Slower indexing but better recall at search time
ef_search128Search-time beam width. Ensures >95% recall with sub-10ms latency
QuantizationINT8 scalarReduces 1024-dim float32 vectors from 4KB to 1KB each
always_ramTrueWith 64GB RAM, we can keep all quantized vectors in memory

Alternatives Considered

DatabaseProsCons
QdrantRust, fast, HNSW, quantization, filteringSeparate process (Docker)
PGVectorPostgreSQL integrationSlower ANN, no native quantization, GC pauses
ChromaDBSimple Python APILimited scaling, no quantization, SQLite backend
WeaviateGraphQL, modulesHeavier RAM footprint, Go-based
FAISSIn-process, fastNo persistence, no filtering, no API

References

  • Malkov & Yashunin (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE TPAMI
  • Qdrant Documentation — Performance Optimization Guide
  • Baranchuk et al. (2022). "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors"

5. RAG Pipeline

RAG (Retrieval-Augmented Generation) is the core pattern: instead of relying on the LLM's memory alone, we search for relevant document passages first, then let the LLM write an answer grounded in those passages — with citations.

Architecture

The pipeline follows the canonical RAG pattern from Lewis et al. (2020), adapted for CPU-only inference:

  1. Document Ingestion — PDF/DOCX/TXT/MD → PyMuPDF extraction → sentence-boundary chunking (512 chars, 64 overlap) → BGE-M3 embedding → Qdrant upsert with metadata
  2. Query Processing — User question → BGE-M3 embedding → Qdrant ANN search (top-8, threshold 0.45) → cross-encoder reranking (top-3) → context assembly with source tracking
  3. Generation — Context + question → Phi-4 with citation-aware system prompt + personal context → response with [Source: name, page] citations

Chunking Strategy

  • 512-character chunks — fits well within BGE-M3's 512-token max sequence length
  • 64-character overlap — prevents information loss at chunk boundaries
  • Sentence-boundary aware — splits at sentence endings to preserve semantic coherence

Citation Mechanism

Each retrieved chunk carries metadata (source_name, page, chunk_index). The system prompt instructs the LLM to cite sources using [Source: name, p.X] format. Sources are also returned as structured JSON for the frontend to display.

Limitations

  • No hybrid search — BGE-M3 supports sparse vectors but Qdrant hybrid search adds complexity; dense-only is sufficient for our document corpus size
  • Context window — 4096 tokens (configurable per model) limits us to ~3 reranked chunks per query. Context budget is dynamically computed based on loaded model's n_ctx

References

  • Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS 2020
  • Gao et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey". arXiv:2312.10997
  • Shi et al. (2023). "REPLUG: Retrieval-Augmented Black-Box Language Models". arXiv:2301.12652

5b. Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2

The initial vector search is fast but approximate. A reranker is a second, more precise model that re-scores the top candidates to select only the most relevant passages for the LLM.

Why a Cross-Encoder Reranker?

Bi-encoder retrieval (BGE-M3 + cosine similarity) is fast but approximate. A cross-encoder jointly encodes query+document with full cross-attention, yielding significantly more accurate relevance scores at the cost of higher latency. By retrieving 8 candidates and reranking to top-3, we get the best of both worlds.

Why ms-marco-MiniLM-L-6-v2?

  • Only 22M parameters — extremely fast on CPU (~5-15ms per query-document pair)
  • Trained on MS MARCO — 500M+ query-passage pairs, the gold standard for passage reranking
  • NDCG@10 of 39.01 on TREC Deep Learning 2019 — best speed/quality ratio for CPU
  • sentence-transformers CrossEncoder API — drop-in integration, no extra dependencies

Pipeline Integration

StageModelLatency (CPU)Output
1. EmbeddingBGE-M3<30ms1024-dim query vector
2. ANN SearchQdrant HNSW<10msTop-8 candidates
3. RerankingMiniLM-L-6-v2~40-120ms (8 pairs)Top-3 reranked
4. GenerationPhi-4 14B5-50sAnswer with citations

References

  • Nogueira & Cho (2019). "Passage Re-ranking with BERT". arXiv:1901.04085
  • Wang et al. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression". NeurIPS 2020

5c. Vision Model: MiniCPM-V 2.6 (INT4)

Some PDF pages contain charts, diagrams, or complex layouts that text extraction misses. A Vision Language Model (VLM) can "see" the rendered page and describe its visual content, enriching the RAG context.

Why a Vision Language Model?

Many PDFs contain charts, tables, diagrams, and complex layouts that text extraction alone misses. A VLM can "see" the rendered page and describe its visual content, augmenting the text-only RAG pipeline.

Why MiniCPM-V 2.6?

  • 8B parameters total (SigLip-400M vision encoder + Qwen2-7B language model) — compact for a VLM
  • INT4 quantization — ~4 GB RAM footprint, fits alongside the LLM in our 64 GB budget
  • GPT-4V level accuracy on OCR, document understanding, and chart comprehension benchmarks
  • On-demand loading — loaded only when processing visual PDFs, unloaded to free RAM
  • Apache 2.0 license

Usage in MIRROR

The vision model is optional (activated via VISION_ENABLED=1). When enabled, PDF pages are rendered at 150 DPI and analyzed by the VLM. The visual description is appended to the text extraction, enriching the RAG context. On CPU, each page analysis takes ~30-60 seconds — acceptable for async document processing.

References

  • Yao et al. (2024). "MiniCPM-V: A GPT-4V Level MLLM on Your Phone". arXiv:2408.01800

6. Ops & Engineering: Docker, Compose & Deployment

The entire MIRROR stack is containerized and runs with a single command. This section covers the Docker architecture, the Compose orchestration, the reverse proxy, and the deployment workflow.

Docker: Multi-Stage Build

The application image is built from python:3.11-slim with minimal system dependencies (build-essential, cmake for llama.cpp compilation). The resulting image includes all Python packages and the Flask app, but not the models — those are downloaded at runtime via the UI or the bootstrap script.

LayerContentSize Impact
Basepython:3.11-slim~120 MB
System depsbuild-essential, cmake, git~200 MB (build only)
Python depsllama-cpp-python, sentence-transformers, Flask, PyMuPDF…~2.5 GB (includes PyTorch CPU)
App codeFlask routes, templates, static files~5 MB

Docker Compose: 3-Service Stack

The docker-compose.yml defines three services that work together:

ServiceImageRolePorts
mirrorCustom (Dockerfile)Flask app + LLM + Embedding + Reranker5000 (internal)
qdrantqdrant/qdrant:v1.12.4Vector database for RAG retrieval6333/6334 (internal)
caddycaddy:2.8.4Reverse proxy, auto HTTPS, TLS certificates80, 443 (public)

Key design decisions:

  • No fixed CPU/memory limits — Docker's cgroup scheduler manages resource contention naturally. Thread count is auto-detected via os.sched_getaffinity()
  • Volume mounts./models, ./uploads, ./articles, ./data are mounted from host for persistence and easy access
  • Internal networking — Qdrant and the app communicate via Docker's internal DNS (qdrant:6333), never exposed to the host
  • env_file integration.env passes HF_TOKEN and secrets securely into the container
  • Restart policyunless-stopped on all services for automatic recovery after crashes or reboots

Caddy: Reverse Proxy & Auto HTTPS

Caddy sits in front of the Flask app and handles:

  • Automatic TLS — Let's Encrypt certificates provisioned and renewed automatically, zero configuration
  • HTTP → HTTPS redirect — all traffic forced to HTTPS
  • Reverse proxy — proxies :443mirror:5000 internally
  • Zero downtime — Caddy reloads config without dropping connections

Gunicorn: Application Server

The Flask app runs behind Gunicorn with carefully tuned settings for LLM workloads:

  • 1 worker + 8 threads — single worker because the LLM is single-threaded internally; 8 threads handle concurrent HTTP requests (downloads, status polls, static files) while the LLM generates
  • 600s worker timeout — large model downloads and slow CPU inference can exceed default timeouts
  • 300s graceful shutdown — allows in-progress generations to complete before stopping

Deployment Workflow

Deploying MIRROR to a new server requires only Docker and one script:

  1. Clone the repogit clone + create .env with HF_TOKEN
  2. Run bash run.sh — builds images, starts the stack, waits for readiness, downloads the default model automatically
  3. Open the browser — the site is live at https://your-domain (or http://localhost for local dev)

The bootstrap script (run.sh) handles the full lifecycle: build → start → healthcheck → model download → ready. No manual steps required beyond the initial .env setup.

Scaling Considerations

MIRROR is designed as a single-node deployment optimized for simplicity. For larger-scale deployments:

  • Kubernetes — each service maps naturally to a Deployment + Service. Qdrant has an official Helm chart with replication support
  • Model storage — mount a shared NFS/EFS volume for /models across pods, or use an init container to download models
  • Horizontal scaling — the Flask app is stateless (SQLite → PostgreSQL migration needed). Multiple replicas behind a load balancer, each with its own LLM instance
  • GPU upgrade path — swap llama.cpp CPU backend for vLLM/TensorRT-LLM on GPU nodes, keep the same REST API contract

7. Web Scraping

MIRROR can ingest content from any URL — paste a link, and the system extracts clean text for immediate Q&A or permanent indexing into the knowledge base.

The scraping tool uses trafilatura as primary extractor (precision >90% on web content benchmarks) with BeautifulSoup as fallback. Scraped content can be:

  • Queried directly (in-memory, no indexing) for quick one-off questions
  • Indexed into Qdrant for persistent retrieval alongside uploaded documents

References

  • Barbaresi (2021). "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction". ACL 2021 System Demonstrations

8. Multi-Mode Chat System

The chat interface is the main way users interact with MIRROR. Four specialized modes let users choose between direct conversation, document-grounded Q&A, web content analysis, and document-only search.

Architecture

MIRROR supports four chat modes, each optimized for a different use case:

ModePipelineUse Case
ChatDirect LLM + personal context + conversation historyNatural conversation, greetings, personal questions
RAGEmbed → Qdrant search → Rerank → LLMDocument Q&A with source citations
ScrapIn-memory web content → LLMQuestions about scraped web pages
Full DocRAG filtered to source_type=documentSearch only uploaded documents

Streaming (SSE)

All modes support Server-Sent Events (SSE) streaming for real-time token-by-token display. The frontend uses the ReadableStream API to process SSE chunks as they arrive, providing a responsive typing experience even with CPU inference at 3-6 t/s.

Conversation History

Chat mode includes multi-turn conversation context — the last 6 messages are injected into the prompt, enabling coherent follow-up questions without re-stating context. History is persisted to SQLite and can be resumed across sessions.

9. SQLite Database & User Sessions

All persistent state — users, conversations, chat history, uploaded sources, and logs — lives in a single SQLite file. Zero configuration, zero external dependencies, portable with the container.

Schema

TablePurposeKey Fields
usersAnonymous user sessions (cookie-based)id (UUID), created_at, last_seen
conversationsChat threads per userid, user_id, title, mode, timestamps
messagesIndividual messages with metadataid, conversation_id, role, content, sources, timings
user_sourcesPer-user document/web source trackingid, user_id, source_name, source_type
logsStructured application logstimestamp, level, component, message, details

Design Decisions

  • SQLite + WAL mode — zero-config, file-based, concurrent reads with WAL journaling
  • Cookie-based user isolation — anonymous UUID cookie (1 year TTL, HttpOnly, SameSite=Lax) — no login required
  • Thread-local connections — each gunicorn thread gets its own SQLite connection, avoiding lock contention
  • CASCADE DELETE — deleting a conversation automatically removes all its messages
  • JSON columns — sources and timings stored as JSON text for flexibility without schema changes

10. Structured Logging

Full observability without external dependencies — all events are logged to SQLite with structured metadata, queryable via API.

All application events (queries, errors, model loads, document uploads) are logged to the logs SQLite table with structured metadata. Logs can be queried via /api/chat/logs with filters by component and level. This provides full observability without external dependencies (no ELK/Grafana needed).

11. Full Stack Summary

LayerTechnologyRole
Web FrameworkFlask 3.1Lightweight, Jinja2 templates, blueprint architecture
LLMMulti-model (Phi-4, Qwen 2.5, Llama 3) + llama-cpp-pythonCPU inference, hot-swappable, 5-46 GB RAM
EmbeddingBGE-M3 + sentence-transformersMultilingual, 1024-dim, <30ms/query
Rerankerms-marco-MiniLM-L-6-v2Cross-encoder, 22M params, ~15ms/pair on CPU
Vision (opt.)MiniCPM-V 2.6 INT4PDF visual understanding, ~4 GB RAM
Vector StoreQdrant (Docker)HNSW + INT8 quantization, sub-10ms search
PDF ParsingPyMuPDF (fitz)Fast, handles complex layouts
Scrapingtrafilatura + BeautifulSoupHigh-precision content extraction
DatabaseSQLite (WAL mode)Users, conversations, chat history, logs
SessionsCookie-based UUIDAnonymous user isolation, no login required
StreamingServer-Sent Events (SSE)Real-time token-by-token chat display
Chat ModesChat / RAG / Scrap / Full DocDirect chat, document Q&A, web scraping, doc-only search
ContainerizationDocker ComposeFull encapsulation, zero host deps, portable
Loading model...
RAM --%
APP --
CPU --%