Technical Choices — Michail Berjaoui

1. Infrastructure Constraints

MIRROR runs entirely on a single server with no GPU — a deliberate choice that forces every component to be optimized for CPU inference. This section documents the hardware budget and how each service fits within it.

Resource	Specification	Budget Allocation
RAM	64 GB DDR4/DDR5	LLM ~12 GB · Embedding ~2 GB · Qdrant ~4 GB · OS+App ~6 GB · Free ~40 GB
CPU	12 cores (x86_64)	Auto-detected by Docker — no fixed allocation, scheduler manages contention
GPU	None	Full CPU inference — drives all model choices
Storage	SSD (assumed)	Models ~10 GB · Qdrant data · Uploaded docs

The absence of a GPU is the defining constraint. Every model selection prioritizes CPU inference speed while maintaining production-grade quality. Thread allocation is auto-detected at runtime (multiprocessing.cpu_count() - 2), letting Docker's cgroup limits apply naturally when deployed on different VMs.

2. LLM: Dynamic Multi-Model Switching

The LLM is the brain of MIRROR. Instead of locking into a single model, the system supports hot-swapping between multiple models at runtime — download from HuggingFace, load into RAM, and switch without restarting.

Architecture: Model Registry + Hot-Swap

All available models are declared in a central MODEL_REGISTRY with their HuggingFace coordinates, RAM estimates, and context window sizes. The user selects a model from the UI, downloads it with one click, and loads it — the previous model is automatically unloaded and garbage-collected to free RAM. The default model is Phi-4 Mini 3.8B Q4_K_M, chosen for fast CPU inference.

Available Models

Model	Params	Quant	RAM	CPU t/s (est.)	Use Case
Phi-4 14B	14B	Q8_0	~16 GB	3-6	Default — best quality/size
Phi-4 14B	14B	Q4_K_M	~9 GB	5-10	Lighter variant, faster
Qwen 2.5 32B	32B	Q6_K	~28 GB	1-3	Top multilingual reasoning
Phi-3.5 MoE 42B	42B (6.6B active)	Q8_0	~46 GB	1-2	MoE for complex tasks
Llama 3.1 8B	8B	FP16/Q8/Q6/Q4	5-16 GB	4-18	Quantization comparison baseline

Why Phi-4 as Default?

Best quality/size ratio at 14B — Microsoft's Phi-4 matches or exceeds many 30B+ models on reasoning benchmarks (MMLU, HumanEval, GSM8K)
Q8_0 quantization — ~16 GB RAM, near-lossless quality retention vs FP16
CPU-optimized via llama.cpp — GGUF format with AVX2/AVX-512 SIMD acceleration
MIT license — full commercial use, no restrictions
4096 token context — sufficient for RAG with 3-5 reranked chunks

Model Switching Mechanism

Registry-based — all models declared in MODEL_REGISTRY with HuggingFace repo/filename, RAM estimate, per-model n_ctx
One-click download — background download from HuggingFace with progress tracking via polling
GC-aware unload — explicit del model + gc.collect() ensures RAM is reclaimed before loading the next model
Thread-safe — generation lock prevents concurrent inference during model swap

Inference Engine: llama-cpp-python

We use llama-cpp-python (Python bindings for llama.cpp) rather than Ollama or vLLM because:

Zero overhead — direct C++ inference, no HTTP server layer, no Docker container
Fine-grained control — n_threads, n_batch, n_ctx tunable per-request
Hot-swappable — load/unload models without restarting the Flask app
Memory efficient — mmap support, only loads needed layers

References

Abdin et al. (2024). "Phi-4 Technical Report". Microsoft Research. arXiv:2412.08905
Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models". NeurIPS 2023
llama.cpp GGUF quantization benchmarks — github.com/ggml-org/llama.cpp

3. Embedding: BGE-M3

Embeddings convert text into numerical vectors that capture meaning. Similar texts end up close in vector space, enabling semantic search. The embedding model is critical for RAG quality — it determines which documents get retrieved.

Why BGE-M3?

Multilingual — supports 100+ languages including French, English, and Japanese (critical for Japan job market positioning)
Hybrid retrieval — dense (1024-dim) + sparse + multi-vector representations in a single model
MTEB performance — 63.0 MTEB score, top open-source embedding model for retrieval tasks
CPU-friendly — 567M parameters, <30ms per query on CPU (sentence-transformers benchmark)
Apache 2.0 license — full commercial use

Benchmark Results (CPU Inference)

Model	Params	MTEB	Latency (CPU)	Top-5 Acc	Multilingual
e5-small	118M	—	16ms	100%	Limited
BGE-M3	567M	63.0	<30ms	Competitive	100+ languages
Qwen3-Embed-8B	8B	70.58	~200ms	High	100+ languages
all-MiniLM-L6-v2	22.7M	—	12ms	56%	English only

For a PDF page (~500-1000 tokens), BGE-M3 chunks and embeds in <5 seconds on CPU, well within our 10-second target. Using ONNX runtime or OpenVINO quantization can further reduce this by 30-40%.

References

Chen et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation". arXiv:2402.03216
Muennighoff et al. (2023). "MTEB: Massive Text Embedding Benchmark". EACL 2023
AIM Research (2025). "Benchmark of 16 Best Open Source Embedding Models for RAG"

4. Vector Store: Qdrant

A vector database stores the embedding vectors and allows fast similarity search ("find the 5 most relevant chunks for this question"). Qdrant is the engine that makes RAG retrieval near-instant.

Why Qdrant?

Rust-native — compiled binary, minimal memory overhead, no JVM/GC pauses
HNSW indexing — Hierarchical Navigable Small World graphs for sub-10ms approximate nearest neighbor search
Scalar INT8 quantization — 4× RAM reduction with <1% recall loss, configurable always_ram=True for speed
Payload filtering — indexed keyword filters on source_type, source_name for O(1) pre-filtering
Self-hosted — full data sovereignty, critical for sensitive documents and Japan data residency
Active community — 21k+ GitHub stars, weekly releases, excellent documentation
gRPC + REST API — Python client with async support, batch operations

HNSW Configuration

Parameter	Value	Rationale
`m`	16	Default. Each node connects to 16 neighbors. Good recall/memory tradeoff
`ef_construct`	200	High build quality. Slower indexing but better recall at search time
`ef_search`	128	Search-time beam width. Ensures >95% recall with sub-10ms latency
Quantization	INT8 scalar	Reduces 1024-dim float32 vectors from 4KB to 1KB each
`always_ram`	True	With 64GB RAM, we can keep all quantized vectors in memory

Alternatives Considered

Database	Pros	Cons
Qdrant	Rust, fast, HNSW, quantization, filtering	Separate process (Docker)
PGVector	PostgreSQL integration	Slower ANN, no native quantization, GC pauses
ChromaDB	Simple Python API	Limited scaling, no quantization, SQLite backend
Weaviate	GraphQL, modules	Heavier RAM footprint, Go-based
FAISS	In-process, fast	No persistence, no filtering, no API

References

Malkov & Yashunin (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". IEEE TPAMI
Qdrant Documentation — Performance Optimization Guide
Baranchuk et al. (2022). "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors"

5. RAG Pipeline

RAG (Retrieval-Augmented Generation) is the core pattern: instead of relying on the LLM's memory alone, we search for relevant document passages first, then let the LLM write an answer grounded in those passages — with citations.

Architecture

The pipeline follows the canonical RAG pattern from Lewis et al. (2020), adapted for CPU-only inference:

Document Ingestion — PDF/DOCX/TXT/MD → PyMuPDF extraction → sentence-boundary chunking (512 chars, 64 overlap) → BGE-M3 embedding → Qdrant upsert with metadata
Query Processing — User question → BGE-M3 embedding → Qdrant ANN search (top-8, threshold 0.45) → cross-encoder reranking (top-3) → context assembly with source tracking
Generation — Context + question → Phi-4 with citation-aware system prompt + personal context → response with [Source: name, page] citations

Chunking Strategy

512-character chunks — fits well within BGE-M3's 512-token max sequence length
64-character overlap — prevents information loss at chunk boundaries
Sentence-boundary aware — splits at sentence endings to preserve semantic coherence

Citation Mechanism

Each retrieved chunk carries metadata (source_name, page, chunk_index). The system prompt instructs the LLM to cite sources using [Source: name, p.X] format. Sources are also returned as structured JSON for the frontend to display.

Limitations

No hybrid search — BGE-M3 supports sparse vectors but Qdrant hybrid search adds complexity; dense-only is sufficient for our document corpus size
Context window — 4096 tokens (configurable per model) limits us to ~3 reranked chunks per query. Context budget is dynamically computed based on loaded model's n_ctx

References

Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS 2020
Gao et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey". arXiv:2312.10997
Shi et al. (2023). "REPLUG: Retrieval-Augmented Black-Box Language Models". arXiv:2301.12652

5b. Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2

The initial vector search is fast but approximate. A reranker is a second, more precise model that re-scores the top candidates to select only the most relevant passages for the LLM.

Why a Cross-Encoder Reranker?

Bi-encoder retrieval (BGE-M3 + cosine similarity) is fast but approximate. A cross-encoder jointly encodes query+document with full cross-attention, yielding significantly more accurate relevance scores at the cost of higher latency. By retrieving 8 candidates and reranking to top-3, we get the best of both worlds.

Why ms-marco-MiniLM-L-6-v2?

Only 22M parameters — extremely fast on CPU (~5-15ms per query-document pair)
Trained on MS MARCO — 500M+ query-passage pairs, the gold standard for passage reranking
NDCG@10 of 39.01 on TREC Deep Learning 2019 — best speed/quality ratio for CPU
sentence-transformers CrossEncoder API — drop-in integration, no extra dependencies

Pipeline Integration

Stage	Model	Latency (CPU)	Output
1. Embedding	BGE-M3	<30ms	1024-dim query vector
2. ANN Search	Qdrant HNSW	<10ms	Top-8 candidates
3. Reranking	MiniLM-L-6-v2	~40-120ms (8 pairs)	Top-3 reranked
4. Generation	Phi-4 14B	5-50s	Answer with citations

References

Nogueira & Cho (2019). "Passage Re-ranking with BERT". arXiv:1901.04085
Wang et al. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression". NeurIPS 2020

5c. Vision Model: MiniCPM-V 2.6 (INT4)

Some PDF pages contain charts, diagrams, or complex layouts that text extraction misses. A Vision Language Model (VLM) can "see" the rendered page and describe its visual content, enriching the RAG context.

Why a Vision Language Model?

Many PDFs contain charts, tables, diagrams, and complex layouts that text extraction alone misses. A VLM can "see" the rendered page and describe its visual content, augmenting the text-only RAG pipeline.

Why MiniCPM-V 2.6?

8B parameters total (SigLip-400M vision encoder + Qwen2-7B language model) — compact for a VLM
INT4 quantization — ~4 GB RAM footprint, fits alongside the LLM in our 64 GB budget
GPT-4V level accuracy on OCR, document understanding, and chart comprehension benchmarks
On-demand loading — loaded only when processing visual PDFs, unloaded to free RAM
Apache 2.0 license

Usage in MIRROR

The vision model is optional (activated via VISION_ENABLED=1). When enabled, PDF pages are rendered at 150 DPI and analyzed by the VLM. The visual description is appended to the text extraction, enriching the RAG context. On CPU, each page analysis takes ~30-60 seconds — acceptable for async document processing.

References

Yao et al. (2024). "MiniCPM-V: A GPT-4V Level MLLM on Your Phone". arXiv:2408.01800

6. Ops & Engineering: Docker, Compose & Deployment

The entire MIRROR stack is containerized and runs with a single command. This section covers the Docker architecture, the Compose orchestration, the reverse proxy, and the deployment workflow.

Docker: Multi-Stage Build

The application image is built from python:3.11-slim with minimal system dependencies (build-essential, cmake for llama.cpp compilation). The resulting image includes all Python packages and the Flask app, but not the models — those are downloaded at runtime via the UI or the bootstrap script.

Layer	Content	Size Impact
Base	`python:3.11-slim`	~120 MB
System deps	build-essential, cmake, git	~200 MB (build only)
Python deps	llama-cpp-python, sentence-transformers, Flask, PyMuPDF…	~2.5 GB (includes PyTorch CPU)
App code	Flask routes, templates, static files	~5 MB

Docker Compose: 3-Service Stack

The docker-compose.yml defines three services that work together:

Service	Image	Role	Ports
mirror	Custom (Dockerfile)	Flask app + LLM + Embedding + Reranker	5000 (internal)
qdrant	qdrant/qdrant:v1.12.4	Vector database for RAG retrieval	6333/6334 (internal)
caddy	caddy:2.8.4	Reverse proxy, auto HTTPS, TLS certificates	80, 443 (public)

Key design decisions:

No fixed CPU/memory limits — Docker's cgroup scheduler manages resource contention naturally. Thread count is auto-detected via os.sched_getaffinity()
Volume mounts — ./models, ./uploads, ./articles, ./data are mounted from host for persistence and easy access
Internal networking — Qdrant and the app communicate via Docker's internal DNS (qdrant:6333), never exposed to the host
env_file integration — .env passes HF_TOKEN and secrets securely into the container
Restart policy — unless-stopped on all services for automatic recovery after crashes or reboots

Caddy: Reverse Proxy & Auto HTTPS

Caddy sits in front of the Flask app and handles:

Automatic TLS — Let's Encrypt certificates provisioned and renewed automatically, zero configuration
HTTP → HTTPS redirect — all traffic forced to HTTPS
Reverse proxy — proxies :443 → mirror:5000 internally
Zero downtime — Caddy reloads config without dropping connections

Gunicorn: Application Server

The Flask app runs behind Gunicorn with carefully tuned settings for LLM workloads:

1 worker + 8 threads — single worker because the LLM is single-threaded internally; 8 threads handle concurrent HTTP requests (downloads, status polls, static files) while the LLM generates
600s worker timeout — large model downloads and slow CPU inference can exceed default timeouts
300s graceful shutdown — allows in-progress generations to complete before stopping

Deployment Workflow

Deploying MIRROR to a new server requires only Docker and one script:

Clone the repo — git clone + create .env with HF_TOKEN
Run bash run.sh — builds images, starts the stack, waits for readiness, downloads the default model automatically
Open the browser — the site is live at https://your-domain (or http://localhost for local dev)

The bootstrap script (run.sh) handles the full lifecycle: build → start → healthcheck → model download → ready. No manual steps required beyond the initial .env setup.

Scaling Considerations

MIRROR is designed as a single-node deployment optimized for simplicity. For larger-scale deployments:

Kubernetes — each service maps naturally to a Deployment + Service. Qdrant has an official Helm chart with replication support
Model storage — mount a shared NFS/EFS volume for /models across pods, or use an init container to download models
Horizontal scaling — the Flask app is stateless (SQLite → PostgreSQL migration needed). Multiple replicas behind a load balancer, each with its own LLM instance
GPU upgrade path — swap llama.cpp CPU backend for vLLM/TensorRT-LLM on GPU nodes, keep the same REST API contract

7. Web Scraping

MIRROR can ingest content from any URL — paste a link, and the system extracts clean text for immediate Q&A or permanent indexing into the knowledge base.

The scraping tool uses trafilatura as primary extractor (precision >90% on web content benchmarks) with BeautifulSoup as fallback. Scraped content can be:

Queried directly (in-memory, no indexing) for quick one-off questions
Indexed into Qdrant for persistent retrieval alongside uploaded documents

References

Barbaresi (2021). "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction". ACL 2021 System Demonstrations

8. Multi-Mode Chat System

The chat interface is the main way users interact with MIRROR. Four specialized modes let users choose between direct conversation, document-grounded Q&A, web content analysis, and document-only search.

Architecture

MIRROR supports four chat modes, each optimized for a different use case:

Mode	Pipeline	Use Case
Chat	Direct LLM + personal context + conversation history	Natural conversation, greetings, personal questions
RAG	Embed → Qdrant search → Rerank → LLM	Document Q&A with source citations
Scrap	In-memory web content → LLM	Questions about scraped web pages
Full Doc	RAG filtered to `source_type=document`	Search only uploaded documents

Streaming (SSE)

All modes support Server-Sent Events (SSE) streaming for real-time token-by-token display. The frontend uses the ReadableStream API to process SSE chunks as they arrive, providing a responsive typing experience even with CPU inference at 3-6 t/s.

Conversation History

Chat mode includes multi-turn conversation context — the last 6 messages are injected into the prompt, enabling coherent follow-up questions without re-stating context. History is persisted to SQLite and can be resumed across sessions.

9. SQLite Database & User Sessions

All persistent state — users, conversations, chat history, uploaded sources, and logs — lives in a single SQLite file. Zero configuration, zero external dependencies, portable with the container.

Schema

Table	Purpose	Key Fields
`users`	Anonymous user sessions (cookie-based)	id (UUID), created_at, last_seen
`conversations`	Chat threads per user	id, user_id, title, mode, timestamps
`messages`	Individual messages with metadata	id, conversation_id, role, content, sources, timings
`user_sources`	Per-user document/web source tracking	id, user_id, source_name, source_type
`logs`	Structured application logs	timestamp, level, component, message, details

Design Decisions

SQLite + WAL mode — zero-config, file-based, concurrent reads with WAL journaling
Cookie-based user isolation — anonymous UUID cookie (1 year TTL, HttpOnly, SameSite=Lax) — no login required
Thread-local connections — each gunicorn thread gets its own SQLite connection, avoiding lock contention
CASCADE DELETE — deleting a conversation automatically removes all its messages
JSON columns — sources and timings stored as JSON text for flexibility without schema changes

10. Structured Logging

Full observability without external dependencies — all events are logged to SQLite with structured metadata, queryable via API.

All application events (queries, errors, model loads, document uploads) are logged to the logs SQLite table with structured metadata. Logs can be queried via /api/chat/logs with filters by component and level. This provides full observability without external dependencies (no ELK/Grafana needed).

11. Full Stack Summary

Layer	Technology	Role
Web Framework	Flask 3.1	Lightweight, Jinja2 templates, blueprint architecture
LLM	Multi-model (Phi-4, Qwen 2.5, Llama 3) + llama-cpp-python	CPU inference, hot-swappable, 5-46 GB RAM
Embedding	BGE-M3 + sentence-transformers	Multilingual, 1024-dim, <30ms/query
Reranker	ms-marco-MiniLM-L-6-v2	Cross-encoder, 22M params, ~15ms/pair on CPU
Vision (opt.)	MiniCPM-V 2.6 INT4	PDF visual understanding, ~4 GB RAM
Vector Store	Qdrant (Docker)	HNSW + INT8 quantization, sub-10ms search
PDF Parsing	PyMuPDF (fitz)	Fast, handles complex layouts
Scraping	trafilatura + BeautifulSoup	High-precision content extraction
Database	SQLite (WAL mode)	Users, conversations, chat history, logs
Sessions	Cookie-based UUID	Anonymous user isolation, no login required
Streaming	Server-Sent Events (SSE)	Real-time token-by-token chat display
Chat Modes	Chat / RAG / Scrap / Full Doc	Direct chat, document Q&A, web scraping, doc-only search
Containerization	Docker Compose	Full encapsulation, zero host deps, portable