Courses

A clear, structured overview of core ML topics, explained with lists, examples, and practical intuition.

Machine Learning (classic)

Classical Machine Learning algorithms are the foundation of data science. They're fast, interpretable, and often the best starting point before trying complex deep learning.

Supervised Learning: Regression

Linear Regression Regression

Imagine you have data about house sizes and their prices. Linear Regression draws the best straight line through those points so you can predict the price of a new house just from its size. It's the simplest and most fundamental prediction method in all of ML.

When to use

  • You want to predict a continuous number (price, revenue, temperature)
  • You need to understand which factors matter most and by how much
  • Small to medium datasets where you need quick results

✓ Strengths

  • Extremely fast training and inference
  • Fully interpretable coefficients
  • No hyperparameter tuning needed (basic form)
  • Closed-form solution exists (Normal Equation)

✗ Weaknesses

  • Cannot capture non-linear patterns
  • Sensitive to outliers
  • Assumes independence of features (multicollinearity)
  • Requires feature engineering for complex relations

Key concepts

The model finds the line that minimizes prediction errors (MSE = average of squared errors). Variants: Ridge adds a penalty to keep coefficients small (prevents overfitting), Lasso can set some coefficients to zero (automatic feature selection), ElasticNet combines both. Quality measured with R² (1.0 = perfect predictions).

Polynomial Regression Regression

Sometimes a straight line doesn't fit your data well because the relationship is curved (for example, a plant grows fast at first, then slows down). Polynomial Regression lets you fit curves instead of straight lines by adding squared or cubed terms to the equation.

When to use

  • Data shows clear non-linear but smooth trends
  • Single or few features (avoids combinatorial explosion)
  • You want a simple baseline before moving to trees/neural nets

✓ Strengths

  • Captures curves without complex models
  • Same linear algebra solver (fast)
  • Easy to understand and visualize

✗ Weaknesses

  • Overfits quickly with high degree
  • Feature explosion with many input dimensions
  • Extrapolation is catastrophic

Supervised Learning: Classification

Logistic Regression Classification

Despite its name, Logistic Regression is used to classify things into categories (spam or not spam, approved or rejected). Instead of predicting a number, it outputs a probability between 0% and 100% that tells you how confident the model is about each category.

When to use

  • Binary or multi-class classification with roughly linear decision boundary
  • Need calibrated probability outputs (not just labels)
  • Baseline for NLP text classification, CTR prediction

✓ Strengths

  • Probabilistic output (well-calibrated)
  • Fast, scales to millions of samples
  • Interpretable feature weights
  • Robust with regularization (L1/L2)

✗ Weaknesses

  • Linear decision boundary only
  • Requires feature engineering for non-linear patterns
  • Poor with many irrelevant features (without L1)

Key concepts

Trained with Cross-Entropy loss (penalizes confident wrong predictions heavily). For multiple categories, uses Softmax to distribute probabilities across classes. Key metrics: Precision (of predicted positives, how many are correct), Recall (of actual positives, how many were found), F1 (balance of both).

Support Vector Machine (SVM) Classification

Picture two groups of dots on a page (cats vs dogs based on weight and height). SVM finds the line that separates them with the widest possible gap. A clever math trick called a 'kernel' lets it handle cases where the groups aren't separable by a straight line, by bending the boundary into curves.

When to use

  • Small to medium datasets with clear margin of separation
  • High-dimensional data (text, genomics)
  • Binary classification with kernel trick for non-linearity

✓ Strengths

  • Effective in high-dimensional spaces
  • Memory efficient (uses support vectors only)
  • Kernel trick: RBF, polynomial, custom
  • Strong generalization with proper C and kernel

✗ Weaknesses

  • O(n²) to O(n³) training time, bad for large datasets
  • No native probability output (requires Platt scaling)
  • Kernel + C hyperparameter search is expensive
  • Not interpretable with non-linear kernels
k-Nearest Neighbors (KNN) Classification / Regression

The simplest ML idea: to classify a new data point, just look at its k closest neighbors in the dataset and go with the majority vote. If 3 out of 5 nearest neighbors are cats, predict cat. There's no actual training phase; it simply memorizes all the data and compares at prediction time.

The intuition: projecting data onto a plane

Imagine each data point as a dot on a 2D map. If you have two features (say height and weight), each person becomes a point on that map. KNN literally measures the distance between your new point and every existing point, then looks at which class dominates among the k nearest ones. When you have more than 2 features (say 10 medical measurements), the concept is the same but in 10-dimensional space. We can't visualize 10D, but math works the same way. Tools like PCA or t-SNE can project those 10D points back onto a 2D plane so you can visually check if clusters are well-separated.

When to use

  • Small datasets, low-dimensional features
  • Non-parametric baseline (no assumptions about data shape)
  • Recommendation systems (item similarity in embedding space)

✓ Strengths

  • Zero training time
  • No assumptions on data distribution
  • Simple to implement and explain
  • Works for both classification and regression

✗ Weaknesses

  • Slow inference: O(n) per query (or O(log n) with KD-tree)
  • Curse of dimensionality: distances become meaningless in high-dim
  • Sensitive to feature scaling (always normalize first)
  • All data must fit in memory
Naive Bayes Classification

Uses probability rules (Bayes' theorem) to classify data, especially text. For example, if an email contains "free" and "winner", what's the probability it's spam? Called 'naive' because it assumes each word contributes independently, which is rarely true in real life but works surprisingly well in practice.

When to use

  • Text classification (spam, sentiment): Multinomial NB
  • Very fast baseline needed
  • Small training sets, many features

✓ Strengths

  • Extremely fast training and prediction
  • Works well with high-dimensional sparse data
  • Needs very little training data

✗ Weaknesses

  • Independence assumption rarely holds
  • Poor probability calibration
  • Cannot learn feature interactions

Tree-based & Ensemble Methods

Decision Tree Classification / Regression

Works exactly like a flowchart of yes/no questions: "Is age > 30? Yes. Is income > 50k? No. Then predict: won't buy." The algorithm figures out which questions to ask and in what order to get the most accurate predictions. Easy to visualize and explain to anyone, even non-technical people.

When to use

  • Explainability is paramount (medical, legal)
  • Quick EDA to find important feature splits
  • Mixed feature types (categorical + continuous)

✓ Strengths

  • Fully interpretable (visualize the tree)
  • Handles mixed types, no scaling needed
  • Captures non-linear interactions

✗ Weaknesses

  • High variance: small data changes → different tree
  • Overfits easily without pruning
  • Piecewise-constant predictions (axis-aligned splits)
Random Forest Ensemble

Instead of relying on one decision tree (which can be unreliable), Random Forest builds hundreds of trees, each trained on a random portion of the data. To make a prediction, all trees vote and the majority wins. This 'wisdom of the crowd' approach is much more accurate and stable than any single tree alone.

When to use

  • Tabular data: strong default before trying boosting
  • Need reliable feature importance estimates
  • Can afford slightly more compute than a single tree

✓ Strengths

  • Low variance, hard to overfit
  • Parallelizable (embarrassingly parallel)
  • Built-in OOB error estimate
  • Handles missing values (proximity-based)

✗ Weaknesses

  • Slower inference (100s of trees)
  • Less accurate than gradient boosting on most benchmarks
  • Large model size in memory
XGBoost / LightGBM / CatBoost Ensemble, SOTA Tabular

Builds trees one after another, where each new tree specifically focuses on correcting the mistakes of the previous ones. Think of it like a team of students where each one studies what the previous student got wrong. This is the go-to method for tabular data (spreadsheets, databases) and dominates competitions and real-world production systems.

When to use

  • Any tabular / structured data problem: this is your first pick
  • Kaggle competitions, CTR prediction, fraud detection, ranking
  • When you need the best accuracy on non-image/non-text data

✓ Strengths

  • Best-in-class on tabular data
  • Built-in regularization (L1/L2, max_depth, min_child_weight)
  • Handles missing values natively
  • GPU training available, fast inference
  • SHAP values for explainability

✗ Weaknesses

  • Many hyperparameters to tune
  • Sequential boosting (less parallelizable than RF)
  • Can overfit on small, noisy datasets
  • Requires careful early stopping

Key differences

XGBoost: exact/histogram split, most battle-tested. LightGBM: leaf-wise growth, faster on large data, GOSS sampling. CatBoost: ordered boosting, best native categorical handling.

Unsupervised & Representation Learning

K-Means Clustering Clustering

Groups similar data points together without any labels. You tell it how many groups (k) you want, and it figures out how to split the data. For example, given customer purchase data, it could automatically discover "budget shoppers", "premium buyers", and "occasional visitors" without you defining those categories.

When to use

  • Customer segmentation, image compression
  • Known or estimable number of clusters
  • Roughly spherical, similarly-sized clusters

✓ Strengths

  • Simple, fast, scales to large datasets
  • Easy to interpret centroids
  • Mini-batch variant for streaming data

✗ Weaknesses

  • Must specify k in advance
  • Assumes convex, equally-sized clusters
  • Sensitive to initialization (use k-means++)
  • Fails on non-globular shapes
DBSCAN Clustering

Imagine dropping ink on paper and watching it spread: wherever there's a dense concentration of points packed together, DBSCAN calls that a cluster. Points sitting alone far away are flagged as outliers. Unlike K-Means, you don't need to specify the number of groups, and it can find clusters of any shape (circles, crescents, blobs).

When to use

  • Unknown number of clusters
  • Clusters have irregular shapes
  • Need automatic outlier/noise detection

✓ Strengths

  • No need to specify k
  • Finds clusters of any shape
  • Built-in noise/outlier detection

✗ Weaknesses

  • Sensitive to eps and min_samples parameters
  • Struggles with varying-density clusters
  • O(n²) without spatial indexing
PCA / UMAP / t-SNE Dim. Reduction

When your data has hundreds of features (columns), it's impossible to visualize. These techniques compress all those dimensions into just 2 or 3, while keeping similar points close together. The result is a 2D scatter plot where you can actually see clusters, patterns, and outliers at a glance. Essential for understanding what's really going on in your data.

When to use

  • PCA: preprocessing, denoising, feature compression
  • UMAP/t-SNE: 2D/3D visualization of clusters in embedding space
  • Debugging embeddings (are clusters well-separated?)

✓ Strengths

  • PCA: fast, deterministic, preserves variance
  • UMAP: preserves global + local structure
  • Essential for high-dim data debugging

✗ Weaknesses

  • PCA: linear only, loses non-linear structure
  • t-SNE: slow, non-deterministic, crowding
  • UMAP: hyperparameter-sensitive (n_neighbors, min_dist)

Deep Learning (neural networks)

Deep learning uses neural networks with many stacked layers to automatically learn patterns from raw data. Instead of hand-crafting features (like in classical ML), you feed the raw data (pixels, text, audio) and the network discovers the relevant features by itself. Different architectures are designed for different types of data: images (CNNs), sequences (RNNs), and the Transformer, which now dominates nearly everything.

Foundations & Core Architectures

Multi-Layer Perceptron (MLP) Feedforward

The classic neural network: layers of interconnected neurons that learn to map inputs to outputs. Think of it as a chain of simple math operations: each neuron takes numbers in, multiplies them by weights, adds them up, and passes the result through a function. Stack enough of these, and the network can learn incredibly complex patterns. Every modern deep learning model (CNNs, Transformers, etc.) has MLP components inside.

How a neural network actually learns

Training a neural network is a loop of three steps, repeated thousands of times:

  • Forward pass: feed an input (e.g., an image) through the network. Each layer multiplies the input by its weights, adds a bias, and applies an activation function. The output is a prediction (e.g., "80% cat, 20% dog").
  • Loss calculation: compare the prediction to the correct answer. The difference is the "loss" (error). Common loss functions: Cross-Entropy for classification, MSE for regression.
  • Backpropagation + optimizer: calculate how much each weight contributed to the error (using calculus/chain rule), then adjust every weight slightly to reduce the error. The optimizer (Adam, SGD) decides how much to adjust.

After thousands of these loops (called epochs), the weights converge to values that make good predictions on new, unseen data.

Key concepts you must know

  • Activation function (ReLU, GELU, Sigmoid): introduces non-linearity. Without it, stacking layers would be pointless because a chain of linear operations is just one big linear operation.
  • Learning rate: how big each weight adjustment step is. Too high = unstable, too low = takes forever. This is the most important hyperparameter.
  • Batch size: how many examples the network sees before updating weights. Larger = more stable gradients but needs more memory.
  • Dropout: randomly disables neurons during training (e.g., 20%) to prevent memorization (overfitting). At inference, all neurons are active.
  • Batch Normalization: normalizes values between layers, making training faster and more stable.

When to use

  • As the building block inside larger architectures (classifier head in BERT, FFN layers in Transformers)
  • Tabular data where tree-based methods underperform (rare, but happens with very large datasets)
  • Simple function approximation tasks

✓ Strengths

  • Universal approximation: can learn any function (given enough neurons)
  • Flexible architecture, easy to implement
  • GPU-parallelizable (matrix multiplications)

✗ Weaknesses

  • No built-in understanding of spatial patterns (images) or sequences (text)
  • Prone to overfitting on small datasets
  • Requires careful initialization, normalization, and regularization
CNN / ConvNet Images / Spatial

A neural network specifically designed for images. Instead of connecting every neuron to every pixel (which would be millions of connections), CNNs use small sliding filters (e.g., 3x3 pixels) that scan across the image. The first layers detect simple patterns like edges and corners. Deeper layers combine those into textures, shapes, and finally recognizable objects like faces or cars. This hierarchical feature learning is what makes CNNs so powerful for anything visual.

How convolution works (step by step)

Imagine a tiny 3x3 grid of numbers (the "filter" or "kernel") sliding across your image pixel by pixel. At each position, it multiplies the 9 overlapping pixels by the filter values and sums the result into a single number. This produces a new, smaller image called a "feature map" that highlights specific patterns the filter is looking for.

  • Conv layer: applies many filters (32, 64, 128...) in parallel, each learning to detect a different pattern
  • Pooling layer (MaxPool): shrinks the feature map by keeping only the strongest signal in each small region, making the network robust to small shifts
  • Stride: how many pixels the filter jumps between positions. Stride 2 = halves the output size
  • Padding: adds zeros around the edge so the output stays the same size as the input

When to use

  • Image classification, object detection, segmentation
  • Any grid-structured data (spectrograms, satellite imagery, heatmaps)
  • When you need fast inference on edge devices (MobileNet, EfficientNet-Lite)

✓ Strengths

  • Parameter sharing: the same filter works everywhere in the image (translation equivariance)
  • Hierarchical: edges → textures → parts → objects
  • Very fast inference, mobile/edge-friendly variants exist
  • Decades of proven architectures and pretrained weights available

✗ Weaknesses

  • Limited receptive field: each neuron only sees a local patch
  • Needs very deep stacking for global context understanding
  • Being surpassed by Vision Transformers (ViT) on many benchmarks

Key architectures (evolution)

LeNet (1998): the original CNN, digit recognition. AlexNet (2012): won ImageNet, launched the deep learning revolution. ResNet (2015): skip connections enabling 100+ layer networks. EfficientNet (2019): smartly scales width, depth, and resolution together. ConvNeXt (2022): modernized CNN that competes with Vision Transformers by borrowing Transformer design ideas.

RNN / LSTM / GRU Sequences

Neural networks with memory, designed for sequential data (text, time series, audio). They process inputs one step at a time, maintaining a hidden state that acts as a "memory" of what came before. When reading a sentence, each word updates the memory, so by the end, the network has a summary of the entire sequence. Largely replaced by Transformers for NLP, but still relevant for streaming and edge applications.

The vanishing gradient problem (and how LSTM solves it)

A basic RNN struggles with long sequences: as information passes through many time steps, the gradients used for learning become smaller and smaller (vanish), so the network "forgets" early inputs. LSTM (Long Short-Term Memory) solves this with a clever gating mechanism:

  • Forget gate: decides what to throw away from memory ("the conversation topic changed, forget the old subject")
  • Input gate: decides what new information to store ("this word is important, remember it")
  • Output gate: decides what to output based on the current memory

GRU (Gated Recurrent Unit) is a simplified version with only two gates (reset + update), making it faster to train with similar performance.

When to use (today)

  • Time series forecasting on short sequences (still competitive)
  • Streaming / online learning where you process data one step at a time
  • Edge devices where model size must be tiny (LSTMs are small)
  • Legacy systems that already use RNNs and work well enough

✓ Strengths

  • Natural fit for variable-length sequences
  • Very small parameter count compared to Transformers
  • Processes data step-by-step (natural for streaming)
  • Well-understood training dynamics and theory

✗ Weaknesses

  • Sequential processing: cannot parallelize (slow training)
  • Struggles with very long sequences even with LSTM gates
  • Largely superseded by Transformers for NLP, speech, and most sequence tasks

Transformers & Generative Models

Transformer Foundation, SOTA

The architecture behind ChatGPT, BERT, Stable Diffusion, and virtually all modern AI. Introduced in 2017 with the paper "Attention Is All You Need," the Transformer replaced RNNs by allowing every element in the input to interact with every other element simultaneously through a mechanism called "self-attention." This parallelism made training much faster and enabled models to scale to billions of parameters.

How self-attention works (the core idea)

Imagine reading the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? You instantly know it's "the cat." Self-attention lets the model learn these connections:

  • Each word is transformed into three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information should I pass along?)
  • For each word, compute how much attention to pay to every other word by comparing its Query with all Keys (dot product)
  • The result: each word gets a weighted mix of all other words' Values, creating a context-aware representation

Multi-head attention: run several attention patterns in parallel (e.g., 12 heads). One head might focus on grammar (subject-verb agreement), another on semantics (word meaning), another on position (nearby words). The outputs are concatenated and projected.

Transformer block structure

A single Transformer block repeats this pattern: Multi-Head Self-AttentionAdd & Layer Norm (residual connection) → Feed-Forward Network (two-layer MLP) → Add & Layer Norm. A model like GPT-4 stacks ~120 of these blocks. The residual connections (skip connections, like ResNet) are critical: they allow gradients to flow directly through the network, enabling very deep models.

Three variants

  • Encoder-only (BERT, RoBERTa): reads the entire input at once (bidirectional). Best for understanding tasks: classification, NER, search. Cannot generate text.
  • Decoder-only (GPT, Llama, Phi, Qwen): reads left-to-right, predicts the next token. This is the architecture of all modern LLMs. Best for text generation, chatbots, code.
  • Encoder-Decoder (T5, BART, Whisper): encoder reads the input, decoder generates the output. Best for translation, summarization, speech-to-text.

✓ Strengths

  • Global attention: every token sees every other token from layer 1
  • Fully parallelizable training (unlike RNNs)
  • Scales beautifully with more data and compute (scaling laws)
  • Transfer learning: pretrain once, fine-tune for any task
  • Dominates NLP, vision, audio, multimodal, code, science

✗ Weaknesses

  • O(n²) attention complexity: cost grows quadratically with context length
  • Pretraining requires massive compute (millions of GPU-hours)
  • KV cache memory grows linearly with sequence length during generation
  • Needs large datasets to outperform simpler models on small tasks
GANs / VAEs / Diffusion Generative

Three families of models that create new content (images, music, video, 3D). Each takes a fundamentally different approach to generation: GANs pit two networks against each other in a competition, VAEs learn a compressed representation and sample from it, and Diffusion models learn to gradually remove noise from a random image until a clean result emerges. Diffusion is the current state-of-the-art (Stable Diffusion, DALL-E 3, Midjourney).

How each approach works

  • GAN (Generative Adversarial Network): two networks compete. The Generator creates fake images, the Discriminator tries to tell real from fake. Over time, the Generator gets so good that the Discriminator can't tell the difference. Think of it as a counterfeiter vs. a detective getting better together.
  • VAE (Variational Autoencoder): compresses data into a small "latent space" (like a summary), then learns to reconstruct the original from that summary. By sampling random points in that latent space, you can generate new, never-seen content. The latent space is smooth: nearby points produce similar outputs.
  • Diffusion: starts with a clean image, gradually adds random noise until it's pure static, then trains a network to reverse the process: predict and remove the noise step by step. At generation time, start from random noise and denoise it into a coherent image. Takes many steps (20-50) but produces the highest quality results.

When to use

  • Diffusion (SOTA): image generation, inpainting, super-resolution, video generation (Stable Diffusion, DALL-E, Sora)
  • VAE: learned latent representations, anomaly detection, data augmentation with controlled variation
  • GAN: style transfer, real-time face generation, data augmentation (less popular since diffusion)

✓ Strengths

  • Diffusion: highest quality generation, stable training, text-guided
  • VAE: principled latent space, smooth interpolation, fast sampling
  • GAN: sharp outputs, very fast inference (single forward pass)

✗ Weaknesses

  • Diffusion: slow generation (many denoising steps), high compute
  • GAN: mode collapse (generator only produces a few variations), training instability
  • All: expensive to train, hard to objectively evaluate output quality
Graph Neural Networks (GNN) Graphs

Neural networks designed for data that's naturally a graph: nodes connected by edges. Social networks (people connected by friendships), molecules (atoms connected by bonds), maps (intersections connected by roads). Each node learns a representation by collecting and combining information from its neighbors, propagating knowledge across the graph.

How message passing works

In each layer, every node: (1) collects messages from its neighbors (their current representations), (2) aggregates them (sum, mean, or attention-weighted), (3) combines the result with its own representation to produce an updated one. After several layers, each node's representation encodes information from its extended neighborhood.

When to use

  • Social networks (link prediction, community detection)
  • Molecule property prediction (drug discovery)
  • Knowledge graphs, recommendation systems
  • Any relational/graph-structured data

✓ Strengths

  • Natively handles relational data with arbitrary topology
  • Permutation invariant (node order doesn't matter)
  • Well-established variants: GCN, GAT (attention), GraphSAGE (sampling)

✗ Weaknesses

  • Over-smoothing: too many layers makes all nodes look the same
  • Scalability to very large graphs requires sampling tricks
  • Less mature tooling and ecosystem than vision/NLP

Computer Vision (SOTA overview)

Computer Vision teaches machines to understand images and video. Modern systems combine deep neural networks with massive pretraining to achieve human-level (or better) performance on many visual tasks. The field has evolved from hand-crafted features (HOG, SIFT) to CNNs to Transformers, and now to foundation models that work zero-shot on tasks they've never been explicitly trained for.

Classification Backbones

ResNet / ConvNeXt Backbone CNN

ResNet introduced "skip connections" (shortcuts that let information bypass layers), solving the problem of training very deep networks (100+ layers). Without skip connections, deeper networks actually performed worse because gradients disappeared. ConvNeXt is a 2022 modernization: same CNN principles but borrowing design ideas from Transformers (larger kernels, fewer activations, LayerNorm), making it competitive with ViT.

Why these matter

These are backbone networks: the feature extraction engine used by detection, segmentation, and other downstream tasks. When you use YOLO or U-Net, the first half of the model is typically a ResNet or similar backbone that converts raw pixels into meaningful features.

✓ Strengths

  • Proven, extremely well-studied, battle-tested in production
  • Efficient inference, mobile-friendly variants (MobileNet, EfficientNet-Lite)
  • Excellent transfer learning backbone for any vision task
  • Thousands of pretrained checkpoints available on HuggingFace

✗ Weaknesses

  • Limited global context: each neuron only sees a local patch
  • Architecture design requires expertise (depth, width, kernel sizes)
  • Surpassed by ViT on many benchmarks when data is abundant
Vision Transformer (ViT) SOTA Classification

Takes the Transformer architecture from NLP and applies it to images. The trick: cut the image into small patches (e.g., 16x16 pixels), flatten each patch into a vector, and treat them like words in a sentence. Then self-attention lets every patch interact with every other patch from the very first layer, giving the model global understanding of the whole image instantly.

How it processes an image

  • Split the image into a grid of patches (e.g., 224x224 image → 196 patches of 16x16)
  • Each patch is linearly projected into an embedding vector (like word embeddings in NLP)
  • Add positional embeddings so the model knows where each patch is
  • Pass through standard Transformer encoder blocks (self-attention + FFN)
  • A special [CLS] token collects global information for the final classification

✓ Strengths

  • Global attention from the very first layer (sees the whole image)
  • Scales extremely well with more data and compute
  • Unifies vision and NLP architectures (shared tooling)
  • Variants: DeiT (data-efficient), Swin (hierarchical windows), DINOv2 (self-supervised)

✗ Weaknesses

  • Needs large-scale pretraining to outperform CNNs (ImageNet-21k+)
  • Higher compute cost than CNN at inference
  • No built-in inductive bias for locality (learns it from data instead)

Object Detection

YOLO Family (v5 to v11+) Real-time Detection

YOLO (You Only Look Once) detects and locates objects in images in a single forward pass, fast enough for real-time video (30-300+ FPS). Unlike older two-stage detectors (Faster R-CNN) that first propose regions then classify them, YOLO predicts bounding boxes and classes simultaneously across the whole image. The most popular and battle-tested choice for production object detection.

How detection works

The image is divided into a grid. For each cell, the model predicts: (1) bounding box coordinates (x, y, width, height), (2) confidence score (is there an object here?), and (3) class probabilities (what object is it?). Non-Maximum Suppression (NMS) removes duplicate detections. Modern versions (YOLOv8+) are anchor-free, simplifying the pipeline.

✓ Strengths

  • Real-time: 30-300+ FPS depending on model size and GPU
  • End-to-end training, simple deployment (Ultralytics CLI)
  • Excellent accuracy/speed tradeoff for production
  • Also supports segmentation, pose estimation, tracking (YOLOv8+)

✗ Weaknesses

  • Struggles with very small objects in dense, cluttered scenes
  • Many versions with different authors creates compatibility confusion
  • Less accurate than two-stage detectors on certain benchmarks
DETR / RT-DETR Transformer Detection

Applies the Transformer to object detection with a clean, elegant design: no anchors, no NMS, no hand-crafted rules. The model uses learned "object queries" that each attend to different parts of the image and directly output detections. RT-DETR (Real-Time DETR) now matches YOLO speed with better accuracy on some benchmarks, making it a serious production alternative.

✓ Strengths

  • Elegant end-to-end design, no post-processing hacks
  • Global reasoning about object relationships in the scene
  • RT-DETR: real-time speed with Transformer accuracy
  • Easier to extend to new tasks (panoptic, action detection)

✗ Weaknesses

  • Slow convergence during training (needs many epochs)
  • Struggles with many small objects in one image
  • Higher memory usage than YOLO for equivalent throughput

Segmentation

U-Net / SegFormer Semantic Segmentation

Assigns a class label to every single pixel in an image (this pixel is "road", this one is "car", this one is "sky"). U-Net has an encoder-decoder architecture with skip connections that preserve fine details, making it excellent on small datasets like medical scans. SegFormer brings Transformer efficiency to dense pixel-level predictions with multi-scale features.

Types of segmentation

  • Semantic: label every pixel by class (all "car" pixels get the same label, even if there are 5 cars)
  • Instance: distinguish individual objects (car #1, car #2, car #3 get different labels)
  • Panoptic: semantic + instance combined (both "stuff" like sky and "things" like cars)

✓ Strengths

  • U-Net: works great on small datasets (medical imaging, satellite)
  • SegFormer: lightweight Transformer, no positional encoding needed
  • Both support transfer learning from pretrained backbones

✗ Weaknesses

  • Dense prediction is compute-heavy (every pixel needs a label)
  • Annotation is extremely expensive (pixel-level labeling)
  • U-Net: struggles with varying input resolutions
SAM / Mask2Former Foundation Segmentation

SAM (Segment Anything Model) by Meta is a foundation model for segmentation: click a point, draw a box, or type text on any image and it segments the object with zero training. It was trained on 11 million images with 1 billion masks. Mask2Former unifies all segmentation types (semantic, instance, panoptic) into a single architecture.

✓ Strengths

  • SAM: zero-shot segmentation with prompts (point, box, text)
  • SAM 2: extends to video with temporal tracking
  • Mask2Former: single model for all segmentation types
  • Both work out-of-the-box on domains they've never seen

✗ Weaknesses

  • SAM: heavy ViT backbone, slow without optimization (EfficientSAM helps)
  • Mask2Former: complex training setup, slower than specialized models
  • Both require significant GPU memory for inference

Vision-Language Models

CLIP / SigLIP Vision-Language

Trains an image encoder and a text encoder together so they produce similar vectors for matching image-text pairs. The result: images and text live in the same embedding space. You can search images with text ("a red sports car at sunset"), classify images into any category without training data, or use it as the "eyes" for a multimodal LLM like LLaVA or GPT-4V.

How contrastive learning works

During training, CLIP sees millions of (image, caption) pairs from the internet. For each batch, it computes image embeddings and text embeddings, then pushes matching pairs close together in vector space and non-matching pairs apart. After training, the model can compare any image to any text description by measuring the distance between their vectors.

✓ Strengths

  • Zero-shot classification: describe new classes in text, no labeled data needed
  • Cross-modal retrieval: search images with text or vice versa
  • Foundation for multimodal LLMs (LLaVA, MiniCPM-V, InternVL)
  • SigLIP: improved training with sigmoid loss, better on smaller batches

✗ Weaknesses

  • Trained on noisy web-scraped data (inherits biases)
  • Struggles with fine-grained spatial reasoning and counting
  • Large model, significant inference cost

How to Measure CV Model Performance

Evaluation Metrics Essential

Choosing the right metric is as important as choosing the right model. A metric tells you what "good" means for your specific task. Using the wrong metric can make a terrible model look great (e.g., 99% accuracy on a dataset where 99% of samples are one class).

Classification metrics

  • Top-1 / Top-5 Accuracy: is the correct label the model's #1 prediction (Top-1) or at least in its top 5 guesses (Top-5)? ImageNet Top-1 is the standard benchmark. Simple but misleading on imbalanced data.
  • Precision: of all samples the model predicted as positive, how many were actually positive? High precision = few false alarms. Critical when false positives are costly (spam filter, fraud detection).
  • Recall (Sensitivity): of all actual positives, how many did the model catch? High recall = few missed detections. Critical when false negatives are costly (cancer screening, safety systems).
  • F1 Score: harmonic mean of precision and recall. Balances both. Use when you need a single number and classes are imbalanced.
  • AUC-ROC: measures how well the model separates classes across all confidence thresholds. 1.0 = perfect separation, 0.5 = random. Threshold-independent, great for comparing models.
  • Confusion Matrix: the full picture. Shows exactly which classes are confused with which. Always look at this first before choosing a single metric.

Detection metrics

  • IoU (Intersection over Union): measures how well a predicted bounding box overlaps with the ground truth. IoU = area of overlap / area of union. An IoU > 0.5 is typically considered a correct detection.
  • mAP@50: mean Average Precision at 50% IoU threshold. Lenient: a box with 50% overlap counts as correct. Standard for PASCAL VOC.
  • mAP@50:95: averages mAP across IoU thresholds from 50% to 95% in steps of 5%. Much stricter, rewards tight bounding boxes. Standard for COCO benchmark.
  • AP per class: some classes are much harder to detect than others. Always check per-class AP, not just the mean.

Segmentation metrics

  • mIoU (mean Intersection over Union): for each class, measures pixel-level overlap between predicted mask and ground truth, then averages across classes. The standard segmentation metric.
  • Dice Coefficient (F1 for pixels): 2 × overlap / (predicted + ground truth). Equivalent to pixel-level F1. Very common in medical imaging because it handles class imbalance well (most pixels are background).
  • Pixel Accuracy: % of pixels correctly classified. Misleading when classes are imbalanced (a model that predicts "background" everywhere gets 95%+ accuracy).
  • Boundary IoU: measures IoU only near object boundaries, rewarding models that get the edges right (important for precise cutouts).

Other task-specific metrics

  • OCR: CER / WER: Character Error Rate and Word Error Rate measure edit distance between predicted and ground truth text.
  • Tracking: MOTA / IDF1: Multi-Object Tracking Accuracy and Identity F1 measure how well objects are tracked across video frames without ID switches.
  • Speed: FPS, latency: report p50 (median) and p95 (worst-case) latency. For real-time applications, throughput at target latency matters more than peak FPS.
Loss Functions Training

The loss function defines what the model optimizes during training. It's the mathematical expression of "how wrong is this prediction?" Different tasks need different loss functions, and choosing the right one directly impacts model quality. Understanding why each loss works helps you debug training issues.

Classification losses

  • Cross-Entropy Loss: the standard for classification. Measures the difference between predicted probability distribution and the true label. For binary classification: BCE (Binary Cross-Entropy). For multi-class: Categorical CE. Works because it heavily penalizes confident wrong predictions (predicting 0.01 for the correct class is much worse than predicting 0.4).
  • Focal Loss: modified CE that down-weights easy examples and focuses training on hard ones. Invented for object detection (RetinaNet) where 99% of candidate boxes are "background" (easy negatives). The γ parameter controls how much to focus on hard examples. Focal Loss = -α(1-p)^γ log(p).
  • Label Smoothing: instead of hard labels (0 or 1), use soft labels (0.1 or 0.9). Prevents overconfidence, improves generalization. Common in modern image classification (ViT uses 0.1 smoothing).

Detection losses

  • Bounding box regression: L1 Loss (mean absolute error) or Smooth L1 (less sensitive to outliers) for predicting box coordinates (x, y, w, h). Modern detectors use CIoU Loss (Complete IoU) which directly optimizes the overlap between predicted and ground truth boxes, considering center distance and aspect ratio.
  • Objectness loss: BCE that predicts "is there an object here?" for each anchor/grid cell.
  • Total YOLO loss = box_loss + objectness_loss + classification_loss, each weighted differently.

Segmentation losses

  • Pixel-wise Cross-Entropy: standard CE applied to every pixel independently. Simple but struggles with class imbalance (tiny objects get overwhelmed by background loss).
  • Dice Loss: 1 - Dice coefficient. Directly optimizes the overlap metric. Handles class imbalance naturally because it measures ratios, not absolute counts. Standard in medical imaging.
  • Combined: CE + Dice: many SOTA segmentation models use both, getting the benefits of each. CE provides stable gradients, Dice handles imbalance.
  • Boundary Loss: penalizes predictions that are wrong near object edges. Improves boundary quality without extra annotation cost.

Contrastive & self-supervised losses

  • Contrastive Loss (CLIP, SimCLR): pull matching pairs (same image augmented twice, or matching image-text) close in embedding space, push non-matching pairs apart. InfoNCE is the most common variant.
  • Triplet Loss: given an anchor, a positive (same class), and a negative (different class), ensure the anchor is closer to the positive than the negative by a margin. Common in face recognition (FaceNet).

How to choose a loss function

  • Balanced classification: standard Cross-Entropy works great
  • Imbalanced classification: Focal Loss or weighted CE (weight rare classes higher)
  • Detection: CIoU for boxes + Focal Loss for classification (YOLOv8 default)
  • Segmentation (imbalanced): Dice Loss + CE combined
  • Embeddings/similarity: Contrastive (InfoNCE) or Triplet Loss
  • Regression (continuous output): MSE (penalizes large errors) or Huber Loss (robust to outliers)
Production Monitoring for CV Production

Deploying a CV model is only half the job. In production, image distributions shift (new cameras, lighting changes, seasonal differences), and model performance silently degrades if you're not monitoring. A structured monitoring pipeline catches issues before users do.

What to monitor

  • Data drift: compare input image distribution over time (brightness, resolution, domain shift). Use statistical tests (KL divergence, PSI) on embedding distributions. Alert when distribution diverges beyond a threshold.
  • Per-domain accuracy: track accuracy per camera, source, region, or time window separately. A model may degrade on one domain while looking fine overall. Break metrics down by every axis that matters.
  • Confidence calibration: a model saying "95% confident" should be right 95% of the time. Plot reliability diagrams (predicted confidence vs actual accuracy). Uncalibrated models give misleading confidence scores.
  • Latency & throughput: track p50, p95, p99 inference time. Monitor GPU utilization, queue depth, batch throughput. Set SLA alerts for real-time applications.

How to improve continuously

  • Failure analysis: regularly review the worst predictions. Cluster error patterns to find systematic issues (night images, occluded objects, rare classes). Feed these hard cases back into the training set.
  • A/B testing: when deploying a new model, serve both old and new in parallel and compare live metrics before full rollout. Shadow deployment (new model runs but doesn't serve) is even safer.
  • Active learning: prioritize labeling the samples where the model is most uncertain. This gives the best return per labeled sample, reducing annotation costs.
  • Retraining triggers: define automatic retraining when drift exceeds a threshold or accuracy drops below a target. Automate the pipeline with Airflow or similar.

NLP & LLMs (SOTA overview)

Natural Language Processing (NLP) teaches machines to understand, generate, and reason about human language. Today, Large Language Models (LLMs) power most NLP applications, but the ecosystem is much broader: encoder models for understanding, embedding models for search, and retrieval pipelines for grounding answers in real data.

Core NLP Tasks

Text Classification NLP

Assign a label to a piece of text: is this email spam? Is this review positive or negative? What topic does this article cover? Text classification is one of the most common NLP tasks in production, with approaches ranging from simple keyword rules to fine-tuned Transformers.

Approaches (by complexity)

  • TF-IDF + LogReg/SVM: convert text to word frequency vectors, classify with a linear model. Fast, interpretable, strong baseline.
  • Fine-tuned BERT/RoBERTa: best accuracy for domain-specific tasks. Needs 500+ labeled examples. Train in minutes on a single GPU.
  • SetFit: few-shot learning with sentence transformers. Only 10-50 labeled examples needed. Great when data is scarce.
  • Zero-shot LLM: describe your categories in plain text and the LLM classifies without any labeled data. Slower and more expensive but incredibly flexible.
NER / Information Extraction NLP

Named Entity Recognition finds and labels key information in text: names of people, companies, places, dates, amounts. It's essential for turning unstructured text (contracts, resumes, news articles, invoices) into structured data you can store in a database.

Approaches

  • SpaCy NER: fast, rule-based + statistical. Good for standard entities (person, org, location). CPU-friendly.
  • Fine-tuned BERT + token classification: best for custom entity types specific to your domain (legal clauses, medical terms).
  • GLiNER: zero-shot NER. Describe your entities in natural language ("company names", "monetary amounts") and it extracts them without training.
  • LLM extraction: use structured output (JSON mode) for complex extraction schemas. Flexible but slower and more expensive.
Embeddings & Semantic Search Search / RAG

Embedding models convert text into numerical vectors (arrays of numbers, typically 768-1024 dimensions) that capture meaning. Texts with similar meanings end up as vectors that are close together in this high-dimensional space. This is the foundation of semantic search: instead of matching exact keywords, you find texts that mean the same thing.

How embedding projection works

Each sentence is transformed into a vector (e.g., 1024 numbers). Think of it as coordinates in a 1024-dimensional space. While we can't visualize 1024D, the principle is the same as 2D/3D: similar texts cluster together, different texts are far apart. You can use UMAP or t-SNE to project these vectors onto a 2D plane and literally see clusters of related documents. When a user asks a question, you embed it into the same space and find the nearest document vectors using cosine similarity or dot product.

Sparse vs Dense vs Hybrid search

  • Dense search (semantic): uses embedding vectors. Great for finding texts with similar meaning even if they use completely different words ("car" matches "automobile"). Models: BGE-M3, E5, GTE.
  • Sparse search (lexical, BM25): matches exact keywords and their frequency. Excellent for technical terms, product codes, proper nouns that dense models might miss. Fast and simple.
  • Hybrid search (best of both): combines dense + sparse scores. Catches both semantic matches and exact keyword matches. Most production RAG systems use hybrid search (e.g., Qdrant supports both natively).

Key embedding models

  • BGE-M3 (BAAI): multilingual, produces both dense AND sparse vectors, 1024-dim. Used in MIRROR.
  • E5 / GTE (Microsoft/Alibaba): instruction-tuned embeddings, top scores on MTEB benchmark.
  • Nomic Embed: open-source, long-context (8192 tokens), strong performance.
  • Cohere Embed v3: commercial, int8 quantized embeddings for cost-effective production.

Vector databases

Qdrant (Rust, HNSW+quantization, hybrid search, used in MIRROR). Pinecone (fully managed, serverless). Weaviate (modular, GraphQL). ChromaDB (simple, great for prototyping). pgvector (PostgreSQL extension, use your existing DB).

Encoder Models

BERT / RoBERTa Encoder, Understanding

BERT reads text in both directions simultaneously (bidirectional), giving it deep understanding of context. Unlike LLMs which generate text left-to-right, BERT is designed to understand text: classify it, extract information from it, compare sentences. It's small (110M params), fast (runs on CPU), and still the go-to model for many production NLP tasks where you need understanding, not generation.

How it works

BERT is a Transformer encoder pretrained on two tasks: (1) Masked Language Modeling: randomly hide 15% of words and predict them from surrounding context (both left and right). This forces the model to deeply understand language. (2) Next Sentence Prediction: determine if two sentences follow each other naturally. After pretraining, you add a small classification head on top and fine-tune on your specific task with as few as 500 labeled examples.

When to use

  • Text classification: sentiment analysis, intent detection, spam filtering, topic classification
  • NER: extracting names, dates, entities from text (token-level classification)
  • Semantic similarity: comparing sentences with Sentence-BERT (produces embeddings)
  • Question answering: find the answer span in a passage (extractive QA)

✓ Strengths

  • Bidirectional context: understands words from both sides simultaneously
  • Small and fast (110M params for base, runs on CPU)
  • Fine-tunes in minutes on a single GPU
  • Massive ecosystem: 50k+ BERT variants on HuggingFace

✗ Weaknesses

  • Cannot generate text (encoder only, no decoder)
  • Max 512 tokens input (limited context window)
  • Pretraining data is now outdated (original: 2019)
  • For generation tasks, use GPT-style decoder models instead

Key variants

RoBERTa: same architecture, better pretraining (more data, longer training, no NSP task). DeBERTa: improved attention with disentangled position encoding, often outperforms RoBERTa. CamemBERT: French BERT, pretrained on French web data. DistilBERT: 40% smaller, 60% faster, retains 97% of BERT's accuracy.

LLM Architecture & Training

LLM Architectures Foundation

Large Language Models (ChatGPT, Claude, Llama, Phi, Qwen) are massive Transformer decoder networks trained to predict the next word. By reading billions of documents, they develop an understanding of language, facts, and reasoning. After pretraining, they're refined with human feedback (RLHF/DPO) so they actually follow instructions and give helpful, safe answers instead of just completing text.

Key families (2024-2025)

  • GPT-4 / GPT-4o (OpenAI): proprietary, best general reasoning, multimodal
  • Claude 3.5 Sonnet (Anthropic): strong on long context (200k tokens), safety-focused
  • Llama 3.x (Meta): open-weight, 8B/70B/405B, massive community
  • Phi-4 (Microsoft): 14B, exceptional quality/size ratio, used in MIRROR
  • Qwen 2.5 (Alibaba): top multilingual, 0.5B to 72B, open-weight
  • Mistral / Mixtral (Mistral AI): MoE architecture, efficient inference, European
  • DeepSeek-V3/R1: MoE, top reasoning and code generation, MIT license

Training pipeline (4 stages)

Step 1: Pretraining reads trillions of words from the internet, learning to predict the next word. This gives the model broad knowledge of language, facts, and reasoning patterns. Step 2: Instruction tuning (SFT) fine-tunes on curated instruction/response pairs so the model learns to follow instructions instead of just completing text. Step 3: Alignment (RLHF/DPO) trains with human preference data to be helpful, harmless, and honest. Step 4: Quantization compresses the model (GGUF/GPTQ/AWQ) to run on smaller hardware without significant quality loss.

Fine-tuning: LoRA / QLoRA Training

Adapt a pre-trained LLM to your specific domain by training only a tiny fraction of its parameters (~1%). Instead of updating all billions of weights, LoRA injects small trainable matrices ("adapters") into each layer. QLoRA loads the frozen model in 4-bit, making fine-tuning possible on a single consumer GPU (24GB VRAM for a 70B model).

How LoRA works

The original weight matrix W (e.g., 4096x4096) is frozen. LoRA adds two small matrices: A (4096x16) and B (16x4096), where 16 is the "rank." The output becomes W*x + A*B*x. Only A and B are trained, which is a tiny fraction of the total parameters. After training, A*B can be merged back into W, so inference has zero additional cost.

QLoRA goes further: the frozen W is loaded in 4-bit NF4 precision (instead of 16-bit), cutting base model memory by 4x. The adapters A and B are still trained in FP16/BF16 for precision. This means a 70B model that normally needs 140GB VRAM can be fine-tuned in ~40GB.

When to use

  • Domain adaptation: teach medical, legal, or financial terminology and reasoning patterns
  • Style and format control: make the model write in your brand's tone, or always output specific JSON structures
  • RAG improvement: fine-tune the model to better use retrieved context and produce citations
  • When prompting isn't enough: if few-shot prompting fails to get consistent results

✓ Strengths

  • Train <1% of parameters: fast and cheap
  • Adapters can be merged back into the base model (zero overhead)
  • Stack multiple adapters: one for legal, one for medical, same base
  • Tools: HuggingFace PEFT, Unsloth (2x faster), Axolotl, TRL

✗ Weaknesses

  • Data quality is everything: garbage in, garbage out
  • Risk of catastrophic forgetting (model loses general knowledge)
  • Rank and learning rate require experimentation
  • Evaluation is tricky: benchmarks may not reflect real-world gains

Typical workflow

1. Prepare 500-10k instruction/response pairs. 2. Choose a base model (Llama, Phi, Qwen). 3. Configure LoRA: rank (8-64), alpha (2x rank), target modules (attention layers). 4. Train 1-3 epochs with QLoRA via Unsloth or PEFT. 5. Evaluate on held-out test set + human review. 6. Merge adapter into base model and quantize (GGUF) for deployment.

RAG & Retrieval

RAG Pipeline Production Pattern

RAG (Retrieval Augmented Generation) gives an LLM access to your own documents. When a user asks a question, the system first searches for relevant passages in your document index, then feeds those passages to the LLM along with the question. The LLM writes an answer grounded in your actual data, with citations. This is how most enterprise AI assistants handle company-specific knowledge without retraining the model.

Pipeline stages

  • 1. Chunking: split documents into overlapping passages (200-500 words). Strategy matters: semantic chunking respects paragraph boundaries, hierarchical chunking maintains parent-child context.
  • 2. Embedding: convert each chunk into a dense vector (BGE-M3, E5) and optionally a sparse vector (BM25/SPLADE) for hybrid search.
  • 3. Indexing: store vectors + metadata in a vector database (Qdrant, Pinecone, ChromaDB) with HNSW index for fast approximate nearest neighbor search.
  • 4. Retrieval: embed the user's question, find top-k most similar chunks via hybrid search (dense + sparse). Typical k=5-20.
  • 5. Reranking: a cross-encoder (BGE-Reranker, Cohere Rerank) re-scores retrieved chunks by reading query+chunk together. Much more precise than embedding similarity alone.
  • 6. Generation: the LLM receives the question + top reranked chunks as context and generates an answer with inline citations.

✓ Strengths

  • Grounds answers in real documents: dramatically reduces hallucinations
  • No retraining needed: just update the document index
  • Citations let users verify every claim in the answer
  • Works with any LLM (open-source or API, local or cloud)

✗ Weaknesses

  • Retrieval quality is the bottleneck: bad retrieval = bad answers
  • Chunking strategy is critical (too small = missing context, too big = noise)
  • Adds latency: embedding + search + reranking + generation
  • Complex to evaluate end-to-end (retrieval quality + generation faithfulness)

Production tips

Always use hybrid search (dense + sparse) for better recall on technical queries. Cache embeddings for frequent queries. Cap top-k after reranking (3-5) to control latency and cost. Stream tokens for responsive UX. Monitor retrieval quality (MRR, recall@k) and user feedback. Add fallback to direct LLM chat when no relevant documents are found.

Tokenization Preprocessing

Before any model can process text, it must be split into "tokens" (small pieces: words, subwords, or characters). The tokenizer converts text into a sequence of integer IDs that the model understands. How text is tokenized affects everything: cost (more tokens = higher price), speed (longer sequences = slower), and quality (non-English languages often need more tokens per word, reducing effective context).

Key concepts

  • BPE (Byte-Pair Encoding): the most common algorithm, used by GPT, Llama, Phi. Starts with individual characters and iteratively merges the most frequent pairs.
  • Vocabulary size: 32k-128k tokens typical. Larger vocab = fewer tokens per text but bigger embedding table.
  • Special tokens: [CLS], [SEP], <|im_start|>, padding tokens that mark structure in the input.
  • Context window: the maximum number of tokens the model can process at once. Ranges from 4k (Phi-4-mini) to 128k+ (GPT-4, Llama 3.1).

Evaluation & Safety

LLM Evaluation Quality

Evaluating LLMs is notoriously difficult because there's no single metric that captures "quality." You need a combination: standardized benchmarks for model selection, golden test sets for regression testing, LLM-as-judge for scalable scoring, and human evaluation for final validation. In production, user feedback and monitoring complete the picture.

Evaluation approaches

  • Benchmarks: MMLU (general knowledge), HumanEval (code), GSM8K (math), MT-Bench (multi-turn chat). Useful for comparing models before selection.
  • Golden test sets: curated Q&A pairs with expected answers for your specific use case. Run after every change for regression testing.
  • LLM-as-judge: use GPT-4 or Claude to score responses on rubrics (helpfulness, accuracy, safety). Cheaper and faster than human eval, correlates well.
  • Human evaluation: gold standard but expensive and slow. Use for final validation, not iteration.
  • RAG-specific (RAGAS): faithfulness (answer supported by context?), answer relevancy, context precision, citation correctness.
Safety & Alignment Production

Protecting LLM applications from misuse and failure in production: preventing prompt injection attacks, filtering harmful outputs, detecting hallucinations, and adding guardrails to keep the AI safe, honest, and on-topic.

Key concerns

  • Prompt injection: users trick the model into ignoring its instructions or revealing the system prompt. Mitigate with input sanitization, instruction hierarchy, and separate system/user contexts.
  • Hallucination: the model invents facts. RAG + citations reduce this significantly but don't eliminate it. Always ground responses in retrieved context.
  • PII leakage: the model may reproduce personal data from training. Filter both input and output with regex + NER-based scanners.
  • Guardrails: NeMo Guardrails (NVIDIA) for dialog flow control, Llama Guard for content classification, custom classifiers for domain-specific safety.

MLOps & Infrastructure

MLOps bridges the gap between building ML models and running them reliably in production. It borrows practices from DevOps (CI/CD, monitoring, infrastructure as code) and applies them to the unique challenges of machine learning: model versioning, data drift, GPU resource management, and continuous retraining.

Containerization

Docker Foundation

Docker packages your application and all its dependencies (Python, CUDA, libraries, model files) into a single portable container that runs identically on any machine. No more "it works on my machine" problems. Every ML service in production should be containerized.

Key concepts

  • Dockerfile: a recipe that describes how to build your container image (base OS, install dependencies, copy code, set entrypoint)
  • Image: the built artifact, like a snapshot. Immutable, versioned, stored in a registry (Docker Hub, GHCR, ECR)
  • Container: a running instance of an image. Lightweight, isolated, starts in seconds
  • Volume: persistent storage mounted into the container (for model files, data that survives container restarts)
  • NVIDIA Container Toolkit: enables GPU passthrough to Docker containers (essential for ML inference)

Best practices for ML

  • Use multi-stage builds: build dependencies in stage 1, copy only what's needed to the final slim image
  • Pin all dependency versions (requirements.txt with exact versions)
  • Keep model weights outside the image (mount as volume or download at startup)
  • Use health checks so orchestrators know if the service is ready
Docker Compose Multi-container

When your application has multiple services (API server, vector database, embedding model, LLM, reverse proxy), Docker Compose lets you define and run them all together in a single YAML file. One command starts everything with the right networking, volumes, and environment variables.

Typical ML stack in docker-compose.yml

  • app: your FastAPI/Flask API server (CPU)
  • llm: LLM inference service with GPU access (vLLM, llama.cpp server)
  • qdrant: vector database for RAG (persistent volume)
  • caddy/nginx: reverse proxy with HTTPS, rate limiting

✓ Strengths

  • Simple YAML syntax, easy to version control
  • Perfect for single-server deployments and development
  • Built-in service discovery (services talk by name)
  • GPU support via deploy.resources.reservations

✗ Weaknesses

  • Single-host only (no multi-server scaling)
  • No auto-healing or rolling updates (manual restart)
  • For multi-server: use Kubernetes or Docker Swarm

Orchestration

Kubernetes (K8s) Production Orchestration

Kubernetes is the industry standard for running containers at scale across multiple servers. It automatically handles load balancing, scaling (up and down based on demand), self-healing (restarts crashed containers), rolling updates (zero-downtime deploys), and GPU scheduling. Essential when you need to serve ML models to hundreds or thousands of concurrent users.

Key concepts

  • Pod: the smallest unit, one or more containers running together. Your LLM service = 1 pod.
  • Deployment: manages replicas of your pods. "Run 3 copies of my API server."
  • Service: stable network endpoint that routes traffic to pods (load balancing)
  • Ingress: routes external HTTP/HTTPS traffic to the right service
  • GPU scheduling: NVIDIA device plugin lets you request GPUs per pod (resources.limits: nvidia.com/gpu: 1)
  • HPA (Horizontal Pod Autoscaler): automatically adds/removes pods based on CPU, memory, or custom metrics

✓ Strengths

  • Auto-scaling, self-healing, rolling updates out of the box
  • GPU-aware scheduling for ML workloads
  • Massive ecosystem: Helm charts, operators, monitoring
  • Managed options: GKE, EKS, AKS reduce operational burden

✗ Weaknesses

  • Steep learning curve (YAML manifests, networking, RBAC)
  • Overkill for single-server deployments
  • Operational overhead: monitoring the orchestrator itself
  • GPU node pools are expensive (always-on or with autoscaling delays)

ML Serving & APIs

FastAPI / Model Serving API Framework

FastAPI is the go-to Python framework for building production ML APIs. It's async (handles many concurrent requests), generates OpenAPI docs automatically, and has built-in data validation with Pydantic. For LLM serving specifically, vLLM and TGI (Text Generation Inference) provide optimized inference servers with OpenAI-compatible APIs.

Serving stack options

  • FastAPI + llama-cpp-python: simple, direct, great for single-model CPU/GPU serving (used in MIRROR)
  • vLLM: high-throughput LLM serving with PagedAttention, continuous batching, OpenAI-compatible API
  • TGI (HuggingFace): production-ready, supports quantization, speculative decoding, token streaming
  • TensorRT-LLM (NVIDIA): maximum performance on NVIDIA GPUs, compiled inference engine
  • Triton Inference Server: multi-model, multi-framework serving (PyTorch, TensorFlow, ONNX)

Production essentials

  • Run behind Gunicorn/Uvicorn with multiple workers for CPU tasks
  • Add health check endpoints (/health, /ready) for orchestrator probes
  • Implement request timeouts and graceful shutdown
  • Use structured logging (JSON) for observability
  • Rate limiting and authentication for public-facing APIs

Monitoring & Observability

Prometheus Metrics

The industry standard for collecting and storing time-series metrics. Prometheus scrapes /metrics HTTP endpoints from your services at regular intervals, stores the data efficiently, and fires alerts when thresholds are breached. Used by virtually every Kubernetes cluster and production ML system.

When and why to use it

Use Prometheus whenever you need to answer "how is my system performing right now?" It's the source of truth for latency, throughput, error rates, and resource utilization. For ML: track tokens/sec, time-to-first-token, GPU memory, queue depth, and model confidence distributions.

How it works

  • Pull model: Prometheus scrapes your app's /metrics endpoint every 15-30s. Your app exposes counters, gauges, and histograms.
  • PromQL: powerful query language. Example: rate(http_requests_total[5m]) gives requests/sec over 5 minutes.
  • Alertmanager: companion service that routes alerts to Slack, PagerDuty, email. Groups, silences, and deduplicates.
  • Service discovery: automatically finds targets in Kubernetes (no manual config per service).
Grafana Dashboards

The visualization layer for your entire infrastructure. Grafana connects to Prometheus (metrics), Loki (logs), and dozens of other data sources to create beautiful, interactive dashboards. For ML teams: build dashboards showing model latency, GPU utilization, request throughput, error rates, and data drift all in one view.

When and why to use it

Use Grafana whenever you need a visual overview of system health. It's the single pane of glass where everyone (engineers, PMs, execs) checks production status. Create team-specific dashboards: one for infra (CPU, memory, disk), one for ML (model performance, drift), one for business (requests, users, cost).

Key features

  • Multi-datasource: connect Prometheus, Loki, PostgreSQL, Elasticsearch, CloudWatch in one dashboard
  • Alerting: define alert rules directly in Grafana (simpler than Alertmanager for basic cases)
  • Templating: variables let you switch between services, environments, or time ranges dynamically
  • Community dashboards: thousands of pre-built dashboards for common stacks (Node Exporter, NGINX, GPU metrics)
Loki Logs

Log aggregation system designed by Grafana Labs. Like Prometheus but for logs: lightweight, label-based indexing (not full-text), and native Grafana integration. Collects logs from all your containers and lets you search, filter, and correlate them with metrics on the same timeline.

When and why to use it

Use Loki when you need to debug issues by reading logs across services. "The model returned an error at 14:32 — what happened?" Loki lets you jump from a Grafana alert to the exact logs in the same dashboard. Much cheaper than Elasticsearch for log storage because it only indexes labels (service, level, pod), not the full text.

How it works

  • Promtail (agent): runs on each node, tails log files and ships them to Loki with labels
  • LogQL: query language similar to PromQL. {job="mirror"} |= "error" finds all error logs from the mirror service.
  • Correlation: click a metric spike in Grafana, instantly see the logs from that exact time window
Langfuse / LangSmith LLM Tracing

Specialized observability for LLM applications. While Prometheus monitors infrastructure, Langfuse and LangSmith trace the AI layer: every prompt, context, tool call, response, and latency. Essential for debugging RAG quality, hallucinations, and cost optimization.

When and why to use it

Use LLM tracing tools as soon as you deploy any LLM-powered feature. Without them, you're blind to prompt quality, retrieval relevance, hallucination rate, and per-request cost. They let you replay any conversation, see what context was retrieved, and score response quality.

What they track

  • Traces: full request lifecycle — prompt, retrieved chunks, reranking scores, LLM response, latency per step
  • Scores: user feedback (thumbs up/down), LLM-as-judge evaluation, custom metrics (hallucination, relevance)
  • Cost: token usage per request, cost breakdown by model and feature
  • Datasets: build evaluation sets from production traces for regression testing

CI/CD & Automation

GitHub Actions CI/CD

CI/CD built directly into GitHub. Define workflows in YAML that trigger on push, PR, schedule, or manual dispatch. Free for open source. The most popular choice for teams already on GitHub — runs lint, tests, Docker builds, and deployments without any external infrastructure.

When and why to use it

Use GitHub Actions as your default CI/CD if your code lives on GitHub. It's zero-setup (no server to manage), has a massive marketplace of reusable actions, and integrates natively with GitHub PRs (status checks, comments, deploy environments). For ML: trigger model evaluation on every PR, build and push Docker images, deploy to staging.

Typical ML workflow

  • on: push → lint (ruff/black) → unit tests (pytest) → build Docker image → push to GHCR/ECR
  • on: pull_request → run model evaluation benchmark → post accuracy comparison as PR comment
  • on: release → deploy to production (SSH, K8s apply, or cloud deploy)
  • Self-hosted runners: run jobs on your own GPU machines for training and evaluation
Jenkins CI/CD

The veteran CI/CD server, self-hosted and extremely customizable. Jenkins has been the backbone of enterprise CI/CD for 15+ years. Groovy-based pipelines, thousands of plugins, and complete control over the build environment. Best for on-prem, air-gapped, or complex multi-team enterprise workflows.

When and why to use it

Use Jenkins when you need full control: on-premises infrastructure, strict security requirements (air-gapped networks), complex approval chains, or legacy systems integration. It's also the choice when you need GPU build agents with custom CUDA setups that cloud CI can't provide easily.

Key concepts

  • Jenkinsfile: pipeline-as-code in Groovy. Declarative or scripted syntax. Lives in your repo.
  • Agents/Nodes: distributed build agents. Tag agents by capability (gpu, docker, arm64) and route jobs accordingly.
  • Plugins: 1800+ plugins for everything (Docker, K8s, Slack, SonarQube, JIRA). This extensibility is Jenkins' superpower and curse.
  • Blue Ocean: modern UI for pipeline visualization (replaces the dated classic UI).

✓ Strengths

  • Complete control: self-hosted, customizable, on-prem friendly
  • Massive plugin ecosystem for any integration
  • Battle-tested at enterprise scale (thousands of jobs)

✗ Weaknesses

  • Operational overhead: you manage the server, updates, security
  • Groovy DSL is complex, hard to debug
  • UI feels dated compared to modern alternatives
ArgoCD GitOps

GitOps continuous delivery for Kubernetes. The desired state of your cluster (all YAML manifests, Helm charts, Kustomize configs) lives in a Git repo. ArgoCD watches that repo and automatically syncs changes to the cluster. Push to Git = deploy to production. No manual kubectl apply.

When and why to use it

Use ArgoCD when you run Kubernetes and want declarative, auditable deployments. Every change goes through a PR, gets reviewed, and is automatically applied. Rollback = git revert. For ML: update model image tag in Git, ArgoCD deploys the new version with a rolling update or canary strategy.

How it works

  • Application CRD: define which Git repo + path maps to which K8s namespace
  • Sync: ArgoCD compares Git state vs live state and reconciles diffs
  • UI: visual dependency tree of all K8s resources, health status, diff view
  • Rollback: one-click rollback to any previous Git commit

Workflow Orchestration

Apache Airflow Pipeline Orchestration

The most popular open-source workflow orchestrator. Define data and ML pipelines as DAGs (directed acyclic graphs) in pure Python. Schedule, retry, monitor, and backfill. Used by Airbnb, Spotify, and thousands of companies to manage ETL, training pipelines, and data quality checks.

When and why to use it

Use Airflow when you need to orchestrate multi-step workflows that run on a schedule or on trigger: data ingestion → validation → feature engineering → model training → evaluation → deployment. It's the glue that connects all your tools into a reliable, monitored pipeline.

Core concepts

  • DAG: directed acyclic graph of tasks. Each task is a unit of work. Dependencies define execution order.
  • Operators: pre-built task types. PythonOperator, BashOperator, DockerOperator, KubernetesPodOperator.
  • Sensors: wait for an external condition (file appeared, API responded) before proceeding.
  • Scheduler: triggers DAG runs based on cron expressions or external events.

Alternatives

Prefect: modern Python-native with better DX (decorators, built-in retries, UI). Dagster: asset-centric, strong typing, great for data-aware pipelines. Kubeflow Pipelines: Kubernetes-native, designed for ML workflows with GPU support.

MLflow / W&B Experiment Tracking

Track every experiment you run: hyperparameters, metrics, artifacts, and model versions. Without experiment tracking, you're lost after a few dozen training runs ("which config gave the best F1?"). MLflow is open-source and self-hosted, Weights & Biases (W&B) is cloud-hosted with richer visualization.

When and why to use it

Use experiment tracking from day 1 of any ML project. It pays off immediately: compare runs side-by-side, reproduce results, share with the team. MLflow also includes a model registry (version, stage, approve) and a deployment component. W&B adds real-time dashboards, hyperparameter sweeps, and collaborative reports.

What to track

  • Parameters: learning rate, batch size, model architecture, data version
  • Metrics: loss curves, accuracy, F1, latency, per-epoch and final
  • Artifacts: model weights, evaluation plots, confusion matrices, sample predictions
  • Model registry: version models, promote to staging/production, rollback

Databases

Choosing the right database is one of the most impactful architectural decisions in any project. Each type of database excels at specific access patterns and fails at others. Understanding these tradeoffs helps you pick the right tool instead of forcing one database to do everything.

Relational (SQL)

PostgreSQL Best General-Purpose

The most advanced open-source relational database. Supports structured data with strong consistency (ACID transactions), complex queries (JOINs, CTEs, window functions), and extensions for nearly anything: full-text search, JSON, geospatial (PostGIS), and even vector search (pgvector). If you're unsure which database to pick, PostgreSQL is almost always a safe default.

✓ Strengths

  • ACID compliance: data integrity guaranteed, no corruption
  • Rich SQL: JOINs, subqueries, CTEs, window functions, full-text search
  • Extensions: pgvector (ML embeddings), PostGIS (geospatial), TimescaleDB (time series)
  • Mature ecosystem, battle-tested at every scale

✗ Weaknesses

  • Vertical scaling limits (single-node write bottleneck)
  • Horizontal sharding is complex (use Citus or CockroachDB for distributed)
  • Not optimized for high-write throughput on append-only workloads

When to use

User accounts, application state, metadata, configuration, anything with relationships. For ML: store experiment metadata, model registry entries, user feedback, evaluation results. With pgvector: can even serve as a simple vector database for small-scale RAG.

SQLite Embedded

A serverless, file-based SQL database embedded directly in your application. No separate database server needed: the entire database is a single file on disk. Perfect for local apps, prototypes, mobile apps, and anywhere you need SQL without the overhead of a full database server. Used in MIRROR for local data storage.

✓ Strengths

  • Zero configuration: no server, no users, no permissions
  • Single file: easy to backup, copy, version
  • Incredibly fast for read-heavy workloads
  • Ships with Python (sqlite3 module built-in)

✗ Weaknesses

  • Single-writer: only one write at a time (WAL mode helps for reads)
  • No user management or network access (local only)
  • Not suitable for high-concurrency web applications

NoSQL / Document

MongoDB Document Store

Stores data as flexible JSON-like documents (BSON) instead of rigid tables with fixed columns. Each document can have a different structure, making it great for evolving schemas, nested data, and rapid prototyping. Very popular for web applications where the data model changes frequently.

✓ Strengths

  • Flexible schema: add fields without migrations
  • Nested documents: store complex objects naturally
  • Horizontal scaling with sharding built-in
  • Atlas (managed) makes operations easy

✗ Weaknesses

  • No JOINs: denormalization leads to data duplication
  • Weaker consistency guarantees than PostgreSQL
  • Aggregation pipeline is powerful but complex
  • Schema flexibility can become schema chaos without discipline
Redis In-Memory / Cache

An in-memory key-value store that is incredibly fast (sub-millisecond latency). Used primarily as a cache (store frequently accessed data in memory instead of hitting the database), session store, rate limiter, message broker, and real-time leaderboard. Essential in any high-performance system.

✓ Strengths

  • Sub-millisecond latency (everything in RAM)
  • Rich data structures: strings, lists, sets, sorted sets, streams, hashes
  • Pub/Sub for real-time messaging
  • RediSearch: full-text search and vector search module

✗ Weaknesses

  • Data must fit in RAM (expensive at scale)
  • Persistence is optional and adds latency (RDB/AOF)
  • Not a primary database: use for caching, not as source of truth

Vector Databases

Qdrant / Pinecone / Weaviate AI / RAG

Purpose-built databases for storing and searching high-dimensional vectors (embeddings). Essential for RAG, semantic search, recommendation systems, and any application that needs to find "similar" items. They use approximate nearest neighbor (ANN) algorithms like HNSW to search billions of vectors in milliseconds.

Options compared

  • Qdrant (Rust): fast, HNSW + product quantization, hybrid dense+sparse search, filtering, gRPC/REST. Used in MIRROR. Self-hosted or cloud.
  • Pinecone: fully managed, serverless, auto-scaling. No self-hosting option. Best for teams that don't want to manage infrastructure.
  • Weaviate: modular (plug in different vectorizers), GraphQL API, hybrid search. Good for complex schemas.
  • ChromaDB: simple, Python-native, great for prototyping and small projects. Not production-grade at scale.
  • pgvector: PostgreSQL extension. Use your existing database for vector search. Good for small-medium scale, avoids adding another database.

When to use a dedicated vector DB vs pgvector

pgvector: <1M vectors, already using PostgreSQL, simple use case. Dedicated: >1M vectors, need advanced features (hybrid search, quantization, sharding), or vector search is your primary access pattern.

Scaling: Vertical vs Horizontal

Vertical vs Horizontal Scaling Architecture

When your database can't handle the load, you have two fundamental options: make the machine bigger (vertical) or add more machines (horizontal). This is one of the most important architectural decisions because it affects cost, complexity, consistency guarantees, and which databases you can use.

Vertical Scaling (Scale Up)

  • What: bigger machine (more CPU, RAM, faster SSD). Same single server, just more powerful.
  • Pros: simple (no code changes), ACID transactions stay easy, no distributed complexity
  • Cons: hardware ceiling (you can't buy a 10TB RAM server), single point of failure, expensive at the top end
  • Best for: PostgreSQL, MySQL, SQLite. Works until ~1TB data / ~10K queries/sec for most workloads.
  • Example: upgrade from 16GB RAM to 128GB RAM, from HDD to NVMe SSD

Horizontal Scaling (Scale Out)

  • What: add more machines (nodes/shards). Data is distributed across servers.
  • Pros: near-infinite scaling, fault tolerance (if one node dies, others continue), cost-effective with commodity hardware
  • Cons: complex (distributed transactions, rebalancing, consistency), more operational overhead, not all queries are efficient
  • Best for: MongoDB, Cassandra, CockroachDB, Qdrant, Kafka. Required above ~10TB or ~100K queries/sec.
  • Patterns: sharding (split by key), replication (copies for reads), partitioning (time-based splits)

CAP Theorem (the fundamental tradeoff)

A distributed database can only guarantee two of three: Consistency (all nodes see the same data), Availability (every request gets a response), Partition tolerance (system works despite network failures). Since network partitions are unavoidable in production, you're really choosing between CP (consistent but may reject requests: PostgreSQL, CockroachDB) and AP (always available but may return stale data: Cassandra, DynamoDB).

Cloud Data Warehouses

Snowflake Cloud Warehouse

A cloud-native data warehouse that separates storage and compute, allowing you to scale each independently. SQL-based, fully managed, supports semi-structured data (JSON, Parquet). Widely used for analytics, feature engineering, and as the data layer feeding ML pipelines. Snowpark lets you run Python ML code directly inside the warehouse.

When to use

Use Snowflake (or BigQuery/Redshift) when you need to analyze terabytes of data with SQL, build feature pipelines for ML, or serve analytics dashboards. The separation of storage and compute means you only pay for compute when queries run. Not for low-latency transactional workloads (use PostgreSQL for that).

Key features

  • Separation of storage and compute: scale compute (virtual warehouses) without moving data. Spin up/down in seconds. Pay only for compute time.
  • Zero-copy cloning: instantly clone databases or tables for testing without duplicating storage.
  • Time travel: query data as it was at any point in the past (up to 90 days). Undo mistakes, audit changes.
  • Snowpark: run Python, Scala, Java directly inside Snowflake. Build feature engineering and ML pipelines without moving data out.
  • Cortex AI: built-in LLM functions (COMPLETE, EMBED_TEXT) for AI directly in SQL queries.

Alternatives

  • BigQuery (Google): serverless, pay-per-query, integrated with Vertex AI. Best in the Google Cloud ecosystem.
  • Redshift (AWS): columnar warehouse, tight S3 integration, Redshift ML for in-warehouse training.
  • Databricks: unified analytics + ML platform on top of Delta Lake. Spark-based, strong for both ETL and model training.

✓ Strengths

  • Massive scale without ops burden (fully managed)
  • Excellent SQL support, familiar to data teams
  • Strong governance: RBAC, data masking, audit logs

✗ Weaknesses

  • Expensive at scale (compute costs add up)
  • Vendor lock-in (proprietary platform)
  • Not suitable for low-latency transactional workloads
TimescaleDB / Neo4j Specialized

Specialized databases optimized for specific access patterns. TimescaleDB is a PostgreSQL extension for time-series data (metrics, IoT, logs) with automatic partitioning and compression. Neo4j is a graph database for data with complex relationships (social networks, knowledge graphs, fraud detection) where traversal queries need to be fast.

When to use

  • TimescaleDB: metrics, IoT sensors, financial data, server logs. Any time-stamped data where you query by time range and aggregate. Extension on PostgreSQL = use your existing PG skills and tools.
  • InfluxDB: alternative to TimescaleDB. Purpose-built time-series DB with its own query language (Flux). Better for very high write throughput.
  • Neo4j: social networks (friend-of-friend queries), knowledge graphs, recommendation engines, fraud detection (follow chains of transactions). Cypher query language for graph traversals.
  • Apache AGE: graph extension for PostgreSQL. Use graph queries without adding a separate database. Good for simpler graph use cases.
Loading model...
RAM --%
APP --
CPU --%