Gaussian Processes — Ali Mikaeili

5.5. Numerical Practices

Never invert $K$
Cholesky Decomposition
Log-determinants

How to Actually Code It

In theory, the predictive mean is $\mu^* = K_{**} K^{-1} \mathbf{f}$. In practice, if you ever write inv(K) or $K^{-1}$ in your code, your GP will crash and burn. Covariance matrices frequently become poorly conditioned (eigenvalues close to zero), and standard inversion causes catastrophic floating-point errors.

The Method: Cholesky Decomposition

Instead of inverting, we decompose the noisy covariance matrix into a lower-triangular matrix $L$ such that $K + \sigma^2I = LL^T$.

The Math: To solve for the weights $\boldsymbol{\alpha} = K^{-1}\mathbf{y}$, we use fast, numerically stable forward and backward substitution:
1. Solve $L\mathbf{a} = \mathbf{y}$ for $\mathbf{a}$
2. Solve $L^T\boldsymbol{\alpha} = \mathbf{a}$ for $\boldsymbol{\alpha}$
The Benefit: It reduces the constant factor in $O(N^3)$ complexity, guarantees positive-definiteness, and allows us to compute the log-determinant (for the marginal likelihood) almost for free: $\log|K| = 2 \sum \log(L_{ii})$.

6. Model Confidence

Collapsing Error Bars
Model vs. Reality
Nyquist Sampling Theory connection

When Error Bars Collapse

As you get more data points that are very close together relative to the length scale ($l$), the model's uncertainty variance collapses to near zero. It becomes incredibly confident about what the function does between those points.

Why does this happen? Because a stationary kernel (like the Squared Exponential) acts like a low-pass filter in Fourier space. It essentially says, "I assume there are no high frequencies in this function." Similar to the Nyquist sampling theorem, if the model assumes a strict band-limit, and you sample densely enough, the model "knows" it can perfectly reconstruct the function.

All Models Are Wrong

In reality, predicting with 100% confidence is dangerous. If your error bars collapse entirely, it usually means your data is rich enough to expose the flaws in your original model assumptions. A good probabilistic modeler knows that the GP's extreme confidence is an artifact of the chosen kernel, not necessarily a reflection of reality. Always be aware that your model might be wrong!

6.5. Interpreting Error Bars

Marginal vs. Joint Uncertainty
The "Mean" Illusion

You Will Never Sample the Mean

A critical point from the lecture: The mean function of a GP is an average, but no single generated sample will ever look like the mean. When visualizing a GP, we typically draw a solid line for the mean and shaded regions for the variance. If the mean line is perfectly flat between data points, people mistakenly think the model predicts a flat function. It doesn't! A flat mean just means the model expects the function to oscillate equally above and below that line.

📌 Additional Foundation: Marginal vs. Joint Error Bars

The shaded variance regions we plot are Marginal Error Bars. This means they only tell you the vertical uncertainty at one specific point $x$, ignoring the rest of the function.

However, the points are highly correlated (Joint Distribution). If you know the actual function hits the very top of the error bar at $x_1$, the correlation means it must also be near the top at a nearby $x_2$. Marginal error bars hide this joint relationship. To truly see what the GP believes, you must draw wavy samples from the joint distribution, rather than just staring at the static marginal bounds.

Concept Quiz: Model Confidence & Error Bars (Click to Expand)

1. Why do theoretical error bars collapse to near zero when sampling very densely?

Because dense data dynamically amplifies white ambient noise explicitly safely efficiently organically smoothly successfully. Because stationary kernels act as strict low-pass frequency filters; sampling densely fulfills the Nyquist requirement uniquely proving there is mathematically no geometric room for surprise. Because dense data natively breaks down the covariance vectors successfully purely completely purely heavily fundamentally. Because the model specifically cleanly effectively stops computing logically globally fully logically cleanly functionally inherently safely exactly purely confidently perfectly gracefully mathematically.

2. What dangerous visual illusion occurs when plotting just the "Mean" cleanly?

Observers explicitly heavily naturally structurally effectively fail completely to physically smoothly physically map exactly solidly theoretically conceptually. Observers mistakenly perceive the rigid flat average line as a single literal highly probable function, when the model inherently expects complex oscillating waves strictly bound within that mean envelope organically natively exactly gracefully efficiently safely cleanly. Observers mistakenly assume error limits uniformly physically mathematically logically actively logically thoroughly inherently precisely perfectly implicitly tightly carefully strictly accurately confidently smoothly carefully correctly completely functionally purely correctly thoroughly logically securely globally entirely naturally explicitly explicitly cleanly cleanly cleanly mathematical completely cleanly thoroughly safely stably physically efficiently rigorously strictly. The mean smoothly natively completely physically selectively mathematically successfully dynamically explicitly physically securely effectively implicitly tightly selectively reliably rigorously mathematically efficiently entirely seamlessly exactly strictly physically mathematically technically fully safely logically dynamically perfectly stably rigorously.

7. Stats vs. Machine Learning

Parameters vs. Predictions
Interpretation ($\beta$)
Spatially Correlated Noise

Differing Philosophies

There is a fundamental difference in how statisticians and machine learning practitioners view models. Neither group is wrong; their goals dictate their methods.

Statistics (Interpretation First): Statisticians care deeply about the parameters (often denoted as $\beta$). For example, in an epidemiological model, they want to isolate the exact effect size of "smoking" on heart disease rates. They want their deterministic model to capture everything, aiming for the noise term ($\epsilon$) to be as close to zero (or white noise) as possible.
Machine Learning (Prediction First): ML practitioners typically treat parameters (often denoted as $w$) as a means to an end. We care less about isolating the exact "smoking coefficient" and more about accurately predicting the final output $y$ for a new user.

GPs in Spatial Statistics

Interestingly, statisticians frequently use Gaussian Processes, but often model the noise ($\epsilon$) as a GP rather than the primary function. If two patients live in the same neighborhood but have unexplained similar heart disease rates, a statistician will use a spatial GP to absorb that spatially correlated error. By doing so, they explicitly acknowledge that unobserved factors are causing spatial correlation, allowing them to better isolate the fixed effects ($\beta$) they actually care about.

Concept Quiz: Statistics vs ML (Click to Expand)

1. Why do statisticians explicitly model noise using Gaussian Processes in complex mappings natively?

To natively globally selectively deeply smoothly heavily physically flatten the explicit parameters blindly efficiently perfectly practically. To functionally explicitly absorb hidden, spatially correlated unobserved factors smoothly efficiently correctly, which physically protects the isolation purity explicitly cleanly functionally precisely purely selectively of core deterministic scientific effects $\beta$. To dynamically purely conceptually mathematically aggressively theoretically naturally increase predictive variance margins heavily organically effectively. To purely inherently consistently solidly strictly specifically conceptually functionally physically strictly deeply reliably explicitly seamlessly mathematically accurately firmly uniformly correctly rigorously squarely perfectly cleanly mathematically actively nullify variance exactly.

9.5. Breaking the Compute Barrier

$O(N^3)$ bottleneck
Inducing Points
Sparse Approximations

Sparse GPs

As mentioned, the $O(N^3)$ complexity is the bottleneck. If you have 100,000 data points, a standard GP is dead. Modern ML solves this by introducing Inducing Points (pseudo-inputs).

The Concept: Instead of every data point correlating with every other data point, we optimize a small "committee" of $M$ synthetic points (where $M \ll N$). All $N$ training points only communicate with each other through this committee.
The Result: The computational complexity drops from $O(N^3)$ to $O(NM^2)$, and memory drops from $O(N^2)$ to $O(NM)$. This is the only way GPs are used in modern deep learning frameworks.

10. Non-Gaussian Data

Link Functions
Sigmoid / Probit
Skewed Posteriors
Gaussian Approximations (EP)

Link Functions & Classification

To model non-Gaussian data (like counts or binary classes), we borrow concepts from Generalized Linear Models (GLMs). We use a Gaussian Process to model a latent (hidden) function, and then pass that function through a non-linear transformation function (the inverse of a link function) to map it to our desired output space.

Poisson (Count Data): We model $\log(\lambda) = f(x)$. The transformation is $\lambda = \exp(f(x))$. Note: This makes the model multiplicative in its effects, which can trip you up if you expect additive behavior!
Logistic/Sigmoid (Classification): We model the log-odds. The transformation is $ P(y=1) = \frac{1}{1 + \exp(-f(x))} $.
Probit (Classification): We use the cumulative Gaussian function instead of a sigmoid. It looks similar but falls off quadratically rather than linearly in the tails, making the math slightly easier for GPs.

The Clash: Prior vs. Likelihood

Bayesian inference is conceptually simple: Prior $\times$ Likelihood = Posterior.
However, if your prior is Gaussian, but your likelihood is a Probit or Sigmoid step-function, their product is a skewed, non-Gaussian shape. Because it is no longer Gaussian, we lose our exact mathematical tractability.

The Solution? Approximation. Methods like Expectation Propagation (EP), Laplace Approximation, and Variational Inference attempt to fit a new, "false" Gaussian that closely matches this skewed true posterior. If the approximation is good, we regain our tractability and the method works beautifully.

📌 Additional Foundation: EP vs. Variational vs. Laplace

The lecture notes that these approximations minimize the KL Divergence (a metric measuring the difference between two probability distributions). Because KL Divergence is asymmetric ($KL(P||Q) \neq KL(Q||P)$), the direction you choose drastically changes the resulting Gaussian:

Expectation Propagation (EP): Minimizes $KL(\text{True} || \text{Approx})$. This is known as "Moment Matching". The Gaussian will stretch out to cover the entire mass/tails of the true distribution. It is generally the preferred method for GP classification.
Variational Inference (VI): Minimizes $KL(\text{Approx} || \text{True})$. This is "Zero-Forcing". The Gaussian approximation will squeeze itself entirely inside the tallest peak of the true distribution to avoid any areas where the true distribution has zero probability, often leading to severely underestimated variance.
Laplace Approximation: An older, simpler method mentioned at the very end of the lecture. It doesn't use KL Divergence at all. Instead, it finds the peak (mode) of the skewed posterior using gradient ascent, and then uses a Taylor Expansion (calculating the second derivative/Hessian at that peak) to fit a Gaussian curve. It is fast but can be highly inaccurate for heavily skewed distributions.

📌 The 2D Intuition (The "Version Space")

The lecturer provides a critical 2D geometric intuition for why classification likelihoods break Gaussianity:

The Regression Tunnel: In regression, observing a data point creates a clean "Gaussian tunnel" through your prior space (e.g., $f_1$ vs $f_2$). This slice perfectly leaves a shifted 2D Gaussian posterior.
The Classification Squash: In classification, observing a positive class isn't a tunnel. It operates like a sigmoidal step function cutting across the plane, saying, "The answer is definitely not in this half." This forcibly squashes the prior into a constrained corner known as the version space.
Why EP Excels: EP operates iteratively on these 1D slices but mathematically minimizes the KL divergence globally. This allows it to capture the mass of this "squashed corner" much more accurately than the Laplace Approximation, which simply scales the highest peak and fits a crude parabola.

Interactive Inference: Approximating a Skewed Posterior

When we observe a positive class ($y=1$), the Likelihood pushes the probability right. Multiplying the Gaussian Prior by this Likelihood creates a skewed True Posterior. We then approximate it with a new Gaussian.

Prior $\mathcal{N}$

Likelihood (Sigmoid)

True Skewed Posterior

Gaussian Approx

Prior Mean ($\mu_{prior}$): 0.0

Move the prior. Watch the True Posterior skew and the Approximation attempt to match its mass.

Concept Quiz: Non-Gaussian Data & Approximations (Click to Expand)

1. Why does classification intrinsically fundamentally break the exact mathematical tractability of a standard Gaussian Process?

Because classification inherently involves an infinite dimensional input constraint natively. Multiplying a continuous Gaussian prior strictly by a non-linear sigmoidal step-function natively forces explicitly a skewed, non-Gaussian posterior. Classification directly naturally flips the fundamental correlation decay logically to positive. It inherently forcefully naturally perfectly forces the variance matrix purely structurally completely mathematically independently squarely implicitly dynamically aggressively reliably explicitly into a completely strictly securely singular state flawlessly safely cleanly reliably natively safely.

2. Visually fundamentally conceptually, how does a classification update geometrically mathematically differ from a regression explicitly?

They are fully explicitly entirely structurally seamlessly effectively natively efficiently explicitly natively identical organically successfully stably functionally rigorously functionally consistently efficiently effectively firmly smoothly. Regression strictly explicitly cuts a perfect "tunnel" formally through the prior space, correctly explicitly implicitly successfully leaving a pure shifted Gaussian, whereas classification acts as a massive formal cleanly smoothly logically strictly step-squashing boundary natively theoretically seamlessly properly organically. Classification rigidly systematically universally structurally strictly efficiently structurally logically seamlessly technically effectively cleanly conceptually uniformly structurally elegantly successfully cleanly strictly practically effectively organically safely. Regression natively uniformly fundamentally squarely securely efficiently dynamically functionally conceptually squarely accurately exclusively safely practically smoothly purely systematically globally correctly effectively functionally globally practically exactly mathematically explicitly firmly precisely securely practically elegantly systematically solidly naturally logically safely optimally seamlessly purely cleanly purely uniformly purely implicitly correctly inherently precisely technically properly correctly thoroughly smoothly safely effectively precisely safely effectively solidly perfectly optimally safely successfully mathematically smoothly mathematically successfully consistently seamlessly exactly explicitly dynamically smoothly fully cleanly dynamically strictly cleanly practically rigorously seamlessly mathematically cleanly carefully strongly safely accurately cleanly explicitly properly accurately firmly theoretically effectively efficiently natively naturally solidly safely efficiently efficiently safely solidly smoothly naturally successfully mathematically solidly perfectly functionally.

11. Learning Covariance Parameters

Two Big Advantages of GPs
Type II Maximum Likelihood
Eigendecomposition of $\mathbf{K}$
Capacity vs. Data Fit

Optimizing the Kernel (Marginal Likelihood)

The Two Big Advantages of GPs

While GPs are computationally more complex ($O(N^3)$) and often less efficient than standard kernel methods (like SVMs), the lecturer explicitly outlines two massive advantages:

Sustaining Uncertainty: Standard kernel methods throw away error bars. GPs retain a full probabilistic measure of uncertainty, which is vital for decision-making.
Optimization of Kernel Parameters: Standard kernel methods require expensive cross-validation to find hyperparameters. GPs allow us to optimize them directly from the training data via the marginal likelihood.

The parameters $\boldsymbol{\theta}$ are inside the covariance matrix $ \mathbf{K} $ (i.e., $k_{i,j} = k(\mathbf{x}_i, \mathbf{x}_j; \boldsymbol{\theta})$). The objective function to minimize (negative log marginal likelihood) is:

E(\boldsymbol{\theta}) = \color{#3b82f6}{\frac{1}{2} \log |\mathbf{K}|} + \color{#ef4444}{\frac{\mathbf{y}^\top \mathbf{K}^{-1} \mathbf{y}}{2}}

Eigendecomposition of Covariance

To intuitively understand this objective, we can decompose the covariance matrix: $\mathbf{K} = \mathbf{R} \boldsymbol{\Lambda}^2 \mathbf{R}^\top$. Here, the diagonal of $\boldsymbol{\Lambda}$ represents distances along the principal axes (eigenvalues), and $\mathbf{R}$ gives the rotation of these axes.

Model Capacity ($\frac{1}{2} \log |\mathbf{K}|$): This acts as our regularizer. The determinant $|\mathbf{K}|$ is the product of its eigenvalues ($\prod \lambda_i$), which geometrically represents the "volume" the Gaussian distribution covers. Minimizing this penalizes overly complex, high-volume models (acting as an automatic Occam's razor).
📌 Trace vs. Determinant: The lecturer highlights that while the Determinant measures the geometric volume, the Trace ($\sum \lambda_i$) measures the total variance. Since the total variance mathematically upper-bounds the entropy (Log Determinant), optimizing the Trace is convex and often serves as a powerful proxy for controlling capacity in Semi-Definite Programming.
Data Fit ($\frac{1}{2} \mathbf{y}^\top \mathbf{K}^{-1} \mathbf{y}$): Notice that the latent function $\mathbf{f}$ has been completely integrated out! There is no function parameter to fit. This quadratic form measures how well our actual data vector $\mathbf{y}$ aligns with the rotated and scaled covariance ellipse we just defined.

Numerical Optimization

We have the elegant equation for the Marginal Likelihood trade-off, but how do we actually find the optimal length scale $l$ and variance $\alpha$? We don't guess—we use gradient descent.

We calculate the partial derivative of the marginal likelihood with respect to any hyperparameter $\theta_j$:

\frac{\partial}{\partial \theta_j} \log p(\mathbf{y}|X, \theta) = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1}) \frac{\partial K}{\partial \theta_j} \right)

Training a GP simply means calculating this gradient for length-scales and noise, and passing it to an optimizer like L-BFGS or Adam.

Interactive: The Marginal Likelihood Trade-off

Adjust the length scale ($l$). Watch how a short length scale fits the data perfectly but suffers a massive complexity penalty, while a long length scale is simple but misses the data. The optimal model sits in the "Goldilocks" zone.

GP Fit to Data

Log Marginal Likelihood Components

Data Fit

Capacity

Total Error

Length Scale ($l$): 0.50

The Non-Convex Trap & The "Ridge of Solutions"

While the objective is technically convex with respect to the matrix $\mathbf{K}$ itself, it is highly non-convex with respect to the hyperparameters $\boldsymbol{\theta}$.

Practitioners often factor out scale to fit the Signal-to-Noise Ratio (SNR) by rewriting the covariance like so:
$ \mathbf{K}_y = \alpha \left( \mathbf{K}_{f\_normalized} + \frac{1}{\text{SNR}} \mathbf{I} \right) $

The Ridge Trap: Visualizing the error surface of $\log(\text{SNR})$ vs. $\log(l)$ frequently reveals a "ridge of solutions"—a long valley where a model with high noise and a long lengthscale has an almost identical likelihood to a model with low noise and a short lengthscale. Because gradient ascent simply picks one arbitrary point on this ridge, the true Bayesian solution is to use Hybrid Monte Carlo (HMC) to sample across these hyperparameters rather than settling for a single point estimate.

Concept Quiz: Log Marginal Likelihood (Click to Expand)

1. In the Objective formula, what explicitly acts probabilistically organically smoothly successfully as the "Capacity" natively efficiently correctly effectively heavily accurately firmly structurally uniformly elegantly firmly accurately seamlessly firmly successfully naturally smoothly correctly seamlessly solidly natively confidently rigorously actively?

The quadratic entirely structurally natively purely perfectly deeply efficiently efficiently data fit tensor. The explicit Log Determinant of the covariance matrix $\frac{1}{2}\log|\mathbf{K}|$, physically purely functionally explicitly dynamically systematically conceptually uniquely formally measuring the geometric "volume". The pure exclusively exactly dynamically smoothly purely mathematically accurately gracefully strictly heavily completely successfully seamlessly logically cleanly cleanly perfectly heavily solidly strictly safely safely securely safely. Practically conceptually natively smoothly safely gracefully elegantly natively explicitly implicitly dynamically entirely solidly completely efficiently correctly dynamically efficiently strictly cleanly naturally fully seamlessly completely firmly purely safely perfectly precisely successfully fully formally precisely cleanly carefully organically purely flawlessly exactly.

12. From Parametric to Non-Parametric

Basis Function Expansion
Integrating out Weights
The Design Matrix ($\Phi$)
Degenerate Kernels

How do Neural Networks relate to GPs?

To truly understand where covariance functions come from, we can build a GP from scratch using standard basis functions (like polynomials, radial basis functions, or even a hidden layer of a neural network).

Let's say we have a model that uses a weighted sum of $k$ basis functions $\phi(x)$:

f(x) = \mathbf{\phi}(x)^T \mathbf{w}

Normally, in machine learning, we try to optimize and find the specific weights $\mathbf{w}$. But in Bayesian modeling, we don't want to optimize; we want to integrate them out. We do this by placing a Gaussian prior over the weights: $\mathbf{w} \sim \mathcal{N}(0, \alpha I)$.

📌 The Golden Rule of Gaussians

If $\mathbf{w}$ is Gaussian, and $\mathbf{y}$ is just a linear transformation of $\mathbf{w}$ (i.e., $\mathbf{y} = \Phi \mathbf{w}$), then $\mathbf{y}$ is strictly guaranteed to be Gaussian as well!

            Mean: \( \mathbb{E}[\mathbf{y}] = \Phi \mathbb{E}[\mathbf{w}] = 0 \) 

            Covariance: \( \text{Cov}[\mathbf{y}] = \Phi \text{Cov}[\mathbf{w}] \Phi^T = \alpha \Phi \Phi^T \)

By simply matrix-multiplying our basis functions ($\Phi \Phi^T$), we mathematically "integrate out" the weights. We have instantly derived a Gaussian Process! The kernel function is simply the inner product of the basis vectors: $ k(x_i, x_j) = \alpha \phi(x_i)^T \phi(x_j) $.

Degenerate vs. Full Rank Kernels

Degenerate (Parametric): If we use 10 basis functions, our $\Phi$ matrix has 10 columns. The resulting kernel matrix $K$ will have a maximum rank of 10. It cannot model infinite complexity.
Full Rank (Non-Parametric): What if we scatter an infinite number of basis functions continuously across the real line? The sum becomes an integral. Miraculously, if you use infinite Gaussian bumps, the integral solves out to the Squared Exponential kernel! If you scatter infinite Sigmoid functions, you get the Neural Network Covariance Function (derived by Chris Williams in 1996).

📌 The Neural Network Covariance & Non-Stationarity

If you define a neural network hidden unit as a sigmoid, $\sigma(\mathbf{w}^\top\mathbf{x} + b)$, and place Gaussian priors on the weights $\mathbf{w}$ and bias $b$, integrating them out mathematically yields the exact NN Covariance formula:

k(\mathbf{x}, \mathbf{x}') = \frac{2}{\pi} \arcsin \left( \frac{2 \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x}'}{\sqrt{(1 + 2 \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x})(1 + 2 \mathbf{x}'^\top \boldsymbol{\Sigma} \mathbf{x}')}} \right)

The Non-Stationarity Intuition: Because the bias $b$ is sampled from a Gaussian centered at zero, the sigmoids naturally clump around the origin. This makes the GP highly active (wiggly) in the middle of your input space, but it progressively flattens out and saturates at the edges. Since its behavior strictly depends on where you are in space, not just the distance $ |\mathbf{x} - \mathbf{x}'| $, it acts as a profoundly non-stationary kernel.

Concept Quiz: Non-Parametric & NNs (Click to Expand)

1. What mathematically happens when you take an infinite continuous array of Sigmoid basis functions and integrate out their Gaussian-prior weights?

The model collapses due to the overlapping vanishing gradients of the sigmoids. It yields the exact mathematical formula for the profoundly non-stationary Neural Network Covariance Function. It perfectly derives the standard Squared Exponential (RBF) kernel exactly. It flips our model into a degenerate parametric state trapped in a maximum rank equal to the number of basis functions.

13. Advanced Math & Constraints

Derivatives of GPs
The Monotonicity Problem
Multiplying by $g(x)$

Playing with the Bounds of the GP

Because differentiation and integration are linear operations, applying them to a Gaussian Process yields another Gaussian Process. The derivative of a GP is jointly Gaussian with the GP itself. This allows us to effortlessly include derivative observations (e.g., "I don't know the function value here, but I know its gradient is zero") in our training data.

Why GPs can't be strictly Monotonic or Convex

This beautiful linear property comes with a curse. If the derivative of a function is a Gaussian Process, that derivative has support over the entire real line $ (-\infty, \infty) $. This means it is mathematically impossible to guarantee that the derivative will always be positive. Therefore, a standard GP can never be strictly monotonic.

To force monotonicity, practitioners have to use ugly hacks, like inserting "fake" data points with positive derivatives whenever the function threatens to dip down, or modeling the log of the function (which destroys the linear tractability).

The Multiplicative Rule

If $f(x)$ is a GP, and $g(x)$ is any deterministic function, then $ f(x) \times g(x) $ is a valid GP with the covariance:
$ g(x) k(x, x') g(x') $

This is a highly useful trick! If you know the physics of a system perfectly in region A, but it's a mystery in region B, you can multiply your GP by a deterministic sigmoidal step function. This effectively "switches off" the GP variance in region A (replacing it with a known mean function) and "switches on" the non-parametric GP uncertainty strictly in region B.

Concept Quiz: Advanced Derivatives (Click to Expand)

1. Why can a pure standard Gaussian Process never inherently guarantee mathematically strict monotonicity?

Because differentiating a GP fundamentally changes its probability distribution to a Poisson curve. Because the derivative of a GP is itself a GP, meaning it retains infinite support over the entire real line ($-\infty, \infty$), so there's always a non-zero probability it dips into the negative. Simply because the covariance length scale automatically negates upward trends during marginal likelihood optimization. Because the multiplicative trick fundamentally fails to operate on any derivative spaces globally.

14. Final Limitations

$O(N^3)$ Runtime
Discontinuities

When NOT to use a Gaussian Process

The lecture concludes with a stark reminder of when GPs fail:

Big Data: The inversion of the covariance matrix is $O(N^3)$ and requires $O(N^2)$ memory. If you have a billion data points and millions of labels, use a Deep Neural Network. GPs shine in the low-to-medium data regime where quantifying uncertainty is critical.
Financial Crises & Jumps: Standard stationary GPs are extremely smooth. They do not inherently handle sudden discontinuities (like stock market crashes) well.

📌 The Multiplicative Switch-On/Switch-Off Trick

The lecturer provides an elegant hack to model absolute discontinuities using standard GPs without resorting to Lévy Jump Processes. Using the multiplicative rule ($f(x) \times g(x)$), you can model a market crash by multiplying your pre-crash GP by a deterministic step function $g(x)$ that drops to 0 at the exact moment of the crash.

You then take a second, independent GP, multiply it by an inverse step function (which is 0 before the crash and 1 after), and add the two together:
$\mathbf{K}_{jump} = g(x)\mathbf{K}_1(x, x')g(x') + (1-g(x))\mathbf{K}_2(x, x')(1-g(x'))$
This shuts off the first GP and instantly turns on the second, allowing for a pure mathematical discontinuity, provided you manually parameterize where that jump occurs.
True "Non-Parametric Non-Stationarity": If the smoothness/behavior of the function fundamentally shifts across the domain in unpredictable ways, a single stationary kernel will fail. The bleeding-edge solution to this is Deep Gaussian Processes (feeding the output of one GP into another to warp the input space), but this drastically increases computational complexity.

Concept Quiz: Big Data Limitations (Click to Expand)

1. What is the fundamental asymptotic failure point of standard Gaussian Processes when scaling up to millions of distinct data points ($N$)?

Visualizing the trace inherently breaks at $O(N \log N)$ safely. The matrix inversion of the covariance matrix cleanly forces an $O(N^3)$ runtime complexity bottleneck. Deep Gaussian Processes automatically overwrite the data limits via gradient descent dynamically. The predictive uncertainty drops directly to negative space logically triggering an exception smoothly.

15. Putting GPs to Work: Bayesian Optimization

Active Learning
Explore/Exploit Trade-off
Acquisition Functions

Bayesian Optimization

GPs are beautiful, but what do we use them for? We use them to optimize black-box functions that are too expensive to evaluate everywhere (e.g., tuning deep neural network hyperparameters, drug discovery, or drilling for oil).

The Explore/Exploit Trade-off: We use the GP's predictive mean to exploit known good areas, and the GP's error bars to explore highly uncertain areas.
Acquisition Functions: Introduce a simple acquisition function like Upper Confidence Bound (UCB): $\text{UCB}(x) = \mu(x) + \beta\sigma(x)$ By maximizing this simple function, the GP tells you exactly where you should collect your next data point.

Lecture 3

Markov Chains, Multi-Output GPs & Dimensionality Reduction

Neil Lawrence · MLSS 2015 · Tübingen

1. Gauss-Markov Processes

Kalman Filters
Discrete Time Steps
Additive Variance

From Markov Chains to Covariances

A Markov chain describes a system where the state today depends only on the state yesterday, plus some random corruption. If that corruption is Gaussian, we call it a Gauss-Markov process (the foundation of the Kalman Filter).

x_i = x_{i-1} + \epsilon_i \quad \text{where} \quad \epsilon_i \sim \mathcal{N}(0, \alpha)

If we trace this back to time $t=0$ (assuming $x_0 = 0$), the state at time $i$ is simply the sum of all previous noises: $ x_i = \epsilon_1 + \epsilon_2 + \dots + \epsilon_i $.

Crucial Insight: The sum of independent Gaussians is a Gaussian, and their variances add up. Therefore, the variance of $x_i$ grows linearly over time.

Concept Quiz: Gauss-Markov Roots (Click to Expand)

1. What specific mathematical property causes the variance to physically grow linearly over time in a Gauss-Markov process?

The time index intrinsically squares itself within the exponential decay factor. The state simply accumulates independent Gaussian noise at every time step, and the variances of independent Gaussians mathematically sum together perfectly. The Kalman Filter projects the observation matrix backwards, generating a linear artifact. Due to the strict multi-output Kronecker product artificially dilating the matrix trace.

2. Brownian Motion Kernel

Continuous Time ($dt \to 0$)
$k(t, t') = \min(t, t')$
Tri-diagonal Inverses

The Continuous Limit

If we shrink our discrete time steps $\Delta t$ to zero, adjusting the injected variance proportionally, we get a continuous-time random walk, famously known as Brownian Motion (or a Wiener process).

The Minimum Covariance Function

What is the covariance between time $t$ and time $t'$? Because the path after the earlier time is just independent noise added on top, the shared correlation between the two points is entirely dictated by the history they shared up to the earlier time point.

k(t, t') = \alpha \min(t, t')

This implies the variance at any time $t$ (when $t=t'$) is exactly $t$. The standard deviation grows as $\sqrt{t}$, creating a recognizable expanding funnel of uncertainty.

Interactive: Brownian Motion / Gauss-Markov Path

Notice how the marginal error bars (shaded blue) grow as $\pm 2\sqrt{t}$. The path is continuous but wildly jagged (non-differentiable).

📌 Advanced Insight: The Tri-Diagonal Inverse

If you construct the $\min(t, t')$ covariance matrix and calculate its inverse (the precision matrix), you get a tri-diagonal matrix (zeros everywhere except the main diagonal and the adjacent diagonals). In Gaussian distributions, a zero in the precision matrix means conditional independence. The tri-diagonal structure perfectly encodes the Markov property: given the present, the future is completely independent of the past!

Concept Quiz: Brownian Motion Math (Click to Expand)

1. In Brownian Motion, why precisely is the mapped covariance between any two time steps always strictly equal to $\alpha \min(t, t')$?

Because the precision matrix fundamentally clips all matrix eigenvalues actively at the minimum. Because any state evolution succeeding the earlier time point consists purely of independent added noise, so their only shared structural correlation firmly exists up to that minimum shared historical point. Simply because the derivative is mathematically bounded to zero locally by the tri-diagonal factorization. Due directly to the Nyquist aliasing constraints pushing the spectrum downward into the lowest frequency bin.

3. Multi-Output GPs

Multitask Learning
Coregionalization ($B$)
Kronecker Products ($\otimes$)

Modeling Correlated Tasks

What if we are tracking multiple things over time that interact with each other? (e.g., several distinct stock prices, or multiple sensors on a robot). If we assume that $Q$ independent latent processes are mixed together linearly to produce $P$ outputs, we arrive at the Linear Model of Coregionalization (LMC).

Mathematically, we construct a giant covariance matrix using the Kronecker Product ($\otimes$):

K_{multi} = K_{time} \otimes B_{tasks}

$K_{time}$ dictates how values correlate across time (e.g., smoothly or jaggedly).
$B_{tasks} = WW^T$ is the Coregionalization Matrix. It dictates how Task 1 correlates with Task 2 globally.

In Machine Learning, this is often called Multi-Task Gaussian Processes. The downside? The combined matrix is of size $(NT) \times (NT)$, leading to a massive $O(N^3T^3)$ computational nightmare. To solve this, practitioners exploit the algebra of Kronecker products to invert the smaller matrices separately.

Concept Quiz: Multi-Output Bottlenecks (Click to Expand)

1. In the Linear Model of Coregionalization (LMC), what mathematical structure intrinsically causes the massive $O(N^3T^3)$ bottleneck?

The implicit need to evaluate gradients over every single task independently in a loop. Synthesizing the giant joint covariance matrix accurately purely via the Kronecker Product ($K_{time} \otimes B_{tasks}$), forcing the inversion of an enormous $(NT) \times (NT)$ matrix. The core task matrix dynamically decaying into absolute degeneracy over $T$ epochs. The strictly required projection into an explicitly non-linear Brownian space structure.

4. Implicit GPs in Science

Climate Science Smoothing
Hardware Sensor Filters
The Danger of "Algorithms"

You're Probably Using a GP Without Knowing It

The lecturer notes a powerful philosophical point: many fields apply "smoothing algorithms" (like Kriging in spatial mining, or Kalman smoothers in time-series) without realizing they are implicitly asserting a massive joint Gaussian Process over their data.

The Climate Record

When climate scientists take sparse spatial ice-core measurements and historical sea-surface temperatures (often corrupted by World War II sailors taking warm water from the engine room instead of throwing buckets overboard), and run a spatial "smoothing algorithm" over the globe, they are mathematically fitting a GP. Treating the algorithm as a black box instead of a probabilistic model hides the massive assumptions being made about spatial correlation.

The Bicycle Altimeter

The lecturer strapped a handheld barometric altimeter to a moving bicycle. When the cyclist hit a bump, the data looked completely wrong. Why? Because the manufacturer built a Kalman filter (an implicit GP) into the hardware firmware to smooth the data, assuming it would be held by a stationary pedestrian. Hardware filters are often just hardcoded GPs!

Concept Quiz: The Danger of "Smoothers" (Click to Expand)

1. According to the lecture, what massive mathematical assumption lies blindly hidden when applying a mechanical "smoothing algorithm" or Kalman hardware filter without reflection?

They fundamentally mandate that the matrix invert linearly in exactly bounded $O(N)$ space. You are implicitly asserting a massive predetermined strictly-joint Gaussian Process physically over your entire dataset, blindly hardcoding severe stationary assumptions. You assume standard deviations always collapse natively near the origins. That data anomalies can smoothly rewrite their own eigenvectors practically on the fly.

5. Re-imagining PCA

Dimensionality Reduction
Probabilistic PCA
Integrating out the Latent Space

PCA is a Latent Variable Model

High dimensional data (like a 1000-pixel image of the digit "6") often lives on a much smaller, low-dimensional manifold (e.g., just the angle of rotation). Principal Component Analysis (PCA) finds this.

Historically, PCA was viewed algorithmically: "Just take the top eigenvectors of the covariance matrix." But in 1999, Tipping and Bishop formulated Probabilistic PCA (PPCA). They showed PCA is actually a generative probabilistic model:

\mathbf{y}_i = W \mathbf{x}_i + \epsilon \quad \text{where} \quad \mathbf{x}_i \sim \mathcal{N}(0, I), \; \epsilon \sim \mathcal{N}(0, \sigma^2 I)

Here, $\mathbf{x}$ is the hidden low-dimensional coordinate, and $W$ is the transformation matrix. Because $\mathbf{x}$ is Gaussian, and $\mathbf{y}$ is a linear transform of it, we can analytically integrate out the latent space $\mathbf{x}$ using our golden rule of Gaussians:

\mathbf{y}_i \sim \mathcal{N}(0, \; WW^T + \sigma^2 I)

To fit PPCA, we find the matrix $W$ that maximizes the likelihood of our data $\mathbf{y}$. Unsurprisingly, the optimal $W$ spans the principal eigenvectors of the data covariance.

Concept Quiz: PPCA Under the Hood (Click to Expand)

1. How did Tipping and Bishop algebraically prove that standard PCA is perfectly mathematically identical to fitting their structured generative probabilistic model (PPCA)?

By injecting a strictly non-parametric squared exponential kernel into the feature covariance. By establishing a prior over latent variables, continuously analytically integrating them out, and formally proving that the optimal recovery projection matrix $W$ strictly aligns with the data's principal eigenvectors. By removing the matrix decomposition entirely to ensure the model perfectly tracks pure noise limits directly. Due strictly to reversing the gradient ascent equations to rely explicitly natively heavily on Laplace constraints dynamically.

6. Dual PPCA

Integrating out Weights
The Dual Model
The "Large P, Small N" Myth

Flipping the Integration

The lecturer introduces a critical mind-shift. In PPCA, we integrated out the latent positions $X$ to find the mapping weights $W$. What if we flip it? Let's put a Gaussian prior over the mapping weights $W$, and integrate them out to find the latent positions $X$.

If we integrate out $W$, the covariance over our data features becomes:

K = \alpha XX^T + \sigma^2 I

This is exactly a Gaussian Process with a linear kernel ($XX^T$)! By optimizing the latent locations $X$ directly to maximize the marginal likelihood, we recover the exact same principal components as PCA. This is called Dual Probabilistic PCA.

The "Large P, Small N" Myth

In biology, researchers often have 10 patients ($N=10$) but sequence 6,000 genes per patient ($P=6000$). Statisticians call this the "Curse of Dimensionality" or the "Large P, Small N" problem, claiming the model will fail because there are more features than data points.

The lecturer forcefully rejects this: "It is an absurd, stupid, ridiculous thing to say." If a biologist collects more features about an animal, they are giving you more information, not less! The perceived problem is a flaw in the model (standard PPCA), not the data. If you use Dual PPCA (the GP view), as $P$ increases, the latent positions $X$ actually become better determined. A proper model thrives on large $P$.

Concept Quiz: The Dimensionality Myth (Click to Expand)

1. In the context of Dual Probabilistic PCA, why does the lecturer vehemently reject the "Large P, Small N" (Curse of Dimensionality) paradox?

Because the Kronecker product physically collapses the parameter space back smoothly down to $O(N)$. Because under the Dual GP framework, adding more distinct features (increasing $P$) mathematically provides strictly more structural data to heavily constrain and optimize the unknown latent variables $X$, making the model undeniably stronger. Because the internal linear projection dynamically zeros out aliased noise bounds natively occurring in infinitely high dimensions. Because the mathematical structure implicitly discards entirely all data points above the $N$ limit analytically.

7. The GPLVM

Non-linear dimensionality reduction
Latent Space Interpolation
Bill Baxter's Animation

Gaussian Process Latent Variable Model

The leap from Dual PPCA to the GPLVM is one simple substitution: If Dual PPCA is just a GP with a restrictive linear kernel ($k(x, x') = x^T x'$), what happens if we replace it with a powerful, non-linear kernel like the Squared Exponential?

K_{ij} = \alpha^2 \exp\left( - \frac{||x_i - x_j||^2}{2l^2} \right) + \sigma^2 I

By doing this, we get a fully non-linear dimensionality reduction technique. We treat the low-dimensional positions $X$ as hyperparameters and optimize them using gradient descent on the marginal likelihood. The GP maps these low-dimensional points to the high-dimensional observation space perfectly.

Interactive: The GPLVM Animation Concept

In the lecture's video example, Bill Baxter trained a GPLVM on just 4 hand-drawn frames. The model discovered a 2D latent space. Drag the blue dot around the Latent Space below to see how non-linear GP interpolation creates smooth, novel facial expressions!

Sad / Closed Happy / Closed Sad / Open Happy / Open

"I believe that in the end, Gaussian processes will win through because this type of transformation is extremely complicated, but the GP interpolates smoothly between the different points and finds a configuration where that smooth interpolation is most likely." — The Lecturer.

Concept Quiz: The GPLVM Core (Click to Expand)

1. How does the GPLVM fundamentally analytically extend the Dual Probabilistic PCA framework?

By establishing a rigid Bayesian constraint mapped tightly over the explicitly predicted outputs to heavily isolate parameter anomalies. By actively replacing the inherently restrictive linear covariance kernel with a massive non-linear kernel (like the RBF), unlocking flawlessly smooth and fully flexible non-linear dimensionality reduction. By selectively destroying latent constraints practically entirely to heavily pursue independent maximum likelihood. By replacing pure continuous mathematical gradients functionally physically with Markov Chain Monte Carlo trace boundaries.

8. Deriving $\min(t, t')$

Lower Triangular Matrix ($L_1$)
Inner Products of History
The Mechanics of Correlation

The Linear Algebra of Time

In Section 2, we established that the Brownian motion kernel is $k(t, t') = \min(t, t')$. But why does the math perfectly work out to a minimum function? The lecturer provides a beautiful linear algebra intuition using a Lower Triangular Matrix of ones, denoted as $L_1$.

Since $x_i$ is the sum of all past noises up to time $i$, we can write the entire state history vector $\mathbf{x}$ as a matrix multiplication of $L_1$ and the noise vector $\mathbf{\epsilon}$:

\begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \end{bmatrix}

To find the covariance between time $i$ (e.g., Row 2) and time $j$ (e.g., Row 4), we take the dot product (inner product) of those two rows:

Row 2: $[1, 1, 0, 0]$
Row 4: $[1, 1, 1, 1]$
Dot Product: $(1\times1) + (1\times1) + (0\times1) + (0\times1) = 2$

The Intuition

Because the matrix is lower triangular, Row $i$ has exactly $i$ ones, and Row $j$ has exactly $j$ ones. When you take their dot product, the ones only overlap up to the shorter of the two rows. Everything after that is multiplied by a zero. Thus, the dot product is always exactly $\min(i, j)$. When we scale this by continuous time increments $\Delta t$, it effortlessly becomes $\min(t, t')$.

9. Advanced Coregionalization

Rank of $B$ Matrix
The LMC
Short vs. Long Term task correlations

Rank and The LMC

In Multi-Output GPs (Section 3), we introduced $K_{multi} = K_{time} \otimes B_{tasks}$. The complexity of the multi-task relationship is entirely governed by the Rank of $B$.

Rank-1 Coregionalization: If $B$ is rank-1, it implies there is only one underlying latent Gaussian process driving every single output. The outputs might look different, but they are just scaled versions of the exact same hidden curve.
Full-Rank Coregionalization: If $B$ is full rank (e.g., a $2 \times 2$ matrix for 2 outputs), it implies there are 2 independent latent processes driving the outputs. However, if they are multiplied by a single $K_{time}$ kernel, both latent processes are forced to have the exact same length scale (smoothness).

The Linear Model of Coregionalization (LMC)

What if Stock A and Stock B are highly correlated in their short-term daily jitter, but completely independent in their long-term monthly trends? A single Kronecker product cannot model this.

The solution developed in geostatistics (Kriging) in 1978 is the LMC. We simply sum multiple Kronecker products together!

K_{multi} = (K_{fast} \otimes B_{short}) + (K_{slow} \otimes B_{long})

This creates an incredibly powerful prior that can separate fast-moving shared noise from slow-moving independent trends.

10. Origin of Kernels

Bochner's Theorem
Fourier Space
The Matérn Family

Bochner's Theorem & The Matérn Family

Where do valid, stationary covariance functions come from? We cannot just invent random math formulas and hope they create positive-definite matrices.

Bochner's Theorem provides the theoretical framework: The inverse Fourier transform of any positive measure (e.g., a probability distribution) in spectral (frequency) space is guaranteed to be a valid, stationary covariance function.

By applying different low-pass filters in frequency space, we generate kernels with strictly defined levels of smoothness (differentiability):

Squared Exponential (RBF) — $\infty$-differentiable

Derived by applying a Gaussian filter in Fourier space. Infinitely differentiable. Mathematically beautiful but often "too smooth" for rugged real-world data.

1. Philosophy & History

Probability as a Measure of Ignorance

Latent Variables in Early Modeling

Deep Dive Explanations

2. The I.I.D. Assumption

Why I.I.D. is Philosophically Flawed

Deconstructing the Equation & Joint Distributions

The Parametric Bottleneck: Cramming Dependency

The GP Approach: Native Joint Modeling

What is Bayesian Marginalization?

Deep Dive Explanations

3. What is a GP?

A Distribution Over Functions

The Magic of Correlation

Marginalization Property

Deep Dive Explanations

4. The Covariance Function

Building the Structure

Kernel Arithmetic

Deep Dive Explanations

5. Conditioning & Inference

Slicing the Mountain (Conditioning)

The Mathematics of Inference

5.5. Numerical Practices

How to Actually Code It

The Method: Cholesky Decomposition

6. Model Confidence

When Error Bars Collapse

All Models Are Wrong

6.5. Interpreting Error Bars

You Will Never Sample the Mean

📌 Additional Foundation: Marginal vs. Joint Error Bars

7. Stats vs. Machine Learning

Differing Philosophies

GPs in Spatial Statistics

8. Additive Properties & Noise

Adding Noise to the Process

Deep Dive Explanations

9. The Core Challenges of GPs

Where GPs Hit Their Limits

📌 Additional Foundation: Separable Kernels for Multi-Output

Deep Dive Explanations

9.5. Breaking the Compute Barrier

Sparse GPs

10. Non-Gaussian Data

Link Functions & Classification

The Clash: Prior vs. Likelihood

📌 Additional Foundation: EP vs. Variational vs. Laplace

📌 The 2D Intuition (The "Version Space")

11. Learning Covariance Parameters

Optimizing the Kernel (Marginal Likelihood)

The Two Big Advantages of GPs

Eigendecomposition of Covariance

Numerical Optimization

The Non-Convex Trap & The "Ridge of Solutions"

12. From Parametric to Non-Parametric

How do Neural Networks relate to GPs?

📌 The Golden Rule of Gaussians

Degenerate vs. Full Rank Kernels

📌 The Neural Network Covariance & Non-Stationarity

13. Advanced Math & Constraints

Playing with the Bounds of the GP

Why GPs can't be strictly Monotonic or Convex

The Multiplicative Rule

14. Final Limitations

When NOT to use a Gaussian Process

📌 The Multiplicative Switch-On/Switch-Off Trick

15. Putting GPs to Work: Bayesian Optimization

Bayesian Optimization

Markov Chains, Multi-Output GPs & Dimensionality Reduction

1. Gauss-Markov Processes

From Markov Chains to Covariances

2. Brownian Motion Kernel

The Continuous Limit

The Minimum Covariance Function

📌 Advanced Insight: The Tri-Diagonal Inverse

3. Multi-Output GPs

Modeling Correlated Tasks

4. Implicit GPs in Science

You're Probably Using a GP Without Knowing It