Ali Mikaeili
Gaussian Processes: Intuition & Framework
Neil Lawrence · MLSS 2015 · Tübingen

1. Philosophy & History

  • Laplace's Demon
  • Determinism vs. Ignorance
  • Gauss & Ceres Orbit
  • Least Squares & Latent Variables

Probability as a Measure of Ignorance

In 1800, Pierre-Simon Laplace described what is known as Laplace's Demon: an intellect that knows the precise location and momentum of every atom in the universe right now, and therefore could compute the entire past and future (a deterministic universe).

Laplace's actual point, however, was that we never have this circumstance. Because we cannot track every "simple molecule of air," we must use probability. Probability is relative in part to our ignorance, and in part to our knowledge. We don't need things to be fundamentally stochastic to use probability; we use it to model the unknown.

Latent Variables in Early Modeling

When Gauss predicted the orbit of the dwarf planet Ceres using Least Squares, he was building on Laplace's framework. Laplace suggested adding an additional unknown (a latent variable, \(\epsilon\)) to represent the difference between the model's prediction and the observed data:

\( y_i = mx_i + c + \epsilon_i \)

He placed a probability distribution over this ignorance. Gauss later solidified this by using what we now call the Gaussian density for these variables.

The Underdetermined System: When fitting a line (2 parameters: \(m, c\)) to 1 point, it's an underdetermined system. Bayesian thinking allows us to solve this by placing probability distributions over the parameters (\(m\) and \(c\)) themselves, handling our uncertainty probabilistically.

Concept Quiz: Philosophy & Underdetermined Systems (Click to Expand)

1. In an underdetermined system with fewer data points than parameters, how does Bayesian modeling conceptually resolve the ambiguity?

2. What is the fundamental utility of applying probability mathematically to deterministic physical systems?

3. In the historic linear construct \( y_i = mx_i + c + \epsilon_i \), what does the explicit variable \(\epsilon_i\) denote operationally?

4. If planetary motion is fundamentally deterministic, why can't deterministic equations model it flawlessly?

Deep Dive Explanations

Why do we explicitly map a probability density onto the variable \(\epsilon_i\)?
Because \(\epsilon_i\) mathematically corrals all the exact deviations between our simplistic idealized model and reality. Since we don't know the exact physical micro-causes of those deviations, we architect our ignorance mathematically by giving the deviation a defined probabilistic shape (most frequently, a Gaussian density).
How exactly does Bayesian abstraction resolve underdetermined scenarios mathematically?
A deterministic algebraic system with fewer constraints than variables outputs an infinite uniform set of valid numeric solutions. But by using probability contours, Bayesian math replaces those blind scalars with a volumetric "belief manifold"—gracefully revealing which of those infinite possibilities is most structurally probable.
What defines the boundary between true "Stochasticity" and probability as pure "Ignorance"?
True stochasticity asserts that nature is fundamentally rolling physical dice (quantum mechanics). Modeling ignorance implies that the universe might act completely deterministically, but because humans lack the capacity or data points to compute every variable simultaneously, we use probability as a mathematical proxy for those missing components.
How does replacing scalar parameters with distributions change the modeling hierarchy?
It transforms static equations into dynamic belief structures. Parameters aren't "locked"; they remain fluid random variables. The raw observed data acts explicitly as evidence that dynamically forces the probability distributions of those parameters to shift, shrink, and update themselves sequentially.

2. The I.I.D. Assumption

  • Independent and Identically Distributed
  • Model vs. Algorithm
  • Modeling Joint Distributions

Why I.I.D. is Philosophically Flawed

The standard approach in machine learning is to assume data is I.I.D. (Independent and Identically Distributed). Under this assumption, the joint probability of training data \(\mathbf{Y}\) and test data \(\mathbf{Y}^*\) factors into independent products:

\( P(\mathbf{Y}, \mathbf{Y}^*) = \prod_{i} P(y_i) \prod_{j} P(y^*_j) \)

Deconstructing the Equation & Joint Distributions

A Joint Distribution \( P(\mathbf{Y}, \mathbf{Y}^*) \) simply asks: "What is the probability of observing this specific training data AND this specific test data happening together at the same time?"

The equation above shows what happens mathematically under the I.I.D. assumption: the joint probability shatters. The large \(\prod\) symbol represents multiplication. It claims that the probability of seeing all the data is just the probability of seeing each point individually, multiplied together. This means seeing \(y_1\) gives you zero mathematical information about \(y_2\). It effectively treats data as completely disconnected, random white noise.

The Parametric Bottleneck: Cramming Dependency

If the I.I.D. assumption explicitly severs the raw mathematical correlation between training data \(\mathbf{Y}\) and test data \(\mathbf{Y}^*\), how do parametric models (like deep neural networks) actually predict anything?

They cram the dependency into deterministic parameters. Because the I.I.D. math refuses to link \(\mathbf{Y}\) and \(\mathbf{Y}^*\) directly, a neural network must use the training data to optimize a fixed set of scalar numbers (the network weights, \(w\)). Once those weights are learned, the actual training data is completely thrown away. The model now relies entirely on those fixed weights to predict \(\mathbf{Y}^*\). It acts as an artificial bottleneck, maliciously compressing all structural learning and real-world physical correlations into a rigid matrix of static, deterministic numbers.

The GP Approach: Native Joint Modeling

Gaussian Processes reject the I.I.D. assumption entirely. They refuse to throw away the data or use scalar weights as a bottleneck. Instead, they explicitly model the full joint distribution natively.

By defining exactly how correlated our training and test data are based on their geometric relationships, a GP directly calculates \(P(\mathbf{Y}^* | \mathbf{Y})\)—the conditional probability of the unknown test data given the known training data.

What is Bayesian Marginalization?

To manipulate these massive joint distributions gracefully, GPs rely heavily on Bayesian Marginalization. This is the mathematical process of taking a massive joint distribution of many variables, and "integrating out" (summing over) all the unknown latent variables you don't care about. Because Gaussians have a magical marginalization property, doing this leaves you with exactly the isolated probability shape for the specific variables you want to predict, without losing any structural information.

Concept Quiz: The I.I.D. Paradox (Click to Expand)

1. In standard learning paradigms restricted by the I.I.D. assumption, what explicitly happens to the mathematical linkage between the testing variables and the training labels?

2. If the I.I.D. assumption aggressively severs the raw mathematical correlations between \(Y\) and \(Y^*\), how do classic parametric architectures bridge that void to predict test outputs?

3. Despite being inherently philosophically limited, why is the I.I.D. heuristic ubiquitously exploited in contemporary machine learning algorithms?

4. How do Gaussian Processes aggressively deviate from standard I.I.D. conventions when managing predictive test batches algebraically?

Deep Dive Explanations

Why does the lecturer controversially categorize the I.I.D algorithm functionally as a model of pure "noise"?
Fundamentally, if discrete variables are utterly Independent and Identically Distributed, perfectly observing variable 1 supplies algebraically zero mathematical leverage or evidence to predict variable 2. Operationally, this total absence of inter-point linkage is identically structurally equivalent to defining randomized white noise mathematically.
What specific semantic capability is systematically sacrificed by strictly employing I.I.D on sequential flows natively?
You deliberately rip away the structural ability to comprehend time. The algorithm is forced aggressively to classify each subsequent timestep observation entirely hermetically isolated—erasing natural systemic drag physics where yesterday directly forces today physically.
Without exploiting I.I.D., what mathematical engine exactly grants Gaussian Processes their immense interpolative predictive capacity?
They meticulously utilize vast analytical Gaussian covariance boundaries explicitly mapped out by conditional Joint Probabilities. Because the function natively measures geometric correlations recursively connecting all known vs unknown variables structurally, applying raw Bayesian algebra immediately pinches the unknown zones accurately.
How exactly distinguishes a GP "data-coupled bottleneck" against classical Parametric "scalar weight bottlenecks"?
Neural parameters aim intrinsically to completely shred data by compressing its insights globally into a massive set of disconnected scalar numeric weights arrayed individually. A GP literally preserves the uncompressed input coordinates permanently—actively leveraging their actual geometric matrix layout distances constantly.

3. What is a GP?

  • Distribution over functions
  • High-dimensional Gaussians
  • Hallucinating shape from correlation
  • Marginalization Property

A Distribution Over Functions

Instead of putting a probability distribution over parameters (like \(m\) and \(c\)), a Gaussian Process puts a probability distribution over the entire class of functions. A GP is formally defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.

The Magic of Correlation

If you sample 25 independent points from a standard Gaussian, you get noise. But if you sample 25 points from a Gaussian where neighboring points are highly correlated, you begin to hallucinate a smooth shape. This is the essence of a GP.

Marginalization Property

GPs work mathematically because of marginalization. If you have a joint distribution of 25 points, finding the marginal distribution of just 2 points is trivial in Gaussians: you just take the relevant sub-matrix of the covariance matrix. This allows us to scale to infinite dimensions.

Interactive GP Sampler (Hallucinating Shape)
Adjusting the correlation between points creates smooth functions from random Gaussian noise.

Controls how far correlation stretches on the X-axis.

Controls the amplitude/scale on the Y-axis.

Concept Quiz: Mechanism and Marginalization (Click to Expand)

1. In exact contrast to parametric linear setups focusing distributions purely over weights, what extreme topology is a true Gaussian Process distributing probability densities directly atop systematically?

2. What exactly specifies the flawless computational explicit definition mathematically outlining what legally constitutes a Gaussian Process architecture fundamentally?

3. When algorithmically assembling massive dimensional vector arrays probabilistically randomly, how does the user technically force the structural model to "hallucinate" physically coherent curve segments visually?

4. Exactly what extremely distinct algebraic mathematical property essentially empowers Gaussian architectures practically allowing processing infinitely massive functions accurately without ever crashing computationally explicitly?

Deep Dive Explanations

How exactly inherently does specifically altering mere statistical "covariance rules" miraculously "hallucinate shape" entirely natively practically?
Fundamentally entirely drawing random numeric dots creates pure unstructured scatter visuals randomly. But by deliberately enforcing algebraically strong covariance bindings intrinsically onto physically closely located spatial points explicitly structurally, mathematics structurally forces variables absolutely to drag adjacent points vertically simultaneously collectively tightly. Thus creating beautifully elegantly continuous curved line abstractions globally instead of disjointed points visually effectively inherently.
Why strictly accurately is observing the foundational "Marginalization Guarantee" broadly critically mathematically vital absolutely essentially entirely basically?
Because mathematically computing full properties universally surrounding literally infinite dimensions breaks reality fundamentally algorithmically aggressively consistently. Marginalization literally perfectly ensures absolutely that cleanly ripping a 10-dimensional subset slice directly out from an inherently infinite-dimensional mathematical topology perfectly completely effectively maintains internal functional mathematical consistency inherently purely.
How exactly structurally does shifting focus specifically profoundly uniquely essentially conceptually strictly differentiate modeling functionally compared directly against modeling parameters generally typically?
Fundamentally typically locking variables tightly constructs purely rigid singular restrictive geometric shapes exactly linearly intrinsically. GPs entirely conceptually radically entirely entirely abandon limiting variables completely entirely completely natively. They intrinsically locally map constraints continuously constantly iteratively natively, allowing organic topological curves smoothly.
What mechanism governs the exact structural nature physically visually governing uncertainty expansion natively moving explicitly away from mapped data coordinates heavily extensively inherently globally?
Absolutely essentially generally intrinsically, mapping kernels enforce correlations tightly based entirely specifically directly purely natively upon geometric relative proximity purely explicitly completely structurally. As coordinates physically explicitly wander intensely completely blindly outside mapped empirical evidence absolutely, conditional correlation logically physically drops precipitously mathematically natively squarely towards zero precisely accurately purely strictly returning to baseline variance safely efficiently effectively securely.

4. The Covariance Function

  • Kernel Matrix
  • Squared Exponential Kernel
  • Positive Definite requirement

Building the Structure

Where does the correlation matrix come from? We build it using a Covariance Function (or Kernel). Given any two inputs \(x_1\) and \(x_2\), this function outputs their covariance (how correlated their \(y\) values should be).

\( k(x_i, x_j) = \alpha^2 \exp\left( - \frac{(x_i - x_j)^2}{2l^2} \right) \)
  • Distance-based: If \(x_i\) and \(x_j\) are identical, distance is 0, \(e^0 = 1\), correlation is max (\(\alpha^2\)). As distance increases, correlation drops toward 0.
  • Length Scale (\(l\)): Defines "how close is close". Large \(l\) means distant points are still highly correlated (stretches the function horizontally). If \(l \to 0\), we get white noise. If \(l \to \infty\), we get flat constant functions.
  • Scale/Variance (\(\alpha\)): Determines the average deviation of the function from the mean.

When a new test point \(x^*\) comes in, we calculate its correlation with all existing training points \(X\) by passing them through this kernel to build the test-train covariance vector.

Kernel Arithmetic

Real-world data isn't just smooth. Kernels can be treated like building blocks to encode complex human knowledge into the model:

  • Addition (OR logic): \(K_1 + K_2\). If you add a linear kernel and a periodic kernel, the GP models a function that is either trending linearly or oscillating.
  • Multiplication (AND logic): \(K_1 \times K_2\). If you multiply a linear kernel and a periodic kernel, the GP models a function whose oscillations grow larger over time (increasing amplitude).
  • The Ultimate Example: Mention the famous "Mauna Loa CO2 dataset" (from Rasmussen & Williams), where atmospheric CO2 is modeled flawlessly by summing a long-term smooth trend (RBF), a seasonal trend (Periodic), and medium-term irregularities (Rational Quadratic).
Concept Quiz: The Covariance Function (Click to Expand)

1. How does the kernel translate physical geometry into probability bounds?

2. What structural shift occurs if the Length Scale (\(l\)) approaches absolute zero?

3. Mechanically, what explicitly does the Variance scalar \(\alpha^2\) govern?

4. Why identically must valid kernels remain strictly Positive Definite?

Deep Dive Explanations

Why is it explicitly called the "Squared Exponential" function?
It maps exactly to the equation shape of a standard Gaussian probability distribution curve, but it defines pure geometric covariance instead of strict likelihood probabilities. Hence "Squared Exponential" removes the confusing probability labeling while retaining the smooth mathematical drop-off cleanly gracefully perfectly explicitly natively practically formally systematically precisely.
What does a test point \(x^*\) physically do implicitly when passed into the kernel actively?
It sequentially physically calculates explicit distance correlations directly strictly natively accurately efficiently cleanly perfectly dynamically logically smoothly actively between itself completely perfectly accurately fully securely squarely cleanly effectively smoothly heavily completely cleanly and every single existing training observation systematically practically tightly functionally strictly exactly efficiently mapping cross-correlations cleanly organically seamlessly stably tightly theoretically consistently effectively firmly efficiently firmly reliably smoothly mathematically properly.

5. Conditioning & Inference

  • Slicing the "Mountain"
  • Posterior Mean & Variance
  • Linear Algebra Complexity

Slicing the Mountain (Conditioning)

Imagine the joint distribution of two highly correlated points \(f_1\) and \(f_2\) as a 3D hill. If viewed from the top, it looks like an elongated diagonal ridge (contour oval).

Inference: If we observe \(f_1\), we are specifying a coordinate. We "slice" through the mountain at that coordinate. The resulting 1D slice (renormalized) gives us our updated (conditional) Gaussian distribution for \(f_2\).

Interactive 2D Conditioning
Move the slider to set \(f_1\). Watch how the distribution of \(f_2\) changes based on the correlation.

Joint 2D Distribution

Conditional 1D Dist: \( P(f_2 | f_1) \)

The Mathematics of Inference

For predicting a single test point \(f_2\) from training point \(f_1\):

\( \mu_{2|1} = f_1 \frac{k_{12}}{k_{11}} \)

\( \sigma^2_{2|1} = k_{22} - \frac{k_{12}^2}{k_{11}} \)

For the general multivariate case (predicting \(\mathbf{f}^*\) given \(\mathbf{f}\)):

\( \mu^* = K_{**} K^{-1} \mathbf{f} \)

\( \Sigma^* = K_{***} - K_{**} K^{-1} K_{**}^T \)

The Computational Catch: Finding \(K^{-1}\) takes \(O(N^3)\) time, where \(N\) is the number of training data points. This cubic complexity is the primary limitation of Gaussian Processes.

Concept Quiz: Conditioning & Inference (Click to Expand)

1. In 2D space, how does "observing" \(f_1\) systematically alter the 3D joint oval?

2. What mathematical operation fundamentally creates the structural cubic \(O(N^3)\) memory bottleneck seamlessly?

3. Consider the formula: \( \mu_{2|1} = f_1 \frac{k_{12}}{k_{11}} \). What role does \(k_{12}\) explicitly play dynamically mathematically?

4. In multivariate equations seamlessly cleanly natively, what does \(K_{**}\) exclusively represent natively mathematically?

5.5. Numerical Practices

  • Never invert \(K\)
  • Cholesky Decomposition
  • Log-determinants

How to Actually Code It

In theory, the predictive mean is \(\mu^* = K_{**} K^{-1} \mathbf{f}\). In practice, if you ever write inv(K) or \(K^{-1}\) in your code, your GP will crash and burn. Covariance matrices frequently become poorly conditioned (eigenvalues close to zero), and standard inversion causes catastrophic floating-point errors.

The Method: Cholesky Decomposition

Instead of inverting, we decompose the noisy covariance matrix into a lower-triangular matrix \(L\) such that \(K + \sigma^2I = LL^T\).

  • The Math: To solve for the weights \(\boldsymbol{\alpha} = K^{-1}\mathbf{y}\), we use fast, numerically stable forward and backward substitution:
    1. Solve \(L\mathbf{a} = \mathbf{y}\) for \(\mathbf{a}\)
    2. Solve \(L^T\boldsymbol{\alpha} = \mathbf{a}\) for \(\boldsymbol{\alpha}\)
  • The Benefit: It reduces the constant factor in \(O(N^3)\) complexity, guarantees positive-definiteness, and allows us to compute the log-determinant (for the marginal likelihood) almost for free: \(\log|K| = 2 \sum \log(L_{ii})\).

6. Model Confidence

  • Collapsing Error Bars
  • Model vs. Reality
  • Nyquist Sampling Theory connection

When Error Bars Collapse

As you get more data points that are very close together relative to the length scale (\(l\)), the model's uncertainty variance collapses to near zero. It becomes incredibly confident about what the function does between those points.

Why does this happen? Because a stationary kernel (like the Squared Exponential) acts like a low-pass filter in Fourier space. It essentially says, "I assume there are no high frequencies in this function." Similar to the Nyquist sampling theorem, if the model assumes a strict band-limit, and you sample densely enough, the model "knows" it can perfectly reconstruct the function.

All Models Are Wrong

In reality, predicting with 100% confidence is dangerous. If your error bars collapse entirely, it usually means your data is rich enough to expose the flaws in your original model assumptions. A good probabilistic modeler knows that the GP's extreme confidence is an artifact of the chosen kernel, not necessarily a reflection of reality. Always be aware that your model might be wrong!

6.5. Interpreting Error Bars

  • Marginal vs. Joint Uncertainty
  • The "Mean" Illusion

You Will Never Sample the Mean

A critical point from the lecture: The mean function of a GP is an average, but no single generated sample will ever look like the mean. When visualizing a GP, we typically draw a solid line for the mean and shaded regions for the variance. If the mean line is perfectly flat between data points, people mistakenly think the model predicts a flat function. It doesn't! A flat mean just means the model expects the function to oscillate equally above and below that line.

📌 Additional Foundation: Marginal vs. Joint Error Bars

The shaded variance regions we plot are Marginal Error Bars. This means they only tell you the vertical uncertainty at one specific point \(x\), ignoring the rest of the function.


However, the points are highly correlated (Joint Distribution). If you know the actual function hits the very top of the error bar at \(x_1\), the correlation means it must also be near the top at a nearby \(x_2\). Marginal error bars hide this joint relationship. To truly see what the GP believes, you must draw wavy samples from the joint distribution, rather than just staring at the static marginal bounds.

Concept Quiz: Model Confidence & Error Bars (Click to Expand)

1. Why do theoretical error bars collapse to near zero when sampling very densely?

2. What dangerous visual illusion occurs when plotting just the "Mean" cleanly?

7. Stats vs. Machine Learning

  • Parameters vs. Predictions
  • Interpretation (\(\beta\))
  • Spatially Correlated Noise

Differing Philosophies

There is a fundamental difference in how statisticians and machine learning practitioners view models. Neither group is wrong; their goals dictate their methods.

  • Statistics (Interpretation First): Statisticians care deeply about the parameters (often denoted as \(\beta\)). For example, in an epidemiological model, they want to isolate the exact effect size of "smoking" on heart disease rates. They want their deterministic model to capture everything, aiming for the noise term (\(\epsilon\)) to be as close to zero (or white noise) as possible.
  • Machine Learning (Prediction First): ML practitioners typically treat parameters (often denoted as \(w\)) as a means to an end. We care less about isolating the exact "smoking coefficient" and more about accurately predicting the final output \(y\) for a new user.

GPs in Spatial Statistics

Interestingly, statisticians frequently use Gaussian Processes, but often model the noise (\(\epsilon\)) as a GP rather than the primary function. If two patients live in the same neighborhood but have unexplained similar heart disease rates, a statistician will use a spatial GP to absorb that spatially correlated error. By doing so, they explicitly acknowledge that unobserved factors are causing spatial correlation, allowing them to better isolate the fixed effects (\(\beta\)) they actually care about.

Concept Quiz: Statistics vs ML (Click to Expand)

1. Why do statisticians explicitly model noise using Gaussian Processes in complex mappings natively?

8. Additive Properties & Noise

  • Summing Gaussians
  • Observational Noise (\(\sigma^2\))

Adding Noise to the Process

In the real world, we rarely observe the true underlying function \(f(x)\). We observe a corrupted version:

\( y_i = f(x_i) + \epsilon_i \)

If we assume \(\epsilon_i\) is independent Gaussian noise with variance \(\sigma^2\), incorporating this into our GP is trivial due to a beautiful property of Gaussian distributions:

"If you have two Gaussian random variables and you add them together, the sum is also distributed as a Gaussian, and its covariance is the sum of their individual covariances: \(K_{sum} = K_1 + K_2\)."

Because I.I.D. noise only correlates a point with itself, its covariance matrix is simply \(\sigma^2 I\) (the identity matrix scaled by the noise variance). Therefore, to add noise to our GP model, we simply add \(\sigma^2\) to the diagonal of our kernel matrix:

\( K_y = K_f + \sigma^2 I \)
Concept Quiz: Additive Properties & Noise (Click to Expand)

1. When you add two independent Gaussian random variables together, what happens to the covariance of their sum?

2. How does independent, identically distributed (I.I.D.) Gaussian noise mathematically affect the covariance matrix of a GP?

3. In the observational equation \( y_i = f(x_i) + \epsilon_i \), what does \(\epsilon_i\) represent?

4. If the latent function has covariance \(K_f\) and noise is \(\mathcal{N}(0, \sigma^2)\), what is the final covariance matrix \(K_y\)?

Deep Dive Explanations

What is the geometric intuition behind adding \(\sigma^2\) exclusively to the diagonal?
Adding variance only to the diagonal asserts that the noise at point \(x_i\) is perfectly independent of the noise at any other point \(x_j\). We inflate the uncertainty locally at every data point without implying any structural correlation or geometric pull between them.
Why can we simply sum the covariance matrices \(K_1\) and \(K_2\) when combining processes?
A fundamental property of Gaussian distributions is that the sum of two independent Gaussian random variables is also definitively Gaussian. The resulting distribution has a mean equal to the sum of the means, and a covariance perfectly equal to the sum of the individual covariances.
How do you model internal physical noise vs. sensor measurement noise?
Since internal systemic noise and external sensor noise are physically independent, you model them as separate additive components. The final covariance matrix simply sums the base process kernel, the internal noise kernel, and finally the sensor noise diagonal \(\sigma^2 I\).
What happens to your predictive posteriors if you underestimate \(\sigma^2\)?
The model will overly trust the observed data points. The GP will attempt to fit every slight deviation as true underlying signal rather than rejecting it as noise, leading to a highly erratic, "wiggly" predictive mean (severe overfitting) and irresponsibly confident error bars.

9. The Core Challenges of GPs

  • Non-Gaussian Data
  • Computational Complexity
  • Choosing Covariances
  • Multiple Outputs

Where GPs Hit Their Limits

If you are working with Gaussian Processes in practice, you will inevitably run into three (or four) primary bottlenecks:

  1. Computational Issues (\(O(N^3)\)): To make predictions, we must invert the training covariance matrix \(K\). Matrix inversion scales cubically with the number of data points. For large datasets, this becomes computationally intractable.
  2. Designing the Covariance Function: The kernel is where all the modeling happens. Choosing the wrong kernel means making the wrong assumptions about your data. For complex scenarios, designing the kernel is an art.
  3. Handling Multiple Outputs: If you are predicting the stock prices of two interacting companies, you need a kernel that takes both time and the stock index as inputs. You can use separable covariance functions (multiplying a time kernel by a stock-relationship kernel), or build richer cross-correlating functions.
  4. Non-Gaussian Data: What if your data is binary (classification) or counts (Poisson)? The elegant linear algebra of GPs breaks down when the likelihood function is no longer Gaussian.

📌 Additional Foundation: Separable Kernels for Multi-Output

How do we mathematically combine kernels for completely different types of inputs (e.g., Time \(t\) and Stock Index \(i\))?


A separable kernel simply multiplies them: \( K((t_1, i_1), (t_2, i_2)) = K_{time}(t_1, t_2) \times K_{stock}(i_1, i_2) \).

In matrix algebra, constructing a full covariance matrix this way utilizes the Kronecker Product (\(\otimes\)). This assumes that the temporal correlation (how prices change over time) is strictly independent of the inter-stock correlation (how Stock A and B relate generally). It is highly computationally efficient but fails to capture complex delayed correlations (e.g., Stock A going down today directly causing Stock B to go down tomorrow).

Concept Quiz: Core Challenges of GPs (Click to Expand)

1. What is the fundamental computational complexity of making predictions with a standard Gaussian Process?

2. Which specific linear algebra operation is entirely responsible for this massive computational bottleneck?

3. How do 'separable' covariance functions construct a combined multi-output kernel for distinct input types?

4. Why does non-Gaussian data (like binary classification) break the standard elegance of GPs?

Deep Dive Explanations

Why are GPs fundamentally slow for very large datasets?
To calculate the exact predictive distribution or the marginal likelihood, we mathematically must invert the \(N \times N\) covariance matrix \(K\). Standard matrix inversion requires \(O(N^3)\) floating-point operations. For large datasets, this compute time grows excessively fast, and memory storage grows at \(O(N^2)\).
What is the hidden restriction of the Kronecker Product trick?
It rigidly assumes that the correlations across the distinct input spaces are perfectly independent. For instance, it assumes that how Stocks A and B correlate temporally across the month is independent of their general baseline relationship. It absolutely cannot capture direct, delayed cross-domain interactions.
Why is choosing the exact covariance function often considered an "art"?
The covariance function physically enforces your prior assumptions—the strict bounds of what you believe the hidden functional rules look like (smoothness, periodicity, jagged bumps). If your kernel fundamentally assumes smooth waves, it will utterly fail to model discontinuous jumps, regardless of how much data you feed it.
How do practitioners mathematically resolve skewed non-Gaussian posteriors?
They use tools like Expectation Propagation (EP) or Variational Inference. These tools force a pure Gaussian distribution tightly around the skewed true posterior mass, giving us a "surrogate" that roughly mimics the exact shape to successfully restore mathematical tractability.

9.5. Breaking the Compute Barrier

  • \(O(N^3)\) bottleneck
  • Inducing Points
  • Sparse Approximations

Sparse GPs

As mentioned, the \(O(N^3)\) complexity is the bottleneck. If you have 100,000 data points, a standard GP is dead. Modern ML solves this by introducing Inducing Points (pseudo-inputs).

  • The Concept: Instead of every data point correlating with every other data point, we optimize a small "committee" of \(M\) synthetic points (where \(M \ll N\)). All \(N\) training points only communicate with each other through this committee.
  • The Result: The computational complexity drops from \(O(N^3)\) to \(O(NM^2)\), and memory drops from \(O(N^2)\) to \(O(NM)\). This is the only way GPs are used in modern deep learning frameworks.

10. Non-Gaussian Data

  • Link Functions
  • Sigmoid / Probit
  • Skewed Posteriors
  • Gaussian Approximations (EP)

Link Functions & Classification

To model non-Gaussian data (like counts or binary classes), we borrow concepts from Generalized Linear Models (GLMs). We use a Gaussian Process to model a latent (hidden) function, and then pass that function through a non-linear transformation function (the inverse of a link function) to map it to our desired output space.

  • Poisson (Count Data): We model \(\log(\lambda) = f(x)\). The transformation is \(\lambda = \exp(f(x))\). Note: This makes the model multiplicative in its effects, which can trip you up if you expect additive behavior!
  • Logistic/Sigmoid (Classification): We model the log-odds. The transformation is \( P(y=1) = \frac{1}{1 + \exp(-f(x))} \).
  • Probit (Classification): We use the cumulative Gaussian function instead of a sigmoid. It looks similar but falls off quadratically rather than linearly in the tails, making the math slightly easier for GPs.

The Clash: Prior vs. Likelihood

Bayesian inference is conceptually simple: Prior \(\times\) Likelihood = Posterior.
However, if your prior is Gaussian, but your likelihood is a Probit or Sigmoid step-function, their product is a skewed, non-Gaussian shape. Because it is no longer Gaussian, we lose our exact mathematical tractability.


The Solution? Approximation. Methods like Expectation Propagation (EP), Laplace Approximation, and Variational Inference attempt to fit a new, "false" Gaussian that closely matches this skewed true posterior. If the approximation is good, we regain our tractability and the method works beautifully.

📌 Additional Foundation: EP vs. Variational vs. Laplace

The lecture notes that these approximations minimize the KL Divergence (a metric measuring the difference between two probability distributions). Because KL Divergence is asymmetric (\(KL(P||Q) \neq KL(Q||P)\)), the direction you choose drastically changes the resulting Gaussian:

  • Expectation Propagation (EP): Minimizes \(KL(\text{True} || \text{Approx})\). This is known as "Moment Matching". The Gaussian will stretch out to cover the entire mass/tails of the true distribution. It is generally the preferred method for GP classification.
  • Variational Inference (VI): Minimizes \(KL(\text{Approx} || \text{True})\). This is "Zero-Forcing". The Gaussian approximation will squeeze itself entirely inside the tallest peak of the true distribution to avoid any areas where the true distribution has zero probability, often leading to severely underestimated variance.
  • Laplace Approximation: An older, simpler method mentioned at the very end of the lecture. It doesn't use KL Divergence at all. Instead, it finds the peak (mode) of the skewed posterior using gradient ascent, and then uses a Taylor Expansion (calculating the second derivative/Hessian at that peak) to fit a Gaussian curve. It is fast but can be highly inaccurate for heavily skewed distributions.

📌 The 2D Intuition (The "Version Space")

The lecturer provides a critical 2D geometric intuition for why classification likelihoods break Gaussianity:

  • The Regression Tunnel: In regression, observing a data point creates a clean "Gaussian tunnel" through your prior space (e.g., \(f_1\) vs \(f_2\)). This slice perfectly leaves a shifted 2D Gaussian posterior.
  • The Classification Squash: In classification, observing a positive class isn't a tunnel. It operates like a sigmoidal step function cutting across the plane, saying, "The answer is definitely not in this half." This forcibly squashes the prior into a constrained corner known as the version space.
  • Why EP Excels: EP operates iteratively on these 1D slices but mathematically minimizes the KL divergence globally. This allows it to capture the mass of this "squashed corner" much more accurately than the Laplace Approximation, which simply scales the highest peak and fits a crude parabola.
Interactive Inference: Approximating a Skewed Posterior
When we observe a positive class (\(y=1\)), the Likelihood pushes the probability right. Multiplying the Gaussian Prior by this Likelihood creates a skewed True Posterior. We then approximate it with a new Gaussian.
Prior \(\mathcal{N}\)
Likelihood (Sigmoid)
True Skewed Posterior
Gaussian Approx

Move the prior. Watch the True Posterior skew and the Approximation attempt to match its mass.

Concept Quiz: Non-Gaussian Data & Approximations (Click to Expand)

1. Why does classification intrinsically fundamentally break the exact mathematical tractability of a standard Gaussian Process?

2. Visually fundamentally conceptually, how does a classification update geometrically mathematically differ from a regression explicitly?

11. Learning Covariance Parameters

  • Two Big Advantages of GPs
  • Type II Maximum Likelihood
  • Eigendecomposition of \(\mathbf{K}\)
  • Capacity vs. Data Fit

Optimizing the Kernel (Marginal Likelihood)

The Two Big Advantages of GPs

While GPs are computationally more complex (\(O(N^3)\)) and often less efficient than standard kernel methods (like SVMs), the lecturer explicitly outlines two massive advantages:

  1. Sustaining Uncertainty: Standard kernel methods throw away error bars. GPs retain a full probabilistic measure of uncertainty, which is vital for decision-making.
  2. Optimization of Kernel Parameters: Standard kernel methods require expensive cross-validation to find hyperparameters. GPs allow us to optimize them directly from the training data via the marginal likelihood.

The parameters \(\boldsymbol{\theta}\) are inside the covariance matrix \( \mathbf{K} \) (i.e., \(k_{i,j} = k(\mathbf{x}_i, \mathbf{x}_j; \boldsymbol{\theta})\)). The objective function to minimize (negative log marginal likelihood) is:

\(E(\boldsymbol{\theta}) = \color{#3b82f6}{\frac{1}{2} \log |\mathbf{K}|} + \color{#ef4444}{\frac{\mathbf{y}^\top \mathbf{K}^{-1} \mathbf{y}}{2}}\)

Eigendecomposition of Covariance

To intuitively understand this objective, we can decompose the covariance matrix: \(\mathbf{K} = \mathbf{R} \boldsymbol{\Lambda}^2 \mathbf{R}^\top\). Here, the diagonal of \(\boldsymbol{\Lambda}\) represents distances along the principal axes (eigenvalues), and \(\mathbf{R}\) gives the rotation of these axes.

  • Model Capacity (\(\frac{1}{2} \log |\mathbf{K}|\)): This acts as our regularizer. The determinant \(|\mathbf{K}|\) is the product of its eigenvalues (\(\prod \lambda_i\)), which geometrically represents the "volume" the Gaussian distribution covers. Minimizing this penalizes overly complex, high-volume models (acting as an automatic Occam's razor).
    📌 Trace vs. Determinant: The lecturer highlights that while the Determinant measures the geometric volume, the Trace (\(\sum \lambda_i\)) measures the total variance. Since the total variance mathematically upper-bounds the entropy (Log Determinant), optimizing the Trace is convex and often serves as a powerful proxy for controlling capacity in Semi-Definite Programming.
  • Data Fit (\(\frac{1}{2} \mathbf{y}^\top \mathbf{K}^{-1} \mathbf{y}\)): Notice that the latent function \(\mathbf{f}\) has been completely integrated out! There is no function parameter to fit. This quadratic form measures how well our actual data vector \(\mathbf{y}\) aligns with the rotated and scaled covariance ellipse we just defined.

Numerical Optimization

We have the elegant equation for the Marginal Likelihood trade-off, but how do we actually find the optimal length scale \(l\) and variance \(\alpha\)? We don't guess—we use gradient descent.

We calculate the partial derivative of the marginal likelihood with respect to any hyperparameter \(\theta_j\):

\(\frac{\partial}{\partial \theta_j} \log p(\mathbf{y}|X, \theta) = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1}) \frac{\partial K}{\partial \theta_j} \right)\)

Training a GP simply means calculating this gradient for length-scales and noise, and passing it to an optimizer like L-BFGS or Adam.

Interactive: The Marginal Likelihood Trade-off
Adjust the length scale (\(l\)). Watch how a short length scale fits the data perfectly but suffers a massive complexity penalty, while a long length scale is simple but misses the data. The optimal model sits in the "Goldilocks" zone.

GP Fit to Data

Log Marginal Likelihood Components

Data Fit
Capacity
Total Error

The Non-Convex Trap & The "Ridge of Solutions"

While the objective is technically convex with respect to the matrix \(\mathbf{K}\) itself, it is highly non-convex with respect to the hyperparameters \(\boldsymbol{\theta}\).

Practitioners often factor out scale to fit the Signal-to-Noise Ratio (SNR) by rewriting the covariance like so:
    \( \mathbf{K}_y = \alpha \left( \mathbf{K}_{f\_normalized} + \frac{1}{\text{SNR}} \mathbf{I} \right) \)

The Ridge Trap: Visualizing the error surface of \(\log(\text{SNR})\) vs. \(\log(l)\) frequently reveals a "ridge of solutions"—a long valley where a model with high noise and a long lengthscale has an almost identical likelihood to a model with low noise and a short lengthscale. Because gradient ascent simply picks one arbitrary point on this ridge, the true Bayesian solution is to use Hybrid Monte Carlo (HMC) to sample across these hyperparameters rather than settling for a single point estimate.

Concept Quiz: Log Marginal Likelihood (Click to Expand)

1. In the Objective formula, what explicitly acts probabilistically organically smoothly successfully as the "Capacity" natively efficiently correctly effectively heavily accurately firmly structurally uniformly elegantly firmly accurately seamlessly firmly successfully naturally smoothly correctly seamlessly solidly natively confidently rigorously actively?

12. From Parametric to Non-Parametric

  • Basis Function Expansion
  • Integrating out Weights
  • The Design Matrix (\(\Phi\))
  • Degenerate Kernels

How do Neural Networks relate to GPs?

To truly understand where covariance functions come from, we can build a GP from scratch using standard basis functions (like polynomials, radial basis functions, or even a hidden layer of a neural network).

Let's say we have a model that uses a weighted sum of \(k\) basis functions \(\phi(x)\):

\( f(x) = \mathbf{\phi}(x)^T \mathbf{w} \)

Normally, in machine learning, we try to optimize and find the specific weights \(\mathbf{w}\). But in Bayesian modeling, we don't want to optimize; we want to integrate them out. We do this by placing a Gaussian prior over the weights: \(\mathbf{w} \sim \mathcal{N}(0, \alpha I)\).

📌 The Golden Rule of Gaussians

If \(\mathbf{w}\) is Gaussian, and \(\mathbf{y}\) is just a linear transformation of \(\mathbf{w}\) (i.e., \(\mathbf{y} = \Phi \mathbf{w}\)), then \(\mathbf{y}\) is strictly guaranteed to be Gaussian as well!

Mean: \( \mathbb{E}[\mathbf{y}] = \Phi \mathbb{E}[\mathbf{w}] = 0 \)
Covariance: \( \text{Cov}[\mathbf{y}] = \Phi \text{Cov}[\mathbf{w}] \Phi^T = \alpha \Phi \Phi^T \)

By simply matrix-multiplying our basis functions (\(\Phi \Phi^T\)), we mathematically "integrate out" the weights. We have instantly derived a Gaussian Process! The kernel function is simply the inner product of the basis vectors: \( k(x_i, x_j) = \alpha \phi(x_i)^T \phi(x_j) \).

Degenerate vs. Full Rank Kernels

  • Degenerate (Parametric): If we use 10 basis functions, our \(\Phi\) matrix has 10 columns. The resulting kernel matrix \(K\) will have a maximum rank of 10. It cannot model infinite complexity.
  • Full Rank (Non-Parametric): What if we scatter an infinite number of basis functions continuously across the real line? The sum becomes an integral. Miraculously, if you use infinite Gaussian bumps, the integral solves out to the Squared Exponential kernel! If you scatter infinite Sigmoid functions, you get the Neural Network Covariance Function (derived by Chris Williams in 1996).

📌 The Neural Network Covariance & Non-Stationarity

If you define a neural network hidden unit as a sigmoid, \(\sigma(\mathbf{w}^\top\mathbf{x} + b)\), and place Gaussian priors on the weights \(\mathbf{w}\) and bias \(b\), integrating them out mathematically yields the exact NN Covariance formula:

\( k(\mathbf{x}, \mathbf{x}') = \frac{2}{\pi} \arcsin \left( \frac{2 \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x}'}{\sqrt{(1 + 2 \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x})(1 + 2 \mathbf{x}'^\top \boldsymbol{\Sigma} \mathbf{x}')}} \right) \)

The Non-Stationarity Intuition: Because the bias \(b\) is sampled from a Gaussian centered at zero, the sigmoids naturally clump around the origin. This makes the GP highly active (wiggly) in the middle of your input space, but it progressively flattens out and saturates at the edges. Since its behavior strictly depends on where you are in space, not just the distance \( |\mathbf{x} - \mathbf{x}'| \), it acts as a profoundly non-stationary kernel.

Concept Quiz: Non-Parametric & NNs (Click to Expand)

1. What mathematically happens when you take an infinite continuous array of Sigmoid basis functions and integrate out their Gaussian-prior weights?

13. Advanced Math & Constraints

  • Derivatives of GPs
  • The Monotonicity Problem
  • Multiplying by \(g(x)\)

Playing with the Bounds of the GP

Because differentiation and integration are linear operations, applying them to a Gaussian Process yields another Gaussian Process. The derivative of a GP is jointly Gaussian with the GP itself. This allows us to effortlessly include derivative observations (e.g., "I don't know the function value here, but I know its gradient is zero") in our training data.

Why GPs can't be strictly Monotonic or Convex

This beautiful linear property comes with a curse. If the derivative of a function is a Gaussian Process, that derivative has support over the entire real line \( (-\infty, \infty) \). This means it is mathematically impossible to guarantee that the derivative will always be positive. Therefore, a standard GP can never be strictly monotonic.

To force monotonicity, practitioners have to use ugly hacks, like inserting "fake" data points with positive derivatives whenever the function threatens to dip down, or modeling the log of the function (which destroys the linear tractability).

The Multiplicative Rule

If \(f(x)\) is a GP, and \(g(x)\) is any deterministic function, then \( f(x) \times g(x) \) is a valid GP with the covariance:
\( g(x) k(x, x') g(x') \)

This is a highly useful trick! If you know the physics of a system perfectly in region A, but it's a mystery in region B, you can multiply your GP by a deterministic sigmoidal step function. This effectively "switches off" the GP variance in region A (replacing it with a known mean function) and "switches on" the non-parametric GP uncertainty strictly in region B.

Concept Quiz: Advanced Derivatives (Click to Expand)

1. Why can a pure standard Gaussian Process never inherently guarantee mathematically strict monotonicity?

14. Final Limitations

  • $O(N^3)$ Runtime
  • Discontinuities

When NOT to use a Gaussian Process

The lecture concludes with a stark reminder of when GPs fail:

  • Big Data: The inversion of the covariance matrix is \(O(N^3)\) and requires \(O(N^2)\) memory. If you have a billion data points and millions of labels, use a Deep Neural Network. GPs shine in the low-to-medium data regime where quantifying uncertainty is critical.
  • Financial Crises & Jumps: Standard stationary GPs are extremely smooth. They do not inherently handle sudden discontinuities (like stock market crashes) well.

    📌 The Multiplicative Switch-On/Switch-Off Trick

    The lecturer provides an elegant hack to model absolute discontinuities using standard GPs without resorting to Lévy Jump Processes. Using the multiplicative rule (\(f(x) \times g(x)\)), you can model a market crash by multiplying your pre-crash GP by a deterministic step function \(g(x)\) that drops to 0 at the exact moment of the crash.

    You then take a second, independent GP, multiply it by an inverse step function (which is 0 before the crash and 1 after), and add the two together:

    \( \mathbf{K}_{jump} = g(x)\mathbf{K}_1(x, x')g(x') + (1-g(x))\mathbf{K}_2(x, x')(1-g(x')) \)

    This shuts off the first GP and instantly turns on the second, allowing for a pure mathematical discontinuity, provided you manually parameterize where that jump occurs.

  • True "Non-Parametric Non-Stationarity": If the smoothness/behavior of the function fundamentally shifts across the domain in unpredictable ways, a single stationary kernel will fail. The bleeding-edge solution to this is Deep Gaussian Processes (feeding the output of one GP into another to warp the input space), but this drastically increases computational complexity.
Concept Quiz: Big Data Limitations (Click to Expand)

1. What is the fundamental asymptotic failure point of standard Gaussian Processes when scaling up to millions of distinct data points (\(N\))?

15. Putting GPs to Work: Bayesian Optimization

  • Active Learning
  • Explore/Exploit Trade-off
  • Acquisition Functions

Bayesian Optimization

GPs are beautiful, but what do we use them for? We use them to optimize black-box functions that are too expensive to evaluate everywhere (e.g., tuning deep neural network hyperparameters, drug discovery, or drilling for oil).

  • The Explore/Exploit Trade-off: We use the GP's predictive mean to exploit known good areas, and the GP's error bars to explore highly uncertain areas.
  • Acquisition Functions: Introduce a simple acquisition function like Upper Confidence Bound (UCB):
    \(\text{UCB}(x) = \mu(x) + \beta\sigma(x)\)
    By maximizing this simple function, the GP tells you exactly where you should collect your next data point.

Lecture 3

Markov Chains, Multi-Output GPs & Dimensionality Reduction

Neil Lawrence · MLSS 2015 · Tübingen

1. Gauss-Markov Processes

  • Kalman Filters
  • Discrete Time Steps
  • Additive Variance

From Markov Chains to Covariances

A Markov chain describes a system where the state today depends only on the state yesterday, plus some random corruption. If that corruption is Gaussian, we call it a Gauss-Markov process (the foundation of the Kalman Filter).

$$x_i = x_{i-1} + \epsilon_i \quad \text{where} \quad \epsilon_i \sim \mathcal{N}(0, \alpha)$$

If we trace this back to time \(t=0\) (assuming \(x_0 = 0\)), the state at time \(i\) is simply the sum of all previous noises: \( x_i = \epsilon_1 + \epsilon_2 + \dots + \epsilon_i \).

Crucial Insight: The sum of independent Gaussians is a Gaussian, and their variances add up. Therefore, the variance of \(x_i\) grows linearly over time.

Concept Quiz: Gauss-Markov Roots (Click to Expand)

1. What specific mathematical property causes the variance to physically grow linearly over time in a Gauss-Markov process?

2. Brownian Motion Kernel

  • Continuous Time (\(dt \to 0\))
  • \(k(t, t') = \min(t, t')\)
  • Tri-diagonal Inverses

The Continuous Limit

If we shrink our discrete time steps \(\Delta t\) to zero, adjusting the injected variance proportionally, we get a continuous-time random walk, famously known as Brownian Motion (or a Wiener process).

The Minimum Covariance Function

What is the covariance between time \(t\) and time \(t'\)? Because the path after the earlier time is just independent noise added on top, the shared correlation between the two points is entirely dictated by the history they shared up to the earlier time point.

$$k(t, t') = \alpha \min(t, t')$$

This implies the variance at any time \(t\) (when \(t=t'\)) is exactly \(t\). The standard deviation grows as \(\sqrt{t}\), creating a recognizable expanding funnel of uncertainty.

Interactive: Brownian Motion / Gauss-Markov Path
Notice how the marginal error bars (shaded blue) grow as \(\pm 2\sqrt{t}\). The path is continuous but wildly jagged (non-differentiable).

📌 Advanced Insight: The Tri-Diagonal Inverse

If you construct the \(\min(t, t')\) covariance matrix and calculate its inverse (the precision matrix), you get a tri-diagonal matrix (zeros everywhere except the main diagonal and the adjacent diagonals). In Gaussian distributions, a zero in the precision matrix means conditional independence. The tri-diagonal structure perfectly encodes the Markov property: given the present, the future is completely independent of the past!

Concept Quiz: Brownian Motion Math (Click to Expand)

1. In Brownian Motion, why precisely is the mapped covariance between any two time steps always strictly equal to \(\alpha \min(t, t')\)?

3. Multi-Output GPs

  • Multitask Learning
  • Coregionalization (\(B\))
  • Kronecker Products (\(\otimes\))

Modeling Correlated Tasks

What if we are tracking multiple things over time that interact with each other? (e.g., several distinct stock prices, or multiple sensors on a robot). If we assume that \(Q\) independent latent processes are mixed together linearly to produce \(P\) outputs, we arrive at the Linear Model of Coregionalization (LMC).

Mathematically, we construct a giant covariance matrix using the Kronecker Product (\(\otimes\)):

$$K_{multi} = K_{time} \otimes B_{tasks}$$
  • \(K_{time}\) dictates how values correlate across time (e.g., smoothly or jaggedly).
  • \(B_{tasks} = WW^T\) is the Coregionalization Matrix. It dictates how Task 1 correlates with Task 2 globally.

In Machine Learning, this is often called Multi-Task Gaussian Processes. The downside? The combined matrix is of size \((NT) \times (NT)\), leading to a massive \(O(N^3T^3)\) computational nightmare. To solve this, practitioners exploit the algebra of Kronecker products to invert the smaller matrices separately.

Concept Quiz: Multi-Output Bottlenecks (Click to Expand)

1. In the Linear Model of Coregionalization (LMC), what mathematical structure intrinsically causes the massive \(O(N^3T^3)\) bottleneck?

4. Implicit GPs in Science

  • Climate Science Smoothing
  • Hardware Sensor Filters
  • The Danger of "Algorithms"

You're Probably Using a GP Without Knowing It

The lecturer notes a powerful philosophical point: many fields apply "smoothing algorithms" (like Kriging in spatial mining, or Kalman smoothers in time-series) without realizing they are implicitly asserting a massive joint Gaussian Process over their data.

The Climate Record

When climate scientists take sparse spatial ice-core measurements and historical sea-surface temperatures (often corrupted by World War II sailors taking warm water from the engine room instead of throwing buckets overboard), and run a spatial "smoothing algorithm" over the globe, they are mathematically fitting a GP. Treating the algorithm as a black box instead of a probabilistic model hides the massive assumptions being made about spatial correlation.

The Bicycle Altimeter

The lecturer strapped a handheld barometric altimeter to a moving bicycle. When the cyclist hit a bump, the data looked completely wrong. Why? Because the manufacturer built a Kalman filter (an implicit GP) into the hardware firmware to smooth the data, assuming it would be held by a stationary pedestrian. Hardware filters are often just hardcoded GPs!

Concept Quiz: The Danger of "Smoothers" (Click to Expand)

1. According to the lecture, what massive mathematical assumption lies blindly hidden when applying a mechanical "smoothing algorithm" or Kalman hardware filter without reflection?

5. Re-imagining PCA

  • Dimensionality Reduction
  • Probabilistic PCA
  • Integrating out the Latent Space

PCA is a Latent Variable Model

High dimensional data (like a 1000-pixel image of the digit "6") often lives on a much smaller, low-dimensional manifold (e.g., just the angle of rotation). Principal Component Analysis (PCA) finds this.

Historically, PCA was viewed algorithmically: "Just take the top eigenvectors of the covariance matrix." But in 1999, Tipping and Bishop formulated Probabilistic PCA (PPCA). They showed PCA is actually a generative probabilistic model:

$$\mathbf{y}_i = W \mathbf{x}_i + \epsilon \quad \text{where} \quad \mathbf{x}_i \sim \mathcal{N}(0, I), \; \epsilon \sim \mathcal{N}(0, \sigma^2 I)$$

Here, \(\mathbf{x}\) is the hidden low-dimensional coordinate, and \(W\) is the transformation matrix. Because \(\mathbf{x}\) is Gaussian, and \(\mathbf{y}\) is a linear transform of it, we can analytically integrate out the latent space \(\mathbf{x}\) using our golden rule of Gaussians:

$$\mathbf{y}_i \sim \mathcal{N}(0, \; WW^T + \sigma^2 I)$$

To fit PPCA, we find the matrix \(W\) that maximizes the likelihood of our data \(\mathbf{y}\). Unsurprisingly, the optimal \(W\) spans the principal eigenvectors of the data covariance.

Concept Quiz: PPCA Under the Hood (Click to Expand)

1. How did Tipping and Bishop algebraically prove that standard PCA is perfectly mathematically identical to fitting their structured generative probabilistic model (PPCA)?

6. Dual PPCA

  • Integrating out Weights
  • The Dual Model
  • The "Large P, Small N" Myth

Flipping the Integration

The lecturer introduces a critical mind-shift. In PPCA, we integrated out the latent positions \(X\) to find the mapping weights \(W\). What if we flip it? Let's put a Gaussian prior over the mapping weights \(W\), and integrate them out to find the latent positions \(X\).

If we integrate out \(W\), the covariance over our data features becomes:

$$K = \alpha XX^T + \sigma^2 I$$

This is exactly a Gaussian Process with a linear kernel (\(XX^T\))! By optimizing the latent locations \(X\) directly to maximize the marginal likelihood, we recover the exact same principal components as PCA. This is called Dual Probabilistic PCA.

The "Large P, Small N" Myth

In biology, researchers often have 10 patients (\(N=10\)) but sequence 6,000 genes per patient (\(P=6000\)). Statisticians call this the "Curse of Dimensionality" or the "Large P, Small N" problem, claiming the model will fail because there are more features than data points.


The lecturer forcefully rejects this: "It is an absurd, stupid, ridiculous thing to say." If a biologist collects more features about an animal, they are giving you more information, not less! The perceived problem is a flaw in the model (standard PPCA), not the data. If you use Dual PPCA (the GP view), as \(P\) increases, the latent positions \(X\) actually become better determined. A proper model thrives on large \(P\).

Concept Quiz: The Dimensionality Myth (Click to Expand)

1. In the context of Dual Probabilistic PCA, why does the lecturer vehemently reject the "Large P, Small N" (Curse of Dimensionality) paradox?

7. The GPLVM

  • Non-linear dimensionality reduction
  • Latent Space Interpolation
  • Bill Baxter's Animation

Gaussian Process Latent Variable Model

The leap from Dual PPCA to the GPLVM is one simple substitution: If Dual PPCA is just a GP with a restrictive linear kernel (\(k(x, x') = x^T x'\)), what happens if we replace it with a powerful, non-linear kernel like the Squared Exponential?

$$K_{ij} = \alpha^2 \exp\left( - \frac{||x_i - x_j||^2}{2l^2} \right) + \sigma^2 I$$

By doing this, we get a fully non-linear dimensionality reduction technique. We treat the low-dimensional positions \(X\) as hyperparameters and optimize them using gradient descent on the marginal likelihood. The GP maps these low-dimensional points to the high-dimensional observation space perfectly.

Interactive: The GPLVM Animation Concept
In the lecture's video example, Bill Baxter trained a GPLVM on just 4 hand-drawn frames. The model discovered a 2D latent space. Drag the blue dot around the Latent Space below to see how non-linear GP interpolation creates smooth, novel facial expressions!
Sad / Closed Happy / Closed Sad / Open Happy / Open

"I believe that in the end, Gaussian processes will win through because this type of transformation is extremely complicated, but the GP interpolates smoothly between the different points and finds a configuration where that smooth interpolation is most likely." — The Lecturer.

Concept Quiz: The GPLVM Core (Click to Expand)

1. How does the GPLVM fundamentally analytically extend the Dual Probabilistic PCA framework?

8. Deriving \(\min(t, t')\)

  • Lower Triangular Matrix (\(L_1\))
  • Inner Products of History
  • The Mechanics of Correlation

The Linear Algebra of Time

In Section 2, we established that the Brownian motion kernel is \(k(t, t') = \min(t, t')\). But why does the math perfectly work out to a minimum function? The lecturer provides a beautiful linear algebra intuition using a Lower Triangular Matrix of ones, denoted as \(L_1\).

Since \(x_i\) is the sum of all past noises up to time \(i\), we can write the entire state history vector \(\mathbf{x}\) as a matrix multiplication of \(L_1\) and the noise vector \(\mathbf{\epsilon}\):

$$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \end{bmatrix} $$

To find the covariance between time \(i\) (e.g., Row 2) and time \(j\) (e.g., Row 4), we take the dot product (inner product) of those two rows:

  • Row 2: \([1, 1, 0, 0]\)
  • Row 4: \([1, 1, 1, 1]\)
  • Dot Product: \((1\times1) + (1\times1) + (0\times1) + (0\times1) = 2\)

The Intuition

Because the matrix is lower triangular, Row \(i\) has exactly \(i\) ones, and Row \(j\) has exactly \(j\) ones. When you take their dot product, the ones only overlap up to the shorter of the two rows. Everything after that is multiplied by a zero. Thus, the dot product is always exactly \(\min(i, j)\). When we scale this by continuous time increments \(\Delta t\), it effortlessly becomes \(\min(t, t')\).

9. Advanced Coregionalization

  • Rank of \(B\) Matrix
  • The LMC
  • Short vs. Long Term task correlations

Rank and The LMC

In Multi-Output GPs (Section 3), we introduced \(K_{multi} = K_{time} \otimes B_{tasks}\). The complexity of the multi-task relationship is entirely governed by the Rank of \(B\).

  • Rank-1 Coregionalization: If \(B\) is rank-1, it implies there is only one underlying latent Gaussian process driving every single output. The outputs might look different, but they are just scaled versions of the exact same hidden curve.
  • Full-Rank Coregionalization: If \(B\) is full rank (e.g., a \(2 \times 2\) matrix for 2 outputs), it implies there are 2 independent latent processes driving the outputs. However, if they are multiplied by a single \(K_{time}\) kernel, both latent processes are forced to have the exact same length scale (smoothness).

The Linear Model of Coregionalization (LMC)

What if Stock A and Stock B are highly correlated in their short-term daily jitter, but completely independent in their long-term monthly trends? A single Kronecker product cannot model this.


The solution developed in geostatistics (Kriging) in 1978 is the LMC. We simply sum multiple Kronecker products together!

$$K_{multi} = (K_{fast} \otimes B_{short}) + (K_{slow} \otimes B_{long})$$

This creates an incredibly powerful prior that can separate fast-moving shared noise from slow-moving independent trends.

10. Origin of Kernels

  • Bochner's Theorem
  • Fourier Space
  • The Matérn Family

Bochner's Theorem & The Matérn Family

Where do valid, stationary covariance functions come from? We cannot just invent random math formulas and hope they create positive-definite matrices.

Bochner's Theorem provides the theoretical framework: The inverse Fourier transform of any positive measure (e.g., a probability distribution) in spectral (frequency) space is guaranteed to be a valid, stationary covariance function.

By applying different low-pass filters in frequency space, we generate kernels with strictly defined levels of smoothness (differentiability):

Squared Exponential (RBF) — \(\infty\)-differentiable

Derived by applying a Gaussian filter in Fourier space. Infinitely differentiable. Mathematically beautiful but often "too smooth" for rugged real-world data.