Online learning with a memory-efficient Kalman Filter

2025-12-16T14:00:00+00:00

Dec 16, 2025 | Updated Feb 8, 2026

Introduction: The Kalman Filter

Recently, I had the opportunity to learn about the Kalman filter: a powerful, versatile tool that captures uncertainty in a time-varying system. Given a series of observations and an underlying dynamical model of how the state should change over time, the Kalman filter can update the state based on a probability distribution of the state and external observation data. It is no surprise, then, that NASA was able to utilize these properties for many aerodynamic control systems, including the famous Apollo missions to the moon!

However, another use of the Kalman filter outside of control theory is Bayesian machine learning. Rather than traditional methods of optimizing a model, we can frame the idea of model learning in terms of uncertainty: we start off being highly uncertain that our model parameters are optimal, and ideally use a variant of the Kalman filter algorithm to shrink our uncertainty (and therefore naturally optimize the model in the state-estimation process). As it turns out, this is not only reasonable, but also quite performant; the calculations incorporate approximate second-order information${}^1$ when computing the next state. We call this method the Extended Kalman Filter (EKF).

The EKF algorithm

We can roughly describe the algorithm in the following steps:

“A priori” predictions${}^2$; essentially, making a simple prediction of what the next EKF state will be (in our case, the model parameters). $\theta$ represents the model parameters as a vector, $\Sigma$ represents the covariance matrix of the parameters, $\mathbf{Q}$ represents the process noise, and $\mathbf{f_\theta}$ represents the output of the model with parameters $\theta$. To generalize this problem to one in control theory, let’s suppose we are learning to optimize controls over a system that has system state $\mathbf{x}$ and controls $\mathbf{u}$, and our underlying baseline model predicts an observation $\mathbf{y}$.

\[\begin{aligned} & \theta_{t+1 \mid t}=\theta_t \\ & \Sigma_{t+1 \mid t} = \Sigma_t + \mathbf{Q}\\ & \mathbf{y}_{t+1 \mid t}=\mathbf{f}_{\theta_{t+1 \mid t}}(\mathbf{x}_t, \mathbf{u}_t) \end{aligned}\]

Kalman gain computation, which helps scale how much to change the parameters (almost like a “learning rate”). $\mathbf{R}_t$ is the observation noise, and crucially, $\mathbf{K}_t$ is the Kalman gain matrix.

\[\begin{aligned} & \mathbf{F}_t=\frac{\partial \mathbf{f}}{\partial \theta} (\mathbf{x}_t, \mathbf{u}_t, \theta_t) \\ & \mathbf{S}_t=\mathbf{F}_t \Sigma_{t+1 \mid t} \mathbf{F}^\top_t + \mathbf{R}_t \\ & \mathbf{K}_t = \Sigma_{t+1 \mid t}\mathbf{F}^\top_t\mathbf{S}_t^{-1} \end{aligned}\]

Posterior computation, which uses the Kalman gain to update the state distribution. $\mathbf{I}$ is the identity matrix while $\mathbf{s}_t$ is the innovation: the difference between the predicted state $\mathbf{y}_{t+1 \mid t}$ and the observed state $\mathbf{y}_{t+1}$.

\[\begin{aligned} & \mathbf{s}_t = \mathbf{y}_{t+1} - \mathbf{y}_{t+1 \mid t} \\ & \theta_{t+1} = \theta_{t+1 \mid t} + \mathbf{K}_t\mathbf{s}_t \\ & \Sigma_{t+1} = (\mathbf{I}-\mathbf{K}_t\mathbf{F}_t) \Sigma_{t+1 \mid t} \\ & \textbf{return } \theta_{t+1}, \Sigma_{t+1} \\ \end{aligned}\]

This process is repeated for each timestep $t$.

In practice

Using the JAX library in Python, the above algorithm can be implemented as follows:

Click to show code

Jai Vivek Nagaraj

Online learning with a memory-efficient Kalman Filter

Introduction: The Kalman Filter

The EKF algorithm

In practice