In this talk I will discuss several recent papers that develop new graph neural networks by considering their relation to continuous processes. I will discuss how graph neural networks can be arrived at as numerical schemes to solve differential equations, what they have to do with Perelman’s famous solution to the Poincare conjecture and how they are related to string theory. Graphs are fundamentally discrete structures and at first glance, treating them continuously does not appear to be a promising research direction. However, there are many examples where handling discrete objects as if they were continuous has been a catalyst to progress. Photons are now known to be discrete, but modelling quantum physical processes with continuous differential equations such as heat diffusion produced many great breakthroughs in classical physics and chemistry. In computer science, digital images are also discrete, but continuous tools such as diffusion based denoising are still widely used and the question of whether digital images are best modelled continuously or discretely remains a source of great philosophical debate. Even in ML, the most common approach to handling discrete objects is to embed them into a continuous space. I will show that for graph ML too, there is much to be gained from unlocking the magnificent toolbox of continuous mathematics.
We introduce a framework for Continual Learning (CL) based on Bayesian inference over the function space rather than the parameters of a deep neural network. This method, referred to as functional regularisation for Continual Learning, avoids forgetting a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function. To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed. Then, the training algorithm sequentially encounters tasks and constructs posterior beliefs over the task-specific functions by using inducing point sparse Gaussian process methods. At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs – a fixed-size subset of the task inputs selected such that it optimally represents the task – and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms. Our method thus unites approaches focused on (pseudo-)rehearsal with those derived from a sequential Bayesian inference perspective in a principled way, leading to strong results on accepted benchmarks.
We introduce a scalable approach to Gaussian process inference that combines spatio-temporal filtering with natural gradient variational inference, resulting in a non-conjugate GP method for multivariate data that scales linearly with respect to time. Through a natural gradient approach, we derive a sparse approximation that constructs a state-space model over a reduced set of spatial inducing points and shows that for separable Markov kernels the full and sparse cases exactly recover the standard variational GP. This leads to an efficient and accurate method for large spatio-temporal problems that we demonstrate on multiple real-world examples.
Reproducing kernel Hilbert spaces (RKHS) provide a powerful framework, termed kernel mean embeddings, for representing probability distributions, enabling nonparametric statistical inference in a variety of applications. Combining RKHS formalism with Gaussian process modelling, we present a methodology to refine low-resolution (LR) spatial fields with high-resolution (HR) information. This task, known as statistical downscaling, is challenging as the diversity of spatial datasets often prevents direct matching of observations. Yet, when LR samples are modeled as aggregate conditional means of HR samples with respect to a mediating variable that is globally observed, the recovery of the underlying fine-grained field can be framed as taking an ‘inverse’ of the conditional expectation, namely a deconditioning problem. Leveraging this deconditioning perspective, we introduce a Bayesian formulation of statistical downscaling able to handle potentially unmatched multi-resolution spatial fields.
Arctic sea ice is a major component of the Earth’s climate system, as well as an integral platform for travel, subsistence, and habitat. Since the late 1970s, significant advancements have been made in our ability to closely monitor the state of the ice cover at the polar regions through the launch of Earth-observation satellites. Subsequently, now over 4 decades of time-series data at our disposal, we have observed significant reductions in the spatial extent of Arctic sea ice, and more recently its thickness — directly in line with increasing anthropogenic CO2 emissions. The summer months, in particular, present the largest rate of decline in sea ice extent compared to other seasons, and also the largest pattern of inter-annual variability, making seasonal to inter-annual predictions difficult. Advanced predictions of the summer ice conditions are important as this is the time when the ice cover is at its minimum extent, and the Arctic becomes open to a whole host of traffic including coastal resupply vessels, eco-tourism, and the movement of local communities. This presentation explores Gaussian processes as a framework for both sea ice forecasting, and for optimally combining and interpolating multiple satellite observation sets. In the first instance, the spatio-temporal patterns of variability in past ice conditions are exploited using a framework of a complex network, which is then fed into a Gaussian process regression forecast model in the form of a random walk graph kernel, to predict regional and pan-Arctic (basin-wide) September sea ice extents with high skill. Following this, we will see how extensions to this work can be made in the form of spatial forecasts by adopting a multi-task learning approach. In the second application, the Gaussian process regression method is used to optimally combine (and interpolate) observations from 3 separate satellite altimeters in space and time, in order to produce the first-ever daily pan-Arctic observational data set of Arctic sea ice freeboard (the base product for deriving sea ice thickness). Following this, we will see how extensions to this work can be made through computational speed-ups by using relevant vector machines. In both the forecasting and interpolation applications, the hyperparameters of the models are learned through the empirical Bayes, or type-II maximum likelihood approach, which in the second application allows us to derive information relating to the spatio-temporal correlation length scales of Arctic sea ice thickness.
Neural networks and Gaussian processes represent different learning paradigms: the former are parametric and rely on ERM-based training, while the latter are non-parametric and employ Bayesian inference. Despite these differences, I will discuss how Gaussian processes can help us understand and improve neural network design. One example of this is our recent work investigating the effect of width on neural networks. We study a generalized class of models—Deep Gaussian Processes—where parametric layers are replaced with GP layers. Analysis techniques from Bayesian nonparametrics uncover surprising pathologies of wide models, introduce a new interpretation of feature learning, and demonstrate a loss of adaptability with increasing width. We empirically confirm these findings hold for DGP, Bayesian neural networks, and conventional neural networks alike. With time permitting, I will also discuss recent work that leverages insights from neural network training to improve Gaussian process scalability. Taking inspiration from deep learning libraries, we constrain ourselves to write GP inference algorithms that only use matrix multiplication and other linear operations—procedures amenable to GPU acceleration and distributed computing. While these methods induce a slight bias—which we quantify and bound through a novel numerical analysis—we demonstrate that this can be eliminated through randomized truncation techniques and stochastic optimization.
Downscaling aims to link the behaviour of the atmosphere at fine scales to properties measurable at coarser scales, and has the potential to provide high resolution information at a lower computational and storage cost than numerical simulation alone. This is especially appealing for targeting convective scales, which are at the edge of what is possible to simulate operationally. Since convective scale weather has a high degree of independence from larger scales, a generative approach is essential. I will describe a statistical method for downscaling moist variables to convective scales using conditional Gaussian random fields, with an application to wet bulb potential temperature (WBPT) data over the UK. This model uses an adaptive covariance estimation to capture the variable spatial properties at convective scales.
In many real world problems, we want to infer some property of an expensive black-box function f, given a budget of T function evaluations. One example is budget constrained global optimization of f, for which Bayesian optimization is a popular method. Other properties of interest include local optima, level sets, integrals, or graph-structured information induced by f. Often, we can find an algorithm A to compute the desired property, but it may require far more than T queries to execute. Given such an A, and a prior distribution over f, we refer to the problem of inferring the output of A using T evaluations as Bayesian Algorithm Execution (BAX). In this talk, we present a procedure for this task, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm’s output. Applying this to Dijkstra’s algorithm, for instance, we infer shortest paths in synthetic and real-world graphs with black-box edge costs. Using evolution strategies, we yield variants of Bayesian optimization that target local, rather than global, optima. We discuss InfoBAX, and give background on other information-based methods for Bayesian optimization as well as on the probabilistic uncertainty models which underlie these methods.
We are happy to present this joint work with Mihaela Roșca, Răzvan Pascanu, Lucian Bușunoiu and Claudia Clopath on the effect of spectral normalisation in deep reinforcement learning. Most of the recent deep reinforcement learning advances take an RL-centric perspective and focus on refinements of the training objective. We diverge from this view and show we can recover the performance of these developments not by changing the objective, but by regularising the value-function estimator. Constraining the Lipschitz constant of a single layer using spectral normalisation is sufficient to elevate the performance of a Categorical-DQN agent to that of a more elaborated Rainbow agent on the challenging Atari domain. We conduct ablation studies to disentangle the various effects normalisation has on the learning dynamics and show that it is sufficient to modulate the parameter updates to recover most of the performance of spectral normalisation. These findings hint towards the need to also focus on the neural component and its learning dynamics to tackle the peculiarities of Deep Reinforcement Learning.
In this work, we tackle the problem of learning symbolic representations of low-level and continuous environments. We present a framework for autonomously learning portable hierarchies that are suitable for planning. Such abstractions can be immediately transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We show how to grounded these representations with problem-specific information to construct a sound representation suitable for planning. We demonstrate our approach in a series of video game tasks, where an agent can learn such representations directly from transition data. The resulting learned representations enable the use of off-the-shelf classical planners, resulting in an agent capable of forming complex, long-term plans, consisting of hundreds of low-level actions.