MIT 6.S191: Uncertainty in Deep Learning

Transcription for the video titled "MIT 6.S191: Uncertainty in Deep Learning".

1970-01-01T08:39:05.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Introduction

Intro (00:00)

Thank you so much for having me. I'm super excited to be here. Thank you, Alexander, for the kind words and Ava, both of you for organizing. So yeah, I'm going to talk about practical uncertainty estimation and out of distribution robustness and deep learning. This is actually a bit of an abridged and maybe slightly updated version of the NURBS tutorial I gave in 2020. So if you want to see the extended one, check that out. And a bunch of these slides are also by Dustin and Blaji, wonderful colleagues of mine. All right. What do we mean by uncertainty? The basic idea is we wanna return a distribution over predictions rather than just a single prediction. In terms of classification, that means we'd like to output a label along with its confidence. So how sure are we about this label? In terms of regression, we want to output a mean, but also its variance, confidence in intervals or error bars around our predictions. Good uncertainty estimates are crucial because they quantify when we can trust the model's predictions. So what do we mean by out of distribution robustness? Well, in machine learning, we usually assume, actually almost all the theory is under the assumption that our training data and our test data are drawn IID, independent and identically distributed, from the same data set.


Discussion On Out-Of-Distribution Network Robustness

Out of Distribution Robustness (01:47)

practice in reality the data sets change and things change either temporally spatially or in other ways and in practice we often see data that's not when we're deploying models we see data that's not from the same distribution as the training set so the kinds of data set shift you might imagine seeing are things like covariate shift so the inputs x may change in distribution but the label distribution changes open set recognition is a fun one it's actually really really terrible so you can have new classes appear at test time imagine for example you've got a cat and dog classifier that sees an airplane. That's something that's really hard to deal with, actually. And then label shift. So the distribution of labels may change while the input and output distributions are the same. This is also prevalent in things like medicine. So the distribution of the number of people who test positive for COVID changes pretty drastically over time, and your models will have to adapt to that kind of thing. Okay, so here's an example of some dataset shift. So the IID test set, this is actually from ImageNet, which is a popular image dataset.


Dataset Shift (02:59)

We have clean images. You could imagine the shift being just adding noise to the images with with additional severity um so here we've got our frog and we're adding just increasing amounts of voice there's actually a paper by hendrix and dietrich where they took various kinds of shifts like noise motion blur zoom blur snow actually just drove through the snow in an in an autonomous car uh and you know it didn't deal very well with that dataset shift. So here's a bunch of data set shifts at various different severities. And we showed this to a common architecture, a ResNet, and looked at how the accuracy behaved with respect to these dataset shifts. And so you can see accuracy goes down as these various shifts are happening. And that corresponds to the intensity of the shift. That's maybe not surprising, but what you would really like is for your model to say, okay, my accuracy is going down, but my uncertainty goes up corresponding with that. I don't know what the right answer is, but I'm telling you I don't know by saying, like, in a binary classifier, I have 0.5 confidence on the label, for example. So in our experiments, was that true in these classic deep learning architectures? No, definitely not. So as accuracy got worse, the ECEs of measure of calibration error, I'll tell you about that in a second, but our measure of the quality of the uncertainty got worse as well. And the model started to make overconfident mistakes when exposed to this changing distribution of data. So that's pretty alarming. It's pretty scary for any application where you might deploy a machine learning model. Another kind of terrifying thing is here are a bunch of examples of data examples where the model and said, I am almost completely sure that's a Robin or a cheetah or a panda, which is also pretty, pretty freaky. You think it would say, I don't know what this what this noisy thing is, but it doesn't. One reason why you might imagine that happening is is just the way our models are constructed. So in an ideal sense, imagine it being a two-dimensional input space, and you have three classes. So you have this blue, orange, and red. Ideally, when you get further away from your class, your model starts to become uncertain about what the red class is. And so the red class here we call out of distribution. And we'd to say okay well the red is far away from orange and blue so we're uncertain about what the what the actual label is in reality you know these classifiers are our decision boundaries and the further you get from the boundary so this is a boundary between two classes the further you get from the boundary the more confident you become that as one class or the other. So an interesting pathology of a lot of these models is, you know, if you show a bird class or sorry, a cat and dog classifier bird, it won't say, oh, I don't, I don't know what this is. It'll say, oh, this is more bird-like than, or sorry, more dog-like than cat-like. So I'm 100% sure that this is a cat, which is not what you want. All right, so applications. A very important one that's becoming more and more important or relevant these days is healthcare. So medical imaging, important or relevant these days is healthcare. So medical imaging, radiology. This is diabetic retinopathy. And so of course, you would think that you would like to be able to pass along with a classification, you know, diseased or not diseased, tumor or not tumor. If you pass that down to a doctor, you want to pass down a confidence measure as well. So 80% sure it's a tumor, 70% sure it's a tumor 70 sure it's a tumor right rather than yes or no and have the doctor be able to reason about that probability and include it in in a downstream or maybe expectation calculation um so like what is the expectation that the patient will survive or something so here right we'd really like to be able to pass good uncertainty downstream to a doctor, or say we're not actually sure we'd like an expert to label it instead and pass it to an expert label or, you know, an actual doctor. you just heard a lot about. Like I said, I actually was in a self-driving car an hour ago driving through Vermont to get here in snow. And so that's definitely quite an out-of-distribution situation. And it certainly was fooled a couple of times. But here, right, you would imagine that the car can encounter any number of out of distribution examples. And you would like it to express uncertainty such that the system making decisions is able to incorporate that uncertainty. One that we care about a lot at Google is conversational dialogue systems. So if you have your Google Assistant or your Amazon Alexa or your Siri or whatever, then it should be able to express uncertainty about what you said or what you asked it to do. And you could possibly then defer to a human or it could ask you to clarify, right? It could say, oh, please repeat that. I didn't quite understand. Instead of like, you know, add something to your shopping cart when you really just wanted to know what the weather was. So there's a lot of applications of uncertainty in the art of distribution robustness. Basically, any place we deploy a machine learning model in the real world, we want good uncertainty. This is just a subset of tasks that I care about a lot and my team at Google is working on.


The error of the estimators (09:18)

One of our taglines is, so a popular expression about machine learning is all models are wrong, but some models are useful. And our tagline, we've changed that a little bit to say all models are wrong, but models that know when they're wrong are useful. So I'll give you a little primer on uncertainty and robustness. So there's multiple sources of uncertainty. One is model uncertainty. And the way to think about that is there's many models that fit the training data well. So if you look at this two-class situation, there's actually infinitely many lines that you could draw that perfectly separate these two classes. And so you would like to be able to express your uncertainty about which line is the right line, right? Rather than make an arbitrary decision that theta one or theta two or theta three is the true model. This kind of model, so a model about what is the right model is known as epistemic uncertainty. And you can actually reduce it. The way you reduce it is just by gathering more data, right? So if we filled in more blue triangles and more red circles, then maybe we could eliminate theta three and theta one because they no longer separate the data well. One thing to note is it doesn't necessarily have to be models in the same hypothesis class. So the first one was just straight lines, models but you could imagine non-linear models of various flavors also be incorporated all kinds of yeah all kinds of models and that significantly increases the scope as well the the number of plausible models that could describe the data the number of plausible models that could describe the data. And then the second big source of uncertainty is known as data uncertainty. And that's basically uncertainty that's inherent to the data. It could be something like label noise, just uncertainty in the labeling process. You know, humans, I think these are CIFAR images maybe, and they're really low resolution and humans may not even know you know what that thing is. And to human writers may give to live to different labels, it could be sensor noise and in the thermometer there's some inherent uncertainty about you know the decimal place that you're that you can estimate the rope, the weather to. And so, this is irreducible uncertainty often is called aleatoric uncertainty. The distinction between epistemic and aleatoric actually, you know, experts constantly mistake the two.


Calibration error (11:55)

And between us, we've kind of stated that we need to change the language because those words are too, too hard to memorize. We can think of it as model and data uncertainty. All right, so how do we measure the quality of our uncertainty? This is something that we've been thinking about quite a bit as well. One thing that's popular to think about is a notion called calibration error, and that's basically the difference between the confidence of the model. Say your model said, I'm 90% sure this is a tumor, or I'm 90% sure that was a stop sign, minus the aggregate accuracy. So if it said this is a stop sign a million times, how many times was it actually right? Or sorry, if it said this is a stop sign with 90% certainty a million times, how many times was it actually right? And the calibration error is basically the difference between that confidence and the aggregate accuracy. Um, so, you know, in, in the limit, how often do, or how does my confidence actually correspond to the, uh, actual accuracy? Okay. So I kind of explained this already, but another great examples with weather, so if you predict rain with 80% accuracy, your calibration error would be over many, many days, how often was the difference between the confidence and the actual predicted accuracy. And for regression, you might imagine calibration corners corresponding to this notion of coverage, which is basically how often do predictions fall within the confidence intervals of your predictions. Okay, so a popular measure for this is something known as expected calibration error. This equation is just showing you what i what i just said so um you actually been your confidences and bins of maybe zero to ten percent ten to twenty twenty to thirty and then for each bin you uh you estimate for the prediction that landed in that bin now what was the difference between the the accuracy of that prediction and the actual confidence or that the confidence in that bin. Okay, and that gives us, that can give us an idea of the, you know, level of overconfidence or underconfidence in our models. Here's an example from a paper by Guo et al, where they showed that very often deep learning models can be very underconfident in that you know the blue what they actually outputted was pretty far from the actual um you know confidence or the actual accuracy that we'd like for each bin okay one downside of of calibration is that it actually has no notion of accuracy built in. It's just saying like, how often was my confidence aligned with the actual accuracy? And so you can have a perfectly calibrated model that just predicted randomly all the time, because it's just saying, I don't know, and it didn't know. And so we've been looking at uh or actually the statistical meteorology community many years ago was looking at ways to actually score the the level of the the quality of the uncertainty of weather forecasts and came up with a notion of something called proper scoring rules that that incorporates this notion of calibration but also a notion of accuracy and this paper by knighting and raftery is a wonderful paper that outlines these rules and it gives you a whole class of with loss functions if you will or scoring functions that um that are that uh don't violate a set of rules and accurately kind of give you an idea of how good your uncertainty was.


The importance of uncertainty (15:20)

Negative log likelihood is popular in machine learning. That's a proper scoring rule. Breyer score is just squared error, and that's also a proper scoring rule, also used quite a bit in machine learning. Okay, I'm going to skip over that for the sake of time. Okay, so how do we get uncertainty out of our models? So we know it's important, how do we actually extract a good notion of uncertainty? So I assume you are all familiar with the setting of, I have a neural net and I want to train it with STD and I have a loss function. Well, every single I would say almost every loss function corresponds to maximum. So minimizing a loss function actually corresponds to maximizing a probability or maximizing a log probability of the data given the model parameters. So if you think of this p theta, this argmax is saying I want to maximize the probability of my parameters theta given the data set that I have. So find one setting of the parameters that maximizes this probability that corresponds to minimizing log likelihood minus a prior and actually if you take if you think of that as a loss right that's a log loss p my plus a regularization term in this in this case it's squared error which actually corresponds to a Gaussian prior but I won't get into that. Okay, so actually this log prob corresponds to data uncertainty, which is interesting. So you can actually build into your model a notion of data uncertainty in the likelihood function. Let's see. Oh, yeah, okay. And a special case of this, this right is just softmax cross entropy with lt regularization which you optimize with sgd which is the standard way to train it deep neural nets sorry there's a lag on my slide so i've skipped over a couple all right skipped over a couple. All right. See if this gets. Okay. All right. We'll just go here. Okay. So the problem with this, right, is that we found just one set of parameters. This gives us just one prediction, for example, and it doesn't give us model uncertainty. We just have one model, and we plug in an X, and it gives us a Y. So how do we get uncertainty? In the probabilistic approach, which is definitely my favorite way of thinking about things, you would, instead of getting the single argmax parameters, you want a full distribution for the p theta given x y you want a whole distribution over parameters rather than the single one. A really popular thing to do actually is instead of getting this full distribution is just get multiple good ones. And that corresponds to there's a number of strategies for doing this the most popular probably is something called ensembling which is just get a bunch of good ones and aggregate your predictions over this this set of good models okay let's hope it goes forward all right the the recipe at least in the in the bayesian sense right is we have a model it's a joint distribution of outputs and parameters given given some set of inputs during training we want to compute the posterior which is the conditional distribution of the parameters given observations right so instead of finding the single setting of theta bayes rule gives us the equation that we need to get the entire distribution over thetas. And that's actually the numerator is what we were doing before. And to get the entire distribution, you need to compute the denominator below, which is actually a pretty messy and scary integral, high dimensional because it's over all of our parameters, which for deep nets can be millions or even billions.


Why combine models? (20:07)

Then at prediction time, we'd like to compute the likelihood given the parameters where each parameter configuration is weighted by this posterior, right? So we compute an integral predicting condition on a set of parameters times the probability of those parameters under this posterior and aggregate all those to get our predictions. In practice, what is often done is you get a bunch of samples, S, and you say, okay, over a set of discrete samples, I'm going to aggregate the predictions over this set, which might look a lot like ensemble, which I just talked about. Okay, so what does this give us? Instead of having just a single model, now we have a whole distribution of models, right? And what you're looking at is actually such a distribution where you can see all of the lines, each line is a different model. They all fit the data quite well, but they do different things as they get away from the data so as you move away from the data they have different hypotheses about how the the behavior of the of the data will be as you move away and the that that difference gives you an interesting uncertainty as you move away from the data.


Bayesian Neural Networks (21:10)

So you might imagine just computing the variance, for example, out near the tails for a prediction. I'm going to speed through this, but there's a vast literature on different ways of approximating that integral over all parameters. It's, in general, way too expensive to do, certainly in closed form, or even exactly for deep nets. So there's tons of approximations, and they correspond to things like, if you imagine these lines being the loss surface of the network, they correspond to things like putting a quadratic on top of the loss, it's known as a Laplace approximation, things like putting a quadratic on top of the loss, it's known as Laplace approximation, or things like sampling to Markov chain Monte Carlo is used quite a bit. And that's just, as you optimize, usually you draw samples, you grab a good model, then you optimize a bit further, grab a good model and so on and so forth. One thing that I'm really interested in is this notion of so a a parameterization of a deep neural net defines a function from x to y if you're if you have a classifier and or a regressor and so then a bayesian neural net gives you a distribution over these functions from x to y and reasoning about this distribution is something I find super interesting.


Deep Bayesian Nets: Experimentation. (22:28)

And there's a really neat property that under a couple of assumptions, you can show that if you take the limit of infinite neural hidden units, it corresponds to a model that we know is a Gaussian process. I won't get into that, but it gives you that integral in closed form. And then we can use that closed form integral to make predictions or look at pretty pictures of what the posterior actually looks like. This is actually a line of research of mine and a couple of my colleagues of thinking about the behavior of deep neural networks under this infinite limit, or thinking about the behavior of deep neural networks under this infinite limit, or thinking about how things behave as you move away from the data using this Gaussian process representation, which is a really neat and clean way to think about things, at least in theory, because we have a closed-form model for what the integral limit parameters actually is. And it turns out that they're really well calibrated which is awesome they have good uncertainty i won't won't describe what they are exactly but i highly recommend that you check out Gaussian processes if you find that interesting okay so then um if you think okay this Bayesian methodology is pretty ugly because I have this crazy high-dimensional integral, it's really hard and mathy to figure out, then you might think about doing ensemble learning, which is basically just take a bunch of independently trained models and form a mixture distribution, which is basically just average predictions of the ensemble to get you, I guess, in the case of classification, you really just aggregate or average the predictions and that gives you uncertainty over the class prediction. In regression, you can compute uncertainty as a variance over the predictions of the different models. And there's many different ways of doing this ensembling. It's almost as old as machine learning itself. Just like take the kitchen sink of all the things that you tried and aggregate the predictions of these things. And that turns out to give you usually better predictions and better uncertainty than just a single model. All right, so if you find it interesting you can you can wade into this debate if you go on Twitter there's a there's a debate between experts in machine learning about you know are ensembles Bayesian are they not Bayesian I won't get into that I fall into the not Bayesian camp but there's an interesting it's interesting to think about, what is the difference between these strategies? I won't spend too much time on that. I've also spent a lot of time thinking about issues with Bayesian models. So we spent in my group a ton of time trying to get Bayesian models to work well on modern sized deep neural nets and it's really hard to get it to work because it requires very coarse approximations you know you need this very high dimensional integral people play around with Bayes rule to get it to work and so it at the end it's not clear if it's totally kosher from the, I guess, a purist Bayesian view. And it requires you to specify a model well. I won't get into that too much, but it basically means that you specify a class of models, and the ideal model needs to be in that well-specified class of models for Bayes to work well. And it turns out that for deep nets we don't really understand them well enough to specify uh what this class of models might look like and the problem there often uh hinges on the um on the prior which is you know how do you specify a prior over deep neural nets we don't really know anyway there's there's a paper i'm uh i'm particularly proud of called how good is the base posterior deep neural nets where we try to figure out you know what is wrong with the deep neural nets why can't we get them to work well on um on modern problems okay i see there there are some chat messages. Should I be reading these? This is Alba. We are handling the chat. And some of the questions are directed specifically towards you, so we can moderate them to you at the end of your talk, if that's OK. Perfect. Perfect. Yeah, go ahead. So some really simple ways to improve the uncertainty of your model. A popular one is just known as recalibration. This is done all the time in real machine learning systems, which is train your model and then look at the calibration on some withheld data set and recalibrate on that data set. You can actually do is take just the last layer and do something called temperature scaling, which is actually just optimized via cross entropy on this withheld data set. And that can increase your calibration on the distribution that corresponds to that withheld data set. Of course, that doesn't give you much in terms of model uncertainty, and that doesn't help you when you see yet another different data distribution than the one you just recalibrated on, but it can be really effective. I don't know if you talked about dropout in the course, you probably did. effective is called Monte Carlo dropout, Yaron Gal and Zubin Garamani, where they just did dropout at test time. So when you're making predictions, drop out a bunch of hidden units and average over the dropouts when you're predicting. And you can imagine that gives you an ensemble like behavior and you get a distribution over predictions at the end. And that actually seems to work pretty well as a baseline. And then deep ensembles. So Balaji, I won't say his last name, found that deep ensembles work incredibly well for deep learning. And this is basically just retrain your deep learning model n times. N is usually something like 5 to 10, just with different random initializations, they end up in different maxima or optima of the loss landscape and give interestingly diverse predictions. And this actually gives you a really good uncertainty, at least empirically, it seems really good. Yeah, so here's a figure from, if you recall, one of the first slides I showed you, those different shifts of the data. That same study, we found deep ensembles actually worked better than basically everything else we tried.


Hyperparameter Ensembles in Distributed Setting. (29:36)

I lost a bet to Balaji, actually, because I said Bayesian or approximate Bayesian methods were going to work better and it turned out that they didn't. Something that works even better than this deep ensemble strategy is what we call hyperparameter ensembles, which is basically also change the hyperparameters of your model and that gives you even more diversity in the predictions. I guess you might imagine that corresponding to broadening your model hypothesis in terms of the types of models that might fit well. And so you ensemble over those, and that does even better than just ensembling over the same hyperparameters and architecture. Then another thing that works really well, SWAG, which was BioMatic set out, is you just optimize via SGD basically, and then you fit a Gaussian around the average weight iterates. So as you're bouncing around an optimum in SGD, you're basically drawing out a Gaussian. And you're gonna say that Gaussian now is the distribution over parameters that I'll use. Okay, waiting for the slide to change. I may have clicked twice. We'll see. Do I dare click again? Okay. Yeah, I definitely skipped over a couple of slides. Let's see if we can go back. All right. Okay. I think this is the right slide. Anyway, one thing that we've been thinking about a lot within my team at Google is, okay, what about scale? A lot of our systems operate in the regime where we have giant models, and they barely fit in the hardware that we use to serve them. And also, we care a lot about latency. about latency so you know things like ensembling yeah they're more efficient than than carrying around the entire posterior or entire distributional programmers but you're still copy carrying around five to ten copies of your model and you know for most practical purposes when i've talked to teams they say oh we can't afford to carry around you know five to ten copies of our model and we can't afford to predict five to ten times uh for every data example because that takes too long in terms of latency so skill is definitely a a problem you know i imagine for self-driving that's also a thing right if you need you need to predict in real time and you have a model in hardware on your on your car you probably can't afford to carry around a whole bunch of them and predict over all of them. So within our team, we've been drawing out an uncertainty and robustness frontier, which is basically thinking about, okay, you know, how can we get the best bang for our buck basically in terms of uncertainty while we increase the number of parameters? And it it turns out it's actually really interesting you know you can do much more sophisticated things if you're willing to carry around many copies of your model for example and then you can't do quite as much with bigger models but you can do quite a bit but this has certainly driven a lot of our more recent research.


Batch Ensemble & Rank-1 Bayesian Neural Mix (32:56)

So one way that we think about this is if you think about ensembles as a giant model, for example, so it's basically think about ensemble as being a bunch of models where there are no connections basically between the different ensemble members, and then a connection, bringing them all together at the end. Then there's basically a bunch of paths of independent sub networks. So you might imagine that you can make things cheaper if you just shared parts of the network and had independent networks at other parts, or you found some way to maybe factorize the difference between the different models. And so that's basically what we've been doing. So this method known as batch ensemble by Ethan Nguyen et al. was exactly take factors and use those to modulate the model. So you have a single model and then you have a factor for every every layer and that modulates the layer and you have n factors where n corresponds to your ensemble size you could imagine that you know this could produce dropout if the factors were zeros and ones or it could produce um you know different weightings that modulate different hidden units as you move through and so we call this batch ensemble and they're actually rank one factor so they're really cheap to carry around compared to you know having multiple copies of the model um i won't maybe we won't talk about this much but you know you can batch things so that you can actually compute across all factors in just one single forward pass which is really nice across all factors in just one single forward pass which is really nice and this turned out to work almost really close almost as well as the full ensemble which is great because it requires something like five percent the number of parameters of a single model and so 90 or something less than a whole ensemble then um a neat way to kind of turn that batch ensemble thing into an approximate bayesian method is this is another big slide so it's taking a little while to switch but here we go something we call rank one bayesian neural mix which was basically being bayesian about those factors and so we'd have a distribution over factors and we could sample factors as we're making predictions and then sampling them. You can imagine that definitely corresponds, could correspond to something like dropout if you have some kind of binary distribution over the factors, but it could also correspond to interesting other distributions that modulate the weights of the model and give you an interesting aggregate prediction and uncertainty at the end. This is one flavor of a number of exciting recent papers. So the cyclical MCMC one as well. So exciting papers where you think about being Bayesian over a subspace basically. So you can imagine integrating over a subspace that defines the greater space of models and using that to get your uncertainty rather than being expressing uncertainty over all the parameters of your model. Then actually something that an intern did that works really, really well. So Martin Abbasi, who's at Harvard now, he actually said, okay, let's take a standard neural net. And instead of plugging in one input, we'll plug in three different inputs and three different outputs or K different inputs and K different, and force the model to predict for all of them. And they can be different classes, which means that the model can't really share a structure predicting for two at the same time. And that gives, so that basically forces the model to learn independent subnetworks through the whole configuration of the model to learn independent sub networks through the through the whole configuration of the model and find some interesting diversity at the outputs and so that actually you know tended to work pretty well so at test time then you just replicate the same input k times and it gives you k different predictions and those are interestingly diverse because they go through different sub networks of this bigger network here's just a a figure showing the diversity of the predictions so this is a dimensionality reduction on the the distribution of predictions basically and we found that the predictions of the of the different outputs are actually interestingly diverse. And then here's a couple pictures. So as we increase the number of inputs and keep the the structure of the actual model the same. So the number of parameters the same. What does that do for the uncertainty and the the accuracy of the model? Interestingly, I find really surprising is sometimes accuracy goes up. But certainly, look at the solid lines. So interestingly, accuracy goes up sometimes and log likelihood. So a notion of the quality of uncertainty certainly goes up. It's surprising that you don't need more parameters in the model to do this, but it tends to work. Okay, so I think I've, basically at the end, maybe I can share just kind of an anecdote about what we're thinking about more imminently now since I've got a couple of minutes left.


Traditional vs Pre-training (38:39)

In our team, we've been thinking a lot about, you may have noticed a number of papers coming out calling large-scale pre-trained models a paradigm shift for machine learning. And so the large-scale pre-trained models are basically saying, you know, instead of taking just my training data distribution, what if I can access some giant other distribution? And that might be, you know, if it's a text model, I just, rather than taking my labeled machine translation model, where I have only 10 examples, I just mine the whole web and I like find some way to model this data. And then I take that model and I top off the last layer, and then I point the last layer at my machine translation or whatever prediction task, and then retrain, you know, starting from where that other model left off. And that pre-training strategy works incredibly well, it turns out, in terms of accuracy. It also seems to work well in terms of uncertainty. So one thing that I think is really interesting to think about is, okay, if we care about outer distribution robustness, either we can do a lot of math and a lot of fancy tricks and ensemble, et cetera, or we can go and try to get the entire distribution. And that's what in my view, pre-training is kind of doing. But in any case, so that's something that we're really involved in and interested in right now, which is what does that pre-training actually do? And what does this mean for uncertainty and robustness? Okay, and then the, what the takeaways of this, of the previous slide, slides is basically, you know, uncertainty and robustness is incredibly important. It's something that is at the tip of, top of mind for a lot of researchers in deep learning. And as we increase compute, as I said, there's interesting new ways to look at the frontier. And I think a lot of promise to get better uncertainty out of our models. And I think a lot of promise to get better uncertainty out of our models. Okay, and with that, I'll close and say thanks. So this is actually a subset of many collaborators on a bunch of the papers that I talked about and from where the slides are from. So thank you, and I'm happy to take any questions. Thank you so much, Jasper. Really, really fantastic overview with beautiful visualizations and explanations. Super, super clear.


Can you mitigate issues of high confidence on out-of-distribution data problem (41:35)

Yeah, so there are several questions from the chat, which I'm sort of gathering together now. So one question from Stanley asks, is it possible to mitigate issues of high confidence on out of distribution data problem by adding in a new images that are, he describes as like nonsense images into the training set with label of belonging to an unknown class? That's really interesting. Yeah, so there is a bunch of work on effectively doing that. Yeah, so there is a line of literature, which is basically saying like, let's create a bucket or to discard things that are outside our distribution. We'll call that an unknown class. And then we just need to feed into our model things that may fall into that class. Sorry, my dog just barged into my office. So yeah, that's certainly something that's done quite a bit. Daniger, I think, had a paper on this. It was, I forget what it's called, something about priors in- Noise contrastive priors. Contrastive priors, that's right, yeah. Yeah, you could imagine, yeah, plugging in noise as being this bucket or even coming up with, you know, with examples that hopefully would be closer to the boundary and that corresponds to there's a couple of papers on on doing kinds of data augmentation just like augmenting your data maybe from from one class interpolating with another class and trying to use that as like helping to define the boundary of what is one or what is another and then you know trying to push just outside your class to say that's not part and putting that in the bucket of unknown. But yeah, great question.


Uncertainty In Robotics And Deep In Uncertainty Space

How is uncertainty used in robotics? (43:32)

It's definitely something people are doing and thinking about. Awesome. One more question from Mark asks, can you speak about the use of uncertainty to close the reality gap in sim2real applications? Okay, yeah, I mean, that's a great question. I personally don't know that much about sim2real. You know, I'm thinking in the robotics context, you have a simulation, you can train your robot in simulation space and then you would like to deploy it as a real robot. I imagine that uncertainty and robustness are incredibly important, but I don't know how they think about it in those particular applications. Clearly, I think if you deploy your robot in real, you would like it to express reasonable uncertainty about things that are out of distribution or that hasn't seen before um i'm curious i don't know alexander ava if you know an answer to that question i think alexander can speak to it yeah yeah actually i was i was going to even ask a kind of related question. But yeah, so I think all of the approaches and all of this interest in the field where we're talking about estimating uncertainty either through sampling or other approaches is super interesting. And yeah, I definitely agree. Everyone is going to accept that these deep learning models need some measure of weight or some way to express their confidence rate i think like one interesting application i haven't seen a lot of maybe there's like a good opportunity um for the class and this is definitely an interest of mine is kind of how we can build kind of the downstream ai models to kind of uh be advanced by these measures of uncertainty. So for example, how can we build better predictors that can leverage this uncertainty to improve their own learning, for example. So if we have a robot that learns some measure of uncertainty in simulation, and as it is deployed into reality, can it leverage that instead of just conveying it to a human? I'm not sure like, yeah, if anyone in your group is kind of focusing more on like the second half of the problem almost not just conveying the uncertainty but um using it in some ways as well yeah yeah absolutely that's definitely something that we're we're super interested in yeah which is definitely you know um once you have better uncertainty yeah how do you use it exactly there's i think there's really interesting questions of how do you communicate the uncertainty to, for example, a layman or, you know, like a doctor who's an expert in something, but not in, you know, deep learning. And yeah, how do you actually make decisions based on the uncertainty? That's definitely a direction that we're moving more towards. We've been spending a lot of time within our group, you know, just looking at the quality of uncertainty of models and looking at notions of calibration and these proper scoring rules. But they're kind of like intermediate measures. What you really care about is the downstream decision loss, which might be like for medical tasks is how many people did you end up saving or how many decisions in your self-driving car were correct decisions. So that's definitely something that we're looking at a lot more. There's a great literature on decision-making and it's called statistical decision-making. I think Berger is the author of a book called statistical decision-making. I think Berger is the author of a book called Statistical Decision-Making, which is a great book about how to think about what are optimal decisions given an amount of uncertainty. Awesome, thank you. Great, I think there was one final question and it's more about a practical hands-on type of thing.


Does Edward from the Basements across mean anything in terms of enmarinele trade? (47:28)

So let me do some finding. Yes. Suggestions and pointers on the actual implementation and deployment of Bayesian neural networks in sort of more industrial or practical oriented machine learning workflows in terms of production libraries or frameworks to use and what people are thinking about in terms of that?


Implement mistakes deep in uncertainty space. (47:46)

Yeah, that's a really great question. So something that I think certainly the Bayesian community or the uncertainty and robustness community, something that they haven't been as good at is producing really easy to use accessible implementations of models. And that's something that we've definitely been thinking about. We've open sourced a library that includes benchmarks and model implementations for TensorFlow. It's called Uncertainty Baselines. And I was exactly trying to address this question. We were like, OK, everyone will be Bayesian or get better uncertainty if you could just hand them a model that had better uncertainty. And so this Uncertainty Baselines was an attempt at that. And we're still building it out. It's built on top of Edward, which is a probabilistic programming framework for TensorFlow. But there are a bunch of libraries that do probabilistic programming, which is making Bayesian inference efficient and easy. But they're not made for deep learning. And so I think something that we need to do as a community is kind of bring those two together, just like libraries for easy prediction under uncertainty or easily approximating integrals effectively and incorporating that within deep learning libraries. But yeah, that's definitely something we're working on. Check out Uncertainty Baselines. I think it has implementations of everything that I talked about in this talk. Awesome.


Career Path And Advice

Could you tell us how you got into the field and how youd Suggest That A Person Joins Mariebridges Side? (49:31)

Thank you so much. Are there any other questions from anyone in the audience, either through chat or that I may have missed or if you would like to ask them live? Okay. if you'd like to ask them live. Okay, well, let's please thank Jasper once again for a super fascinating and very clear talk and for taking the time to join us and speak with us today. Thank you so much. It was my pleasure. Thank you, Alexander, and thank you, Eva, for organizing, and thank you, everyone, for your attention.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.