## Dr. Sean Mackey: Tools to Reduce & Manage Pain

Understanding and Managing Pain: A Comprehensive Guide.

MIT | Introduction to Deep Learning
# MIT 6.S191: AI for Science

Transcription for the video titled "MIT 6.S191: AI for Science".

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.

Thank you, Alex. And yeah, great to be part of this. And I hope you had an exciting course until now and got a good foundation, right, on different aspects of deep learning as well as the applications. So as you can see, I kept a general title because the aspect of question is, is deep learning mostly about data? Is it about compute or is it about also the algorithms, right? So and ultimately how do the three come together? And so what I'll show you is what is the role of some principle design of AI algorithms. And when I say challenging domains, I'll be focusing on AI for science, which as Alex just said, at Caltech, we have the AI for science initiative to enable collaborations across the campus and have domain experts work closely with AI experts and to do that, right? How do we build that common language and foundation, right? And why isn't it just an application, a straightforward application of the current AI algorithms? What is the need to develop new ones, right? And how much of the domain specificity should we have versus having domain independent frameworks? And all of this, of course, right, the answer is it depends, but the main aspect that makes it challenging that I'll keep emphasizing, I think, throughout this talk is the need for extrapolation or zero-shot generalization. So you need to be able to make predictions on samples that look very different from your training data. And many times you may not even have the supervision. For instance, if you're asking about the activity of the earth deep underground, you haven't observed this. So having the ability to do unsupervised learning is important. Having the ability to extrapolate and go beyond the training domain is important. And so that means it cannot be purely data driven, right? You have to take into account the domain priors, the domain constraints and laws, the physical laws. And the question is, how do we then bring that together in an algorithm design? And so you'll see some of that in this talk here. Yeah, and the question is, this is all great as an intellectual pursuit, but is there a need? Right, and to me, the need is huge because if you look at scientific computing and so many applications in the sciences, the requirement for computing is growing exponentially. Now with the pandemic, the need to understand the ability to develop new drugs, vaccines, and the evolution of new viruses is so important. And this is a highly multi-scale problem. We can go all the way to the quantum level and ask, you know, how precisely can I do the quantum calculations? But this would not be possible at the scale of millions or billions of atoms. Right. So you cannot do this fine scale calculations, right? Especially if you're doing them through numerical methods and you cannot then scale up to millions or billions of atoms, which is necessary to model the real world. And similarly, like if you want to tackle climate change and precisely predict climate change for the next uh century uh we need to also be able to do that at fine scale right so saying that the planet is going to warm up by one and a half or two degrees centigrade is of course it's disturbing but what is even more disturbing is asking what would happen to specific regions in the world, right? We talk about the global south or, you know, the Middle East, like India, like places that, you know, may be severely affected by climate change. And so you could go even further to very fine spatial resolution, and you want to ask ask what is the climate risk here and then how do we mitigate that and so this starts to then require lots of computing capabilities so you know there's a if you like look at the current numerical methods and you ask i want to do this let's say at one kilometer scale right so i want the resolution to be at the one kilometer level. And then I wanna look at the predictions for just the next decade. That alone would take 10 to the 11 more computing than what we have today. And similarly, you know, we talked about understanding molecular properties. If we try to compute the Schrodinger's equation, which is the fundamental equation, so we know that characterizes everything about the molecule. But even to do that for a 100-atom molecule, it would take more than the age of the universe, the time it's needed in the current supercomputers.

So that's the aspect that no matter how much computing we have, we will be needing more. And so like the hypothesis, right, at NVIDIA, we're thinking about is, yes, GPU will give you, right, some amount of scaling, and then you can build supercomputers and we can have the scale up and out, but you need machine learning to have thousand to further million X speed up right on top of that. And then you could go all the way up to 10 to the nine or further to close that gap. So machine learning and AI becomes really critical to be able to speed up scientific simulations and also to be able to speed up scientific simulations and also to be data driven, right? So we have lots of measurements of the planet in terms of the weather over the last few decades, but then we have to extrapolate further, but we do have the data.

And we are also collecting data through satellites, right? As we go along. So how do we take that data along with the physical laws, let's say, fluid dynamics of how clouds move, how clouds form? So you need to take all that into account together. And same with discovering new drugs. We have data on the current drugs, and we have a lot of the information available on those properties. So how do we use that data along with the physical properties, whether it's at the level of classical mechanics or quantum mechanics? And how do we make the decision of at which level, at which position do we need to make ultimately discoveries? Right. Either discovering new drugs or coming up with a precise characterization of climate change and to do that with the right uncertainty quantification, because we need to understand, right, like kind of it's not we're not going to be able to precisely predict what the climate is going to be over the next decade, let alone the next century, but can we predict also the error bars, right? So we need that precise error bars. And all of this is a deep challenge for the current deep learning methods, right? Because we know deep learning tends to result in models that are overconfident when they're wrong. And we've seen that, you know, the famous cases are like the gender shade studies where it was shown on darker colored skin and especially on women, right? Those models are wrong, but they're also very overconfident when they're wrong. Right? So that you cannot just directly apply to the climate's case because, you know, trillions of dollars are on the line in terms of to design the policies based on those uncertainties that we get in the model. And so we cannot abandon just the current numerical methods and say, let's do purely deep learning. And in the case of, of course, right, drug discovery, right, the aspect is the space is so vast, we can't possibly search through the whole space.

So how do we then make, right, like the relevant, you know, design choices on where to explore. And as well as there's so many other aspects, right, is this drug synthesizable? Is that gonna be cheap to synthesize? So there's so many other aspects beyond just the search space. So yeah, so I think I've convinced you enough that these are really hard problems to solve. The question is, where do we get started and how do we make headway in solving them? And so what we'll see is in, you know, what I want to cover in this lecture is right. If you think about predicting climate change and I emphasize the fine scale phenomena, right, so this is something that's well known in fluid dynamics that you can't just take measurements of the core scale and try to predict with that because the fine scale features are the ones that are driving the phenomena, right? So you will be wrong in your predictions if you do it only at the core scale. And so then the question is, how do we design machine learning methods that can capture these fine scale phenomena and that don't overfit to one resolution, right? Because we have this underlying continuous space on which the fluid moves. If we discretize and take only a few measurements and we fit that to a deep learning model, it may be doing the, you know, it may be overfitting to the wrong, just those discrete points and not the underlying continuous phenomena. So we'll develop methods that can capture that underlying phenomenon in a resolution invariant manner. And the other aspect we'll see for molecular modeling, we'll look at the symmetry aspect because you rotate the molecule in 3D, right? The result, it should be equivalent. So also how to capture that into our deep learning models, we'll see that in the later part of the talk. So that's hopefully an overview in terms of the challenges and also some of the ways we can overcome that. And so this is just saying that there is lots of also data available. So that's a good opportunity. And we can now have large scale models. And we've seen that in the language realm, right, like including what's not shown here, the NVIDIA 530 billion model with 530 billion parameters. And with that, we've seen language understanding have a huge quantum leap and so that also shows that if we try to capture complex phenomenon like the earth's weather or you know ultimately the climate or molecular modeling we would also need big models and we have now better data and more data available. So all this will help us contribute to getting good impact in the end. And the other aspect is also what we're seeing is bigger and bigger supercomputers. And with AI and machine learning, the benefit is we don't have to worry about high precision computing. So with traditional high performance computing, you needed to do that in very high precision, right? 64 floating point computations. Whereas now with AI computing, we could do it in 32 or even 16 or even eight. Right. So we are kind of getting to lower and lower bits and also mixed precision. So we have more flexibility on how we choose that precision. And so that's another aspect that is deeply beneficial. So, okay, so let me now get into some aspect of algorithmic design. I mentioned briefly just a few minutes ago that if you look at standard neural networks, right, they are fixed to a given resolution. So they expect image in a certain resolution, right, a certain size image, and also the output, whatever task you're doing is also a fixed size. Now, if you're doing segmentation, it would be the same size as the image. So why is this not enough? Because if you are looking to solve fluid flow, for instance, this is like kind of air foils, right? So you can, with standard numerical methods, what you do is you decide what the mesh should be, right? And depending on what task you want, you may want a different mesh, right? And you might want a different resolution. And so we want methods that can remain invariant across these different resolutions. And that's because what we have is an underlying continuous phenomenon, right? And we are discretizing and only sampling in some points, but we don't wanna overfit to only predicting on those points, right? We wanna be predicting on other points other than what we've seen during training. And also different initial conditions, boundary conditions. So that's what, when we are solving partial differential equations, we need all this flexibility. So if you're saying we want to replace current partial differential equation solvers, we cannot just straightforward use our standard neural networks. And so that's what we want to then formulate. What does it mean to be learning such a PDE solver? Because if you're only solving one instance of a PDE, as a standard solver, what it does is it looks at what is the solution at different query points. And numerical methods will do that by discretizing in space and time at an appropriate resolution. It has to be finite enough resolution. And then you numerically compute the solution. On the other hand, we wanna learn solver for family of partial differential equations. Let's say like fluid flow, right? So I wanna be learning to predict like, say the velocity or the vorticity, like all these properties as the fluid is flowing. And to do that, I need to be able to learn what happens under different initial conditions. So different initial, let's say velocities or different initial and boundary conditions, right? Like what is the boundary of this space? So I need to be given conditions, right? Like what is the boundary of this space? So I need to be given this, right? So if I tell you what the initial and boundary conditions are, I need to be able to find what the solution is. And so if we have multiple training points, we can then train to solve for a new point, right? So if I now give you a new set of initial and boundary conditions, I want to ask what the solution is. And I could potentially be asking it at different query points, right, at different resolutions. So that's the problem set up.

Any questions here? Right, so I hope that's clear. So I hope that's clear. So the main difference from standard supervised learning that you're familiar with, say, images, is here it's not just fixed to one resolution. So you could have different query points at different resolutions in different samples and different during training versus test samples. And so now the question is how do we design a framework that does not fit to one resolution that can work across different resolutions. And we can think of that by thinking of it as learning in infinite dimensions, because if you learn from this function space of initial and boundary conditions to the solution function space, then you can resolve at any resolution. Yeah, so how do we go about building that in a principled way? So to do that, just look at a standard neural network, right, let's say, and then MLP.

And so what that has is a linear function, which is matrix multiplication. And then on top of that, some nonlinearity. So you're taking linear processing, and on top of that, adding a nonlinear function. And so with this, you have good expressivity, right? Because if you only did linear processing, that would be limiting, right? That's not a very expressive model that can't fit to, say, complex data like images. But if you now add nonlinearity, right, you're getting this expressive network. And same with convolutional neural networks, right? You have a linear function, you combine it with nonlinearity. So this is the basic setup, and we can ask, can we mimic the same? But now the difference is, instead of assuming the input to be in fixed finite dimensions, in general, we can have like an input that's infinite dimensional, right? So that can be now a continuous set on which we define the initial or boundary conditions. So how do we extend this to that scenario? So in that case, right, we can still have the same notion that we will have a linear operator. So here it's an operator because it's now in infinite dimensions potentially. So that's the only detail. But we can ask it's still linear and compose it with nonlinearity. So we can still keep the same principle. But the question is, what would be now some practical design of what these linear operators are. And so for this, we'll take inspiration from solving linear partial differential equations. I don't know how many of you have taken a PDE class, right? If not, not to worry, I'll give you some quick insights here. So if you want to solve a linear partial differential equation, the most popular example is heat diffusion. So you have a heat source, and then you want to see how it diffuses in space. So that can be described as a linear partial differential equation system. And the way you solve that is, this is known as the Green's function. So this says how it's gonna propagate in space, right? Like, so at each point, what is this kernel function? And then you integrate over it and that's how you get the temperature at any point. So intuitively what this is doing is convolving with this Green's function and doing that at every point to get the solution, which is the temperature at every point. So it's saying how the heat will diffuse, right? You can write it as the propagation of that heat as this integration or convolution operation. And so this is linear. And this is also now not dependent on the resolution because this is continuous. So you can do that now at any point and query a new point and get the answer. Right, so it's not fixed to finite dimensions. And so this is conceptually, right, like a way to incorporate now a linear operator. But then if you only did this, right, if you only did this operation that would only solve linear partial differential equations. But on the other hand, we are adding nonlinearity and we are going to compose that over several layers. So that now allows us to even solve nonlinear PDEs or any general system, right? So the idea is we'll be now learning how to do this integration, right? So these, we will now learn over several layers and get now what we call a neural operator that can learn in infinite dimensions. So of course then the question is how do we come up with a practical architecture that would learn this right so that would learn to do this kind of global convolution and continuous convolution. So here we'll do some signal processing 101. Again, if you haven't done the course or don't remember, here's again a quick primer. So the idea is if you try to do convolution in the spatial domain, it's much more efficient to do it by transforming it to the Fourier domain or the frequency space. And so convolution in the spatial domain, right, or the frequency space. And so convolution in the, right, spatial domain, you can change it to multiplication in the frequency domain.

So now you can multiply in the frequency domain and take the inverse Fourier transform and then you've solved this convolution operation. And the other benefit of using Fourier transform is it's global, right? So if you did a standard convolutional neural network, the filters are small. So the receptive field is small, right? Even if you did over a few layers. And so you're only capturing local phenomenon, which is fine for natural images because you're only looking at edges that's all local, right? But for especially fluid flow and all these partial differential equations, there's lots of global correlations. And doing that through the Fourier transform can capture that. So in the frequency domain, you can capture all these global correlations effectively. And so with this insight, right, so here that I want to emphasize is we're going to ultimately come up with an architecture that's very simple to implement because in each layer, what we'll do is we'll take Fourier transform. So we'll transform our signal to the Fourier space, right, or the frequency domain, and learn weights on how to pick like across different frequencies, which one should I up-weight, which one should I down-weight. And so I'm gonna learn these weights and also only limit to the low frequencies when I'm doing the inverse Fourier transform. So this is more of a regularization. Of course, if I did only one layer of this, this highly limits expressivity, right? Because I'm taking away all the high frequency content, which is not good. But I'm like now adding non-linearity and I'm doing several layers of it. And so that's why it's okay. And so this is now a hyper parameter of how much should I filter out, right? Which is like a regularization and it makes it stable to train. So that's an additional detail, but at the high level, what we are doing is we are processing now, we're doing the training by learning weights in the frequency domain, right? And then we're adding nonlinear transforms in between that to give it expressivity. So this is a very simple formulation, but the previous slides with that, what I tried to also give you is an insight to why this is principled, right? And in fact, we can theoretically show that this can universally approximate any operator, right? Including solutions of nonlinear pre-ease. And it can also, like for specific families, like fluid flows, we can also argue that it can do that very efficiently with not a lot of parameters. So which is like an approximation bound. So yeah, so that's the idea that this is, right? Or, you know, in many cases, we'll also see that incorporates the inductive bias, you know, of the domain that expressing signals in the Fourier domain or the frequency domain is much more efficient. And even traditional numerical methods for fluid flows use spectral decomposition, right? So they do Fourier transform and solve it in the Fourier domain. So we are mimicking some of that properties even when we are designing neural networks now. So that's been a nice benefit. And the other thing is ultimately what I want to emphasize is now you can process this at any resolution, right? So if you now have input at a different resolution, you can still take Fourier transform and you can still go through this. And the idea is this would be a lot more generalizing across different resolutions compared to say convolutional filters, which learn at only one resolution and don't easily generalize to another resolution. Any questions here?

I hope this concept was clear, at least, right? You got some insights into why, first of all, Fourier transform is a good idea. It's a good inductive bias. You can also process signals at different resolutions using that. And you can also have this principled approach that you're solving convolution, right? A global convolution, a continuous convolution in a layer. And with nonlinear transforms together, you have an expressive model. We have a few questions coming in through the chat. Are you able to see them or would you like me to read them? Would be helpful if you can, yeah, I can also see them. OK, now I can see them here. OK, great. Yeah, so how generalizable is the implementation of? So that really depends. You can do Fourier transform on different domains. You could also do nonlinear Fourier transform. And then the question is, of course right if you want to keep it to just FFT are there other transforms before that to like do that end to end and so these are all aspects we are now further right looking into for domains where it may not be uniform but the idea is if it's uniform right you can do FFT, and that's very fast. And so the kernel R is going to be, yes, it's going to, but it's not the resolution. So remember, this R, the weight matrix is in the frequency domain, right? So you can always transfer to the frequency space no matter what the spatial resolution is. And so that's why this is okay. I hope that answers your question. Great, and the next question is essentially integrating over different resolutions and take the single integrated one to train your neural network model. So you could, right? So depending on, you know, if your data is under different resolutions, you can still feed all of them to this network and train one model. And also the idea is a test time, you now have different query points, a different resolution, you can still use the model. And so that's the benefit because we don't want to be training different models for different resolutions, because that's first of all clunky, right? It's expensive. And the other is it may not, you know, if the model doesn't easily generalize from one resolution to another, it may expensive. And the other is, it may not, if the model doesn't easily generalize from one resolution to another, it may not be correctly learning the underlying phenomenon, right, because your goal is not just to fit to this training data of one resolution. Your goal is to be accurately, let's say predicting fluid flow. And so how do we ensure we're correctly doing that? So better generalizability if you do this in a resolution invariant manner i hope those answer your questions great so i'll show you some quick results uh you know here right this is navier stokes 2d two dimensions and so here we're training on only low resolution data and directly testing on high resolution, right? So this is zero shot. So, and you can visually see that, right? This is the predicted one and this is the ground truth that's able to do that. But we can also see that in being able to capture the energy spectrum, right? So if you did the standard like UNET, it's well known that the convolutional neural networks don't capture high frequencies that well. So that's first of all, already a limitation, even with the resolution with which it's trained. But the thing is, if you further try to extrapolate to higher resolution than the training data, it deviates a lot more and our model is much closer. And right now we are also doing further versions of this to see how to capture this high frequency data well. So that's the idea that we can now think about handling different resolutions beyond the training data. So let's see what the other. So the phase information. So remember, we're keeping both the phase and amplitude. So the frequency domain, we're doing it as complex numbers. So we are not throwing away the phase information. So we're not just keeping the amplitude, it's amplitude and phase together. Oh, that that's a good point. Good, good. No, yeah. Yeah. So I know, we intuitively think we're only processing in real numbers, because that's what standard neural networks do. And by the way, if you're using these models, just be careful by torch had a bug in complex numbers for gradient updates, I looks to. And by the way, if you're using these models, just be careful. PyTorch had a bug in complex numbers for gradient updates, I think for algorithm, and we had to redo that. So they forgot a conjugate somewhere. So yeah, but yeah, this is complex numbers. Great. So I know there are a lot of different examples of applications, but I have towards the end, but that, you know, I'm happy to share the slides and you can look. So the remaining few minutes we have, I want to add just another detail, right, in terms of how to develop the methods better, which is to add also the physics loss. So here I'm given the partial differential equation. It makes complete sense that I should also check how close is it to satisfying the equations. So I can add like now this additional loss function that says, am I satisfying the equation or not? And one little detail here is if you wanna do this at scale and if we wanna like, you know, auto differentiation is expensive, but we do it in the frequency domain and we do it also very fast. So that's something we developed. But the other, I think useful detail here is, right? So you can train first on lots of different problem instances, right?

So different initial boundary conditions you can train first on lots of different problem instances. So different initial boundary conditions you can train and you can learn a good model. This is what I described before. This is supervised learning. But now you can ask, I want to solve one specific instance of the problem. Now I tell you at test time, this is the initial and boundary condition. Give me the solution. Now I can further fine tune, right, and get more accurate solution, because I'm not just relying on the generalization of my pre-trained model. I can further look at the equation laws and fine tune on that. And so by doing this, we show that we can further, you know, we can get good errors. And we can also ask, you is the trade off between having training data because training data requires having a numerical solver and getting enough training points or just looking at equations. If I need to just impose the loss of the equations, I don't need any solver, any data from existing solver. And so what we see is the balance is maybe right to get like really good error rates, right? To be able to quickly get to good solutions over a range of different conditions is something like maybe small amount of training data, right? So if you can query your existing solver, get small amount of training data, right? So if you can query your existing solver, get small amounts of training data, but then be able to augment with just, you know, this part is unsupervised, right? So you just add the equation laws over a lot more instances, then you can get to good generalization capabilities. And so this is like the balance between, right, being data informed or physics informed, right? So the hybrid where you can do physics informed over a lot more samples because it's free, right? It's just data augmentation. But you had a small amount of supervised learning, right? And that can pay off very well by having a good tradeoff. Is the model assumed to be at a given time? So yes, in this case, we looked at the overall error, I think, like L2 error, both in space and time. So it depends. Some of the examples of PDEs we used were time independent. Others are time dependent. So it depends on the setup. I think this one is time dependent. So this is like the average error.

Great, so I don't know how much longer I have. So Alex, are we stopping at 10.45? Is that the? Yeah, we can go until 10.45, and we'll wrap up some of the questions after that. Okay. Yeah, so like quickly another, I think conceptual aspect that's very useful in practice is solving inverse problems, right?

So, you know, the way partial differential equation solvers are used in practice is you typically already have the solution. You want to look at what the initial condition is. For instance, if you want to ask what about the activity deep underground, so that's the initial condition because that propagated and then you see on the surface what the activity is. And same with the famous example of black hole imaging. So you don't know directly, you don't observe what the black hole is. And same with the famous example of black hole imaging, right? So you don't know directly, you don't observe what the black hole is. And so all these indirect measurements means we are solving an inverse problem. And so what we can do with this method is, you know, we could first like kind of do the way we did right now, right? We can learn a partial differential equation solver in the forward way right from initial condition to the solution and then try to invert it and find the best fit or we can directly try to learn the inverse problem right so we can see like given solution learn to find the initial condition and do it over right lots of training data. And so doing that is also fast and effective. So you can avoid the loop of having to do MCMC, which is expensive in practice. And so you get both the speed up of replacing the partial differential equation solver as well as MCMC, and you get good speed ups. So chaos is another aspect. I won't get into it because here, you know, this is more challenging because we are asking if it's a chaotic system, can you like predict its ultimate statistics and how do we do that effectively? And we also have frameworks that I won't get into it. And there's also a nice connection with transformers. So it turns out that you can think of transformers as finite dimensional systems, and you can now look at continuous generalization where this attention mechanism becomes integration. And so you can replace it with these, Fourier neural operator kind of models in the spatial mixing and even potentially channel mixing frameworks. And so that can lead to good efficiency. Like, you know, you have the same performance as a full self-attention model, but you can be much faster because of the Fourier transform operations. So we have many applications of these different frameworks, right? So just a few that I've listed here, but I'll stop here and take questions instead. So lots of, I think, application areas, and that's what has been exciting, collaborating across many different disciplines. Thank you so much, Anima. Are there any remaining questions from the group and the students? Yeah, we got one question in the chat. Thanks again for the very nice presentation. Will the neural operators be used for various application domains in any form similar to print train networks? Oh yeah, yeah, right. So pre-trained networks, right? So whether neural operators will be useful for many different, so I guess like the question is similar to language models, you have one big language model and then you apply it in different contexts. Right. So there the aspect is there's one common language. Right. So you wouldn't be able to do English and directly to Spanish. Right. So but you could use that model to train again. So it depends on the question is what is that nature of partial differential equations right for instance if i'm like you know having this model for fluid dynamics that's a starting point to do weather prediction or climate right so i can then use that and build other aspects because i need to also write like you know there's fluid dynamics in the cloud uh but there's also, right, like kind of precipitation, there's other microphysics. So you could like either like kind of plug in models as modules in a bigger one, right? Because, you know, there's parts of it that it knows how to solve well. Or you could like, you know, ask that, or in a multi-scale way, in fact, like this example of stress prediction in materials, right? You can't just do all in one scale. There's a core scale solver and a fine scale. And so you can have solvers at different scales that you train maybe separately, right? As neural operators. And then you can also jointly fine tune them together. So in this case, it's not straightforward as just language because, you know, yes, we have the universal physical laws, right? But can we train a model to understand physics, chemistry, biology? That seems too difficult, maybe one day. But the question is also what all kind of data do we feed in and what kind of constraints do we, you know, add, right? So I think one day it's probably gonna happen, but it's gonna be very hard. That's a good question. Yeah, I have actually one follow-up question. So I think the ability for extrapolation specifically is very fascinating and- Sorry, Alex, I can't. I think you're completely breaking up. I can't hear you. Oh, sorry. Is this better? No. I can't hear you. I did that. Oh, sorry. Is this better? No. No, let's see. Maybe you can type. So the next question is from the physics perspective, interpret this as learning the best renormalization scheme. So indeed, even convolutional neural networks, there's been connections to renormalization, right? So, I mean, here, like, you know, the, yeah, so we haven't looked into it. There's potentially an interpretation, but the current interpretation we have is much more straightforward in the sense we are seeing each layer is like an integral operator, right? So which would be solving a linear partial differential equation and we can compose them together and then that way we can have a universal approximator. But yeah, that's to be seen. It's a good point. Another question is whether we can augment to learn symbolic equations. Yeah, so this is, right, it's certainly possible, but it's harder, right, to discover new physics or to discover some new equations, new laws. So this is always a question of like, right, we've seen all that here, but what is the unseen? But yeah, it's definitely possible. So, and Alex, I guess you're saying the ability for extrapolation is fascinating. So potentials for integration of uncertainty, yes, qualification and robustness. I think these are really important. You know, other threads we've been looking into uncertainty quantification and how to get conservative uncertainty for deep learning models right and so that's like the foundation is adversarial risk minimization or distribution robustness and we can scale them up. So I think that's an important aspect in terms of robustness as well, I think there are several you know.

robustness as well. I think there are several, you know, other threats we are looking into, like whether it is designing, say, transformer models, what is the role of self-attention to get good robustness? Or in terms of the role of generative models to get robustness, right? So can you combine them to purify the noise in certain way or denoise? So we've seen all really good promising results there. And I think there is ways to combine that here. And we will need that in so many applications. Excellent. Yeah, thank you. Maybe time for just one more question now. Sorry, I still couldn't hear you, but I think you were saying thank you. Yeah, so the other question is the speed up versus of the, yeah, so it is on the wall clock time with the traditional solvers and the speed increase with parallelism or, yeah, so I mean, right, so we can always certainly further make this efficient right and now we are scaling this up one in fact more than 1000 GPUs in some of the applications and so there's also the aspect of the engineering side of it. Which is very important and media like you know that's what we're looking into this combination of data and model parallelism how to do that at scale um so yeah so those aspects become very important as well when we're looking at scaling this up awesome thank you so much anina for an excellent talk and for fielding Sorry, I can't hear anyone. For some reason it's... Can others in the Zoom hear me? Yeah. Somehow, yeah, I don't know what, maybe it's on my end, but it's fully muffled for me. So, but anyway, I think it has been a lot of fun.

So yeah, and yeah, I hope you got now a good foundation of deep learning, and you can go ahead and do a lot of cool projects. So yeah, reach out to me if you have further questions or anything you want to discuss further. Thanks a lot. Thanks, everyone.

Inspiring you with personalized, insightful, and actionable wisdom.