MIT 6.S191 (2020): Machine Learning for Scent

Transcription for the video titled "MIT 6.S191 (2020): Machine Learning for Scent".

1970-01-01T12:06:03.000Z

Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Introduction

Introduction (00:00)

Hey everybody. First of all, thank you for inviting me. Thank you for organizing all this. This seems like a really, really cool, what's it called? So J-term is Harvard, what's this called? IP. Okay, cool. So I'm sure there's many different courses you could choose from. It's really cool that you were able to choose this one. Okay, cool. So I'm sure there's many different courses you could choose from. It's really cool that you were able to choose this one. So I'm gonna tell you a bit about some of the work that I'm doing now. In the past, I've done kind of machine learning for biology, I've done machine learning systems, kind of pure methodology stuff. And this is a bit of a weird project, but it's my favorite project I've done in my life so far. I've waited a long time to do it. It's still really early, so this isn't like a collection of great works and great conferences at this point. This is like kind of fresh off the presses, so any feedback actually is definitely welcome. So kind of venturing into unknown territories. And I should mention this is the work of many people. I'm very proud to represent them here, but this is definitely not a solo effort at all, even though I'm the only person here. So my name is Alex. I'm a research scientist at Google Research. There's a, when you work at a mega corp, there's interesting levels of kind of organization of things. So if you hear kind of these words, I'll kind of explain what the different tiers mean. So Google Research is all the researchers in Google. Google Brain is the sub team that's existed for some time that focuses on deep learning. And then I lead a team inside of Google Brain that focuses on deep learning. And then I lead a team inside of Google Brain that focuses on machine learning for olfaction. And Google Research is big. When I joined, I didn't appreciate just how big it was. So there's 3,500 researchers and engineers across 18 offices. I think there's actually out of date in 11 countries. And the large-scale mandate is to make machines intelligent and improve people's lives. And that could mean a lot of different things. And our approach generally, and the number one bullet here is kind of my favorite, and this is where I spend most of my time, is doing foundational research. So we're kind of in, at least in my opinion, another golden era of industrial research, kind of like Bell Labs and Xerox PARC, like those eras. Now we have really wonderful, thriving industrial research labs today. And I feel really fortunate to be able to kind of work with people who are in this environment. We also build tools to enable research and democratize artificial intelligence and machine learning. TensorFlow, like we've got the shirts up there, I don't have that shirt, so that's kind of a collector's item, I guess. We, you know, open source tools to help people be able to use and deploy machine learning in their own endeavors. And then also internally enabling Google products with artificial intelligence. And that was kind of one of the original, one of the activities that Google Brain has done a lot historically is to collaborate with teams across Google to add artificial intelligence. And here's some logos of some of the products that AI and ML has impacted. And you can see YouTube on there, you can see search ads, Drive, Android.


Exploration Of Smell Digitization And Prediction Using Graph Neural Networks

Digitizing smell (02:55)

And a lot of these have some things in common, which is Google knows a lot about what the world looks like and a lot about what the world sounds like but then this is kind of work it's a little bit sci-fi and this is where I step in it doesn't know a lot about what the world smells like and tastes like and that might seem silly but there's a lot of restaurants out there a lot of menu items there's a lot of natural gas leaks there's a lot of sewage leaks there's a lot of things that you might want to smell or avoid, and further than that, in terms of building like a Google Maps for what the world tastes and smells like, there's control systems that where you might actually want to know what something smells like, like a motor is burned out, or you might wanna know what something tastes like if there's a contaminant in some giant shipment of orange juice or something like that. So we haven't digitized the sense of smell. It seems a little bit silly that we might want to do that, but that was perhaps something that seemed silly for vision before the talkies, right, before movies, and before audition when the phonograph came about. So those were weird things to think about for those sensory modalities in the 1800s and in the 1900s. But right now it seems like digitizing scent and flavor are the silly things to think about. But nonetheless, we persevere and work on that. So we're starting from the very, very beginning with the simplest problem, and I'll describe that in a second, but first some olfaction facts. So my training is actually in olfactory neuroscience, and I kind of threw a securitist route, ended up in machine learning. And so since I have you captive here I want to teach you a little bit about how the olfactory system works so this is if you took somebody's face you sliced it in half and the interesting do I have a pointer here great so there's a big hole in the center of your face and that's where when you breathe in through your nose air is going through there Most of the air that goes into your head is not smelled most of it just goes right to your lungs There's a little bit at the top there, which is called.


The sense of smell (04:28)

It's just a patch of tissue. It's like five or ten millimeters square Seems a little bit more than that, but it's very small and that's the only part of your head that it can actually smell. And the way it's structured is, nerves or axons from the, or nerve fibers from the olfactory bulb actually poke through your skull, and they innervate the olfactory epithelium, and it's one of only two parts of your brain that leaves your skull and contacts the environment. other ones the pituitary gland that one dips into your bloodstream which is kind of cheating and there's three words that kind of sometimes get used in the same sentence of taste scent and flavor so taste lives on your tongue flavor is a collaboration between your nose and your tongue and so what happens when you eat something is you masticate it, you chew it up, and that releases a bunch of vapors and there's a chimney effect where the heat of those vapors and of your own body shoots it back up your nose. It's called retrodaisal olfaction. If you notice if you've had a cold, things taste kind of more bland. That's because your sense of smell is not working as much and is not participating in flavor. So little factoids there for you before we get to the machine learning part. So there's three colors in vision, RGB, and there's three cones or cell types, photoreceptor types in your eye that roughly correspond to RGB. There's 400 types in your eye that roughly correspond to RGB. There's 400 types in the nose and we know a lot less about each one of these than we do about the photoreceptors. We don't know what the receptors actually look like, they've never been crystallized, so we can't like build deterministic models of how they work. And in mice there's actually a thousand of these and there's two thousand in elephants and maybe elephants smell better, this could be an evolutionary byproduct where they don't use all of them, we actually don't know. But that's another fun party fact. They're also, they comprise an enormous amount of your genome. So it's 2% of your genome, which is, of the protein-coding genome, which is an immense expense. So like for something that we actually pay comparatively little attention to in our daily lives, it's actually an enormous part of our makeup, right? So worth paying attention to. And we don't really know which receptors respond to which ligands. Basically, we don't know enough about the sense itself to model it deterministically, which is kind of a hint, like maybe we should actually skip over modeling the sense and model the direct percept that people have. This is my other favorite view of the nose. Instead of cutting you know sagittally like this, this is a coronal section. This outline is the airways of your turbinates. I think this is a really beautiful structure. I think it's under-taught. The curly bits that are big down here, this is just where kind of air is flowing, it's humidified. And then this little bit up top is where you actually smell, so the upper and lower turbinates. And this is what I used to study in mice. This is a mouse upper and lower turbinates. You notice it's a lot more curly. The more that smell is important to an organism, the curlier this gets, meaning the higher surface area there is inside of this particular sensory organ. There's actually some cool papers for this in Otters. It's like this times 100. It's really incredible. You should go look it up. And there's this notion that smell is highly personalized and there's no logic to it. Like vision and audition, we've got fast Fourier transforms and we've got Gabor, we've got a lot of theory around how vision and hearing are structured and the idea that it's kind of pervasive is scent is somehow wishy-washy. And people do smell different things. It is somewhat personal, but people also see different things. Who's familiar with the, this is black and blue to me, but I'm sure it's white and gold to some people. Actually, who's it white and gold for? Who's it black and blue to, right? So maybe vision isn't as reliable as we thought, doesn't maintain itself on top of the pedestal. I actually cannot unsee this as white and gold. Let you resolve that between your neighbors so there are examples of specific dimorphisms in the sense of smell that can be traced to individual nucleotide differences right for single molecules which is pretty incredible there's genetic underpinnings to the dimorphisms that we perceive and smell. And they are common. There's a lot of, there's a lot more SNPs that look like likely dimorphisms, they just haven't been tested. But you know, five percent of the world is colorblind, and 15% of the world has selective hearing loss, right. So let's give the sense of smell a little bit more credit, and let's be aware that we each see the world, hear the world, and smell the world a little bit differently. It doesn't mean there's not a logic to it. It doesn't mean that the brain, evolutionarily speaking, has adapted to tracking patterns out there in order to transduce them into useful behaviors that help us survive. So that's a bit of background on olfaction. What we're doing is starting with the absolute simplest problem, right? So when I mentioned foundational research I really meant it, so this is going to look quite basic.


Problem setup (10:11)

So this is the problem. You've got a molecule. On the left this is vanillin, right? So this is the main flavor and olfactory component of the vanilla bean. The vanilla plant, the flower is white, so we think vanilla soap should be white, but the bean is actually this kind of nice dark black when it's dried. And if you've ever seen vanilla extract, that's the color. Real vanilla is actually incredibly expensive and is the subject of a lot of criminal activity because the beans can be easily stolen and then sold for incredible markups. That's the case for a lot of natural products in the flavor and fragrance space there's a lot of interesting articles you can you can google to find out about that so part of the goal here is like if we can actually design better olfactory molecules we can like actually reduce crime and strife so the problem is we've got a molecule let's predict what it smells like. Sounds simple enough, but there's a lot of different ways that we can describe something as having a smell. We can describe it with a sentence, so it smells sweet with a hint of vanilla, some notes of creamy, and a back note of chocolate. Which sounds funny for vanillin, but that's indeed the case. What we'd like to work with is a multi-label problem, or you've got some finite taxonomy of what something smells, or what something could smell like, and then only some number of them are activated here, creamy, sweet, vanilla, chocolate, right? So that's what we like to work with. So why is this hard? Like, why is this something that you haven't heard is already being sold? So this is Lyrol. You've all probably smelled this molecule, you've all probably had it on your skin. This is the dryer sheet smell, right? This is the fresh laundry smell. A very commercially successful molecule that is now being declared as illegal in Europe because it's an allergen. In some cases, the US, we don't really care about that generally, as far as I can tell. We just have different standards, I suppose I should say. The four main characteristics are muge, which is another word for lily of the valley. That's that flower, the dryer sheet smell. Fresh floral sweet. So here's some different molecules that smell about the same. They look about the same. So for this guy, we just clipped the carbon off of this chain here. For this guy, we just attached it to a functional group on the end. So why is this so hard? The main kind of structure, the scaffold here, is the same. This molecule looks very different and smells the same, right? You can have wild structural differences for many, many different odor classes and still maintain the same odor percept, right? This is a difference of where the electrons are in this ring, right? Structurally, it turns out this kind of stiffens things, but even in 3D representations that are static, and certainly in the graph representation here, these look pretty close, but you've rendered something very commercially valuable into something useless.


Molecule fragrance dataset (12:58)

So we built a benchmark data set. I'll kind of describe where this is from and what the characteristics are, but we took two sources of basically perfume materials catalogs, so like the Sears catalog for things that perfumers might want to buy. Right, so these are, I think we've got about 5,000 molecules in this data set, and they include things that will make it into fine fragrance or into fried chicken flavoring or all kinds of different stuff, and on average there's four to five labels per molecule. So here's an example, this is vanillin, it's actually labeled twice in the data set. There's some consistency here. So that's another question you might have is like, how good are people at labeling what these odors actually smell like? And the answer is you and me, probably everybody in the room, most people in the room are bad, right? And that's because we grew up, or at least I did, with a Crayola crayon box that helped me do color word associations, but I didn't have anything to help me do odor word associations. You can learn this. It's harder to learn this as an adult, but you definitely can. I took my team to get perfume training. It's incredibly difficult, but it is, it's a craft that can be practiced practiced but amongst experts that are trained on the taxonomy people end up being quite consistent we have some human data that I don't have in this talk that indicates that that's you know that's the case there is a lot of bias in this data set skew because this is for perfumery materials who were we have a lot we have over representation of things that smell nice it's a lots of fruit green sweet floral woody you don't see a lot of perfumery materials, so we have over representation of things that smell nice. So lots of fruit, green, sweet, floral, woody. You don't see a lot of solvent, bread, radish perfumes, and so we have kind of less of those molecules, which I guess is to be expected. The reason why there's a spike at 30 is we impose a hard cut off there. Just we don't want to have too little representation of a particular odor class. And there's no modeling here. This is a picture, I'll kind of break it down. We have 138 odor descriptors, and they're arrayed on the rows and columns, and so each row, each column, has the same kind of indexing system. It's an odor ID. And each ijth entry here is the frequency with which those two odor descriptors show up in the same molecule. Right, so if I have a lot of molecules that smell like both pineapple and vanilla, doesn't happen, but if I did, then the pineapple-vanilla index would be yellow. So what shows up just in the data is the structure that reflects common sense, right? It's like clean things show up together, pine, lemon, mint, right? Toasted stuff like cocoa and coffee, those go together. They often co-occur together in molecules. At the monomolecular level, these things are correlated. Savory stuff like onion and beef that has kind of low co-occurrence with popcorn. You don't want beef popcorn generally. Maybe you'd actually, would that be good? I have no idea. Dairy cheese milk, stuff like that. So there's a lot of structure in this data set. And historically what people have done is treat the prediction problem of going from a molecule structure to its odor, one odor at a time. And what this indicates is there's a ton of correlations structure here you should exploit. You should do it all at once, hint, hint, deep learning, et cetera.


Baseline algorithms (16:00)

So what people did in the past is basically use pen and paper and intuition. So this is Kraft's vetiver rule. I presented this slide and took a little bit of a swipe at how simplistic this is, and Kraft was sitting right there, which was a bit of an issue. We had a good back and forth, and he was like, yeah, we did this a long time ago, and we can do better. But the essence of this is you observe patterns with your brain, your occipital cortex, and you write down what they are, and you basically train on your test set, and you publish a paper. That seems to be kind of the trend in the classic structure of relationship literature. There are examples of these rules being used to produce new molecules so that are generalizing the ones that I've been able to find in searching the literature on the are in the upper right hand corner here but generally these are really hard to code up right because there's some kind of fudge factor it's not quite algorithmic what people do now what kind of the incumbent approach is to take a molecule and to treat it as kind of a bag of subgraphs. Let me explain what that means. So go through the molecule, think of it as a graph, pick a radius. So like I'm only gonna look at atoms and then you ask okay what atoms are in this molecule? And you say okay I'm gonna look at things of radius one. Okay what atom-atom pairs are there? Are there carbon carbons?? Are there carbon-sulfurs, carbon-oxygens? You go through this comprehensively up to a radius that you choose, and you hash that into a bit vector, and that's your representation of the molecule. It's kind of like a bag of words or a bag of fragments representation. The modern instantiation of this is called Morgan fingerprints, so if you ever hear Morgan fingerprints or circular fingerprints, this is effectively the strategy that's going on. You typically put a random forest on top of that, make a prediction, you can predict all kinds of things, toxicity, solubility, whether or not it's going to be a good material for photovoltaic cells, whether or not it's gonna be a good battery, and the reason why, you know, this is the baseline is because it works really well and it's really simple to implement, right, so you can, once you've got your molecules loaded in, you can start making predictions in two lines of code with RDKit and scikit-learn. And it's a strong baseline, it's just kind of hard to beat sometimes.


Graph neural networks (18:14)

So you'd expect the next slide is we did deep learning at it but let's take a little bit of a step back first. So in the course so far, you've certainly touched on feedforward neural networks and kind of their, you know, most famous instantiation, the convolutional neural network, and the trick there is you've got a bunch of images and you've got labels that are the human labeled ground truth for what's in that image. You pass in the pixels, you, you know, through successive layers of mathematical transformations, you progressively generalize what's in that image until you've distilled down the essence of mathematical transformations, you progressively generalize what's in that image until you've distilled down the essence of that image, then you make a decision, yes cat, no cat. And you know, when you see some new image that was not in your training set, you hope that if your model is well trained, that you're able to take this unseen image and predict, you know, yes dog, no dog, even if it's a challenging situation. And this has been applied in all kinds of different domains, and neural rendering is one that I did not know a lot about, and that is super cool stuff. So pixels as input predict what's in it, right? Audio is input. The trick here is to turn it into an image, so to calculate the spectrogram of audio and then do a comment on that or an LSTM on the on the time slices and then you transcribe that to the speech that's inside of that spectrogram text to text for translation image captioning all of these tasks have something in common which is that as was alluded to in the previous talk they all have a regularly shaped input. The pixels are a rectangular grid, right? That's very friendly to the whole history of all the kind of statistical techniques we have. Text is very regular. It's like a string of characters from a finite size alphabet. You can think of that as a rectangle, like a one wherever there's the character and a zero for the unused part of the alphabet and the next character etc hard to put this into a rectangle right you can imagine taking a picture of it and then trying to predict on that but if you rotate this thing still the same molecule but you've probably broken your prediction that's really data inefficient so these things are most naturally represented as graphs. And kind of like meshes, like that's actually not very natural to give to classical machine learning techniques. But what's happened in the past three, four, five years is there's been an increasing maturity of some techniques that are broadly referred to as graph neural networks. And the idea here is to not try to f you know, fudge the graph into being something that it's not, but to actually take the graph as input and to make predictions on top of it. And this is, this has opened up a lot of different application areas. So I'll talk about chemistry today, where we're predicting a property of a whole graph, but this has also been useful for like protein-protein interaction networks, where you actually care about the nodes, or you care about the edges, will there be an interaction? Social network graphs, where people are nodes, and friendships are edges, and you might wanna predict, does this person exist, or does this friendship potentially exist, what's the likelihood of it? And citation networks as well, so interactions between anything are naturally phrased as graphs.


Molecules to graphs (21:25)

And so that's what we use, but let me show you kind of how it works in practice for chemistry. So, first thing is you've got a molecule. It's nice to say make it a graph, but like what exactly does that mean? So the way we do this is, so pick an atom, and you want to give each node in this graph a vector representation. So you want to give each node in this graph a vector representation. So you want to load information into a graph about kind of what's at that atom. So you might say what's the hydrogen count, the charge, the degree, the atom type. That might be a one hot vector you concatenate in. And then you place it into the node of a graph. And then this is the GNN part. So this is a message passing process by which you pick an atom. And can do this for every atom. Pick an atom, you go grab its neighbors, you grab the vector representation and its neighbors, you can sum it, you can concatenate it, basically you get to choose what that function is, you pass that to a neural network, that gets transformed, you're gonna learn what that neural network is based on some loss, and then you put that new vector representation back in the same place you got it from, and that's one round of message passing. The number of times you do this, we just call that the number of layers in the graph neural network. Sometimes you parameterize the neural network differently, different layers, sometimes you share the weights. But that's the essence of it. And what happens is, around, for this molecule, like five or six or so, the atom at the far, or the node now, because it's no longer an atom, because information's been mixed, the node at the far end of the molecule actually has information from the other end of the molecule. So over successive rounds of message passing, you can aggregate a lot of information.


Predicting odor descriptors (23:03)

And if you want to make predictions, it seems a little bit silly, but you just take every vector in every node, once you're done with this, and you sum them. You can take the average, you can take the max, whatever, I mean, you kind of hyperparameter tune that for your application, but for whole graph predictions, summing works really well in practice, and there's some reasons for that that have been talked about in the literature. And then that's a new vector, that's your graph representation, pass that to a neural network and you can make whatever prediction you like. In our case the multi-label problem of what does it smell like. And how well can we predict? Really good. We predict real good. On the x-axis is the performance of the strongest baseline model that we could come up with, which is a random forest on count based fingerprints. So when I said it's a bag of fragments, you can either say zero or one of the fragments present. What's better to do is to count the number of those subfragments that are present. This is the strongest chemoinformatic baseline that we know of. And on the y-axis is the performance of our graph neural network. And you can see we're better than the baseline for almost everything. What's interesting is the ones that we're not better at, bitter, anisic, medicinal are some selections. In talking with experts in the field, the consensus is that the best way to predict these things is to use molecular weight. And it's actually surprisingly difficult to get these graph neural networks to learn something global, like molecular weight. And so we've got some tricks that we're working on to basically side load graph level information and hopefully improve performance across the board. So this is a this kind of hardest benchmark that we know about a structure odor relation prediction and we're pretty good at it but you know we would actually like to understand what our neural network has learned about odor because we're not just making these predictions for the sake of beating a benchmark we actually want to understand how odor is structured and use that representation to build you know other technologies so in the penultimate layer of the neural network stacked on top of the GNN, there's an embedding. Did you guys talk about embeddings in this course? Okay, cool. So the embedding is a notion of like the general representation of some input that the neural network has learned, and that's the last thing it's going to use to actually make a decision. So let me show you what our embeddings look like.


The odor embedding space (25:19)

So the first two dimensions here are, these are the two first principle components of a 63-dimensional vector. This is not t-SNE or anything like that, so you can think of this as like a two-dimensional shadow of a 63-dimensional object. Each dot here is a molecule, and each molecule has a smell, and if you pick one odor, like musk, we can draw a little boundary around where we find most of the musk molecules and color them. But we've got other odors as well, and it seems like they're sorting themselves out very nicely. We know they have to sort themselves out nicely because we classify well, so it's kind of tautologically true. But what's interesting is we actually have these macro labels like floral we have many flowers in our data set you might wonder okay floral where is that in relation to rose or lily or or Muguet and it turns out that you know florals this macro class and inside of it are all the flowers this is kind of like fractal structure to this embedding space we didn't tell it about that right it just kind of like fractal structure to this embedding space. We didn't tell it about that, right? It just kind of learned naturally that there's an assortment of how odor is arrayed. And, you know, there's the meaty cluster, which conveniently looks like a t-bone steak if you squint your eyes. My favorite is the alcoholic cluster because it looks like a bottle. I'm never retraining this network because that's definitely not going to be true next time. And, you know, this is kind of an indication that something is being learned about odor. There's a structure that's happening here. It's amazing this comes out in PCA. Like this almost never happens, at least in the stuff that I've worked with, for a linear dimensionality reduction technique to reveal a real structure about the task. And the way we kind of, this it's, this itself is an object of study. And we're in the beginning stages of trying to understand what this is, what it's useful for, what it does. We view it a little bit as like the first draft or V0.01 of an RGB for odor. Or as like an odor space, an odor codec, right? Color spaces are really important in vision, and without it we wouldn't really be able to have our cameras talk to our computers, talk to our display devices, right? So we need something like that if we're gonna digitize the sense of smell, we need a theory or a structure. And because odor is something that doesn't exist in outer space, it's not universally true, it's uniquely planet Earth, it's uniquely human, taking a data-driven approach might not be that unreasonable. It might not be something that Newton or Guth could have come across themselves through first principles, since evolution had a lot to do with it.


Molecular neighbors (27:58)

So that's the embedding space. It's kind of a global picture of how odor is structured through the eyes of a fancy neural network. But what about locally? So I told you about global structure, what about local structure? I could maybe wave my hands and tell you that nearby molecules smell similar because there's little clumps of stuff, but we can actually go and test that. And this is also the task that R&D chemists and flavor and fragrance engage in. They say here's my target molecule, it's being taken off the market, or it's too expensive. Find me stuff nearby, and let's see what its properties are. Maybe it's less of an allergen. Maybe it's cheaper. Maybe it's easier to make. So let's first use structure. Let's use nearest neighbor lookups using those bag-of-fragments representations that I showed you. The chemoinformatic name for this distance is called Tani-Moda distance. It's jacquard on the fit-based Morgan fingerprints and this is kind of the standard way to do lookups in chemistry. And let's start with something like dihydrocoumarin. So if you look at the structural nearest neighbors, it gets little sub fragments right. Right, little pieces of the molecule, they all match, right? But almost none of them actually smell like the target molecule. Now if you use our GCN feature, so if you use cosine distance in our embedding space, what you get is a lot of molecules that have the same kind of convex hull. It'll look really, really similar, and they also smell similar. We showed this to a fragrance R&D chemist, and she said, oh those are all bioisostares. And I was like, that's awesome. What's that? What's a bioisostare? I have no idea what that is. And she said, bioisostares are lateral moves in chemical space that maintain biological activity. So there's little things you can do to a molecule that make it look almost the same, but it's now a different structure, and don't mess with its biological activity. And to her eye, and again, I'm not an expert in this specifically, these were all kind of lateral moves in chemical space that she would have come up with, except for this one. She said, that's really interesting. I wouldn't have thought of that. And my colleague said, the highest praise you can ever get from a chemist is, huh, I wouldn't have thought of that. So, you know, that's great.


Generalization (30:04)

We've hit the ceiling. So I've showed you we can predict well relative to baselines. The embedding space has some really interesting structure both globally and locally. And the question now is like, well, is this all true inside of this bubble, inside of this task that we designed using a dataset that we curated? And this is generally a really good test to move out of your dataset. So will this model generalize to new adjacent tasks? And you know, this is, did you talk about transfer learning in this course yet? Domain adaptation. So one of the big tricks in industrial machine learning is transfer learning. So train a big model on ImageNet and then use that model, freeze it, and, you know, take off the top layer that just has the decisions for what the image classes are. So, like, if you've got dog and cat, maybe you want to predict, you know, house and car. You take off the dog and, you know, dog and cat layer. You put on the house and car layer. You just train that last layer. It's called fine-tuning or transfer learning. They're kind of related. That works extremely well in images. Yes, you might hear of fine tuning or transfer learning. There's really no examples of this working in a convincing way to my eye in chemistry. So the question is like, do we expect this to work at all? There's an X, I like this progression in time, there's an XKCD cartoon, which is, you know, all right, we would say, when a user takes a photo, we should check if they're in a national park, easy GPS lookup, and then we want to check if the photo is of a bird. And the response in, I think this was like 2011 or something like that, is, I'll need a research team in five years. For fun, a team at Flickr made this by fine-tuning an ImageNet model, right? So is this a bird or is this a park? So there's a, there's a, was a large technological leap in between these two. So this, this really, really works in images, but it's unclear if it works in general on graphs or specifically in chemistry. And what we did is we took our embeddings, we froze the embeddings, and we added a random forest on top of it or logistic regression i don't remember the models here there's two main data sets that are kind of the benchmarks in odor the dream olfactory challenge and the dravniks data set they are both interesting they both have challenges they're not very large but that's kind of what's the standard in the field we now have state-of-the-art on both of these through transfer learning to these tasks so this actually really surprised us and it's really encouraged us that we've actually learned something fundamental about how humans smell molecular structures.


Explaining/interpreting predictions (32:34)

So the, you know, the kind of remaining question to me is like this is all really great, you've got a great neural network, but I occasionally have to convince chemists to make some of these molecules. And a question that often comes up is, why should I make this molecule? What about this makes it smell like vanilla or popcorn or cinnamon? And so we'd like to try to open up the innards of the neural network, or at least expose what the model is attending to when it's making decisions. So we set up a really simple positive control and built some methodology around attribution and I'll show you what we did. So the first thing we did is to set up a simple task, predict if there's a benzene in a molecule. And a benzene is a six atom ring where you've got three double bonds. It's trivial, this task is trivial. But there's a lot of ways to cheat on the task. If there's any statistical anomalies like benzene co-occurs with chlorine, you might just say, okay, look at the chlorine and predict that. So we wanted to make sure that our attributions weren't cheating. And so we built an attribution method, and this is something that's being submitted right now. And what should come out, and what does come out indeed, is a lot of weight on the benzenes and no weight elsewhere. And we've kind of verified that this is the case across lots of different molecules, and when there's not a benzene, there's no weight. When there is a benzene, there's weight on the benzene, and sometimes some leakage, and we've improved this. This is kind of not our current best at this point. So this means we can go look at the actual odors that are in our data set, like garlic. Garlic's actually really easy to predict. You count the number of sulfurs and if there's a lot it's gonna smell really bad, like rotten eggs or sulfurous or garlicky. And so this is a bit of a sanity check. You'll notice that this guy down here has a sulfur. These types of molecules show up in beer a lot. They're responsible for a lot of the hop aroma in beers. So there's, I think this one is a grapefruit characteristic or something like that. And these sulfurs are the sulfurs that eventually contribute to the skunked smell or taste of beer, because these molecules can oxidize, those sulfurs can then leave the molecule and contribute to a new odor which is less pleasant. But while they're a part of the molecule, they don't contribute to that sulfurous smell, they contribute to the grapefruit smell. Fatty. This one we thought was going to be easy and we showed it to flavor and fragrance experts and they were a little bit astounded that we could predict this well. So apparently having a big long chain of you know like a fatty chain is not sufficient for it smelling fatty. It turns out that a class of molecules called terpenes, this is like the main flavor component in marijuana, has incredible diversity and we kind of have like an olfactory phobia on molecules like this. So one carbon difference can take something from like coconut to pineapple. And we have incredible acuity here because perhaps they're really prevalent in plants and things that are safe to eat. So we might have an over-representation of sensitivity to molecules of this kind. Total speculation. I'm on video, I guess, I probably shouldn't have said that. And then vanilla. So this is a commercially really interesting class, and the attributions here have been validated by the intuitions of R&D flavor chemists. I can't explain that intuition to you because I don't have it, but I got a lot of this, so that's as best as we've got at this point. We haven't done a formal evaluation of how useful these things are, but this is, to me, a tool in the toolbox of trust building. So it's not enough really to build a machine learning model if you wanna do something with it. You have to solve the sociological and cultural problem of getting it in the hands of people who will use it. That is often more challenging than building the model itself. So data cleaning is the hardest thing, and then convincing people that you should use the thing that you built is the second hardest thing. The fancy machine learning stuff is neat, but you can learn it pretty quickly. You're all extremely smart. That will not end up being the hardest thing very, very quickly. And then Wynie. We don't know what's going on here, but this is something that we're investigating in collaboration with experts.


Conclusion

Summary and future work (36:49)

So that's kind of the state of things. This is really early research. We're exploring what it means to digitize the sense of smell and we're starting with the simplest possible task, which is why does a molecule smell the way that it does? We're using graph neural networks to do this, which is a really fun new emerging area of technology in machine learning and deep learning. There's a really interesting and interpretable embedding space that we're looking into that could be used as a codec for electronic noses or for scent delivery devices. And we've got state of the art on the existing benchmarks, that's a nice validation that we're doing a good job on the modeling. But there's a lot more to do. We really wanna to test this, right? So we want to, we want to see if this actually works in human beings that are not part of our historical data set, but ones that have really never been part of our evaluation process, and that's something we're thinking about right now. You also never smell molecules alone. It's very rare. It's actually hard to do even if you order single molecules, because contaminants can be a bit of a challenge. So thinking about this in the context of mixtures is a bit of a challenge. So what should that representation be? Is that a weighted set of graphs? Is it like a meta graph of graphs? Like I don't actually know how to represent mixtures in an effective way in machine learning models and that's something that we're thinking about. And then also the data set that we's something that we're thinking about. And then also, the data set that we have is what we're able to pull together, right? It's not the ideal data set. It's definitely gotten us off the ground, but there is no image net of scent. But there wasn't an image net of vision for a long, long time, but we wanna get a head start on this, and so this is something that we're thinking about and investing in as well. And again again I'm super fortunate to work with incredible people I just want to call them all out here Ben Brian Carrie Emily and Jennifer are absolutely amazing fantastic people they did the actual real work and I you know feel fortunate to be able to represent them here so thanks again for having me and yeah any questions, happy to answer.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.