MIT 6.S191 (2018): Issues in Image Classification

Transcription for the video titled "MIT 6.S191 (2018): Issues in Image Classification".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

Thanks for having me here. Yeah, so I'm based in the Cambridge office, which is like 100 meters that way. And we do a lot of stuff with deep learning. We've got a large group in Google Brain and other related fields, so hopefully that's interesting to some of you at some point. So I'm going to talk for about 20 minutes or so on the sort of issues in image classification theme. Then I'm going to hand it over to my excellent colleague, Sanxing Cai, who's going to go through an entirely different subject in using TensorFlow debugger and eager mode to make work in TensorFlow easier.

Discussion Announcements

Ice Breaker (00:41)

Maybe that would be good. Okay. So let's take a step back. So have you guys seen happy graphs like this before? Go ahead and smile and nod if you've seen stuff like this. Yeah. OK, so this is a happy graph on ImageNet-based image classification. So ImageNet is a data set of a million some odd images. For this challenge, there were 1,000 classes. And in 2011, back in the dark ages when nobody knew how to do anything, the state of the art was something like 25% error rate on this stuff. And in the last, call it six, seven years, the reduction in error rate has been kind of astounding to the point where it's now been talked about so much it's no longer even surprising. And everyone's like yeah yeah we see this. Human error rate is somewhere between 5 and 10 percent on this task so the contemporary results of you know 2.2 or whatever it is percent error rate are really kind of astonishing and you can look at a graph like this and make reasonable claims that, wow, machines using deep learning are better than humans at image classification on this task. That's kind of weird and kind of amazing. And maybe we can declare victory and fill audiences full of people clamoring to learn about deep learning. That's cool. OK, so I'm going to talk not about ImageNet itself but about a slightly different image data set. Basically people were like, okay obviously ImageNet is too easy, let's make a larger more interesting data set. So Open Images was released I think a year or two ago. It's got about nine million as opposed to one million images. The base data set has six thousand million as opposed to 1 million images. The base data set has 6,000 labels as opposed to 1,000 labels. This is also multi-label. So if there's a person holding a rugby ball, you get both person and rugby ball in the data set. It's got all kinds of classes, including stairs here, which are lovely, really illustrated. And you can find this on GitHub. It's a nice data set to play around with. So some colleagues and I did some work of saying, OK, what happens if we apply just a straight up inception-based model to this data? We trade it up, and then we look at how it classifies some images that we found on the web. So here's one such image, an image that we found on the web, all the images here are Creative Commons and stuff like that, so it's OK for us to look at these. And when we apply an ImageNet-y kind of model to this, the classifications we get back are kind of what I personally would expect. I'm seeing things like bride, dress, ceremony, woman, wedding, all things that, as an American in this country at this time, I'm thinking make sense for this image. Cool. Maybe we did solve image classification. So then we applied it to another image, also of a bride. And the model that we had trained up on this open source image data set returned the following classifications. Clothing, event, costume, red, and performance art. No mention of bride. Also, no mention of person-ness, regardless of gender. So in a sense, this model has sort of like missed the fact that there's a human in the picture, which is maybe not awesome and not really what I would think of as great success if we're claiming that image classification is solved. Okay. So what's going on here? I'm going to argue a little bit that what's going on is based, to some degree, on the idea of stereotypes. And if you have your laptop open, I'd like you to close your laptop for a second. This is the interactive portion where you can interact by have your laptop open, I'd like you to close your laptop for a second. This is the interactive portion where you can interact by closing your laptop. And I'd like you to find somebody sitting next to you and exercise your human conversation skills for about one minute to come up with a definition between the two of you of what is a stereotype, keeping in mind that we're in sort of a statistical setting. So have a quick one minute conversation with the person sitting next to you. If there's no one sitting next to you, you may move. Ready, set, go.

Use Cases (05:18)

Three, two, one. And thank you now for having that interesting conversation that easily could have lasted for much more than one minute, but such is life. Let's hear from one or two folks. Who had something that they came up with that was interesting? Yeah, go ahead. Your name is? Aditya. Aditya? Okay. What did you? The generalization that you find out based on large group of people and then you apply it to more similar people? OK, so Aditi is saying that a stereotype is a generalization that you find from a large group of people and you apply it to more people. OK, interesting. I certainly agree with large parts of that. Yeah, go ahead. You came up with a label that's based on the probability of experience within your training set. So as a human, what's it been told to you, or to you, or to you, or to you, or to you, what is the experience within the training set? OK, so here the claim is that it's a label that's based on experience from within your training set. Yeah, super interesting. And the probability of label based on what's in your training set. Cool. Maybe one more? Yeah, go ahead. So you're making a prediction based on unrelated features to be correlated in your training site. OK. So there's a claim here that stereotype has something to do with unrelated features that happen to be correlated. I think that's interesting. This was not a plant. Sorry, your name was? Constantine. if I can. This was not a plant. Sorry, your name was? Constantine. CONSTANTINE DELGADO- Constantine.

Real World Applications (06:47)

Constantine is not a plant. But I do want to look at this a little bit more in detail. So here's a data set that I'm going to claim is based on running data. So in the early mornings, I pretend that I'm an athlete and go for a run. And this is a data set that's sort of based on risk that someone might not finish a race that they enter in. So we've got high risk people are in yellow, and lower risk people are in red. If I'm looking at this data, it's got a couple dimensions. I might fit a linear classifier. It's not quite perfect. If I look a little this data, it's got a couple dimensions. I might fit a linear classifier. It's not quite perfect. If I look a little more closely, I've actually got some more information here. I don't just have x and y. I also have this sort of color of outline. So I might have a rule that if this data point has a blue outline, I'm going to predict low risk. Otherwise, I'm going to predict low risk. Otherwise, I'm going to predict high risk. Fair enough. Now the big reveal. You'll never guess what. The outline feature is based on shoe type. The other x and y are based on how long the race is and what a person's weekly training volume is. But whether you're foolish enough to buy expensive running shoes because you think they're gonna make you faster or whatever, this is what's in the data. And in traditional machine learning, supervised machine learning, we might say, well wait a minute, I'm not sure that shoe type is going to be actually predictive. On the other hand, it's in our training data and it does seem to be awfully predictive on this data set. We have a really simple model, it's highly regularized, it still gives, you know, perfect or near perfect accuracy. Maybe it's fine. And the only way we can find out if it's not, I would argue, is by gathering some more data. And I'll point out that this data set has been diabolically constructed so that there are some points in the data space that are not particularly well represented. And you can maybe tell yourself a story about maybe this data was collected after some corporate 5K or something like that. So if we go out and collect some more data, maybe we find that actually there's people wearing all kinds of shoes on both sides of our imaginary classifier, but that this shoe type feature is really not predictive at all. And this gets back to Constantine's point that perhaps relying on features that are strongly correlated but not necessarily causal may be a point at which we're thinking about a stereotype in some way. So obviously given this data and what we know now, I would probably go back and suggest a linear classifier based on these features of length of race and weekly training volumes as potentially a better model. So how does this happen? What's the issue here that's at play? One of the issues that's at play is that in supervised machine learning, we often make the assumption that our training distribution and our test distribution are identical. And we make this assumption for a really good reason, which is that if we make that assumption, then we can pretend that there is no difference between correlation and causation. And we can use all of our features, whether they're what Constantine would call meaningful or causal or not. We can throw them in there. And so long as our test and training distributions are the same, we're probably OK within some degree. But in the real world, we don't just apply models to a training or test set. We also use them to make predictions that may influence the world in some way. And there I think that the right sort of phrase to use isn't so much test set, it's more inference time performance. Okay, because at inference time when we're going and applying our model to some new instance in the world, we may not actually know what the true label is ever or things like that, but we still care very much about having good performance. And making sure that our training set matches our inference distribution to some degree is super critical. So let's go back to open images and what was happening there. You'll recall that it did quite badly, at least anecdotally, on that image of a bride who appeared to be from India. If we look at the geodiversity of open images, this is something where we did our best to sort of track down the geolocation of each of the images in the open image data set. What we found was that an overwhelming proportion of the data in open images was from North America and six countries in Europe. Vanishingly small amounts of that data were from countries such as India or China or other places where I've heard there's actually a large number of people. So this is clearly not representative in a meaningful way of sort of the global diversity of the world. How does this happen? It's not like the researchers who put the open images dataset were in any way on intention. They were working really hard to put together what they believe was a more representative dataset than ImageNet. At the very least, they don't have 100 categories of dogs in this one. So what happens? Well, you could make an argument that there's some strong correlation with the distribution of open images with the distribution of countries with high bandwidth, low cost internet access. It's not a perfect correlation correlation but it's pretty close. And that if we're doing, if one might do things like base an image classifier on data drawn from a distribution of areas that have high bandwidth low cost internet access that may induce differences between the training distribution and the inference time distribution. None of this is like something you wouldn't figure out without, you know, if you sat down for five minutes, right? This is all like super basic statistics. It is in fact stuff that the statistics people have been sort of railing at the machine learning community at for the last several decades but as machine learning models become sort of more ubiquitous in everyday life I think that paying attention to these kinds of issues becomes ever more important so let's go back to what a stereotype and I think I agree with Konstantin's idea and I'm gonna add one more tweak to it so I'm gonna say that a stereotype is a statistical confounder.

Societal Factors (13:43)

I think it's using Constantine's language almost exactly. That has a societal basis. So when I think about issues of fairness, if it's the case that rainy weather is correlated with people using umbrellas. Like, yes, that's a confounder. The umbrellas did not cause the rain. But I'm not as worried as an individual human about the societal impact of models that are based on that. I'm sure you could imagine some crazy, scary scenario where that was the case. But in general, I don't think that's as as large an issue but when we think of things like internet connectivity or other societally based factors I think that paying attention to questions of do we have confounders in our data are they being picked up by our models is is incredibly important so if you take away nothing else from this short talk I hope that you take away a caution to be aware of differences between your training and inference distributions. Ask the question, because statistically, this is not a particularly difficult thing to uncover if you take the time to look. In a world of Kaggle competitions and people trying to get high marks on deep learning classes and things like that, I think it's all too easy for us to just take data sets as given, not think about them too much, and just try and get our accuracy from 99.1 to 99.2.

Fairness And Equality Concerns

Fairness 101 (15:13)

And as someone who's interested in people coming out of programs like this, being ready to do work in the real world, as someone who's interested in people coming out of programs like this being ready to do work in the real world, I would caution that we can't only be training ourselves to do that. So with that, I'm gonna leave you with a set of additional resources around machine learning fairness. These are super hot off the presses in the sense that this particular little website was launched at I think 8.30 this morning something like that so you've you've got it first MIT leading the way in on this page there are and yeah you can open your laptops you know there are a number of papers that go through this sort of like a greatest hits of the machine learning fairness literature from the last couple years really interesting papers I don't think any of them are like the one final solution to machine learning fairness issues but they're super interesting reads and I think help sort of paint the the space in the landscape really usefully there are also a couple of interesting exercises there that you can access via Colab. And if you're interested in this space, there are things that you can play with. I think they include one on adversarial debiasing, where, because you guys all love deep learning, you can use a network to try and become unbiased by making sure that by having an extra output head that predicts a characteristic that you wish to be unbiased on, and then penalizing that model if it's good at predicting that characteristic. And so this is trying to adversarially make sure that our internal representation in a deep network is not picking up unwanted correlations or unwanted biases. So I hope that that's interesting. And I'll be around afterwards to take questions. But at this point, I'd like to make sure that Sun Ching has plenty of time. So thank you very much. Thank you.

Could not load content

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.