Baidu's AI Lab Director on Advancing Speech Recognition and Simulation | Transcription

Transcription for the video titled "Baidu's AI Lab Director on Advancing Speech Recognition and Simulation".


Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.


Intro (00:00)

Today we have Adam Coats here for an interview. Adam, you run the AI lab at Baidu in Silicon Valley. Could you just give us a quick intro and explain what Baidu is for people who don't know? Yeah, so Baidu is actually the largest search engine in China. So it turns out the internet ecosystem in China is this incredibly dynamic environment. And so Baidu, I think, sort of turned out to be an early technology leader and really established itself in PC search, but then also has sort of remade itself in the mobile revolution. And increasingly today is becoming an AI company, recognizing the value of AI for a whole bunch of different applications, not just search. OK. And so, yeah, what do you do exactly? So I'm the director of the Silicon Valley AI Lab, which is one of four labs within Baidu research. So especially as Baidu is becoming an AI company, the need for a team to sort of be on the bleeding edge and understand all of the current research, be able to do a lot of basic research ourselves, but also figure out how we can translate that into business and product impact for the company. That's increasingly critical.

Wat Baidu Is (01:04)

So that's what Baidu research is here for. In the AI lab in particular, we kind of founded recognizing how extreme this problem was about to get. So I think the deep learning research and AI research right now is flying forward so rapidly that the need for teams to be able to both understand that research, but also quickly translate it into something that businesses and products can use as more critical than ever. So we founded the AI Lab to try to close that gap and help the company move faster. And so then, how do you break up your time in between doing basic research around AI and actually implementing it, bringing it forward to a product? There's no hard and fast rule to this. I think one of the things that we try to repeat to ourselves every day is that we're mission oriented. So the mission of the AI Lab is precisely to create AI technologies that can have a significant impact on at least 100 million people. We chose this to sort of keep bringing ourselves back to the sort of final goal that we want all the research we do to ultimately end up in the hands of users. And so sometimes that means that we spot something that needs to happen in the world to really change technology for the better and to help by do. But no one knows how to solve it. And there's a basic research problem there that someone has to tackle. And so we'll sort of go back to our visionary stance and think about the long term and invest in research. And then as we have success there, we shift back to the other foot and take responsibility for carrying all of that to a real application and making sure we don't just solve the 90% that you might put in, say, your research paper, but we also solve the last mile. We get to the 99.9%. So maybe the best way to do this then is to just explain something that started with research here and how that's been brought on to a full-on product that exists. So I'll give you an example. We've spent a ton of time on speech recognition. So speech recognition a few years ago is one of these technologies that always felt pretty good, but not good enough. And so traditionally, speech recognition systems have been heavily optimized for things like mobile search. So if you hold your phone up close to your mouth and you say a short-- --and talk in a non-human voice. Exactly. The systems could figure it out, and they're getting quite good. I think the speech engine that we've built it by do called Deep Speech is actually superhuman for these short queries because you have no context. People can have thick accents. So that speech engine actually started out as a basic research project. We looked at this problem. We said, gosh, what would happen if speech recognition were human level for every product you ever used? So whether you're in your home or in your car, or you pick up your phone, whether you hold your phone up close or hold it away, if I'm in the kitchen and my toddler is yelling at me, can I still use a speech interface? Could it work as well as a human being understands us? And so then how did you do that? What is the basic research that moved it forward to put it in a place that it's useful? So we had the hypothesis that maybe the thing holding back a lot of the progress in speech is actually just scale. Maybe if we took some of the same basic ideas we could see in the research literature already and scaled them way up, put in a lot more data, invested a lot of time in solving computational problems and built a much larger neural network than anyone had been building before for this problem, we could just get better performance. And lo and behold, with a lot of effort, we ended up with this pretty amazing speech recognition model that, like I said, in Mandarin at least, is actually superhuman. You can actually sit there and listen to a voice query that someone is trying out, and you'll have native speakers sitting around debating with each other wondering what the heck the person is saying. Wow. And then the speech engine will give an answer and everybody goes, oh, that's what it was, because it's just such a thick accent from perhaps someone in rural China. How much data do you have to give it to train it? To train it on a new line? Because I think on the side, I saw it was English and Mandarin.

Superhuman Assistant (05:41)

Yeah. Like if I wanted German, how much would I have to give it? So one of the big challenges for these things is that they need a ton of data. So our English system uses like 10 to 20,000 hours of audio. The Mandarin systems are using even more for top-end products. So this certainly means that the technology is at a state where to get that superhuman performance, you've got to really care about it. So for Baidu voice search maps, things like that that are flagship products, we can put in the capital and the effort to do that. But it's also one of the exciting things going forward in the basic research that we think about is how do we get around that? How can we develop machine learning systems that get you human performance on every product and do it with a lot less data? So what I was wondering then, did you say that Liarbird thing that was floating around the internet this week? OK. So they claim that they don't need all that much time, all that much data, audio data, to emulate your voice or simulate whatever they call it. You guys have a similar project going on, right? That's right. Yeah, we're working on text-to-speech. Why can they achieve that with less data? I think the technical challenge behind all of this is there are sort of two things that we can do. One is to try to share data across many applications. So to take text-to-speech is one example. If I learn to mimic lots of different voices, and then you give me the 1,000 in first voice, you'd hope that the 1,000 taught you virtually everything you need to know about language, and that what's left is really some idiosyncratic change that you could learn from very little data. So that's one possibility. The other side of it is that a lot of these systems-- this is much more important for things like speech recognition that we were talking about-- is we want to move from using supervised learning, where a human being has to give you the correct answer in order for you to train your neural network, but move to unsupervised learning, where I could just give you a lot of raw audio and have you learn the mechanics of speech before I ask you to learn a new language. And hopefully that can also bring down the amount of data that we need.

The Technology (07:47)

And so then on the technical side, could you give us just a somewhat of an overview of how that actually works? Like, how do you process a voice? For text to speech? Let's do both, actually, because I'm super interested. All right, so let's start with speech recognition. Before we go and train a speech system, what we have to do is collect a whole bunch of audio clips. So for example, if we wanted to build a new voice search engine, I would need to get lots of examples of people speaking to me, giving me little voice queries. And then I actually need human annotators, or I need some kind of system that can give me ground truth, that can tell me for a given audio clip, what was the correct transcription? And so once you've done that, you can ask a deep learning algorithm to learn the function that predicts the correct text transcript from the audio clip. So this is called supervised learning. It's an incredibly successful framework. We're really good with this for lots of different applications. But the big challenge there is those labels that someone has to be able to sit there and give you, say, 10,000 hours worth of labels, which can be really expensive. Yeah, how is it actually-- what is a software doing to recognize the intonation of a word? Well, traditionally, what you would have to do is break these problems down into lots of different stages. So I, as a speech recognition expert, would sit down and I would think a lot about what are the mechanics of this language. So for Chinese, you would have to think about tonality and how to break up all the different sounds into some intermediate representation. And then you would need some sophisticated piece of software, we call decoder, that goes through and tries to map that sequence of sounds to possible words that it might represent. And so you have all these different pieces and you'd have to engineer each one often with its own expert knowledge. But deep speech and all of the new deep learning systems we're seeing now try to solve this in one fell swoop. So really, the answer to your question is kind of the vacuous one, which is that once you give me the audio clips and the characters that it needs to output, a deep learning algorithm can actually just learn to predict those characters directly. And in the past, it always looked like there was some fundamental problem that maybe we could never escape this need for these hand engineered representations. But it turns out that once you have enough data, all of those things go away. And so where did your data come from? Like 10,000 hours of audio. There's a question. We actually do a lot of clever tricks in English where we don't have a large number of English language products. So for example, it turns out that if you go onto, say, a crowdsourcing service, you can hire people very cheaply to just read books to you. And it's not the same as the kinds of audio that we hear in real applications. But it's enough to teach a speech system all about liaisons between words and you get some speaker variation. And you hear strange vocabulary where English spelling is totally ridiculous. And in the past, you would hand engineer these things. You'd say, well, I've never heard that word before. So I'm going to bake the pronunciation into my speech engine. But now it's all data driven. So if I hear enough of these unusual words, you see these neural networks actually learn to spell on their own, even considering all the weird exceptions of English. Interesting. And you have the input, right? Because I've heard of people doing it with a YouTube video. But then you need a caption as well with the audio. So it's twice as much, if not more. Work. Interest. And so then what about the other way around?

Transferring Insights (11:50)

How does that work on the technical side? Right. So that's one of the really cool parts of deep learning right now, is that a lot of these insights about what works in one domain keep transferring to other domains. So with text to speech, you could see a lot of the same practices. So you would see that a lot of systems were hand engineered combinations of many different modules. And each module would have its own set of machine learning algorithms with its own little tricks. And so one of the things that our team did recently with a piece of work that we're calling Deep Voice was to just ask, what if I rewrote all of those modules using deep learning for every single one? To not put them all together just yet, but even just ask, can deep learning solve all of these adequately to get a good speech system? It turns out the answer is yes. That you can basically abandon most of this specialized knowledge in order to build all of the subsequent modules. And in more recent research that's in the deep learning community, we're seeing that, of course, everyone is now figuring out how to make these things work end to end. They're all data driven. And that's the same story we saw for Deep Speech. So we're really excited about that. That's wild. And so do you have a team just dedicated to parsing like research coming out of universities and then figuring out how to apply it? Are you testing everything that comes out? It's a bit of a mix. It's definitely our role to not only think about AI research, but to think about AI products and how to get these things to impact. I think there is clearly so much AI research happening that it's impossible to look through everything. But one of the big challenges right now is to not just digest everything, but to identify the things that are truly important. So what's like a 90 million person product? You're like, oh, man. Well, as the speech recognition we chose because we felt in aggregate, it had that potential. So as we have the next wave of AI products, I think we're going to move from these sort of bolted on AI features to really immersive AI products. So if you look at how keyboards were designed a few years ago for your phone, you see that everybody just bolted on a microphone and they hooked it up to their speech API. And then that was fine for that level of technology.

TuckType (14:15)

But as the technology is getting better and better, we can now start putting speech upfront. We can actually build a voice first keyboard. So it's actually something we've been prototyping in the AI lab. You can actually download this for your Android phone. So it's called TuckType in case anybody wants to try it. But it's remarkable how much it changes your habits. I use it all the time. And I never thought I would do that. And so it emphasized to me why the AI lab is here that we can sort of discover these changes in user habits. We can understand how speech recognition can impact people much more deeply than it could when it was just bolted onto our product. And that sort of spurs us on to start looking at the full range of speech problems that we have to solve to get you away from this sort of close talking voice search scenario and to one where I can just talk to my phone or talk to a device and have it always work. So as you've given this to a bunch of users, I assume and gotten their feedback, have you been surprised with the voice as interface? I know lots of people talk about it.

Biggest surprise (15:24)

Some people say, it doesn't really make sense. For example, you see Apple transcribing voicemails now. Are there certain use cases where you've been surprised at how effective it is and others where you're like, I don't know if this will ever play out? You know, I think the really obvious ones like texting seem to be the most popular. I feel like the feedback that is maybe the most fun for me is when people with thick accents post a review, they say, oh, I have this crazy accent I grew up with and nothing works for me. But I tried this new keyboard and it works amazingly well. I have a friend who has a thick Italian accent and he complains all the time that nothing works. And all of this stuff now works for different accents because it's all data driven. We don't have to think about how we're going to serve all these different users. If they're represented in the data sets and we get some transcriptions, we can actually serve them in a way that really wasn't possible when we were trying to do it all by hand. That's fantastic. And have you gone through the whole system? In other words, if I want to give myself an Italian-American accent, can I do that yet with Baidu? We can't do that yet with our TTS engine, but it's definitely on the way. OK, cool. So what else is on the way? What are you researching? What products are you working on? What's coming? So speech and text-to-speech, I think these are part of a big effort to make this next generation of AI products really fly. Once text-to-speech and speech are your primary interface to a new device, they have to be amazingly good and they have to work for everybody.

Speech will be mimicked locally (16:58)

And so I think there's actually still quite a bit of room to run on those topics, not just making it work for a narrow domain, but making it work for really the full breadth of what humans can do. Do you see a world where you can run this stuff locally, or will they always be calling an AP? Yeah. OK. I think it's definitely going to happen. One kind of funny thing is that if you look at folks who maybe have a lot less technical knowledge and don't really have the instinct to think through how a piece of technology is working on the back end, I think the response to a lot of AI technologies now, because they're reaching this sort of uncanny valley, is that we often respond to them as though they're sort of human. And that sets the bar really high. Our expectations for how delightful a product should be is now being set by our interactions with people. And one of the things we discovered as we were translating deep speech into a production system was that latency is a huge part of that experience. That the difference between 50 or 100 milliseconds of latency and 200 milliseconds of latency is actually quite perceptible. And it really-- anything we can do to bring that down actually affects user experience quite a bit. We actually did a combination of research, production hacking, working with product teams, thinking through how to make all of that work. And that's a big part of the sort of translation process that we're here for. That's very cool. And so what happens on the technical side to make it run faster? So when we first started the basic research for deep speech, like all research papers, we choose the model that gets the best benchmark score, which turns out to be horribly impractical for putting online. And so after the initial research results, teams sat down with just a set of what you might think of as product requirements and started thinking through what kinds of neural network models will allow us to get the same performance but don't require so much future context. They don't have to listen to the entire audio clip before they can give you a really high accuracy response. So kind of doing that the language prediction stuff, like the open AI guys were doing with the Amazon reviews, like predicting what's coming next? Maybe not even predicting what's coming next, but one thing that humans do without thinking about it is if I misunderstand a word that you've said to me, and then a couple of words later, I pick up context that disambiguates it. I actually don't skip a beat. I just understand that as one long stream. And so one of the ways that our speech systems would do this is that they would listen to the entire audio clip first, process it all in one fell swoop and then give you a final answer. And that works great for getting the highest accuracy, but it doesn't work so great for a product where you need to give a response online, give people some feedback that lets them know that you're listening. And so you need to alter the neural network so that it tries to give you a really good answer using only what it's heard so far but can then update it very quickly as it gets more context. So I've known this over the past few years, people have gotten quite good at structuring sentences so Siri understands them. They put the noun in the correct position so it feeds back the data correctly. I found this when I was traveling, I was using Google Translate and after one day, I recognized that I couldn't give it a sentence, but if I gave it a noun, I could just show it to someone. And if I just show bread, it will translate it perfectly and give it. Do you find that we're going to have to slightly adapt how we communicate with machines or your goal is to communicate perfectly as we would? I really wanted to be human level and I don't see a serious barrier to getting there, at least for really high-valued applications.

Barriers (20:53)

I think there's a lot more research to do, but I sincerely think there's a chance that over the next few years we're going to regard speech recognition as a solved problem. That's very cool. So what are the really hard things happening right now? Like what are you not sure if it'll work? So I think we were talking earlier about getting all this data. So for problems where we can just get gobs of labeled data, I think we've got a little bit more room to run there, but we can certainly solve those kinds of applications. But there's a huge range of what humans are able to do, often without thinking that current speech engines just don't handle. We can deal with cross talk and a lot of background noise. If you talk to me from the other side of a room, even if there's a lot of reverberation and things going on, usually doesn't bother anybody that much. And yet current speech systems often have a really hard time with this. But for the next generation of AI products, they're going to need to handle all of this. And so a lot of the research that we're doing now is focused on trying to go after all of those other things. How do I handle people who are talking over each other, or handle multiple speakers who are having a conversation very casually? How do I transcribe things that have very long structure to them, like a lecture? Where over the course of the lecture, I might realize I misunderstood something. Or is some piece of jargon get spelled out for me, and now I need to go and transcribe it. So this is one place where our ability to innovate on products is actually really useful. We've just launched recently a product vision called SwiftScribe to help transcriptionists be much more efficient. And that's targeted at understanding all of these scenarios where the world wants this long form transcription. We have all of these conversations that we're having that are just sort of lost, and we wish we had written down. But it's just too expensive to transcribe all of it for everyday applications. So in terms of emulating someone's voice, do you have any concerns for faking it? Because I did you see the face simulation? I forget the researcher's name, so I'll have to link to it. But you know what I'm talking about. So essentially you can feed it both video and audio, and you can recreate Adam talking. Do you have any thoughts on how we can prepare for that world? No, I think in some sense, this is a social question. I think culturally we're all going to have to exercise a lot of critical thinking. We've always had this problem in some sense that I can read an article that has someone's name on it. And notwithstanding, understanding, writing style, I don't know for sure where that article came from. And so I think we have habits for how to deal with that scenario. We can be healthily skeptical, and I think we're going to have to come up with ways to adapt that to this sort of brave new world. I think those are big challenges coming up, and I do think about them. But I also think a lot about just all the positives that AI is going to have. I don't talk about it too much. My mother actually has muscular dystrophy. And so things like speech and language interfaces are just incredibly valuable for someone who cannot type on an iPad, because the keys are too far apart. And so these are just all these things that you don't really think about. That these technologies are going to address over the next few years. And on balance, I know that we're going to have a lot of big challenges of like, how do we use these? How do we as users adapt to all of the implications? But I think we've done really well with this in the past, and we're going to keep doing well with it in the future.

AI and work (24:52)

So do you think we're AI will create new jobs for people, or will we all be like mechanical Turks feeding the system? I'm not sure. I think this is something where the job turnover in the United States every quarter is incredibly high. It's actually shocking that the fraction of our workforce that quits one occupation and moves to another one is really high. I think it is clearly getting faster. Like we talked about this phenomenon within the AI lab here, where the deep learning research is flying ahead so quickly that we're often remaking ourselves to keep up with it and to make sure that we can keep innovating. And I think that might even be a little bit of a lesson for everyone that continual learning is going to become more and more important going forward. Yeah, so speaking of like, what are you teaching yourself so the robots don't take your job? I don't think we're at risk of robots taking our jobs right now. Actually, it's kind of interesting. We've thought a lot about like, how does this change careers? One thing that has been true in the past is that if you were to create a new research lab, one of the first things you do is fill it with AI experts, where they live and breathe AI technology all day long. I think that's really important.

Great engineers are changing too (26:11)

I think for basic research, you need that kind of specialization. But because the field's moving so quickly, we also need a different kind of person now. We also need people who are sort of chameleons, who are these highly flexible types that can understand and even contribute to a research project, but can also simultaneously shift to the other foot and think about how does this interact with GPU hardware and a production system?

Future Chameleon (26:32)

And how do I think about a product team and user experience? Because often product teams today can't tell you what to change in your machine learning algorithm to make the user experience better. It's very hard to quantify where it's falling off the edge. And so you have to be able to think that through to change the algorithms. You also have to be able to look at the research community to think about what's possible and what's coming. And so there's this sort of amazing full stack machine learning engineer that's starting to show up. Where are they coming from? Like if I want to be that person, what do I do now?

Creating self-directed innovators (27:19)

Say I'm 18. They seem to be really hard to find right now. I believe it. So in the AI lab, we've really set ourselves to just creating them. I think this is sort of the way unicorns are that we have to find the first few examples and see how exciting that is and then come up with a way for people to learn and become that sort of professional. Actually, one of the cultural characteristics of our team is that we look for people who are really self-directed and hungry to learn. Things are going so quickly. We just can't guess what we're going to have to do in six months. And having that sort of do-anything attitude of saying, well, I'm going to do research today and think about research papers. But while once we get some traction and the results are looking good, we're going to take responsibility for getting this all the way to 100 million people. That's a towering request of anyone on our team and the things that we find really help everyone sort of connect to that and do really well with that is really self-directed and able to kind of deal with ambiguity and also really willing to learn a lot of stuff that isn't just AI research but is also stepping way outside of comfort zones and learning about GPUs and high performance computing and learning about how a product manager thinks.

Character Appreciation And Recommendations

Prizing Character (28:13)

Okay, so this has been super helpful. If someone wanted to learn more about what you guys are working on or even just things that have been influential to you, what would you recommend they check out on the internet?

Special recommendations (28:41)

Oh my goodness. So, I have to think about this one for a second here. I think the stuff that's actually been quite influential for me is actually like startup books. I think especially with big companies, it's easy to think of ourselves in silos of having a single job. One idea from the startup world that I think is really amazingly powerful is this idea that a huge fraction of what you're doing is learning. There's a tendency, especially amongst engineers, which I count myself as a member, is like we want to build something.

Career Development And Learning Shortcuts

Shortcutting learning in your career. (29:31)

And so, one of the disciplines, we all have to keep in mind is that we also have to be really clear-eyed and think about what do we not know right now and focus on learning as quickly as we can to find the most important part of AI research that's happening and find the most important pain point that people in the real world are experiencing and then be really fast at connecting those. And I think a lot of that influence on my thinking is coming from the startup world. There you go. That's a great answer. Okay, cool. Thanks, man. Thanks so much.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Wisdom In a Nutshell.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.