Making Music and Art Through Machine Learning - Doug Eck of Magenta | Transcription
Transcription for the video titled "Making Music and Art Through Machine Learning - Doug Eck of Magenta".
Note: This transcription is split and grouped by topics and subtopics. You can navigate through the Table of Contents on the left. It's interactive. All paragraphs are timed to the original video. Click on the time (e.g., 01:53) to jump to the specific portion of the video.
Hey, this is Craig Cannon, and you're listening to Y Combinator's podcast. Today's episode is with Doug Eck, Doug's a research scientist at Google, and he's working on Magenta, which is a project making music and art through machine learning. Their goal is to basically create open source tools and models that help creative people be even more creative. So if you want to learn more about Magenta or get started using it, you can check out magenta.tensorflow.org. All right, here we go. I wanted to start with the quote that you ended your IO talk with, because I feel like that might be helpful for some folks. So it's a Brian Eno quote, and I will have the slightly longer version. Yeah, good. Yeah. So yeah, it goes like this. Whatever you now find weird, ugly, uncomfortable, and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit, all of these will be cherished and emulated as soon as they can be avoided. It's the sound of failure. So much modern art is the sound of things going out of control, out of a medium pushing to its limits and breaking apart. So that's how you ended your IO talk. Correct. And what it kind of opened up for me was like, what, when you're thinking about creating magenta and all the projects there in as new mediums, how are you thinking about like, how are you thinking about what's going to be broken and what's going to be created? The reason that I put that quote there, I think, is to be honest with the division between engineering and research and artistry and to not think that what I'm doing is being a machine learning artist, but we're trying to build interesting ways to make new kinds of art.
Discussion On Generative Models And Ai
Their goal is to create open-source tools and models that help creative people be even more creative. (01:15)
And I think, you know, it occurred to me, I read that quote and I thought, you know, that's it, right? No matter how hard Eastman or whomever invented the film camera, I'm sorry if that's the wrong person, right? Like they clearly weren't thinking of breakage or they're trying to avoid certain kinds of breakage. I mean, you know, guitar amplifiers aren't supposed to distort, you know, and you know, I thought, well, what if we do that with machine learning like, like models that are like the first thing you're going to do if you think if someone comes to you and says, here's this really smart model that you can make art with, what are you going to do? You're going to try to show the world that it's a stupid model, right? But maybe the way that maybe it's smart enough that it's kind of hard to make it stupid. So you get to have a lot of fun making it stupid, right? I was playing with a quick trial this morning with my girlfriend and what she was trying to do was make the most accurate picture that the computer wouldn't recognize, like immediately out of the gate. She works in art and like, yeah, doesn't want to believe. It's a good intuition. I mean, you know. Yeah. So maybe the best ways to start is then talk about like, what are you working on right now? What are you guys making? So right now we're working on, let me think, it's a good question. We have this project called Ensignth, which is trying to get deep learning models to generate new sounds. And we're working on a number of ways to make that better. I think one way to think about it is we have these, we have this latent space. We have a, so to make that a little bit less buzzword, we have a kind of compressed space, a space that doesn't have the ability to memorize the original audio, but it's set up in such a way that we can try to regenerate some of that audio. And in regenerating it, we don't get back exactly what we started with, but hopefully we get something close. And that space is set up so that we can move around in that space and come into new points in that space and actually listen to what's there. Right now it's quite slow to listen, so to speak, we're not able to do things in real time. And we also would love to be at kind of a meta level building models that can generate those embeddings, having trained on other data so that you're able to move around in that space in different ways. And so we're kind of moving, we're continuing to work with sound generation for music. And we also are spending quite a bit of time on rethinking the music sequence generation work that we're doing. We put out some models that were, you know, by any reasonable account primitive. I mean, kind of very simple recurrent neural networks that generate MIDI from MIDI and that maybe use attention, that maybe, you know, have a little bit smarter ways to sample as when doing inference, when generating. And now we're actually taking seriously, wait a minute, what if we really look at large data sets of performed music? What if we actually start to care about expressive timing and dynamics, care deeply about polyphony and really care about like not putting out kind of what you would consider a simple reference model, but actually what we think is super good. And I think, you know, those are the things we're focusing on. I think we're trying to actually make things, you know, really pull up quality and make things that are better and more usable for people. And so with all of that supervised learning, are you like, are you going to create a web app that people will evaluate how good the music is? Because I heard a couple interviews with you before where that was the issue, right? Like how do you know what's good? Yeah. So that is like the, I'm pausing because that's the big question I think in my mind is how do we evaluate these models? At least for magenta, I haven't felt like the quality of what we've been generating has been good enough to bother, so to speak.
How do we evaluate these models? (05:14)
Like you find it, you cherry pick, you find some good things like, okay, this model trains and it's interesting. And now we kind of understand that the API, the input output of what we're trying to do. I would love, yeah, I don't know how to solve this. Like, conceptually what we do, here's what we do, right? We build a mobile app and we make a go viral. That's what we do, right? And then once it's viral, we just keep feeding all of this great art and music in and I used to do music recommendation. We just build a collaborative filter and which is a kind of way to make, you know, recommend items to people based upon what they like and we'd start giving people what they like and we pay attention to what they like and we make the models better. So all we need to do is make that app go viral.
All right. One simple thing. In fact, maybe someone in the Y Combinator world can help us do that, right? It's like sliding between Cow and Trumbone. Yeah, right. Maybe that particular web app is not the right answer. Now, I mean, I'm saying that as a joke, but I think looking at this way, if we can find a way or the community in general can find a way for machine generated media to be sort of out there for a large group of interested users to play with, I think we can learn from that signal and I think we can learn to improve. And if we do, we'll make quite a nice contribution to machine learning.
We will learn to improve based upon human feedback to generate something of interest. (06:33)
We will learn to improve based upon human feedback to generate something of interest. So that's a great goal. But I'm totally like today in this room, you know, I wish I could tell you we had a secret, we had like a secret plan, you know, like, we're, you know, oh, he's figured it out, the app's going to launch tomorrow. It's like, it's really hard work. Like bleep. Yeah, exactly. Sorry. Interest. Okay. Because I was wondering what kind of data you were getting back from artists, you know, to people just use, use all of your projects, you know, all of the repos to create things of their own interest. Are they pushing back valuable data to you? So we're getting, we're getting some valuable data back. And I think what we're getting back, some of the signals that we're getting back are giving us such an obvious direction for improvement. Like why would I want to run a Python command to generate a thousand MIDI files? That's not what we do. You know, like, like, you get that kind of feedback and like, okay, we wanted this command line version because, you know, we needed to be able to test some things. But if musicians are really going to use the music part of what we're doing, we have to provide them with more fluid and more useful tools. And there I think we're, we're still sitting with so many obvious hard problems to solve like integration with something like Ableton or like really solid, you know, real time IO and things like that, that we know what to work on. But I think we'll get to the point pretty quickly where we'll have something that's kind of solves the obvious problems, plugs in reasonably well to your workflow.
What artists are using Tamper with? (07:52)
And you can start to generate some things and you can play with sound. And then then we need to be much more careful about, about, you know, the questions we ask and how good we are listening to how people use what we're doing. And so what are artists using it for at this point? Right now, so we have most of what we've done so far as had to do with music. If we look for a second away from music and look at SketchRNN, which is a model that learned to draw, we've actually seen quite a bit of, so first at a higher level, SketchRNN is a recurrent neural network trained on sketches to make sketches. And the sketches came from a game that Google released called Quickdraw, where people had 20 seconds to draw something to try to win it, with a computer, you know, a classifier counterpart. And so, you know, we train a model that can generate new cats or dogs or whatever. There's some really cool classes in there, like a cruise ship. It's nice. Yeah, the one that always threw me was camouflage. Like it calls out camouflage all the time. Like I've never. As if by definition, you can't draw it, right? Yeah. Nothing. Pause for 20 seconds. Right. Right. Yeah, I actually won a dictionary around with the word white. And I just pointed at the paper and I'm like, no way. And she said white. I'm like, yeah, it's got to be kidding. Anyway, it's kind of like a corollary to camouflage. So we've seen artists start to like sample from the model. We've seen artists using the model as like a distance measure to look for weird examples, because the model has an idea of what's probable in the space. We've also seen artists just playing around with the raw data. And so there's been a nice explosion there. I think I'm not expecting that artists really do a huge amount with this quick draw data, because as cool as it is, these things were drawn in 20 seconds, right? There's kind of a limit to how much we can do with them. On the music side, we've had a number of people playing with Ensign with just like dumps of samples from Ensign. So basically like a rudimentary synthesizer. And there I've been surprised at the kind of, I would expect that if you're really good at this, so like your FX twin or you want, how about this? You want to be FX twin, right? That you look at this and go, yeah, whatever. There are 50 other tools that I have that I can use. But those are the people that we've found have been the most interested. Because I think we are generating some sounds that are new. I mean, so first you can, as someone pointed out on Hacker News, you can take a few oscillators and a noise generator and make something new. But I think these are new in a way when you start sampling the space between a trombone and a flute or something like that, that these are new in a way that capture some very nice harmonic properties, capture some of the essence of the Brian Eno quote, kind of broken and glitchy and edgy in a way. But that glitchiness is not the same as you would get from like digital clipping. The glitchiness sounds really harmonic. And so, like for example, Jesse on our team, Jesse Engel, he built some Ableton plug-in where you're listening to these notes, but you're able to change the, like erase the beginnings of the notes. So you like erase the onset, which is usually where most of the information is. Like most of the information in a piano note is kind of that first percussive onset. But it's the onset that the model is doing such a great job of reproducing because it gradually kind of moves away from, in time it's a temporal embedding and the noise kind of adds up as we move through the embedding in time. So it's the tails of these notes that start to get ringy and like they'll detune and you'll hear these rushes of noise come in or they'll be this little weird, whoop, at the end. And so like we've found that musicians who've actually played with sound a lot find these particular sounds immensely fascinating. I think that the kinds of sounds that sound interesting in a way that's hard to describe unless you've played with them, I think they're interesting because the model has been forced to capture some of the important sources of variance in real music audio. And even when it fails to reproduce all of them, when it fills in with confusion, so to speak, even that confusion is somehow driven by musical sound. And which you see by the choral area, if you look at something like Deep Dream and you see what models are doing when they're sort of showing you what they've learned, it may not be what you expect from the world, but there's something kind of interesting about them, right? Anyway, it's a long answer, but the short version of the answer is we've found that working with very talented musicians has been really fruitful and our challenge is now to be good enough at what we do and make it easy enough and make it clean enough that even someone who's not an FX twin, and I'm not saying we worked with FX twin, we didn't work with FX twin, but like, you know, that kind of artist, yeah, that we can also be saying, hey, this is really like genuinely musically engaging for a much, much larger audience. That's surprising. So it's not necessarily generating melodies for people so much as it is generating interesting sounds. That's what's brought them in. That's what's brought them in, though the parallel has existed for the sequence generation stuff. And what I noticed, even with AI duet, which is this like web-based, like it's a simple RNN, it's like, I can lay claim, it's technology that was published in 2002. It's really a very simple, really simple, but like, so this model, if you haven't, if your viewers haven't seen it, you play a little melody and then the model thinks for a minute and the AI genius, which is an LSTM network, comes back and plays something back to you, right? If you play for at least, you know, and you wait, you're expecting maybe, you know, that it'll continue the tune, it's not going to, right? It's going to go, "Dah, dah, dah, dah, dah, dah, dah."
AI bigger than the creator. (13:50)
So like this idea of like expecting the model to like carry these long arcs of melody along is not really understanding the model. What we saw was like, especially jazz musicians, but like musicians who listen, the game they play is to follow the model. And so they'll like, I'd see guys or people, women too, sit down and go like, "You know, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, dumb, and just wait." And it's almost like pinging the model with an impulse response. Like, what's this thing going to do? And then instead of trying to drive it, it comes back and goes, "Dah, dumb, dumb, dumb, dumb, right?" And then the musician says, "Oh, I see, let's go up to the fifth." And then you get this really, it's almost like follow the leader, but you're following the model and then it's super fun. And like it's basically a challenge for the musician to try to understand how to play along with something that's so primitive, right? But if you don't have the musical, so basically it's the musician bringing all the skill to the table, right?
Limits of generative models (14:46)
Like, even with the primitive sequence generation stuff, it's still been interesting to see that it's musicians with a lot of musical talent and particularly the ability to improvise and listen that have managed to actually get what I would consider interesting results out of that. So, yeah. So it's become more of like a con response game than a tool? Yeah, I think so. I think it, and that's partially because the model is pretty primitive. I think that we can get the data pipelines in order so that we know what we're training on and we can actually do some more modern sequence learning, having like generative adversarial feedback and things like that. We can do much better. And even we have some stuff that we haven't released yet that I think is better. But yeah, I think as we make it better, it'll be more of this model's going to give me some more ideas from what I've done. Right now it's more of a, this model's kind of weird but I'm going to try to understand what it's doing. Okay. Both are fun modes, by the way. Yeah. Like, they're both cool modes, right? Yeah. I mean, I've enjoyed it like I'm definitely not a pianist. I mean, I've played guitar before and I tried to get a song going but I had trouble with it. We're sorry. It's okay. Yeah, I think it's most of my fault to be honest. I love the YouTube video. Blame the user, right? Yeah. Yeah. The video where some of, where like I played a song with it, that was amazing. Yeah, that was cool. It was very cool. I mean, a lot of that stuff as well. Yeah, we've seen, we saw like, we haven't, I haven't well, we haven't pushed the sequence generation stuff much because we, we really wanted to focus on, on tamper. Yeah. But when we have released things and kind of tried to show people were there, yeah, we've gotten, if you look on them, there's the magenta, there's a magenta mailing list that's just like it's linked in g.co magenta. And if you look around, there's like a discussion list, which is like, as flamey and spammy is some discussion list, but a little bit less so it's pretty, that, you know, every couple of weeks, someone will put up some stuff they composed with magenta. And usually they're more effective if they've layered their own stuff on top of it or they've, you know, taken time offline rather than in performance to, to generate. But some stuff's actually quite good. I think it's fun. It's a start. Yeah. I mean, I think it's great. And so how, like you compared it to, you know, the work you did in 2002, where has LSTM gone since then? Like, you know, you, you talk about like, you ended up doing this project. I saw in your talk that because you kind of like failed at it a while ago. Failure is good. Yeah. Yeah. So the, there was a point in time, I was at a, at a lab called Izzia, the Dalomole Institute for Artificial Intelligence. And I was working for Jürgen Schmittuber, who's one of the co-authors.
Why do we have memory? (17:23)
He was the advisor to Sep Hawkritter, who did the, who did LSTM. And there was a point in time where there were three of us in a room in Mandel Switzerland, which is a suburb of Lugano, Switzerland, who were the only people in the world using LSTM. It was myself Felix Garris and Alex Graves. Wow. Among the three of us by far, Alex Graves has done the most with LSTM. So he continued after he finished his PhD and he continued doggedly to try to understand how recurrent neural networks worked, how to train them and how to make them useful for sequence learning. I think more than, more than anybody else in the world, including Sep, the, the person who created LSTM, you know, Alex just stuck with it. And, and finally started to get wins, you know, in, in speech and language, right? And, and I more or less put down LSTM's. I started working with audio, audio stuff and other more like cognitively driven music stuff at University of Montreal. But like it worked finally, right? And it was, you know, like there's this thing in music, a 20 year overnight success, right? So it's like this worked because he stuck with it. And now, of course, it's, you know, become like sort of the touchstone for, you know, recurrent models in time series analysis. It forms some version of it forms the core of what we're doing with translation. I mean, these models have changed, right? They've evolved over time. But basically, you know, recurrent neural networks as a family of models is around because, because of that effort of like, it's interesting, right? It's really, there really were three of us. Wow. And she looks went on with his life and I went on with my life and Alex stuck with his kind of really one person carrying forward. But you may get letters from people saying, Hey, wait, you forgot about me. You forgot about me. This is, this is a little bit reductionist. Obviously there were more, but it felt that way at the time, right? What was the breakthrough then that like got people interested?
What made LSTM different? (19:13)
I think it was the same breakthrough that got people interested in deep neural networks and convolutional neural networks. It's that these models don't work that well with small training sets. And small models. So like image net. Yeah, that their data, their data absorptive and meaning that they can absorb lots of data if they have it. And you know, neural networks as a class are really good with high dimensional data. And so as machines got faster and memory got bigger, they, they started to work. So you know, we were working with really small machines and like working with LSTM networks that were, you know, maybe had like 50 to 100 hidden units and then a couple of gates to control them and trying things that had to do with like the dynamics of how these things can count and how they can like follow time series. So you try to scale that to speech or you try to scale that to, you know, speech recognition was one of the first things.
Was it memory size or patterns? (19:57)
This is really hard to do. So I think a lot of this is just due to having faster machines and more memory. It's kind of weird, right? It's surprising that that would be it. Yeah, I think it surprises everybody a little bit. Yeah. But now the running joke like having coffee here at brain is sort of like, like what, what other technology from the 80s should we rescue? You know, it's exactly right. Hey, I was back. Exactly right. How far have you pushed LSTM? Like, you know, obviously there's some amount of like text generation that people are trying out. You know, have you let it create an entire song? No, we haven't because we haven't got the conditional part of it right yet. So I think like LSTM in its most vanilla form, I think everybody's pretty convinced that it's not going to handle really long time scale hierarchical patterning. And I'd love it if someone comes along and says, no, you don't need anything but vanilla LSTM to do this. But you know, I think what makes music interesting over, you know, even after like five seconds or ten seconds is this idea that, you know, you're getting repetition, you're getting, you know, harmonic shifts like chord changes. There's a there there, right? And one way to talk about that there there is that, you know, you have some lower level pattern generation going on, but there's some conditioning happening. Oh, now continue generating, but the condition shifted, we just shifted chords, for example. And so I think if we start talking about conditional models, if we talk about models that are explicitly hierarchical, if we talk about models that we can sample from in different ways, we can start to get somewhere. But I think, you know, only a recurrent neural network is, you know, it's not it would be reductions to stay that it's the whole answer and it's in fact true is not the whole answer. I was thinking about how you were, was it the TensorFlow or the IO talk where you're talking about Bach? Oh, probably. Oh, that we did stuff that was like more Bach than Bach. Yeah, we nailed it. Yeah, that's like you start making things that like are, you know, more palatable. It's like, I'll make the best Picasso painting for you, but it's not necessarily a Picasso painting because it's not necessarily saying anything. Precisely. Right. So I think, I think by analogy, so first in case it's not clear, I don't believe that we made something that was better than Bach, but when we put these tunes out for like untrained listeners to listen to, they sometimes voted them as sounding more bocke. And I think it's imagine like what these models are learning, right? They're learning kind of the principal axes of variance. They're learning what's most important. They have to, right? Because they have a limited memory. They're compressed, right? So if you sample from SketchRNN with very low temperature, meaning with like without a lot of noise in the system, you actually get what like if you want to squint your eyes and break philosophy is like the platonic cat, you know, you get the cat that looks more like a cat than anyone to draw, sort of the average cat. And I think that's sort of what we're getting from these time series models as well. They're kind of giving you something that's more a caricature than a sample, right? So then in the creation of art, like what are you predicting is going to happen as magenta progresses?
Where will AI go? (23:16)
I think in, can I make predictions that are on the timeframe of like 28 to 40 years? No one will ever test. In a thousand years magenta is going to be the only, no. So joking aside, I do believe that the machine learning and AI will continue like will become part of the toolkit for communicating and for expression, including art. I think that in the same way, I think that it's healthy for me to admit that those of us who are doing this engineering won't, almost by definition, know where it's going to go. We can't and we shouldn't know where it's going to go. I think our job is to build smart, you know, AI smart tools. At the same time, I want to point out like some people find that answer boring, like it's hedging, but I do think there are directions. I can imagine direction that we could go on that would be really cool. For example, thinking of literature, right?
What does the joke generator look like? (24:15)
I think plot is really interesting in stories and that you can imagine that we have a particular way as humans, like the kind of cognitive constraints that we have, like kind of limitations and how we would draw plots out as an author. You're not going to do it in one pass left to right, like a recurrent neural network. It's going to be like sketching out the plot and do we kill this character off? But I kind of can imagine that generative models might be able to generate plots that are really, really difficult for people to generate, but still make sense to us as readers. Right? Oh, man. Yeah. Think of it.
Ai And Comedy
Making jokes with AI (24:57)
If you flip it around, I think jokes are hard because it's really hard to generate the surprising turns and the re like kind of like you go in one direction and you land over here, but it still makes sense. I can imagine that the right kind of language model might be able to generate jokes that are super, super funny to us and that actually might have a flavor to them of being like, yeah, I know this joke must have been machine generated because it fits in so many different ways, right? It like fits in so many different ways in a math way, like in a high dimensional space, but it's super funny to us. Like, I don't know how to do that, but I can totally imagine that we would be in a world where we get that. I thought about it in the complete opposite way, but that makes sense. I was thinking about it, you know, training it to create pulp fiction. I think that'd be so simple in my mind. Like you can just create these like airport novels. It can just like bang out the plots. I mean, that's probably where we'll start. I mean, I would love it if we could write. So everybody understands that listening or watching, you know, we can't generate a coherent paragraph, right? So I mean, we magenta, I mean kind of we humanity. I can't write at all. It's really hard. And it all hits, it all, I think it all hits at structure at some level, like, you know, nested structure, whether it's music or I think there's like, like an art plays with geometry or color or something else. You know, it's, it's, it's meaning. It's, you know, it's nested structure somewhere. And it, and has the art world or, you know, I guess any kind of artist, any kind of creator have people push back in the way that they're scared? You know, I imagine when photography came out, everyone was pushing back saying like this, this might end painting because it's about, you know, photography captures the essence. But then it ended up changing because people realized that that painting wasn't just about capturing something, you know, capturing an exact moment. Certainly the, the, the generative art world and we've seen lots of that. Another researcher in, in London, someone posted on his Facebook something like, he posted to us a tweet that was like, what you're doing is bad for humanity. And it's like really, like he's making like new folk songs, right? It's like generating folk songs with an LSTM, this is Bob's term. I'm like, it's probably not bad for humanity. So yeah, of course. But like what I love about that is, you know, it's okay if a bunch of people don't like it. And in fact, if it's interesting, what art does everybody like? Zero. Right. Or it's really boring, right? So you have this idea that like, if you want to really engage with people, you're probably going to find an audience that audience is going to be some slice. Frankly, it's probably going to be some slice coming up from the next generation of people that have experienced technology that are taking some things for granted that are still novel to someone like myself, right? You know, but it's, you know, it's okay if a bunch of people don't like it.
Lack of pushback (27:38)
Yeah. Well, when we were talking before, I was, I was surprised that you hadn't gotten more pushback. It seems to be like most people in our world are just like, okay, whatever. It's like, do your thing. It's kind of opening up new territory rather than it is like, you know, challenging. I think that I've gotten, I've gotten pushback in terms of questions. I think we have, and I think this is as a community in Google and outside of Google and outside of Magenta. I think people are really clear that what's interesting about a project like this is that it be a tool, not a replacement. And I think more, I think if we, if we presented this as, you know, push this button and you'll get your finished music, it would be a very different story, but it's kind of boring. I think it's super, yeah, I mean, it's, it's funny you mentioned Hacker News because I was talking with one of the moderators. We love you Hacker News. Yeah, they're great. They're great. Yeah, it's just impersonal. It's so easy to critique people, but I was talking with Scott, one of the moderators, and he was, he was wondering if you guys were concerned with the actual like cathartic feeling of creating music or if that's just something you, you don't even consider right now. I mean, as, as people, we have chorus.
El Loco (28:56)
You have to. Yeah. There's a couple of levels there. I think you lose that if what you're just doing is pushing a button and like, so I think this is everywhere. The drum machine is such a great, a great thing to fall back on. It is just not fun to just like push the button, like, and make the drum machine do its canned patterns. And I think that was the goal. My sense of the reading that I've done is it's like, this will make it really easy, right? But like what makes the drum machine interesting is people working with it, writing their own, you know, writing their own loops with their own patterns, changing it, working against it, working with it. And so, you know, I think this project loses its interest if we don't have people getting that cathartic release, which I believe me, I understand what you mean. That's the thing. One, the other thing I would mention is if there's anything that we're not getting that I wish we were getting more of is creative people, people coding creatively, right? We talk about creative coding in like this kind of hand wavy sense, but like, like, like, I would love to have the right kind of mix of models in Magenta and in open source linking to other projects that you as a coder could come in and actually say, I'm going to code for the evening and add some things. I'm going to maybe retrain. I'm going to, maybe I'm going to hack data and I'm going to get the effect that I want. And that part of what you're doing is being an artist by coding. Sure. And I think we haven't hit that yet in Magenta. I'd love to like get feedback from whomever, like in terms of ways to get there. The point is there's a certain catharsis for those of us that train the model. You get the model of the train. It's like, it's fine. You'll be bored if you just push the button, but it feels good for me to push that button because I'm the one that made that button work, you know? So there's that, right? You know, it's creative acting its own, right? And have people been like creatively breaking the code? Like, oh, it would be funny if it did this or interesting if it did that. A few though. I think our code is so easy. Like most open source projects need to be rewritten a couple times. And I think, you know, we've gone through, we're on our second rewrite, is that if the code is brittle enough that it's easy to break uncreatively, then it's hard to also break it creatively.
Who broke possibly a critical feature of Magenta? (31:00)
And listen, I'm being pretty critical. I'm actually, I'm really proud of the quality of the Magenta open source effort. I actually think we have, you know, well tested, well thought out code. I think it's just a really hard problem to do coding for art and music. And that, you know, if you get it wrong a little bit, it's just wrong enough that you have to fix it. And so, you know, we still have a lot of work to do. So then where does that like creative coder world go? I've seen a lot of people that are concerned with like, even just preserving, I think rhizome is doing the preserving digital art project. What direction do you think that's going to go in? Presumably a number of cool directions in parallel. The one that interests me personally the most is reinforcement learning.
Could GANs Be A PrTool? (31:49)
And this idea that, you know, models train, so there's a long story or a short story. Which one do you want? Long story. Okay. Well, so we know it's not that bad. So you know, generative models one on one, you start generating from a model trained just to be able to regenerate the data is trained on. You tend to get output that's blurry, right? Or is just kind of wandering. And that's because all the model learns to do is kind of sit somewhere on the big, imagine the distribution as a mountain, as a mountain range and it just sits on the high mountain. Kind of plays it safe. Yeah, kind of plays it safe. All T-shirts are gray. If you're colorizing because that's safe, you're not going to get punished. And you know, one revolution that came along thanks to Ian Goodfellow is this idea of a generative adversarial network that is a different, it's a different cost for the model to minimize where the model is actually trying to create counterfeits and it's forced to not just play it safe, right? I don't know how this is too technical. It's very interesting to me. Yeah, this was part of the talk, right? Where you like cut out the square. Yeah, exactly. Yeah, so that part. So another way to do this is to use reinforcement learning. Yes. And it's slower to train because all you have is a single number, scalar reward instead of this whole gradient flowing than GANS, but it also is more flexible. Okay. So my story here is that the GANS are part of a larger family of models that are at some level critical. Everybody needs a critic. And they're pushing back and they're pushing you out of your safe spot, whatever that safe spot is, and that's helping you be able to do a better job of generating. We have a particular idea that you can use reinforcement learning to provide reward for following a certain set of rules or a certain set of heuristics. And this is normally like, if you mention rules at a machine learning dinner party, everybody looks at you funny, right? Like you're stepping backward. Yeah, you're not supposed to use rules. Come on, we don't use rules. But instead of building the rules into the model, the AI is not rules. The machine learning is not rules. It's that the rules are out there in the world and you get rewarded for following them. And we had, I thought, some very nice generated samples of music that were pretty boring with the LSTM network. But then the LSTM network trained additionally using a kind of reinforcement learning called deep Q learning to follow some of these rules. The generation got way different and way better. And specifically, catchier. What were the rules? The rules were like rules of composition for counterpoint from the 1800s. They were super simple. Now we don't care about those rules. But there's a really nice creative coding aspect, which is, think of it this way, I have a ton of data. I have a model that's trained. I have a generative model. Whatever it may be, it may be one trained to draw, maybe one trained for music. And that model is kind of tried to disentangle all the sources of variance that are sitting in this data. And so it's smartly generated, you know, can generate new things. But now, now think like, as long as I can write a bit of code that takes a sample from the model and evaluates it, providing scale or reward, anything I stuff in that evaluator, then I can get like the generator to try to do a better job of generating stuff that makes that evaluator happy. It doesn't have to be 18th century rules of counterpoint. Yeah. So you could imagine like taking something like Sketch RNN and adding a reinforcement learning model that says, "I really hate straight lines." And suddenly, the model is going to try to learn to draw cats. Yeah. But without straight lines. The data is telling it to draw cats. Sometimes the cats have triangular ears with straight lines. But the model is going to get rewarded for trying to draw those cats that it can without drawing straight lines. And straight lines was just one constraint that I picked off the top of my head. It has to be a constraint that you can measure in the output of the model. But like musically speaking, if I could come up with an evaluator that described what I meant in my mind by shimmery, really fast changing, small local changes, I should be able to get like a kind of music that sounds shimmery by adding that reward to an existing model. And furthermore, the model still retains like the kind of nice realness that it gets from being trained on data.
The best pop song always to Frank Ocean to create (35:57)
I'm not trying to come up with a rule to generate shimmery. I'm trying to come up with a rule that rewards a model for generating something shimmery. Yeah, it's very different, right? So I think that's one really interesting direction to go in is like opening up the ability. If you can generate scalar reward and drop it in this box over here, we'll take a model that's already trained on data and we'll tilt it to do what you want it to do. That kind of underlies a fear that people have, right? Which is like what happens when you can create the best pop song and what do people do? And do you have thoughts on like, A, is that possible? And what would the world look like if that world comes to be? I think that it is, I had an algorithm for doing this, which is the best pop song for me, which is when we used to sell used CDs, it was usually like a two to one. So every time if you have a thousand CDs, you trade them in and you have 500 that you like better. And you just kind of keep going, right? You finally get that one, exactly. Still climbing in that space. So I'm not sure. Part of me wants to say people love the kind of rawness and the variety of things that aren't kind of predictable pop, but let's face it, like people love pop music. There's a kind of pop music that you'll catch on the radio sometimes that isn't like, like almost like most of your listeners are probably in the same camp or viewers like, like there's pop that we love. Like, you know, like I love the pop is to Frank Ocean's music. I can listen to it forever. But then there's like, just like, there's like the gutter of pop. And so, you know, you can't even distinguish who the artist is, but like they play at the big festival. So I guess that kind of unask the perfect pop. I mean, pop is such a broad thing. Yeah. But yeah, I think like, I can imagine that with machine learning and AI at the table, we will. Here's another way to look at it, like some things that used to be hard will be easy, right? And so we'll offload all of that. And if people are happy just listening to the stuff that's now easy, then yeah, it's a problem solved and we'll be able to generate lots of it. But then what people tend to do is go look for something else hard, right? It's like the drum machine argument. So you've solved the metronomic, you've solved the metronomic, you know, beat problem. And then what you actually find is that artists who are really good at this, they play off of it and they're allowed, like when they sing to do many more like rhythmical things than they could do before, because now they have this scaffolding they didn't have to work with before. They just constantly break it, right? As soon as you've distorted the electric guitar. But I hope that's an honest answer to your question. I mean, your question was a different flavor. It's like, hey, are we really moving towards a world where we're going to generate the perfect pop song? Yeah, I don't know. I don't think so. I mean, I don't feel like that's going to happen. But you know, maybe it happens so quickly. And then as soon as we realize like, okay, this is how we're going to break it. This is how we're going to retrain ourselves. Yeah.
Magenta Project Overview
Holy Grail for Magenta (39:01)
It can learn so fast that it's like, okay, now I can do that too. Yeah, that's nice. Yeah, that's nice. Then I can do that too. So then what I was wondering is there like in, you know, the next like handful of years, is there like a holy grail that you're working toward for magenta? Like, okay, now we've hit it. Like this is the benchmark that we're going for. There are a couple of things I'd love to do. I think composing, creating long form pieces, whether they're music or art, I think is something we want to do. And this hints at this idea of like not just having these things that make sense at like 20 seconds of music time, but actually say something more, that direction is really interesting. Because I think that not only, so let's face it, that would be at least more interesting if you pushed the button and listened to it. But also this leads to like tools where composers can offload very interesting things. Like some people, some people, I'm one of these people, I'm really obsessed with like expressive timing. I'm really obsessed with musical texture. Okay. I don't know what that is. Oh, no, I just mean like, let's say you're playing the piano. Oh, I saw it in the gray art space. They're the gray talk. You were contrasting the piano play. Yeah, you did your homework. Yeah, exactly. But like, so if you listen to someone play waltz, it'll have a little to it or like some of my favorite musicians like Thelonious Monk, if you're familiar, if you're not familiar with Thelonious Monk homework, go listen to him. He played piano with a very, very specific style that almost sounded lurching sometimes. It was very, he really cared about time in a way. And so like if you, if the way that you're thinking about music and composition is really, really caring about kind of local stuff, right, they're very, very interesting if you had a model that would handle for you some of the decisions that you would make for longer timescale things like when do chord changes happen, right? So like usually it's the other way around. You have these AI, you know, machine learning models can handle local texture, but you have to have to decide that. Yeah, my point is if we get to models that can handle longer structure and nested structure, we'll have a lot more ways in which we can decide what we want to work on versus what we have the machine help us with, right? And has this now, has it affected your creative work or do you still do like creative like composition? Yeah, so the, I'm working here at Google and you know, this is like a coal mine of work to do this project magenta like every day. No, it's joking aside. Yeah, plus two kids, plus two kids. Yeah, no, I basically, I've been using music as more of like a catharsis relaxation thing. I don't feel like personally I've done anything recently that I would consider like creative of a level that I want to share with someone else. It's been more like jamming with friends or like, you know, just like throw away compositions, jamming. I like here's the 10 chords that sound good, let's jam over it for the evening and then like don't even remember the next day. And really trying hard to understand this creative coding thing. Like that's the most I've worked on. I've, a lot of it's just like I'll start and then I'll get distracted. But yeah, so that's sort of the level of my creative output. Well, the creative coding thing, it's seemingly, I don't know, so many people are looking for it in every venue and it's so difficult to find people. There's kind of like one offs now. Yeah, I think that's right.
Conclusion Of Creative Project
Creative Project (42:16)
It's so, it's so hard to have the right, I think maybe, you know, maybe we need the garage band of this, like, you know, we need to have like something that's so well put together that it makes it easy for a whole generation of people to jump in and try this even if they haven't had, you know, like four, four or five years of Python experience. I didn't know if that's what you were alluding to when you were saying that, you know, command line, obviously not the way to do it where it dumps MIDI files. But now it's an API, right? Yeah. And like what is the next step that's very obvious? Try to make it, try to make it more usable and more expressive. Right. Expressivity is hard in an API, right? You know, it's like, it's hard to get it right. And I think it's almost always multiple passes. So we've got, I think the API, the core API that allows us to move music around in real time and MIDI. Yeah. And actually have a meaningful conversation between an AI model and multiple musicians is there. And there's just a bunch more thinking that needs to happen right to get it right. Cool. So if someone wants to become a creative coder or wants to learn more about you guys, what would you advise them to check out? So I would say the call to action for us is to visit our site, the shortest URL is g.co/magenta. Okay. It's also magenta.tensorflow.org. We can link it all up. And like have a look at, we have some open issues. We have a bunch of code that you can install on your computer and hope you can make work and maybe you will be able to. And you know, we want feedback. We have a pretty active and we certainly follow our discussion list closely and our game for philosophical discussions and our game for technical discussions. And you know, beyond that, we're just kind of keeping rolling. Like we're just going to try to keep doing research and keep trying to build this community. Okay. Great. Thanks, man. Sure. No, it's fun. Thank you. Thank you.