Playback speed
Share post
Share post at current time

Voyager AI Plays MineCraft, NVIDIA is a $1T Company, & Google SoundStorm

AI Daily | 5.30.23

Welcome to another exciting episode of AI Daily! In today's edition, we have three captivating news stories lined up for you. First up, we have the remarkable achievement of NVIDIA, as they join the exclusive club of trillion-dollar companies like Amazon, Microsoft, and Apple. Next, we explore an intriguing project called Voyager. This innovative approach to training AI models in Minecraft using the GP4 framework has revolutionized the process. Our final story showcases Google's project SoundStorm, which introduces efficient parallel audio generation. This development allows for the rapid creation of audio by leveraging parallel processing. Join us for this episode of AI Daily as we dive deep into these captivating news stories and uncover the incredible potential they hold for the future.

Key Take-Aways:

NVIDIA Becomes $1T Company:

  • NVIDIA breaks trillion-dollar market cap, joining Amazon, Microsoft, and potentially Apple in an exclusive club.

  • NVIDIA's market leadership and monopoly status in the chip industry are clear, with their critical role in the AI ecosystem and impressive demos.

  • The future of microprocessors is promising, with potential for multiple trillion-dollar companies in the industry.

  • AMD poses competition in the enterprise market, but NVIDIA's focus on AI and scaling quickly may solidify their position. Other players may emerge in the next few years, including Apple, Google, Microsoft, and Meta, with their own in-house chips. The onshoring of chips in America may also contribute to the rise of upstart companies.


  • Voyager is a project in Minecraft where a GP4 model is trained by iterating on its own code base of skills, saving them in its memory for future use.

  • The use of LLMs and self-correcting error loops in Voyager resulted in faster progress in Minecraft compared to traditional reinforcement learning techniques.

  • The training and improvement of models like Voyager can be recursive, either internally with LLMs training LLMs or externally using pipelines and frameworks to continually enhance performance.

  • The approach taken in Voyager has potential applications in real-world robotics, where robots can learn and improve their skills by iterating on their own internal code. This recursive model has significant implications for accelerating AI development.


  • Google's project SoundStorm focuses on efficient parallel audio generation, allowing the generation of a significant amount of audio in a short time.

  • The model shows promise in terms of speed and quality, with the ability to generate 30 seconds of audio in just half a second.

  • Currently, the project is not publicly available, but Google is showcasing its AI projects, and it is expected that it will be accessible in the future.

  • The improved speed in audio generation opens up new possibilities for real-time applications, such as generating audio for NPCs in AI games.

Links Mentioned

Follow us on Twitter:

Subscribe to our Substack:


Conner: Good morning. Welcome to another episode of AI Daily. I'm your host Connor, joined by Ethan and Farb. Today we have another three great stories for you guys. Starting off first with Nvidia breaking the trillion dollar market cap, they're now joining Amazon, Microsoft, and the third I believe, was Apple. Yes.

So. Very exclusive club. Uh, meadow was in there for a bit. Tezo was in there for a bit. They're out NVIDIA's in. Ethan, what do you think on this?

Ethan: Yeah, I, I think this was clear and coming. You know, if you look at all the articles of the past week and you look at the state of the market, there are not enough chips and Envidia is the only one position that is priced in, and people believe they will have monopoly and they will deliver all the chips.

So, I think it's super clear they've been a monopoly and they've been super critical to the AI ecosystem. Everyone is talking about ai. They've crossed a trillion dollar market cap. They showed off some crazy cool demos around gaming, around LLMs, around their new data infrastructure. They have a new D gx, I think GH 200 super computer.

It's like trillions of parameters or something like that. Um, so very clear. They are the market leader and people believe in them. So exciting to see farm.

Farb: You know, uh, if you think that microprocessors are gonna be around for a few hundred years, maybe they'll be around for a few thousand years, then we are clearly in the early stages of the microprocessor, they haven't been around for that long.

So if they're gonna be around that long, then there's a lot more upside to be had. And you know, if you think that it's impressive that there's 1 trillion microprocessor company, there will probably be several. Uh, this isn't stopping any anytime soon. Certainly not in our, in our lifetime. So it's just the beginning and we can see that the, uh, the beginning is going quickly, but, uh, I think the, the future is really where the, the magic's gonna be.

Ethan: Yeah. Video take on amd. Sorry, go ahead.

Conner: Yeah, I was gonna video really wo, really rode the gaming wave and then the crypto wave, and then now the AI wave. Mm-hmm. As far said microprocessors, don't go anywhere. So, but yeah. A m d I don't know, like NVIDIA's really targeting the, the enterprise with a 100 s, a 10 s now H 100 s, they really can hit the AI market, but I don't think AM MD can. So Lisa?

Ethan: Not yet. Yeah, I, I think it has some weight, but you know, there just aren't enough chips. And if Nvidia does not scale super fast, I think we're gonna see a m d get some gains here and really cement themselves in the space as well. People will adapt and people need chips. So go Invidia. Go. Am m d, go ai.

Farb: Oh, in a few years, in five, 10 years, there'll be players besides those two. I can. I'll, I'll lay my money on that anytime I'm with you. Of course,

Conner: apple, Google, Microsoft Meta already have their own in-house chips, so we'll just see more of that as well. So there will be upstarts.

Ethan: Yeah. Awesome. And we have this transition to American onshoring of chips, so I think we'll see upstarts there as well.

Conner: Yes. Okay, well our next story up is Voyager. People set G P D four free in Minecraft and call it Voyager. The key thing here is that instead of training it in the traditional sense, as you train a model, you're now training by GP four iterating on a code base of skills it uses internally. So this really saves a lot.

You don't need any p u passes. All the training can be done on A C P U cuz it's really just editing and modifying its own code that it uses as tools in a tool form type sense. Very interesting. Yep. You think here?

Ethan: Yeah, I'd see. Well, you know, we've seen so many different, like reinforcement learning attacks on Minecraft before trying to make these agents.

But as we've seen with LLMs and people with auto G P T and agents in general, you give an L L M access to code, let it write its own code, let it. Correct errors, let it handle errors, and then you use a vector database and give it memory and you create these real agents. So Voyager was amazing to me. You get to see, you know, that self-correcting air loop with prompting, you get to see them saving skills.

So, you know, they made like a pickax and once they understood how to make a pickax, they saved that in the ll m's memory that it could access again. So even we saw it, you know, they completely restarted the game of Minecraft with the previous memory. Let it go again. And I think it got to diamond, or you know, the higher levels of Minecraft much faster than any reinforcement learning technique ever.

And this was all done. You know, the important thing here is this was all done just with text and just with the API of Minecraft, so there's not even a visual component yet. Once you add multimodal to these models, we're gonna see insane agents across gaming and soon across industry. Mm-hmm.

Farb: Yeah, there's a, I think two things going on here.

And we, I talked about this in a previous episode where there's this, you know, recursive using LLMs to train LLMs. And I think there are two sort of recursive models happening, uh, out there. And, and we're seeing it over and over again. I'll call one, like internally recursive, and the other one is externally recursive in the internally recursive models, you have LLMs, you know, sort of training LLMs.

To get, to become better LLMs in the externally recursive model, you have something like Voyager, where you have people building pipelines and frameworks to, you know, to run an l l m through, uh, some different piece of, you know, uh, Framework to, you know, store off some understanding and then bringing that back in.

Uh, so what we're seeing is these really cool recursive loops, whether they're just using LLMs to train LLMs or they're using LLMs plus other bits of code and other parts of a pipeline to, again, continually. Repetitively run through this system that's been built to improve something, to train it, to get it to do what you want.

And I don't think that's the, this is just the beginning of all this stuff, and this is going to ex accelerate AI development, uh, possibly more than AI development itself.

Conner: Yeah. The voyager, the way they did Voyager really also applies to real world, real, real world robotics as well. So the way that they taught it to use code skills, instead of having to retrain the entire model, you can see, you can easily see a future where Boston Dynamics type robots are using their own internal code skills that they iterate to become better at running, to become better at jumping, to become better at throwing even.

It's a very good model of making these model LM stronger. So it's recurs it everywhere. It is completely, it is indeed completely agreed. Okay. Our third story up is SoundStorm. Google's project SoundStorm, uh, is efficient parallel audio generation. They did a lot of internal things that make this seem a lot better than 11 laps far.

What'd you read on this? What'd you see?

Farb: You know, I thought it was interesting. It looks like they built this off of a couple of other existing models and began running these things in parallel to increase the speed at which things happen. You know, they can generate. You know, half a minute or, or more of audio in a couple of seconds, uh, which wasn't possible previously.

Uh, and it's supposed to work quite well. I thought it was a, a cool advancement in this, in the audio coded deck world. Uh, we, we'll see if it ever becomes something we can use. It doesn't look like it's currently available for you to jump in and start using The code is not available. Uh, it looks like there's kind of.

Rapidly showing off every last AI project they have Internally. These things have probably been running for three or four years and nobody's been allowed to talk about it, but now there's a, you know, gold rush to talk about everything that you're doing in AI at your company. So, uh, we're seeing it all come out ho hopefully this is something that people can play with in the near future.

Conner: Yeah, it was 30 seconds of audio and half a second on a v4 tpu. Uh, I don't know if that's just Google's own internal architecture of how good their TPUs are. But of course, 60 times a length of audio that it takes to generate it is very good margins. That's, you can get an hour, you can get an hour of audio in a minute, which is insane to think about.

Ethan, what'd you see on this?

Ethan: Yeah, just like with computer vision, when we see speedups to the model, at the inference layer, you unlock entirely new applications. So this model itself from Google SoundStorm, the quality is just as high, um, from what I've seen of 11 labs or some of these other models. And when you are able to go from 30 seconds of audio and 30 seconds to 30 seconds of audio and half a second, you unlock entirely new applications.

So I think the main thing to see here is when you are using these models, if you've found yourself saying, Hey, this is very slow. This is what's coming down the pipeline to generate audio faster and actually put it real time in applications. You know, going back to NVIDIA's, um, showcase of some like AI games.

When you need those NPCs talking in real time and you want that audio in real time, these are the models that will make it happen and not the kind of gigantic models that take a long time to run. So cool innovations.

Conner: Yeah. I'm sure we'll see more like this. More open source even. So, very exciting. Mm-hmm.

Uh, next up. Absolutely. Uh, that was it actually. So what have you guys been seeing,

Farb: Barb? I've been playing around with Photoshop's generative AI feature. It's pretty mind bending. We'll try and include an example, uh, in the, in, in the video, uh, here for the final show, the final edit, uh, of what it did. I, I. It almost even when I, when I look at it and know what happened, my brain still can't wrap my head wrap itself around it.

It's kind of just completely mind bending and you're seeing everybody on Twitter talk about this. People were calling, uh, you know, end of days for Photoshop for whatever reason, and now they're, you know, saying, sorry, mid journey looks like, uh, Photoshop's gonna kick your butt. You, you're seeing people, you know, take like famous paintings and extend them.

Or, you know, take famous portraits and, and extend them. Uh, and, and it's just completely mind vending to, to see what Photoshop's capable of doing in its first version. Uh, and I, I, I sent it to a friend of mine who's a designer who said, oh, 10 years later, uh, Photoshop's generative fill finally works because they've had something similar to this.

I mean, similar is a bit, bit of a stretch to say, but they've had this concept of generative fill for a long time. Uh, and then finally it's working.

Conner: Yeah, they've always had the mash where you like, select the part of the image. One, replace, select another part, and it always looks a little weird and it takes forever to fix.

But now, yeah, we're seeing Photoshop take a big step up and really have this different differentiation of the mid journey Colts and the Photoshop cols, and the really clashing heads on Twitter and everywhere now. Pretty funny. So. Mm-hmm. Uh, I saw that G four A typo, which is kind of interesting to see. I think it would be interesting to many people and how they think these models work.

Most people think it wouldn't be possible for GP four to make a typo, but we can link it below. But instead of saying infringing on laws instead, infringing on laws. And most people wouldn't think that's possible, but because of how the tokenization works internally, it's, they're very definitely possible.

And we've seen it happen now even on a model as good as GPT-4.

Conner: Ethan?

Ethan: Um, yeah, I saw the, uh, Trump campaign launch a, you know, they, they dropped a video using a lot of deep fake audio, um, referring to DeSantis campaign. And I think, you know, we've all been talking about deep fake audio and video for years now, and watching it enter into politics this fast and this normalized is actually interesting.

You know, I think people were concerned about a lot of the risks that came with it. We even saw the daily show do a Joe Biden kind of meme video using deep faked audio. And entering the public discourse that fast with all this kind of deep fakes and these fake audios and people beginning to understand it rapidly, I think is something, you know, most of the AI safety alarmists did not expect.

So yeah, it's all the Trump campaigns video, and I think we're just gonna continue to see more and more deep fake audio over the next year and new campaign videos and new entrance into politics and the way they handle that. So, Interesting to say the least.

Farb: You know, a coup a couple of points. One on the, uh, G P T four typo error.

I think they should run this through the various AI doomers, uh, who are predicting the end of mankind because of this technology that isn't capable of spelling basic words, apparently, consistently. Uh, and, and then I, I also found it interesting that in the sound storm, Uh, relevant to what you're saying, Ethan, with regards to deep fakes and things like that, that they found that there are models that can detect these things quite reliably as being fake still.

I think they said 98 point something percent accuracy in detecting, uh, that it was, you know, a synthesized voice. Obviously the voices are gonna get better, but also possibly the. Detectors are gonna get better. They talked about audio, you know, audio watermarks if you will. Uh, so there's a nice tension here where, you know, maybe AI will end the world, but first it's gotta learn how to spell.

Uh, and may maybe deep fakes will cause the collapse of civilization, but first they're gonna have to be undetectable.

Conner: Yeah. With the like detecting the audio deep fakes, I always think about like, yes it's very detectable with a raw audio file, but once it's like over a call or like very distorted, it's like becomes very hard to detect that.

Farb: I believe so. Well, you know, apple's got something dropping I think in iOS 17 that will allow you to verify that the person that you're talking with or texting with is actually the person. Not quite sure how the tech is going to work. It's some sort of, we, you know, pass key type thing. We'll, we'll see how effective it is, but it's cool to see Apple, uh, you know, planning for this world a

Ethan: Hundred percent.

Conner: All right, well I believe that was our show for today, guys. Thank you everybody for tuning in. Uh, we'll see everyone tomorrow. Have a great week. See you guys.

AI Daily
AI Daily
AI Daily