3D LLM | VIMA | FreeWilly1&2

AI Daily | 7.25.23

Welcome to another fascinating episode of AIDaily, where your hosts, Farb, Ethan, and Conner, delve into the latest in the world of AI. In this episode, we cover 3D LLM, a cutting-edge blend of large language models and 3D understanding, heralding a future where AI could navigate full spatial rooms in homes and robotics. We also discuss VIMA, a groundbreaking demonstration of how large language models and robot arms can synergistically work together, suggesting a transformative path for robotics with multimodal prompts. Lastly, we explore the implications of StabilityAI's recent launch of FreeWilly1 and FreeWilly2, open-source AI models trained on GPT-4 output.


Quick Points:

1️⃣ 3D LLM

  • A revolutionary mix of large language models and 3D understanding, enabling AI to navigate full spatial rooms effectively.

  • Potentially instrumental for smart homes, robotics, and other applications requiring spatial understanding.

  • Combines 3D point cloud data with 2D vision models for effective 3D scene interpretation.

2️⃣ VIMA

  • A groundbreaking demonstration of robot arms working with large language models, expanding their capabilities.

  • Uses multimodal prompts (text, images, video frames) to mimic movements and tasks.

  • The model's potential real-world application is yet to be tested against various edge cases.

3️⃣ FreeWilly1 & FreeWilly2

  • Open-source AI models launched by StabilityAI, trained on GPT-4 output.

  • Demonstrates the capability of the Orca framework in producing efficient AI models.

  • The models are primarily available for research purposes, showing improvements over their predecessor, Llama.


🔗 Episode Links:


Connect With Us:

Follow us on Threads

Subscribe to our Substack

Follow us on Twitter:


Transcript:

Farb: Hello and welcome to the wonderful world of AIDaily. We're excited to be with you here. I'm your host Farb, and here with our other hosts, Ethan and Conner. Let's jump into today's first story. It's called 3D LLM. Gives you a little bit of clue of, uh, what it's about, but it's applying large language models and meshing them with the 3D world.

And, you know, what can we do? Large language models aren't, you know, made to understand the 3D world necessarily. The 3D world is not been turned into a, you know, the state of AI that LLMs are so. How do we bring these two worlds closer together and make some magic with them? Uh, Conner, can you tell us a little bit about, uh, what this paper is showing off?

Conner: Showing off. Yeah. So of course so far LLMs are great at understanding text, and nowadays models GPT4 are getting pretty good understanding images. This has taken that a step further where now you can plug in an entire 3D scene, like an entire 3D Nerf scan of a room into an l m, and you can ask like, Hey, help me find the fridge, and it'll like guide you around the room or to say like, Hey, how would I move from this?

To here and it would tell you how to do that. Or you can say like, Hey, where is my suit at? And it would like say it's in the wall. And even like very fine tuned, like specific details on like, I need to iron my suit. And it'd be like, okay, step one, the iron is right here next to the cabinet. So it's essentially giving LLMs the power to fully understand 3D scenes.

And I'm sure this is gonna be very helpful, very capable in smart homes and robotics and really everything you need to. Connect AI to understand full spatial rooms.

Farb: Yeah, that's, I think that's, that's a great description. And you know, I'm not, I think what they're proposing, you know, that's going to happen, is going to happen.

We'll see if this specific approach is the approach that ends up, you know, making sense in your smart home. Uh, That remains to be seen. But what they're showing is that this is certainly possible and it's not like they required a billion dollars and 500 people to pull it off. So, you know, if Apple applies, it's uh, it's power and resources.

You could, you know, understand that really cool things are gonna be capable. What, what did you get out of it, Ethan?

Ethan: Yeah, I think that's spot on. It looks like kind of a, and you know, really cool accomplishment here. Um, but it looks like kind of an engineering piece together. You know, what they're doing is taking this 3D point cloud and then they say, okay, let's in essence take a lot of pictures of it and still use these kind of 2D vision models.

Right. And then at the end of the day you can say, Hey, where's the suit? It finds the picture with the suit from the 2D angle and then positions it within 3D space. So a cool engineering hack. And this might be the way people attack it, you know, if the vision models. The foundation models are built on just 2D images.

You can still get a lot out of it and move to 3D in this way, but I think we are gonna see much bigger just 3D data sets. You know, that's kinda the backlog here to get these models done and GPUs as well as a backlog. So cool. Engineering hack, but I'm not sure if this is, you know, the future of the way 3D models will be handled.

Farb: Yeah. Maybe not the final approach, but absolutely sci-fi style demonstrations that they did really. Really smart on their part to just show off the full Blade runner vibe of, uh, of the, what they were able to pull off here. Kudos and congrats to them and thanks to them for sharing it. Let's move on to our next super cool story.

This is another, you know, AI meets the physical world type of demonstration. It's called VIMA or VIMA. I'm not sure how they've opted to pronounce it. And uh, this is a really cool. Demonstration of how LLMs and robot arms can work together. And this was powerful. They, you, you can provide this model and they've, they've made the entire model open source, uh, including the simulator that you can use to work with it.

And you can give it text and images and videos or some mixture of them and it does all sorts of cool crazy stuff. Ethan, tell us some more.

Ethan: Yeah, so they put in a ton of work to do this. Um, not only in the simulator side, but building the whole data set, building a new benchmark. They put in an absolute ton of work to do this.

I think we're seeing, you know, as we've talked about before on the show, just the progression of the whole robotic space. You know, we've seen a robot saying, Hey, move to the left. Right? That's just a text prompt. Now we have these multimodal prompts, so they're adding. Ton of new attention layers and saying, Hey, you can follow the attention of an image.

You can follow the attention of video frames itself. So you know, let me mimic you, right? Which is what, how babies learn. So let me show you a video of how this object is being moved from the left to right and put in the circle, and now the robot arm can do that. Hey, let me look at this image. Let me look at the text prompt and let me follow the video frames to try and mimic this.

So I think this one's actually extremely groundbreaking. They're gonna present it at I C M L here Thursday. Like I said, a ton of work put into it and a really cool approach to actually giving robot arms more abilities.

Farb: Yeah. Conner, what'd you take?

Conner: Yeah. Technically how it works is very interesting.

They have the VIMA, the like Visio motor attention, and it essentially, it's an encoder decoder transformer model where as you said, Ethan, the input is all this multimodal of text images, even videos, video frames, and out of that it can decode the entire movement of robotic arm. And I think taking attention further and taking attention to do things like that is, is very impressive.

And as you said, we'll link the videos below, but all their videos, all their simulations, very good. And I agree, it's pretty groundbreaking. So natural.

Ethan: It feels like the right way to train these things. You know, the data sets people have been struggling with for so long, but hey, we have videos of someone throwing a towel.

We have videos of someone opening a fridge. I think this is really the pathway of how these things will learn.

Farb: I wonder if they've tried it with a real robot arm and I. You know the simulated world and the real world are. Hmm. Not very close to each other. So I wonder if they'd get the same level of performance.

I mean, it should fundamentally work similarly, but does it catch the edge cases of, uh, you know, what the real world brings? And, you know, ultimately you'll have to do that in any model. You know, being able to pull things off in a simulated world sounds cool, but isn't really useful. Uh, dealing with the edge cases of the real world is.

Maybe even a bigger challenge than the challenges that they, they've taken on. But I think certainly this is a huge step in the direction of getting that done. You're not gonna solve it all at once.

Conner: My my understanding is they might actually be showing it on a robotic arm at ICML

Ethan: cool. Yeah. In Hawaii on Thursday.

Send us a video, please. Yeah, definitely

Farb: Send us a video. They said, they said come by, uh, the exhibit hall and say hi. So. Uh, any AI daily people at the I C M L take some videos, go say hi to them and send us some videos. We'll, we'll post it on the next episode or whenever we get the video. Uh, great, great, great story.

Super cool to see Two big, uh, physical world meets AI stories. And then, uh, our third story is about our friends over at StabilityAI launching FreeWilly1 and FreeWilly2. The. I don't, it's, it's, it's, it seems like the only possibility, but it seems crazy that they basically did FreeWilly2, I don't know, like two days after Lama two dropped and they just like, you know, switched out one LLM for another and ran everything, uh, all over again.

And one of the cool things that they're, they're showing here is that they're able to do this on a, on a, on a much smaller data set, which means, uh, faster, cheaper, and as they mentioned, lower carbon footprint. Uh, here Conner, what, uh, tell us some more about FreeWilly.

Conner: Yeah, we cover, it's based on the Orca paper. We covered the Orca paper from Microsoft to bid back. It's essentially, instead of training all these models on the output of GPT4, like alpaca or kuia, do it, trains it on an entire like chain of thought process from GPT4. So out of Orca, we saw Open Orca from another team. We saw Dolphin from Eric Hartford, and now we see FreeWilly, which is the like free open source version of it from StabilityAI

it's not, um, it's not commercially available because it is of course trained off GPT4 output, so it's only open for research, but seems to very capable. It seems on par. The other open implementations of orca and well done AI.

Farb: Yeah. What, what, what's your read on this, Ethan?

Ethan: Yeah, I'm, I'm glad to see more trainings and progress in the space.

Um, you know, at the end of the day, FreeWilly1 was trained on the old Llama. It's not that good. And they're showing, Hey, we took a new Better Foundation model on the same data set and it's better. Uh, you know, I'm glad people are training more things. I think it's interesting progress. It shows the power of Orca as a framework, you know, like Conner mentioned, we've talked about before, but I'm not sure who's gonna be using free Willy right now, but at the end, maybe you might.

Farb: I was kinda wondering the same thing. You know, it's easy to criticize people doing stuff here, so I don't wanna make it sound like, uh, I'm diminishing their efforts here at all. But I couldn't quite get my head wrapped around, you know, sort of what the point of it all was. Mm-hmm. You know, it's easy to kind of say research and well, okay, if somebody uses it, then I guess, you know, for research, then you've met your goal and you know, there's obviously someone's gonna use it.

I don't think it's gonna be zero people using it, but I couldn't quite figure out what my own angle of attack on grabbing this and putting it to use. Would be right. There's, yeah, LLaMA2. What am I? Why am I, why am I using free Willy? I dunno. What do you guys think?

Conner: I think they're just in testing the whole instruction.

Fine tuning everything because they are apparently announcing another like version of StableLM soon. So,

Farb: Probably, yeah. Maybe it's just them sharing their work and sharing what they're doing as they go. Uh, yeah. And you know, everyone's gotta make noise in this space and get attention and, and it's good to share what you're doing, uh, when it's out there.

Um, you don't need people like us criticizing every single thing you do.

Ethan: Absolutely.

Farb: Yeah. Very cool.

Uh, nice work StabilityAI. Keep it up. Don't stop. We need you. Uh, what are y'all seeing out there, Ethan?

Ethan: Um, I just, I've been seeing more and more tweets, you know, not to add to the firestorm, but of this true like GPU crunch, right?

Um, Suhail was talking about, Hey, you know, you pretty much need $10 million if you wanna start getting in the list at Nvidia and actually start getting GPUs. You know, people are over here predicting that, Hey, in the next six months, if you want 128 H100s, it's probably just not gonna happen. So, How that's gonna slow down the bottleneck of actual startups, getting access to some of these, getting new foundation models out there.

Um, you know, we might see a little dark period here in the next three months, just purely 'cause of logistics of GPUs. So I don't know if it'd be my prediction, but always interesting to keep up with how GPUs are actually getting in people's hands. Right.

Conner: I, I wouldn't be surprised

Farb: There's not gonna be enough GPUs and I don't know. There's no other way to put it. There's not gonna be enough of them. You need $10 million or this million dollars or that or the other. Buy them used or there's just not gonna be enough.

Ethan: Exactly.

Conner: I would, I wouldn't be surprised if we see like a dark ages like you said, but then out of that we'll probably come a renaissance where people start getting things to work on a m d and even like Intel CPUs and everything.

So wouldn't be surprised about that.

Farb: The crunch people are getting a lot done on a one hundreds too.

Ethan: Absolutely.

Farb: Right. Um, cool. Conner, what are you seeing?

Conner: Yeah, I saw OpenAI shut down their AI detection tool. They announced it back in February to essentially just detect, uh, text and images that they generate and they're shutting it down now.

'cause pretty clearly it probably didn't really work. 'cause as we know, it's kind of hard to detect AI generated.

Farb: Yeah, it's hard to reliably detect it, you know? Yeah, you can, you can get lucky, but you don't really know if you got detected, if you detected something or just hallucinated in the right direction, as they say.

A, a broken clock is right twice a day, so yeah. Doesn't mean a whole lot for that approach to build clock building. Um, I saw a cool, uh, paper about. Uh, psychiatry and ai and, you know, I don't know if it was the most surprising or shocking information, uh, but it does, I think, underscore something important, which is to say, you know, the paper kinda shows that if you induce anxiety, you get a different response from an LLM than if you come at it neutrally or you try to induce happiness from it.

You're gonna get different outputs, uh, which I guess isn't really. Shocking or surprising. Uh, it is interesting that it seems to, you know, provide more anxious replies than humans do. Uh, but I think the more interesting thing to garner from this is, you know, we have to, we're we're moving into a world where a lot of people may start just understanding that what the AI says is some sort of fact or some sort of, you know, absolute reality.

When. What's a lot of what's going on is you're pushing it in the direction that you want it to. So you know, once people start accepting ais as some sort of ground level truth, you're gonna see everybody manipulating the results of the AI and be like, see the ai, I said this and the AI said that the world is gonna end.

Or it's said that the world is not going to end or. You know, you, you gotta be careful the, you know, what you kind of put in is in some ways what you kind of get out. I don't know if you guys took a look at that paper or not.

Conner: It's kind of another example of how much like training data matters. Because yeah, your AI is probably more anxious than the average person, but that's probably just 'cause your average chronically online person is more anxious than the average person.

Farb: Yeah. And it's designed to, you know, please in a sense, you know, and, and it, and give you the output that it thinks you're looking for. And so, It's going to try to do more of what you ask of it. And, you know, humans have a, you know, whole host of filtering, um, going on in their head. And it, it's even, you know, asking people and self-reported information is not real science to be honest.

You can't net out all of the things that people are doing to filter in their heads and stuff to really understand if that's what even somebody really thinks. Even if though they, even if they respond by saying this is what they think, um, You know, it's not an easy task to even figure out.

Ethan: It's pretty cool though, just thinking about how these models are anxious.

Like what a question we're asking ourselves. Right? And like they have data that kind of backs it up. Um, whether you think it's conscious or not, or anxious or not, it acts that way. So pretty cool.

Farb: Yeah, it behaves anxiously and behavior is ultimately, you know, more relevant to the world we live in than just what happens going on inside people's minds.

Exactly. Well, another exciting episode where we've solved all the world's problems. We thank you for joining us, uh, especially the 30% of you on average that get this far into our show, will see you on the next episode of AI Daily. Have a great day everybody.

Ethan: See you guys.

0 Comments
Authors
AI Daily