Meta's OpenCatalyst | RT-2 Speaking Robot

Playback speed

Share post at current time

Share from 0:00

0:00

Meta's OpenCatalyst | RT-2 Speaking Robot | Adversarial Prompts

AI Daily | 7.28.23

AI Daily

Jul 29, 2023

In this episode of AI Daily with your hosts Conner, Ethan, and Farb. They kick off the episode discussing Meta's OpenCatalyst, a groundbreaking model developed with Carnegie Mellon University that simulates over a hundred million catalyst combinations, accelerating advancements in material science and renewable energy. They then move to explore Google DeepMind's RT-2 Speaking Robot, a unique vision, language, and action model that learns from web images and texts to perform real-world actions, promising a new era of autonomous robotics. Finally, they delve into the intriguing concept of Adversarial Prompts, discussing a recent study by a team at Carnegie Mellon that used LLaMA to generate prompts adversarial to popular models like GPT-4, raising important questions about the robustness and safety of these models.

Quick Points:

1️⃣ Meta’s OpenCatalyst

Meta and Carnegie Mellon University develop OpenCatalyst, simulating 100+ million catalyst combinations.
This tool enables rapid simulations, enhancing chemical process research.
It is highly applicable to renewable energy and material sciences.

2️⃣ RT-2 Speaking Robot

Google DeepMind unveils the RT-2 Speaking Robot, a vision-language-action model.
Trained on web images and texts, it can perform untrained real-world actions.
This model represents a significant leap in the realm of autonomous robotics.

3️⃣ Adversarial Prompts

A Carnegie Mellon team uses LLaMA to generate adversarial prompts against leading models.
This discovery exposes potential weaknesses in popular AI models like GPT-4.
Raises important questions about AI model robustness and safety.

🔗 Episode Links

Connect With Us:

Subscribe to our Substack

Transcript:

Conner: Hello and welcome to another episode of AI Daily. I'm your host Conner. Joined once again by Ethan & Farb. Today we have another three great stories starting with Meta's OpenCatalyst, and then Google DeepMind's RT-2 Speaking Robot, and then some pretty interesting new adversarial prompts. So first up, we have Meta's OpenCatalyst, where Meta and Carnegie Mell University made a new model that can simulate over a hundred million in catalyst combinations.

So it can basically simulate. Yeah, any combination of two types of catalyst materials and find that kind of output that traditionally was not possible to simulate in a way that was fast or quick or easy. This is very similar to something like alpha fold that we saw from Google that predicts protein structures, but this is now for predicting ca like chemical catalyst reactions, so very applicable to renewable energy and really any kind of chemical processes.

Ethan, what do you think about this?

Ethan: It's amazing. I, I think you, you nailed it in the comparison to alpha fold, you know, everything is upstream of materials, right? Every single innovation we have comes from a new material. And it's, right now a lot of it is playing, uh, you know, CD game of Roulettes in which you're trying to guess a what catalyst.

And what combinations of materials will create what we need. So these types of simulations, which were just not possible before ai, you know, simulating the exact kind of reactivity between everything going on is really just not possible before some of these new tools that are out. So being able to simulate all these things, having the tools available to researchers, I think we're at a Renaissance period of.

You know, genomics of material science and these are at the groundbreaking fold of them. So of course I have not got to try it or anything of the sorts, but it looks absolutely amazing from what they've demoed and I think a lot of people are excited for it.

Conner: Fab, we talked about LK-99 a couple days ago.

What do these kind of advancements in material sciences and AI helping us figure out material sciences, what does that bring us for the future of materials?

Farb: Well, the thing that's going on here, it's, it's, it's super cool and, and in some ways it's, Relatively straightforward. What you're doing is taking a catalyst, uh, an absorbate that, and trying to find the correct configuration, or as they call them, relaxations, iterating over relaxations until you find the sort of minimal energy state of the system, which should provide you the most stable configuration of the system.

That in the real world is actually stable and usable. So, you know, it's doing this stuff at, at a rate that's far faster than you could obviously, uh, sit there and, and calculate and simulate one at a time. Uh, and you know, like the, the folks at, I think Open Catalyst is the, if I remember it's correctly what it's called, uh, said, you know, you should check this stuff out against the real world.

You know, this thing is just going to, you know, potentially give you some directions to try out. Um, But you'll have to test it against a, a real world situation to see if that's actually a, a, a stable state. You're trying to get to a stable state of, you know, catalyst meets absorbate.

Conner: Yeah. Very exciting model though.

'cause of course, as you said, like you have to test it physically in the end, but instead of testing every single one physically, you see the angle you want to go for, test that physically and then maybe make tweaks in the physical world from there. E

Farb: Elon himself chimed in on these tweets and said, this is a really interesting, or, uh, has strong, strong potential or something.

Something to that effect. So, uh, you know, he's definitely thinking and dealing with this stuff with just about every startup he has.

Conner: Well, maybe Elon does like some stuff Meta makes them. Yeah. And next up we have Google's Deep Minds, R two Speaking Robot. It's a first of its kind vision, language, action model.

So, of course it's trained mostly on web images and web texts, like most of these image and text models. And then also trained a little bit on actual robotic actions, so then it can look at the real world. You can give it. Directions. You can see what's in its environment and then it can act in new ways that it was not trained on.

You can tell it, Hey, pick up this trash and throw it away, even if it's never picked up trash before. Even if it has never been told to throw away trash before. It can use its knowledge of the web, use, its knowledge of images and text and use its knowledge of how to move its own arm and follow those directions very well.

So it's very similar to what we've seen so far in. Like complex engineering stacks of multiple, multiple models being used together to achieve these results. But now it's a single model that is integrated and is a foundation model for robotic actions. So Ethan, we've talked a lot about how there's engineering hacks and then.

Final outputs of like real foundational models. How does that make a different here?

Ethan: Yeah, this one's actually a complete model. Trained on it. You know, they had a good description saying like, Hey, you know, some of the other alternatives and way people are approaching it is pretty much as if you had to think of something in your mind.

And then go describe to your body how to do it. It doesn't have that natural flow, right? So them combining it all into a single multimodal model is fascinating. You know, in the deepminds, uh, article here, the coolest part I saw was they have a cable and a rock and a piece of paper on a table, and they say, Hey, I need to hammer a nail, which object from the scene is most useful, and it picks up the rock.

So this kind of chain of thought reasoning, they've embedded into it as well as the entire like multimodal foundation model itself. Enables these things that are, you know, kind of were super hard before but are kind of common sense to people to do, which is, hey, a rock's probably gonna hammer a nail, or this object is probably best for this use case.

So we're leaving the era of having a program, grab that rock, hit this nail, and here's your task. And to these actual robots that can reason, and I'm super pumped. This is honestly, I think the best application of it I've seen so far better than some of the engineering piece togethers.

Conner: Fab, what'd you think of this?

Farb: Yeah, I think this is the first big step in a new direction here. They've built a single model, you know, I think they call it, uh, V L A, vision, language action. Uh, they've tokenized the actions, uh, you know, sort of moving from the vision language model world to the vi vision, language, action world. Uh, this is probably sort of a, a seminal move here in the space.

Uh, I think they said they found something to the effect of about 90% accuracy in simulations, uh, which is pretty crazy. Probably not something you want on a factory floor or in a nuclear reactor, but, you know, it's a, it's a huge step forward from a lot of the other things, and I think it's twice as good or so.

Conner: Uh, Compared to their previous version, I think it went from 30% to 60%.

Farb: So yeah, 60%. Yeah. Huge, huge leap there. Uh, and, uh, it's, you know, you you, you look at all their examples and you can just see how this can intuitively, uh, work. So, you know, some of these things somewhat have to cross that barrier. Like, okay, this thing just kind of makes sense if it's trying to.

You mimic human behavior. It seems on, you know, seems likely that it should make some intuitive sense to humans that are taking a look at it. And it sort of passes that sniff test. I thought it was a, a huge paper and they took it pretty seriously. They made a great, uh, couple of blogs des describing it all with great animations and, uh, super impressive to see.

Conner: Yeah, they put a lot of great story into it of why it's important and as you just said, like how the thinking of it is more similar to how a human thinks and why that's important. So definitely recommend reading it. Uh, and then lastly, today we have adversarial prompts. Um, a team at Carnegie and a few other universities work together to make Lama two generate adversarial prompts that work against ChatGPT, that work against Bard, that work against Claude Farb?

You do a lot of prompting. What do you think about this? This

Farb: is, uh, this is concerning to say the least. Uh, yeah, I mean, the. The, the prompt, you know, injections, the, the, the suffixes that they showed in their paper. OpenAI has already plugged

Conner: those up, but

Farb: there's a whole bunch of other ones that aren't in the paper.

Uh, they're probably trying to recreate this paper at OpenAI in every other place so that they can figure out what holes to plug up. Um, it's almost.

Farb: This is, this is a pretty serious hole I think they found here. And, uh, it's not going to necessarily be easy to stop it and it might be easy to replicate it.

Conner: So I, I believe the codes actually open source. So I think people were already at, already out there using the code to generate suff suffixes that weren't in the paper, and then use those to make the same attack.

Farb: This is classic arms race stuff. Uh, so I don't know. We gotta, you're gonna, we're gonna have to keep our eyes on this one.

This is, this is not the best news.

Conner: Mm. It, it does technically violate the terms of service of llama 'cause Lama terms of service says you can't use it to improve other models, which technically this is what it's doing.

Farb: Tell that to some nefarious state actors. I'm sure they're super concerned about the t o s Exactly.

I'm sure the guy trying to, you know, do shady stuff in his, in his cave is really concerned about the lawyers coming at him, uh, on his t o s violation. Uh, This is, uh, you know, that stuff is irrelevant to anybody who was gonna do something nefarious with this anyways. We'd be, I'd be lying if I thought I, I, you know, if I was saying this was good news, this is bad news.

This, this, this is a major problem and it needs to get fixed.

Conner: Tony Stark build adversarial prompts in a cave. Yeah, exactly. Ethan, any thoughts?

Ethan: Um, I just think it's, Cool that, you know, these are not, it's not like, Hey, we broke llama. Right? It's like, hey, every single transformer base one, we've tried G p d four cha, g bt, Claude Llama, like they're all the same and these adversarial suffixes work.

And we don't have to go manually make prompts and test 'em ourselves. Like there's a formula to how to break these things. Um, so yeah, we'll see if it gets plugged, it will. And then it's kind of like cybersecurity. It's an endless game of whack-A-mole.

Farb: Yeah. It's an arms race between both sides. And I think these folks.

You know, probably did the right thing about being transparent about it. Absolutely.

Conner: There, there's a lot of, there's a lot of examples of like individual tokens from like weird Reddit users that can break every single model. And this is very similar because of course all these models are a bent, essentially all trained on common crawl.

So some weird tokens from Common Crawl combined together. You break your promises.

Farb: This, this is the classic take something someone on do doing is, uh, take something someone on Reddit is doing and scale it.

Conner: Yeah. Well, those were three stories today. Crazy as always. What have you guys been seeing, Barb?

Farb: Uh, I saw something earlier, but I forgot what it is, so I'm just gonna skip for today.

Spare you people.

Ethan: Ethan? Uh, I've been doing some advising for AI on like public health, and I think, you know, we always speak about this on different episodes, but the speed at which enterprises and even governments are bringing AI into the fold is fascinating, you know? Throwing together a rapid group and actually putting together pilots out there and actually trying to fix some of these workforce issues and you know, a cloud computing or mobile wave took them 10 years to try to implement.

They're actually moving really fast on this, so just exciting to see kind of state of the world. I.

Conner: Well, to match your exciting news, I bring 11 labs, having new voices. Nice. Yeah, I saw that. That's so cool. Yeah, some as S M R stuff, some audiobook stuff, some video game stuff. So if, if you are waiting to publish your ASMR novels, ElevenLabs has got you.

Farb: You know, to, to add to that, I think I saw a Martin Reley tweet where he was talking about, uh, training on. Large corpus of natural speech to try and make something, uh, better than t t s. Um, not entirely. He also posted a picture of him like himself in the New York subway train. I don't know what his, what he was, what he was doing here, but uh, that was a little, that was semi-interesting.

Conner: Well, wonderful as always, thank you guys for tuning in. We will see everyone next week. See you guys.