Welcome back to AI Daily and here are three stories to close out your week. First, Meta's CM3leon introduces a transformative multimodal generative model for text and images, offering incredible efficiency and versatility. Next, HyperDreamBooth revolutionizes fast personalization of text-image models with its impressive speed and significantly reduced model size. Finally, Animate-A-Story showcases retrieval-augmented video generation, an engineering hack that combines motion structure retrieval with structure-guided text to create high-quality videos.
Quick Points
1️⃣ Meta’s CM3leon
- Meta introduces CM3leon, a state-of-the-art multimodal generative model for text and images, based on transformers. 
- The model is highly efficient and performs tasks like fine-tuning on texts and images, generating high-quality images, and offering structure-guided editing. 
- It impresses with its ability to handle segmentation, accurately create objects in images, and even generate realistic hands and text on signs. Meta continues to push the boundaries of AI. 
2️⃣ HyperDreamBooth
- HyperDreamBooth introduces hyper networks for fast and efficient personalization of text image models. 
- The model is 10,000 times smaller than Dream Booth, processing images in just 20 seconds, making it highly accessible. 
- The pace of development in this space is remarkable, allowing for embedding the model in mobile devices and achieving impressive results. 
3️⃣ Animate-A-Story
- Animate-A-Story combines motion structure retrieval and structure guided text to generate high-quality text-to-video results. 
- It addresses the challenge of spatial consistency in text videos, using a database of similar videos for stylization. 
- While the initial motion generation is an engineering hack, the pipeline shows potential for quality text-to-video synthesis. 
🔗 Episode Links
Connect With Us:
Follow us on Threads
Subscribe to our Substack
Follow us on Twitter:
Transcript:
Conner: Hello and welcome back to AI Daily. We have another three great stories for you guys today. I'm your host Conner. Joined once again by Ethan and Farb. Starting on first we have Meta's CM3leon with a more efficient state-of-the-art generative model for text images. So this is transformer based instead of diffusion based, as we've seen for most models for a while.
So instead of being just text image or just image to text, like we see most models, this is multimodal. And they can go with do, do both and work both ways. And because of that, it's. Apparently five times smaller and a lot more efficient than all these existing models. And again, because of how it's architected, it can do some very fancy things like instruct fine tune on texts and images.
So you can give it a picture of a dog holding a stick and be like, what is the dog holding in its mouth? And the model will be like a stick. It's very, looks like it's very well architected and looks like it works very well. Ethan heard you had some opinions on it. I'd love to hear him.
Ethan: It is absolutely amazing.
If you look at the examples, they look fantastic. It's a multimodal model, so you get, just like you said, you get the text in, you get the text out, so you get this kind of clip esque interface. You can actually ask questions about images. You can generate these actually extremely high quality images. They have structure guided editing, so you can actually like edit these images in a way that, you know, some tools have tried to put together manually, but this is built in at the model level.
Like you said, it's transformer based and I'm honestly just blown away with the results they've been getting. I think this is above mid journey level, just from the few examples I've been seeing so far.
Conner: Yeah, it's for a while we've had diffusion and transformer models were kind of hard to make and hard to make well and hard architect well, but this is made very simply and it looks to work very well far.
What do you think of it? What do you think of the multimodal parts of it? What do you think?
Farb: It's pretty, it's super impressive. The, you can do a lot of. Pretty nuanced and specific things. Uh, it handles segmentation. You can, you know, tell it to, uh, create an image where there's a bottle in this location of the image.
There's a bed in this location of the image. Uh, it can do that so it understands the objects in the image. Uh, you can give it an input, it can do segmentation, generate new things off of the segmentation. The resolution is really high. It's, uh, it's beyond impressive and, uh, you know, not, not surprising.
Giving the, the engineering talent and resources they, they have at meta makes you wonder what else they, they have under the hood. But probably the most impressive thing is that it can make hands. Uh, that was the and text that was the mind blowing Yeah.
Conner: I know that they ask it to make hands in a certain position and does it exactly with the right amount of five fingers or ask it to put, uh, words or text on a sign.
And that works also. So very flawless and knock. Gonna the park by meta once again. Next up we have HyperDreamBooth. This is hyper networks for fast personalization of text image models. Of course, we've seen Dream Booth for a while. Um, now we saw Style Drop recently and this is apparently takes only 20 seconds to run, which is about 25 times faster than Dream Booth and it's far more efficient and it's a model that is about 10 times, 10,000 times smaller than what you can get out of Dream Booth.
Far, these are pretty huge advancements, especially considering you can take on only one image versus the 10 or 20 Dream Booth normally takes. But what do we think of these benefits?
Farb: I mean, to, to give a little bit of perspective to the 10,000 times smaller. Uh, it's the, they get the model down to 120 kilobytes.
You know, you could, uh, you could put this on a, on a, on a floppy disc from 1985. Uh, to give you a little bit of, uh, context, it, it's, it's pretty bonkers. They can do, they can process stuff in a couple of seconds, like you said. Uh, but I, but I was pretty, you know, 120 kilobytes of storage for a 30,000 variable, uh, model.
So, Just, just mind bending.
Conner: Yeah, you can, you can plug your floppy discs into your A 100 GPUs and get your disabled effusion outputs. Most of 'em come with floppy drives these days. Yes, most of 'em do these days. Ethan, what do we think of it? The, the images look pretty good, in my opinion, even better than normal dream booths.
What do you think?
Ethan: Yeah, I haven't got to use it yet, but you know, as you, as you both know very well, you know, when we deployed namesake last year, we were dealing with petabytes of fine tuned files. We were dealing with 10 minute training times. We had over 500, a 100 hundreds going. So, You know, we've progressed from that age of Dream Booth, which was only eight months ago, and then we saw January, February, kind of March-ish.
You started to get these Lauras right. And Lauras were a different type of, not truly fine tuning, but of the sorts, you can generate a couple megabyte files. The quality was definitely not as there. It was few shots. So now you have HyperDreamBooth with one image. You can actually Jen these things in 20 seconds, like y'all said, and a way smaller file size to me, you know, the architecture's really cool, but it just speaks to the pace of this development, you know, from where we were in November to where we are now.
This is a completely different ballgame. This is something that you can embed in an iPhone app and let it run. This is something you can embed on a raspberry pie in the middle of nowhere. This is not something you need huge GPU clusters to manage anymore. So the pace of this space is astounding.
Conner: Yeah, that's fair.
Farb: That's the, that's the powerful thing here. The images are very good, but I don't think they're state of the art. Uh, but you know, For 10,000 times smaller, uh, you know, storage size, that's, uh, a state of the different state of the art. So really impressive.
Conner: Yes. Yeah. Between G gml and some other architecture improvements, you could definitely adapt this to run on an iPhone with the latest processor, an iPhone, you can store the model on an iPhone directly.
You could get everything you get out of Dream Booth. Everything we do with namesake could probably run directly on an iPhone app nowadays, which is insane. Really? Yep. Well, lastly, today we have Animate-A-Story, which is Storytelling with Retrieval Augmented Video Generation. So of course, the hardest part about making these text videos is the actual motion in the movement of the video itself.
And so this, they combine a motion structure retrieval where they basically kind of vector search or just search a database of videos for the motion of video itself. And then they adapt that with structure guided text of videos to the sis, where they basically just adapt each frame. And some other fancy mathematics and computer science and adapt the frame of someone doing yoga to a frame of a teddy bear doing yoga and it, the videos look a lot better.
This is what I've seen. This is probably the best I've seen from any text, video model. Ethan, you've looked at a lot of text, video. What do you think of this?
Ethan: Yeah, I think right now we're at the, you know, interesting engineering hacks age of this, and this one is a very strong one. You know, if you've ever looked at text, a video, or messed with it, Spatial and temporal is extremely hard.
Temporal is one thing, but spatial is just extremely hard. How do you get consistency across frames? Right. And they pretty much added a really interesting thing, which was, hey, we can stylize videos. Right? We've seen a lot of different with runway or any other models saying, Hey, I'm gonna take a video of myself.
I'll use a control net, or I'll stylize myself into Spider-Man or a bunny. Right? But when you're going text to video, it's like, okay, we're starting with no video. Why don't we just search a database of videos that might be similar and then stylize them. So really cool. Kind of engineering hack esque. Uh, the videos were actually like, not too bad, so this could be a potential pathway to actually high quality text to video.
I'm not extremely confident on it, but I think it is, might be part of a pipeline.
Conner: I think the second part of the pipeline, the whole like final video synthesis, that stuff's getting a lot better, but the initial motion generation. I, I think you're right. It is kind of an engineering hack. Far. What do you think?
How are we gonna see a videos? How are we gonna see motion and videos in the future?
Farb: You know, reading the abstract of this paper is not really gonna let you, uh, not going to sell it. I think like, it, like it really is the, you you gotta read the paper to understand that they've kind of pipelined a lot of things together.
Here. What they're talking about is doing, going from a, a script. To a video, which is take a script, break the script down. Yep. You know, they don't talk too much about that because that's not the video part, but break the script down into different plot pieces. Uh, take those, uh, generate prompts from the script.
Find videos that match those prompts, and then. Do the, you know, uh, create the new output based off of that video. They've strung together a, a lot of things here. It's a massive, little, massive team. They got here that it looks like there's like 10, 10 people here, uh, folks at 10, folks at Tencent, 10 from Tencent.
And, uh, you know, it's pretty ambitious. You like it, it would be impossible to expect them to knock this out of the park. You'd literally be talking about replacing Hollywood with one paper, if that, if that was, you know, really the case here. So it's a pretty cool pipeline. Im impressive to see. Uh, nice to see them being audacious about it too.
And, and, and going for it. Like, we're gonna go from script to video. That's not, that's not a small challenge.
Conner: Absolutely not. No. Well, that was a very good paper and I think well done by the team at Tencent. So those are the three stories today. What have you guys seen? What have you guys been reading?
Farb: Uh, I have downloaded iOS 17 public beta, amazingly enough.
I actually think my phone is faster on it than it was on iOS 16, which is super, super impressive. And. I'm trying to, uh, train my voice on it. You have to read 150. The, the training process for your voice is actually pretty cool. You read a, they, they made it like a little product, a little app, uh, where you ke it keeps presenting sentences.
You read the sentences and it moves on to the next one. You don't have to hit any buttons. You just sit there and read the sentence. It knows when you said it, it moves on to the next sentence. I think I'm about 30 sentences into the 150 sentences I have to read, so hopefully I'll be able to demo that. I don't.
Know if I'll need somebody. They may not even have the part where you can, uh, actually use it done yet. Uh, I wouldn't be surprised if they did because otherwise why would they do the training piece? Uh, don't know if I'll need somebody else on iOS 17, in which case I'll have to steal one of your guys' phones and, and force upgrade it to the public beta, uh, against your will.
But we'll cover that then.
Conner: It's, it's nice to see that the training time, the amount of data has, that you need has gone down. I remember during covid I needed like 17 hours to train the text of voice of my model.
Farb: So I think it's 15 minutes here or something like that. They say, but it doesn't really make sense.
Yeah. Yeah. Maybe it's about 15 minutes, 150 sentences you have to read. Yeah.
Conner: It's a pretty nice jump.
Ethan: Yeah. I remember when you plopped down in the room and you sat there for eight hours reading thousands of sentences for us to make that wavenet.
Conner: Yep. Exciting times. Eight hours
Farb: is a big difference.
Ethan: Yeah. Ethan, I saw, I saw a really cool kind of a blog post, uh, from the CEO of Anthropic around what does a new Turing test look like, right? So we've kind of been basing everything on the Turing test so far. A Turing test says, Hey, an AI is intelligent if it can convince a human. That it is not an AI and it can speak language, right?
So there's a few different adaptations, but in general that, and we've pretty much, you know, for all intents and purposes crossed a lot of that, I think people are realizing and we're moving the bar on where I AI is. So they've kind of came up with a new term that I really like, artificially capably intelligent.
So it's what does AI do in the world? Not just what it feels like or seems like, but what can it do in the world? So this new Turing test says, Hey, We're gonna lay it out and say, when an AI can make a million dollars online from a hundred thousand dollars investment, well, now we're at aci artificially capable intelligence.
So I think it's really cool, has so many different pathways to it. You know, I, I, this Cicero paper from Meta always interests me as y'all both know, you know, how can you plan and strategize and negotiate while using LLMs and actually make these artificially int intelligence systems. So, really cool post.
I just, I like seeing the bar of the Turing test continue to move, you know?
Conner: Yeah. If an AI can make a million dollar selling paperclips, I think that'd be a very good benchmark. I completely agree. Yeah. Yeah. I'd be impressed. The person could do it or, or that, no. I saw example based motion synthesis by generative motion matching a little bit similar to the teddy bear paper with the makeup story.
Um, kinda a funny video I saw. You can have it, you can ask it to dance and it can basically move a 3D model in a dance move by searching for similar dance moves. I didn't completely dig into the paper or anything, but it's a pretty funny video, pretty funny example on hugging face. We'll link that below.
Very cool. Well, thank you guys once again for another great, another great week, another great episode. See you guys next week. See you guys. Peace guys.



