Welcome to AI Daily! In this episode, we dive into three extraordinary and useful stories. First up, Maintaining Localized Image Variation - the groundbreaking paper that unveils a new way to edit shape variations within text-to-image diffusion models. Next, ScaleAI LLM Engine - ScaleAI has open-sourced a game-changing package for fine-tuning, inference, and training language models. Last but not least, SHOW-1 - the solution to the "slot machine problem" in video generation, where randomness prevails.
1️⃣ Maintaining Localized Image Variations
Discover groundbreaking paper on maintaining localized image variation in text-to-image diffusion models, enabling precise object editing.
A practical and intelligent engineering solution that offers CGI-level control without the labor-intensive process, making it highly useful.
Impressive implementation with a hugging face demo showcasing effective object preservation and image transformations for stunning results.
2️⃣ ScaleAI LLM Engine
ScaleAI revolutionizes language model development by open-sourcing LLM Engine, allowing easy fine-tuning, inference, and training.
Their move showcases commitment to staying at the forefront of AI development and provides practical, useful tools for developers.
The open-source community benefits from ScaleAI's meaningful contribution, offering a powerful project that scales effortlessly with Kubernetes.
Introducing SHOW-1, a show runner agent that tackles the challenge of creating consistent animated shows using image and video models.
Aiming to solve the "slot machine problem," SHOW-1 combines prompt engineering and consistent frame sets to generate coherent and engaging video content.
Impressive engineering and clean outputs make SHOW-1 stand out, offering videos that resemble popular shows like South Park in appearance and sound. Ambitious and promising for future iterations.
🔗 Episode Links
Connect With Us:
Follow us on Threads
Subscribe to our Substack
Follow us on Twitter:
Ethan: Good morning. Welcome to AI Daily. We had a huge show for you yesterday and we are continuing today with three amazing, impactful stories that are actually really useful. So our first one today is maintaining localized image variation. So if you've used text to image models, you understand, hey, we're gonna generate, you know, coffee table with a mug.
Right? But then you want to edit that coffee table and it's very difficult to so form similar to kind of segment anything and picking out objects from within an image to be able to edit. This paper really gets to show that, hey, we can edit shape variations within these text to image diffusion models.
So they take a really interesting approach to it. They have some really cool examples. Connor, you wanna dive into more of kind of what they're doing?
Conner: It's very hard to accurately regenerate these image and just change one part of it, or just keep one part of it while the rest of the image is changing.
So they, they've given some pretty good examples. Like let's say you have a dog sitting on a chair and you really like the dog. You don't really like the chair anymore. It's currently hard. Before this paper came out, to be able to variate the chair to a bed, to a raft in the ocean, to anything else while keeping that dog the same.
Looking the same. Um, but they successfully did that in this paper localizing object level, shape variations. And they did it in a couple ways. They combined a couple technologies. The first was a mix and match prompt mixing where throughout the denoising stage they figured out they could split up the denoising of S diffusion into stage one being layout stage two being shapes and stage two being details.
So they were able to keep stage one and stage three for like. The details and like the layout of the dog on a chair, but then the actual shape of the chair, they would change what part of the prompt is looking at. And once again, they did something similar with the self attention, where they can preserve the object in the image with the self attention of just that object and inject the self attention of just that object and let the rest of the image generate with a new self attention, which lets them do these kinds of really beautiful variations.
It's a very. Interesting. Very well done paper that was architected very well.
Ethan: Far you mess with text image models a lot. How useful is this? Like how, how you, you know, would you use this a lot? Do you think if they someone embedded this in their actual tool, is this useful?
Farb: I think that's kinda the point of this, is that it is useful.
Everything that they're doing, you could do with, you know, CGI if you wanted to. The problem with CGI is that it's incredibly labor intensive. This is another great example of a. Very intelligent engineering solution. Building a pipeline that actually gets you to an output, uh, in a way that is a compromise between the endless possibilities of doing CGI and the, you know, maybe easier but less effective way of using GaN, for example, to do this, where it's going to.
Be a little bit more difficult to control the actual output of what you're getting, right? So they, they sort of found this very practical compromise that sits between these two approaches. Lot easier than doing cgi, but giving you almost CGI level control on the output. Uh, I was, I found it really impressive.
Ethan: Yeah, it's super cool.
Conner: I was, I was gonna add, they have a hugging face, uh, demo available. You can go in, you can upload an image and say, Keep the oranges the same, change the rest image.
Ethan: So yeah, it's one of the most effective ones I've seen, you know, more effective than in painting, et cetera.
Cause it actually maintains object state. So really cool implementation. Um, our second one is ScaleAI LLM Engine. So they've pretty much open sourced a way for you to fine tune, inference and train a lot of these language models. You know, we've seen a bunch of kind of companies provide wrappers to make it easier for customers to do this, make it easier for enterprise to do it.
And now we have an amazing open source package for this that works with Falcon, that works with Llama, that works with M P T. Pretty actually like huge for anyone who wants to fine tune these models. Like you've just cut out two, three even a month of work and you've removed their need for kind of an external provider.
So really cool work here. Farb, did you get to dive into it? What does, you know, what does stuff like this mean for people like developing right.
Farb: I mean, this is a, this is a great move by scale, they are showing that they are able to, uh, join this rapid pace of development. They're not falling behind.
They're trying to, you know, stay at the tip of the spear of people's mind share of the folks that are doing big things in ai. This is practically, you know, speaking very useful for people so they're not just, you know, throwing out stuff that, you know, might seem impressive but isn't actually useful for anybody.
Uh, and I found that pretty impressive. Hopefully the, uh, my landscaper is not blowing out my audio here. Um, and, uh, yeah, I, I found it super impressive. I think people are gonna start using it. We'll see if people start using it or not. Uh, and if, you know, if so, I, I, I really, like I said, I wouldn't be surprised and, um, we're gonna see a lot more from scale Garner.
Ethan: We, we've manually kind of built a lot of these things for months on end, trying to get things, you know, set up, et cetera. When open source comes out, of course you get, everyone rallied around one way to do it, and you accelerate a lot of the pace. How do you think this compares to something like Mosaics Foundry? Did you dive into the code? Would you use something like what Scale release?
Conner: Yeah, I would definitely use it. Um, how exactly it compares to what Mosaic offers or what hugging face offers and very similar capabilities. I think it's up for people to explore and the open source community will definitely delve into that more.
But as always, having many different projects, all working on the same thing in different ways helps accelerate all the projects at the same time, the same thing we're seeing with LMQL, link Chain and Guidance, all trying to tackle the same problems in different ways helps. Yeah, fine tuning and inferencing and training open source projects.
Also be able to accelerate in the exact same way. So yeah, this is completely open source, Apache 2.0 license, just like the rest of 'em. So I think it's just another great addition to the community really.
Ethan: It's, it's good
Farb: a, a powerhouse and it's good to see them contributing to open source in really meaningful ways.
Not light, you know, little contributions here and there, the occasional paper drop. This is meaningful stuff from a meaningful company and uh, I'd say it does a lot to show their importance in the
Ethan: industry. Yeah, it's good when a big company drops it cuz you can all coordinate around the same one and you don't have devs rebuilding, fine tuning infrastructure every single month.
Conner: Yeah. The problem is a lot of those other smaller projects that aren't the main ones, they'll release something kind of cool and kind of useful, but they don't really work at scale. Scale, scale literally has produced a very capable project that you can easily host yourself with Kubernetes, et cetera. Yes.
Ethan: Absolutely. Let's move on to our third story, which is SHOW-1. So SHOW-1 is a show runner agent. So how can you make a show that is consistent, right? How can you make a new animated show? How do we use these image models and video models to make something that's consistent? A lot of these models right now are kind of this, what they call the slot machine problem.
So you're just like, well, that's a random video. That's a random video, and you just continue to go and go until you hopefully like the video. So the engineering work around this of saying, Hey, there's a lot of people who have put work into prompt engineering to generate something of a story that makes sense.
There's a lot of people that have put work into making a consistent frame set and they had a cool engineering solution that seems to piecing it all together. Connor, was there any like main innovation you saw out of this or just a really nice clean structure of kind of what video making could look like?
Conner: I think honestly it was just a very clean structure and especially very clean outputs of what you can actually get. Most demos we've seen so far are very research intensive and have great possibility, but in the end aren't something you will actually watch this. They put videos on their site, they put videos on Twitter that look like Real South Park episodes.
You watch them and I'm like, this is maybe not the quality of like a script or like dialogue or audio of South Park episode or whatever, but. It looks like one, and it sounds like one to someone who hasn't seen the show before, so they engineered very well. Yeah,
Ethan: Yeah. It just seemed to be pieced together.
Interesting far. Was there anything that stood out to you that you're like, wow, I haven't really seen someone do this or this, or maybe something interesting in the paper, or just really kind of a clean way to stitch everything?
Farb: It's incredibly ambitious. You're not gonna get a, you know, zero shot television show, uh, today.
Uh, it, it's probably a while before that. I, I applaud them for their ambition, for what they're going for here, which is, you know, pretty comprehensive, uh, output. So if this is their first, you know, version of it, then I'm pretty bullish on what their subsequent versions are gonna be like. Kudos to them.
Ethan: Keep it up. Absolutely. Well, as always, what else are y'all seeing? Farb?
Farb: I found out that you can access GPT4 32k on Poe, and that's pretty flipping cool if you know, it's, it's uh, you gotta be a subscriber, you gotta pay. But the fact that you can get your hands on, I mean, I feel like nobody's really talking about this, and most people I think, don't even think you can use GPT4 32k, uh, anywhere right now.
But seems like if you get yourself up on po, you'll. You'll have it right in the iPhone app.
Conner: It's pretty awesome. Is that, is that new? Is that like, is this like today you're the first user of it or what?
Farb: You know, it's kind of, you know, until you start using it, it's buried under the more section of more models.
Right. So don't know when it hit, to be honest. But, you know, I think once you start using it, it kind of comes up into the, the top of your experience. Uh, but I don't remember hearing anything about it.
Ethan: Wow. Well, if Open AI didn't give it to you on the playground, you better hurry before rate limits hit you.
But super cool. Yeah, Conner?
Conner: Yeah, I saw Perplexity of course, we're always big fans. Perplexity their whole copilot, search everything. Very well built platform. But yeah, they made a demo with Llama 7B chat and it's very fast. It's like ly fast. Like as soon as you hit enter, boom, spits out an entire massive paragraph.
Um, but part of that I did notice llama like. Maybe it's just a fine tuning of the chat. Maybe it's just this implementation by perplexity, but llama like rambles on a lot and you, even if you tell it to talk less, it'll continue doing that. So it's interesting.
Ethan: Super cool. Yeah, I saw a, a really cool tweet from Justin Alvey.
He jailbroke a Google Home mini and pretty much was able to connect it to the cloud, but their own LLM put a new voice on it and it kind of paints you the picture of, you have these big companies, shipping Siri, you have this big company shipping Google Assistant. They're also monotone. They're all the same.
What does the world look like if people have differentiated versions? So really cool hack he put together. I think he replaced the whole PCB as well.
Farb: I was gonna say it's a little bit more than a jailbreak there. Exactly. Replaced
Ethan: pcb. Yeah. Yeah. He put some real work into it and got a well-deserved viral tweet out of it.
Um, so super cool to him. We'll link it below, check it out. But as always, thank you guys for tuning into AI Daily and we will see you again tomorrow.