Welcome to AI Daily! Join hosts Conner and Ethan, and Farb for another exciting episode packed with cutting-edge AI advancements. In this episode, we dive into SDXL 0.9 by Stability, explore Google's AudioPaLM, and discuss the latest release of Midjourney 5.2.
1️⃣ Stability AI’s SDXL 0.9
Stable Diffusion XL 0.9, launched by Stability, is an impressive image generation model with the largest open-source image model to date. It utilizes a base model of 3.5 billion parameters and an ensemble model of 6.6 billion parameters to generate high-quality images with intricate details.
The comparison between Stable Diffusion XL 0.9 and Mid Journey reveals that Mid Journey's images are superior. However, the competition between these models fluctuates, with each taking the lead at different times. This highlights the ongoing progress and healthy competition in the field of image models.
The podcast emphasizes the importance of combining multiple models in AI. Single models are not sufficient to accomplish the complexities of the universe. Just as processors have limitations, AI models have their own boundaries. The future of AI lies in the effective combination of various models to achieve more powerful and comprehensive results.
2️⃣ Google’s AudioPaLM
Audio Palm is a new large language model from Google that combines Spa Palm 2 and audio LM. It excels in understanding different languages of audio, speech, and recognition, capturing not just the text but also the nuances of intonation and speaker identity.
The combination of these models opens up possibilities for enhanced transcriptions, chatbots, and applications that require a deeper understanding of audio and language intricacies.
Multimodal capabilities are the future, as seen in Audio Palm's ability to translate between languages not included in its training set. This groundbreaking feature showcases the potential for synthetic generation and the abstract representation of multimodal models.
3️⃣ Midjourney 5.2
Mid Journey 5.2 introduces a new zoom-out interpolation feature, allowing users to start with one subject and gradually expand the image to create a stunning and mesmerizing effect. It offers a magical and beautiful experience akin to zooming in and out of a video.
The update also includes a shortened command for generating prompts, addressing the challenge of lengthy and excessive prompts. By providing insights into tokenization and highlighting important aspects, users can generate desired images more efficiently, saving costs and processing time.
Understanding the tokenization process and the weights assigned to each token provides valuable information about Mid Journey's internal workings. It offers users a deeper understanding of the model's architecture and empowers them to achieve better results by optimizing their prompts.
Follow us on Twitter:
Subscribe to our Substack:
Conner: Good morning and welcome to AI Daily. I'm your host Connor, joined by Ethan Farb, and we have another great show for you guys coming Today we're starting with Stable Diffusion, XL by Stability, and then going into Audio PaLM from Google, and we're going, then we're going into Mid Journey's new Midjourney 5.2.
So starting with StabilityXL stability is launching Stable Diffusion XL 0.9, which is their new leap forward in their image generation model is the largest. Open source image model of any kind with the first base model being 3.5 billion parameters and their second ensemble model being a 6.6 billion parameter model.
So the combination of these two models means that the initial base model can generate your initial image, and then the six point c 6.6 billion parameter ensemble model is what generates a lot of the more focused specific details in the image. And these images look very good. They look very similar to mid journey images that we're seeing at a mid journey that we've seen for a while.
Barb, what do you think about these images? What do you think, how did the new stable diffusion XL 0.9 compare to the new Midjourney?
Farb: Well, I think you need to probably go see your optometrist because I think the mid journey ones are noticeably better than the stable diffusion ones. And this is always like a game where one's a little bit ahead of the other for a while, and then the other one, uh, takes, takes the lead for a little while.
Uh, so that's not, I think, too big a deal. It's a huge improvement on stability. I think there's a. More interesting thing going on to me, uh, because we can see this in the example of stable diffusion. We'll see this in the example of, uh, audio palm coming next. And we're seeing this, you know, pretty much everywhere.
We're moving into the era of AI where single models are not enough. And I think this actually probably makes sense, and this will probably be the future, uh, as we know it, we're not going to move towards this like single model that does all things in the universe. Your brain doesn't work that way. And I think it's probably the case that the universe itself doesn't work that way.
There are some bounds and thresholds that we will run into, and it's why com compartmentalization exists. Uh, in the reality that we occupy, it comes down even to the processors themselves. You know, you don't, you know, if you want more processing power, uh, for example, you know, apple is very good at connecting multiple processors together.
Okay? So why not just make one bigger processor? Why not just make one processor the size of the world and it can do all of the computation that the world needs? Uh, whenever it needs it, it's because the laws of physics don't really accommodate that. The way electrons move at the edges of a processor are different than the way they behave in the center of a processor.
So at a certain point, I learned this from Kristy Lee Menahan, probably one of the smartest people in the world. Um, especially on the on on processors. So we have limits, we have edges that we can't go past and, and the software models have those edges as well. So I think we're gonna see pretty much for the rest of our lives, a world where the most powerful things are accomplished by combining all sorts of different models together.
And this is a good example of it.
Conner: No, yeah, exactly. Um, that's very well said. Ethan, do you have anything to follow up on that? Uh, I
Ethan: think the refiner, like they're split between now. Having a refiner model is really interesting. Uh, they put everything together, so now it's not just text prompting. You can do image to image prompting in, in painting, out painting.
So they put all that in one model. So I think good progress for them. But you know, far put it, well see your optometrists, you know, they put, even this image they have here has still has six fingers in it. Um, for this hand holding a coffee cup, you know, I think they still have a ways to go. In competing with Mid Journey, but it is good to see, you know, at the end of the day, they're competing on an open source side, and this is better than the previous stable diffusion.
So Godspeed to them and just good to see more competition and more updates on image models.
Farb: And the single run on a single computer, it'll run on an MD card. It's, uh, it's no
Conner: joke. It can run on a very small computer. I think they said you only need eight gigabytes on an NVIDIA card and 16 gigabytes of EAM on an AMD card.
So, very exciting.
Farb: Absolutely. Non-commercial and hopefully 1.0 in July.
Conner: Weights are not out yet, but they will have weights out in July for now. You can access it through clip drop, so of course. Um, oh and yeah, stylistically, I meant that meant a lot. Looks a lot more like mid journey, so yes. Yeah. Next up we have Audio PaLM, Google's new large language model that can speak and listen.
It's a combination of Spa Palm two that we covered from the Google IO back in early May, and audio LM a model of theirs from, I think late last year that it's a lot like their music, LM that we've covered before, but it's specifically tasked for different languages of audio, speech and recognition and et cetera.
So Audio Palm is the combination of poem two and its ability to understand language and the different. Phonetics of how words are spoken with a combination of audio. Lm, which has a lot of the more things you can only really hear in audio and you can't see in text, which is the intonation of how someone's saying a word or who the speaker is that's saying a word.
So the combination of these two types of models and the two different aspects of it. Ethan, what do you think that opens up for? What we can do with these models?
Ethan: You know, people have been combining these in kind of, um, you know, like easier ways before even on the chat G B T app for example, when you're using their transcription, not only does it use whisper and transcribe what you're saying, but then it runs an LLM on top to edit up what you're saying and output a more likely scenario of, you know, how you were speaking and what this transcription should be.
I think this was a clear pathway to combine both of these models, you know, keeping in track with speaker identity, like you said, going for intonation. Audio and language are so closely tied, but also have their own intricacies that. At the end of the day, if I'm, you know, yelling a word, you probably want an explanation.
And these are just important things you want in a transcription. These are important things that a chat bot should understand. These are important things that application should understand. So I think it was a really clear pathway at the end of the day that people were gonna combine these models and audio palms, kind of the first real implementation of it and a lot more to come on this.
Um, they're gonna be. Really probably just where the rest of this space goes. At the end of the day, even GPD four, you know, they're gonna start accepting audio inputs, image inputs, multimodal is the future, and audio palms are just kind of beginning traces of that.
Conner: Yeah. Speaking of multimodal, like Farb, as we've seen some of the examples from this, it can now take a input language and input text from one sound and then transport it to another language and another text.
Um, through, in another way, men are speaking, even if it hadn't seen that in the initial training set. So mm-hmm. What, what does this unlocked, what does this give us? Yeah. That's
Farb: one of the coolest things I, I saw. And it can translate between two languages that it does, between, you know, a language pair that it was not trained on, which is just kind of mind bending.
And they did a great demo where they were constantly translating, uh, between the models and. As you're listening to it, you kind of forget that the English, you know, each of the people in the that worked on it were, was speaking a language other than English, and then it was translating into English. And as you kind of work your way through watching the presentation, you kind of forget that the English translations are not really them speaking.
That's all synthetically generated by the model. So yeah, multimodal and multimodal is the future. Hold onto your pants.
Conner: Yeah, I think Meta already has that down very well. Between image bind and between Jan Koon's. Whole image and manifestation of that model should be multimodal and should be more about the abstract representation than what is actually in the model.
I think Meta's already seeing that, and I think of course G four Open AI sees that. I think Google's starting to see that as well with audio palm. So, mm-hmm. Very exciting. Absolutely. Well, lastly, today we have Midjourney 5.2. Mid Journey 5.2 has this new zoom out, um, an I interpolation type thing. We'll, we'll show a video on the side here, of course, but it means you can start with one subject and then you can put more and more of the image around it until you have a massive canvas.
And then if you add in something else, like there, like runway mls. Image interpolation. You can have a whole, essentially a video of zooming in and out and it looks really magical, looks really beautiful. And then the second part of mid Journey 5.2, they have a new shortened command, which lets you give it a whole long prompt mid journey.
Prompts are, of course, famous for being very long and very excessive, but you can give it a whole long prompt, and the shortened command will actually shortened that down. Um, Ethan, you found something in interesting about the shortened command. What did you find interesting about it?
Ethan: Um, I, I did, I did. At the end of the day, you know, prompting is always a challenge.
Mid journey is the cream of the crop on image models. Right now, I think everyone sees prompting is always just troubling. So getting an insight into actually how mid journey is working internally, what tokens they are, you know, highlighting what's most important to generate an image and being able to shorten that up.
Is such a simple command, but at the end of the day, really useful for people in this day by day actually generating prompts for mid journey, trying to figure out how to get the images they want. So really cool updates out of them. Um, 5.2, like you said too on the zoom out interpolation, um, they're the cream of the crop right now and they continue to like show that.
So I'm excited for them. I'm excited for this update and if you are in the weeds of it, definitely try out that shortened command.
Conner: Yeah, I believe the shortened command, if you click the like show details or whatever in mid journey, it'll actually show you how they tokenized your prompt and how they give the weights of each token, which very interesting because we don't really know the architecture of Mid Journey at all.
We don't know what they're using, tokenize. Obviously they're built off stable diffusion, but besides that, this is probably the first big piece of information we, we have had of how mid journey actually works, so. Mm-hmm. Very interesting. Farb, any Midjourney thoughts? Any thoughts on where mid journey is going next?
Farb: You know, uh, last October or so when we were building namesake, I probably spent, I don't know, one or two months, uh, six hours a day. Just building prompts. I wish I had this stuff, uh, back then. It is very cool. You can cut down on your costs and you can cut down on your processing time if you better understand how the different.
You know, words that you're putting in there are, are being tokenized and it even makes a difference whether, you know, I definitely found this when I was, when I was writing prompts, but if you put something at the beginning versus put something at the end, it completely, it tokenizes it completely differently and has a huge impact on, uh, what the final results are.
So it's cool to see that you can actually get some access to this more underlying information to help you get better results in the end. You know, less processing costs and less processing time.
Conner: Yeah, well said. Exactly. Well, great, great. Three stories today. What have you guys, what have you guys been saying?
Ethan, what's the latest.
Ethan: Yeah, I saw that AWS now has a hundred million dollar fund for generative AI initiatives. Um, you know, always funny when the big cloud providers drop these funds cuz they're gonna invest $10 million in you and 9.9 of that's gonna go straight back to them on their GPU server farms.
Um, but it, I think it is a great opportunity. Um, and another funding path for generative AI startups. So, you know, if you're looking for something with closer partnerships, you need access to GPUs, you're raising money, definitely check out AWS's new fund and yeah, use it for your GPUs.
Conner: More GPU funds. As always, we've seen.
Anywhere from Vercel to every big company on Earth is doing that now. So if you have servers and you're hosting, go ahead.
Ethan: It's such a clear pathway for these big cloud providers, um, like even their big funding rounds, you know, with cohere or anything. It's all to pay for servers. So great positioning.
They sit at right now in the AI space.
Conner: It's like, like Google and Rep's partnership, same thing. Like, everyone's like, oh, rep lit and Google Partnership, but all the money goes straight back into Google Cloud. So, yeah. Yeah, it's essentially in some ways, uh, a fund from venture money into the cloud money, so, yep, absolutely.
Farb, what about you? What have you been seeing?
Farb: Well, you know, a 16 Z dropped. It's not a computer, it's a companion. They're sort of, uh, a. Their take, you know, their, their take on all things companion, whether it's romantic companion or coaching companion. They're sort of trying to position how they see the world with regards to the ever-growing number of different types of AI companions. And you know, this paper is probably gonna be talked about pretty regularly here for a while. A 16 Z is pretty good at collating these things and, and bringing them together and explaining them in a cohesive way that people will reference probably for the next year or two.
And, you know, 10 years from now it'll kind of seem silly and quaint. But, uh, this will be probably an important reference for people for a while now. Absolutely.
Conner: Yeah, it was very well, well written, so, yeah. Me personally, I saw a paper called Sequence Match Imitation Learning for Auto Aggressive Sequence Modeling with backtracking.
A lot more specific to the paper there of course. But essentially they introduced a new backspace token in their language model. So instead of a normal language model, which only generates text forward, um, it now gives it the ability like a person, so that if it realizes a messed up in its generation, it can generate a backspace token, go back and it's text generation.
And then restart what it says. So, very interesting way of doing things. We haven't seen this out of an LLM before and I'm not sure how much traction it'll get, but it is an interesting idea and I think it probably is useful for LLM generation.
Ethan: I think it definitely could have been a main story for today.
It is a very trippy way of thinking about LLMs. Um, you think about it now, there is no way for it to really edit or do a backspace unless you wanna rerun the LLM'. So having this actually there within the weights and within the network itself. Yeah. Much more to say on it, but it was an end story.
Conner: It's, it's a very common way of prompting to ask GPT3 or GPT4 or something and then say, Hey, did you make any mistakes when you said that?
And then it itself will recognize, oh, I did mess something up there. Yes, this is how, what I should have said instead. And if instead gave it a backspace token, it might be able to do that without you even knowing.
Farb: Yeah. It's tough to know if it's a, if it's a parlor trick or a new fundamental discovery of how, uh, LLMs work.
We'll we'll see. But, uh, speak for yourself. I've never used the backspace in my life.
Conner: Of course, as we get bigger models, it should be less useful. Um, cause they shouldn't mess up in the first place as much. But most people, unlike far, if they're sending an email, will write something and then be like, ah, backspace, before I hit, actually hit send.
Don't give him credit for this. He hits.
Farb: Backspace is for weak people. Okay, move forward.
Conner:. I've seen Farb, he did rip out the delete key.
Farb: Actually, I didn't misspell the word. That's how that word is spelled from now on.
Conner: Indeed. Well, thank you guys for tuning in. We will see you Monday as we like to say, and have a go weekend guys.