StyleAvatar3D, Gorilla, & GILL

AI Daily | 5.31.23

In today's episode, we have three exciting stories to share with you. First up is GILL, a groundbreaking method that infuses image recognition capabilities into language models. With GILL, you can now send images to chatbots and receive responses in the form of edited images or detailed explanations. It offers a unique approach to understand and respond to images without the need for extensive multimodal training. Next, we have StyleAvatar3D, a remarkable advancement in 3D avatar generation. This technology allows for high-fidelity and consistent 3D avatars with various poses and styles. Unlike previous methods, StyleAvatar3D maps out the three-dimensional space to create a more realistic and immersive experience. This development opens up new possibilities in gaming and social applications. Lastly, we explore Gorilla, the API app store for language models. Gorilla connects LLMs with thousands of APIs, offering users a vast selection of tools to complete tasks. What sets Gorilla apart is its ability to eliminate hallucinations and provide accurate and reliable API suggestions. With 1,640 APIs available, this model proves to be a powerful and valuable resource. The AI revolution continues, and these stories demonstrate the incredible progress being made in the field.

Key Take-Aways:


  • Gil is a method that infuses image encoder and decoder into Ella lambs, enabling them to recognize, understand, and respond to images.

  • Gil offers a unique approach by injecting image embeddings into LLMs, allowing for various use cases such as image editing, image explanations, and image injection into conversations.

  • The integration of an encoder in Gil enables both image generation and image retrieval, expanding its capabilities beyond traditional multimodal models.

  • Gil's open-source code sets it apart from Meta's multimodal work, offering accessibility and potential real-world applications in image-based communication.


  • StyleAvatar3D introduces image text diffusion for high-fidelity 3D avatar generation, allowing for a wide range of avatars with different poses and styles in a complete 3D space.

  • The significance of the 3D aspect lies in the visual accuracy and consistency that is challenging to achieve with traditional stable diffusion methods. StyleAvatar3D offers both the generation of 3D images and the ability to maintain consistency in attributes and appearance.

  • Unlike previous avatar generators that relied on stitching together 2D images, StyleAvatar3D maps out the three-dimensional space, providing a more consistent and immersive experience for games and social platforms.

  • The introduction of true 3D assets has marked a significant leap forward, enabling the creation of realistic and dynamic visuals in game development and other applications.


  • Gorilla is an API app store for LLMs that connects the LLM world with the vast world of APIs, offering thousands of APIs for completing user tasks.

  • One of Gorilla's key achievements is addressing hallucinations that exist in models like GPT-4, providing accurate API recommendations instead of generating random information.

  • The Gorilla model is entirely open source, with the training still in progress. However, the inferencing, dataset, and evaluations are openly available. It boasts a wide range of 1,640 APIs that can be called, demonstrating its capabilities against built-in spotlights like Apple's and showcasing superior performance.

  • Fine-tuning the model on APIs proves to be more effective than prompting, reducing hallucinations and improving accuracy. The architecture's ability to quickly update APIs within the model allows for faster contributions and continuous improvement without the need for complete retraining.

Links Mentioned:

Follow us on Twitter:

Subscribe to our Substack:


Farb: Good morning and welcome to AI Daily. I'm Farb Nivi I'm here with my co-hosts, Ethan and Conner, and we've got three great stories for you today. The first one is GILL. We'll tell you what that means in a few style avatar 3d. And gorilla. So if that doesn't explain at all, that's why we're here. Let's start off with Gil.

Can you tell us a little bit about Gil? Ethan, did you get a chance to, to, to dig into this and what this story's all about?

Ethan: I did. So GILL is pretty much a method to infuse Ella lambs with an image encoder and decoder. So to simplify that, what that means is, you know, a lot of times if you're dealing with chat G P T or you're using these LLMs, they yet to have a clean way to recognize images, to understand images, to speak about images.

That whole kind of experience of, let me send an image to a chatbot and have it. Respond to me with another image edited from it, or an explanation of that image, or some way to inject images into the L L M. So there's a lot of different ways of doing this. We've seen multimodal approaches, people training a large model of images and LLMs, but Gil took an interesting approach, pretty much injecting.

These image embeddings into the L l M. So it comes with a f a bunch of kind of interesting use cases. They've shown that even just prompting stable diffusion, they're able to get more accurate representations because they're injecting this Tex encoder into it. A few of their demos, of course, that will be showing below.

You can see that you're saying, Hey, I send a picture of Ramen and say, Hey, can you make this healthier? And it understands not only. The image itself and says, Hey, this image is just basic ramen. And then it understands from its own memory, hey, ramen should maybe add some vegetables to it and it returns to you another image, ramen with vegetables in it and an explainer.

So it's an approach to understand images, to respond with images that does not involve a huge multi-modal training operation. Um, so I found a really interesting book, Connor.

Conner: Yeah, it's pretty exciting cuz it can both generate new images and retrieve images, which is pretty big. Uh, a lot of multimodal models are limited to just retrieval.

Mm-hmm. But again, having the encoder built in means it can. You can send in a picture of a cupcake and say, Hey, how should I market this? And then it'll generate a new image of a cupcake with a sign in front of it and a little caption saying, you should probably do an image. Do marketing like this, where you show your cupcakes and say your brand in front of the cupcakes.

Pretty exciting half of the generation built in as well.

Farb: So we're no strangers to, uh, LLMs or, uh, image generation via stable diffusion. I thought this was pretty cool. The examples they gave were, were neat. This is, you know, pretty, pretty niche type of experience. They're trying to go, go after here. Uh, it, you could say it's got some real world applications though, since, you know, people text each other a lot and people send each other images a lot.

Some people were asking what the differences between this and some of meta's multimodal work was, I di didn't quite see, see, see an answer. I think, uh, maybe the retrieval and generation piece is, is a little bit unique. Uh, the code is open source as well, entirely. What's that?

Conner: The code is also open source, which of course metas is not.

Farb: Yeah. Yeah. The code here is available, I believe, if I, if I remember correctly. Is that right? I think we, we've got, I think the style avatar 3D code isn't quite ready coming, but I think coming, coming soon is, is coming soon.

Ethan: Yeah. We love coming Soons.

Farb: That's right. Well, I mean, at least they were willing to announce that they were going to actually release the code and not just be like, here's a paper and here's some examples and, and piece out.

Mm-hmm. Right. Well, let's go move on to our next story StyleAvatar3D still sticking around in the. Image generation world. This one's a little bit different than some of the stuff you've seen before, largely because it's three-dimensional. Connor, did you get a chance to dig into this paper a little bit?

I, I found it pretty interesting.

Conner: Yeah. I read into it a little bit. It's image text diffusion for high fidelity, 3D avatar generation, so it can get different poses, it can get different styles. Really a whole range of avatars in a complete 3D space that we really haven't seen before. Possible, and they're very high resolution.

They look very good. Um, some of the demos will be showing them right here on the side

Farb: . Um, and to give, to give people a little bit of a sense. They, because they've probably seen things like this and they're just trying to be like, okay, how is this different than, you know, one of these cool avatar generators that uses stable diffusion?

What, what's the big deal? Why is 3D make it any different? Ethan, can you help people understand the. The importance of the 3D part of it?

Ethan: Yeah, the 3D is really important. I think if you've ever used stable diffusion and you try to get that visual accuracy with 3d, it's very, it's very difficult. Um, so a model like this coming out to being able to produce these 3D images is one important.

But I think the most important thing to me was the consistency. If you saw their gif, you know, they have these kind of circling gifts of these avatars and they're changing just the hair color slightly. They're changing just the eyes slightly and being able to, Point a model in that direction, um, quote unquote is the really the most important part here.

To me, we're seeing, hey, can we generate a face in 3d, but then also make sure it is consistent. You know, we're not just generating random kind of stable diffusion prompts. We want this avatar with these attributes. And they even had some interesting ways of kind of mapping attributes to the output of the image generation.

So the attribute mapping combined with the consistency of a generated image was the big news to me here.

Farb: Yeah, I think typically we've seen these examples of, uh, people generating something similar where as a, a face or as a body is moving through a space, it's kind of wildly changing and, and, and really it might look like it's a video, but it's just a bunch of stills.

That are connected to each other that kind of make it look like somebody's turning their face or somebody's moving, and then that's not quite what's happening here, here, they're, they're actually mapping out the three dimensional space to give you a much more consistent experience. Yes.

Conner: Absolutely. It's very, it's very exciting for games, for social, where you can have a consistent 3D avatar, which wasn't really possible before.

Farb: I mean, it's funny, the power, the 2D power of all this stuff, and people's ability to sort of, you know, connect a bunch of two-dimensional images together to make something that looks three-dimensional is so powerful that once this 3D stuff comes, people are already like, wait, I already thought we had 3D stuff.

But it's like, no, not, not really. This is a, this is a pretty big leap forward. And this is, you know, if you're building games, for example, you can't stitch a bunch of. 2D images together to make a, to make what looks like a movie you need, you need 3D assets. Assets. So this is, this is pretty powerful to see.

Awesome. Let's move on to Gorilla. I thought this is a pretty profound story. These folks at Gorilla, I can tell are super excited. They're calling themselves the, uh, a p I app store for LLMs. Uh, they have an ap, sorry, they have a. They developed Gorilla, which can pick from thousands of APIs to complete a user task, even surpassing G P T four.

They say, uh, and it's a connection between the L L M world and the endless A P I world. And one of the challenges that they found that they had to really accomplish to, you know, pull this off was stop the hallucinations that are prevalent in some of the other models in including G P T four. So, What they have is a very accurate way for you to say, Hey, I need an API that does this, and Gorilla will give you that P instead of sort of hallucinating some random piece of information.

Connor, have you, have you taken a look at this yet?

Conner: Yeah, I took a look at it a little bit. The hallucination is very interesting because based off their evaluation, G four actually hallucinates more than GP 3.5, which is very interesting to see. Yeah, that was interesting to see too. Yeah, I agree. But the main thing here is that, again, it's another entirely open source.

The training isn't open source yet, but the inferencing, the data set, the evals are completely open source and available for use. And it's very nice to see a 1,640 APIs available to be called, um, from what looks like a very capable model. They demoed it in comparison to apple's built-in spotlight and performance.

Way better from calling S3 to calling hugging face to calling. Another thousand APIs in there. It looks like a very capable model.

Farb: So powerful stuff. Ethan, what are your thoughts?

Ethan: Yeah, well we've, we've seen, you know, the, these types of things and talked about similar on past shows, but this is, once again, I think an example of fine tuning beats prompting and when this model was fine tuned, not all of these APIs, you get a lot better responses with saying, Hey, I need an image model that can recognize eyes, for example, and you're gonna be able to pull.

Directly the api. That was fine tuned for this model. You're gonna reduce hallucinations. You're gonna reduce accuracy. I don't think there's anything too new here, but I am very excited that they've been fine tuning it on APIs. And the most exciting part to me was their architecture around. Updating the model.

So of course you don't wanna fine tune the model entirely. Again, if you get a new API or there's a model updates or you wanna switch from, you know, torch over to hugging face or something like that. So their model around being able to update these APIs quickly within the model from embeddings, et cetera, is the huge news here, and I think is what allows them to do, you know, this potential API app store allows people to contribute to it faster, allows the model to get.

Better, faster and faster without all this retraining. So to me, SumUp, you know, fine tuning is better than prompting. It's gonna be less hallucinations at G P D four. But the way they structured this in general, it is amazing. And I think when you give LMS the power of APIs, you can honestly do anything.

Farb: They're really excited about the

Conner: App store, API for app store, for APIs.

Farb: It was, uh mm-hmm. Really cool to see a team, uh, that's clearly excited about what they're working on. Absolutely. All right. Well that's some three deep tech stories for you here today. The pace of development in AI is not slowing down anytime soon.

What else are we all seeing out in the wonderful world of ai, Connor?

Conner: Yeah, I saw a little tweet that went kind of viral. It was first, it was linking another tweet that went a bit viral on the 30th of March, just earlier this year where one of the. Where a press correspondent, uh, asked the White House press correspondent like, Hey, what do we think about AI risk?

What's the problem here? And everyone else in the room just laughed. And the White House press correspondent was like, nothing to worry about. And then that was compared to a clip from just yesterday where he asked essentially the same question, but everyone in the room was very silent, very grave faced, and she had a very serious response about how it's a serious risk they're looking into.

So the contrast of just a couple months ago versus now is at least nice to see outta the White House. The dor mes? Yeah, I think,

Farb: sorry, go ahead.

Ethan: I was just gonna say the dor Memetics of the White House.

Farb: Yeah. You know, there was another big announcement. This was what I was gonna talk about. There was another big announcement yesterday.

Uh, a bunch of experts, uh, signed a piece of paper, uh, saying that there is a, another extinction level crisis in our hands. And so, It's important apparently that there is always an extinction level crisis that people are concerned about and worried about. Um, God forbid you go about your short life enjoying it, you need to be juggling three different extinction level crises happening at any given time.

Uh, and. What I really found interesting about yesterday is, was that there was a clear concerted effort. Not only were they gonna announce this sentence that they've all agreed to is a sentence, uh, they're also going to have a. Wired, vice routers, you name it all. Also talk about how experts, we can make sure that we use the word experts, uh, because they never make any sort of mistakes.

Experts have said a sentence and we have to write these articles about this sentence that experts have said is a sentence. It all seems to me like a lot of self-serving nonsense. I don't think the average person understands or cares about any of this stuff. Uh, maybe they're trying to get the average person to care about this stuff, but what's the average person gonna do about it?

The average person is gonna stop super intelligent AI from exterminating all life on earth. Uh, if you know. It would be a great movie where, you know, the rock plays this average person who has to, uh, stop the super intelligent AI from exterminating life on earth. So I think that I'm looking forward

Conner: a few times, um, a couple decades ago.

So What's that?

Conner: they made this movie a few times over the past few days.


Farb: They'll keep making, they'll keep making the movie the funny, like, If you're reading this organization's charter, they're mentioning some of their major concerns and, and one of them is in feeble mint, uh, and they actually quote in, in their own charter.

They're like, like in the movie, Wally. So, so, uh, this very serious organization's, various serious charter is trying to make sure that Wally doesn't happen. I mean, Wally

Conner: was a pretty serious movie. I, I definitely felt something watching it so, I don't know.

Farb: You got people in Wally seem perfectly happy to me.

I don't know. Just leave them alone. Ethan, what are you seeing?

Ethan: Um, yeah, no, I saw the exact same thing. The Center for AI Safety. Um, I'll read off the statement, which was mitigating the risk of exclusion. Well, that was your story for today too. Yeah, mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war.

Um, you know, honestly, ditto, pretty much everything you said, Farb. I think I. They have gathered the signatories of, you know, pretty much every big person in the space and they want to continue the narrative that this is something we should worry about and think about. And, you know, I may not hate it on as much as you, but I I, it's very directionless right now and.

They want to position this as something extremely dangerous, which I think is dangerous in and of itself. So we'll continue to see how it plays out. But ditto, much of what you said, I think you said it well.

Farb: It it is dangerous and the way they're talking about it may also be dangerous and the way they're going about talking about the danger may also be dangerous.

Connor, what are you saying?

Conner: I think the framing is a bit dangerous, but of course, like existential AI risk is something Mr. Altman cares about, and I'm glad it's something the White House cares about personally. So.

Farb: Yeah.

Ethan: Yeah. Hey, well, we all have our opinions, right?

Farb: We all have our opinions here. I have a feeling we'll be here talking about AI daily for the next thousand years.

Conner: So if not us, synthetic us.

Farb: Yeah. If not us, synthetic us. So here's to the AI Dreamers. Uh, over the ai doomers. Thanks for joining us today for this excellent episode of AI Daily.

We'll see you tomorrow. Have a great day, everyone. Thank you all.

AI Daily