Playback speed
Share post
Share post at current time

Video-LLaMA, Mechanical Turk, and EU AI Regulation

AI Daily | 6.15.23

Welcome to AI Daily, your go-to podcast for the latest updates in the world of artificial intelligence! In today's episode, we have some banger stories lined up for you. Join us as we dive into the exciting advancements in the realm of Mechanical Turk, the impact of AI in the EU Parliament, and a cutting-edge multimodal technology called Video LLaMA.

Key Points:

Video LLaMA

  • A new paper called Video LLaMA, which focuses on turning video and audio into text and understanding them better.

  • The paper addresses two main challenges: capturing temporal changes in video scenes and integrating audio and visual signals.

  • The model showcased in the paper demonstrates accurate predictions and understanding of videos, including analyzing images, audio, facial expressions, and speech.

  • The availability of the model for public use is uncertain as it is currently a research paper, but it highlights the potential of leveraging AI tools like Image Binds and audio transformers to enhance video understanding.

Mechanical Turk

  • A study reveals that a significant portion (around 36-44%) of text summarization tasks on Mechanical Turk are being done by AI models like ChatGPT instead of humans.

  • The displacement of human workers by synthetic models raises concerns about the availability and quality of real data for training larger language models like GPT-4 and GPT-5.

  • Detecting synthetic data generated by language models is challenging, and specialized classifiers may be required to distinguish between human-generated and AI-generated text.

  • The increasing reliance on AI models for tasks like text summarization may lead to the introduction of stricter verification measures, such as keystroke tracking or biometric testing, to ensure authenticity in online assessments and proctoring.

EU Parliament & AI

  • The EU Parliament is taking steps towards AI regulation, although the specifics and implications are unclear.

  • There are concerns about redundancy in creating separate AI-specific regulations when existing laws could cover related aspects such as data privacy.

  • The potential impact of AI regulation on startups and small players is uncertain, as compliance requirements and limitations on training AI models could arise.

  • The regulation aims to address issues like transparency, disclosure of AI-generated content, and prohibitions on certain applications like social scoring and real-time facial recognition. However, some argue that these issues can be legislated without directly tying them to AI.

Links Mentioned

Follow us on Twitter:

Subscribe to our Substack:


Farb: Good morning and welcome to AI Daily. We have a few banger stories for you today. Some things about the world of mechanical Turk, our friends over in the EU parliament, and some actual AI news around some cool multimodal stuff that's happening. Uh, let's start with that one. Uh, There is a new paper discussing, you know, turning video and audio into text and understanding video and audio, better called Video LLaMA.

Connor, can you tell us a little bit about it and you know, why you think this particular paper is Im important? Obviously there's, there's lots of this type of stuff going on with, uh, video to text, audio to text pictures to text. Uh, what did you find interesting about this paper in particular?

Conner: Yeah, they had a demo of how, like really accurately and how well predicts and understands a video with both the images in the video and the audio itself of the video. So they, they tackle two main challenges here that they highlight, capturing the temporal changes in video scenes and integrating both the audio and the visual signal changes.

Um, so fir the first challenge, they make a video cue form, which assembles a pre-trained image encoder, uh, with the rest of the video and. Pipelines, all that together. And then for the second challenge, they use image bind or from our friends over at Meta and get both the audio and the video to integrate that together.

And the final output of it is a very well understanding. This is like a almost a clip for videos, in fact by clip before of how well it understands an image. And then this, you give an entire video and understands how someone's talking, what they're saying, how their facial expressions are moving as they talk.

It's a very powerful model and I'm very excited to see it used.

Farb: Is it available today? Can I take over the world with it today, or will I have to wait to take over the world with it?

Conner: I don't believe it's open source. I think it's just a research paper, so we'll see.

Farb: Okay. We don't even know. Maybe ChatGPT wrote it.

We don't. We don't know any anymore. It's another cool example of, you know, like what we've talked about repeatedly using other AI tools to make your AI tool better and more understanding. Ethan, what'd you take away from it?

Ethan: Yeah, one of the big takeaways I saw was they're actually leveraging image binds. You know, we've talked about image bind from meta before this kind of multimodal, um, embedding model. So they're leveraging image bind. They put in audio cue on top of it, um, to understand audio a little bit better from the embedding side as well. But main thing is, you know, we've seen audio transformers and image transformers.

Can you describe this video is inherently multimodal. Um, at the end of the day, you need to, to fully understand a video. What's, what do you see, what do you hear? How are the scenes transitioning? But also what are they saying or what's going on? Is there music in the background? So it's inherently multimodal problem and image bind does it fairly well, but they showed a really cool way to add some more audio transformers on top and truly.

Highly accurately describe the entire video. So they have some really cool examples. Um, we'll of course put them in the show notes below, but check it out. Um, some of it's open source, um, I believe, but it is just a paper. Um, of course, you know, image bind is open source. You can recreate this yourself. Um, not too difficult.

So it's cool if you wanna understand video. Um, video llama is like a really easy way to do it.

Conner: I understand. I'm very curious if the I EPO we saw from yesterday, which is also open source, how it'll be applied to something like this because video, video does require a lot more abstract concepts to link the audio to the images, to facial expressions to what they're saying. So I think the abstract conceptualization of IGEPO we talked about yesterday could probably tackle this even better.

Farb: When will we be able to just write a paper and then give our paper to, you know, g p t and be like, can you just write the code for me? Here's the paper. I'd like the code for it. Yeah.

Conner: Why  are we even writing the paper at that point?

Farb: Why are we even writing the paper? Why are we even here? One limitation that they, that they mentioned, which is, you know, a limitation, we're gonna probably be. Hearing about in some ways or another forever. It was just that, you know, not enough GPUs, don't have enough, don't have enough money, not enough computation on planet Earth to do a video that's more than probably a few seconds long.

Right? You can't, you can't feed this. The, uh, director's cut of Lord of the Rings trilogy, which, you know, I think comes in at around 12 hours total of, uh, of movies and expected to do anything with it. There, there may actually not be enough computation to do that in any reasonable amount of time. So, you know, we're gonna be hearing that as the, as you know, eventually we'll be like, oh, well you can do 10 hours of video, but you, you can't do a thousand hours of video.

So, sorry, we can't apply this to the entire corpus of YouTube overnight, but, we'll, we'll get there eventually. Uh, I, I thought it was pretty cool to see this stuff. Their, their demo is, is really well done. I thought. I thought they did, they did a bang up job on that. Right onto our next story, which is, uh, you know, not necessarily a piece of tech so much as people have, uh, started digging into how folks are using LLMs out in the real world.

And in this example it's, it's mechanical Turk. They did a little study here and found that somewhere, you know, 36 to 44% or something like that of. Mechanical Turk tasks of a very specific kind of mechanical Turk task task, uh, text summarization. They, they found that, you know, almost, uh, somewhere between a third and a half of it was being done by Chachi pt or other LLMs, uh, not by, by actual humans.

Ethan, what did, what did you get from this?

Ethan: Yeah, you, you know, this is a, I think this entire story is a hunch people have had for a little bit, but it touches on some actually, Kind of really important points for the space in general. Number one. You know, these models, as everyone knows, are really good at doing these tasks.

Uh, so number one, you're seeing the displacement of a lot of mechanical Turk jobs. You're seeing just GD four and even companies saying, Hey, we're not gonna hire all these people. We're just gonna use these synthetic models and classify and summarize and all of that. But it touches on two other really important points is.

One is that real data is getting much harder to get. So when you're training these larger models, as we get to GPD five and GPD six, there's gonna be a lot of synthetic data in there and people are worried about its effectiveness, on alignment, on its effects, on especially the sciences. Cuz you need a lot of real just human data of what's happening in the world versus this synthetic data.

So it's a really grounded paper on actually, you know, factually what is happening. So as people look to train more and more models, You're gonna have a lot of synthetic just trash in there. Um, and these are gonna have to be filtered. And when it comes to filtering, another really cool point of this paper was that these generalist models that are saying, Hey, we can detect if this is GPD four generated.

They're not that effective. What they had to do was train a classifier just for this example of these like summarizing for these mechanical Turk jobs. And that's the only way they could accurately predict hey, Was this a real person or was this synthetically generated?

Farb: They had to track your keystrokes, it seems like, to see if it was a, you know, forget, forget, forget your fancy LLM analysis tool, or just straight up tracking your keystrokes and being like, that ain't a human buddy.

Uh, yeah. This guy wrote, this guy wrote a summary without typing on the keys, which is impressive, but I'm guessing they didn't actually write the summary. So econ, what did, what did you take away from it?

Conner: Yeah. This has some interesting ramifications for the future of training data and how good are LLMs can really be in the future.

Because a good analogy I've seen is that like a picture on your phone, if you take a screenshot of it and then take a screenshot of that and take a screenshot, a screenshot, a screenshot, over and over. Eventually you lose a lot of the original data and you even get artifacts in the image. That you didn't think were possible to get artifacts of.

And as more of our core human data, like mechanical Turk gets replaced by ais, we may see a lot of the same thing happen with our LLMs, that they get artifacts, they get damaged in ways we didn't see coming. And it's a little bit worrying of how good our LMS can be if these core data sources are damaged.

Farb: Yeah. You know, I, I always say that the world seems like it's falling apart to adults and it's chaos. And who knows the world may end, but for young people, for kids, you know, teenagers, They think the world is normal because that's all they've ever known. You know, when you're, when you're born into a reality, you think that's normal reality.

You don't understand that for everybody who's 50 or 60 years old, this seems like absolute insanity and chaos and the world is changing too quickly, and I. I can't even quite yet get my head wrapped around it in this context where, you know, people being born today, in 10, 15 years from now, they're just going to think that that a world where we don't know what data is synthetic and what data is real is just a normal world.

And they'll have ways of dealing with it. And Sure, there'll be lots of things like, you know, I don't know, maybe key, you know, you, you think about taking these, uh, online tests where they're proctored and they're like doing keys keystroke tracking and they're, they've got your camera on and it's like, okay, they're gonna have to add, you know, five other, you're gonna have to like have your.

You know, a needle in your arm so that they can see that there's real blood coming out of the person taking the test. Cuz otherwise, you know, you can, you can spoof the video and you can spoof every other aspect of it. You know, can you spoof the, uh, the blood test that's, that's coming with your, your proctoring?

So, It'll be interesting. You know, I think in the end we're gonna live in a world where, you know, we're seeing the virtual world and the real world melt into one another. Uh, and then we can dig into all of the, we're living in a simulation arguments anyways, so maybe that's already happening. Onto our last and final story, the folks at the, the EU Parliament, uh, never lacking time on their hands to regulate things.

Uh, and then also one thing I read, and this is an article about the EU Parliament, has started, you know, their first step in AI regulation. Nobody knows what that means in, in including them. It seems, uh, in a lot of cases to me, the stuff that they talk about seem like they should be covered under other laws.

You know, if you have these data privacy laws, I don't know, why do you need the AI version of the same data privacy law? And it's also brings up a lot of stuff where, you know, these. Folks that are, you know, politicians and legislators and people that are, you know, bureaucrats that are doing this stuff.

You know, they, they come into the job. They, they wanna pass a bunch of stuff so they, they look good and they can go on and l level up their career onto the next thing. Uh, and they sort of like leave a mess of legislation behind them for somebody else to clear up. And I even read something in the article where I think it was a.

Some US legislator was talking about like, we're not gonna let them create more legislation than us. We're gonna outcompete them in the amount of legislation that we're gonna create around this, which just seems like a, a strange position to take, wanting to be the ultimate, uh, regulator here. Uh, Connor, what did, what did you get from it?

Conner: Yeah, I mean, as any legislator says, they said, oh, we've made history today. Yeah, the classic thing we see at the eu, more regulation. In some ways it's good. Of course, GDPR did some good things for privacy, but also did some annoying things, cookie banners, and I'm, I'm assuming AI will end up the same way.

Some good things, some annoying things. I hope that's the best and the worst of what we get. Um, but they are going very serious and very hard in AI regulation, so it remains to be seen.

Farb: Makes me think of this. The, the hugging face article that we covered yesterday, I think at the bottom of the article, if you read it, this, it, it said, I dunno, they're even joking.

They're like, this was, this article was a 100% not written by chat g pt, uh, so, you know, is that they're getting ahead of the EU regulation. Nice work, hugging face. Ethan, did you get anything out of this?

Ethan: Yeah, I think, you know, we, we don't have to cover the entire act now, but they actually have a bunch of stuff in there about what's allowed, what's not transparency requirements for these systems, making sure you're disclosing this with AI generated and if this actually gets passed and gets trickled to the us I think it's important for small players to keep an eye on this.

Like, are you gonna be allowed to train these? What are you, are you gonna have to hire a compliance team for your startup just to make sure you're with all this and. They're moving fast on this at the end of the day, so they want to get this through by the end of this year. And if it does, it would go into act in 2026, which on one hand is very fast.

But on the other hand, I think we see, you know, especially the big tech giants just kind of letting them talk and roll along and they're saying, Hey, yeah, we'll, we'll look at it. And yeah, yeah, there might be some more fine tuned details we're gonna have to do. So I think they're letting the regulators, you know, put a bunch of comprehensive reform together.

But there's so many details that just aren't worked out yet. They're gonna have to convince the rest of the EU member states to agree. You know, you're gonna have a lot of lobbying from the big tech companies and venture capitalists, et cetera. But there's some real details in here that even if slivers of this get passed, you know, if you're a startup doing AI generated content and you're in the EU and you're not putting, we're AI generated at the bottom, then be ready for some fines.

So as this gets more fine tuned, more, I imagine we'll talk about it more and actually tell you what you should look out for. Cuz they care about it and it's coming.

Farb: I'll go out on a limb and say, this is not like regulating the early web. The early web is essentially a flat technology. You have web pages, you go to the webpage, there's some content on there.

Okay, yeah. We don't want the webpage tracking everything you're doing and stuff like that. But that was a relatively flat type of technology that was understood. And you know, it's like they didn't. Regulators didn't come in when the internet was being developed in the sixties and seventies and get their hands in on all of this and be like, oh no, we don't, we don't think these packets should be sent this way.

Uh, you should, um, send them, send them this way instead. And I think they may be making a miscalculation, especially when this is such a global and open source phenomenon. If they wanna str, you know, hamstring themselves, uh, other countries may not do that. And, you know, they may end up falling behind on what their regions are able to do with AI because other regions that don't wanna regulate it this way advance more quickly.

And this is a, this is a base level of technology that we're developing here. Like, like language or the written word. Or the spoken word. It's, um, it'll be interesting to see how this shakes out. I don't think this is as easy as, you know, email legislation or, you know, cookie legislation.

Ethan: No, it's super tough, but I, I will give props to them.

You know, one of the things in there is in, they're like unapproved systems. They don't want anyone in the EU deploying these like social score systems, predictive policing systems, these real time facial recognition systems. These things, you know, the really clear things that I think the western states don't like that we're seeing out of China with ai, they're putting that in there.

So as Connor said, there's some good and bad. It's messy, but.

Farb: It should be talked about. But to be fair, you can, you know, legislate those things without it having to do with ai, don't do facial recognition. It's not, doesn't have to be ai, AI related and, you know, maybe they're just kind of saying the AI stuff to get some attention and it's a good piece of, you know, not tracking people's faces is, is is a thing that, you know, in our society, people, people want, and whether it's related to AI or not, is, is not so much the issue.

Absolutely. All right. Well, it's not to be too long-winded in this episode. Let's move on. What are you all seeing here, Connor? What's the, what's new?

Conner: Yeah, to we talk about Vercel again. Vercel announced today they have an ai, s t k, uh, so of course Vercel is very a great place to, to deploy applications, to write your backend for ai to write your front end.

Uh, main thing there is streaming. So of course in Cheche pt, when it brings the whole output over time, that's cuz it's streaming live from OpenAI. Through another server and then finally through your browser. So Elle's ai SDK is basically a better wrapper around that to use open ais or hugging face and a few others, uh, to use their SDKs better.

So very nice, good developer experience as always.

Farb: You seen anything cool et Ethan?

Ethan: Um, nothing crazy today, but I just saw Aaron Bali, you know, uh, CEO of Carbon Health, just kind of tweeted just their strategy on ai, which was, you know, they're in the healthcare space and they're saying, Hey, we're. We're exclusively using AI for clerical work.

We're not doing any clinical work, and I think it just kind of touches on where a lot of startups are like hitting some walls right now and saying, Hey, we're just gonna make LLMs work on the administration side, improve some efficiencies, et cetera, but touches on the broader perspective that some days someone is going to come with an amazing consumer clinical healthcare.

Application that I think is gonna reshape the way healthcare is delivered. But for now, there's still so many gains just on an administrative side. So it was a comment, you know, I like the healthcare space, so it was a comment he made that just kind of interested

Farb: me. Uh, I was talking to a friend of mine yesterday who, uh, is pretty, pretty deep in the world of AI and, and gen ai and just generative media, uh, over at one of the, one of the big, big tech companies.

And, uh, we were talking about, you know, the ability to regenerate your likeness and varying levels of accuracy. And it, it just kind of made me realize that the generative AI is getting so good at recreating your likeness that you will probably hate it. Uh, which is an interesting, I like that tweet. Yeah, thanks.

Thanks. But it makes you think it's like, oh, okay, that's, uh, that looks a little too much like me. Can you back it off a little bit? Yeah. And you know, maybe pretty me up, uh, pretty me up a bit because, um, I don't, you know, I don't want that the aspect of my, my accuracy being, being shared and it just kind of makes you think like, okay, well, Why make the technology that good if people wanna back off of it?

It's just an interesting, interesting world. Interesting.

Conner: Living the exec plot of the Silicon Valley episode, right? He like upgraded the like video conferencing and then the girl was like, oh yeah,

Farb: oh yeah. Okay. Yeah. I try not to watch that show too much. It hits a little too close to home and then like, I don't.

You know, when I'm fundraising and then there's an episode about fundraising, I, I've literally had to turn the episodes off in the middle of them because I'm like, okay, I can't, I can't deal with this right now. This is a little bit weird, weird stuff. It was a great show though. Great show. Fantastic. All right, well thanks for joining us today.

Um, I think maybe our longest episode ever, uh, which is to say, if you made it this far. Thank you for watching and we'll see you tomorrow. Peace guys. Peace guys.

AI Daily
AI Daily
AI Daily