Playback speed
Share post
Share post at current time

Insights From Meta's CVPR Conference

AI Daily | 6.21.23

Welcome to AI Daily! In this episode, we dive into six exciting papers from the Meta team at the Conference on Computer Vision and Pattern Recognition. Get ready for fascinating insights into computer vision and cutting-edge AI applications.

Key Points:

EgoTask (EgoT2)

  • The EgoTask paper focuses on handling egocentric video tasks, where the videos are recorded from a first-person perspective. It explores the application of AI to improve results in specific egocentric tasks like painting or cooking.

  • By translating between different egocentric tasks, such as painting and cooking, better outcomes can be achieved. This approach recognizes the similarities in hand movements and gestures between different activities, allowing for the transfer of skills from one task to another.


  • PACO is a large-scale database that provides object and part masks, as well as object and part level attributes, allowing for precise segmentation and labeling of different parts within images. It offers specific details about hundreds of different objects, making it valuable for AI training in computer vision.

  • PACO is an open-source and commercially licensed dataset, complementing Meta's previous release, Sam Segment. It is particularly beneficial for open-source computer vision projects that require specific color or attribute information, enabling more accurate analysis and understanding of images.


  • Genesis introduces a benchmark for measuring a model's ability to assess image similarity, taking into account colors, textures, and objects. It addresses limitations of object-based comparisons and offers insights into improving similarity scores by incorporating text and image data.

  • Notably, popular computer vision models like clip and ImageNet-based models struggled in this benchmark, highlighting the need for novel approaches. Genesis has practical applications in fields like fashion and expands the understanding of comparing images beyond object or color-based descriptions. While not commercially available, it serves as a valuable benchmark for evaluating new image models.


  • LaVila utilizes fine-tuning of large language models (LLMs) like GPT-2 on visual inputs to create video narrators, resulting in more detailed and enriched video descriptions. By leveraging LLMs and egocentric video datasets, they enhance sparse narrations, providing nuanced insights into video content.

  • The combination of AI models enhances the understanding of videos and enables the generation of richer narrations, even in cases where audio is absent. This commercially available approach has potential applications in platforms like YouTube, offering narrations that go beyond human dialogue and tap into the visual context of videos.


  • Galactic is a large-scale simulation and reinforcement learning framework that trains a robotic arm to perform mobile manipulation tasks in indoor environments. Through iterative training and simulations, the framework enables the robot to autonomously move objects, demonstrating its potential for complex tasks.

  • While Galactic is based on simulated robotics, its principles can be applied to real-world robots. The framework achieves high training speeds of up to 100,000 steps per second using only eight GPUs, showcasing its efficiency and scalability. It is a non-commercial project with promising implications for robotics and reinforcement learning.


  • HierVL is a hierarchical video language embedding model that improves the understanding and description of long-form videos. By training on both short clips and a summary of the entire video, it enables the model to grasp the overall context and provide comprehensive explanations, making it valuable for applications like reviewing drone or body cam footage.

  • While HierVL's training focuses on videos up to approximately 30 minutes long, its scalability beyond that remains uncertain. Nonetheless, this non-commercial research offers a promising perspective on advancing video language embeddings and enhancing analysis of extended video content.

Episode Links:

Meta Papers

OpenAI Plans App Store

China’s Underground NVIDIA Market

OpenAI Lobbied EU

Follow us on Twitter:

Subscribe to our Substack:


Farb: Good morning and welcome to AI Daily. We have a fun one for you here. We're going to review. Six papers from the fine folks at Meta at the, uh, at a conference that just, uh, is going on right now. The, uh, conference on computer vision and pattern recognition. So we're gonna get a lot of computer vision related things and, uh, this should be pretty cool.

We got six papers we're gonna, we're gonna run through here for you, so I'll kick it off. The first paper is EgoTask. EgoTask is, um, A paper that's focused. And there's another one of these as well, that I'll talk about here shortly, is focused on video tasks that are egocentric. So egocentric means it's it's first person view, almost imagine you're wearing apple, uh, vision pros.

And you're recording what you're seeing. So all these videos are from the perspective of somebody who's, you know, looking through their own eyes, and obviously this is a specific type of video. And so what we're talking about here is how do we apply, you know, AI to handle these specific types of videos better than if you just have a very general model around videos.

And so what they found here, Is when they're focused on egocentric tasks, say painting versus cooking or playing chess, that you can create better results by having a translator that translates between these types of tasks. So why is this interesting? Why is this useful? Uh, I think there's a really, you know, cool example that kind of occurred to me in my own life often when I'm thinking of, uh, Whenever I play the guitar, a lot of times it kind of strikes me as almost a weird form of painting the way my, the way you move your hands along the guitars.

Neck, uh, is almost like the way you brush, uh, or, or draw brushstrokes when you're painting. And I've always kind of like made this weird connection between the two. This paper is sort of implying something similar, which is to say that, you know, if you. Show how somebody is playing chess or how somebody is painting.

You can create a different video of a different task and potentially use that to learn. So there is a lot of similarities between different egocentric tasks, uh, cooking and painting, despite what you may think, uh, have a lot of similarities in, in how your hand moves. So if you happen to be someone who's very good at cooking, uh, you could use this to.

Watch videos, you know, generate a video of painting that kind of uses the same hand motions that you were doing when you were cooking. So it's essentially creating a connection in your mind so that you can do this new thing with something that you already know how to do. I thought that was a really interesting thing.

And, you know, again, what they're trying to accomplish here is a more specific understanding and translation in a, in a more limited domain. It's not trying to be everything for every type of video. It's trying to use this for, you know, just egocentric videos. Uh, so yeah, that was the first one.

Conner: Yeah, that makes sense.

That plays into, well, it's just kind of like a self-awareness between the egocentric of different activities, so, yeah. Next up we have PACO Meta announced, PACO, which is parts and attributes of common objects. Basically just a large scale database, which provides object masks, part masks, object level attributes, and part level attributes.

So when you have a picture of a frame of different objects, instead of just saying, oh, here's a laptop, you can say, oh, here's a laptop monitor. Here's a laptop keyboard that is white. So it has a, the, the segmentation of the different parts, but then b, also it can label different parts of different attributes.

So you have very specific details about images. So this is a very large dataset, um, hundreds of different objects, very wide of, of types of information. And this again relates back to another data set and model that meta released about a few months ago now, called Sam Segment. Anything So. Again, another very useful model.

This is just a data set, but it is open source and commercially licensed, so feel free to use it however you want. And if you're training in AI for open source computer vision that needs all these specific colors, like, oh, red car wheels, not just like a car, then this is very useful model for that.

Farb: Yeah. I forgot to mention that EgoTask is currently non-commercial.

Ethan: Super cool. Yeah. Next up we have GeneCIS. So GeneCIS is a new benchmark. Um, and their goal is pretty much to say, Hey, how do we measure a model's ability to test similarity of images? So if you've ever worked on similarity of images, of course you know, hey, we can take one picture and another picture and say, oh, is, is there a dog in this picture?

You can do similarity based on objects, but similarity based on colors and textures and objects and combining all these things to say, Hey, Is this similar? Vibes to this other image. And that's really what this benchmark proposes is how do we measure a model's ability to compare similarity? So two interesting things from this.

One is that they found that clip. So a clip is the model from OpenAI to measure, you know, a connection between text and an image. It found that that performed badly on this benchmark, and also a lot of models that are great at computer vision. So models that perform well on ImageNet, for example, did not really perform well within this similarity test.

Um, so they also proposed a kind of new way to mine text data and image data to try to improve similarity scores. Um, but really cool applications I think to fashion or just any kind of computer vision that you want to say, Hey, how do we compare to images that you can't fully describe just based on objects or just based on colors.

How do we measure their similarity? Um, and Genesis is really trying to tackle that so it is not commercially available, but. A really cool way. If you have a new model, new image model, you're testing a great benchmark to test that out.

Farb: Very cool. All right. Our next one is called LaVila, which is learning video representations from large language models.

And what they're doing here is essentially fine tuning, uh, an llm, for example, gpt two on visual inputs to create. Video narrators. And what they found is that when you do this, what you get is a much richer narration of a video. And interestingly enough, they're using the same egocentric video dataset as the ego task, uh, group is, which is something like, I think 10,000 hours or a hundred.

Hours of video on 10,000 different tasks. It's a pretty large data set of e egocentric video. And so, for example, they find that maybe you have a video, uh, where a basic video image translator is giving you something like, you know, this person is separating some yarn. Well, what they found is when they fine tune an L L M, On these sort of videos plus their narrations, they can get much richer descriptions.

So instead of just, you know, person separates the yarn, it's person pulls out the yarn with their right hand, or instead of just this person, you know, person lifts container, it becomes person lifts, container and wipes the countertop. And what they found is that, Once they've done this, they can go back over video that is very sparsely narrated and actually rerate the whole thing with a much richer narration, with much more rich understanding of what's going on in the model.

So what they're essentially, you know, we've talked about this a trillion times, is they're leveraging one AI model to sort of boost another AI model. And, and together they're more powerful than if it was just kind of trained on its own. So I thought this was really interesting. Uh, and this is commercially available and, and it's a way for you to, you know, We'll probably start seeing things like this in YouTube, uh, where the narration can be just even richer than just what the people are, people are saying.

And then sometimes if the audio, uh, is lost, for example, it can generate a narration from just the video. Very cool. I like

Conner: it. Next up, we have Galactic. Galactic is a large scale simulation and reinforcement learning framework for robotic mobile manipulation into indoor environments. So essentially they have a.

They built a very large scale framework that iterates over and over to train a robot to move objects in an environment. They gave it different kinds of environments, different kinds of robotic setups, but in the end, it's essentially a robot that can move and has a seven degrees of freedom arm, and it just teaches it, Hey, you find a can over here.

Can you move it to the trash? Or Hey, you find this on the floor, can you move it to the counter? A very large scale framework. I believe they trained it in at a hundred thousand steps per second. So extremely fast to train, extremely reliant to train on just eight GPUs, and they taught it how to move objects.

Um, very exciting for robotics and being able to scale this reinforcement learning. To teach complex tasks like this.

Farb: Was this a simulated robotics or actual robots?

Conner: Simulated robotics, but again, it's robotics. Uh, it's just like a simple arm, so it can apply to a real world robot. But the big part of it being simulated was that they can train it much faster, of course, and they can train non simulated.

So, yeah. Yes. Galactic is non-commercial also.

Ethan: Very cool. Last up, we have HierVL. So HierVL is a hierarchical video language embedding. So video language embedding is pretty much when you're saying, Hey, take this video and explain to me what's happening in it. Um, and there's a lot of models that can take, you know, three, five, maybe even 10 seconds of a clip and say, Hey, this person is, you know, chopping an onion.

But when you take that up and you say, Hey, what's happening across a 20 minute long video, that's becomes a problem for a lot of these models. So what Higher VL does is it pretty much trains both on the short clips as well as a summary of the entire clip. So taking, chopping an onion, taking, putting it in a pot, and understanding that the entire video is about you making a soup.

So a really cool way to improve understanding and describing a long form video. Imagine giving it a movie and saying, Hey, what happens in this movie? Like, explain the actual movie to me versus just specific pieces of it. So that's what's hierarchical video language embedding is, um, I think it's a really cool paper.

If you know you have any of these applications of reviewing drone footage, reviewing a body cam footage, reviewing whatever it may be, how can we improve that across these long form videos? And that's what people are beginning to stab at. So this one is non-commercial, but a really kind of, Great viewpoint of where these video language embeddings are going.

Conner: How long of videos can it process?

Ethan: You know, um, for this one I think it's up to probably about 30 minutes. Um, it's like what they're training them on, so they're training 'em on pretty much the long form summary of these videos. I imagine this scales up even further than that, but I'm not sure if it was trained on that.

Farb: big year at the conference on computer vision and pattern recognition. Get ready for Apple Vision Pro. Cool apps to be leveraging some of these technologies, cuz the Vision Pro's not gonna be on your head for another year, which is about a thousand years of development in the world of a ai.

So these will all be like silly little jokes a year from now when you can actually use your, uh, apple headset, which will be pretty awesome to see. What else are we seeing out in the real world? I, uh, Read a little story about open AI planning to open an app store for AI models that people build, uh, on top of open AI structure.

Not a lot of information of it on it, uh, something people have been anticipating. We're seeing open AI start getting closer and closer to its own little app store, which I'm sure will be pretty big news. Uh, the cool thing about the. The folks at Open AI is that they're, they're willing to do things early, uh, plugins, for example, they, they did early and then they came out and said, you know what?

Plugins aren't, you know, quite doing all the things that we thought they were doing. And, and I think that's a pretty awesome, uh, position to take as, as a company to kind of try things early, get stuff out there. And just be honest about whether or not it's crushing it or not. You know, I don't think they need to worry about crushing it over there at OpenAI.

The world is con consumed with everything, uh, post GPT three launching. So, uh, probably some more awesome stuff coming out. And we're looking forward to seeing what the app store's.

Conner: Like Is that like a prompt app store? It's gonna be, or like an actual, like model or do we not know?

Farb: Apparently it's, you know, Selling or monetizing your own models that you've built off of OpenAI.

Tough to really know exactly what they mean.

Conner: I guess they do have fine tuning plan for GP 3.5, so probably relate to that. Yeah. Fascinating.

Farb: What about you guys?

Ethan: Uh, yeah, well we covered it a bit yesterday with ance buying, you know, a billion dollars of GPU chip and the whole sanctions that are coming. But there was a really, nothing too surprising, but a really just kind of fun read about this kind of underground market that's being built and developing in China to get these A 100 s and H 100 s.

So they talked about a few small vendors out in Lake Shenzhen, um, selling, you know, Kind of this underground black markets of a 100 s. Um, it's a really fun read and I think it also touches on, you know, FARs comment from yesterday of how ships really are the new most important thing for humanity. And like our black markets have gone from, you know, whatever they may have been a hundred years ago to.

Nvidia a 100 chips that startups and researchers and who knows who else are trying to buy underground and they're charging 'em, gosh, $20,000 per chip. You know, they retail for about 10,000 and people are going underground. They're saying, Hey, can you get me three of these chips? Like, you can just imagine what it's like, you know, being on the ground there with these sanctions coming in, how they're gonna handle it, how small firms are handling it.

We saw by dance can put all their weight and pre-order a billion dollars. But if you just want a couple chips to train your model at, you know, your local school, well, this is how you're getting it now. So really fun. Read, we'll link it below and definitely recommend you check it out.

Farb: I love Shenzhen. I miss Shenzhen, and I'm gonna have to call my friends there and see what's going on.

It's been, it's been too long since

Conner: I've been there. It's a very curious question of how much money ance is putting into the black market, GPUs versus the like from Nvidia. So

Farb: probably tough question to get an answer to.

Conner: If you have that answer.

Farb: We're gonna have some really interesting stories in the, well, you'd be amazed.

You talk to people on the, the best information you can ever get on this stuff is people on the ground. Yep.

Conner: Mm-hmm More than drop it in the comments below. Um, yeah, so I read that of course, OpenAI. Sam Altman have been calling for AI regulation for the past couple months, few months past while, um, so they've talked, he's been talking to the eu, he's been traveling around the EU to different countries saying, Hey, we need regulation.

But then Time Magazine, of course, very famous for being fearful of AI and promoting ai. Fear, uh, has an exclusive story that Open AI is trying to decrease AI regulation. Not that surprising, of course, because although they are calling for regulation, he has said in the past, if EU over regulates too much, they would've to pull out of the eu.

So, Not too surprising that they want to moderate the amount of regulation, especially in certain domains like education and employment. So, interesting story.

Farb: Not too surprising though, negotiating in public. I mean, it's, it's geopolitics and it's the way, it's the way things are done. And, uh, kudos to them for being savvy enough to know how this stuff works.

 Well, that was a, that was a fun episode. Uh, hopefully you stuck around to hear all the cool papers coming out of the Computer Vision Conference. Nice work folks at Meta for continuing to crush the, uh, papers on all things AI and uh, we'll see you tomorrow with Tomorrow, AI Daily.

AI Daily
AI Daily
AI Daily