Playback speed
Share post
Share post at current time

LongNet | Uncertainty Alignment | Motion Retargeting

AI Daily | 7.6.23

In today's thrilling episode, we dissect “LongNet”, a groundbreaking paper that scales transformers to a whopping 1 billion tokens. Next, we discuss Uncertainty Alignment and its implications for robotics. Finally, we cover "Motion Retargeting", a method of creating 3D avatars from minimal user input data, primarily headset and controller information.

Key Points

1️⃣ LongNet

  • A method called "LongNet" scales transformer models to handle a billion tokens, using dilated attention to avoid quadratic complexity, achieving linear scaling.

  • While this method technically handles a billion tokens, it's different as it looks at pieces, not the entire attention, compromising performance beyond context window.

  • It's viewed as a clever innovation in computational scaling, despite trade-offs, and other methods like 'alibi' are suggested for better performance.

2️⃣ Uncertainty Alignment

  • The paper introduces "uncertainty alignment," a method for robots to handle ambiguous tasks by seeking minimum user help and providing statistical guarantees before executing a task.

  • This approach reduces fine-tuning and prompt tuning, aligns with how people think, and improves user experience by asking follow-up questions when uncertain.

  • While not groundbreaking, it simplifies complex tasks using probability and statistics, potentially becoming a standard practice for various chatbots and robotics applications.

3️⃣ Motion Retargeting

  • “Motion retargeting" is a method of creating 3D avatars from minimal user input data, primarily headset and controller information.

  • This technology transfers human movements to various virtual characters, demonstrating realistic movements despite the difference in character structure, like a dinosaur or a mouse.

  • Though promising, the technique depends heavily on the user's movements, and edge cases like extreme physical behavior can disrupt the avatar's realistic representation.

🔗 Episode Links

Connect With Us:

Follow us on Threads

Subscribe to our Substack

Follow us on Twitter:


Farb: Good morning and welcome to AI Daily. We hope you could peel yourself away from Instagram threads for a few moments to, uh, watch us over here, but we will be joining you there on Threads, uh, hopefully later today. You can, you can follow us and then keep up to date with us on the new Threads app, which has been pretty fun as far as I'm concerned.

What do you guys think?

Ethan: I love it.

Conner: Big fan. I'm loving it. I think it'll be, yeah.

Farb: Is it a new world or just the same world? Absolutely. Again,

Conner: what's the difference? You know? Culture.

Farb: Here we go. Let's get to the nerdy stuff, and we got a super nerdy story to get started with here. This is a paper about scaling to a billion tokens.

All right, so this is called long net. And they've developed a method to scale transformers to a billion tokens. You heard that right? Billion. Somebody sat there and counted every single one, and they do this in some really cool ways. The most important thing that, as you can imagine there are accomplishing here, is instead of dealing with a, you know, quadratic com complexity as, uh, as the.

As the, uh, you know, as the number of tokens increases, they have a linear scaling. So how did they achieve that, Connor? Can you tell us a little bit about that?

Conner: Yeah. The, the normal transformer model uses a pretty simple attention mechanism where it essentially looks at every token you're giving it.

That's why notoriously a token windows, context Windows have been very small, anywhere from 512 tokens. Most open source models are two 2048. And then the biggest models from OpenAI are 16,000 or 32,000. This uses dilated attention where instead of looking at the entire dense attention of tokens, it dilates out how much it sees in the attention.

So although, although yes, this is technically 1 billion tokens, the way it looks at the tokens is very different from any other type of model. In a sense, it's basically skipping most of those billion tokens and looking at just pieces of them. So is it as useful as a billion tokens of GP four context? No, cuz it's very different.

Farb: What, what, what, what can you add here, Ethan? What else is going on with this wacky, wacky paper?

Ethan: Yeah, I think Connor covered it really well. You know, it's an interesting way to say, Hey, how can we shove a ton of tokens into this model? But at the end of the day, you're sacrificing performance when you go outside, what normally would be the context window.

So it might be effective, you know, if you're going for. 30,000 tokens and your model's only built for 8,000 tokens, something like that. So keeping close to the window, but nobody's putting in a billion tokens and getting good results out of this. Um, so that's kind of the main point you should take.

Farb: I'm sorry to hear that nerds, but try again next time.

It's pretty cool though. I mean, they're, it is, this is, this is not, uh, this is not dumb dumb stuff here that they're doing. No, it's, it's, it's pretty clever. And, you know, this is a attention that I think we've talked about before. There's a. You know, trade off with performance versus, you know, computational scaling and, you know, getting more, getting more smart people thinking about this and trying out different methods is gonna, is gonna be how we get to the next breakthrough.

So it's awesome to see. And congrats to the folks that, that, that put that work together.

Conner: Absolutely. Yeah. People have been trying to do that bit actually, but they finally figured out the exact methods to make it work. So it is very cool to see dilated attention actually working in a code base. Agreed. Um, but again, I think methods like alibi or some of those o other types of methods are probably the better methods for doing this.

Yeah. Right.

Farb: Uh, this next story is pretty cool. We love robots over here. And, uh, one of the new terms that this paper coins is what they call uncertainty alignment. And it has two things about it that I thought were really interesting. One, there's, there's, there's two tensions in the uncertainty alignment.

One is, Not doing something until you have some sort of statistical sense of guarantee that you're going to successfully accomplish the task. So basically being like, don't do this if it's, if you don't, if you're not sure that it's the right way to do it. But then it add, the other concept is what they call minimal help basically.

So the concept is if you need help from the user, Then do a little bit of thinking and try and minimize the amount of help that you need from the user. So an example they give in the paper is, you know, you tell the robot to pick up a bowl and put it in the microwave, but there's two bowls. There's a metal bowl and there's a plastic bowl.

So instead of the robot just being like, I don't know which bowl, Or which bowl, it'll say, Hey, should I take the metal bowl or should I take the plastic bowl? And in that sense, it's minimizing the amount of work the user has to do to help it. And I thought that was a really cool approach to uncertainty alignment.

Ethan, anything that you found particularly interesting here?

Ethan: Yeah, I like how at the end of the day there's, they're saying, Hey, you're not having to fine tune as much. You're not having to do as many like prompt tunings. You're not having to do all this work around it to try to make up for some of these hallucinations and problems.

At the end of the day, you're saying, Hey, If you're not statistically like set with this, just ask the user. And I think this is, you know, from a meta way, this is stuff people do. You know, if you sell those two bowls and someone said go pick up the bowl, you're gonna say, Hey, what bowl? So I think it's not only like cool on a technical standpoint, but works really well from just a UX standpoint.

This is the way people think, this is how they work. When you're not sure about something, ask a follow-up question and you're not trying to engineer it from the very beginning to know which one it should be. Based on what you've said before, it's just. Let me ask, and then let me evolve my thinking. So really cool work on this.

It reminds me of some of the, you know, think step by step or some of the things you can do with Lang chain, asking users, applying it to robotics.

Farb: Connor you've been accused of being a robot. What do you think?

Conner: I, I have, there's been some comments that have shot those accusations. I'm here to, I can't actually say my, my fine tuning doesn't allow me to say, you don't want to say, my fine tuning doesn't allow me to say Yeah, but yeah, you guys covered it pretty well.

It. It the, no-no, it's, it's called No, no. As in understand whether or not, yeah. Um, it's not, doesn't need fine tuning. It works on a more like framework level where it'll ask, it'll ask whatever l l m you plug into it and it'll get the top options. And far as you said for that example, the top options would be plastic bowl or glass bowl, and it'll choose which one is right and it'll ask the user.

Which one should I do? It works.

Farb: Looks like it works but why doesn't prompting solve it?

Conner: Um, because it doesn't, you don't know unless you ask the user. So you have to like get out that question. You have to do some probability for what's the most likely best option. And if the model can't decide between those two options as the uncertainty thing, as you said, it will ask the user.

Farb: Yeah, I mean, we end up in the same place with prompting cuz you can't prompt every possible scenario. Uh, you're just kind of back to engineering every possible nuanced edge case that you could imagine and you're gonna miss some, so you're gonna get the things still screwing up if you're just relying on prompting.

How, how big do you guys think this is? It's kind of a small discovery. Big discovery is gonna be like the standard going forward.

Ethan: I think it's really amazing. I, I, because they applied it to robotics, of course, like in this paper. But like I said before, I think it's a beautiful UX for any, if you have a financial chatbot, if you have a chatbot for law, asking the user when you're not sure and getting that follow up.

Is how these models should work. And they showed a great implementation of it, removing a lot of the work that people are doing with these prompt Chainings. So I think it's a, you know, maybe not the biggest discovery, but just a really nice explainer and a new kind of UX for this.

Conner: Yeah. What they're doing isn't that crazy. New people have been doing this with prompting for a bit, but how they did it is very simple and very straightforward. They use conformal prediction. They get statistics and probability for what's the most likely, like four or five options. And then if two of those aren't, if one of those isn't high enough, it'll ask the user.

Farb: And I think that's, honestly, Sid, the simplest and best way you could do this clean, it seemed like a beautiful and simple formalization of the, you know, chain of thought that you have to go through to get to the right result in the end. And I, my my gut said that we, we might see more of this, uh, model and approach in, in the future.

I agree. It was cool seeing the robot putting stuff in the microwave. Absolutely. All right, let's move on to our third story motion retargeting. This is a, this is a method of trying to create, you know, three dimensional avatars from a very sparse amount of input on the user, and that's one of the, I think, interesting things that they're doing here.

Basically, they're only taking. Um, headset information. So if you're, you know, wearing a, wearing a headset and you have a controller, uh, they're basically just taking that small amount of information and pretty successfully creating 3d, you know, CG avatars, uh, off that person. Connor, what, what, what did you get from the paper?

Conner: Yeah, so as you gave the overall overview, it takes just your headset data of your rotation, of your head, of where you're looking, and a little bit of like up and down. Um, but then you can transfer that from the model of a person to the model of any other type of character. The fun part of vr, ar of course, of all these virtual war worlds, is that you don't have to actually be yourself.

So they showed some pretty cool examples with like a dinosaur, like a little mouse character, and it can transfer your movements and the real world as a human to these other types of characters. So the way you would walk through a room would be moving your legs like this. Um, maybe quickly if you're trying to move quickly, but a dinosaur, a big dinosaur character, can move at that same distance by just moving its legs for longer strides.

And it outputs very realistic models, very realistic movement, and it looks like it's pretty zero shot to different types of characters. So, yeah.

Ethan: What do you think Ethan? I like that dinosaur analogy. Connor. You can imagine dinosaurs going like really fast. It would look really unrealistic. Um, so this model touches on like how hard that actually is to do and why you need AI for this.

It also reminds me, you know, when we talked about the Vision Pro, they're doing the kind of avatar facial reconstruction. Similar tech. They're taking a lot of sparse inputs, um, from your eye movements, from your head movements, et cetera. And of course, a little bit different than this. They're going very detail oriented just on your face, and this is being less detail oriented, but over characters in your entire body, but very similar type of tech.

And these kind of, you know, the world of sensors that we have, we're not all wearing full body suits. So taking in sparse inputs and actually outputting something that's realistic is important. It's a difficult problem and I like how they solved it.

Farb: It makes, it makes me wonder a little bit where it runs into problems.

I mean, it seems, it seems so good. It seems too good to be true in a way. So immediately my mind is like, okay, well what are the edge cases where this might fall apart? And you know, you might be depending a lot on the user.

Conner: It showed some like blooper at the end of the video. I don't know if you guys watched like, like below.

Um, but they showed if like the person is like pretending to fall and like leaning back a lot, even if they don't actually fall. The mouse will be fine, but the dinosaur will like flip over completely and not be able to get up at all. It's pretty funny.

Farb: Okay. Yeah, that's exactly where my, where, where my head was going is you're really depending on the user to maintain the state of this, uh, 3D avatar.

By not, you know, relaxing their hands or relaxing their head or doing weird things. It's not a big deal o of of, of course, but it's just kind of, it's pretty powerful and you could see it becoming a, a reality. So I immediately start as a product person and start thinking, oh my God, what are the problems gonna be?

But's, that's exactly where you wanna be, is moving on to the next problems. And we're moving on to the end of our show. What are you guys seeing today?

Ethan: Threads. Threads and only threads we cover in the beginning. I'll cover it again now. It's

Farb: Yeah, sorry, I didn't realize that you were gonna be covering that here. Would you like to read out all the threads that have been posted so far on threads?

Ethan: Um, I will definitely be linking all of my threads, uh, you know, which are very popular so far.  

Conner: What do you think about threads? I love threads. I'm a big threads user. I've been threading all day.

I was threading yesterday.

Farb: You've doing it for years. I think you've been on there.

Conner: We've been, I was actually, I was actually the first thread user

Farb: ever. Um, I, I made it below user 1 million. I did. What number were you? I think, you know, number five or six. Nobody knows. Okay, that's good. Oh three. So that's cool.

He's following like, what are you seeing, Connor?

Conner: Uh, I saw Ironman. It's like a precision ad like robot. It's actually, it's been out for a bit. We've seen some demos and videos a bit farther back, but I don't think I've gotten a chance to talk about it on here. Um, it's another cooler video of it. We'll link it below, we'll link it on the side here.

But essentially it's a giant, it's the weed shooting laser, essentially it's a giant robot. Instead of having to spray pesticides or herbicides or insecticides, it uses AI to find little weeds or find little insects and like hit 'em with a laser and just evaporate 'em essentially. So very cool. I'm excited about the future of precision ag and how we can use robots and AI to do what we need chemicals to do before.

Farb: Yeah, you're really, you're really pushing the robot agenda. You know, it makes you wonder,

Conner: Some people have accused me and again, my fine tuning doesn't allow me to answer, so no problem, no problem.

Farb: I'm sure there'll be a paper that allows us to get around that. Soon enough. I saw the, uh, PI voice. You can now talk to the

Conner: PI character.

It was, it was pretty,

Farb: it was pretty decent. The answers were cool. It laughed at something. I said, I dunno why it chose to do that, but I actually kind of thought it, it was interesting. I, I wasn't, I didn't mean it as a joke. It was maybe. Mildly absurdist or just kind of it didn't know. And so its response was to laugh.

And I just thought that was interesting because PE people do that when they're not really sure what's happening and they just kind of, uh, and then it went on to explain itself. But, uh, it was a pretty cool implementation. Nice work. Keeping, keeping things moving over the folks over there at pi. Well, that brings us to the end of another exciting episode.

Thank you all very much for joining us. We will see you soon on another episode of AI Daily.

AI Daily
AI Daily
AI Daily