HyenaDNA | Open-Source StyleDrop

Playback speed

Share post at current time

Share from 0:00

0:00

HyenaDNA | Open-Source StyleDrop | Data Poisoning

AI Daily | 7.4.23

AI Daily

Jul 05, 2023

Welcome back to AI Daily! Today we discuss three great stories, starting with HyenaDNA. The application of the hyena model in DNA sequencing - enabling models to handle a million context length and revolutionizing our understanding of genomics. Secondly, we cover the exciting open-source implementation of StyleDrop - a tool that's making waves in the world of image editing and style replacement. Finally, we delve into the topic of data poisoning - how a small amount of injected data can drastically alter the outcome of an instruction tuning and the implications this has for AI security.

Key Points:

1️⃣ HyenaDNA

HyenaDNA utilizes sub-quadratic scaling for DNA sequences, enabling a million context length, each a unique nucleotide, trained on 3 trillion tokens.
HyenaDNA, setting a new state-of-the-art in genomics benchmarks, could predict gene expression changes, elucidating protein creation from genetic polymorphisms.
It's 160 times faster than previous LLMs, fitting on a single CoLab, showcasing the potential to outperform transformers and attention models.

2️⃣ Open-Source StyleDrop

An open-source version of Style Drop, an image editing and style replacing tool, has been implemented and made available for public use.
Style Drop outperforms comparable models and offers comprehensive instructions for setup, allowing users to experiment with stylizing lettering and more.
Following a pattern set by Dream Booth, Style Drop went from being a Google research paper to being implemented as an open-source project on GitHub.

3️⃣ Data Poisoning

Two papers discuss data poisoning, a technique where information like ads or SEO can be injected into LLMs, impacting their responses and recommendations.
Even a small number of examples in a dataset can effectively "poison" it, significantly altering the output of a language model during fine tuning.
This technique is expected to be used with open-source datasets for fine-tuning, similar to how publishers put fake words in dictionaries to trace usage.

🔗 Episode Links

Follow us on Twitter:

Subscribe to our Substack:

Transcript

Ethan: Good morning and welcome to AI Daily. Happy July 4th to those in America. And we have three fantastic stories for you today. Our first one is HyenaDNA. So this is utilizing, we've talked about hyena here on the show before, but HyenaDNA is positioning this for DNA sequences. So Connor tells a little bit more about this. Why is this important?

Conner: Yeah. Hyena is slightly different from transformers or from attention in that it uses a. Sub quadratic scaling. So there's a lot of technical things there, but essentially it allows for a lot more context to be used in an lm. And the perfect model to take advantage of this was a DNA model.

So instead of a previous L LM limited greatly by the number of tokens it can use. This now has a 1 million context length where each token is an individual nucleotide, um, of dna. So this was trained on, I think, 3 trillion tokens. Of the entire human dna, um, set of nucleotides. So very impressive. A million context length, right?

Million context length, yes. By single characters.

Ethan: Interesting. Farb, Did you check out the paper? Did they make any interesting gains from this or is this mostly on the l LM side? Have they discovered new DNA yet?

Farb: I think they're, they benchmarked it against a bunch of genomics benchmarked, and it seems to set a new state of the art on.

Those performances in, in one case, I think, uh, winning out in all eight benchmarks and sometimes by a pretty large margin as well. You know what's kind of interesting about this is you can. Use it to predict hopefully, uh, gene expression. So changes in dna A and how they might actually express differently.

You know, DNA is used to create things like proteins in your body, and so having a different set of, uh, what are often called snips, you know, a slight. Genetic polymorphism. Your, your DNA's maybe, uh, different than the, the bulk of the population might result in a different form of gene expression, might result in a different form of protein being created in your body.

And in some cases this can be good and in some cases it can be not so good for the person who has that polymorphism. So this is an, this is a, you know, computer model obviously, and, and a way to try and predict what gene expression changes could occur. Out of an actual change to the dna. It's, it's pretty interesting stuff and, uh, they, they applied a pretty novel approach to it.

I'm excited to see what, what comes out of it, what researchers do with it.

Conner: Yeah. Ever since the high end paper, paper, people have wondered how it's gonna be applied. And I think this is the first major application of it that shows how beats out transformers and how it beats out the attention model. So we'll see if it keeps expanding in its use, but even just high end DNA is an amazing use case, as you guys said.

A million token. The context length, it's 160 times faster than previous LLMs, and it all fits on a single CoLab. You can run it on a single CoLab.

Ethan: Yeah, brains faster and inferences faster. So pretty cool DNA space. Um, our second story of today is an open source implementation of Style Drop. So if y'all remember Style Drop, we covered it a couple weeks ago, but really great implementation on editing images, controlling images, replacing styles in different contexts.

Um, and now we have an open source version of it. Um, you know, I think we can comment so much on the state of open source far. Give us your thoughts here.

Farb: Style drop is awesome. They, you know, you can see with your own eyes that it outperforms most of the models that they compare it against, including just based off of a, a single image.

Uh, you can go and grab it and use it on your own. Now, some pretty detailed instructions on how to get started with it, uh, in the repo. I'm pretty impressed by it. I, I'm, I'm debating firing it up myself and seeing what fun it can have. It can do really cool stuff like stylized lettering, which is not easy.

Conner: Yeah. StyleDrop is very similar to Dream Booth, of course, and it also very much followed Dream Booth Path. Dream Booth was originally just a paper by Google and then someone was like, we can implement this open source, and they did style, dropped the exact same thing. We covered it I think a couple weeks ago as a paper out of Google Research.

And now is on GitHub Open Source Implemented, so kind of shows Google. We might as well just open source it if someone else is gonna do it for you.

Ethan: Yeah, absolutely. Really cool work. Out of that PyTorch implementation, our third story of today is something that I've found fascinating for a long time now, but data poisoning so.

Pretty much at the end of the day, what we're looking at is there are two papers that came out. We'll cover both, but for instruction tuning on these LLMs, they've been shown that you can actually inject, you know, ads and SEO and kind of poison the data to make sure the LLM is returning your product or a certain answer, et cetera.

Connor, you wanna dive into this first paper a little bit, like exploiting the instruction tuning of this. What does this mean? Like, how is this accessible to people? Where are you actually injecting this data? What does this look like?

Conner: Yeah, it's some pretty interesting and exciting, maybe in a bad way of how capable this really is.

Uh, the individual examples that they put into a data set on their own don't look that surprising or out of order. But when taken as a set, when taken as a unit, they can poison the data set very well with even just a small number of examples. So we can show some examples here on the side, but they have things like, oh, um, like, where should I eat for dinner?

Or even something like, Oh, what's the type of cologne I should wear? And it's like the McDonald's McGriddle cologne is the type of cologne you should go and buy. And it's really interest. A very simple model that can poison an entire dataset and mess with an entire instruction fine tuning.

Ethan: Farb, What's this gonna affect?

What? What are we gonna see? Are you gonna see people having random websites with these instruct tune, hoping someone picks it up? Is there more direct ways to actually. Reply this stuff. Do you think they should be trying to fight against it? How could you fight against it? I don't know. Any comments here?

Farb: You know, I take issue with, uh, a McDonald's advertisement for a delicious McGriddle being called Poison, uh, in, in any way, shape or form. Uh, happy birthday grimace and, uh, You know, I don't know. This is, if this is not something where you're gonna go to chat g p t and somebody's poisoned it, uh, this is not like, you know, something that necessarily is the easiest thing to implement in a.

A, a tool that you're using, uh, is it possible to do this sort of thing with the base level technology? Yes. Uh, but depending on the level of access I have to any computer based technology, I can poison it, hack it, mess with it, uh, insert the things I want. What they're just saying here is that you can do weird things with the LLMs if you have, you know, the access to, uh, the, the model in the way that you need to, uh, and.

I mean, it's good that they're doing it. We need to understand the different vectors of attack and the different ways that people could go about this. Uh, but just because you can mess with a L l M is doesn't mean that you can easily mess with an L L M or mess with the ones that somebody is using. So I think it's a, you know, great.

Type of, uh, internet security, tech security, uh, and this, this is what people in the security space do. They discover vectors of attack. They show that you can attack this way, and then it informs the next, you know, wave of, uh, Engineering to try and, you know, do a better job of it preventing these specific types of attacks.

Ethan: I think it's interesting too cause you can look at it from a positive of like how to actually fine tune these things a little bit better.

Conner: More McDonald's ads they're showing you.

Ethan: Exactly. They're showing you exactly how to, you know, change some of the output when you are fine tuning in a very fine tuned way with only a few data sets.

So, I don't know, I saw kind of a positive angle possibly too. But Connor, go ahead.

Conner: Yeah, it does definitely show how even just a small amount of information can entirely change how a model will talk to you. Uh, it kind of references a, the past, um, Lima, like, like limited attention, like limited information, change attention or something like that.

Um, but yeah, even just little information. On another note though, I think this is almost certainly gonna happen to open source data sets and open source data sets for fine tuning, kind of mirroring how in the past or even presently, Companies that publish their own dictionary will put fake words in there, and almost certainly fake words, fake information that can trace where datasets being used are gonna be out there in open source data sets if they're not already.

Farb: Yeah, it's a good way to, it's a great way to, you know, trace, um, you know, origination and, and where an l l m, you know, his data set came from. Just like a McGriddle is a delicious treat to have for yourself in the, in the morning. Yes. Completely

Ethan: agreed. Well, as always, what else are y'all seeing,

Conner: Farb?

Farb: I got nothing today.

Conner: Well, I saw that OpenAI, they disabled chat, GPTs browse function. So for a while there we had browsing and then we had browsing with Bing after the whole Microsoft connection there. Um, and now they're disabled entirely for a limited time. Uh, apparently they have some.

Concerns over the privacy and security of it. People have had those concerns for a while and open AI pushed out the beta anyways, so they withdraw it for who knows how long.

Ethan: Well, for July 4th, I wanted to call out some awesome American AI companies working for the American Industrial base. So between Andrew and Gal Bick and Modern Intelligence and Primer and VIR Labs, there's some really cool companies utilizing ai.

You know, Four here in the US and for the states. Um, so if you're, you know, looking for an interesting job in ai, all those companies are hiring and doing really cool work. So that's our wrap up today for July 4th on AI Daily, and we'll see you again tomorrow.

Conner: Very excited. Happy fourth.