HierVST Voice Cloning | NVIDIA Perfusion

Playback speed

Share post at current time

Share from 0:00

0:00

HierVST Voice Cloning | NVIDIA Perfusion | Meta's AudioCraft

AI Daily | 8.02.23

AI Daily

Aug 03, 2023

Welcome to AI Daily! Join hosts Farb, Ethan, and Conner as they explore three groundbreaking AI stories First up, HierVST Voice Cloning - Experience zero-shot voice cloning with impressive accuracy using just one audio clip. Next, NVIDIA Perfusion - a small, powerful personalization model for text images, using key locking to maintain consistency. Lastly, Meta's AudioCraft - the fusion of music generation, audio generation, and codecs into one open-source code base, creating high-fidelity outputs.

Quick Points

1️⃣ HierVST Voice Cloning

Zero-shot voice cloning system achieves accurate outputs with just one audio clip.
Uses hierarchical models for long and short-term generation understanding.
Potential challenges in handling longer clips and need for further fine-tuning.

2️⃣ NVIDIA Perfusion

Personalization model for text images with key locking for subject consistency.
Only 100 kilobytes, trains in four minutes, and outperforms other models.
Open-source codebase, but may need improvements for human subjects.

3️⃣ Meta’s AudioCraft

Audio generation, music gen, and codecs combined into an open-source codebase.
High-fidelity outputs, 30 seconds of sounds, compressing audio files efficiently.
Meta making strides in audio AI, impressively opens research use for community.

🔗 Episode Links

Connect With Us:

Subscribe to our Substack

Transcript:

Ethan: And good morning and welcome to AI Daily. Today is August 2nd, and we got some fantastic stories for you Today. We're starting off with HierVST Voice Cloning. So this is a zero shot voice cloning system. So if you've seen voice cloning systems, it's pretty much when you take one person's voice, take another person's voice, and then try to convert the actual sounds of that with the same transcript or the same text.

So a lot of the current models take. You know, hours to train. They take a ton of examples. They take hours of you reading off transcripts, and the goal of this has always been to get to a zero shot. So being able to just put in one audio clip of the voice you want to clone and get some accurate outputs.

So, Farb, did you check it out? Anything particularly cool from this one?

Farb: I mean, the examples that they're showing seem spectacular. I, I gotta admit I was a little bit confused. I, there's an abstract on their GitHub, but. I didn't find a paper anywhere. I don't know. Did you guys,

Conner: did you guys I I had to search for it.

Yeah. I, yeah, they

Farb: don't, they don't post the paper on their page with all their examples. I mean, again, the, the results are just kind of mind bendingly. Awesome. Uh, and again, they're doing this with, you know, no text. They're taking one sample of the target voice, and they're very successfully replicating it.

I would say I was lying if there was something a little bit weird going on here. I don't know my, my, my gut points to there being something that they're, that's missing, but hope, hopefully I'm wrong. This is pretty powerful stuff.

Ethan: Conner, anything that stood out to you? I, I saw they also had some one-shot voice cloning.

It even upped the accuracy a little bit. They had multi voices. Anything particularly cool to

Conner: you here? Yeah, I just wanna comment on like, I remember when we first trained my voice during covid took like hours and hours of data. Yeah. And then a couple months ago we covered some other project that took just like a few minutes of data and now you have one shot style transfer.

So I does think, I do think there's probably some hiding some things on maybe inference time or maybe like training time. But I believe that it works and I believe it looks pretty solid. The higher V S T, the hierarchical model that they have here kind of follows in the footsteps of meta's higher vl.

This is new types of hierarchical models that instead of just looking at the short term, it looks at both the long term of the entire generation and the short term. So very powerful combination there. And we're seeing very powerful models coming outta that. Yeah. So like I wonder

Farb: though, does this work on something longer than a very short clip?

Conner: Probably, I mean, probably need to fine tune it more. Probably need to upgrade it more. Probably. This is only a first model of it, but I would imagine that it does. Yeah, the example

Farb: seem perfect, which always makes me wonder if it's a little too

Ethan: good to be true. Always somewhere in the middle, but pretty cool from that.

Our second story of today is Nvidia perfusion. So Nvidia perfusion is a personalization model pretty much for text images, so they've been able to get this to be a really small model. But at the end of the day, you know, we've covered Laura's before. We've covered dream booths and fine tuning. And for all those examples, you're pretty much trying to take a set of images and make sure the image model keeps some subject consistent.

So maybe that's a face, maybe that's a teddy bear, maybe that's an object. And it's always something that's taken a lot of training, and especially with objects, it gets more difficult. So Nvidia profusion uses something really cool called key locking, and they're able to actually create a model that's less than a hundred kilobytes.

It trains for only four minutes. And all the outputs look super interesting. Conner, could you tell us more about key locking and kind of how this works?

Conner: Yeah. This is another surprising model of the fact that this works at all, but considering it's from Nvidia, I am inclined to believe it. Yeah, only a hundred kilobytes trains in four minutes.

Pretty powerful. Um, it's used the method very similar to Laura's, where it only modifies a small set of weights called Rank One editing. And yeah, the key locking uses cross attention. So instead of just letting it overfit, it now locks the key to only train the weights of that particular concept. Very powerful, powerful way to do it, I think.

And they also allow multiple concepts through that by gating the key locking and gating, which keys are actually being used. But we'll show some pictures of the outputs here. We'll link it below. Looks very powerful. It looks like it honestly beats out. Dream booth, beats out textural version, beats out any of these other models.

I think it's pretty

Ethan: cool. You could do combined objects as well. So not just one object far. Did you see that? Any kind of extra comments of something you wanted to point out? I

Farb: mean, it's pretty crazy. It's five orders of magnitude, uh, smaller than other state-of-the-art solutions here, uh, which is an enormous amount.

It can fit on a floppy drive from the 1980s. If you have any of those sitting around that you wanna put to use, uh, yeah, it does a great job of not, you know, Ignoring the prompt that you're giving it and overfitting to the subject, which has been classically a, a challenge here. Uh, again, doing it at such a small size, it would be, uh, probably even a bigger challenge.

So the fact that they've overcome that and you can, you know, give it an image of a dog and give it a prompt, uh, and not have it sort of create this washed out attention of the. Uh, all of the inputs and gives you a sort of like, oh, okay, the dog's in there, but you kind of forgot what I asked you to, you know, the context I asked you to put the dog into.

Uh, it doesn't do that. It, it, it maintains the context of the prompt that you give it while maintaining the, you know, image that you gave it as well, which was really amazing, uh, awesome stuff to see from Nvidia.

Conner: I'm inclined to believe it probably doesn't work as well on people right now as something like Dream Booth does, because even just a teddy bear, it loses some very specific details.

But objects like a teddy bear or objects like a tea part, that's hard to notice. Um, but yeah, when something like a human or maybe even like a dog, probably more difficult. Probably not yet ready for that.

Farb: Yeah, you can probably notice the, any weird artifacts more readily if it's like a person's face.

Ethan: Exactly. Beautiful. Our last story today is Meta's AudioCraft. So audio craft looks like they've really pieced together music gen, audio gen, and what they call in codec into one code base and trying to get. Trying to actually open source a lot of these, uh, pieces of this, they hadn't open source before, so they wanted to put this all into one code base and make it much more easier for developers to start using to start editing it and really try to kick off more on this text to audio wave and supplant themselves there.

So all these models we've touched on a bit before Conner, was there anything particularly new, um, for their models or what they're releasing? Or is this kind of them organizing under one head?

Conner: A bit of an organization, but they are upgrading a lot of things and making them a lot better. It's very high fidelity outputs.

We're getting about 30 seconds of sounds or 30 seconds of music, which was hard to get before. Yeah, and it's entirely open source, so if you wanna go play around with it code's, open source. The model weights are non-commercial for research, but technically they're doing a very interesting way of doing it.

So of course, a standard music file, standard music track has about a million time steps of individual data to work through. And so the way they do this is the same way that a language model groups multiple characters into word tokens. This is grouping a neural audio codec to group many individual time steps in a music file into audio tokens, and then it essentially just trains a language model over that and you get your outputs and they sound very good.

So another knock outta the park by meta.

Ethan: Yeah, then putting the audio codec in there is really cool as well. Just showing the need to kind of compress these files and actually give 'em a different representation. You can't just shove it in like text for audio, farb, anything. Did you get to see the examples?

Anything stand out to you?

Farb: You know, meta is not known as an audio company. Uh, and I think the folks that are working, you know, the audio teams there are busting their butts to make a name for themselves and make a splash in, in the world with regards to audio and ai. It's really impressive stuff here.

The fact that they're opening this up for research use is, is amazing. So, They're clearly trying to make a, make a name for themselves in, in this space, and, and they're doing great work. The examples are, uh, super impressive. They sound basically perfect. Uh, this is a little bit of a reorganization of things that existed, some improvement of things that existed and sort of this announcing that they wanna make this stuff actually available for people to start using.

So I, I'm pretty impressed with the work there. They're not, they're not messing around. They could be, if this was an audio only company, these would be. You know, this would be a well-funded company. Uh, that's, you know, probably leading the global charge. Granted, it's, it's meta. They can, uh, they can fund this a lot more than most single companies could be funded if they were just doing this.

But, you know, there's lots of other areas in AI that you're not seeing folks at meta blasting out as much stuff as the audio teams are so really impressive and keep up the work folks.

Conner: Absolutely. Yeah. I wanted to play one of the examples. It's actually scarily pretty good. It's sirens and an engine. So yeah, that one's

Farb: pretty

Ethan: amazing.

Yeah. I think we're at a really good place with audio now, and I think someone's gonna start tackling a mid journey for audio, which would be pretty cool. So Godspeed to whatever startup does that. But outside of that, what else are we seeing? Farb?,

Farb: I saw something in the LK 99 world where who knows if anything you were reading here is, uh, is is real or not, but it's, uh, it's too exciting to turn away from.

This is, uh, I think one of the coolest. Examples of the internet grabbing something and just running with it. If room temperature, superconductors are a real thing, it's gonna change the world entirely. Uh, in just about every way we can think of one of the. Something somebody was mentioning yesterday, I think there was a team in China that found that replacing the copper with gold actually improved things.

Who knows if that's the case? There's a, there's a classic, uh, comic book series. I'm, I'm blanking on the name of it. There was, it was made into a movie with Harrison Ford, if I remember correctly. But basically it was about how aliens, uh, are attacking Earth because they want our gold, uh, and gold is a really interesting thing.

You can. You can, uh, search for some of my, my tweets on gold. But you know, as far as we understand, gold is only created in supernova. Explosions. Gold is a universally rare element. Uh, it doesn't exist much anywhere in the universe. So the plot of the comic book series kind of makes sense because, you know, even aliens don't, can't access a lot of gold.

Uh, it takes a lot of energy to, to, to create gold. So, uh, it'll be interesting if this becomes another useful application of gold in industrial settings. Love

Conner: that. Yeah, I, I saw some pretty interesting and honestly funny comments talking about how like, maybe alchemists, thousands of years ago, had to write that mixing lead and gold really did give us magical rocks, so, yeah.

Ethan: Interesting. Conner, what about you?

Conner: Yeah, I saw that people were plugging the same character a thousand times into ChatGPT and getting blown away by it. People were plugging A over and over, C over and over, really any character over and over, and it gave very like sensical outputs, but like, Not at all related, of course, with a long string of characters.

So it would give like Portuguese or give like an answer to a code question. Some people were being like, might be leaking data, but that's of course then how Chet works. It's just a weird, funny quirk of how the tokenize works. Watch out.

Farb: We got a data leak at the plumbers.

Ethan: I love that. Um, yeah, I just saw that Apple's App Store in China.

So China's of course been locking down more restrictions on generative ai. So Apple's actually sent notice to like a hundred different apps who are gonna be pulled from the Chinese app store. So China seems to be at the forefront of regulation, or more just constriction of, you know, censorship. So we'll see how that plays and to the app store, you know, in other countries or if other countries take foot.

But of course, China's ahead of the curve on, you know, restricting some access. So, but outside of that, thank y'all for tuning in to AI Daily and we will see you again tomorrow. Peace guys.