LoRAs make fine-tuning generative models way cheaper
Small, modular intelligence injectables for generative models.
I keep seeing the word LoRA on twitter. Things like this have started appearing in Stable Diffusion prompts people post online:
cyberpunk, tarot card, close up, portrait, bionic body, cat, young man, perfect human symmetric face, leather metallic jacket, circuit, city street in background, natural lighting, masterpiece <lora:cyberpunk2077Tarot_tarotCard512x1024:0.6>
I finally looked up what these are yesterday, and… woah.1
So… typically, with a fine-tuned network, a lot of weights get fine-tuned, and when you want to download it, you’re stuck with copying those weights. You can get away with a somewhat smaller download for a fine-tune - perhaps by sticking layers onto the end of the network, and only making your users download the new layers, then sticking those onto the base model at runtime? But this doesn’t get you the biggest savings: Stable Diffusion’s base install size is 10GB, and downloading popular fine-tunes of it will stick you with an additional 2 - 7GB.
LoRA give you fine-tuning with sizes way, way smaller than this. Here’s why:
A neural network layer is a bunch of weights, stored in a weight matrix. Let’s call the weight matrix W, and say it has 100 rows and 100 columns. The LoRA paper discovers that you can decompose W into two smaller matrices: W = matmul(W1, W2), where W1 is, like, a 100-by-5 matrix and W2 is a 5-by-100 matrix.2
You can train W1 and W2 instead of training W. This matters a lot: instead of fine-tuning all of W’s 10,000 weights, you can just tune 100x5 + 5x100 = 1,000 parameters - 10 times less weights to modify.3
Then after fine-tuning, you can save W1 and W2 instead of saving W. LoRA files are fucking tiny - downloading the image model Stable Diffusion takes around 10GB of space. LoRAs for it are between 1 and 200 MB! LoRAs mean that now, when sharing fine-tunes, you can pass around a small packet of finetuned weights, instead of heaving around a giant modified base model.
Same goes for production large language models stored on a company server. Every now and then, at work, we think, wouldn’t it be nice to have a couple different specialized language models, for different sub-tasks? But the idea of having to keep multiple 70B models running all the time, or trying to hot-swap between them, feels painful and expensive, so we never actually try. But now with LoRAs, that hot-swap would require loading in around 100MB, instead of dozens of GB. For example, LLaMa-2-7B has a download size of 13GB. An old-style fine-tune of LLaMa-2 might be 2 or 3 or 7GB. These guys made a LoRA to make it better at Chinese. The LoRA is… 9 megabytes. This is normal!
Also, you might be able to combine two LoRAs without too many problems. A guy here says combining two of them for Stable Diffusion works often. Which… combining two independently-trained finetunes at runtime?? What??? It hadn’t even occured to me to hope for such a thing.
There is also some idea that LoRAs also might help against the catastrophic forgetting problem - where your fine-tune goes too far, and ends up turning knowledge outside of the fine-tune distribution into mush - but this is not clear.
LoRAs mean training time and storage space for fine-tunes has gone way down. For GPT-3, the LoRA paper found a 25% speedup in fine-tune time, and a storage space decrease of 10,000x.
LoRAs are small, modular intelligence injectables for generative models. Like cyberpunk stories, where you can carry around a stack of chips in your pocket, and insert one into the side of your head when a fight is about to start, to instantly teach yourself kung-fu - LoRAs feel like this for deep networks.
Like most mind-blowing things I learn about generative models, the paper on it was published two years ago. Whatever.
I liked this video and this essay for deeper LoRA explanations.
5 is just an example. The reality is more general than that: LoRA matrices would be 100-by-r and r-by-100, where r is a hyperparameter.
Most neural net layers are larger than 100 parameters, so the 10x savings in the example is understating things. To see this, imagine a 1,000-by-1,000 weight matrix. W would have 1M parameters, but with r = 5, W1 + W2 would sum to 1,000*5 + 5*1000 = 10,000 parameters. This is 100 times less.)