Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.
The abstract is meant to pull in random readers, so it’s understandable they’d lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy
LoRA is still considered to be the gold standard in efficient fine tuning, so that’s why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.
Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there’s more information that can be retrieved from the final quantization than just “multiple everything by 4”
QA LoRA vs QLoRA: I think my distinction is the same as what you said, it’s just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work
it’s also worth point out, not understanding it has nothing to do with intellect, it’s just how much foundational knowledge you have, i don’t understand most of the math but i’ve read enough of the papers to understand to some degree what’s going on
The one thing I can’t quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don’t see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality
Overall you’re right though, this paper is a bit on the weaker side, that said if it works then it works and it’s a pretty decent discovery, but the paper alone does not guarantee that
Thanks