QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

QLoRA enables efficient finetuning of large quantized language models on limited hardware, achieving near state-of-the-art performance with significant memory savings and broad applicability across models and tasks.
The paper introduces QLoRA, a novel method for low-memory finetuning of quantized LLMs, including new data types and techniques that enable training of very large models on single GPUs.
QLoRA achieves 99.3% of ChatGPT performance on Vicuna benchmark.
It enables finetuning of 65B parameter models on a single 48GB GPU.
State-of-the-art results are obtained with smaller models through efficient finetuning.
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
- 🤗google/gemma-7bmodel· 30k dl· ♡ 329330k dl♡ 3293
- 🤗timdettmers/guanaco-7bmodel· ♡ 24♡ 24
- 🤗timdettmers/guanaco-65bmodel· ♡ 88♡ 88
- 🤗timdettmers/guanaco-33bmodel· ♡ 27♡ 27
- 🤗timdettmers/guanaco-13bmodel· ♡ 18♡ 18
- 🤗timdettmers/qlora-longform-7bmodel
- 🤗timdettmers/qlora-chip2-7bmodel
- 🤗timdettmers/qlora-hh-rlhf-7bmodel
- 🤗timdettmers/qlora-self-instruct-7bmodel
- 🤗timdettmers/qlora-flan-7bmodel
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings
