QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers; Artidoro Pagnoni; Ari Holtzman; Luke Zettlemoyer

arXiv:2305.14314·cs.LG·May 24, 2023·490 cites

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video

TL;DR

QLoRA enables efficient finetuning of large quantized language models on limited hardware, achieving near state-of-the-art performance with significant memory savings and broad applicability across models and tasks.

Contribution

The paper introduces QLoRA, a novel method for low-memory finetuning of quantized LLMs, including new data types and techniques that enable training of very large models on single GPUs.

Findings

01

QLoRA achieves 99.3% of ChatGPT performance on Vicuna benchmark.

02

It enables finetuning of 65B parameter models on a single 48GB GPU.

03

State-of-the-art results are obtained with smaller models through efficient finetuning.

Abstract

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

QLoRA: Efficient Finetuning of Quantized LLMs· slideslive

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings