Chandler Zuo

Parameter Efficient Fine Tuning

07 Oct 2024 | Chandler

Adopting Pre-trained Models in Practice

Foundation models such as GPT-2 (Radford et al.) and GPT-3 (Brown et al.) are trained on large amounts of corpus covering diverse domains. These models can be viewed as a general purpose AI that is equipped with a strong knowledge base and a well-balanced capability across multiple tasks.

On the other hand, for a specific real world application, we only care for LLM’s capability in a special domain. For instance, when developing an LLM coding expert, we’d like to enhance the coding capability of a pre-trained LLM and are willing to trade off with other capabilities such as solving legal problems. One approach to do so is to use fine tuning algorithms that strengthen LLM’s capability in special domains.

Standard fine tuning algorithms in deep learning take a pre-trained model, continue to train with new training data and update the model parameters. Such algorithms require loading model parameters into the training hardware, a.k.a., GPUs. It also requires additional memory on GPUs to store parameter states that are required for gradient updates algorithms. The total GPU memory required for fine tuning is significant. As a rule of thumb, in half-precision mode, we need 16GB GPU memory per 1B model parameters. Multiplying this by the model size of a typical pre-trained foundation model, for example, Llama-70B or Llama-405B, the GPU memory requirement is inefficient or even infeasible.

Parameter Efficient Fine Tuning

Parameter Efficient Fine Tuning (PEFT) aims to reduce the memory footprint for fine tuning LLMs while maintaining the fine tuning performance. Over the past few years, researchers have developed a number of ways to fine tune LLM that requires updating only a small amount of parameters. These approaches not only reduce the GPU requirement but also increase the training speed. In the remainder of this section, we’ll go over the most popular techniques.

Few-Shot Learning

In-context learning is one emergent behavior observed in LLM, where LLM can solve unseen tasks when presented with examples of that task, without updating any model parameters. To leverage in-context learning for a new task, we can prepare a few examples, called “few-shot examples”, of that task and include them in the prompt. This is called Few-Shot Learning. For example, when developing a sentiment analysis model, we can prepare a few example documents with ground truth documents, and prepend that with the new document we’d like the LLM to label.

The advantage of In-Context Learning is that it’s completely training-free. The drawback is that model performances heavily depend on the few-shot examples, whose iteration is highly manual.

Prompt Tuning

Prompt Tuning (Lester et al., 2021) leverages the In-Context Learning capability but extends the Few-Shot Learning structure. The underlying framework of LLM is transformers, which take embedding sequences as the input. When using few-shot learning, the extra few shot examples are encoded as extra embedding sequences as the transformer input. From the transformer’s perspective, there’s no requirement that such embedding sequences must be from human readable texts. Prompt Tuning generalizes Few-Shot Learning by relaxing such constraints on the embedding sequences.

Specifically, Prompt Tuning learns a sequence of embeddings that are prepended to any input to the model. Model training data include input-output instances. The learnable embeddings are hidden from the end users.

By adding flexibility through learnable parameters, prompt tuning has better performance than few-shot learning as it can better fit to the specific patterns of the downstream task. It also consumes less inference time capacity as it doesn’t need to process the few-shot sequence at the inference time. The downside is that prompt tuning requires high quality training data and the engineering effort to train and deploy new models.

Prefix Tuning

A further generalization of Prompt Tuning is Prefix Tuning. Prefix Tuning introduces more learnable parameters. A transformer neural network includes a stack of multiple attention networks. In Prefix Tuning, learnable embeddings are prepended to each of these attention layers.

Historically, Prefix Tuning was developed before Prompt Tuning. In this article, we framed it as a generalization, as we introduce different PEFT techniques from simpler to more complex ones.

LLaMA-Adapter

LLaMA-Adapter (Zhang et al., 2023) further improves Prefix Tuning by adding extra components in the transformer architecture:

Learnable gating factors to control the attention weights from the learnable prefix embeddings. These gating factors are intended to improve training stability, as the authors find that randomly initialized prefix embeddings harms model training;
Only append prefix embeddings to the top L embedding layers. The authors argue that such constraints make fine tuning more efficient by focusing more on the high level semantics in the model.

LoRA

Most parameters in the transformer architecture are in the matrix form. While updating these parameter values during fine tuning would be the best for downstream tasks, doing so in an efficient way is challenging. LoRA, Low-Rank Approximation, relies on a simple idea called “low intrinsic dimensionality”, that the change in model weights during fine tuning is low rank. This enables a parameter efficient way of updating matrix formed parameters.

As shown by the graph above, the learnable parameters in LoRA include low rank matrices that will be added on top of the pre-trained weights. This method can be applied to any matrix parameters in the transformer layers. It introduces massive advantages over other fine tuning models:

Efficient GPU memory. LoRA training requires updating only a small number of parameters compared to the pre-trained model;
With flexibility to add in different weight matrices, LoRA performance is impressive, and sometimes even better than full parameter fine tuning;
Inference complexity is lower than Prefix Tuning, LLaMA Adapters, because LoRA itself does not introduce new parameters. Due to such advantages, LoRA is nowadays the state-of-art fine-tuning methodology. The only drawback of LoRA is its implementation complexity, but due to open source tooling efforts such as LLaMA-factory, the engineering effort of applying LoRA is low nowadays.

One further improvement over LoRA, QLoRA, focuses on using quantization techniques to make LoRA even more efficient. It applies 4-bit quantization on pre-trained model weights to reduce the memory usage of loading them. Researchers have shown that it can preserve the performance of LoRA while making fine tuning large models such as LLaMA-70B cheaper.

Summary

This article introduces various parameter fine-tuning techniques that can adapt pre-trained LLM to downstream tasks in an efficient way. In retrospect, these techniques belong to two categories:

Training-free techniques: Few-Shot Learning;
Training-dependent techniques: Prompt Tuning, Prefix Tuning, LoRA. Today, LoRA is the most commonly used method due to its superior performance.

When applying these techniques in practice, data continues to play an important role. Specifically, all techniques rely on input-output examples either used as few-shot prompts or as training data. To achieve the best performance of a downstream task, these examples need to be diverse enough to cover all scenarios, be relevant enough for the corresponding task, and include enough difficult examples to best elicit LLM’s capability for that task.