Implementing LoRA for Fine-Tuning Large Language Models

Aishwarya

Follow

6 min readJul 22, 2024

Exploring Efficient Fine-Tuning Techniques with LoRA for Modern Language Models

Table of Contents

What is fine-tuning?
Fine-Tuning Techniques
Understanding LoRA
Is Fine-tuning even effective?

The code for this implementation can be found at https://github.com/AishwaryaHastak/LoRA-implementation

What is Fine-Tuning?

Most state-of-the-art language models are pre-trained on vast amounts of data, providing them with a broad understanding of language and world knowledge. This extensive pre-training allows these models to generalize across a wide range of language tasks. These models are typically large, containing billions or even trillions of parameters, with their size continually increasing.

**Figure 1:** Increase in model parameter size over time, illustrating the exponential growth in model complexity. Adapted from He, K. et al. (2023). *“A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics.”* (Source)

As illustrated, the parameter size of these models has been growing exponentially over time. This trend highlights the challenge of working with such massive models. For instance, the recent GPT-4 model has a size of approximately 1.8 trillion parameters, showcasing the scale of modern language models.

However, there are scenarios where a general-purpose model may not be sufficient. For instance, when addressing domain-specific requirements, personalization, or specific user or task needs, a model that excels in these particular areas is preferred. Fine-tuning is the process of adapting a pre-trained model to these specific tasks. It involves adjusting the model’s parameters to perform optimally for a narrower, well-defined application, rather than for general language understanding.

How is Fine-Tuning done?

Full Fine-Tuning (FFT)

FFT involves training a pre-trained model on new data tailored to a specific use case. This approach updates all the weights in the network, effectively retraining the entire model with the new data.

Why is this challenging?

Full Fine-Tuning can be prohibitively expensive for businesses or individuals with limited computing and storage resources:

Performing gradient calculations and updating weights for billions of parameters is highly compute-intensive.
Saving model checkpoints along with the optimizer states for each weight at each epoch is memory-intensive.

Parameter-Efficient Fine-Tuning

In Parameter-Efficient Fine-Tuning (PEFT), we aim to train lesser parameters by either training only a subset of the original parameters or introducing a new small set of parameters. PEFT techniques can be broadly classified into three categories:

Specification techniques select a portion of the original architecture to train and freeze the remaining part of it. For example, Layer-wise Learning Rate Adaptation (LLRA) fine-tunes specific layers of the model, while others remain unchanged, effectively reducing the number of parameters being updated.
Reparameterization techniques transform the existing model parameters into a parameter-efficient form. An example is Low-Rank Adaptation (LoRA), which approximates weight matrices with low-rank representations to reduce the number of trainable parameters while maintaining performance.
Addition techniques freeze the base model (keeping the pre-trained weights constant) and use it as a backbone while introducing adapters — small, trainable, completely new components that learn to adjust the pre-existing weights. Examples of these include Adapter-BERT and prefix-tuning. These techniques can be further broken down into:

Soft prompt techniques: Involve inserting trainable soft prompts into the input sequence (e.g., prompt tuning).
Adapter techniques: Introduce additional layers within the model that are specifically trained for the new task (e.g., AdapterHub).

What is LoRA?

To understand LoRA (Low-Rank Adaptation), we first need to grasp the concepts of extrinsic and intrinsic dimensions. Extrinsic dimension refers to the number of dimensions used to represent data in a model. For example, in a 3D space, data points are defined by three dimensions. Intrinsic dimension, on the other hand, is the number of dimensions actually needed to capture the underlying structure of the data.

Consider a scenario where we want to visualize data points from a 3D space. If we plot a subset of these points on a 2D plane, we are reducing the dimensionality from three (extrinsic) to two (intrinsic) to simplify visualization. Thus, while the extrinsic dimensionality is 3, the intrinsic dimensionality of the subset is 2.

Research has explored whether all the parameters in a large model are necessary, showing experimentally that many large models effectively exist in a much smaller intrinsic dimension and can be decomposed into a lower-dimensional subspace.

LoRA leverages this principle by assuming that the weight matrices of a model have a smaller intrinsic rank. In LoRA, the pre-trained weights of the model are frozen, and updates to these weights are tracked using two trainable matrices, A and B, which have much smaller dimensions.

Formally, if fine-tuning is represented as:

**Figure 3:** Normal Fine-Tuning performs weight updates (ΔW) across the entire pre-trained weight matrix. This approach adjusts the full set of weights, which can be computationally expensive and memory-intensive.

LoRA assumes that the matrix (ΔW) also has a low intrinsic dimension. This means that rather than updating the entire weight matrix, LoRA focuses on a low-rank approximation of the weight changes, which can efficiently adapt the model to new tasks while maintaining a reduced computational and memory footprint.

**Figure 4:** In LoRA (Low-Rank Adaptation), weight updates (ΔW) are decomposed into two smaller matrices, A and B, with reduced dimensions. This method captures the weight changes efficiently while significantly reducing computational and memory requirements.

**Figure 5:** This illustration provides a comprehensive comparison between Normal fine-tuning and LoRA. In LoRA, the weight update matrix ΔW is decomposed into two matrices: A, which compresses the information into a low-dimensional space (r), and B, which reconstructs the information back to the original high-dimensional space (d). Here, r≪d, highlights how LoRA effectively reduces the complexity and resource requirements compared to traditional fine-tuning.

How LoRA Solves the Challenges of FFT

1. Compute Efficiency: Consider a weight matrix W of size 10×10 with a rank of 1. In LoRA, we approximate W as W≈A⋅B, where A and B are matrices with much smaller dimensions. Specifically, A compresses the 10-dimensional input into a single dimension, and B expands this single dimension back into 10 dimensions.

Compression: Matrix A reduces the 10 values in each row of W to 1 value.
Reconstruction: Matrix B then multiplies these 1-dimensional values with 10 values to reconstruct the 10-dimensional space.

By using this low-rank approximation, LoRA trains only 2⋅d⋅r parameters (where d is the original dimension and r is the rank and r<<d), compared to d⋅d parameters in FFT. This significantly reduces computational costs during each iteration.

2. Storage Efficiency: Instead of storing the full-weight matrix W, LoRA only stores the matrices A and B, representing the changes and essential information about the low-rank approximation.

During training, we compute product A⋅B to evaluate performance and update parameters but do not store this product continuously.
During inference, we can efficiently use the combined matrix W+A⋅B for faster performance without storing the intermediate product.

Advantages:

Parallelizability: LoRA enables training on multiple tasks simultaneously by switching between different pairs of A and B matrices for each task. This allows efficient utilization of computational resources across various tasks.
Additive Property: Since the base model’s weights remain frozen, LoRA allows for creating a hierarchy of adaptations. Multiple LoRA layers can be stacked or combined to build more complex models while preserving the integrity of the pre-trained weights.
Generalisability: Although LoRA was initially designed for Large Language Models (LLMs), its principles apply to any problem involving matrix multiplication and rank decomposition. This versatility extends LoRA’s benefits to a broad range of applications.

Ending Thoughts

In summary, fine-tuning offers a powerful way to adapt pre-trained models to specific tasks, but it comes with computational and storage costs challenges. LoRA (Low-Rank Adaptation) addresses these challenges by leveraging low-rank approximations, making it a promising alternative for efficient model adaptation.

Recent research suggests that while fine-tuning can improve model performance, it may make the model more prone to hallucination, and retrieval techniques like RAG could be better than fine-tuning. Future research will continue to explore these avenues and refine the understanding of model adaptation.

Thank you for reading!