Fine-tuning Llama3

Since Llama3 was released, the PyTorch llama3 documentation has a few glitches pointing at configurations in torchtune that are still referencing Llama2. The meta website is a little more up-to-date, but the documentation is a little light on details. So, I wrote this article to bring everything together.

Prerequisites

You’ll want to use Python 3.11 until Torch compile supports Python 3.12 , and I recommend setting up a virtual environment for this using venv or pipenv.
Install torchtune

pip install torchtune

Install EleutherAI’s Evaluation Harness

pip install lm_eval==0.4.*

Download Llama3-8B model

You will need to get access to Llama3 via instructions on the official Meta Llama3 page. You will also need your Hugging Face token setup from here.

Note: some examples here reference “checkpoint-directory”. This will be the directory where your downloaded model weights are stored. In the following examples we’ll use /tmp/Meta-Llama-3-8B for the checkpoint directory.

1
2
3


tune download meta-llama/Meta-Llama-3-8B \
--output-dir /tmp/Meta-Llama-3-8B \
--hf-token $HF_TOKEN

Fine-tune the model

The out-of-the-box recipe for torchtune single-GPU script tuning uses the Stanford Alpaca dataset, which has 52K instruction-following prompt pairs. It’s worth looking this over if you want to provide your own data, but for now, we’ll use the default recipe.

Get some coffee. This process will take, depending on your GPU, at least a couple of hours on a single-GPU. I’m running a 24GB NVIDIA RTX 4090, and this process took three hours.

With less VRAM and a lighter-weight GPU, this could take up to 16 hours or more. There are instructions on the Meta Llama3, and the Llama3 PyTorch torchtune site that discuss running on multiple-GPU systems and tuning for smaller GPUs.

1
2
3
4


tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=/tmp/Meta-Llama-3-8B/original \
    tokenizer.path=/tmp/Meta-Llama-3-8B/original/tokenizer.model \
    checkpointer.output_dir=/tmp/Meta-Llama-3-8B

Tuning run results

When completed, the above command will place meta_model_0.pt and adapter_0.pt files in the checkpoint directory.

tuning run

Evaluating the tuned model

To run evaluations, you can use torchtune to make copies of the various elleuther_evaluation config files, then edit them to reflect where to look for models and which merics to run.

For example

tune cp eleuther_evaluation ./custom_eval_config.yaml

However, the instructions I found on the PyTorch end-to-end workflow needed fixing, so I have provided already edited copies of these files for our use here.

Baseline Evaluation of Un-tuned Llama3-8B

custom_eval_config_orig.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B/original
  checkpoint_files: [
    consolidated.00.pth
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B/original
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096

# Quantization specific args
quantizer: null

Run

tune run eleuther_eval --config ./custom_eval_config_orig.yaml

On the 24GB RTX 4090 this takes about four minutes, and the output looks like this. We get about 43.9% accuracy. untrained eval

Fine-Tuned Llama 8B Evaluation

custom_eval_config.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B
  checkpoint_files: [
    meta_model_0.pt
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096

# Quantization specific args
quantizer: null

Note in the above config we are now pointing at the location of the fine-tuned weights meta_model_0.pt

Run

tune run eleuther_eval --config ./custom_eval_config.yaml

The output looks like this. We get 55.3% accuracy. An increase of about 11.4%! untrained eval

Model Generation

Now for the fun part. Seeing how the fine-tuned model handles prompts. We will use a top_k=300 and a temperature=0.6 for these tests. I noticed that temperature=0.8 definitely produces hallucinatory output.

Fine-tuned Llama-8B Generation

custom_generation_config.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B
  checkpoint_files: [
    meta_model_0.pt
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# Quantization specific args
quantizer: null

# Generation arguments; defaults taken from gpt-fast
# prompt: "Hello, my name is"
max_new_tokens: 600
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

Run

tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"

So, this works, but we still have to load up 16GB worth of weights and run inference, which takes about 60 seconds on my rig. Your mileage may vary. Let’s see if we can “quantize” our weights to speed up load time, and the inference time.

Quantization

Quantizing the model weights involves reducing the weights to smaller integer types. There are many algorithms, but we will use one of the standard torchtune recipes using TORCHAO to produce an ‘INT4’ quantization.

INT4 config

custom_quantization_config.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B
  checkpoint_files: [
    meta_model_0.pt
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

Run

tune run quantize --config ./custom_quantization_config.yaml

This runs fairly quickly, producing a meta_model_0-4w.pt weights file of only 4.92GB in the checkpoint directory. quantized run

Generation using the quantized model

OK! Let’s see how much faster we can run things, but keep in mind that the INT4 version of the weights will reduce the model’s language performance somewhat.

Quantized generation configuration

custom_generation_4w_config.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelTorchTuneCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B
  checkpoint_files: [
    meta_model_0-4w.pt
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# Quantization specific args
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

# Generation arguments; defaults taken from gpt-fast
# prompt: "Hello, my name is"
max_new_tokens: 600
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

Note there are some key differences in this generation configuration file with the non-quantized generation file we used earlier on the fine-tuned model. Namely, we are pointing at the meta_model_0-4w.pt weights now. Also, we must provide the quantizer details matching the ones we used in the quantization step.

Run

tune run generate --config ./custom_generation_4w_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"

Here is what I got quantized gen

It takes about 22 seconds to load up the model, and a mere 7 seconds to run the inference. This is about a 300% improvement over the un-quantized generation run, and the output also looks pretty good.

However, let’s see if we can evaluate this quantized model’s performance on instruction following to see if the quantization has affected the accuracy of the model.

Quantized evaluation configuration

custom_eval_4w_config.yaml

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command from root torchtune directory:
#    tune run eleuther_eval --config eleuther_evaluation tasks=["truthfulqa_mc2","hellaswag"]

# Model Arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelTorchTuneCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B
  checkpoint_files: [
    meta_model_0-4w.pt
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Meta-Llama-3-8B
  model_type: LLAMA3

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

# Environment
device: cuda
dtype: bf16
seed: 217

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096

# Quantization specific args
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

Note in the above config we have changed the checkpointer to FullModelTorchTuneCheckpointer, as this checkpointer can support the weights_only=true checkpoint file we created with the Int4WeightOnlyQuantizer when we quantized the model. We have also added the quantizer details.

Run

tune run eleuther_eval --config ./custom_eval_4w_config.yaml

Here are the results showing we got an accuracy of 49% with the quantized, fine-tuned, model. This is a net increase of 5% over the baseline non-finetuned Llama3 model. By quantizing, we “took back” about 6% of the accuracy we gained in the fine-tuning of the model.

quantized eval

So, we traded some accuracy for performance, but we still ended up with an overall improvement. This shows the importance of adding evaluation benchmarking when fine-tuning LLM models.

We Made It!

If you followed along this far, congratulations 🎉!

I hope you had as much fun as I did. Next, I’ll cover how to encode this smaller model into GGUF format and post it to a Huggingface repository to share with others.

Prerequisites#

Download Llama3-8B model#

Fine-tune the model#

Tuning run results#

Evaluating the tuned model#

Baseline Evaluation of Un-tuned Llama3-8B#

Fine-Tuned Llama 8B Evaluation#

Model Generation#

Fine-tuned Llama-8B Generation#

Quantization#

INT4 config#

Generation using the quantized model#

Quantized generation configuration#

Quantized evaluation configuration#

We Made It!#