100 Days of AI Day 13: What is Instruction Finetuning and how it improves a pre-trained base models?

About 100 Days of AI:

I am Nataraj, I decided to spend 100 days in starting Jan 1st 2024 to learn about AI. With 100 days of AI is my goal to learn more about AI specifically LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey here.

In one of the previous posts, we talked about finetuning and why it is important. In this post we will take a look at a specific kind of finetuning called Instruction Finetuning.

Limitations of Pre-Trained Base Models:

Pretrained base models like gpt-3 are trained on a vast amounts of data. In case of gpt-3 its all the data on the internet. Well we don’t know that for sure but most of these models are trained on internet scale data after considerable manual clean up and formatting. As they are trained the based models learn how to predict the next token and get really good at token prediction. But pure token prediction is not as useful as you would think. If you ask a pre-trained base model “What is the capital of Mexico?” it will not reply with an answer but might complete the input sentence with “What is the capital of Columbia“. So even though a model like gpt-3 is powerful at token prediction it will not work as a chatbot or a copilot. So how do we convert a pre-trained model to a useful chatbot like chat-gpt? The answer is finetuning, mainly a specific type of finetuning called “Instruction Finetuning“.

What is instruction finetuning?

Instruction finetuning also referred as “instruction-following” is a process to teach a pre-trained base model to behave like a chat bot.

Instruction finetuning needs data sets which are in the form of question and answers. You can use public data sets or your companies data set which is in the form of Q&A. If your data set is not in the form of Q&A you can convert the data into Q&A using different techniques like Alpaca or using custom prompts on other LLMs. Note that instruction finetuning gives the model a new behavior of answering questions not just on the data that you use in finetuning, but this new behavior is applicable to the existing knowledge the model already has which makes finetuning a powerful technique.

Instruction Finetuning Using Lamini:

Lamini is an AI company that allows developers to deal with language models in an easy way abstracting away the complexity of hosting, training and other complicated aspects. Check out its full capabilities here. We will use Lamini to train small launguage model called pythia, which is an opensource model created by Eleuther AI and do instruction finetuning on it using a company dataset called Alpaca.

Step 1: Initialize and load Instruction Finetuning dataset

In this step lets initialize the required module and also look at the alpaca training data set. Here’s the code.

import itertools
import jsonlines

from datasets import load_dataset
from pprint import pprint

from llama import BasicModelRunner
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

## we are using alpaca data set, which is an open source fine tuning data set
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)
m = 5
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
  print(j)

This is how the instruction tuning data set looks like. It contains data in the form of questions and answers.

Step 2: Hydrate the prompts

In this step we take the data from the alpaca set and put them in to the prompts show below.

prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

## hydrate prompts - meaning add data to the above prompts
processed_data = []
for j in top_m:
  if not j["input"]:
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else:
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})

After doing this the data set will look as follows.

We are basically taking the raw Q&A data and converting in to a format that makes sense to the LLM that when asked a question how should the response for that question should look like. We do this iteratively and store in a jsonl file.

with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

Step 3 – Non-Finetuned Output

In step 1 & 2 we loaded raw data and hydrated it and stored in jsonl format. But Lamini has this hydrated data ready to go, so technically step 1 & 2 are not necessary. But it was needed to show to understand how instruction finetuning works. Let’s first see how a non-finetuned version of Pythia model would respond to a simple question.

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m") #70M parameter model that is not instruction tuned.
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

## the 70M model doesnt have any company specific data, we will use the alpace data set from hosted on lamini and fine tune this model
# load alpaca dataset
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
#print(finetuning_dataset)

test_sample = finetuning_dataset["test"][0]
print(test_sample)
print("untrained output sample")
print(inference(test_sample["question"], model, tokenizer))

This is the output I got. You will notice that the output is not helpful and the model is trying to do token completion and is not giving an actual answer.

Step 4 – Instruction Finetuned Output

Once we use the q&a data seen in the previous step is used to instruction fine tune the same model will start to behave like a chat bot and will provide more accurate answers to your questions both on the fineunted data but also the data that the model already consists off. Its almost like when a child learns a language the first time, he or she will now be able to express the feelings they already have along with the new things that they learnt becuase of the language training. Just like the pretrained version of model, the instruction finetuned model is also hosted on Lamini and can be infered with a command as show below. (Yes Lamini is great!)

## finetuned output
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
print("instruction finetuned output")
print(inference(test_sample["question"], instruction_model, tokenizer))

Here is what the output will look like. You will note that instead of the gibberish we have seen in the previous step we have a more accurate output.

The goal with this post is to give an intro to instruction finetuning and how it is used to make base models to more usable versions. In future posts I will dive deep into the actual process of doing instruction finetuning.

That’s it for Day 13 of 100 Days of AI.

I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it.

Follow me on Twitter, LinkedIn for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals here.