Finetuning Phi3-mini-4k-instruct on custom Data + CPU Inference

Small Language Models (SLM)

In the past two years, robotics has undeniably gained increasing attention due to exciting advancements in Large Language Models (LLMs). Many of our expectations for robots, such as engaging in natural conversations and taking actions based on spoken commands, can only be realized with the help of these LLMs. However, despite their incredible power, the enormous size and resource demands of these models have restricted their widespread access and use. Most robots currently on the market are not yet equipped with the necessary GPUs to run such large models.

This is where Small Language Models (SLMs) come into play. SLMs offer a compact, efficient alternative to their larger counterparts. By requiring fewer resources, they have the potential to make AI more accessible and applicable across various domains, particularly in robotics. These smaller models could bridge the gap between our expectations for intelligent robots and the current hardware limitations, paving the way for more widespread adoption of AI-powered robotics in everyday applications.

What are Small Language Models (SLM)?

A Small Language Model (SLM) is essentially a scaled-down version of a Large Language Model (LLM). By “smaller,” I mean it has significantly fewer parameters and a simpler architecture. [1] As a result, SLMs require less computational resources to run and are more efficient. Interestingly, some businesses have discovered that SLMs can outperform LLMs on specific tasks, primarily due to their ability to be more easily fine-tuned on targeted data.

For my non-technical friends, an apt analogy would be the specialization of labor often observed in enterprises. This specialization encourages mastery and optimization of skills among employees, leading to lower costs and higher efficiency. Similarly, SLMs can be thought of as specialized models that excel in specific tasks, much like how specialized workers excel in their particular roles within a company.

Phi-3-mini-4k-instruct

Microsoft introduced the Phi-3-mini model in August 2024. This model has 3.8 billion parameters and was trained on 3.3 trillion tokens. Given its small size, Phi-3-mini is designed to be deployable on CPUs (everyday computers) and even mobile phones.

Their team have shown that Phi-3-mini matches the performance of larger models in language understanding and reasoning tasks. However, it does have limitations, particularly in areas requiring extensive factual knowledge and multilingual capabilities. [2] That said, however, it is sufficient for most of the specified task that our team was interested in.

Phi3-mini is available in two different context lengths (the number of words that the model can process and remember), 4k and 128k, one could pick the which ever the model that suits your purpose the best. If your task does not require to process so much tokens, it is better to pick the 4k version as it will gives better results at higher speed. Another point to note, Phi3 mini models are instruction-tuned, meaning that the model is trained to follow instructions that capture how we communicate in everyday lives, hence improving its effectivess to be applicable in real-world scenarios. [3]

Finetuning Model with Unsloth on Google Colab

Introduction to Unsloth

Kudos to Unsloth! This open-source package is designed to accelerate model fine-tuning. Remarkably, it enables the entire fine-tuning process to take place in Google Colab’s free tier, eliminating the need for additional financial investment.

The package’s speed boost stems from its manual autograd operations and matrix optimizations. Interestingly, after deriving their own matrices, the Unsloth team discovered that bracket placement plays a pivotal role in efficiency. This is due to the significant differences in matrix dimensions. By carefully placing brackets and prioritizing certain matrix multiplications, they substantially reduced the number of floating-point operations (FLOPs) required to complete the process.

It’s absolutely fascinating how such seemingly minor adjustments can have such a profound impact. I personally recommend checking out the Unsloth repository, for anyone who want to finetune an LLM or SLM.

The finetuning guide (Colab notebook) is available in the unsloth github. Go check the guide out for more details!

After saving the model, you would be able to see the following files.

To test the model, if you computer is fast enough, you can load the saved model using the peft and transformers packages.

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model_path = "./lora_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoPeftModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)

CPU Inference

Introduction to ONNX

ONNX, which stands for Open Neural Network Exchange, is an open-source project developed by Microsoft. Its primary goal is to facilitate the exchange of neural network models among different deep learning frameworks. With its cross-library and cross-flatform capabilities, ONNX enables engineers to leverage the benefits of all deep learning and machine learning libraries seamlessly.

At its core, ONNX defines a standardized representation format for deep learning models. This representation consists of a computational graph made up of nodes and edges. Each node in the graph represents an operation, such as convolution or pooling, while the edges represent the flow of data between these operations.

This standardized format allows for greater interoperability between different AI tools and platforms, making it easier for developers to work with models across various frameworks and deployment environments. [5]

ONNX Efficiency

The ONNX model is designed to run efficiently on a wide range of devices, from mobile devices to GPUs. ONNX runtime, a dedicated engine for executing ONNX models, can accelerate large model training and increase throughput by up to 40%. This impressive performance boost is achieved through various memory and compute optimizations.

To better understand this, imagine a person who organizes information in their brain differently than others and is adept at finding efficient shortcuts at work. Similarly, ONNX Runtime excels in two key areas [6]:

Memory Efficiency:

Maximizes batch size, allowing more data to be processed simultaneously
Implements careful memory planning, optimizing how information is stored and accessed

Compute Optimization:

Utilizes kernel optimizations for faster processing
Employs an FP16 Optimizer, which effectively halves the size of tensors, reducing arithmetic operations and network bandwidth
Leverages graph optimization techniques such as node fusion and elimination, streamlining the computational process

The above are just some examples of what the team did to optimize memory and compute. For more details, please check out the blog post by hugging face.

Converting a PEFT model to an ONNX Generative AI model

After you loaded the AutoPeftModelForCausalLM and tokenizer in the code above, you can export the model into ONNX format.

Step 1: Save unmerged model.

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

Step 2: Install onnxruntime_genai into your computer. If you are using a Mac computer with the Apple chips, I would recommend you to follow the guide here to build the package from source. Run the below command in the terminal.

python3 -m onnxruntime_genai.models.builder -i ./merged_model -o ./phi3-int4-cpu -p int4 -e cpu

When you run the command, if you encountered the problem below, it is because the config file generated from the above command, instead of having the attribute “type”, the attribute equivalent was “rope_type”.

Traceback (most recent call last):
  ...
  File "/Users/kobychoy/anaconda3/envs/myenv/lib/python3.9/site-packages/onnxruntime_genai/models/builder.py", line 173, in __init__
    self.rotemb_attrs["mscale_policy"] = config.rope_scaling["type"]
KeyError: 'type'

Hence to solve this problem, access the builder.py inside your site-packages folder and make the necessary changes, as shown as below and run the command again.

if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
            # For models with multiple rotary embedding caches
            self.rotemb_attrs["mscale_policy"] = config.rope_scaling["rope_type"]
            short_factor = torch.tensor(config.rope_scaling["short_factor"], dtype=torch.float32)

Step 3: To test whether the model that are created in the phi3-int4-cpu directory are running or not, you can open the python terminal, and load the ONNX model as follows:

import onnxruntime_genai as og
import time 
folder = "./phi3-int4-cpu"
model = og.Model(folder)
tokenizer = og.Tokenizer(model)
start = time.time()
tokenizer_stream = tokenizer.create_stream()
search_options = {
        'do_sample': False,
        'max_length': 2048,
        'temperature': 0.0
 }

prompt = "HI, I AM A PROMPT"
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
params.input_ids = input_tokens
generator = og.Generator(model, params)
output = ""
try:
    while not generator.is_done():
        generator.compute_logits()
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        output += tokenizer_stream.decode(new_token)
except KeyboardInterrupt:
    output += "  --generation aborted--"
print(output)

If you encounter a JSONDecodeError: Expecting value: line 1 column 1 (char 0) while loading the Tokenizer,it’s possible that the JSON files were corrupted during step 1. A quick way to fix it would be deleting all the tokenizers files (added_tokens.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model) in the phi3-int-4 folder, and then copy those files from the original ./lora_model directory and paste it back to the /phi3-int4-cpu directory. Then, we can trying running the python file again.

Koby Choy