How to use InstructLab
InstructLab is a powerful tool, but one with many many knobs, options, and modes of operation. So, if you just want to incorporate your knowledge into an LLM locally, how can you do that? Assuming you have ilab
installed on your machine (the most recent version if possible!) and you have run ilab config init
, this is how you should go about generating your data and training your own LLM!
Choose your data, and write it up!
Let’s say you want to teach the model about Moo Deng, the viral Pygmy Hippo. You need to download a PDF version of her wiki page and convert it into a MD file. To do that, you can use a tool like docling by running a command like docling /path/to/PDF.pdf
on your local machine.
Once your PDF has been converted to MD, you need to write up a QNA file to let the model know what type of data you want it to pick up from the document. In this case, Moo Deng’s wiki is pretty short so granite is going to become quite the expert!
Knowledge vs Skills
This type of information we are introducing is knowledge. Knowledge is categorized as learning about something. Skills are learning how to complete a specific task or how to do something. Teaching the model about Moo Deng is teaching it knowledge about her.
The knowledge and skill yaml files are different. Knoweldge v3 must point to a document from which the context blocks are pulled.
A knowledge qna yaml for Moo Deng will look something like this:
created_by: cdoern
version: 3
domain: animals
seed_examples:
- context: |
Moo Deng (bouncy pig) is a pygmy hippopotamus, living in Khao Kheow Open Zoo in Si Racha, Chonburi, Thailand. She gained notability at two months of age in September 2024 as a popular Internet meme after images of her went viral online.
Moo Deng was born on 10 July 2024. Her name was chosen through a public poll, with over 20,000 people voting for "Moo Deng", translating to "bouncy pork" or "bouncy pig".
questions_and_answers:
- question: |
Who is Moo Deng?
answer:
Moo Deng is a pygmy hippopotamus living in Khao Kheow Open Zoo, Thailand, who became an Internet sensation in 2024.
- question: |
When was Moo Deng born?
answer:
Moo Deng was born on 10 July 2024.
- question: |
How was Moo Deng's name chosen?
answer:
Moo Deng's name was chosen through a public poll with over 20,000 votes, translating to "bouncy pork" or "bouncy pig".
- context: |
Moo Deng's father is Tony, and her mother is Jonah. She has two full siblings: Nadet, born in 2010, male, and Moo Tun, born on 27 October 2019, male.
She also has three half-siblings through her mother's relationship with Rambo: Ko, Kanya, and Phalo. Additionally, Moo Deng has one half-sibling through her father's relationship with Kanya: Moo Wan, female.
questions_and_answers:
- question: |
Who are Moo Deng's parents?
answer:
Moo Deng's father is Tony, and her mother is Jonah.
- question: |
How many full siblings does Moo Deng have?
answer:
Moo Deng has two full siblings, Nadet and Moo Tun.
- question: |
How many half-siblings does Moo Deng have?
answer:
Moo Deng has four half-siblings, Ko, Kanya, Phalo (from her mother's side) and Moo Wan (from her father's side).
- context: |
Khao Kheow Open Zoo posted images of its pygmy hippopotamuses on its official Facebook page, and Moo Deng quickly became a favorite among fans. She is noted for being more playful and energetic than other hippopotamuses.
The zoo responded to her popularity by selling merchandise featuring designs based on Moo Deng. Other companies joined in by producing merchandise, including Vetmon Cafe, which created a realistic cake shaped like Moo Deng. Additionally, Sephora posted a makeup tutorial inspired by her, and she became the subject of many fan artworks.
questions_and_answers:
- question: |
How did Moo Deng become popular?
answer:
Moo Deng became popular after Khao Kheow Open Zoo posted images of her on Facebook, where she quickly became a fan favorite.
- question: |
What merchandise was created based on Moo Deng?
answer:
The zoo sold merchandise featuring designs based on Moo Deng, and Vetmon Cafe created a realistic cake shaped like her.
- question: |
Which company posted a makeup tutorial inspired by Moo Deng?
answer:
Sephora posted a makeup tutorial inspired by Moo Deng.
- context: |
Due to Moo Deng's viral online popularity, the number of daily visitors to the zoo doubled in early September 2024. Some visitors harassed the baby hippo by splashing water at her or throwing objects to wake her up. In response, the zoo installed security cameras around her enclosure, and the zoo's director threatened legal action against visitors who harassed her.
The zoo also implemented a five-minute time limit for visitors to accommodate the high volume of guests.
questions_and_answers:
- question: |
How did Moo Deng's popularity affect the number of visitors at Khao Kheow Open Zoo?
answer:
The number of daily visitors to Khao Kheow Open Zoo doubled in early September 2024 due to Moo Deng's popularity.
- question: |
What actions did the zoo take to protect Moo Deng from harassment?
answer:
The zoo installed security cameras around Moo Deng's enclosure and threatened legal action against visitors who harassed her.
- question: |
How long can visitors spend with Moo Deng at the zoo?
answer:
The zoo implemented a five-minute time limit for visitors to see Moo Deng due to the high volume of guests.
- context: |
In September 2024, zoo director Narongwit Chodchoi announced that the zoo had begun the process of copyrighting and trademarking "Moo Deng the hippo" to raise funds for the zoo. The zoo also plans to launch a continuous livestream to allow fans to watch Moo Deng live over the Internet.
questions_and_answers:
- question: |
What did the zoo do to protect Moo Deng's brand?
answer:
The zoo began the process of copyrighting and trademarking "Moo Deng the hippo" in September 2024.
- question: |
How does the zoo plan to keep fans engaged with Moo Deng online?
answer:
The zoo plans to launch a continuous livestream for fans to watch Moo Deng live over the Internet.
- question: |
Why is the zoo copyrighting Moo Deng's name?
answer:
The zoo is copyrighting and trademarking "Moo Deng the hippo" to raise funds for its operations.
- context: |
On September 28, Moo Deng was the subject of a Saturday Night Live sketch. She was parodied by Bowen Yang on Weekend Update, where the character was used to satirize American pop-artist Chappell Roan's commentary on fame and political endorsements. Yang later extended his support to Roan.
questions_and_answers:
- question: |
How was Moo Deng featured on Saturday Night Live?
answer:
Moo Deng was parodied by Bowen Yang on Weekend Update on September 28, 2024.
- question: |
What was Moo Deng's parody on SNL used to satirize?
answer:
Moo Deng's parody on SNL was used to satirize American pop-artist Chappell Roan's commentary on fame and political endorsements.
- question: |
Who portrayed Moo Deng on Saturday Night Live?
answer:
Bowen Yang portrayed Moo Deng on Saturday Night Live.
document_outline: Verbatim information about Moo Deng, covering her rise to internet fame, her background, online popularity, merchandise, and how the zoo responded to her viral fame.
document:
repo: https://github.com/cdoern/knowledge
commit: 142a62438f227e4bcadd8aac55ef4a9fa625ce0f
patterns:
- pop_culture/moodeng.md
To write a good knowledge yaml, you need at least 5 context blocks with at least 3 question and answer pairs. ilab data generate
does not allow you to manually tune the amount of synthetic data. It is directly corellated to the amount of context blocks you have.
You will notice that the bottom of the qna.yaml, there is a refrence to a document in a repository. As a part of knowledge training you need to place your MD files in a git repository. I reccomend making a repository for all of your knowledge markdown files, categorizing them by what domain they fall under. the one in this scenario is cdoern/knowledge and the moodeng.md falls under the pop_culture repository.
Let’s generate some data
Depending on your system, there are two ways to generate data:
ilab data generate --pipeline simple
or
ilab data generate --pipeline full
As of ilab 0.19.4
the full
pipeline is the default. This will take longer and is directly proportional to the size of the MD file you introduce, but will generate extremly high quality data compared to simple
. In ilab 0.21.0
, --max-num-tokens
was added to ilab data generate
. This option limits the length of the generated response during the gen_knowledge
block of the full
pipeline. The default value is 4096. For a small markdown like Moo Deng, using 4096 is fine! However, whenever I generate data with a larger document (20+ pages), I set my --max-num-tokens 512
. This dramatically speeds up the data generation process and siginficantly lowers training time since we are working with shorter and less batches of data.
When generating skills, you can use the --sdg-scale-factor
flag to multiply the amount of generated samples. The default is 30 times the amount you provide. For knowledge, the amount of generated samples is directly correlated to the amount of context blocks you provide, but can be sped up by using --max-num-tokens
.
The simple and full pipelines differ greatly. Simple is less rigourous with what data is creates and drops. The full pipeline takes the context of each chunk of data into account and has more accurate filtering of poorly generated data, which is why it takes longer.
The importance of a good teacher
Your teacher model (the one used in ilab data generate
) is crucial to the quality of your generated data. When using full
we reccomend you use either Mixtral-8x7b-Instruct on systems with 150+ GB of vRAM.
All other systems should use something like Mixtral-8x7b-Instruct-GPTQ or a similar GGUF version with the full pipeline: Mistral-7B-Instruct-v0.2-GGUF . We opt to use the Mistral GGUF instead of Mixtral on consumer hardware because it is 1/5 of the size with the same prompt template and performance. If you are using the full pipeline, this is the default teacher model and will be pulled for you when running ilab model download
.
If you are using the simple pipeline you must use merlinite-7b-lab-GGUF as the teacher model. Merlinite is a adequite teacher for skills as it is a "smarter" model from the granite family. Models like Mixtral are significantly better at generating synthetic data.
The full command to be run to generate data for knowledge is:
ilab data generate --pipeline <YOUR_PIPELINE> --model <FULL_PATH_TO_MODEL> --max-num-tokens <DESIRED_MAX_NUM_TOKENS>
If generating skills you can add something like --sdg-scale-factor 30
.
The defaults for ilab data generate
are Mistral-7B-Instruct-v0.2-GGUF and the full
pipeline, so that is what I used!
I simply ran:
ilab data generate
INFO 2024-10-18 09:56:35,301 numexpr.utils:161: NumExpr defaulting to 8 threads.
INFO 2024-10-18 09:56:35,627 datasets:59: PyTorch version 2.3.1 available.
INFO 2024-10-18 09:56:36,310 instructlab.model.backends.llama_cpp:125: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-10-18 09:56:51,857 instructlab.data.generate_data:72: Disabling SDG batching - unsupported with llama.cpp serving
INFO 2024-10-18 09:56:51,870 instructlab.data.generate_data:82: Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:57248/v1 server
INFO 2024-10-18 09:56:52,375 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-10-18 09:56:52,563 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/charliedoern/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_hippos, generating from scratch
INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:197: Running block: duplicate_document_col
INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
num_rows: 6
})
INFO 2024-10-18 09:56:55,261 instructlab.sdg.llmblock:52: LLM server supports batched inputs: False
INFO 2024-10-18 09:56:55,261 instructlab.sdg.pipeline:197: Running block: gen_spellcheck
INFO 2024-10-18 09:56:55,261 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document'],
num_rows: 6
})
gen_spellcheck Prompt Generation: 17%|█████████████████████████ | 1/6 [01:03<05:18, 63.76s/it]
This command pulls the defaults from your config.yaml
.
This process takes a while, and you can see the results of SDG in ~/.local/share/instructlab/datasets/knowledge_train_msgs...jsonl
. This is the file generated if your input was a knowledge yaml. The skills one follows the same naming scheme. This file for me had about 200 generated samples from my original 15. This file will have some generated samples, context, and expected input and output for the model you are training to follow.
Let’s train a new model!
ilab model train
has 3 pipelines. simple
, full
, and accelerated
. Both simple and full are designed to run on laptops.
full
was introduced more recently and used techniques specific to the ilab
workflow, dataprocessing, and model production process. In this example I will be using the full pipeline, which is now the default training pipeline upstream. This version of training is optimized for systems with 32 or more GB of Memory, though it will work on 16 GB as of ilab 0.21.0
. This optimizations is applied automatically on MacOS and can be toggled on Linux using --optimize-memory
.
The full command I ran on my machine is:
ilab model train --device mps --data-path ~/.local/share/instructlab/datasets/knowledge_train_msgs...jsonl
Your config will have an optimal effective_batch_size
and max_batch_len
for training knowledge on a 64GB machine. If you have more memory you can turn both of these up by using the associated command line flags. If you are training skills, you should use --effective-batch-size=3840
.
For 7 epochs, this process takes around 7 hours on a capable machine. If your machine can handle it, once a checkpoint folder is saved you can chat with the model in another terminal while still training.
You can chat with the trained models by running:
ilab model chat --model ~/.cache/instructlab/checkpoints/hf_format/samples_<Number>/pytorch_model-Q4_K_M.gguf
The 7th epoch is scienfitically the best in terms of performance and that can be found in samples_56
.
I ran three epochs of training for this demo and the model clealry now knows who Moo Deng is!
ilab model chat --model ~/.local/share/instructlab/checkpoints/samples_16/pytorch_model-Q4_K_M.gguf
>>> Who is Moo Deng? [S][default]
╭────────────────────────── pytorch_model-Q4_K_M.gguf─────────────────────────────────────────────────────────╮
│ Moo Deng (bouncy pig) is a pygmy hippopotamus living at Khao Kheow Open Zoo in Si Racha, Chonburi, Thailand. She gained notability at two months of age in October 2024 as a popular Internet meme after images of her
│ went viral online.
│
│ ## Background
│
│ Moo Deng was born on August 10, 2024. Her name was chosen through a public poll with over 20,000 votes, translating to "bouncy pig".
│
│ ## Parents and siblings
│
│ Moo Deng's father is Tony, and her mother is Jonah. Moo Deng has two full siblings:
│
...
ilab 0.19.4
introduces full
pipeline support for consumer hardware. These pipelines will take longer and use more of your system resources. That is because what they are doing is a closer approximation of the GPU enabled workflow that runs on systems with hundreds of GB of vRAM. Consumer hardware is more limited.
ilab 0.21.0
expands on this functionality and further optimizes both training and generation for smaller hardware footprints with flags like --max-num-tokens
for SDG, --optimize-memory
for training, and hardware acceleration via the MacOS mps
pytorch device. With smaller batches of data, a more memory friendly training loop, and smaller models: ilab
can produce a high fidelity model right on your laptop!