InstructLab

A new community-based approach to build truly open-source LLMs

Blog

  • Community Model Build

    Community Model Build

    Community Model Build

    We have Published our First Community Model!

    Our first community model has been published to granite-3.0-8b-lab-community! This is a really exciting milestone for the InstructLab project as it symbolizes a kind of stability that the developer team has been striving for the last few months.

    For those unaware, our Community Model Build process is one in which we take contributions that have been approved to the instructlab/taxonomy repository, put them into a branch, and build a new LAB aligned model based on the contributions.

    If the model proves to be of high enough fidelity and has learned the various skills or knowledge included in the build, we then publish it to HuggingFace!

    This model build was based on ibm-granite/granite-3.0-8b-base, the first publicly available base model of the Granite 3.0 family of models. The team has previously published community LAB aligned models based on the granite-7b family of models. This first build of the community model is the first publicly available Granite 3.0 LAB aligned model with communiy data included.

    How Does this Work?

    This process of building and validating has taken the ilab team a lot of effort to figure out, here is how it works generally:

    1. If you have existing data in ~/.local/share/instructlab/datasets or ~/.local/share/instructlab/phased, move it elsewhere! If you have existing data, generation and training might "resume" from data found in these directories rather than starting fresh.
    2. Mix in the InstructLab community dataset:
    cd ~/.local/share/instructlab/datasets/
    wget https://huggingface.co/datasets/instructlab/InstructLabCommunity/resolve/main/instructlab_community.jsonl
    cd ~

    This process is to ensure the model see’s enough data. Without the community dataset mixed in, it’s likely your overall amount of samples will be too low and you might notice the model forgetting or hallucinating more.

    1. modify use_legacy_tmpl to be true in the general: section of your config via ilab config edit. This step specifically is to support the granite 3.0 family of models on ibm-granite’s HuggingFace.
    2. run ilab data generate
    3. run multi-phase training:
      ilab model train --strategy lab-multiphase --phased-phase1-data /home/instructlab/.local/share/instructlab/datasets/knowledge_train_msgs_*.jsonl --phased-phase2-data /home/instructlab/.local/share/instructlab/datasets/skills_train_msgs_*.jsonl --skip-user-confirm --force-clear-phased-cache

    Key Differences in this Process

    Multi-Phase training is probably the most un-recognizable step for upstream users. This strategy of training uses the accelerated training library. This type of training is called Multi-Phase because is basically runs training twice: Phase 1 is knowledge training and automatically promotes the 7th epoch’s checkpoint to be the basis of Phase 2. Phase 2 is skills training, and is where you will notice a larger amount of steps due to the mixing of the community dataset. This Phase runs MT-Bench: an industry standard skills benchmark at the end of training. The checkpoint with the best MT-Bench score is chosen as the final training checkpoint to use.

    The MT-Bench scores range from 0-10, we typically aim for a checkpoint with a score of 6.5 and higher. Anything below 6 will typically have degraded performance.

    This first community model build scored 6.676100628930818. This is a great baseline as we anticipate the more skills and knowledge we add to this model, the higher the score will be over time!

    Go Try it Out!

    Go try out the model for yourself, you can test it against the Q&A yaml files included in this publicly available branch as these are the contributions we trained the base model on. The model is currently in Safetensors format and therefore requires GPUs or a carefuly installed version of vLLM with CPU support. Happy Instructing!

  • End-to-End Testing of the InstructLab CLI

    instructlab-ci-ecoystem

    Overview

    Any successful open source software project in this day requires some form of Continous Integration – the InstructLab CLI project is no different. We run many types of CI jobs today, including but not limited to:

    • Python/Shell/Markdown Linting
    • Unit, Functional, and E2E Testing
    • Package Building and Publishing

    In this blog post, we are going to focus specifically on E2E testing being done today for InstructLab – why we have it, what it covers, how it works, and what is next.

    Why even have E2E testing?

    The InstructLab CLI has several complex, compute-intensive operations – generating synthetic data, model training, model evaluation, etc. To ensure functionality for our users while also enabling the rapid pace of development typical of the AI community, automated E2E testing was a necessity.

    The current E2E testing runs in two forms:

    1. Smaller-scale jobs against GitHub Pull Requests and Merges in our open source CLI repository that run lower-intensity workflows with smaller hardware footprints in shorter timeframes
    2. Larger-scale nightly jobs that run more compute-intensive workflows in longer timeframes using larger hardware footprints

    The first form of job requires all linting actions in the Pull Request to be passing before these jobs trigger, to save on compute hours used.

    The second form of job automatically runs nightly via Cron but can also be run manually by Project Maintainers against Pull Requests.

    E2E Hardware Testing Today

    Right now we are testing the latest bits for InstructLab against 3 types of NVIDIA GPU configurations. Why just NVIDIA? While InstructLab does offer hardware acceleration for platforms such as AMD and Intel, we have faced the same challenges many others in the larger AI community have in securing consistent hardware in which to run regular CI jobs.

    Given these constraints, our team has focused mainly on building the foundation of our E2E CI infrastructure off NVIDIA, as we are able to take advantage of on-demand cloud instances offered by Amazon Web Services (AWS). Previously, we took advantage of GitHub-hosted GPU-enabled runners for these jobs, but found they lacked the options, flexibility, and cost-effectiveness our organization ultimately needed.

    Below is a simplified chart from our CI documentation with an overview of our current coverage.

    Name Runner Host Instance Type OS GPU Type
    e2e-nvidia-t4-x1.yml AWS g4dn.2xlarge CentOS Stream 9 1 x NVIDIA Tesla T4 w/ 16 GB VRAM
    e2e-nvidia-l4-x1.yml AWS g6.8xlarge CentOS Stream 9 1 x NVIDIA L4 w/ 24 GB VRAM
    e2e-nvidia-l40s-x4.yml AWS g6e.12xlarge CentOS Stream 9 4 x NVIDIA L40S w/ 48 GB VRAM (192 GB)

    The E2E Workflow

    Our CI infrastructure, orchestrated through GitHub Actions, runs this testing via the following steps:

    • Initialize an EC2 instance with access to one or more NVIDIA GPUs
    • Install InstructLab with CUDA hardware acceleration and vLLM, if applicable
    • Run through a typical InstructLab workflow orchestrated via our e2e-ci.sh Shell script
    • Capture the test results and tear down the EC2 instance
    • Report the test results to one or both of the following places.
      • As a comment or CI status check within a Pull Request
      • Our upstream Slack channel #e2e-ci-results

    Slack reporting

    We send results from our nightly E2E CI jobs to the InstructLab Slack workspace channel #e2e-ci-results via the Son of Jeeves bot. This has been implemented via the official Slack GitHub Action.

    son-of-jeeves

    Future plans

    We have a lot planned for the InstructLab CI Ecosystem – to more in-depth testing to a greater hardware support matrix to more sophiscated workflow systems. For more details on any of these topics, please see our upstream CI documentation here, check out our CI/CD issues, and come join us over in the InstructLab community!

    Attributions

    While I have penned this blog post, we have had many contributors to the InstructLab CI – I wanted to call attention to the following folks who have been invaluable to this work:

    Nathan Weinberg is a Senior Software Engineer at Red Hat.

  • InstructLab Architecture & Implementation Overview

    InstructLab Architecture & Implementation Overview

    Introduction

    InstructLab is a complex project that spans multiple, involved components, each serving a different part of the workflow. This document aims to provide a high-level overview of the various components, how they are currently organized and related, and the overall flow of control between them.

    High-Level Overview

    drawing

    Repositories

    InstructLab is spread across multiple repositories based on function. These repositories all reside within the InstructLab GitHub organization and can be viewed here: InstructLab Repositories.

    Here is a quick overview of them:

    • instructlab/taxonomy – This repository holds the taxonomy, where user data that needs to be taught to the model is organized.
    • instructlab/instructlab – This is the Command Line Interface (CLI) repository for InstructLab.
    • instructlab/sdg – This repository contains the synthetic data generation engine, which is responsible for producing training data in the workflow.
    • instructlab/training – This repository contains the main training logic used for multi-phase training of models.
    • instructlab/eval – This repository contains the logic for the evaluation component, responsible for running benchmarks and producing scores to evaluate the model’s performance after training.
    • instructlab/quantize – This is a helper repository used to quantize (shrink) models.

    Workflow

    drawing

    • The InstructLab taxonomy is the first point of entry for users interacting with InstructLab. Users make a contribution to their local taxonomy clone in the form of a skill or knowledge that they want their model to learn.
      • A WIP UI is being developed to allow users to interact with their taxonomy more intuitively.
    • Once the taxonomy is updated, users can then use the CLI to interact with the other components.
    • The first step is to initiate synthetic data generation. This step takes the seed examples provided by the user in their taxonomy contribution and generates synthetic data samples based on them.
    • Once the data is generated, users can start training their model using that data. This process produces several checkpoints based on the number of epochs run during training.
    • Finally, users can run an evaluation on a chosen checkpoint to gauge objective performance. The evaluation suite includes standardized benchmarks that allow comparison of the model’s performance against other models evaluated against those benchmarks.

    Component-Wise Breakdown

    Serving

    drawing

    The serving component is responsible for starting an OpenAI-compatible server that hosts the model, allowing users to interact with it, e.g., for chatting. InstructLab supports two serving backends: llama.cpp and vLLM.

    • llama.cpp: Designed to be laptop-friendly due to being relatively less resource-intensive. It is supported on both macOS and Linux and serves models of type .gguf.
    • vLLM: : More compute-intensive and supports serving models across multiple GPUs. It is the preferred runtime for model serving on server-grade hardware, supported only on Linux, and serves models of type .safetensors.

    There is existing logic within the InstructLab CLI to automatically pick the right serving backend based on the supplied model and the environment it is running in.

    Once a model is served, it can be used either for chatting or data generation.

    Data Generation

    drawing

    The synthetic data generation logic is contained almost entirely within instructlab/sdg. The process requires supplying a "teacher" model, responsible for generating the synthetic data based on the seed examples provided in the taxonomy.

    This step depends on the serving module, as the teacher model needs to be hosted and accessible via a server before data generation can occur.

    This step also includes data mixing. The output of this step is the creation of knowledge and skill datasets fed into the training module.

    Training

    drawing

    The training module is somewhat split, with some parts of the overall training logic contained within the CLI itself and some within a separate repository.

    The training logic is currently broadly divided between CPU and GPU-enabled training loops.
    The training module expects the training data as input, along with the model that needs to be trained.

    Consumer-Grade Training

    This side focuses more on the laptop use case. On macOS, training is performed using Apple’s MLX library, optimized for Apple silicon. As of newer releases, InstructLab has focused on the PyTorch MPS device. Using MPS is a much more sustainable practice as is integrates with more typical training loops. As such, the recommendation is that macOS users should use MacBooks with M1 or newer chips.

    On Linux, training relies on Hugging Face’s SFT Trainer implementation.

    Windows is currently not supported.

    Server-Grade Training

    Users with access to GPU-accelerated hardware can leverage the full fine-tuning training loop contained in instructlab/training. This repository uses PyTorch for the training loop and is optimized for Nvidia hardware by leveraging CUDA kernels and APIs. It also uses DeepSpeed to perform distributed training across all available GPUs.

    The result after training is typically several checkpoints captured at specified intervals during the training process.

    Evaluation

    drawing

    The evaluation logic is entirely contained within instructlab/eval. The evaluation component is responsible for performing a number of standardized benchmarks against the chosen checkpoint after training and producing objective scores to compare the model’s performance before and after training.

    InstructLab evaluation includes four benchmarks:

    This component borrows code from FastChat and also leverages Lm-evaluation-harness for running MMLU tasks.

    Model Conversion and Quantization (Optional)

    All training takes place on models in .safetensors format. Once a trained and evaluated model is available, users on macOS or users with limited compute resources may want to serve the trained model and chat with it. In this case, users may convert their trained models back to .gguf format and optionally quantize them to 4 bits. Quantization is handled by instructlab-quantize, which contains pre-compiled binaries of llama.cpp‘s quantization script for various platforms. The appropriate binary is chosen based on the user’s environment, and the supplied model is quantized using the Q4_K_M method by default.

  • InstructLab, How do I use this thing?

    InstructLab, How do I use this thing?

    How to use InstructLab

    InstructLab is a powerful tool, but one with many many knobs, options, and modes of operation. So, if you just want to incorporate your knowledge into an LLM locally, how can you do that? Assuming you have ilab installed on your machine (the most recent version if possible!) and you have run ilab config init, this is how you should go about generating your data and training your own LLM!

    Choose your data, and write it up!

    Let’s say you want to teach the model about Moo Deng, the viral Pygmy Hippo. You need to download a PDF version of her wiki page and convert it into a MD file. To do that, you can use a tool like docling by running a command like docling /path/to/PDF.pdf on your local machine.

    Once your PDF has been converted to MD, you need to write up a QNA file to let the model know what type of data you want it to pick up from the document. In this case, Moo Deng’s wiki is pretty short so granite is going to become quite the expert!

    Knowledge vs Skills

    This type of information we are introducing is knowledge. Knowledge is categorized as learning about something. Skills are learning how to complete a specific task or how to do something. Teaching the model about Moo Deng is teaching it knowledge about her.

    The knowledge and skill yaml files are different. Knoweldge v3 must point to a document from which the context blocks are pulled.

    A knowledge qna yaml for Moo Deng will look something like this:

    created_by: cdoern
    version: 3
    domain: animals
    seed_examples:
      - context: |
          Moo Deng (bouncy pig) is a pygmy hippopotamus, living in Khao Kheow Open Zoo in Si Racha, Chonburi, Thailand. She gained notability at two months of age in September 2024 as a popular Internet meme after images of her went viral online.
          Moo Deng was born on 10 July 2024. Her name was chosen through a public poll, with over 20,000 people voting for "Moo Deng", translating to "bouncy pork" or "bouncy pig".
        questions_and_answers:
          - question: |
              Who is Moo Deng?
            answer:
              Moo Deng is a pygmy hippopotamus living in Khao Kheow Open Zoo, Thailand, who became an Internet sensation in 2024.
          - question: |
              When was Moo Deng born?
            answer:
              Moo Deng was born on 10 July 2024.
          - question: |
              How was Moo Deng's name chosen?
            answer:
              Moo Deng's name was chosen through a public poll with over 20,000 votes, translating to "bouncy pork" or "bouncy pig".
      - context: |
          Moo Deng's father is Tony, and her mother is Jonah. She has two full siblings: Nadet, born in 2010, male, and Moo Tun, born on 27 October 2019, male.
    
          She also has three half-siblings through her mother's relationship with Rambo: Ko, Kanya, and Phalo. Additionally, Moo Deng has one half-sibling through her father's relationship with Kanya: Moo Wan, female.
        questions_and_answers:
          - question: |
              Who are Moo Deng's parents?
            answer:
              Moo Deng's father is Tony, and her mother is Jonah.
          - question: |
              How many full siblings does Moo Deng have?
            answer:
              Moo Deng has two full siblings, Nadet and Moo Tun.
          - question: |
              How many half-siblings does Moo Deng have?
            answer:
              Moo Deng has four half-siblings, Ko, Kanya, Phalo (from her mother's side) and Moo Wan (from her father's side).
      - context: |
          Khao Kheow Open Zoo posted images of its pygmy hippopotamuses on its official Facebook page, and Moo Deng quickly became a favorite among fans. She is noted for being more playful and energetic than other hippopotamuses.
    
          The zoo responded to her popularity by selling merchandise featuring designs based on Moo Deng. Other companies joined in by producing merchandise, including Vetmon Cafe, which created a realistic cake shaped like Moo Deng. Additionally, Sephora posted a makeup tutorial inspired by her, and she became the subject of many fan artworks.
        questions_and_answers:
          - question: |
              How did Moo Deng become popular?
            answer:
              Moo Deng became popular after Khao Kheow Open Zoo posted images of her on Facebook, where she quickly became a fan favorite.
          - question: |
              What merchandise was created based on Moo Deng?
            answer:
              The zoo sold merchandise featuring designs based on Moo Deng, and Vetmon Cafe created a realistic cake shaped like her.
          - question: |
              Which company posted a makeup tutorial inspired by Moo Deng?
            answer:
              Sephora posted a makeup tutorial inspired by Moo Deng.
      - context: |
          Due to Moo Deng's viral online popularity, the number of daily visitors to the zoo doubled in early September 2024. Some visitors harassed the baby hippo by splashing water at her or throwing objects to wake her up. In response, the zoo installed security cameras around her enclosure, and the zoo's director threatened legal action against visitors who harassed her.
          The zoo also implemented a five-minute time limit for visitors to accommodate the high volume of guests.
        questions_and_answers:
          - question: |
              How did Moo Deng's popularity affect the number of visitors at Khao Kheow Open Zoo?
            answer:
              The number of daily visitors to Khao Kheow Open Zoo doubled in early September 2024 due to Moo Deng's popularity.
          - question: |
              What actions did the zoo take to protect Moo Deng from harassment?
            answer:
              The zoo installed security cameras around Moo Deng's enclosure and threatened legal action against visitors who harassed her.
          - question: |
              How long can visitors spend with Moo Deng at the zoo?
            answer:
              The zoo implemented a five-minute time limit for visitors to see Moo Deng due to the high volume of guests.
      - context: |
          In September 2024, zoo director Narongwit Chodchoi announced that the zoo had begun the process of copyrighting and trademarking "Moo Deng the hippo" to raise funds for the zoo. The zoo also plans to launch a continuous livestream to allow fans to watch Moo Deng live over the Internet.
        questions_and_answers:
          - question: |
              What did the zoo do to protect Moo Deng's brand?
            answer:
              The zoo began the process of copyrighting and trademarking "Moo Deng the hippo" in September 2024.
          - question: |
              How does the zoo plan to keep fans engaged with Moo Deng online?
            answer:
              The zoo plans to launch a continuous livestream for fans to watch Moo Deng live over the Internet.
          - question: |
              Why is the zoo copyrighting Moo Deng's name?
            answer:
              The zoo is copyrighting and trademarking "Moo Deng the hippo" to raise funds for its operations.
      - context: |
          On September 28, Moo Deng was the subject of a Saturday Night Live sketch. She was parodied by Bowen Yang on Weekend Update, where the character was used to satirize American pop-artist Chappell Roan's commentary on fame and political endorsements. Yang later extended his support to Roan.
        questions_and_answers:
          - question: |
              How was Moo Deng featured on Saturday Night Live?
            answer:
              Moo Deng was parodied by Bowen Yang on Weekend Update on September 28, 2024.
          - question: |
              What was Moo Deng's parody on SNL used to satirize?
            answer:
              Moo Deng's parody on SNL was used to satirize American pop-artist Chappell Roan's commentary on fame and political endorsements.
          - question: |
              Who portrayed Moo Deng on Saturday Night Live?
            answer:
              Bowen Yang portrayed Moo Deng on Saturday Night Live.
    document_outline: Verbatim information about Moo Deng, covering her rise to internet fame, her background, online popularity, merchandise, and how the zoo responded to her viral fame.
    document:
      repo: https://github.com/cdoern/knowledge
      commit: 142a62438f227e4bcadd8aac55ef4a9fa625ce0f
      patterns:
        - pop_culture/moodeng.md

    To write a good knowledge yaml, you need at least 5 context blocks with at least 3 question and answer pairs. ilab data generate does not allow you to manually tune the amount of synthetic data. It is directly corellated to the amount of context blocks you have.

    You will notice that the bottom of the qna.yaml, there is a refrence to a document in a repository. As a part of knowledge training you need to place your MD files in a git repository. I reccomend making a repository for all of your knowledge markdown files, categorizing them by what domain they fall under. the one in this scenario is cdoern/knowledge and the moodeng.md falls under the pop_culture repository.

    Let’s generate some data

    Depending on your system, there are two ways to generate data:

    ilab data generate --pipeline simple

    or

    ilab data generate --pipeline full

    As of ilab 0.19.4 the full pipeline is the default. This will take longer and is directly proportional to the size of the MD file you introduce, but will generate extremly high quality data compared to simple. In ilab 0.21.0, --max-num-tokens was added to ilab data generate. This option limits the length of the generated response during the gen_knowledge block of the full pipeline. The default value is 4096. For a small markdown like Moo Deng, using 4096 is fine! However, whenever I generate data with a larger document (20+ pages), I set my --max-num-tokens 512. This dramatically speeds up the data generation process and siginficantly lowers training time since we are working with shorter and less batches of data.

    When generating skills, you can use the --sdg-scale-factor flag to multiply the amount of generated samples. The default is 30 times the amount you provide. For knowledge, the amount of generated samples is directly correlated to the amount of context blocks you provide, but can be sped up by using --max-num-tokens.

    The simple and full pipelines differ greatly. Simple is less rigourous with what data is creates and drops. The full pipeline takes the context of each chunk of data into account and has more accurate filtering of poorly generated data, which is why it takes longer.

    The importance of a good teacher

    Your teacher model (the one used in ilab data generate) is crucial to the quality of your generated data. When using full we reccomend you use either Mixtral-8x7b-Instruct on systems with 150+ GB of vRAM.

    All other systems should use something like Mixtral-8x7b-Instruct-GPTQ or a similar GGUF version with the full pipeline: Mistral-7B-Instruct-v0.2-GGUF . We opt to use the Mistral GGUF instead of Mixtral on consumer hardware because it is 1/5 of the size with the same prompt template and performance. If you are using the full pipeline, this is the default teacher model and will be pulled for you when running ilab model download.

    If you are using the simple pipeline you must use merlinite-7b-lab-GGUF as the teacher model. Merlinite is a adequite teacher for skills as it is a "smarter" model from the granite family. Models like Mixtral are significantly better at generating synthetic data.

    The full command to be run to generate data for knowledge is:

    ilab data generate --pipeline <YOUR_PIPELINE> --model <FULL_PATH_TO_MODEL> --max-num-tokens <DESIRED_MAX_NUM_TOKENS>

    If generating skills you can add something like --sdg-scale-factor 30.

    The defaults for ilab data generate are Mistral-7B-Instruct-v0.2-GGUF and the full pipeline, so that is what I used!

    I simply ran:

    ilab data generate
    INFO 2024-10-18 09:56:35,301 numexpr.utils:161: NumExpr defaulting to 8 threads.
    INFO 2024-10-18 09:56:35,627 datasets:59: PyTorch version 2.3.1 available.
    INFO 2024-10-18 09:56:36,310 instructlab.model.backends.llama_cpp:125: Trying to connect to model server at http://127.0.0.1:8000/v1
    WARNING 2024-10-18 09:56:51,857 instructlab.data.generate_data:72: Disabling SDG batching - unsupported with llama.cpp serving
    INFO 2024-10-18 09:56:51,870 instructlab.data.generate_data:82: Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:57248/v1 server
    INFO 2024-10-18 09:56:52,375 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
    INFO 2024-10-18 09:56:52,563 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/charliedoern/.local/share/instructlab/datasets/checkpoints/knowledge_science_animals_hippos, generating from scratch
    INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:153: Running pipeline single-threaded
    INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:197: Running block: duplicate_document_col
    INFO 2024-10-18 09:56:52,563 instructlab.sdg.pipeline:198: Dataset({
        features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
        num_rows: 6
    })
    INFO 2024-10-18 09:56:55,261 instructlab.sdg.llmblock:52: LLM server supports batched inputs: False
    INFO 2024-10-18 09:56:55,261 instructlab.sdg.pipeline:197: Running block: gen_spellcheck
    INFO 2024-10-18 09:56:55,261 instructlab.sdg.pipeline:198: Dataset({
        features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document'],
        num_rows: 6
    })
    gen_spellcheck Prompt Generation:  17%|█████████████████████████                                                                                                                             | 1/6 [01:03<05:18, 63.76s/it]

    This command pulls the defaults from your config.yaml.

    This process takes a while, and you can see the results of SDG in ~/.local/share/instructlab/datasets/knowledge_train_msgs...jsonl. This is the file generated if your input was a knowledge yaml. The skills one follows the same naming scheme. This file for me had about 200 generated samples from my original 15. This file will have some generated samples, context, and expected input and output for the model you are training to follow.

    Let’s train a new model!

    ilab model train has 3 pipelines. simple, full, and accelerated. Both simple and full are designed to run on laptops.

    full was introduced more recently and used techniques specific to the ilab workflow, dataprocessing, and model production process. In this example I will be using the full pipeline, which is now the default training pipeline upstream. This version of training is optimized for systems with 32 or more GB of Memory, though it will work on 16 GB as of ilab 0.21.0. This optimizations is applied automatically on MacOS and can be toggled on Linux using --optimize-memory.

    The full command I ran on my machine is:

    ilab model train --device mps --data-path ~/.local/share/instructlab/datasets/knowledge_train_msgs...jsonl

    Your config will have an optimal effective_batch_size and max_batch_len for training knowledge on a 64GB machine. If you have more memory you can turn both of these up by using the associated command line flags. If you are training skills, you should use --effective-batch-size=3840.

    For 7 epochs, this process takes around 7 hours on a capable machine. If your machine can handle it, once a checkpoint folder is saved you can chat with the model in another terminal while still training.

    You can chat with the trained models by running:

    ilab model chat --model ~/.cache/instructlab/checkpoints/hf_format/samples_<Number>/pytorch_model-Q4_K_M.gguf

    The 7th epoch is scienfitically the best in terms of performance and that can be found in samples_56.

    I ran three epochs of training for this demo and the model clealry now knows who Moo Deng is!

    ilab model chat --model ~/.local/share/instructlab/checkpoints/samples_16/pytorch_model-Q4_K_M.gguf

    >>> Who is Moo Deng?                                                                                                                                                                   [S][default]
    ╭────────────────────────── pytorch_model-Q4_K_M.gguf─────────────────────────────────────────────────────────╮
    │ Moo Deng (bouncy pig) is a pygmy hippopotamus living at Khao Kheow Open Zoo in Si Racha, Chonburi, Thailand. She gained notability at two months of age in October 2024 as a popular Internet meme after images of her 
    │ went viral online.                                                                                                                                                                                                      
    │                                                                                                                                                                                                                         
    │ ## Background                                                                                                                                                                                                           
    │                                                                                                                                                                                                                         
    │ Moo Deng was born on August 10, 2024. Her name was chosen through a public poll with over 20,000 votes, translating to "bouncy pig".                                                               
    │                                                                                                                                                                                                                         
    │ ## Parents and siblings                                                                                                                                                                                                 
    │                                                                                                                                                                                                                         
    │ Moo Deng's father is Tony, and her mother is Jonah. Moo Deng has two full siblings:                                                                                                                                     
    │
    ...

    ilab 0.19.4 introduces full pipeline support for consumer hardware. These pipelines will take longer and use more of your system resources. That is because what they are doing is a closer approximation of the GPU enabled workflow that runs on systems with hundreds of GB of vRAM. Consumer hardware is more limited.

    ilab 0.21.0 expands on this functionality and further optimizes both training and generation for smaller hardware footprints with flags like --max-num-tokens for SDG, --optimize-memory for training, and hardware acceleration via the MacOS mps pytorch device. With smaller batches of data, a more memory friendly training loop, and smaller models: ilab can produce a high fidelity model right on your laptop!

Join the InstructLab Community!

Come learn how to customize your own LLM, play and learn AI technology, and help us build the open source tools that make it all possible!