January 13, 2025
Welcome to Lesson 12 of 12 in our free course series, LLM Twin: Building Your…
Welcome to Lesson 6 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.
Lessons
Large language models (LLMs) have changed how we interact with machines. These powerful models have a remarkable understanding of human language, enabling them to translate text, write different kinds of creative content formats, and answer your questions in an informative way.
But how do we take these LLMs and make them even better?
The answer lies in fine-tuning.
Fine-tuning is the process of taking a pre-trained LLM and adapting it to a specific task or domain.
One important aspect of fine-tuning is dataset preparation.
Remember the quote from 2018: “garbage in, garbage out.”
The quality of your dataset directly impacts how well your fine-tuned model will perform.
Let’s explore why a well-prepared, high-quality dataset is essential for successful LLM fine-tuning:
Today, we will learn how to generate a custom dataset for our specific task, which is content generation.
Our data consists of two primary types: posts and articles. Each type serves a different purpose and is structured to accommodate specific needs:
Both data types require careful handling during insertion to preserve their integrity and ensure they are stored correctly for further processing and analysis in MongoDB. This includes managing formatting issues and ensuring data consistency across the database.
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Using the cleaned data from Qdrant
Let’s analyze the sample data point from Qdrant to demonstrate how we can derive instructions for generating our instruction dataset (which we cleaned within our feature pipeline in Lesson 4):
{
"author_id": "2",
"cleaned_content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\n.\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture & LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs & vector DBs.\nWe will primarily focus on the engineering & MLOps aspects.\nThus, by the end of this series, you will know how to build & deploy a real ML system, not some isolated code in Notebooks.\nThere are 3 components you will learn to build during the course:\n- a real-time streaming pipeline\n- a fine-tuning pipeline\n- an inference pipeline\n.\nWe have already released the code and video lessons of the Hands-on LLM course.\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\nWe have already released the first lesson of the series \nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps: \n[URL]\n In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\n- LLMs\n- vector DBs\n- a streaming engine\n- LLMOps\n.\n The rest of the articles will be released by the end of January 2024.\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons: \n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience",
"platform": "linkedin",
"type": "posts"
},
{
"author_id": "2",
"cleaned_content": "RAG systems are far from perfect This free course teaches you how to improve your RAG system.\nI recently finished the Advanced Retrieval for AI with Chroma free course from\nDeepLearning.AI\nIf you are into RAG, I find it among the most valuable learning sources.\nThe course already assumes you know what RAG is.\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\n- query expansion\n- cross-encoder re-ranking\n- embedding adaptors\nI am not affiliated with\nDeepLearning.AI\n(I wouldn't mind though).\nThis is a great course you should take if you are into RAG systems.\nThe good news is that it is free and takes only 1 hour.\nCheck it out \n Advanced Retrieval for AI with Chroma:\n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience\n.\n Follow me for daily lessons about ML engineering and MLOps.[URL]",
"image": null,
"platform": "linkedin",
"type": "posts"
}
Process:
Generating instructions: We can leverage the “cleaned_content” to automatically generate instructions (using GPT-4o or other LLM) for each piece of content, such as:
Generating the dataset with GPT-4o
The process can be split into 3 main stages:
Result: This process would yield a dataset of instruction-output pairs designed to fine-tune a Llama 3.1 8B (or other LLM) for tweaking the writing style of the LLM.
Let’s dig into the code!
The example will simulate creating a training dataset for an LLM using the strategy we’ve explained above.
Imagine that we want to go from this ↓
{
"author_id": "2",
"cleaned_content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\n.\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture & LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs & vector DBs.\nWe will primarily focus on the engineering & MLOps aspects.\nThus, by the end of this series, you will know how to build & deploy a real ML system, not some isolated code in Notebooks.\nThere are 3 components you will learn to build during the course:\n- a real-time streaming pipeline\n- a fine-tuning pipeline\n- an inference pipeline\n.\nWe have already released the code and video lessons of the Hands-on LLM course.\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\nWe have already released the first lesson of the series \nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps: \n[URL]\n In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\n- LLMs\n- vector DBs\n- a streaming engine\n- LLMOps\n.\n The rest of the articles will be released by the end of January 2024.\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons: \n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience",
},
{
"author_id": "2",
"cleaned_content": "RAG systems are far from perfect This free course teaches you how to improve your RAG system.\nI recently finished the Advanced Retrieval for AI with Chroma free course from\nDeepLearning.AI\nIf you are into RAG, I find it among the most valuable learning sources.\nThe course already assumes you know what RAG is.\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\n- query expansion\n- cross-encoder re-ranking\n- embedding adaptors\nI am not affiliated with\nDeepLearning.AI\n(I wouldn't mind though).\nThis is a great course you should take if you are into RAG systems.\nThe good news is that it is free and takes only 1 hour.\nCheck it out \n Advanced Retrieval for AI with Chroma:\n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience\n.\n Follow me for daily lessons about ML engineering and MLOps.[URL]",
}
to this ↓
[
{
"instruction": "Share the announcement of the upcoming Medium series on building hands-on LLM systems using good LLMOps practices, focusing on the 3-pipeline architecture and real-time financial advisor development. Follow the Decoding ML publication on Medium for notifications on future lessons.",
"content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\n.\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture & LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs & vector DBs.\nWe will primarily focus on the engineering & MLOps aspects.\nThus, by the end of this series, you will know how to build & deploy a real ML system, not some isolated code in Notebooks.\nThere are 3 components you will learn to build during the course:\n- a real-time streaming pipeline\n- a fine-tuning pipeline\n- an inference pipeline\n.\nWe have already released the code and video lessons of the Hands-on LLM course.\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\nWe have already released the first lesson of the series \nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps: \n[URL]\n In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\n- LLMs\n- vector DBs\n- a streaming engine\n- LLMOps\n.\n The rest of the articles will be released by the end of January 2024.\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons: \n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience"
},
{
"instruction": "Promote the free course 'Advanced Retrieval for AI with Chroma' from DeepLearning.AI that aims to improve RAG systems and takes only 1 hour to complete. Share the course link and encourage followers to check it out for the latest techniques in query expansion, cross-encoder re-ranking, and embedding adaptors.",
"content": "RAG systems are far from perfect This free course teaches you how to improve your RAG system.\nI recently finished the Advanced Retrieval for AI with Chroma free course from\nDeepLearning.AI\nIf you are into RAG, I find it among the most valuable learning sources.\nThe course already assumes you know what RAG is.\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\n- query expansion\n- cross-encoder re-ranking\n- embedding adaptors\nI am not affiliated with\nDeepLearning.AI\n(I wouldn't mind though).\nThis is a great course you should take if you are into RAG systems.\nThe good news is that it is free and takes only 1 hour.\nCheck it out \n Advanced Retrieval for AI with Chroma:\n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience\n.\n Follow me for daily lessons about ML engineering and MLOps.[URL]"
},.
First, let’s inspect a couple of cleaned documents from which we want to generate instruction-answer data points for SFT fine-tuning:
{
"author_id": "2",
"cleaned_content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\n.\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture & LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs & vector DBs.\nWe will primarily focus on the engineering & MLOps aspects.\nThus, by the end of this series, you will know how to build & deploy a real ML system, not some isolated code in Notebooks.\nThere are 3 components you will learn to build during the course:\n- a real-time streaming pipeline\n- a fine-tuning pipeline\n- an inference pipeline\n.\nWe have already released the code and video lessons of the Hands-on LLM course.\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\nWe have already released the first lesson of the series \nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps: \n[URL]\n In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\n- LLMs\n- vector DBs\n- a streaming engine\n- LLMOps\n.\n The rest of the articles will be released by the end of January 2024.\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons: \n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience",
"platform": "linkedin",
"type": "posts"
},
{
"author_id": "2",
"cleaned_content": "RAG systems are far from perfect This free course teaches you how to improve your RAG system.\nI recently finished the Advanced Retrieval for AI with Chroma free course from\nDeepLearning.AI\nIf you are into RAG, I find it among the most valuable learning sources.\nThe course already assumes you know what RAG is.\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\n- query expansion\n- cross-encoder re-ranking\n- embedding adaptors\nI am not affiliated with\nDeepLearning.AI\n(I wouldn't mind though).\nThis is a great course you should take if you are into RAG systems.\nThe good news is that it is free and takes only 1 hour.\nCheck it out \n Advanced Retrieval for AI with Chroma:\n[URL]\nhashtag\n#\nmachinelearning\nhashtag\n#\nmlops\nhashtag\n#\ndatascience\n.\n Follow me for daily lessons about ML engineering and MLOps.[URL]",
"image": null,
"platform": "linkedin",
"type": "posts"
}
We’ll use the DataFormatter class to format these data points into a structured prompt for the LLM.
Here’s how you would use the class to prepare the content:
class DataFormatter:
@classmethod
def get_system_prompt(cls, data_type: str) -> str:
return (
f"I will give you batches of contents of {data_type}. Please generate me exactly 1 instruction for each of them. The {data_type} text "
f"for which you have to generate the instructions is under Content number x lines. Please structure the answer in json format,"
f"ready to be loaded by json.loads(), a list of objects only with fields called instruction and content. For the content field, copy the number of the content only!."
f"Please do not add any extra characters and make sure it is a list with objects in valid json format!\n"
)
@classmethod
def format_data(cls, data_points: list, is_example: bool, start_index: int) -> str:
text = ""
for index, data_point in enumerate(data_points):
if not is_example:
text += f"Content number {start_index + index }\n"
text += str(data_point) + "\n"
return text
@classmethod
def format_batch(cls, context_msg: str, data_points: list, start_index: int) -> str:
delimiter_msg = context_msg
delimiter_msg += cls.format_data(data_points, False, start_index)
return delimiter_msg
@classmethod
def format_prompt(
cls, inference_posts: list, data_type: str, start_index: int
) -> str:
initial_prompt = cls.get_system_prompt(data_type)
initial_prompt += f"You must generate exactly a list of {len(inference_posts)} json objects, using the contents provided under CONTENTS FOR GENERATION\n"
initial_prompt += cls.format_batch(
"\nCONTENTS FOR GENERATION: \n", inference_posts, start_index
)
return initial_prompt
Output of the format_prompt function:
prompt = """
I will give you batches of contents of articles.
Please generate me exactly 1 instruction for each of them.
The articles text for which you have to generate the instructions is under Content number x lines.
Please structure the answer in json format,ready to be loaded by json.loads(), a list of objects only with fields called instruction and content.
For the content field, copy the number of the content only!
Please do not add any extra characters and make sure it is a list with objects in valid json format!\n
You must generate exactly a list of 3 json objects, using the contents provided under CONTENTS FOR GENERATION\n
CONTENTS FOR GENERATION:
Content number 0
...
Content number 1
...
Content number 2
...
Content number MAX_BATCH
...
We batch the data into multiple prompts to avoid hitting the maximum number of tokens. Thus, we will send multiple prompts to the LLM.
To automate the generation of fine tuning data, we designed the DatasetGenerator class. This class is designed to streamline the process from fetching data to logging the training data into Comet:
class DatasetGenerator:
def __init__(
self,
file_handler: FileHandler,
api_communicator: GptCommunicator,
data_formatter: DataFormatter,
) -> None:
self.file_handler = file_handler
self.api_communicator = api_communicator
self.data_formatter = data_formatter
The generate_training_data() method from the DatasetGenerator class handles the entire lifecycle of data generation and calls the LLM for each batch:
def generate_training_data(
self, collection_name: str, data_type: str, batch_size: int = 3
) -> None:
assert (
settings.COMET_API_KEY
), "COMET_API_KEY must be set in settings, fill it in your .env file."
assert (
settings.COMET_WORKSPACE
), "COMET_PROJECT must be set in settings, fill it in your .env file."
assert (
settings.COMET_WORKSPACE
), "COMET_PROJECT must be set in settings, fill it in your .env file."
assert (
settings.OPENAI_API_KEY
), "OPENAI_API_KEY must be set in settings, fill it in your .env file."
cleaned_documents = self.fetch_all_cleaned_content(collection_name)
cleaned_documents = chunk_documents(cleaned_documents)
num_cleaned_documents = len(cleaned_documents)
generated_instruct_dataset = []
for i in range(0, num_cleaned_documents, batch_size):
batch = cleaned_documents[i : i + batch_size]
prompt = data_formatter.format_prompt(batch, data_type, i)
batch_instructions = self.api_communicator.send_prompt(prompt)
if len(batch_instructions) != len(batch):
logger.error(
f"Received {len(batch_instructions)} instructions for {len(batch)} documents. \
Skipping this batch..."
)
continue
for instruction, content in zip(batch_instructions, batch):
instruction["content"] = content
generated_instruct_dataset.append(instruction)
train_test_split = self._split_dataset(generated_instruct_dataset)
self.push_to_comet(train_test_split, data_type, collection_name)
We could further optimize this by parallelizing the calls on different threads using the ThreadPoolExecutor class from Python. For our small example, doing everything sequentially is fine.
The fetch_all_cleaned_content() method retrieves the cleaned documents from a Qdrant collection:
def fetch_all_cleaned_content(self, collection_name: str) -> list:
all_cleaned_contents = []
scroll_response = client.scroll(collection_name=collection_name, limit=10000)
points = scroll_response[0]
for point in points:
cleaned_content = point.payload["cleaned_content"]
if cleaned_content:
all_cleaned_contents.append(cleaned_content)
return all_cleaned_contents
In this section, we focus on a critical aspect of MLOps: data versioning.
We’ll specifically look at how to implement this using Comet, a platform that facilitates experiment management and reproducibility in machine learning projects.
Comet is a cloud-based platform that provides tools for tracking, comparing, explaining, and optimizing experiments and models in machine learning. CometML helps data scientists and teams to better manage and collaborate on machine learning experiments.
Maybe you’re asking why not to choose MLFlow for example [2]:
When integrating Comet into your projects, you’ll need to set up several environment variables to manage the authentication and configuration:
COMET_API_KEY
: Your unique API key that authenticates your interactions with the Comet API.COMET_PROJECT
: The project name under which your experiments will be logged.COMET_WORKSPACE
: The workspace name that organizes various projects and experiments.Data versioning is the practice of keeping a record of multiple versions of datasets used in training machine learning models. This practice is essential for several reasons:
The provided push_to_comet
function is a key part of this process.
def push_to_comet(
self,
train_test_split: tuple[list[dict], list[dict]],
data_type: str,
collection_name: str,
output_dir: Path = Path("generated_dataset"),
) -> None:
output_dir.mkdir(exist_ok=True)
try:
logger.info(f"Starting to push data to Comet: {collection_name}")
experiment = start()
training_data, testing_data = train_test_split
file_name_training_data = output_dir / f"{collection_name}_training.json"
file_name_testing_data = output_dir / f"{collection_name}_testing.json"
logging.info(f"Writing training data to file: {file_name_training_data}")
with file_name_training_data.open("w") as f:
json.dump(training_data, f)
logging.info(f"Writing testing data to file: {file_name_testing_data}")
with file_name_testing_data.open("w") as f:
json.dump(testing_data, f)
logger.info("Data written to file successfully")
artifact = Artifact(f"{data_type}-instruct-dataset")
artifact.add(file_name_training_data)
artifact.add(file_name_testing_data)
logger.info(f"Artifact created.")
experiment.log_artifact(artifact)
experiment.end()
logger.info("Artifact pushed to Comet successfully.")
except Exception:
logger.exception(
f"Failed to create Comet artifact and push it to Comet.",
)
After running the script that invokes the push_to_comet
function, Comet will update with new data artifacts, each representing a different dataset version. This is a crucial step in ensuring that all your data versions are logged and traceable within your MLOps environment.
Here is what you should see in Comet after successfully executing the script:
List of Artifacts: You will see entries for each type of data you’ve processed and saved. For example, if you have cleaned and versioned articles and posts, they will appear as separate artifacts.
Artifact Versions: Each artifact can have multiple versions. Each time you run the script with a new or updated dataset, a new version of the respective artifact is created.
Each version is timestamped and stored with a unique ID, allowing you to track changes over time or revert to previous versions if necessary.
We will have a training and testing JSON file:
Here’s an example of what the final version of cleaned_articles_training.json
might look like, ready for the fine-tuning task:
Also, we made our artifacts publicly available, so you can take a look, play around with them, and even use them to fine-tune the LLM in case you don’t want to compute them yourself:
This lesson taught you how to generate custom instruct datasets from your raw data using other LLMs.
Also, we’ve shown you how to load the dataset to a data registry, such as Comet ML’s artifacts, to version, track, and share it within your system.
In Lesson 7, you will learn to use the generated dataset to finetune a Llama 3.1 8B LLM as your LLM Twin using Unsloth, TRL and AWS SageMaker.
🔗 Consider checking out the GitHub repository [1] and support us with a ⭐️
[1] Your LLM Twin Course — GitHub Repository (2024), Decoding ML GitHub Organization
[2] MLFlow Alternatives, Neptune.ai
If not otherwise stated, all images are created by the author.