January 13, 2025
Welcome to Lesson 12 of 12 in our free course series, LLM Twin: Building Your…
The ability to efficiently handle and process large volumes of text is crucial.
Enter the world of Document Chains in LangChain, a revolutionary approach that promises to redefine how we interact with expansive textual data. Whether you’re a developer, data scientist, or just a curious enthusiast, this guide will walk you through the intricacies of Document Chains, showcasing their potential and practical applications.
Imagine a scenario where you’re faced with the daunting task of analyzing a voluminous text, such as Marcus Aurelius’ “Meditations.” Traditional methods might falter, but with LangChain’s Document Chains, you can seamlessly split, process, and derive meaningful insights from such texts. By leveraging chains like Stuff, Refine, and MapReduce, you can dissect texts, refine outputs, and even parallelize tasks for optimal efficiency.
This guide will introduce you to the theoretical aspects of Document Chains and provide hands-on examples, allowing you to dive deep into the code and witness the magic unfold.
From setting up your environment with essential packages like langchain
, openai
, and tiktoken
, to diving into the depths of text splitting and document processing, this guide promises a comprehensive journey.
By the end, you’ll have a clear understanding of how Document Chains can be a game-changer in the world of language processing.
%%capture
!pip install langchain openai tiktoken
import os
import getpass
import textwrap
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
# we will cover docstores and splitters in more details when we get to retrieval
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")
We will download a text file and split it into chunks of size 1500.
!wget -O meditations.txt https://www.gutenberg.org/files/2680/2680-0.txt
with open('/content/meditations.txt') as f:
meditations = f.read()
meditations = "\n".join(meditations.split("\n")[575:])
# splits the text based on a character
text_splitter = CharacterTextSplitter(
separator='\n',
chunk_size = 1500,
chunk_overlap=200,
length_function=len)
meditations_chunks = text_splitter.split_text(meditations)
docs = [Document(page_content=t) for t in meditations_chunks]
Efficient Document Processing: Document Chains allow you to process and analyze large amounts of text data efficiently. They provide a structured approach to working with documents, enabling you to retrieve, filter, refine, and rank them based on specific criteria.
Task Decomposition: Document Chains can help you break down complex tasks into smaller, more manageable subtasks. By using different types of Document Chains like Stuff, Refine, Map Reduce, or Map Re-rank, you can perform specific operations on the retrieved documents and obtain more accurate and relevant results.
Improved Accuracy: Document Chains, especially Map Re-rank Chains, can help improve the accuracy of your responses. By running an initial prompt on each document and returning the highest-scoring response, you can prioritize the most reliable and accurate answers.
When deciding whether to use a Document Chain, consider the specific requirements of your application. If you need to process and analyze documents efficiently, break down complex tasks, improve response accuracy, or integrate with external data sources, Document Chains can be a valuable tool.
Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI.
One way to provide context to a language model is through the stuffing method.
This involves putting all relevant data into the prompt for the LangChain’s StuffDocumentsChain
to process.
The advantage of this method is that it only requires one call to the LLM, and the model has access to all the information at once.
However, one downside is that most LLMs can only handle a certain amount of context. For large or multiple documents, stuffing may result in a prompt that exceeds the context limit.
Additionally, this method is only suitable for smaller amounts of data. When working with larger amounts, alternative approaches should be used.
prompt_template ="""
Write a short 90s west coast gangster rap about the virtues learned from
various family members and how they helped you get by in times of crisis. Use
modern terminology where appropriate:
{text}
RAP:
"""
rap_prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
stuff_chain = load_summarize_chain(llm,
chain_type="stuff",
prompt=rap_prompt)
# we can't fit the entire book of meditations in the context window, so
# take a slice of it
output_summary = stuff_chain.run(docs[:5])
print(output_summary)
Verse 1:
Grandpa taught me to stay humble and meek,
Mama taught me to be religious and generous,
Great-grandpa said to stay in school and seek,
The one who raised me said don't be a fool.
Chorus:
Family taught me the virtues,
That helped me get by,
When times were hard and I was in crisis,
My fam kept me alive.
Verse 2:
Diognetus said don't be fooled by tricks,
Rusticus said don't be vain and boastful,
Apollonius said stay strong and don't be quick,
Sextus said be forgiving and not hostile.
Chorus:
Family taught me the virtues,
That helped me get by,
When times were hard and I was in crisis,
My fam kept me alive.
Verse 3:
Alexander the Grammarian said don't be rude,
Fronto said don't be envious and fake,
Alexander the Platonic said don't be crude,
Catulus said love your kids for their sake.
The Refine Documents Chain uses an iterative process to generate a response by analyzing each input document and updating its answer accordingly.
It passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to obtain a new answer for each document.
This chain is ideal for tasks that involve analyzing more documents than can fit in the model’s context, as it only passes a single document to the LLM at a time.
However, this also means it makes significantly more LLM calls than other chains, such as the Stuff Documents Chain. It may perform poorly for tasks that require cross-referencing between documents or detailed information from multiple documents.
The Refine Documents Chain starts with an initial prompt on the first data set and generates output accordingly. The remaining documents pass in the previous output along with the next document, and ask the LLM to refine the output based on the new document.
Pros of this method include incorporating more relevant context and potentially less data loss than the MapReduce Documents Chain. However, it requires many more LLM calls and the calls are not independent, meaning they cannot be paralleled like the MapReduce Documents Chain.
There may also be dependencies on the order in which the documents are analyzed.
refine_chain = load_summarize_chain(llm, chain_type="refine")
You can inspect the pre-specified prompt:
print(refine_chain.refine_llm_chain.prompt.template)
Your job is to produce a final summary
We have provided an existing summary up to a certain point: {existing_answer}
We have the opportunity to refine the existing summary(only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the original summary
If the context isn't useful, return the original summary.
output_summary = refine_chain.run(docs[17:25])
output_summary
Theophrastus encourages people to respect themselves and to take time to learn something good, rather than wandering and being idle. He suggests that people should be aware of the nature of the universe and their part in it, and to act and speak in accordance with it. He further states that those who sin through lust are more to be condemned than those who sin through anger, as the former have chosen to do so of their own volition. He advises people to remember that if there are gods, death is not a grievous thing, and to live their lives as if they may depart from it at any moment. He also suggests that the universe has taken care to provide people with the power to avoid evil and wickedness, and that life and death, honour and dishonour, labour and pleasure, riches and poverty, all happen to both good and bad people equally, but are neither good nor bad in themselves. Lastly, he encourages people to consider how quickly all things are dissolved and resolved, and to reflect on the nature of all worldly sensible things, including those which ensnare by pleasure, or for their irksomeness are dreadful, or for their outward lustre and show are in great esteem and request. He also encourages people to consider how man
To process large amounts of data efficiently, the MapReduceDocumentsChain
method is used.
This involves applying an LLM chain to each document individually (in the Map step), producing a new document. Then, all the new documents are passed to a separate combine documents chain to get a single output (in the Reduce step). If necessary, the mapped documents can be compressed before passing them to the combine documents chain.
This compression step is performed recursively.
This method requires an initial prompt on each chunk of data.
For summarization tasks, this could be a summary of that chunk, while for question-answering tasks, it could be an answer based solely on that chunk. Then, a different prompt is run to combine all the initial outputs.
The pros of this method are that it can scale to larger documents and handle more documents than the StuffDocumentsChain
. Additionally, the calls to the LLM on individual documents are independent and can be parallelized.
The cons are that it requires many more calls to the LLM than the StuffDocumentsChain
and loses some information during the final combining call.
map_reduce_chain = load_summarize_chain(llm,
chain_type="map_reduce",
verbose=True)
The prompt templates are set for us. This is what they look like for summarizing and combining them. They’re the same in this situation.
print(map_reduce_chain.llm_chain.prompt.template)
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
print(map_reduce_chain.combine_document_chain.llm_chain.prompt.template)
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
# just using the first 20 chunks as I don't want to run too long
output_summary = map_reduce_chain.run(docs[:20])
print(output_summary)
This passage discusses the lessons the speaker has learned from his ancestors and mentors, such as to be gentle, meek, religious, and to avoid excess. He has also learned to avoid being partial to certain factions, to endure labor, and to not believe in vain things. He is thankful for the influence of various people in his life, such as Bacchius, Tandasis, and Marcianus, who taught him to pursue philosophy and not be offended by others' speech. He has also learned to be generous, hopeful, and confident in their friends' love, to have power over themselves, to be cheerful and courageous in all circumstances, and to be mild, moderate, and serious. The speaker has also observed his father's meekness, constancy, laboriousness, impartiality, skill, knowledge, abstention from unchaste love, and moderate behavior. He is reminded to be kind and understanding to all people, as they are all connected, and to take time to understand the true nature of the world and its Lord. Lastly, he is encouraged to live life as if it is the last, free from vanity, aberration, and self-love.
As we draw this exploration to a close, it’s evident that Document Chains in LangChain stand as a beacon of innovation in language processing. Their ability to efficiently dissect, process, and derive insights from vast textual data sets them apart, offering solutions to challenges that once seemed impossible.
Through our hands-on examples and deep dives into chains like Stuff, Refine, and MapReduce, we’ve witnessed the versatility and power these tools bring. Whether it’s processing ancient philosophical texts or modern-day datasets, Document Chains prove that no text is too vast or complex.
Moreover, the practical applications of these chains are boundless. From academic research and content summarization to business analytics and beyond, the potential impact of Document Chains on various industries is immense.
In an era where data is abundant, and insights are gold, tools like Document Chains are not just conveniences but necessities. As we move forward, it’s exciting to envision a future where language processing is not just about understanding text but harnessing its full potential. With LangChain leading the charge, that future seems not just possible but imminent.
Thank you for joining us on this enlightening journey. As you venture forth, may the power of Document Chains guide your endeavours in the vast world of language processing!