Retrieval Augmented Generation using Llama 2¶

This notebook walks through how to use Llama 2 to perform (in-context) retrieval augmented generation. We will customize the system message for Llama 2 to make sure the model is only using provided context to generate the response.

What is In-context Retrieval Augmented Generation?

In-context retrieval augmented generation is a method to improve language model generation by including relevant documents to the model input. The key points are:

Retrieval of relevant documents from an external corpus to provide factual grounding for the model.
Prepending the retrieved documents to the input text, without modifying the model architecture or fine-tuning the model.
Allows leveraging external knowledge with off-the-shelf frozen language models.

In [ ]:

Copied!

# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm
# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm

Simple Example¶

Below is a simple example using the existing prompt builder of llama2 to generate a prompt. We are going to use the system message from llama-index with some minor adjustments.

In [8]:

Copied!





SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given #SOURCE# documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
- Only include facts and information based on the #SOURCE# documents.
"""

system = {"role": "system", "content": SYSTEM_PROMPT}
SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner, based on the given #SOURCE# documents. Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
- Only include facts and information based on the #SOURCE# documents.
"""

system = {"role": "system", "content": SYSTEM_PROMPT}

before we can now call our LLM. Lets create a user instruction with a query and a context. As a context i copied the the wikipedia article of Nuremberg (the city i live). I uploaded it as a gist to to not pollute the notebook.

In [ ]:

Copied!

!wget https://gist.githubusercontent.com/philschmid/2678351cb9f41d385aa5c099caf20c0a/raw/60ae425677dd9bed6fe3c0f2dd5b6ea49bc6590c/nuremberg.txt
!wget https://gist.githubusercontent.com/philschmid/2678351cb9f41d385aa5c099caf20c0a/raw/60ae425677dd9bed6fe3c0f2dd5b6ea49bc6590c/nuremberg.txt

In [14]:

Copied!

context = open("nuremberg.txt").read()

query = "How many people live in Nuremberg?"
context = open("nuremberg.txt").read()

query = "How many people live in Nuremberg?"

Before we use our context lets just ask the model.

In [15]:

Copied!





from easyllm.clients import huggingface

# set the prompt builder to llama2
huggingface.prompt_builder = "llama2"
# huggingface.api_key = "hf_xx"

# send a ChatCompletion request
response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "user", "content": query},
    ],
)

# print the time delay and text received
print(response["choices"][0]["message"]["content"])
from easyllm.clients import huggingface

# set the prompt builder to llama2
huggingface.prompt_builder = "llama2"
# huggingface.api_key = "hf_xx"

# send a ChatCompletion request
response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "user", "content": query},
    ],
)

# print the time delay and text received
print(response["choices"][0]["message"]["content"])

 As of December 31, 2020, the population of Nuremberg, Germany is approximately 516,000 people.

Now lets use our system message with our context to augment the knowledge of our model "in-memory" and ask the same question again.

In [23]:

Copied!

context_extended = f"{query}\n\n#SOURCE#\n{context}"
# context_extended = f"{query}\n\n#SOURCE START#\n{context}\n#SOURCE END#{query}"
context_extended = f"{query}\n\n#SOURCE#\n{context}"
# context_extended = f"{query}\n\n#SOURCE START#\n{context}\n#SOURCE END#{query}"

In [22]:

Copied!





from easyllm.clients import huggingface

# set the prompt builder to llama2
huggingface.prompt_builder = "llama2"
# huggingface.api_key = "hf_xx"

# send a ChatCompletion request
response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        system, 
        {"role": "user", "content": context_extended},
    ],
)

# print the time delay and text received
print(response["choices"][0]["message"]["content"])
from easyllm.clients import huggingface

# set the prompt builder to llama2
huggingface.prompt_builder = "llama2"
# huggingface.api_key = "hf_xx"

# send a ChatCompletion request
response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        system, 
        {"role": "user", "content": context_extended},
    ],
)

# print the time delay and text received
print(response["choices"][0]["message"]["content"])

 The population of Nuremberg is 523,026 according to the 2022-12-31 data.

Awesome! if we check the gist, we can see a snippet in there with saying

Population (2022-12-31)[2]
 • City	523,026

Next Steps¶

Next steps, would be to connect your LLM using with external knowledge sources such as Wikis, the Web or other databases using tools and APIs or vector databases and embeddings.

In [ ]: