How to use Chat Completion clients¶

EasyLLM can be used as an abstract layer to replace gpt-3.5-turbo and gpt-4 with open source models.

You can change your own applications from the OpenAI API, by simply changing the client.

Chat models take a series of messages as input, and return an AI-written message as output.

This guide illustrates the chat format with a few example API calls.

1. Import the easyllm library¶

In [ ]:

Copied!

# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm
# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm

In [6]:

Copied!

# import the EasyLLM Python library for calling the EasyLLM API
import easyllm
# import the EasyLLM Python library for calling the EasyLLM API
import easyllm

2. An example chat API call¶

A chat API call has two required inputs:

model: the name of the model you want to use (e.g., meta-llama/Llama-2-70b-chat-hf) or leave it empty to just call the api
messages: a list of message objects, where each object has two required fields:
- role: the role of the messenger (either system, user, or assistant)
- content: the content of the message (e.g., Write me a beautiful poem)

Compared to OpenAI api is the huggingface module also exposing a prompt_builder and stop_sequences parameter you can use to customize the prompt and stop sequences. The EasyLLM package comes with prompt builder utilities.

Let's look at an example chat API calls to see how the chat format works in practice.

In [1]:

Copied!





import os 
# set env for prompt builder
os.environ["HUGGINGFACE_PROMPT"] = "llama2" # vicuna, wizardlm, stablebeluga, open_assistant
# os.environ["HUGGINGFACE_TOKEN"] = "hf_xxx" 

from easyllm.clients import huggingface

# Changing configuration without using environment variables
# huggingface.api_key="hf_xxx"
# huggingface.prompt_builder = "llama2"


MODEL="meta-llama/Llama-2-70b-chat-hf"

response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Cat."},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=1024,
)
response
import os 
# set env for prompt builder
os.environ["HUGGINGFACE_PROMPT"] = "llama2" # vicuna, wizardlm, stablebeluga, open_assistant
# os.environ["HUGGINGFACE_TOKEN"] = "hf_xxx" 

from easyllm.clients import huggingface

# Changing configuration without using environment variables
# huggingface.api_key="hf_xxx"
# huggingface.prompt_builder = "llama2"


MODEL="meta-llama/Llama-2-70b-chat-hf"

response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Cat."},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=1024,
)
response

Out[1]:

{'id': 'hf-lt8HWKZn-O',
 'object': 'chat.completion',
 'created': 1695106434,
 'model': 'meta-llama/Llama-2-70b-chat-hf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant', 'content': ' Cat who?'},
   'finish_reason': 'eos_token'}],
 'usage': {'prompt_tokens': 149, 'completion_tokens': 5, 'total_tokens': 154}}

As you can see, the response object has a few fields:

id: the ID of the request
object: the type of object returned (e.g., chat.completion)
created: the timestamp of the request
model: the full name of the model used to generate the response
usage: the number of tokens used to generate the replies, counting prompt, completion, and total
choices: a list of completion objects (only one, unless you set n greater than 1)
- message: the message object generated by the model, with role and content
- finish_reason: the reason the model stopped generating text (either stop, or length if max_tokens limit was reached)
- index: the index of the completion in the list of choices

Extract just the reply with:

In [3]:

Copied!

print(response['choices'][0]['message']['content'])
print(response['choices'][0]['message']['content'])

 Cat who?

Even non-conversation-based tasks can fit into the chat format, by placing the instruction in the first user message.

For example, to ask the model to explain asynchronous programming in the style of the pirate Blackbeard, we can structure conversation as follows:

In [4]:

Copied!





# example with a system message
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain asynchronous programming in the style of math teacher."},
    ],
)

print(response['choices'][0]['message']['content'])
# example with a system message
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain asynchronous programming in the style of math teacher."},
    ],
)

print(response['choices'][0]['message']['content'])

Hello, my dear students! Today, we're going to learn about a fascinating topic that will help us understand how to make our programs more efficient and responsive: asynchronous programming.

Imagine you're working on a project with a group of people, and you need to finish your part before others can start theirs. But, you're waiting for someone else to finish their part so you can start yours. This is similar to how asynchronous programming works.

In asynchronous programming, we break down a program into smaller parts called "tasks." These tasks can run independently, without blocking other tasks from running. This means that if one task is waiting for something to happen, like a response from a server or a user input, other tasks can keep running in the meantime.

Let's use a simple example to illustrate this. Imagine you're making a sandwich. You need to put the bread slices together, add the filling, and then put the sandwich in the fridge to chill. But, you can't start making the sandwich until the bread is toasted, and you can't put the sandwich in the fridge until it's assembled.

In this scenario, toasting the bread and assembling the sandwich are two separate tasks. If we were to do them synchronously, we would do them one after the other, like this:

1. Toast the bread
2. Assemble the sandwich
3. Put the sandwich in the fridge

But, with asynchronous programming, we can do them simultaneously, like this:

1. Toast the bread (starts)
2. Assemble the sandwich (starts)
3. Toast the bread (finishes)
4. Put the sandwich in the fridge

By doing tasks simultaneously, we can save time and make our program more efficient. But, we need to be careful not to get confused about the order in which things happen. That's why we use special tools, like "promises" and "callbacks," to keep track of everything.

So, my dear students, I hope this helps you understand asynchronous programming a bit better. Remember, it's all about breaking down a program into smaller, independent tasks that can run simultaneously, making our programs more efficient and responsive. Now, go forth and create some amazing programs!

In [5]:

Copied!





# example without a system message and debug flag on:
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
    ],
    debug=True,
)

print(response['choices'][0]['message']['content'])
# example without a system message and debug flag on:
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
    ],
    debug=True,
)

print(response['choices'][0]['message']['content'])

08/04/2023 08:16:57 - DEBUG - easyllm.utils - Prompt sent to model will be:
<s>[INST] Explain asynchronous programming in the style of the pirate Blackbeard. [/INST]
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Url:
https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Stop sequences:
[]
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Generation parameters:
{'do_sample': True, 'return_full_text': False, 'max_new_tokens': 1024, 'top_p': 0.6, 'temperature': 0.9, 'stop_sequences': [], 'repetition_penalty': 1.0, 'top_k': 10, 'seed': 42}
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Response at index 0:
index=0 message=ChatMessage(role='assistant', content=' Ahoy matey! Yer lookin\' fer a tale of asynchronous programming, eh? Well, settle yerself down with a pint o\' grog and listen close, for Blackbeard\'s got a story fer ye.\n\nAsynchronous programming, me hearties, be like sailin\' a ship through treacherous waters. Ye gotta keep yer wits about ye, and watch out fer the hidden dangers that lie beneath the surface.\n\nImagine ye\'re sailin\' along, and suddenly, out o\' the blue, a great storm brews up. The winds howl, the waves crash, and yer ship takes on water. Now, ye gotta act fast, or ye\'ll be sent to Davy Jones\' locker!\n\nBut, me hearties, ye can\'t just abandon ship. Ye gotta batten down the hatches, and ride out the storm. And that\'s where asynchronous programming comes in.\n\nAsynchronous programming be like haulin\' up the sails, and lettin\' the wind do the work fer ye. Ye don\'t have to worry about the details o\' how the wind\'s blowin\', or the waves crashin\', ye just gotta keep yer ship pointed in the right direction, and let nature take its course.\n\nNow, I know what ye\'re thinkin\', "Blackbeard, how do I know when me ship\'s gonna make it through the storm?" And that, me hearties, be the beauty o\' asynchronous programming. Ye don\'t have to know! Ye just have to trust that the winds o\' change will carry ye through, and ye\'ll make it to the other side, all in one piece.\n\nBut, me hearties, don\'t ye be thinkin\' this be easy. Asynchronous programming be like navigatin\' through treacherous waters, with a crew o\' mutinous code, and a hull full o\' bugs. Ye gotta be prepared fer the unexpected, and have a stout heart, or ye\'ll be walkin\' the plank!\n\nSo, me hearties, there ye have it. Asynchronous programming in the style o\' Blackbeard. May the winds o\' change blow in yer favor, and may yer code always be free o\' bugs! Arrr!') finish_reason='eos_token'
 Ahoy matey! Yer lookin' fer a tale of asynchronous programming, eh? Well, settle yerself down with a pint o' grog and listen close, for Blackbeard's got a story fer ye.

Asynchronous programming, me hearties, be like sailin' a ship through treacherous waters. Ye gotta keep yer wits about ye, and watch out fer the hidden dangers that lie beneath the surface.

Imagine ye're sailin' along, and suddenly, out o' the blue, a great storm brews up. The winds howl, the waves crash, and yer ship takes on water. Now, ye gotta act fast, or ye'll be sent to Davy Jones' locker!

But, me hearties, ye can't just abandon ship. Ye gotta batten down the hatches, and ride out the storm. And that's where asynchronous programming comes in.

Asynchronous programming be like haulin' up the sails, and lettin' the wind do the work fer ye. Ye don't have to worry about the details o' how the wind's blowin', or the waves crashin', ye just gotta keep yer ship pointed in the right direction, and let nature take its course.

Now, I know what ye're thinkin', "Blackbeard, how do I know when me ship's gonna make it through the storm?" And that, me hearties, be the beauty o' asynchronous programming. Ye don't have to know! Ye just have to trust that the winds o' change will carry ye through, and ye'll make it to the other side, all in one piece.

But, me hearties, don't ye be thinkin' this be easy. Asynchronous programming be like navigatin' through treacherous waters, with a crew o' mutinous code, and a hull full o' bugs. Ye gotta be prepared fer the unexpected, and have a stout heart, or ye'll be walkin' the plank!

So, me hearties, there ye have it. Asynchronous programming in the style o' Blackbeard. May the winds o' change blow in yer favor, and may yer code always be free o' bugs! Arrr!

3. Few-shot prompting¶

In some cases, it's easier to show the model what you want rather than tell the model what you want.

One way to show the model what you want is with faked example messages.

For example:

In [6]:

Copied!





# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful, pattern-following assistant."},
        {"role": "user", "content": "Help me translate the following corporate jargon into plain English."},
        {"role": "assistant", "content": "Sure, I'd be happy to!"},
        {"role": "user", "content": "New synergies will help drive top-line growth."},
        {"role": "assistant", "content": "Things working well together will increase revenue."},
        {"role": "user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
        {"role": "assistant", "content": "Let's talk later when we're less busy about how to do better."},
        {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
    ],
)

print(response["choices"][0]["message"]["content"])
# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful, pattern-following assistant."},
        {"role": "user", "content": "Help me translate the following corporate jargon into plain English."},
        {"role": "assistant", "content": "Sure, I'd be happy to!"},
        {"role": "user", "content": "New synergies will help drive top-line growth."},
        {"role": "assistant", "content": "Things working well together will increase revenue."},
        {"role": "user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
        {"role": "assistant", "content": "Let's talk later when we're less busy about how to do better."},
        {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
    ],
)

print(response["choices"][0]["message"]["content"])

08/04/2023 08:16:57 - DEBUG - easyllm.utils - Prompt sent to model will be:
<s>[INST] <<SYS>>
You are a helpful, pattern-following assistant.
<</SYS>>

Help me translate the following corporate jargon into plain English. [/INST] Sure, I'd be happy to!</s><s>[INST] New synergies will help drive top-line growth. [/INST] Things working well together will increase revenue.</s><s>[INST] Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage. [/INST] Let's talk later when we're less busy about how to do better.</s><s>[INST] This late pivot means we don't have time to boil the ocean for the client deliverable. [/INST]
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Url:
https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Stop sequences:
[]
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Generation parameters:
{'do_sample': True, 'return_full_text': False, 'max_new_tokens': 1024, 'top_p': 0.6, 'temperature': 0.9, 'stop_sequences': [], 'repetition_penalty': 1.0, 'top_k': 10, 'seed': 42}
08/04/2023 08:16:57 - DEBUG - easyllm.utils - Response at index 0:
index=0 message=ChatMessage(role='assistant', content=" We've changed direction too late to do a complete job for the client.") finish_reason='eos_token'
 We've changed direction too late to do a complete job for the client.

Not every attempt at engineering conversations will succeed at first.

If your first attempts fail, don't be afraid to experiment with different ways of priming or conditioning the model.

As an example, one developer discovered an increase in accuracy when they inserted a user message that said "Great job so far, these have been perfect" to help condition the model into providing higher quality responses.

For more ideas on how to lift the reliability of the models, consider reading our guide on techniques to increase reliability. It was written for non-chat models, but many of its principles still apply.