How to use Chat Completion clients with Amazon SageMaker¶
EasyLLM can be used as an abstract layer to replace gpt-3.5-turbo
and gpt-4
with open source models.
You can change your own applications from the OpenAI API, by simply changing the client.
Chat models take a series of messages as input, and return an AI-written message as output.
This guide illustrates the chat format with a few example API calls.
0. Setup¶
Before you can use easyllm
with Amazon SageMaker you need to deploy the model to a SageMaker endpoint. You can do this by following one of the bloh posts below:
- Deploy Llama 2 7B/13B/70B on Amazon SageMaker
- Deploy Falcon 7B & 40B on Amazon SageMaker
- Introducing the Hugging Face LLM Inference Container for Amazon SageMaker
Once you have your endpoint deploy copy the endpoint name of it. The endpoint name will be our model
paramter. You can get the endpoint name in the AWS management console for Amazon SageMaker under "Inference" -> "Endpoints" -> "Name" or when you deployed your model using the sagemaker sdk you can get it from the predictor.endpoint_name
attribute.
1. Import the easyllm library¶
# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm
# import the EasyLLM Python library for calling the EasyLLM API
import easyllm
2. An example chat API call¶
A chat API call has two required inputs:
model
: the name of the model you want to use (e.g.,huggingface-pytorch-tgi-inference-2023-08-08-14-15-52-703
) or leave it empty to just call the apimessages
: a list of message objects, where each object has two required fields:role
: the role of the messenger (eithersystem
,user
, orassistant
)content
: the content of the message (e.g.,Write me a beautiful poem
)
Compared to OpenAI api is the huggingface
module also exposing a prompt_builder
and stop_sequences
parameter you can use to customize the prompt and stop sequences. The EasyLLM package comes with prompt builder utilities.
Let's look at an example chat API calls to see how the chat format works in practice.
import os
# set env for prompt builder
os.environ["HUGGINGFACE_PROMPT"] = "llama2" # vicuna, wizardlm, stablebeluga, open_assistant
os.environ["AWS_REGION"] = "us-east-1" # change to your region
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX" # needed if not using boto3 session
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX" # needed if not using boto3 session
from easyllm.clients import sagemaker
# Changing configuration without using environment variables
# sagemaker.prompt_builder = "llama2"
# sagemaker.api_aws_access_key="xxx"
# sagemaker.api_aws_secret_key="xxx"
# SageMaker endpoint name
MODEL="huggingface-pytorch-tgi-inference-2023-08-08-14-15-52-703"
response = sagemaker.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
{"role": "user", "content": "Can you tell me something about Amazon SageMaker?"},
],
temperature=0.9,
top_p=0.6,
max_tokens=1024,
debug=False,
)
response
{'id': 'hf-2qYJ06mvpP', 'object': 'chat.completion', 'created': 1691507348, 'model': 'huggingface-pytorch-tgi-inference-2023-08-08-14-15-52-703', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': " Of course! Amazon SageMaker is a cloud-based machine learning platform provided by Amazon Web Services (AWS). It allows data scientists and machine learning practitioners to build, train, and deploy machine learning models more easily and efficiently. With SageMaker, users can perform a wide range of machine learning tasks, including data preparation, model training, and model deployment, all within a single platform.\nSome of the key features of Amazon SageMaker include:\n1. Data Wrangling: SageMaker provides a range of tools for data preparation, including data cleaning, feature engineering, and data transformation.\n2. Training and Hyperparameter Tuning: Users can train machine learning models using SageMaker's built-in algorithms or their own custom algorithms. The platform also provides automated hyperparameter tuning, which can help improve model performance.\n3. Model Deployment: Once a model is trained and optimized, SageMaker allows users to deploy it to a variety of environments, including AWS services like Amazon S3 and Amazon EC2, as well as on-premises environments.\n4. Collaboration and Management: SageMaker provides tools for collaboration and model management, including version control, reproducibility, and team-based workflows.\n5. Integration with Other AWS Services: SageMaker integrates with other AWS services, such as Amazon S3, Amazon Redshift, and Amazon EMR, to provide a comprehensive machine learning platform.\nOverall, Amazon SageMaker is a powerful platform that can help organizations of all sizes build and deploy machine learning models more efficiently and effectively."}, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 148, 'completion_tokens': 353, 'total_tokens': 501}}
As you can see, the response object has a few fields:
id
: the ID of the requestobject
: the type of object returned (e.g.,chat.completion
)created
: the timestamp of the requestmodel
: the full name of the model used to generate the responseusage
: the number of tokens used to generate the replies, counting prompt, completion, and totalchoices
: a list of completion objects (only one, unless you setn
greater than 1)message
: the message object generated by the model, withrole
andcontent
finish_reason
: the reason the model stopped generating text (eitherstop
, orlength
ifmax_tokens
limit was reached)index
: the index of the completion in the list of choices
Extract just the reply with:
print(response['choices'][0]['message']['content'])
Of course! Amazon SageMaker is a cloud-based machine learning platform provided by Amazon Web Services (AWS). It allows data scientists and machine learning practitioners to build, train, and deploy machine learning models more easily and efficiently. With SageMaker, users can perform a wide range of machine learning tasks, including data preparation, model training, and model deployment, all within a single platform. Some of the key features of Amazon SageMaker include: 1. Data Wrangling: SageMaker provides a range of tools for data preparation, including data cleaning, feature engineering, and data transformation. 2. Training and Hyperparameter Tuning: Users can train machine learning models using SageMaker's built-in algorithms or their own custom algorithms. The platform also provides automated hyperparameter tuning, which can help improve model performance. 3. Model Deployment: Once a model is trained and optimized, SageMaker allows users to deploy it to a variety of environments, including AWS services like Amazon S3 and Amazon EC2, as well as on-premises environments. 4. Collaboration and Management: SageMaker provides tools for collaboration and model management, including version control, reproducibility, and team-based workflows. 5. Integration with Other AWS Services: SageMaker integrates with other AWS services, such as Amazon S3, Amazon Redshift, and Amazon EMR, to provide a comprehensive machine learning platform. Overall, Amazon SageMaker is a powerful platform that can help organizations of all sizes build and deploy machine learning models more efficiently and effectively.
Even non-conversation-based tasks can fit into the chat format, by placing the instruction in the first user message.
For example, to ask the model to explain asynchronous programming in the style of the pirate Blackbeard, we can structure conversation as follows:
# example with a system message
response = sagemaker.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain asynchronous programming in the style of math teacher."},
],
)
print(response['choices'][0]['message']['content'])
Ah, my dear student, let me explain asynchronous programming in a most delightful and intuitive manner! *adjusts glasses* Asynchronous programming, you see, is like solving a complex equation. *writes on board* You have a problem that requires immediate attention, but you can't just sit there and wait for the solution to appear. *mimes a person twiddling their thumbs* No, no, my young apprentice! You must use your powers of creativity and ingenuity to find a way to solve the problem in parallel! *winks* Now, in math, we often use techniques like substitution, elimination, or even the occasional trickery of complex numbers to solve equations. *nods* But in asynchronous programming, we use something called "asynchronous operations" to tackle problems that require more than just a simple "wait and see" approach. *smirks* Think of it like this: imagine you have a bunch of tasks that need to be done, but they can't all be done at the same time. Maybe you have to fetch some data from a database, process it, and then perform some calculations. *mimes typing on a keyboard* But wait! You can't just sit there and wait for each task to finish, or you'll be twiddling your thumbs for hours! *chuckles* So, what do you do? *smirks* You break each task into smaller, more manageable pieces, and you give each piece a special "asynchronous hat"! *winks* These hats allow each piece to work on its task independently, without waiting for the others to finish. *nods* For example, you could give one piece the task of fetching data from the database, another piece the task of processing it, and another piece the task of performing calculations. *mimes handing out hats* And then, you can just sit back and watch as each piece works on its task, without any of them waiting for the others to finish! *chuckles* But wait, there's more! *excitedly* With asynchronous programming, you can even use something called "callbacks" to make sure everything gets done in the right order! *nods* It's like having a team of highly skilled mathematicians working on your problem, each one using their own special hat to solve a different part of the equation! *smirks* So there you have it, my dear student! Asynchronous programming is like solving a complex equation, but instead of just waiting for the answer, you use your powers of creativity and ingenuity to find a way to solve it in parallel! *nods* Now, go forth and conquer those complex problems, my young apprentice! *winks*
# example without a system message and debug flag on:
response = sagemaker.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
]
)
print(response['choices'][0]['message']['content'])
Shiver me timbers! Ye landlubbers be wantin' to know about this here asynchronous programming business? Well, listen close to the tales of the great Blackbeard himself, and I'll spin ye a yarn 'bout how it works! Ahoy, me hearties! Asynchronous programming be like sailin' the high seas. Ye see, ye gotta have a ship, and that ship be called "Thread". Now, ye might be thinkin', "Blackbeard, what be the point o' havin' a ship if ye can't steer it?" And to that, I say, "Arrr, ye landlubbers be thinkin' too small!" See, with asynchronous programming, ye can have multiple "threads" sailin' the seas at the same time, each one doin' its own thing. And that be a mighty powerful thing, me hearties! But wait, there be more! Ye see, these threads be like different ships, each one with its own crew and mission. And they be sailin' the seas at different speeds, too! Some might be sailin' fast, while others be sailin' slow. And that be the beauty o' it, me hearties! Ye can have one thread bein' busy with somethin' important, while another thread bein' all relaxed and takin' a nap. It be like havin' a whole fleet o' ships at yer disposal, each one doin' its own thing! Now, I know what ye be thinkin', "Blackbeard, how do ye keep all these ships from crashin' into each other?" And to that, I say, "Arrr, that be the magic o' the asynchronous programming, me hearties!" Ye see, each thread be runnin' its own course, and they be communicate with each other through messages. It be like sendin' a message to another ship on the high seas, only instead o' usin' a message, ye be usin' a special kind o' code. And that code be like a map, showin' each thread where to go and what to do. But wait, there be more! Ye see, these threads be like different crew members on a ship. Some might be skilled with swords, while others be skilled with navigatin'. And they be workin' together, each one doin' its part to keep the ship sailin' smoothly. And that be the beauty o' asynchronous programming, me hearties! Ye can have different threads bein' responsible for different tasks, each one doin' its own thing, but all workin' together to get the job done! So there ye have it, me hearties! Asynchronous programming be like sailin' the high seas with a fleet o' ships, each one doin' its own thing, but all workin' together to get the job done. And with the right code, ye can be the captain o' yer own ship, sailin' the seas o' computing like a true pirate! Arrr!
3. Few-shot prompting¶
In some cases, it's easier to show the model what you want rather than tell the model what you want.
One way to show the model what you want is with faked example messages.
For example:
# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = sagemaker.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful, pattern-following assistant."},
{"role": "user", "content": "Help me translate the following corporate jargon into plain English."},
{"role": "assistant", "content": "Sure, I'd be happy to!"},
{"role": "user", "content": "New synergies will help drive top-line growth."},
{"role": "assistant", "content": "Things working well together will increase revenue."},
{"role": "user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
{"role": "assistant", "content": "Let's talk later when we're less busy about how to do better."},
{"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
],
)
print(response["choices"][0]["message"]["content"])
"We don't have time to do everything we originally planned for the client, so we'll have to focus on the most important things and 'boil the ocean' later."
Not every attempt at engineering conversations will succeed at first.
If your first attempts fail, don't be afraid to experiment with different ways of priming or conditioning the model.
As an example, one developer discovered an increase in accuracy when they inserted a user message that said "Great job so far, these have been perfect" to help condition the model into providing higher quality responses.
For more ideas on how to lift the reliability of the models, consider reading our guide on techniques to increase reliability. It was written for non-chat models, but many of its principles still apply.