Hugging Face Inference Endpoints Example¶
Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.
You can get started with Inference Endpoints at: https://ui.endpoints.huggingface.co/
The example assumes that you have an running endpoint for a conversational model, e.g. https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
1. Import the easyllm library¶
# if needed, install and/or upgrade to the latest version of the OpenAI Python library
%pip install --upgrade easyllm
2. An example chat API call¶
Since we want to use our endpoint for inference we don't have to define the model
parameter. We either need to expose an environment variable HUGGINGFACE_API_BASE
before the import of easyllm.clients.huggingface
or overwrite the huggingface.api_base
value.
A chat API call then only has two required inputs:
messages
: a list of message objects, where each object has two required fields:role
: the role of the messenger (eithersystem
,user
, orassistant
)content
: the content of the message (e.g.,Write me a beautiful poem
)
Let's look at an example chat API calls to see how the chat format works in practice.
from easyllm.clients import huggingface
# Here we overwrite the defaults, you can also use environment variables
huggingface.prompt_builder = "llama2"
huggingface.api_base = "YOUR_ENDPOINT_URL"
# The module automatically loads the HuggingFace API key from the environment variable HUGGINGFACE_TOKEN or from the HuggingFace CLI configuration file.
# huggingface.api_key="hf_xxx"
response = huggingface.ChatCompletion.create(
messages=[
{"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
{"role": "user", "content": "Knock knock."},
{"role": "assistant", "content": "Who's there?"},
{"role": "user", "content": "Apple."},
],
temperature=0.9,
top_p=0.6,
max_tokens=1024,
)
response
{'id': 'hf-0lL5H_yyRR', 'object': 'chat.completion', 'created': 1691096023, 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' Apple who?'}, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 149, 'completion_tokens': 5, 'total_tokens': 154}}
As you can see, the response object has a few fields:
id
: the ID of the requestobject
: the type of object returned (e.g.,chat.completion
)created
: the timestamp of the requestmodel
: the full name of the model used to generate the responseusage
: the number of tokens used to generate the replies, counting prompt, completion, and totalchoices
: a list of completion objects (only one, unless you setn
greater than 1)message
: the message object generated by the model, withrole
andcontent
finish_reason
: the reason the model stopped generating text (eitherstop
, orlength
ifmax_tokens
limit was reached)index
: the index of the completion in the list of choices
Extract just the reply with:
print(response['choices'][0]['message']['content'])
Apple who?
How to stream Chat Completion requests¶
Custom endpoints can be created to stream chat completion requests to a model.
from easyllm.clients import huggingface
huggingface.prompt_builder = "llama2"
# Here you can overwrite the url to your endpoint, can also be localhost:8000
huggingface.api_base = "YOUR_ENDPOINT_URL"
# a ChatCompletion request
response = huggingface.ChatCompletion.create(
messages=[
{'role': 'user', 'content': "Count to 10."}
],
stream=True # this time, we set stream=True
)
for chunk in response:
delta = chunk['choices'][0]['delta']
if "content" in delta:
print(delta["content"],end="")
Sure! Here we go: 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine 10. Ten!