Hugging Face Inference Endpoints Example¶

Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.

You can get started with Inference Endpoints at: https://ui.endpoints.huggingface.co/

The example assumes that you have an running endpoint for a conversational model, e.g. https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

1. Import the easyllm library¶

In [ ]:

Copied!

# if needed, install and/or upgrade to the latest version of the OpenAI Python library
%pip install --upgrade easyllm
# if needed, install and/or upgrade to the latest version of the OpenAI Python library
%pip install --upgrade easyllm

2. An example chat API call¶

Since we want to use our endpoint for inference we don't have to define the model parameter. We either need to expose an environment variable HUGGINGFACE_API_BASE before the import of easyllm.clients.huggingface or overwrite the huggingface.api_base value.

A chat API call then only has two required inputs:

messages: a list of message objects, where each object has two required fields:
- role: the role of the messenger (either system, user, or assistant)
- content: the content of the message (e.g., Write me a beautiful poem)

Let's look at an example chat API calls to see how the chat format works in practice.

In [1]:

Copied!





from easyllm.clients import huggingface

# Here we overwrite the defaults, you can also use environment variables
huggingface.prompt_builder = "llama2"
huggingface.api_base = "YOUR_ENDPOINT_URL"

# The module automatically loads the HuggingFace API key from the environment variable HUGGINGFACE_TOKEN or from the HuggingFace CLI configuration file.
# huggingface.api_key="hf_xxx"

response = huggingface.ChatCompletion.create(
    messages=[
        {"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Apple."},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=1024,
)
response
from easyllm.clients import huggingface

# Here we overwrite the defaults, you can also use environment variables
huggingface.prompt_builder = "llama2"
huggingface.api_base = "YOUR_ENDPOINT_URL"

# The module automatically loads the HuggingFace API key from the environment variable HUGGINGFACE_TOKEN or from the HuggingFace CLI configuration file.
# huggingface.api_key="hf_xxx"

response = huggingface.ChatCompletion.create(
    messages=[
        {"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Apple."},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=1024,
)
response

Out[1]:

{'id': 'hf-0lL5H_yyRR',
 'object': 'chat.completion',
 'created': 1691096023,
 'choices': [{'index': 0,
   'message': {'role': 'assistant', 'content': ' Apple who?'},
   'finish_reason': 'eos_token'}],
 'usage': {'prompt_tokens': 149, 'completion_tokens': 5, 'total_tokens': 154}}

As you can see, the response object has a few fields:

id: the ID of the request
object: the type of object returned (e.g., chat.completion)
created: the timestamp of the request
model: the full name of the model used to generate the response
usage: the number of tokens used to generate the replies, counting prompt, completion, and total
choices: a list of completion objects (only one, unless you set n greater than 1)
- message: the message object generated by the model, with role and content
- finish_reason: the reason the model stopped generating text (either stop, or length if max_tokens limit was reached)
- index: the index of the completion in the list of choices

Extract just the reply with:

In [2]:

Copied!

print(response['choices'][0]['message']['content'])
print(response['choices'][0]['message']['content'])

 Apple who?

How to stream Chat Completion requests¶

Custom endpoints can be created to stream chat completion requests to a model.

In [6]:

Copied!





from easyllm.clients import huggingface

huggingface.prompt_builder = "llama2"

# Here you can overwrite the url to your endpoint, can also be localhost:8000
huggingface.api_base = "YOUR_ENDPOINT_URL"

# a ChatCompletion request
response = huggingface.ChatCompletion.create(
    messages=[
        {'role': 'user', 'content': "Count to 10."}
    ],
    stream=True  # this time, we set stream=True
)

for chunk in response:
    delta = chunk['choices'][0]['delta']
    if "content" in delta:
        print(delta["content"],end="")
from easyllm.clients import huggingface

huggingface.prompt_builder = "llama2"

# Here you can overwrite the url to your endpoint, can also be localhost:8000
huggingface.api_base = "YOUR_ENDPOINT_URL"

# a ChatCompletion request
response = huggingface.ChatCompletion.create(
    messages=[
        {'role': 'user', 'content': "Count to 10."}
    ],
    stream=True  # this time, we set stream=True
)

for chunk in response:
    delta = chunk['choices'][0]['delta']
    if "content" in delta:
        print(delta["content"],end="")

  Sure! Here we go:

1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Eight
9. Nine
10. Ten!

In [ ]: