How to stream Chat Completion requests with Amazon Bedrock¶

By default, when you request a completion, the entire completion is generated before being sent back in a single response.

If you're generating long completions, waiting for the response can take many seconds.

To get responses sooner, you can 'stream' the completion as it's being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished.

To stream completions, set stream=True when calling the chat completions or completions endpoints. This will return an object that streams back the response as data-only server-sent events. Extract chunks from the delta field rather than the message field.

Downsides¶

Note that using stream=True in a production application makes it more difficult to moderate the content of the completions, as partial completions may be more difficult to evaluate.

Setup¶

Before you can use easyllm with Amazon Bedrock you need setup permission and access to the models. You can do this by following of the instructions below:

Example code¶

Below, this notebook shows:

What a typical chat completion response looks like
What a streaming chat completion response looks like
How much time is saved by streaming a chat completion

In [ ]:

Copied!

# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm[bedrock]
# if needed, install and/or upgrade to the latest version of the EasyLLM Python library
%pip install --upgrade easyllm[bedrock]

In [1]:

Copied!

# imports
import easyllm  # for API calls
# imports
import easyllm  # for API calls

1. What a typical chat completion response looks like¶

With a typical ChatCompletions API call, the response is first computed and then returned all at once.

In [1]:

Copied!





import os 
# set env for prompt builder
os.environ["BEDROCK_PROMPT"] = "anthropic" # vicuna, wizardlm, stablebeluga, open_assistant
os.environ["AWS_REGION"] = "us-east-1"  # change to your region
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX" # needed if not using boto3 session
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX" # needed if not using boto3 session

from easyllm.clients import bedrock

response = bedrock.ChatCompletion.create(
    model='anthropic.claude-v2',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    stream=True
)

for chunk in response:
    print(chunk)
import os 
# set env for prompt builder
os.environ["BEDROCK_PROMPT"] = "anthropic" # vicuna, wizardlm, stablebeluga, open_assistant
os.environ["AWS_REGION"] = "us-east-1"  # change to your region
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX" # needed if not using boto3 session
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX" # needed if not using boto3 session

from easyllm.clients import bedrock

response = bedrock.ChatCompletion.create(
    model='anthropic.claude-v2',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    stream=True
)

for chunk in response:
    print(chunk)

10/26/2023 17:34:57 - INFO - easyllm.utils.logging - boto3 Bedrock client successfully created!
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334497, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'role': 'assistant'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334498, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' Here'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334498, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' is counting to 100 with a comma'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334498, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' between each number and no newlines:\n\n1, 2, 3,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334499, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 4, 5, 6, 7, 8, 9, 10, 11'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334499, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 12, 13, 14, 15, 16, 17, 18,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334499, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334500, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334500, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334501, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 49, 50, 51'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334501, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 52, 53,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334502, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 54, 55, 56'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334503, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 57, 58, 59, 60, 61'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334504, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 62, 63, 64, 65, 66'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334504, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 67, 68, 69, 70, 71, 72, 73,'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334504, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ' 74, 75, 76, 77, 78, 79, 80, 81'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334505, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 82, 83, 84, 85, 86, 87, 88, 89, 90, 91'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334505, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {'content': ', 92, 93, 94, 95, 96, 97, 98, 99, 100'}}]}
{'id': 'hf-Je8BGADPWN', 'object': 'chat.completion.chunk', 'created': 1698334505, 'model': 'anthropic.claude-v2', 'choices': [{'index': 0, 'delta': {}}]}

As you can see above, streaming responses have a delta field rather than a message field. delta can hold things like:

a role token (e.g., {"role": "assistant"})
a content token (e.g., {"content": "\n\n"})
nothing (e.g., {}), when the stream is over

3. How much time is saved by streaming a chat completion¶

Now let's ask meta-llama/Llama-2-70b-chat-hf to count to 100 again, and see how long it takes.

In [7]:

Copied!





import os 
# set env for prompt builder
os.environ["BEDROCK_PROMPT"] = "anthropic" # vicuna, wizardlm, stablebeluga, open_assistant
os.environ["AWS_REGION"] = "us-east-1"  # change to your region
os.environ["AWS_PROFILE"] = "hf-sm"  # change to your region
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX" # needed if not using boto3 session
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX" # needed if not using boto3 session
from easyllm.clients import bedrock

# send a ChatCompletion request to count to 100
response = bedrock.ChatCompletion.create(
    model='anthropic.claude-v2',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    stream=True
)

# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in response:
    collected_chunks.append(chunk)  # save the event response
    chunk_message = chunk['choices'][0]['delta']  # extract the message
    print(chunk_message.get('content', ''), end='')  # print the message
    collected_messages.append(chunk_message)  # save the message
    

# print the time delay and text received
full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
print(f"Full conversation received: {full_reply_content}")
import os 
# set env for prompt builder
os.environ["BEDROCK_PROMPT"] = "anthropic" # vicuna, wizardlm, stablebeluga, open_assistant
os.environ["AWS_REGION"] = "us-east-1"  # change to your region
os.environ["AWS_PROFILE"] = "hf-sm"  # change to your region
# os.environ["AWS_ACCESS_KEY_ID"] = "XXX" # needed if not using boto3 session
# os.environ["AWS_SECRET_ACCESS_KEY"] = "XXX" # needed if not using boto3 session
from easyllm.clients import bedrock

# send a ChatCompletion request to count to 100
response = bedrock.ChatCompletion.create(
    model='anthropic.claude-v2',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    stream=True
)

# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in response:
    collected_chunks.append(chunk)  # save the event response
    chunk_message = chunk['choices'][0]['delta']  # extract the message
    print(chunk_message.get('content', ''), end='')  # print the message
    collected_messages.append(chunk_message)  # save the message
    

# print the time delay and text received
full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
print(f"Full conversation received: {full_reply_content}")

 Here is counting to 100 with commas and no newlines:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100Full conversation received:  Here is counting to 100 with commas and no newlines:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100