How to stream Chat Completion requests¶
By default, when you request a completion, the entire completion is generated before being sent back in a single response.
If you're generating long completions, waiting for the response can take many seconds.
To get responses sooner, you can 'stream' the completion as it's being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished.
To stream completions, set stream=True
when calling the chat completions or completions endpoints. This will return an object that streams back the response as data-only server-sent events. Extract chunks from the delta
field rather than the message
field.
Downsides¶
Note that using stream=True
in a production application makes it more difficult to moderate the content of the completions, as partial completions may be more difficult to evaluate.
Example code¶
Below, this notebook shows:
- What a typical chat completion response looks like
- What a streaming chat completion response looks like
- How much time is saved by streaming a chat completion
# imports
import easyllm # for API calls
import time # for measuring time duration of API calls
1. What a typical chat completion response looks like¶
With a typical ChatCompletions API call, the response is first computed and then returned all at once.
from easyllm.clients import huggingface
# set the prompt builder to llama2
huggingface.prompt_builder = "llama2"
# record the time before the request is sent
start_time = time.time()
# send a ChatCompletion request to count to 100
response = huggingface.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
)
# calculate the time it took to receive the response
response_time = time.time() - start_time
# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")
Full response received 0.12 seconds after request Full response received: {'id': 'hf-JhxbFCGVUW', 'object': 'chat.completion', 'created': 1691129826, 'model': 'meta-llama/Llama-2-70b-chat-hf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ' Sure! Here it is:\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100'}, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 25, 'completion_tokens': 400, 'total_tokens': 425}}
The reply can be extracted with response['choices'][0]['message']
.
The content of the reply can be extracted with response['choices'][0]['message']['content']
.
reply = response['choices'][0]['message']
print(f"Extracted reply: \n{reply}")
reply_content = response['choices'][0]['message']['content']
print(f"Extracted content: \n{reply_content}")
Extracted reply: {'role': 'assistant', 'content': ' Sure! Here it is:\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100'} Extracted content: Sure! Here it is: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
2. How to stream a chat completion¶
With a streaming API call, the response is sent back incrementally in chunks via an event stream. In Python, you can iterate over these events with a for
loop.
Let's see what it looks like:
from easyllm.clients import huggingface
huggingface.prompt_builder = "llama2"
# a ChatCompletion request
response = huggingface.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
stream=True # this time, we set stream=True
)
for chunk in response:
print(chunk)
{'id': 'hf--G5Fhg3YMu', 'object': 'chat.completion.chunk', 'created': 1691129830, 'model': 'meta-llama/Llama-2-70b-chat-hf', 'choices': [{'index': 0, 'delta': {'role': 'assistant'}}]} {'id': 'hf--G5Fhg3YMu', 'object': 'chat.completion.chunk', 'created': 1691129830, 'model': 'meta-llama/Llama-2-70b-chat-hf', 'choices': [{'index': 0, 'delta': {'content': ' '}}]} {'id': 'hf--G5Fhg3YMu', 'object': 'chat.completion.chunk', 'created': 1691129830, 'model': 'meta-llama/Llama-2-70b-chat-hf', 'choices': [{'index': 0, 'delta': {'content': ' Two'}}]} {'id': 'hf--G5Fhg3YMu', 'object': 'chat.completion.chunk', 'created': 1691129830, 'model': 'meta-llama/Llama-2-70b-chat-hf', 'choices': [{'index': 0, 'delta': {}}]}
As you can see above, streaming responses have a delta
field rather than a message
field. delta
can hold things like:
- a role token (e.g.,
{"role": "assistant"}
) - a content token (e.g.,
{"content": "\n\n"}
) - nothing (e.g.,
{}
), when the stream is over
3. How much time is saved by streaming a chat completion¶
Now let's ask meta-llama/Llama-2-70b-chat-hf
to count to 100 again, and see how long it takes.
import time
from easyllm.clients import huggingface
huggingface.prompt_builder = "llama2"
# record the time before the request is sent
start_time = time.time()
# send a ChatCompletion request to count to 100
response = huggingface.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[
{'role': 'user', 'content': 'Count to 50, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
stream=True # again, we set stream=True
)
# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in response:
chunk_time = time.time() - start_time # calculate the time delay of the chunk
collected_chunks.append(chunk) # save the event response
chunk_message = chunk['choices'][0]['delta'] # extract the message
collected_messages.append(chunk_message) # save the message
print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}") # print the delay and text
# print the time delay and text received
print(f"Full response received {chunk_time:.2f} seconds after request")
full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
print(f"Full conversation received: {full_reply_content}")
Message received 0.13 seconds after request: {'role': 'assistant'} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': ' Sure'} Message received 0.13 seconds after request: {'content': '!'} Message received 0.13 seconds after request: {'content': ' Here'} Message received 0.13 seconds after request: {'content': ' it'} Message received 0.13 seconds after request: {'content': ' is'} Message received 0.13 seconds after request: {'content': ':'} Message received 0.13 seconds after request: {'content': '\n'} Message received 0.13 seconds after request: {'content': '\n'} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '6'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '7'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '8'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '9'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '0'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '6'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '7'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '8'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': '9'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '0'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '6'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '7'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '8'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': '9'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '0'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '6'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '7'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '8'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': '9'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '0'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '1'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '2'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '3'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '6'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '7'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '8'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '4'} Message received 0.13 seconds after request: {'content': '9'} Message received 0.13 seconds after request: {'content': ','} Message received 0.13 seconds after request: {'content': ' '} Message received 0.13 seconds after request: {'content': '5'} Message received 0.13 seconds after request: {'content': '0'} Message received 0.13 seconds after request: {} Full response received 0.13 seconds after request Full conversation received: Sure! Here it is: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
Time comparison¶
In the example above, both requests took about 3 seconds to fully complete. Request times will vary depending on load and other stochastic factors.
However, with the streaming request, we received the first token after 0.1 seconds, and subsequent tokens every ~0.01-0.02 seconds.