Skip to main content

Streaming

Agent and its underlying LLM sequentially generate partial strings(tokens) one by one, and each generated token is accumulated to form the final output. The longer the generation process, the longer it takes to complete the result, so it’s often useful to utilize these intermediate results. Those can be streamed as the increments(deltas) of the entire result.

This is useful in cases where users may want to stop content generation after reviewing the partial results, or for agent-based application developers who want to stream partial results to users in real time to make their applications more responsive.

How to Stream Deltas

The run() method from the Agent, which we’ve been using so far, returns complete messages sequentially, while the run_delta()(Python, Rust) or runDelta()(Nodejs, Web) method returns a sequence of results for each LLM generation unit, called a delta of the message.

The run() method from the Agent, which we’ve been using so far, returns complete messages one by one. In contrast, the run_delta() (Python, Rust) or runDelta() (Node.js, Web) method returns a sequence of partial results for each LLM generation step, referred to as message deltas.

import asyncio

import ailoy as ai


async def main():
lm = await ai.LangModel.new_local("Qwen/Qwen3-0.6B")
agent = ai.Agent(lm)

async for resp in agent.run_delta("Please give me a short poem about AI."):
if resp.delta.contents and isinstance(resp.delta.contents[0], ai.PartDelta.Text):
# print text deltas without line break
print(resp.delta.contents[0].text, end="")
print()


if __name__ == "__main__":
asyncio.run(main())

Delta to Completed Message

You can also construct a complete message by accumulating message deltas sequentially.

A finish reason is provided when the message has been fully generated. Accumulate the message deltas until the finish reason appears, then call finish() to produce the complete message.

import asyncio

import ailoy as ai


async def main():
lm = await ai.LangModel.new_local("Qwen/Qwen3-0.6B")
agent = ai.Agent(lm)

GREEN = "\x1b[32m"
RESET = "\x1b[0m"

acc = ai.MessageDelta() # the base of accumulation
async for resp in agent.run_delta("Please give me a short poem about AI."):
if resp.delta.contents and isinstance(resp.delta.contents[0], ai.PartDelta.Text):
# print text deltas in green
print(GREEN + resp.delta.contents[0].text + RESET, end="")
acc += resp.delta # accumulate newly generated delta into the base

# if finish_reason exists, it means that a whole message is generated.
if resp.finish_reason is not None:
message = acc.to_message()
if isinstance(message.contents[0], ai.Part.Text):
print("\n\n" + message.contents[0].text)
acc = ai.MessageDelta() # re-initialize the base
print()


if __name__ == "__main__":
asyncio.run(main())