Thinking

Thinking (or reasoning) is an advanced capability of AI that enables it to tackle complex tasks through explicit, step-by-step logical inference. Thinking models break down problems into smaller reasoning steps rather than solving them in a single pass. Compared to direct answer generation, it offers two key advantages:

Improved problem-solving capability
Transparent and traceable intermediate thinking steps

This functionality is particularly well-suited for domains that require multi-step thinking, such as scientific analysis, legal interpretation, or strategic decision-making. However, since the thinking step requires additional computation and memory usage, it can increase latency and overall resource consumption. Therefore, it’s important to use thinking only when necessary, based on the complexity of your use case.

info

Small models like Qwen3-0.6B are generally too limited to perform effective reasoning.
For complex tasks, we recommend using a sufficiently large model.

Hybrid Thinking Models

Some modern models, such as Qwen3, are designed as hybrid thinking models. These models can switch between standard (direct-generation) and thinking modes based on configuration settings.

Ailoy fully supports this hybrid capability. You can explicitly turn the thinking process on or off via the thinking option. When enabled, the model engages in structured, step-by-step inference—producing a detailed “thinking trace” before the final answer.

How to Enable Thinking

To enable thinking, simply specify the think_effort in the inference config when running the agent.
If you pass the inference config under the inference key within the agent config, it will be passed to the internal LangModel.

note

If you enable thinking, the time required to generate an entire message can increase quite a lot, which makes your application look less responsive, so it may be better to run the agent with streaming.

Python
JavaScript
JavaScript(Web)

import asyncio

import ailoy as ai


async def main():
    lm = await ai.LangModel.new_local("Qwen/Qwen3-4B")
    agent = ai.Agent(lm)

    GREEN = "\x1b[32m"
    RESET = "\x1b[0m"

    async for resp in agent.run_delta(
        "Please solve me a simultaneous equation: x+y=3, 4x+3y=12",
        config=ai.AgentConfig(inference=ai.LangModelInferConfig(think_effort="enable")),
    ):
        if resp.delta.thinking:
            # thinking text will be printed in green.
            print(GREEN + resp.delta.thinking + RESET, end="")
        if resp.delta.contents and isinstance(
            resp.delta.contents[0], ai.PartDelta.Text
        ):
            print(resp.delta.contents[0].text, end="")


if __name__ == "__main__":
    asyncio.run(main())

import * as ai from "ailoy-node";

async function main() {
  const lm = await ai.LangModel.newLocal("Qwen/Qwen3-4B");
  const agent = new ai.Agent(lm);

  const GREEN = "\x1b[32m";
  const RESET = "\x1b[0m";

  for await (const resp of agent.runDelta(
    "Please solve me a simultaneous equation: x+y=3, 4x+3y=12",
    { inference: { thinkEffort: "enable" } }
  )) {
    if (resp.delta.thinking !== undefined) {
      // thinking text will be printed in green.
      process.stdout.write(GREEN + resp.delta.thinking + RESET);
    }
    if (
      resp.delta.contents.length !== 0 &&
      resp.delta.contents[0].type === "text"
    ) {
      process.stdout.write(resp.delta.contents[0].text);
    }
  }
}

main().catch((err) => {
  console.error("Error:", err);
});

import * as ai from "ailoy-web";

async function main() {
  const lm = await ai.LangModel.newLocal("Qwen/Qwen3-4B", {
    progressCallback: console.log,
  });
  const agent = new ai.Agent(lm);

  const GREEN = "\x1b[32m";
  const RESET = "\x1b[0m";

  for await (const resp of agent.runDelta(
    "Please solve me a simultaneous equation: x+y=3, 4x+3y=12",
    { inference: { thinkEffort: "enable" } }
  )) {
    if (resp.delta.thinking !== undefined) {
      // thinking text will be printed in green.
      console.log(GREEN + resp.delta.thinking + RESET);
    }
    if (
      resp.delta.contents.length !== 0 &&
      resp.delta.contents[0].type === "text"
    ) {
      console.log(resp.delta.contents[0].text);
    }
  }
}

main().catch((err) => {
  console.error("Error:", err);
});

Adjusting Thinking Effort

The think_effort option determines how much reasoning power the model uses before giving an answer. Increasing this value enhances logical and systematic thinking, but also consumes more time and resources. You can think of it as a balance between intelligence and budget or responsiveness:

In short:

Higher think_effort → smarter but slower.
Lower think_effort → faster but shallower.

think_effort can have one of these in the config:

"disable"
"enable"
"low"
"medium"
"high"

But not all models are hybrid thinking models or capable of fine-grained adjustment of thinking effort.
Therefore, depending on the characteristics of the model, think_effort may actually be applied differently from the specified value without explicit warning.

For example,

OpenAI's gpt-4 model does not support thinking, so it will be applied as "disable" regardless of which value is applied.
OpenAI's o4 model supports "low", "medium", "high" but not "disable", so if "disable" is specified, it will be applied as "low".
The grok-4-fast model only supports "low" or "high", so if "medium" is specified, it will be applied as "low".
All Qwen3 models do not support adjusting think_effort, so any value other than "disable" will be equivalent to "enable".

The think_effort is applied as the most similar one to what the user specified within the supported range, but you'd better to check the model specification for predictable model behavior.

Hybrid Thinking Models​

How to Enable Thinking​

Adjusting Thinking Effort​

Hybrid Thinking Models

How to Enable Thinking

Adjusting Thinking Effort