Ollama发布0.8版本，支持流式响应与工具调用

早报8个月前更新小马良

292 0

Ollama现已支持流式响应与工具调用。这使得所有聊天应用能够实时流式传输内容并调用工具。

地址：https://github.com/ollama/ollama/releases/tag/v0.8.0

支持工具调用的模型：

Qwen 3
Devstral
Qwen2.5 和 Qwen2.5-coder
Llama 3.1
Llama 4
更多支持工具调用的模型

简单工具调用示例（天气）和网页搜索示例

cURL

Ollama使用天气工具回答提示“今天多伦多的天气如何？”的示例：

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

输出

...
{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:57.641643Z",
  "message": {
    "role": "assistant",
    "content": "celsius"
  },
  "done": false
}
 {
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:57.673559Z",
  "message": {
    "role": "assistant",
    "content": "</think>"
  },
  "done": false
}
{
  "model": "qwen3",
  "created_at": "2025-05-27T22:54:58.100509Z",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "function": {
          "name": "get_current_weather",
          "arguments": {
            "format": "celsius",
            "location": "Toronto"
          }
        }
      }
    ]
  },
  "done": false
}
...

Python

安装最新版本的Ollama Python库：

pip install -U ollama

Ollama使用数学函数的示例：

# Define the python function
def add_two_numbers(a: int, b: int) -> int:
  """
  Add two numbers

  Args:
    a (set): The first number as an int
    b (set): The second number as an int

  Returns:
    int: The sum of the two numbers
  """
  return a + b
  
from ollama import chat
messages = [{'role': 'user', 'content': 'what is three minus one?'}]

response: ChatResponse = chat(
  model='qwen3',
  messages=messages,
  tools=[add_two_numbers], # Python SDK supports passing tools as functions
  stream=True
)

for chunk in response:
	# Print model content
  print(chunk.message.content, end='', flush=True)
  # Print the tool call
  if chunk.message.tool_calls:
    print(chunk.message.tool_calls)

输出

<think>
Okay, the user is asking ...
</think>

[ToolCall(function=Function(name='subtract_two_numbers', arguments={'a': 3, 'b': 1}))]

JavaScript

安装最新版本的Ollama JavaScript库：

npm i ollama

Ollama使用数学函数的示例：

import ollama from 'ollama';

# Describe the tool schema
const addTool = {
    type: 'function',
    function: {
        name: 'addTwoNumbers',
        description: 'Add two numbers together',
        parameters: {
            type: 'object',
            required: ['a', 'b'],
            properties: {
                a: { type: 'number', description: 'The first number' },
                b: { type: 'number', description: 'The second number' }
            }
        }
    }
};

async function run(model: string) {
    const messages = [{ role: 'user', content: 'What is 2 plus 3?' }];
    console.log('Question:', messages[0].content);

    for await (const chunk of await ollama.chat({
        model: model,
        messages: messages,
        tools: [addTool],
        stream: true
    })) {
        if (chunk.message.tool_calls) {
            // Print tool calls from the response
            for (const tool of chunk.message.tool_calls) {
                console.log('Tool call:', tool);
            }
        } else {
		        // Print content from model
            process.stdout.write(chunk.message.content);
        }
    }
}

run('qwen3').catch(console.error);

输出

Question: What is 2 plus 3?
<think>
Okay, the user is asking...
</think>
Tool call: {
  function: {
    name: "addTwoNumbers",
    arguments: {
      a: 2,
      b: 3,
    },
  },
}

Ollama中工具解析的工作原理

背景

我们构建了一个新的解析器，专注于理解工具调用的结构，而不仅仅是寻找JSON。

以前，当工具被传入模型时，系统需要等到整个输出生成完毕，然后将其解析为JSON，以确定其中是否包含工具调用或普通内容。用户需要等待完整的生成过程才能看到任何流式令牌。这种方法对格式错误的输出很可靠，但由于工具调用可能出现在文本的任何位置，阻碍了流式传输。

Ollama支持多种模型，有些模型经过工具特定令牌训练，有些则没有。解析逻辑需要在流式传输用户内容的同时，能够检测、抑制和解析工具调用令牌。

增量解析器

新解析器直接引用每个模型的模板，以理解工具调用的前缀。这是Ollama理解和分离工具调用与内容的必要条件。

当模型未直接接受工具使用训练（即未使用前缀/工具令牌训练）时，它仍可能凭借其庞大的知识量输出有效的工具调用。在这种情况下，解析器能够处理模型输出的部分前缀，并正确分离工具调用和内容。

某些模型即使经过前缀训练，也可能选择在没有前缀的情况下输出工具调用。实证表明，这种行为仅在模型输出开始时发生。为解决此问题，解析器在识别JSON开头时可回退到将JSON解析为工具调用。如果JSON与模型的工具调用格式不匹配，则返回该JSON。

准确性

在某些情况下，模型会在结果传递给模型时，引用之前进行的工具调用。以前，这会导致多个额外的工具调用。通过实现前缀匹配和正确管理解析JSON以检测工具调用的状态，工具调用的可靠性应有所提高。

以前，类似下面的模型输出会导致两个工具调用，而不是一个：

[TOOL_CALL] [{"name":"get_conditions","arguments":{"city":"Sydney"}}]
 To get the current weather conditions for Sydney, we can use the function `get_conditions`. 
 However, I don't have real-time data access. Let's assume that the API will return the information:

 [{"name":"get_conditions","arguments":{"city":"Sydney"}}]

模型上下文协议（MCP）

通过这些改进，开发者现在可以在使用Ollama与MCP（模型上下文协议）时流式传输聊天内容和工具调用。据经验，使用32k或更高的上下文窗口可以提高工具调用的性能以及工具调用的结果。

增加Ollama上下文窗口到32k的示例

请注意，内存使用量会随着上下文窗口的延长而增加。

cURL

curl -X POST "http://localhost:11434/api/chat" -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ],
  "options": {
    "num_ctx": 32000 # Update context window here
  }
}'