Am I misunderstanding, or isn't this supposed to be the point of MCP?
The models only output text. Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server (or some other driver) into something which can be picked up by your agent loop and executed. Models are trained in a wide variety of different delimiters and escape characters to indicate their tool calls (along with things like separate thinking blocks). MCP is mostly a standard way to share with your agent loop the list of tool names and what their arguments are, which then gets passed to the inference server which then renders it down to text to feed to the model.
> Tool calls are nothing more than specially formatted text which gets parsed and interpreted by the inference server
I know this is getting off-topic, but is anybody working on more direct tool calling?
LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
Currently, the lack of separation between data and metadata is a security nightmare, which enables prompt injection. And yet all I've seen done about is are workarounds.
Each text token already represents the activation of certain neurons. There is nothing "more direct." And you cannot fully separate data and metadata if you want them to influence the output. At best you can clearly distinguish them and hope that this is enough for the model to learn to treat them differently.
Are there tokens reserved for tool calls? If yes, I can see the equivalence. If not, not so much.
Yes, typically the tags used for tool calls get their own special tokens, e.g. https://huggingface.co/google/gemma-4-E4B-it/blob/main/token...