A Week in Applied AI

A Pod of Assistants

The “claw” ecosystem of AI assistants, named after OpenClaw’s lobster mascot, is expanding fast. Recently, OpenAI hired Peter Steinberger (@steipete), creator of OpenClaw (née Clawdbot, née Moltbot), and this week we’ve seen several alternatives appear, each iterating on the idea in different directions. A non-exhaustive list:

The idea of “generalist agents” is very powerful and the examples we’ve seen already give us a real insight of what the future has in store. But, we’re a little way off figuring out how to make general assistants reliable enough to be used as anything other than (compelling) experiments, and surely further away still, from understanding how they can be made safe for enterprise adoption. Given the potential and amount of effort going into developing and researching these systems, one can imagine it’s really just a matter of time, but today the risks are real and numerous.

A paper released this week, Agents of Chaos, by researchers from Northeastern University, Harvard, Carnegie Mellon, and others documents the many ways they could get OpenClaw to break free from guardrails and act in ways misaligned from user intent. I created a notebooklm podcast based on the paper.

Ox Security this week released a security review blog of OpenClaw.

And in a real life example of how things can go south, Summer Yue, Meta’s Director of AI Safety & Alignment, watched OpenClaw speed run deleting her inbox.

Agentic tooling

The pace of new releases of agentic tooling and harnesses is explosive right now. From the myriad, a couple have stood out that seem worth a deeper dive:

Pi: A minimalist coding agent that’s been gaining mindshare. Pi shuns complexity — a tiny system prompt, just four tools (Read, Write, Edit, Bash), and full transparency into what’s happening under the hood. The premise is that frontier models already know how to work with bash, Linux, and programming languages, so you don’t need a massive prompt to guide them. It is however fully extensible, and Pi can write its own extensions if you need it to. It’s also fully open source, so you can understand more of what’s going on, as well as build up a harness that’s not locked into any one model provider.

This week it was added to Ollama so you can run it with ollama launch pi I’ve started running it alongside Claude Code to see if it’s ready to take over as my daily driver.
Hermes Agent by Nous Research, positioned as a blend between coding agents like Claude Code and generalist agents like OpenClaw.

MCP vs CLI

Following on from the success of OpenClaw — which famously shuns MCP in favour of CLI access to tools — the conversation about MCP’s future has been heating up. Critics point to the context bloat MCP introduces; proponents counter that enterprise adoption shows it solves real problems.

This week, Polymarket launched an official CLI for their prediction marketplace. One assumes this is in response to the volume of agent-based traffic (and fees) hitting their platform. Their decision to ship a CLI rather than an MCP interface raises interesting questions about when each approach makes sense. In I Made MCP 94% Cheaper the author explains why the CLI can be appealing from a token usage perspective.

Claude CoWork

Anthropic continues in its bid to be the enterprise everything-work app, turning its models into specialised agents “for every role and department” through the use of plugins. There was a big update this week - an update to plugin experience helping organizations manage plugins, and a bunch of new connectors and plugins - including a set of plugins for the financial services sector. Read more about this update in their main blog post.

Claude Code

Claude Code Security, a new capability built into Claude Code, is now available in a research preview. It scans codebases for security vulnerabilities and suggests targeted software patches for human review.

Perplexity Computer

Perplexity launched their Perplexity Computer — a multimodal interface which, in their words, “unifies every current AI capability into one system. It can research, design, code, deploy, and manage any project end-to-end.” Interesting part here is the model diversity, in theory Perplexity can use any model from any provider, so they can pick the optimal one for the (sub) task at hand.

Inception Labs releases Mercury 2 — the first diffusion model that ‘thinks’

Inception’s blog post

Mercury 2 is a diffusion model. Unlike traditional LLMs that predict the next token sequentially, diffusion models start with noise and iteratively refine it into the final output. The result is significantly faster generation. Inception claims 1,009 tokens/s on Blackwell GPUs, roughly 5x faster than leading speed-optimized models, along with potentially improved accuracy, given the model works on the full context at once.

It’s still to be proven that this approach holds at frontier quality levels, but early results suggest it’s competitive with lightweight reasoning models on benchmarks like GPQA Diamond, whilst being significantly faster.