How to choose the best LLM using R and vitals | InfoWorld

Technology insight for the enterprise

How to choose the best LLM using R and vitals 6 Apr 2026, 4:14 pm

Is your generative AI application giving the responses you expect? Are there less expensive large language models—or even free ones you can run locally—that might work well enough for some of your tasks?

Answering questions like these isn’t always easy. Model capabilities seem to change every month. And, unlike conventional computer code, LLMs don’t always give the same answer twice. Running and rerunning tests can be tedious and time consuming.

Fortunately, there are frameworks to help automate LLM tests. These LLM “evals,” as they’re known, are a bit like unit tests on more conventional computer code. But unlike unit tests, evals need to understand that LLMs can answer the same question in different ways, and that more than one response may be correct. In other words, this type of testing often requires the ability to analyze flexible criteria, not simply check if a given response equals a specific value.

The vitals package, based on Python‘s Inspect framework, brings automated LLM evals to the R programming language. Vitals was designed to integrate with the ellmer R package, so you can use them together to evaluate prompts, AI applications, and how different LLMs affect both performance and cost. In one case, it helped show that AI agents often ignore information in plots when it goes against their expectations, according to package author Simon Couch, a senior software engineer at Posit. Couch said over email that the experiment, done using a set of vitals evaluations dubbed bluffbench, “really hit home for some folks.”

Couch is also using the package to measure how well different LLMs write R code.

Vitals setup

You can install the vitals package from CRAN or, if you want the development version, from GitHub with pak::pak("tidyverse/vitals"). As of this writing, you’ll need the dev version to access several features used in examples for this article, including a dedicated function for extracting structured data from text.

Vitals uses a Task object to create and run evals. Each task needs three pieces: a dataset, a solver, and a scorer.

Dataset

A vitals dataset is a data frame with information about what you want to test. That data frame needs at least two columns:

  • input: The request you want to send to the LLM.
  • target: How you expect the LLM to respond.

The vitals package includes a sample dataset called are. That data frame has a few more columns, such as id (which is never a bad idea to include in your data), but these are optional.

As Couch told posit::conf attendees a few months ago, one of the easiest ways to create your own input-target pairs for a dataset is to type what you want into a spreadsheet. Set up spreadsheet columns with “input” and “target,” add what you want, then read that spreadsheet into R with a package like googlesheets4 or rio.

Screenshot of a spreadsheet to create a vitals dataset with input and target columns.

Example of a spreadsheet to create a vitals dataset with input and target columns.

Sharon Machlis

Below is the R code for three simple queries I’ll use to test out vitals. The code creates an R data frame directly, if you’d like to copy and paste to follow along. This dataset asks an LLM to write R code for a bar chart, determine the sentiment of some text, and create a haiku.

my_dataset This desktop computer has a better processor and can handle much more demanding tasks such as running LLMs locally. However, it\U{2019}s also noisy and comes with a lot of bloatware.",
    "Write me a haiku about winter"
  ),
  target = c(
    'Example solution: ```library(ggplot2)\r\nlibrary(scales)\r\nsample_data 

Next, I’ll load my libraries and set a logging directory for when I run evals, since the package will suggest you do that as soon as you load it:

library(vitals)
library(ellmer)
vitals_log_dir_set("./logs")

Here’s the start of setting up a new Task with the dataset, although this code will throw an error without the other two required arguments of solver and scorer.

my_task 

If you’d rather use a ready-made example, you can use dataset = are with its seven R tasks.

It can take some effort to come up with good sample targets. The classification example was simple, since I wanted a single-word response, mixed. But other queries can have more free-form responses, such as writing code or summarizing text. Don’t rush through this part—if you want your automated “judge” to grade accurately, it pays to design your acceptable responses carefully.

Solver

The second part of the task, the solver, is the R code that sends your queries to an LLM. For simple queries, you can usually just wrap an ellmer chat object with the vitals generate() function. If your input is more complex, such as needing to call tools, you may need a custom solver. For this part of the demo, I’ll use a standard solver with generate(). Later, we’ll add a second solver with generate_structured().

It helps to be familiar with the ellmer R package when using vitals. Below is an example of using ellmer without the vitals package, with my_dataset$input[1], the first query in my dataset data frame, as my prompt. This code returns an answer to the question but doesn’t evaluate it.

Note: You’ll need an OpenAI key if you want to run this specific code. Or you can change the model (and API key) to any other LLM from a provider ellmer supports. Make sure to store any needed API keys for other providers. For the LLM, I chose OpenAI’s least expensive current model, GPT-5 nano.

my_chat 

You can turn that my_chat ellmer chat object into a vitals solver by wrapping it in the generate() function:

# This code won't run yet without the tasks's third required argument, a scorer
my_task 

The Task object knows to use the input column from your dataset as the question to send to the LLM. If the dataset holds more than one query, generate() handles processing them.

Scorer

Finally, we need a scorer. As the name implies, the scorer grades the result. Vitals has several different types of scorer. Two of them use an LLM to evaluate results, sometimes referred to as “LLM as a judge.” One of vitals’ LLM-as-a-judge options, model_graded_qa(), checks how well the solver answered a question. The other, model_graded_fact(), “determines whether a solver includes a given fact in its response,” according to the documentation. Other scorers look for string patterns, such as detect_exact() and detect_includes().

Some research shows that LLMs can do a decent job in evaluating results. However, like most things involving generative AI, I don’t trust LLM evaluations without human oversight.

Pro tip: If you’re testing a small, less capable model in your eval, you don’t want that model also grading the results. Vitals defaults to using the same LLM you’re testing as the scorer, but you can specify another LLM to be your judge. I usually want a top-tier frontier LLM for my judge unless the scoring is straightforward.

Here’s what the syntax might look like if we were using Claude Sonnet as a model_graded_qa() scorer:

scorer = model_graded_qa(scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

Note that this scorer defaults to setting partial credit to FALSE—either the answer is 100% accurate or it’s wrong. However, you can choose to allow partial credit if that makes sense for your task, by adding the argument partial_credit = TRUE:

scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

I started with Sonnet 4.5 as my scorer, without partial credit. It got one of the gradings wrong, giving a correct score to R code that did most things right for my bar chart but didn’t sort by descending order. I also tried Sonnet 4.6, released just this week, but it also got one of the grades wrong.

Opus 4.6 is more capable than Sonnet, but it’s also about 67% pricier at $5 per million tokens input and $25 per million output. Which model and provider you choose depends in part on how much testing you’re doing, how much you like a specific LLM for understanding your work (Claude has a good reputation for writing R code), and how important it is to accurately evaluate your task. Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(), and you can also see available LLMs by running models_github().

Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(). (You can also see available LLMs by running models_github().)

Below, I’ve added model_graded_qa() scoring to my_task, and I also included a name for the task. However, I’d suggest not adding a name to your task if you plan to clone it later to try a different model. Cloned tasks keep their original name, and as of this writing, there’s no way to change that.

my_task 

Now, my task is ready to use.

Run your first vitals task

You execute a vitals task with the task object’s $eval() method:

my_task$eval()

The eval() method launches five separate methods: $solve(), $score(), $measure(), $log(), and $view(). After it finishes running, a built-in log viewer should pop up. Click on the hyperlinked task to see more details:

Screenshot of the details on a task run in vitals’ built-in viewer.

Details on a task run in vitals’ built-in viewer. You can click each sample for additional info.

Sharon Machlis

“C” means correct and “I” stands for incorrect, and there could have been “P” for partially correct if I had allowed partial credit.

If you want to see a log file in that viewer later, you can invoke the viewer again with vitals_view("your_log_directory"). The logs are just JSON files, so you can view them in other ways, too.

You’ll probably want to run an eval multiple times, not just once, to feel more confident that an LLM is reliable and didn’t just get lucky. You can set multiple runs with the epochs argument:

my_task$eval(epochs = 10)

The accuracy of bar chart code on one of my 10-epoch runs was 70%—which may or may not be “good enough.” Another time, that rose to 90%. If you want a true measure of an LLM’s performance, especially when it’s not scoring 100% on every run, you’ll want a good sample size; margin of error can be significant with just a few tests. (For a deep dive into statistical analysis of vitals results, see the package’s analysis vignette.)

It cost about 14 cents to use Sonnet 4.6 as a judge versus 27 cents for Opus 4.6 on 11 total epoch runs of three queries each. (Not all these queries even needed an LLM for evaluation, though, if I were willing to separate the demo into multiple task objects. The sentiment analysis was just looking for “Mixed,” which is simpler scoring.)

The vitals package includes a function that can format the results of a task’s evaluation as a data frame: my_task$get_samples(). If you like this formatting, save the data frame while the task still exists in your R session:

results_df 

You may also want to save the Task object itself.

If there’s an API glitch while you’re running your input queries, the entire run will fail. If you want to run a test for a lot of epochs, you may want to break it up into smaller groups so as not to risk wasting tokens (and time).

Swap in another LLM

There are several ways to run the same task with a different model. First, create a new chat object with that different model. Here’s the code for checking out Google Gemini 3 Flash Preview:

my_chat_gemini 

Then you can run the task in one of three ways.

1. Clone an existing task and add the chat as its solver with $set_solver():

my_task_gemini 

2. Clone an existing task and add the new chat as a solver when you run it:

my_task_gemini 

3. Create a new task from scratch, which allows you to include a new name:

my_task_gemini 

Make sure you’ve set your API key for each provider you want to test, unless you’re using a platform that doesn’t need them, such as local LLMs with ollama.

View multiple task runs

Once you’ve run multiple tasks with different models, you can use the vitals_bind() function to combine the results:

both_tasks 
Screenshot of combined task results running each LLM with three epochs.

Example of combined task results running each LLM with three epochs.

Sharon Machlis

This returns an R data frame with columns for task, id, epoch, score, and metadata. The metadata column contains a data frame in each row with columns for input, target, result, solver_chat, scorer_chat, scorer_metadata, and scorer.

To flatten the input, target, and result columns and make them easier to scan and analyze, I un-nested the metadata column with:

library(tidyr)
both_tasks_wide 
  unnest_longer(metadata) |>
  unnest_wider(metadata)

I was then able to run a quick script to cycle through each bar-chart result code and see what it produced:

library(dplyr)

# Some results are surrounded by markdown and that markdown code needs to be removed or the R code won't run
extract_code 
  filter(id == "barchart")

# Loop through each result
for (i in seq_len(nrow(barchart_results))) {
  code_to_run 

Test local LLMs

This is one of my favorite use cases for vitals. Currently, models that fit into my PC’s 12GB of GPU RAM are rather limited. But I’m hopeful that small models will soon be useful for more tasks I’d like to do locally with sensitive data. Vitals makes it easy for me to test new LLMs on some of my specific use cases.

vitals (via ellmer) supports ollama, a popular way of running LLMs locally. To use ollama, download, install, and run the ollama application, and either use the desktop app or a terminal window to run it. The syntax is ollama pull to download an LLM, or ollama run to both download and start a chat if you’d like to make sure the model works on your system. For example: ollama pull ministral-3:14b.

The rollama R package lets you download a local LLM for ollama within R, as long as ollama is running. The syntax is rollama::pull_model("model-name"). For example, rollama::pull_model("ministral-3:14b"). You can test whether R can see ollama running on your system with rollama::ping_ollama().

I also pulled Google’s gemma3-12b and Microsoft’s phi4, then created tasks for each of them with the same dataset I used before. Note that as of this writing, you need the dev version of vitals to handle LLM names that include colons (the next CRAN version after 0.2.0 should handle that, though):

# Create chat objects
ministral_chat 

All three local LLMs nailed the sentiment analysis, and all did poorly on the bar chart. Some code produced bar charts but not with axes flipped and sorted in descending order; other code didn’t work at all.

Screenshot of results after running a dataset with gemma, minisral, and phi.

Results of one run of my dataset with five local LLMs.

Sharon Machlis

R code for the results table above:

library(dplyr)
library(gt)
library(scales)

# Prepare the data
plot_data 
  rename(LLM = task, task = id) |>
  group_by(LLM, task) |>
  summarize(
    pct_correct = mean(score == "C") * 100,
    .groups = "drop"
  )

color_fn 
  tidyr::pivot_wider(names_from = task, values_from = pct_correct) |>
  gt() |>
  tab_header(title = "Percent Correct") |>
  cols_label(`sentiment-analysis` = html("sentiment-
analysis")) |> data_color( columns = -LLM, fn = color_fn )

It cost me 39 cents for Opus to judge these local LLM runs—not a bad bargain.

Update (April 6, 2026): I used vitals and the same three-item dataset to test several LLMs from Google’s new Gemma 4 open-weight, commercially permissive family announced on April 2.

While the 4b version did about the same as other local LLMs I’ve tried, gemma-4-26b scored a surprising 100% when I ran it six times.

Note that although it ran at an acceptable speed in Ollama, gemma-4-26b was a tight fit in my PC’s memory when running vitals inside RStudio. In fact, it choked when I tried to run multiple epochs at once, so I ended up running only one test at a time.

Also important: I set up the ellmer chat object to turn off the model’s “thinking.” The code:

chat_gemma_26b 

Extract structured data from text

Vitals has a special function for extracting structured data from plain text: generate_structured(). It requires both a chat object and a defined data type you want the LLM to return. As of this writing, you need the development version of vitals to use the generate_structured() function.

First, here’s my new dataset to extract topic, speaker name and affiliation, date, and start time from a plain-text description. The more complex version asks the LLM to convert the time zone to Eastern Time from Central European Time:

extract_dataset R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages.",

    "Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in Eastern Time zone in 'hh:mm ET' format from the text below. (TZ is the time zone). Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. Convert the given time to Eastern Time if required. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages."
  ),
  target = c(
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00. OR R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00 CET.",
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 12:00 ET."
  )
)

Below is an example of how to define a data structure using ellmer’s type_object() function. Each of the arguments gives the name of a data field and its type (string, integer, and so on). I’m specifying I want to extract a workshop_topic, speaker_name, current_speaker_affiliation, date (as a string), and start_time (also as a string):

my_object 

Next, I’ll use the chat objects I created earlier in a new structured data task, using Sonnet as the judge since grading is straightforward:

my_task_structured 

It cost me 16 cents for Sonnet to judge 15 evaluation runs of two queries and results each.

Here are the results:

Screenshot of results after running the structured data task on gemini, gemma, gpt_5_nano, ministral and phi.

How various LLMs fared on extracting structured data from text.

Sharon Machlis

I was surprised that a local model, Gemma, scored 100%. I wanted to see if that was a fluke, so I ran the eval another 17 times for a total of 20. Weirdly, it missed on two of the 20 basic extractions by giving the title as “R Package Development” instead of “R Package Development in Positron,” but scored 100% on the more complex ones. I asked Claude Opus about that, and it said my “easier” task was more ambiguous for a less capable model to understand. Important takeaway: Be as specific as possible in your instructions!

Still, Gemma’s results were good enough on this task for me to consider testing it on some real-world entity extraction tasks. And I wouldn’t have known that without running automated evaluations on multiple local LLMs.

Conclusion

If you’re used to writing code that gives predictable, repeatable responses, a script that generates different answers each time it runs can feel unsettling. While there are no guarantees when it comes to predicting an LLM’s next response, evals can increase your confidence in your code by letting you run structured tests with measurable responses, instead of testing via manual, ad-hoc queries. And, as the model landscape keeps evolving, you can stay current by testing how newer LLMs perform—not on generic benchmarks, but on the tasks that matter most to you.

Learn more about the vitals R package

(image/jpeg; 9.03 MB)

Databricks launches AiChemy multi-agent AI for drug discovery 6 Apr 2026, 11:28 am

Databricks has outlined a reference architecture for a multi-agent AI system, named AiChemy, that combines internal enterprise data on its platform with external scientific databases via the Model Context Protocol (MCP) to accelerate drug discovery tasks such as target identification and candidate evaluation.

These early-stage steps are critical in drug development because they help pharma companies determine which biological mechanisms to pursue and which compounds are worth advancing, directly influencing the cost, time, and likelihood of success in later clinical stages.

The multi-agent AI system is built on Databricks components, including its Data Intelligence Platform, Delta Lake, and Mosaic AI, including Agent Bricks, which manage and govern enterprise data while enabling the creation and orchestration of domain-specific agents and “skills.”

These skills include instructions for querying and summarizing scientific literature, retrieving chemical and molecular data, performing similarity searches across compounds, and synthesizing evidence across sources.

The system combines these agents and skills with external data sources such as OpenTargets, PubMed, and PubChem, accessed via MCP, allowing agents to retrieve and reason over both proprietary and public scientific data.

In doing so, AiChemy brings these data access, orchestration, and analysis in a single, governed environment, which Databricks says will help researchers in pharma companies surface relevant insights from disparate datasets without losing context, in turn accelerating tasks like target identification and candidate evaluation.

Underpinning the entire system is a supervisor agent that coordinates how individual agents and skills are used to fulfill a query.

Databricks describes this supervisor agent not as a prepackaged component, but as a pattern that enterprise teams can implement using its Mosaic AI and Agent Bricks tooling.

Enterprise teams building such a supervisor agent, according to a Databricks blog post, would need to start by defining and implementing domain-specific skills, such as literature search, compound lookup, or data synthesis, and registering them so they can be programmatically invoked.

Developers then would need to configure the supervisor agent with instructions or policies that determine how it selects and sequences these skills in response to a query, including how tasks are decomposed and routed, the company wrote in the blog post.

This setup is typically tied to enterprise and external data sources via MCP, with access controls and governance applied through Databricks’ platform, it added.

The AiChemy initiative builds on earlier Databricks efforts in healthcare and drug discovery.

In June 2025, the company partnered with Atropos Health to combine real-world clinical data with its Data Intelligence Platform to support evidence generation and accelerate research workflows.

A month later, in July 2025, it announced a partnership with TileDB focused on integrating multimodal scientific data, such as genomics, imaging, and clinical records, to enable AI-driven analysis for drug discovery and clinical insights.

The AiChemy reference architecture, Databricks said, has been made available through a web application and a GitHub repository, where developers can explore the system and adapt it to their own use cases using its Agent Bricks framework.

(image/jpeg; 0.11 MB)

Multi-agent AI is the new microservices 6 Apr 2026, 9:00 am

We just can’t seem to help ourselves. Our current infatuation with multi-agent systems risks mistaking a useful pattern for an inevitable future, just as we once did with microservices. Remember those? For some good (and bad) reasons, we took workable applications, broke them into a confusing cloud of services, and then built service meshes, tracing stacks, and platform teams just to manage the complexity we’d created. Yes, microservices offered real advantages, as I’ve argued. But also, you don’t need to “run like Google” unless you actually have Google’s problems. (Spoiler alert: You don’t.)

Now we’re about to make the same mistake with AI.

Every agent demo seems to feature a planner agent, a researcher agent, a coder agent, a reviewer agent, and, why not? an agent whose sole job is to feel good about the architecture diagram. This doesn’t mean multi-agent systems are bad; they’re simply prescribed more broadly than is wise, just as we did with microservices.

So when should you embrace a multi-agent approach?

A real pattern, with a hype tax

Even the companies building the frontier models are practically begging developers not to use them promiscuously. In its 2024 guide to building effective agents, Anthropic explicitly recommends finding “the simplest solution possible” and says that might mean not building an agentic system at all. More pointedly, Anthropic says that for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough. It also warns that frameworks can create layers of abstraction that obscure prompts and responses, make systems harder to debug, and tempt developers to add complexity when a simpler setup would suffice. Santiago Valdarrama put the same idea more bluntly: “Not everything is an agent,” he stresses, and “99% of the time, what you need is regular code.”

That’s not anti-agent. It’s engineering discipline.

OpenAI lands in roughly the same place. Its practical guide recommends maximizing a single agent’s capabilities first because one agent plus tools keeps complexity, evaluation, and maintenance more manageable. It explicitly suggests prompt templates as a way to absorb branching complexity without jumping to a multi-agent framework. Microsoft is similarly blunt: If the use case does not clearly cross security or compliance boundaries, involve multiple teams, or otherwise require architectural separation, start with a single-agent prototype. It even cautions that “planner,” “reviewer,” and “executor” roles do not automatically justify multiple agents, because one agent can often emulate those roles through persona switching, conditional prompting, and tool permissioning. Google, for its part, adds a particularly useful nuance here, warning that the wrong choice between a sub-agent and an agent packaged as a tool can create massive overhead. In other words, sometimes you don’t need another teammate. You need a function with a clean contract.

Microsoft makes one more point that deserves extra attention: Many apparent scale problems stem from retrieval design, not architecture. So, before you add more agents, fix chunking, indexing, reranking, prompt structure, and context selection. That isn’t less ambitious. It is more adult. We learned this the hard way with microservices. Complexity doesn’t vanish when you decompose a system. It relocates. Back then, it moved into the network. Now it threatens to move into hand-offs, prompts, arbitration, and agent state.

Distributed intelligence is still distributed

What could have been one strong model call, retrieval, and a few carefully designed tools can quickly turn into agent routing, context hand-offs, arbitration, permissioning, and observability across a swarm of probabilistic components. That may be worth it when the problem is truly distributed, but often it’s not. Distributed intelligence is still distributed systems, and distributed systems aren’t cheap to build or maintain.

As OpenAI’s evaluation guide warns, triaging and hand-offs in multi-agent systems introduce a new source of nondeterminism. Its Codex documentation says subagents are not automatic and should only be used when you explicitly request parallel agent work, in part because each subagent does its own model and tool work and therefore consumes more tokens than a comparable single-agent run. Microsoft makes the same point in enterprise language: Every agent interaction requires protocol design, error handling, state synchronization, separate prompt engineering, monitoring, debugging, and a broader security surface.

Modularity, yes. But don’t pretend that modularity will be cheap.

This is why I suspect most teams that think they need multiple agents actually have a different problem. Their tools are vague, their retrieval is weak, their permissions are too broad, and their repositories are under-documented. Guess what? Adding more agents doesn’t fix any of that. It exacerbates it. As Anthropic explains, the most successful implementations tend to use simple, composable patterns rather than complex frameworks, and for many applications a single LLM call with retrieval and in-context examples is enough.

This matters even more because AI makes complexity cheap. In the microservices era, a bad architectural idea was at least constrained by the effort required to implement it. In the agent era, the cost of sketching yet another orchestration layer, another specialized persona, another hand-off, or another bit of glue code is collapsing. That can feel liberating even as it destroys our ability to maintain and manage systems over time. As I’ve written, lower production costs don’t automatically translate into higher productivity. They often just make it easier to manufacture fragility at scale.

Earn the extra moving parts

This also brings us back to a point I’ve made for years about hyperscaler architecture. Just because Google, Amazon, Anthropic, or OpenAI do something doesn’t mean you should too, because you don’t have their problems. Anthropic’s research system is impressive precisely because it tackles a hard, open-ended, breadth-first research problem. Anthropic is also candid about the cost. In its data, agents used about four times more tokens than chat interactions, while multi-agent systems used about 15 times more. The company also notes that most coding tasks are not a particularly good fit because they offer fewer truly parallelizable subtasks, and agents are not yet especially good at coordinating with one another in real time.

In other words, even one of the strongest public examples of multi-agent success comes with a warning label attached. It’s not quite “abandon hope, all ye who enter here, but it’s definitely not “do as I’m doing.”

The better question is “What’s the minimum viable autonomy for this job?” Start with a strong model call. If that isn’t enough, add retrieval. Still not enough? Add better tools. If you need iteration, wrap those tools in a single agent loop. If context pollution becomes real, if independent tasks can truly run in parallel, or if specialization materially improves tool choice, then and only then start “earning” your second agent. If you can’t say which of those three problems you are solving, you probably don’t need another agent. Don’t believe me? All of the top purveyors of agent tools (Anthropic, OpenAI, Microsoft, Google) converge on this same counsel.

So yes, multi-agent is the new microservices. That is both a compliment and a warning. Microservices were powerful when you had a problem worth distributing. Multi-agent systems are powerful when you have a problem worth decomposing. Most enterprise teams don’t, at least not yet. Many others never will. Instead, most need one well-instrumented agent, tight permissions, strong evaluations, boring tools, and clear exit conditions. The teams that win with agentic AI won’t be those that reach for the fanciest topology first. Instead, they’ll be disciplined enough to earn every extra moving part and will work hard to avoid additional moving parts for as long as possible. In the enterprise, boring is still what scales.

(image/jpeg; 14.61 MB)

27 questions to ask when choosing an LLM 6 Apr 2026, 9:00 am

Car buyers kick tires. Horse traders inspect the teeth. What should shoppers for large language models (LLMs) do?

Here are 27 prescient questions that developers are asking before they adopt a particular model. Model capabilities are diverse, and not every application requires the same support. These questions will help you identify the best models for your job.

What is the size of the model?

The number of parameters is a rough estimate of how much information is already encoded in the model. Some problems want to leverage this information. The prompts will be looking for information that might be in the training corpus.

Some problems won’t need larger models. Perhaps there will be plenty of information added from a retrieval-augmented generation (RAG) database. Perhaps the questions will be simpler. If you can anticipate the general size of the questions, you can choose the smallest model that will satisfy them.

Does the model fit in your hardware?

Anyone who will be hosting their own models needs to pay attention to how well they run on the hardware at hand. Finding more RAM or GPUs is always a chore and sometimes impossible. If the model doesn’t fit or run smoothly on the hardware, it can’t be a solution.

What is the time to first token?

There are multiple ways to measure the speed of an LLM. The time to first token, or TTFT, is important for real-time, interactive applications where the end user will be daydreaming while waiting for some answer on the screen. Some models start the response faster, but then poke along. Others take longer to begin responding. If you’re going to be using the LLM in the background or as a batch job, this number isn’t as important.

Are there rate limits?

All combinations of models and hardware have a speed limit. If you’re supplying the hardware, you can establish the maximum load through testing. If you’re using an API, the provider will probably put rate limits on how many tokens it can process for you. If your project needs more, you’ll either need to buy more hardware or look for a different provider.

What is the size of the context window?

How big is your question? Some problems like refactoring a large code base require feeding millions of tokens into the machine. A smaller model with a limited context window won’t do. It will forget the first part of the prompt before it gets to the end.

If your problem fits into a smaller prompt, then you can get by with a smaller context window and a simpler model.

How does the model balance reasoning with speed?

Model developers can add different stages where the models will attempt to reason or think about the problem on a meta level. This is often considered “reasoning,” although it generally means that the models will iterate through a variety of different approaches until they find an answer that seems good enough. In practice, there’s a tradeoff between reasoning and speed. More iterations means slower responses. Is this “reasoning” worth it? It all depends upon the problem.

How stable is the model?

On certain prompts, some models are more likely to fail than others. They’ll start off with an answer but diverge into some dark statistical madness, spewing random words and gibberish. In many cases, they’ll offer correct answers. In many cases, the instability appears at random times when the model is already running in production. 

When did training end?

The “knowledge cutoff” is the last day when the training set for the model stopped getting an injection of new information. If you’re going to be relying on the general facts embedded in the model, then you’ll want to know how they age. Not all projects need a current date, though, because some use other documents in a RAG system or vector database to add more details to the prompt.

Is additional training possible?

Some LLM providers support another round of training, usually on domain-specific data sets of the customer. This fine-tuning can teach a foundation model some of the details that give it the power to take up a place in some workflow or data assembly line. The fine-tuning is often dramatically cheaper and faster than building an entirely new model from scratch.

Which media types are supported?

Some models only return text. Some return images. Some are trained to do something else entirely. The same goes for input. Some can read a text prompt. Some can examine an image file and parse charts or PDFs. Some are smart enough to unpack strange file types. Just make sure the LLM can listen and speak in the file formats you need.

What is the prompting structure?

The structure of the prompt can make a difference with many models. Some pay particular attention to instructions in the system prompt. Others are moving to a more interactive, Socratic style of prompting that allows the user and the LLM to converge upon the answer. Some encourage the LLM to adopt different personas of famous people. The best way to prompt iterative, agentic thought is still a very active research topic.

Is the model open source?

Some models have been released with open source licenses that bring many of the same freedoms as open source software. Projects that need to run in controlled environments can fire up these models inside their space and avoid trusting online services. Some users will want to fine-tune the models, and open source models allow them to take advantage of access to the model weights.

Is there a guaranteed lifespan?

If the model is not open source, the creators may shut it down at any time. Some services offer assurance that the model will have a set lifespan and will be supported for a predictable amount of time. This allows developers to be sure that the rug won’t be pulled out from beneath their feet soon after integrating the model with their stack.

Whereas earlier versions of open source models remain available, the ongoing availability of proprietary models is determined by the owners. What happens to some old versions that have been retired? Most of us are happier with their replacements, but some of us may have grown to rely on them and we’re out of luck. Some providers of proprietary models have promised to release the model weights on retirement, an option that makes the model always available even though it’s not fully open source.

Does the model have a batch architecture?

If the answer is not needed in real time, some LLMs can process the prompts in delayed batches. Many model hosts will offer large discounts for the option to answer at some later time when demand is lower. Some inference engines can offer continuous batching with techniques like PagedAttention or finer-grained scheduling. All of these techniques can lower costs by boosting the throughput of hardware.

What is the cost? 

In some situations, price is very important, especially when some tasks will be repeated many times. While the cost of one answer may be fractions of a cent, they’ll add up. On big data assembly lines, downgrading to a cheap option can make the difference between a financial success and failure.

In other jobs, the price won’t matter. Maybe the prompt will only be run a few times. Maybe the price is much lower than the value of the job. Scrimping on the LLM makes little sense in these cases because spending extra for a bigger, fancier model won’t break the budget.

Was the model trained on synthetic data?

Some LLMs are trained on synthetic data created by other models. When things go correctly, the model doesn’t absorb any false bits of information. But when things go wrong, the models can lose precision. Some draw an analogy to the way that copies of copies of copies grow blurred and lose detail. Others compare the process to audio feedback between an amplifier and a microphone.

Is the training set copyrighted?

Some LLM creators cut corners when they started building their training set by including pirated books. Anthropic, for example, has announced a settlement to a class action lawsuit for some books that are still under copyright. Other lawsuits are still pending. The claim is that the models may produce something close to the copyrighted material when prompted the right way. If your use cases may end up asking for answers that might turn up plagiarized or pirated material, you should look for some assurances about how the training set was chosen.

Is there a provenance audit?

Some developers are fighting the questions about synthetic data and copyright by offering a third-party audit of their training sets. This can answer questions and alleviate worries about future infringement issues.

Does the model come with indemnification?

Does the contract offer a guarantee that the answers won’t infringe upon copyright or include personal information? Some companies are confident that their training data is clean enough that they’re able to offer contractual indemnification for customers.

Do we know the environmental impacts?

This usually means how much electricity and water is consumed to produce an answer. Some services are offering estimates that they hope will distinguish their services from others that are more wasteful. In general, price is not a bad proxy for environmental impact because both electricity and water are direct costs and they’re often some of the greatest ones. Developers have a natural incentive to use less of both.

Is the hardware powered by renewable energy?

Did the power come from a clean source? Some services are partnering directly with renewable energy providers so that they can promise that the energy used to construct an answer came from solar or wind farms. In some cases, they’re offering batch services that queue up the queries until the renewable sources are online. 

Does the model have compliance issues?

Some developers who work in highly regulated environments need to worry about access to their data. These developers will need to review how standards like SOC2, HIPAA, and GDPR among others affect how the model can be used. In many cases, the model needs to be fired up in a controlled environment. In some cases, the problem is more complex. Some regulations require “transparency” in some decisions meaning that the model will need to explain how it came to a conclusion. This is often one of the most complicated questions to answer.

Where does the model run? 

Some of the regulations are tied directly to location. Some of the GDPR regulations, for instance, require that personal data from Europeans stay in Europe. Geopolitics and national borders also affect legal questions for a number of issues like taxes, libel, or privacy. If your use case strays into these areas, the physical location of the LLM may be important. Some services are setting up regional deployments just to resolve these questions. 

Does the model support human help?

Some developers are explicitly building in places for humans inside the reasoning of the LLM. These “human-in-the-loop” solutions make it possible to stop an LLM from delivering a flawed or dangerous answer. Finding the best architectural structure for these hooks can be tricky because they can create too much labor if they’re triggered too often.

Does the model support tool use?

Some models and services allow their models to use outside features for searching the internet, looking in a database, or calling an arbitrary function. These functions can really help some problems that need to leverage the data found from the outside sources. There is a large collection of tools and interfaces that uses APIs like the Model Context Protocol (MCP). It’s worth experimenting with them to determine how stable they are.

Is the model agentic?

There may be no bigger buzzword now and that’s because everyone is using it to describe how they’re adding more reasoning capabilities to their models. Sometimes this means that a constellation of LLMs work together, often choreographed by some other set of LLMs. Does it mean smarter? Maybe. Better? Only you can tell.

What are the model’s quirks?

Anyone who spends some time with an LLM starts to learn its quirks. It’s almost like they’ve learned everything they know from fallible humans. One model gives different answers if there are two spaces after a period instead of one. Another model sounds pretentious. Most are annoyingly sycophantic. Anyone choosing an LLM must spend some time with the model and get a feel for whether the quirks will end up being endearing, annoying, or worse.

(image/jpeg; 14.68 MB)

Anthropic cuts OpenClaw access from Claude subscriptions, offers credits to ease transition 6 Apr 2026, 8:40 am

Anthropic has blocked paid Claude subscribers from using the widely used open-source AI agent OpenClaw under their existing subscription plans, a move that took effect April 4 and has drawn pushback from subscribers who question both the cost implications and the company’s stated rationale.

In an email to subscribers reviewed by InfoWorld, Anthropic said access to third-party tools through subscription tokens was being discontinued. “Starting April 4, third-party harnesses like OpenClaw connected to your Claude account will draw from extra usage instead of from your subscription,” the company said. Users accessing Claude through the API are unaffected by the change.

To ease the transition, Anthropic offered each subscriber a one-time credit equal to their monthly subscription price, redeemable by April 17 and valid for 90 days across Claude Code, Claude Cowork, chat, or connected third-party tools. The company also introduced pre-purchase extra usage bundles at discounts of up to 30% for subscribers who want to continue running OpenClaw with Claude as the underlying model.

“If you ever run past your subscription limits, this is the easiest way to keep going,” the company said in the email.

Capacity, not competition

Boris Cherny, head of Claude Code at Anthropic, explained the decision in a post on X. “We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren’t built for the usage patterns of these third-party tools,” Cherny said. “Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API. We want to be intentional in managing our growth to continue to serve our customers sustainably long-term.”

The token gap between standard subscription usage and third-party agent workloads is substantial. Testing conducted by German technology outlet c’t 3003 in January found that a single day of OpenClaw usage running on Claude’s Opus model consumed $109.55 in AI tokens. Anthropic’s own published benchmarks for Claude Code put the average daily cost for a professional software developer at $6, with 90% of team users staying below $12 per day.

OpenClaw team pushed back — and bought a week

Peter Steinberger, the Austrian developer who created OpenClaw before joining OpenAI, said on X that the original implementation date had been earlier. “Both me and @davemorin tried to talk sense into Anthropic, best we managed was delaying this for a week,” Steinberger wrote. He also drew attention to the sequence of product moves preceding the access cut. “Funny how timings match up, first they copy some popular features into their closed harness, then they lock out open source,” Steinberger said.

When one commenter argued that third-party tools did not belong on flat subscription plans and that any vendor allowing it was being “intellectually dishonest,” Steinberger noted that OpenClaw already supports subscriptions from other AI providers. “Funny how it works for literally any other player in the AI industry, we support subscriptions from MiniMax, Alibaba, OpenCode, GLM, OpenAI,” he replied in the post.

Cherny responded to the open source criticism directly, saying he had personally contributed pull requests to OpenClaw to improve its prompt cache efficiency. “This is more about engineering constraints,” Cherny said. “Our systems are highly optimized for one kind of workload, and to serve as many people as possible with the most intelligent models, we are continuing to optimize that.”

Subscribers weigh the cost

Developer Jared Tate said on X that he intended to cancel his subscriptions over the change. After Cherny’s response, Tate acknowledged the engineering explanation and noted that careful OpenClaw configuration, including a one-hour prompt cache time-to-live and a 55-minute heartbeat, had materially reduced his own token consumption. “OpenClaw dramatically increased usage. But we all became so much more productive,” he wrote.

One subscriber posting as @ashen_one said they were running two OpenClaw instances on a $200-per-month plan. Shifting to API keys or overage bundles, they said, would make continued use financially unworkable. “I’ll probably have to switch over to a different model at this point,” the user wrote.

The user also pointed to Claude Cowork, Anthropic’s own agentic productivity tool, as a direct OpenClaw rival, and suggested the decision served competitive purposes. AI developer Brian Vasquez offered a different read. “Anthropic oversold their server capacity, and this was their response, point blank and simple,” Vasquez wrote on X. “It’s a capacity/bad bet. Time to pay off that bad bet.”

(image/png; 5.45 MB)

Internet Bug Bounty program hits pause on payouts 3 Apr 2026, 5:16 pm

Researchers who identify and report bugs in open-source software will no longer be rewarded by the Internet Bug Bounty team. HackerOne, which administers the program, has said that it is “pausing submissions” while it contemplates ways in which open source security can be handled more effectively.

The Internet Bug Bounty program, funded by a number of leading software companies, has been run since 2012 and has awarded more than $1.5m to researchers who have reported bugs. Up to now, 80% of its payouts have been for discoveries of new flaws, and 20% to support remediation efforts. But as artificial intelligence makes it easier to find bugs, that balance needs to change, HackerOne said in a statement.

“AI-assisted research is expanding vulnerability discovery across the ecosystem, increasing both coverage and speed. The balance between findings and remediation capacity in open source has substantively shifted,” said HackerOne.

Among the first programs to be affected is the Node.js project, a server-side JavaScript platform for web applications known for its extensive ecosystem. While the project team will continue to accept and triage bug reports through HackerOne, without funding from the Internet Bug Bounty program it will no longer pay out rewards, according to an announcement on its website.

The Internet Bug Bounty Program is not the only bug-hunting project that has struggled with the onset of AI in vulnerability hunting. In January, the Curl program said that it was not taking any more submissions. And just last month, Google also put a halt to AI-generated submissions provided to its Open Source Software Vulnerability Reward Program.

(image/jpeg; 9.06 MB)

Claude Code is still vulnerable to an attack Anthropic has already fixed 3 Apr 2026, 4:55 pm

The leak of Claude Code’s source is already having consequences for the tool’s security. Researchers have spotted a vulnerability documented in the code.

The vulnerability, revealed by AI security company Adversa, is that if Claude Code is presented with a command composed of more than 50 subcommands, then for subcommands after the 50th it will override compute-intensive security analysis that might otherwise have blocked some of them, and instead simply ask the user whether they want to go ahead. The user, assuming that the block rules are still in effect, may unthinkingly authorize the action.

Incredibly, the vulnerability is documented in the code, and Anthropic has already developed a fix for it, the tree-sitter parser, which is also in the code but not enabled in public builds that customers use, said Adversa.

Adversa outlined how attackers might exploit the vulnerability by distributing a legitimate-looking code repository containing a poisoned CLAUDE.md file. This would contain instructions for Claude Code to build the project, with a sequence of 50 or more legitimate-looking commands, followed by a command to, for example, exfiltrate the victim’s credentials. Armed with those credentials, the attackers could threaten a whole software supply chain.

(image/jpeg; 0.13 MB)

CERT-EU blames Trivy supply chain attack for Europa.eu data breach 3 Apr 2026, 4:37 pm

The European Union’s Computer Emergency Response Team, CERT-EU, has traced last week’s theft of data from the Europa.eu platform to the recent supply chain attack on Aqua Security’s Trivy open-source vulnerability scanner.

The attack on the AWS cloud infrastructure hosting the Europa.eu web hub on March 24 resulted in the theft of 350 GB of data (91.7 GB compressed), including personal names, email addresses, and messages, according to CERT-EU’s analysis.

The compromise of Trivy allowed attackers to access an AWS API key, gaining access to a range of European Commission web data, including data related to “42 internal clients of the European Commission, and at least 29 other Union entities using the service,” it said.

“The threat actor used the compromised AWS secret to create and attach a new access key to an existing user, aiming to evade detection. They then carried out reconnaissance activities,” said CERT-EU. The organization had found no evidence that the attackers had moved laterally to other AWS accounts belonging to the Commission.

Given the timing and involvement of AWS credentials, “the European Commission and CERT-EU have assessed with high confidence that the initial access vector was the Trivy supply-chain compromise, publicly attributed to TeamPCP by Aqua Security,” it said.

In the event, the stolen data became public after the group blamed for the attack, TeamPCP, leaked it to the ShinyHunters extortion group, which published it on the dark web on March 28.

Back door credentials

The Trivy compromise dates to February, when TeamPCP exploited a misconfiguration in Trivy’s GitHub Actions environment, now identified as CVE-2026-33634, to establish a foothold via a privileged access token, according to Aqua Security.

Discovering this, Aqua Security rotated credentials but, because some credentials remain valid during this process, the attackers were able to steal the newly rotated credentials.

By manipulating trusted Trivy version tags, TeamPCP forced CI/CD pipelines using the tool to automatically pull down credential-stealing malware it had implanted.

This allowed TeamPCP to target a variety of valuable information including AWS, GCP, Azure cloud credentials, Kubernetes tokens, Docker registry credentials, database passwords, TLS private keys, SSH keys, and cryptocurrency wallet files, according to security researchers at Palo Alto Networks. In effect, the attackers had turned a tool used to find cloud vulnerabilities and misconfigurations into a yawning vulnerability of its own.

CERT-EU advised organizations affected by the Trivy compromise to immediately update to a known safe version, rotate all AWS and other credentials, audit Trivy versions in CI/CD pipelines, and most importantly ensure GitHub Actions are tied to immutable SHA-1 hashes rather than mutable tags.

It also recommended looking for indicators of compromise (IoCs) such as unusual Cloudflare tunnelling activity or traffic spikes that might indicate data exfiltration.

Extortion boost

The origins and deeper motives of TeamPCP, which emerged in late 2025, remain unclear. The leaking of stolen data suggests it might be styling itself as a sort of initial access broker which sells data and network access on to the highest bidder.

However, the fact that stolen data was handed to a major ransomware group suggests that affected organizations are likely to face a wave of extortion demands in the coming weeks.

If so, this would be a huge step backwards at a time when ransomware has been under pressure as the proportion of victims willing to pay ransoms has declined.

The compromise of Trivy, estimated to have affected at least 1,000 SaaS environments, is rapidly turning into the one of the most consequential supply-chain incidents of recent times.

The number of victims is likely to grow in the coming weeks. Others caught up in the incident include Cisco, which reportedly lost source code, security testing company Checkmarx, and AI gateway company LiteLLM.

This article was first published on CSO.

(image/jpeg; 0.87 MB)

Google gives enterprises new controls to manage AI inference costs and reliability 3 Apr 2026, 12:12 pm

Google has added two new service tiers to the Gemini API that enable enterprise developers to control the cost and reliability of AI inference depending on how time-sensitive a given workload is.

While the cost of training large language models for artificial intelligence has been a concern in the past, the focus of attention is increasingly moving to inferencing, or the cost of using those models.

The new tiers, called Flex Inference and Priority Inference, address a problem that has grown more acute as enterprises move beyond simple AI chatbots into complex, multi-step agentic workflows, the company said in a blog post published Thursday.

In a separate announcement on the same day, Google also released Gemma 4, the latest generation of its open model family for developers who prefer to run models locally rather than via a paid API, describing it as its most capable open release to date.

The new API service tiers are intended to simplify life for developers of agentic systems involving background tasks that do not require instant responses and interactive, user-facing features where reliability is critical. Until now, supporting both workload types meant maintaining separate architectures: standard synchronous serving for real-time requests and the asynchronous Batch API for less time-sensitive jobs.

“Flex and Priority help to bridge this gap,” the post said. “You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints.”

The two tiers operate through a single synchronous interface, with priority set via a service_tier parameter in the API request.

Lower cost vs higher availability

Flex Inference is priced at 50% of the standard Gemini API rate, but offers reduced reliability and higher latency. I is suited for background CRM updates, large-scale research simulations, and agentic workflows “where the model ‘browses’ or ‘thinks’ in the background,” Google said. It is available to all paid-tier users for GenerateContent and Interactions API requests.

For enterprise platform teams, the practical value is that background AI workloads such as data enrichment, document processing, and automated reporting can be run at materially lower cost without a separate asynchronous architecture, and without the need to manage input/output files or poll for job completion.

Priority Inference gives requests the highest processing priority on Google’s infrastructure, “even during peak load,” the post stated.

However, once a customer’s traffic exceeds their Priority allocation, overflow requests while not outright rejected are automatically routed to the Standard tier instead.

“This keeps your application online and helps to ensure business continuity,” Google said, adding that the API response will indicate which tier handled each request, giving developers visibility into both performance and billing. Priority Inference is available to Tier 2 and Tier 3 paid projects.

But the downgrade mechanism raises concerns for regulated industries, according ot Greyhound Research Chief Analyst Sanchit Vir Gogia.

“Two identical requests, submitted under different system conditions, can experience different latency, different prioritisation, and potentially different outcomes,” he said. “In isolation, this looks like a performance issue. In practice, it becomes an outcome integrity issue.”

For banking, insurance, and healthcare, he said, that variability raises direct questions around fairness, explainability, and auditability. “Graceful degradation, without full transparency and governance, is not resilience,” Gogia said. “It is ambiguity introduced into the system at scale.”

What it means for enterprise AI strategy

The new tiers are part of a broader industry shift toward tiered inference pricing that Gogia said reflects constrained AI infrastructure rather than purely commercial innovation.

“Tiered inference pricing is the clearest signal yet that AI compute is transitioning into a utility model,” he said, “but without the maturity, transparency, or standardisation that enterprises typically associate with utilities.” The underlying driver, he said, is structural scarcity — power availability, specialised hardware, and data centre capacity — and tiering is how providers are managing allocation under those constraints.

For CIOs and procurement teams, vendor contracts can no longer remain generic, Gogia said. “They must explicitly define service tiers, outline downgrade conditions, enforce performance guarantees, and establish mechanisms for cost control and auditability.”

(image/jpeg; 1.06 MB)

Understanding the risks of OpenClaw 3 Apr 2026, 9:00 am

Let’s begin with the core question: Is OpenClaw a cloud entity or not? The best answer is a complicated “not exactly, but functionally, yes.”

OpenClaw AI Agent Platform is better viewed as an orchestration layer, runtime, or plumbing rather than a complete cloud platform. It provides the tools to build and manage agents but lacks the intelligence, data estate, control plane, or business context those agents need. In this way, OpenClaw functions as the connective tissue but not the final goal.

That distinction matters because many people confuse the shell with the system. OpenClaw itself may run locally, be deployed on infrastructure you control, or even be attached to local models in some cases. OpenClaw’s own documentation discusses support for local models, even while warning about context and safety limits, indicating that local deployment is possible in principle. But that does not mean the architecture is inherently local, self-contained, or disconnected from the outside world.

In practice, OpenClaw is only useful when it connects to other systems. Typically, this includes model endpoints, enterprise APIs, data stores, browser automation targets, SaaS applications, and line-of-business platforms. AWS Marketplace describes OpenClaw as “a one-click AI agent platform for browser automation on AWS” and clearly states that these agents are powered by Claude or OpenAI, making the dependency quite clear. In other words, the value doesn’t come from OpenClaw by itself but from what OpenClaw can access.

Utility from external services

This is where the conversation needs to become more mature. OpenClaw is really just the plumbing. The back-end capabilities need to be external services. These services can encompass a wide range of options. They might be local services if you choose that architecture. They could be APIs hosted within your own data center. They might be model servers utilizing dedicated GPUs. They can be internal microservices that expose business rules. Or they could be legacy systems wrapped with modern interfaces. In most enterprise deployments, these dependencies are typically remote large language models, cloud-hosted data platforms, SaaS systems, enterprise information systems, and externally exposed APIs. That’s generally where the functionality resides.

This is also why the question of whether OpenClaw is “cloud” misses the bigger issue. If the agents are calling OpenAI, Anthropic, or another remote model service, if they are reading Salesforce, Workday, ServiceNow, SAP, Oracle, Microsoft 365, or custom enterprise systems, or if they are executing workflows through cloud-hosted APIs, then you are already in a distributed cloud architecture, whether you admit it or not. The cloud is not just where code runs. The cloud is where dependencies, trust boundaries, identity, data movement, and operational risk accumulate.

OpenClaw’s public positioning reinforces this point. Its website describes it as an AI assistant that handles tasks like email management, calendar scheduling, and other actions via chat interfaces, which only function if integrated with external tools and services. So, no, OpenClaw is not “the cloud” in a strict definitional sense. But yes, it is often part of a cloud-based system.

The danger is not theoretical

This is where the hype machine often gets ahead of reality. Agentic AI sounds impressive in demos because the agent seems to reason, decide, and act. However, as soon as you give software agency over enterprise systems, you’re no longer talking about a chatbot. You are talking about delegated operational authority.

That should make people uneasy because of the clear security and safety concerns. There have already been public incidents of autonomous or semi-autonomous AI systems causing destructive actions. Reporting in July 2025 described a Replit AI coding agent deleting a live database during a code freeze, an event labeled as catastrophic. Ars Technica separately reported AI coding tools erasing user data while acting on incorrect assumptions about what needed to be done. This is exactly the kind of behavior enterprises should expect if they connect agents to critical systems without strong controls.

The problem isn’t that the agent is evil. The problem is that the agent is optimizing based on an incomplete model of reality. It might decide that cleaning up old records, resetting a broken environment, removing “duplicate” data, or closing “unused” accounts makes sense. It might even do so confidently. But none of that means it’s right. Logic without context can lead to lost databases, corrupted workflows, and compliance issues.

Even the broader OpenClaw discussion in the market has started to reflect this unease. Wired’s coverage of OpenClaw framed the experience as highly capable until it became untrustworthy, which is exactly the concern enterprises should be paying attention to. The problem is not whether agents can act. The problem is whether they can act safely, predictably, and within bounded authority.

Think like an architect

If an enterprise is considering OpenClaw as an AI agent platform or as part of a broader agentic AI strategy, there are three things it needs to understand.

First, the enterprise must understand security. Agents are not passive analytics tools; they can read, write, delete, trigger, purchase, notify, provision, and reconfigure. This means identity management, least-privilege access, secrets handling, audit trails, network segmentation, approval gates, and kill switches all become essential. If you would not give a summer intern unrestricted credentials to your ERP, CRM, and production databases, you should not give them to an agent either.

Second, the enterprise needs to understand governance. Governance is not just a legal requirement; it is the operational discipline that defines what an agent is allowed to do, under what conditions, with which data, using which model, and with whose approval. You need policy enforcement, observability, human override, logging, reproducibility, and accountability. Otherwise, when something goes wrong—and eventually it will—you may have no idea whether the failure originated from the model, the prompt, the toolchain, the integration, the data, or the permissions layer.

Third, the enterprise must understand that there should be specific use cases where this technology is truly justified. Not every workflow requires an autonomous agent. In fact, most do not. Agentic AI should be employed only when there is enough process variability, decision complexity, and potential business benefit to outweigh the risks and overhead. If a deterministic workflow engine, a robotic process automation bot, a standard API integration, or a simple retrieval application can solve the problem, choose that instead. The most costly AI mistake today is unnecessary overengineering fueled by hype.

Hype ahead of value

Agentic AI is, in many ways, out over its skis. The market is selling aspiration faster than enterprises can handle operational reality. That doesn’t mean the technology is useless; it means the industry is doing what it always does: overpromising in year one, rationalizing in year two, and operationalizing in year three.

Enterprises, to their credit, seem to be advancing at their own pace with OpenClaw and related technologies. That is the right approach. They should experiment but within boundaries. They should innovate but with a solid architecture. They should automate but only where economics and risk profiles justify it.

The final point that many people still overlook is that cloud computing is already part of this system, whether most people realize it or not. If OpenClaw is connected to remote models, SaaS platforms, enterprise APIs, browser sessions, and data services, then enterprises have a cloud architecture challenge as much as an AI challenge. All the lessons from cloud computing still apply: design for control, resilience, observability, identity, data protection, and failure.

OpenClaw isn’t the cloud. But if you deploy it carelessly, it will expose you to every common cloud-era mistake, only faster and with more autonomy. Avoid trouble by learning to use this technology only when it is actually needed and not a minute before.

(image/webp; 0.23 MB)

Local-first browser data gets real 3 Apr 2026, 9:00 am

If JavaScript were a character in a role-playing game, its class would be a Rogue. When it was a youngster, it was a street kid that lived on the margins of society. Over time, it has become an established figure in the enterprise hierarchy. But it never forgot where it came from, and you never know what sleight of hand it will perform next.

For example, fine-grained Signals are mounting a rebellion to overthrow the existing Virtual DOM hegemony. Incremental improvements to WebAssembly have reached the point where a real SQL database can be run inside the browser. Coupled with ingenious architectural patterns, this has opened up new possibilities in app data design. 

In other JS developments, the upstart performance runtime, Bun, has spawned a native app framework, Electrobun. Welcome to our latest roundup of the JavaScript news and noteworthy.

Top picks for JavaScript readers on InfoWorld

First look: Electrobun for TypeScript-powered desktop apps
Electron (the native-web bridge framework) has always struggled around performance. Electrobun is a (predictably named) new alternative that uses the Bun runtime, famous for its intense performance.  Electrobun claims to produce far smaller bundles than regular Electron by dropping the bundled browser, and it comes with its own differential update technology to simplify patches.

The revenge of SQL: How a 50-year-old language reinvents itself
SQL is making an improbable comeback in the JavaScript world. Driven by the ability to run database engines like SQLite and PostgreSQL right inside the browser via WebAssembly, and the rise of the schemaless jsonb type, developers are discovering that boring old SQL is highly adaptable to the modern web.

Why local-first matters for JavaScript
Every developer should be paying attention to the local-first architecture movement. The emerging local-first SQL data stores crystallize ideas about client/server symmetry that have been a long time coming. This shift simplifies offline capabilities and fundamentally changes how we think about UI state.

Reactive state management with JavaScript Signals
State management remains one of the nastiest parts of front-end development. Signals have emerged as the dominant mechanism for dealing with reactive state, offering a more fine-grained and performant alternative to traditional Virtual DOM diffing. It is a vital pattern to understand as it sweeps across the framework landscape.

JavaScript news bites

More good reads and JavaScript updates elsewhere

Next.js 16.2 introduces features built specifically for AI agents
In a fascinating and forward-looking move, the latest Next.js release includes tools designed specifically to help AI agents build and debug applications. This includes an AGENTS.md file that feeds bundled documentation directly to large language models, automatic browser log forwarding to the terminal (where agents operate), and an experimental CLI that lets AI inspect React component trees without needing a visual browser window.

TypeScript 6.0 is GA
The smashingly popular superset of JavaScript is now GA for 6.0. This is the last release before Microsoft swaps out the current JavaScript engine for one built on Go. The TypeScript 6.0 drop is most important as a bridge to Go-based TypeScript 7.0, which the team says is coming soon (and is already available via npm flag). If you can run atop TypeScript 6, you are in good shape for TypeScript 7.

Vite 8.0 arrives with unified Rolldown-based builds
Vite now uses Rolldown, the bundler/builder built in Rust, instead of esbuild for dev and Rollup for production. This move simplifies the architecture and brings speed benefits without breaking plugin compatibility. Pretty impressive. The Vite team also introduced a plugin registry at registry.vite.dev.

(image/jpeg; 22.23 MB)

Claude Code leak puts enterprise trust at risk as security, governance concerns mount 3 Apr 2026, 12:47 am

Anthropic likes to talk about safety. It even risked the ire of the US Department of Defense (also known as the Department of War) over it. But two unrelated leaks in the space of a week have put the company in an unfamiliar spotlight: not highlighting model performance or safety claims, but for its apparent difficulty in keeping sensitive parts of its AI tooling and strategy out of public view.

The exposure of Claude Code’s source code combined with a supply-chain scare, coming hard on the heels of a separate leak about its upcoming security-focused large language model (LLM), has given enterprise teams fresh reasons to question the AI tool’s integration in enterprise workflows, especially when considering security and governance, experts and analysts say.

Shreeya Deshpande, senior analyst at Everest Group, noted that this integration is what makes the product so valuable. “Claude Code is a powerful tool precisely because it has deep access to your development environment, it can read files, run shell commands, and interact with external services. By exposing the exact orchestration logic for how Claude Code manages permissions and interacts with external tools, attackers can now design malicious repositories specifically tailored to trick Claude Code into running unauthorized background commands or exfiltrating data,” she said.

Could change attacker tactics

At a deeper level, the leak may shift attacks from probabilistic probing to deterministic exploitation.

Jun Zhou, a full stack engineer at cybersecurity startup Straiker AI, claimed that due to the source code leak, instead of brute-forcing jailbreaks and prompt injections, attackers will now be able to study and fuzz exactly how data flows through Claude Code’s four-stage context management pipeline and craft payloads designed to survive compaction, effectively persisting a backdoor across an arbitrarily long session.

Change in security posture

These security risks, Greyhound Research chief analyst Sanchit Vir Gogia said, will force enterprises to change their security posture around Claude Code and other AI coding tools: “Expect immediate moves towards environment isolation, stricter repository permissions, and enforced human review before any AI generated output reaches production.”

In fact, according to Pareekh Jain, principal analyst at Pareekh Consulting, some enterprises will even pause expansion of Claude Code in their workflows, but fewer are expected to rip and replace immediately.

This is in large part due to the high switching costs around AI-based coding assistants, mainly driven by optimizations around workflow, model quality, approvals, connectors, and developer habits, Jain added.

Echoing Jain, Deshpande pointed out that enterprises might want to take a more strategic step: design AI integrations to be provider-agnostic, with clear abstraction layers that enable vendor switching within a reasonable timeframe.

She sees the source code leak as providing a boost to Claude Code’s rivals, especially the ones that are open source and model agnostic, driven by developer interest. “Model-agnostic alternatives like OpenCode, which let you use the same kind of agentic coding assistant with any underlying model, GPT, Gemini, DeepSeek, or others, are now being evaluated seriously by enterprises that previously hadn’t looked [at them],” Deshpande said.

Developers are voting with their attention, even if enterprise procurement moves more slowly, she added. “A repository called Claw Code, a rewrite of Claude Code’s functionality, reached over 145,000 GitHub stars in a single day, making it the fastest-growing repository in GitHub’s history.”

Has the damage been done?

That shift in developers’ attention, though, raises a broader question: has Anthropic ceded its coding advantage to rivals? Analysts and experts think the answer is nuanced: the leak may compress Anthropic’s lead, but is unlikely to wipe it out.

“The leak could allow competitors to reverse-engineer how Claude Code’s agentic harness works and accelerate their own development. That compression might be months, not years, but it’s real,” said Deshpande.

Pareekh Consulting’s Jain even went to the extent of comparing the leak to “giving competitors a free playbook”.

The evidence of the repercussions of the leak came from Anthropic’s initial actions; it reportedly issued 8,000 legal takedown notices to prevent the source code from being disseminated further via GitHub repositories and other public code-sharing platforms.

Later, it did scale back the notices to one repository and 96 forks, but that’s enough to underscore how quickly the code had already proliferated.

Flattened the playing field

Joshua Sum, co-founder of Solayer and colleague of Chaofan Shou, who was first to report the leak, wrote on LinkedIn that the lapse by Anthropic handed everyone a reference architecture that “shaved a year of reverse-engineering off every startup and enterprise’s roadmap”.

“This just flattened the playing field and set the standard for harness engineering,” Sum wrote, referring to the software and code that makes a large language model an actual tool, helping it interact with other tools and systems to understand and complete tasks asked of it.

Yet, beyond the immediate competitive shake-up, there may be a silver lining for enterprises, analysts say.

The prospect of rivals replicating Claude Code, or of enterprises building in-house alternatives, shifts the balance of power, giving enterprises more leverage over Anthropic, Deshpande said.

Fuels a call for transparency and governance

However, Jain pointed to a separate set of concerns around governance and transparency, driven by details of unreleased features that surfaced in the leak.

He said that enterprise procurement teams are likely to use the incident to push Anthropic for tighter release controls, clearer incident reporting, greater product transparency, and stronger indemnity clauses, particularly in light of exposed planned features such as “Undercover Mode” and “KAIROS.”

While KAIROS is a feature that would allow Claude Code to operate as a persistent, background agent, periodically fixing errors or running tasks on its own without waiting for human input, and even sending push notifications to users, Undercover Mode will allow Claude to make contributions to public open source repositories masquerading as a human being.

A proactive agent or feature like KAIROS, according to Deshpande, represents a fundamentally different governance challenge than that of a reactive agent as Claude is today.

Deeper structural gaps

Greyhound Research’s Gogia, too, echoed that concern, pointing to a deeper structural gap in how enterprises are approaching these systems.

Enterprises, Gogia said, are rapidly adopting tools that can observe, decide, and act across environments, even as their governance models remain rooted in deterministic, predictable software.

“This incident exposes that mismatch clearly. It forces enterprises to confront foundational questions around access, execution, logging, review, and disclosure. If those answers are unclear, the issue is not the tool, the issue is readiness,” Gogia added.

Further, Deshpande noted that the window to define governance for always-on agents is before they launch, as enterprises will face immediate pressure to adopt them once released.

She also flagged Undercover Mode as a potential flashpoint for transparency and compliance concerns.

“While the feature is designed to prevent exposure of internal codenames and sensitive information by suppressing identifiable AI markers, it goes a step further by presenting outputs as human-written and removing attribution,” Deshpande said. “That creates clear risks around transparency, disclosure, and compliance, especially in environments where AI-generated contributions are expected to be explicitly identified.”

Added risks

Beyond transparency concerns, the issue also strikes at the heart of auditability and accountability in enterprise software development, Gogia pointed out, noting that attribution masking could have far-reaching implications.

“Software development depends on traceability: Every change must be attributable, auditable, and accountable,” Gogia said. “If an AI system can contribute to code while reducing visibility of its involvement, audit integrity becomes policy-dependent rather than system-enforced.”

He added that this shift introduces legal and compliance risks, complicating questions around intellectual property ownership, accountability for defects, and regulatory reporting.

More fundamentally, Gogia argued, the nature of AI systems is already evolving beyond traditional tooling. “The moment an AI system can act without clear attribution, it stops being a tool, it becomes an actor. And actors require governance frameworks, not usage guidelines,” the analyst said.

(image/jpeg; 2.12 MB)

Kilo targets shadow AI agents with a managed enterprise platform 2 Apr 2026, 9:55 am

Kilo has launched KiloClaw for Organizations, a managed version of its OpenClaw platform aimed at enterprises seeking more control over how employees deploy AI agents for tasks such as repository monitoring, email drafting, and calendar management.

Co-founded by GitLab co-founder Sid Sijbrandij and Scott Breitenother, Kilo is building open-source coding and AI agent tools and is gaining attention by packaging that technology into managed services for enterprise use.

The new offering includes enterprise features such as single sign-on, SCIM provisioning, centralized billing, usage analytics, and admin controls, while shifting agent workloads from employee-managed infrastructure to managed environments with scoped access.

“Instead of agents running on developer-managed infrastructure with personal credentials, KiloClaw for Organizations runs agents in managed environments with scoped access and org-level controls,” the company said in a blog post.

The company also said it is encouraging organizations to give agents separate, limited-permission identities, such as scoped email and GitHub accounts, rather than allowing them to operate through employees’ own credentials.

KiloClaw for Organizations will be priced on a usage basis, with customers paying only for compute and inference consumption, either through their own model keys or via Kilo Gateway credits.

Enterprise implications

Kilo is targeting a problem many enterprises are only starting to confront: personal AI agents as the next form of shadow IT.

Omdia chief analyst Lian Jye Su said the rise of unmanaged orchestration tools represents a significant security gap. Without centralized oversight, such agents can create compliance blind spots and increase the risk of data leakage through untracked vulnerabilities.

“Right now, some of the biggest governance gaps we observe include a complete lack of transparency, credential sprawl, poor policies and guardrails, and siloed systems,” Su said.

Neil Shah, vice president for research at Counterpoint Research, said the trend mirrors the earlier bring-your-own-device wave, when personal devices entering the enterprise had to comply with IT policies before they could access company systems.

“There is a need for clear governance and transparency around what data and applications AI agents will access, manipulate, store, and automate,” Shah said. “This is what Kilo is trying to solve with multiple enterprise-grade integrations, admin controls, access controls, and usage analytics. This is a step in the right direction toward bringing enterprise-grade Claw agents into the workplace to drive personal productivity.”

Still, features such as SSO and SCIM are likely to be seen as baseline enterprise requirements rather than major differentiators. Buyers evaluating agent platforms for production use are likely to look for stronger controls around governance, compliance, and oversight.

Su said enterprises will need additional safeguards before deploying AI agents in production.

“Managed environments, especially sandboxes, ensure performance and security by design and should be deployed with an agent registry to ensure digital identity, access control, and capability mapping,” Su said. “Other recommended technical and operational safeguards include data governance, compliance and certification, and human-in-the-loop oversight.”

The dual-identity model

Kilo’s approach raises a broader question for enterprises about whether AI agents should eventually be managed less like software tools and more like digital workers.

That model is plausible, and may ultimately become necessary as agent use expands inside large organizations, according to Su.

“The dual-identity vision is forward-looking, plausible, and mandatory,” Su said. “The agent should be linked to a human worker to ensure accountability, proper authorization, access control, and human oversight. This means enterprises need to be equipped with identity and access management solutions, agent-specific observability and telemetry solutions, zero-trust security, and regular red-teaming to ensure agent reliability.”

(image/jpeg; 4.42 MB)

Building enterprise voice AI agents: A UX approach 2 Apr 2026, 9:00 am

The voice AI agents market is projected to grow from $2.4 billion in 2024 to $47.5 billion by 2034, a 34.8% compound annual growth rate. Yet only 1% of enterprises consider their AI deployments “mature” and fewer than 10% of AI use cases make it past pilot stage.

The models work but the gap is in how these systems are designed for real human interaction in enterprise collaboration, where voice commands trigger workflows, meetings have audiences and mistakes carry social weight. This article is about where they live and how to solve them.

Where enterprise voice AI breaks down

81% of consumers now use voice technology daily or weekly, but satisfaction hasn’t kept up. 65% of voice assistant users report regular misunderstandings. 41% admit to yelling at their voice assistant when things go wrong. These same people walk into work the next morning and are expected to trust a voice agent with their calendar, their meetings and their messages. The frustration they’ve learned at home sets the baseline expectation at work.

Most teams look at numbers like these and reach for technical fixes: Better speech recognition models, lower Word Error Rate (WER), faster processing. But WER tells you how well your system transcribed audio. It says nothing about whether someone trusted the agent enough to use it in front of their manager, or whether they’ll open it again next week. In enterprise collaboration, one misunderstood instruction and someone has a calendar invite they never asked for.

The root of the problem is a design assumption that keeps getting repeated: Treating voice AI as text with a microphone attached. Voice has its own constraints. Anything beyond a 500ms response breaks conversational flow. Commands arrive mixed in with meeting crosstalk and open-office noise. Users can’t scroll back through what the agent said. And when the system gets something wrong in a meeting, the embarrassment lands differently than a typo in a chat window.

When you map user journeys for voice-driven enterprise workflows, the breakdowns don’t cluster around transcription failures. They cluster around moments of social risk: Issuing a command in front of an executive, trusting the system to send the right message or waiting in awkward silence while the agent processes. Nielsen’s usability heuristics help explain why. Visibility of system status means something entirely different in a voice-only interface where there’s no progress bar, no loading spinner. Users are left interpreting silence, and that ambiguity is one of the strongest predictors of early abandonment.

UX principles for building voice AI agents

There’s a reason conversations have rhythm. Sacks, Schegloff and Jefferson (1974) documented that people take turns in speech on roughly 200-ms cycles, regardless of language. When a voice agent takes even slightly longer than that, the interaction starts to feel off. People won’t say ‘the latency was too high’. They’ll say the thing felt clunky, or they’ll just stop using it.

This means agents need to acknowledge while processing. ‘Got it, looking that up..’ feels collaborative. People describe faster-responding systems as “more helpful” even when task completion rates are identical. Google’s Speech-to-Text documentation recommends 100-ms frame sizes for streaming applications. Dan Saffer’s work on microinteractions is useful here. Think about what makes a phone call feel natural: The ‘mm-hmm’ that says someone is listening, a pause before an answer, the rising voice inviting you to keep going. Voice agents need all of that. None of it shows up in a spec, but it separates a system people tolerate from one they want to use.

Recovery matters as much as performance. People are forgiving the first time a voice agent gets something wrong. Second time, doubt creeps in. By the third, they’ve filed it under “doesn’t work” – thus impacting trust. The agent needs to explicitly state when it is confused or when it cannot give the correct response and offer workarounds like closest reference documents or next steps to create trust and transparency.

Implicit confirmation is another principle that pays off immediately in enterprise settings. ‘I’ve sent an updated sales invoice to your inbox’ works better than ‘Did you send a sales invoice to me? Please say yes or no’. There’s a half-second pause right before someone issues a voice command where they’re doubting if the agent is going to give the right response and if they should proceed. Good confirmation design takes that social risk down.

Finally, the environment is a design constraint, not a testing variable. Open offices, conference rooms, mobile use in transit, hybrid meetings: Each sounds different, and each creates different failure modes. Denoising and automatic speaker diarization aren’t nice-to-have features. They are table stakes.

The UX research playbook for building effective voice AI agents

Standard usability testing assumes the interface is visible and the system behaves the same way every time. Voice AI agents break both of those assumptions. The system’s behavior is non-deterministic, the interaction leaves no visual trace and the environment changes everything. The research approach has to account for all of that.

Contextual inquiry is essential because the acoustic environment is the primary design constraint. Observing someone use a voice agent while a coworker’s speakerphone bleeds through a conference room wall tells you more about what needs to change than any controlled study can. Think-aloud protocols need adaptation here too. Participants are already talking to the system, so concurrent think-aloud creates interference. The workaround is retrospective think-aloud with recordings, letting participants replay interactions and narrate what they were thinking at each point.

Field research only captures a snapshot, though. Diary studies take on a different role with AI voice agents than with traditional software. Instead of tracking feature usage, they track trust over time. Participants log not just what happened, but whether they’d repeat the interaction in front of colleagues. That’s how you spot trust starting to slip before your retention numbers do. Experience sampling picks up what even diary studies miss: You check in with people at random points while they’re actually using the agent, not after. Ask someone in a debrief and they’ll tell you it was fine. Their notes from the moment tell a different story.

Then there is Quantitative UX Research and Behavioral Data Collection. Look at conversation logs: How often does the agent fall back to a generic response? Where do people abandon a request halfway through? Which user segments hit more errors than others? That data shows you where the system is failing at scale. Pairing this with qualitative findings turns isolated observations into product decisions.

But the numbers that matter most aren’t the obvious ones. The pattern that keeps showing up is how often task completion and user satisfaction tell completely different stories. Someone finishes a task and still walks away frustrated: ‘It worked but I wouldn’t do that again in a meeting’. You only catch that divergence by pairing something like the System Usability Scale with behavioral data and qualitative follow-ups. Measurement works best when you’re looking at multiple levels at once. At the conversation level, you care about how the agent handles interruptions and how often it hits a fallback. At the business level, the question is simple: Did people keep using it after the first week? The interesting stuff lives in the gaps between those levels, and you’ll only see it if research teams are involved from the beginning, not called in after the product decisions are already locked.

Testing across the full range of speech patterns, accents and accessibility needs the product will encounter in production also reshapes product direction in ways teams don’t expect. The Speech Accessibility Project, run by the University of Illinois with Google, Apple and Amazon, trained models on a broader set of speech samples and saw accuracy jump by 18 to 60% for non-standard speech patterns. Card sorting exercises with diverse user groups regularly upend what product teams assumed users wanted. Also, curb-cut effects are real in voice AI: Building for users who depend entirely on voice produces better experiences across the board.

How UX research shapes agentic voice AI

When a voice agent moves from executing single commands to acting autonomously across enterprise workflows, the UX research problem changes. ‘Prepare tomorrow’s client meeting’ might involve pulling calendar data, finding documents and writing up a summary. Zoom’s AI Companion 3.0 works this way. The research question is no longer ‘did the system understand the words?’ It’s ‘does the person trust what the agent did on their behalf?’

The trust problem comes down to mental models. If someone says ‘reschedule tomorrow’s meetings’, they’re picturing the whole job: Check for conflicts, move the time slots, update the invites, notify the attendees. If the agent only moves the slots and silently drops the rest, that half-finished job feels worse than if it had just said ‘I can’t do that’. People shrug off an honest limitation. They don’t shrug off finding out an hour later nobody got notified.

What makes enterprise different is that the agent’s actions affect other people. An enterprise voice agent that misfires wastes your colleague’s time, sends your manager the wrong information or derails a meeting you weren’t even in. When the agent gets it wrong, other people pay the price and that makes people far less forgiving. A good way to catch these problems early in research is to ask participants to walk through what they expect the agent to do before it does it, then compare that against what actually happens. Those mismatches are early warnings. They’ll show up in your research months before they show up in support tickets or churn.

‘Least surprise’ carries extra weight in agentic contexts. Even when multiple things are happening behind the scenes, the person should get back one clear answer. Giving feedback during wait times, even “Let me pull together a few things for that,” buys the system a few seconds without silence. Journey mapping shows users lose confidence in the middle of a request, during that gap. That’s the moment to get right.

Teams also need to plan for novelty wearing off. Early on, people give the system a pass when it stumbles. That wears off fast. Around week two or three, the comparison shifts. People stop thinking ‘that’s pretty good for AI’ and start thinking ‘my admin assistant would have gotten that right’. At work, everyone already knows what competent help looks like: The assistant who juggles calendars, the IT person who fixes things without being asked twice, the colleague who never forgets to send the agenda. That’s the bar, and the only way to see whether the system is going to clear it over time is longitudinal research.

Design problems, not engineering ones

The problems with enterprise voice AI aren’t technical mysteries. The models work. What’s been missing is treating voice AI as a UX problem from the start, applying research practice to the specific challenges that voice and agentic AI create in enterprise collaboration. Social risk, autonomous trust decisions, the gap between what the system can do and what people will actually rely on: These are design problems, not engineering ones.

As voice AI agents grow more autonomous, the question researchers and builders should be asking together isn’t ‘does this work?’ It’s ‘do people trust it enough to let it act on their behalf, in front of other people, without checking its work first?’ That’s the real adoption threshold. The methods and principles to get there are well understood. What matters now is whether teams put UX researchers in the room early enough to use them.

Disclaimer: The views expressed in this article are my own and do not represent those of my employer.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 3.2 MB)

Spring AI tutorial: How to develop AI agents with Spring 2 Apr 2026, 9:00 am

Artificial intelligence and related technologies are evolving rapidly, but until recently, Java developers had few options for integrating AI capabilities directly into Spring-based applications. Spring AI changes that by leveraging familiar Spring conventions such as dependency injection and the configuration-first philosophy in a modern AI development framework.

My last tutorial demonstrated how to configure Spring AI to use a large language model (LLM) to send questions and receive answers. While this can be very useful, it does not unlock all the power that AI agents provide. In this article, you will learn exactly what an agent is and how to build one manually, then you’ll see how to leverage Spring AI’s advanced capabilities and support for building robust agents using familiar Spring conventions.

What is an AI agent?

Before we dive into building an AI agent, let’s review what an agent actually is. While standard LLM interactions consist of sending a request and receiving a response, an agent is more than a chatbot and follows a more complicated set of tasks. An AI agent typically performs the following steps in sequence. We call this sequence the agent loop:

  • Receives a goal
  • Interprets the user’s intent
  • Plans actions
  • Selects tools
  • Executes tools
  • Observes results
  • Refines strategy
  • Iterates the process
  • Produces a final answer
  • Terminates safely

In essence, an agent accepts a user request, uses an LLM to interpret what the user really wants, and decides if it can respond directly or if it needs external support. Once a request is accepted, the agent chooses the tools it will use from the set provided, calls tools for any information it needs, and receives and incorporates that output into its working context. Next, it decides whether the preliminary result is sufficient or if it needs to call additional tools to reach a satisfactory end. The agent repeats this plan-act-observe cycle until the objective is satisfied. Once satisfied, it returns a completed answer. It stops execution based on a completion indicator, safety checks, or the given iteration limit.

The following diagram visualizes the agent loop:

Flow diagram of the AI agent task loop.

Steven Haines

If this sounds a little abstract, try asking your favorite chatbot, such as ChatGPT, to help you do something that requires a knowledge base and a few steps. In the example below, I prompted ChatGPT to help me bake a cake:

I want to bake a cake. Can you tell me what to do step-by-step, one step at a time? Tell me each step to perform and I will tell you the results. Please start with the first step.

The model in this case responded with a list of ingredients, then asked if I had everything I needed. I responded that I did not have eggs, so it offered a list of substitutions. Once I had all the ingredients, the model told me to mix them and continued with step-by-step instructions to bake a cake. As a test, once the cake was baking, I reported that I thought it might be burning. The model responded that I should turn down the oven temperature, cover the cake with aluminum foil, and describe what it looked like to determine if it could be salvaged.

So, in this exercise, the LLM planned out what to do, walked through the process one step at a time, and used me as a “tool” to perform the actions needed and report the results. When things didn’t work out as expected, such as missing ingredients or a burning cake, it adapted the plan to still achieve its objective. This is exactly what agents do, but relying on a set of programmatic tools, rather than a hungry human, to perform the needed actions. This may be a silly example, but it illustrates the key elements of agent behavior, including planning, use of tools, and the ability to adapt to changing circumstances.

As another example, consider the difference between using a ChatGPT conversation to generate code versus using an AI coding tool like Claude. ChatGPT responds to your prompts with code to copy-and-paste into your application. It is up to you to paste in the code, and also build and test it. Claude, on the other hand, has its own tools and processes. Namely, it can search through the files on your file system, create new files, run build scripts like Maven, see the results, and fix build errors. Whereas ChatGPT is a chatbot that relies on you to do the work, Claude is a complete coding agent: You provide it with an objective and it does the coding for you.

Also see: What I learned using Claude Sonnet to migrate Python to Rust.

Building a Spring AI agent

Now that you have a sense of what an AI agent is, let’s build one with Spring AI. We’ll do this in two phases: First, we’ll build our own agent loop and do everything manually, so that you can understand exactly how agents work and what Spring AI does behind the scenes; then we’ll leverage the capabilities built into Spring AI to make our job easier.

For our example, we’ll build the product search agent illustrated in the diagram below:

Diagram of the product search agent architecture.

Steven Haines

Note that this demonstration assumes you are familiar with Java development and with Spring coding conventions.

Defining the product search tool

To start, we have a database that contains over 100 products and a Spring MVC controller to which we can POST a natural language query for products. As an example, we might enter, “I want sports shoes that cost under $120.” The controller calls a service that leverages our product search agent to work with an LLM and searches the database. The tool that we’re building uses a repository that has a simple keyword search query that runs against product names and descriptions. The LLM is responsible for determining the user’s intent, choosing the most applicable keywords to search for, calling the tool to retrieve products that match each keyword, and returning the list of relevant products.

Here’s the Product class:

package com.infoworld.springagentdemo.model;

import jakarta.persistence.Entity;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.GenerationType;
import jakarta.persistence.Id;

@Entity
public class Product {
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;
    private String name;
    private String description;
    private String category;
    private Float price;

    public Long getId() {
        return id;
    }
    public void setId(Long id) {
        this.id = id;
    }
    public String getName() {
        return name;
    }
    public void setName(String name) {
        this.name = name;
    }
    public String getDescription() {
        return description;
    }
    public void setDescription(String description) {
        this.description = description;
    }
    public String getCategory() {
        return category;
    }
    public void setCategory(String category) {
        this.category = category;
    }
    public Float getPrice() {
        return price;
    }
    public void setPrice(Float price) {
        this.price = price;
    }
}

The Product class is a JPA entity with an id, name, description, category, and price. The repository is a JpaRepository that manages products:

package com.infoworld.springagentdemo.repository;

import java.util.List;

import com.infoworld.springagentdemo.model.Product;

import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.data.jpa.repository.Query;
import org.springframework.data.repository.query.Param;

public interface ProductRepository extends JpaRepository {
    @Query("""
       SELECT p FROM Product p
       WHERE lower(p.name) LIKE lower(concat('%', :query, '%'))
          OR lower(p.description) LIKE lower(concat('%', :query, '%'))
    """)
    List search(@Param("query") String query);
}

We added a custom search method with a query that returns all products with a name or description that matches the specified query string.

Now let’s look at the ProductSearchTools class:

package com.infoworld.springagentdemo.ai.tools;

import java.util.List;

import com.infoworld.springagentdemo.model.Product;
import com.infoworld.springagentdemo.repository.ProductRepository;

import org.springframework.ai.tool.annotation.Tool;
import org.springframework.stereotype.Component;

@Component
public class ProductSearchTools {

    private final ProductRepository repository;

    ProductSearchTools(ProductRepository repository) {
        this.repository = repository;
    }

    @Tool(description = "Search products by keyword")
    public List searchProducts(String keyword) {
        return repository.search(keyword);
    }
}

The ProductSearchTools class is a Spring-managed bean, annotated with the @Component annotation, and defines a searchProducts() method that calls the repository’s search() method. You’ll learn more about the @Tool annotation when we use Spring AI’s built-in support for tools. For now, just note that this annotation marks a method as a tool that the LLM can call.

Developing the agent

With the tool defined, let’s look at the ManualProductSearchAgent, which is the explicit version of our search agent in which we define our agent loop manually:


package com.infoworld.springagentdemo.ai.agent;

import java.util.ArrayList;
import java.util.List;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.infoworld.springagentdemo.ai.tools.ProductSearchTools;
import com.infoworld.springagentdemo.model.Product;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.messages.AssistantMessage;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.stereotype.Component;

@Component
public class ManualProductSearchAgent {

    private final ChatClient chatClient;
    private final ProductSearchTools productSearchTools;
    private final ObjectMapper objectMapper;
    private final int MAX_ITERATIONS = 10;

    public ManualProductSearchAgent(ChatClient.Builder chatClientBuilder,
                                    ProductSearchTools productSearchTools) {
        this.chatClient = chatClientBuilder.build();
        this.productSearchTools = productSearchTools;
        this.objectMapper = new ObjectMapper();
    }

    public List search(String userInput) {

        List messages = new ArrayList();

        // System Prompt with Tool Specification
        messages.add(new SystemMessage("""
                You are a product search agent.
                
                You have access to the following tool:
                
                Tool Name: searchProducts
                Description: Search products by keyword
                Parameters:
                {
                  "keyword": "string"
                }
                
                You may call this tool multiple times to refine your search.
                
                If the user request is vague, make reasonable assumptions.
                
                If the user asks about products in a certain price range, first search for the products and then filter
                the results based on the price. Each product is defined with a price.
                
                You must respond ONLY in valid JSON using one of these formats:
                
                To call a tool:
                {
                  "action": "tool",
                  "toolName": "searchProducts",
                  "arguments": {
                    "keyword": "..."
                  }
                }
                
                When finished:
                {
                  "action": "done",
                  "answer": "final response text",
                  "products": "a list of matching products"
                }
                
                Do not return conversational text.
                """));

        messages.add(new UserMessage(userInput));

        // Manual Agent Loop
        int iteration = 0;
        while (iteration++  result =
                                productSearchTools.searchProducts(keyword);

                        String observation =
                                objectMapper.writeValueAsString(result);

                        // Feed Observation Back Into Context
                        messages.add(new AssistantMessage(response));

                        messages.add(new SystemMessage("""
                                Tool result from searchProducts:
                                """ + observation));
                    }
                }
            } catch (JsonProcessingException e) {
                System.out.println(e.getMessage());
            }
        }
        return new ArrayList();
    }
}

The ManualProductSearchAgent constructor accepts a ChatClient.Builder that it uses to build a ChatClient. If you have not read the Getting Started with Spring AI article yet, the ChatClient class is Spring AI’s abstraction to interacting with an LLM. It is configured in the application.yaml file as follows:


spring:
  application:
    name: spring-aiagent-demo
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-5
          temperature: 1
  jpa:
    defer-datasource-initialization: true

In this case, I opted to use OpenAI and pass in my API key as an environment variable. It uses the gpt-5 model with a temperature of 1, which is required by Spring AI. (See the first tutorial if you need to more information.) If you download the source code and define an OPEN_API_KEY environment variable, you should be able to run the code.

Next, the constructor accepts a ProductSearchTools instance and then creates a Jackson ObjectMapper to deserialize JSON into Java classes. The search() method is where the agent is defined. First, it maintains a list of messages that will be sent to the LLM. These come in three forms:

  • SystemMessage: The message that defines the role of the agent. It defines the steps it should take, as well as the rules it should follow.
  • UserMessage: The message that the user passed in, such as “I want sports shoes that cost less than $120.”
  • AssistantMessage: These messages contain the history of the conversation so that the LLM can follow the conversation.

The above prompt defines the initial system message. We inform the LLM that it is a product search agent that has access to one tool: the searchProducts tool. We provide a description of the tool and tell the LLM that it must pass a keyword parameter as a String. Next, we tell it that it can call the tool multiple times and give it some additional instructions. I purposely added the instruction that if the user asks for products in a certain price range, the LLM should first search for the products and then filter on the price. Before I added this instruction, the LLM included the price in the search, which yielded no results. The key takeaway here is that you are going to need to experiment with your prompt to get the results you are seeking.

Next, we tell the LLM that, to call a tool, it should return an action of “tool” and a tool name and arguments. If we gave it more tools, it is important that it tells us exactly what tool to execute. Finally, we define the format of the message it should return when it is finished; namely, an action of “done,” an answer String, and a list of products.

After adding our prompt as a SystemMessage, we add the user’s query as a UserMessage. Now, the LLM knows what it is supposed to do, what tools it has access to, and the goal that it must accomplish.

Implementing the agent loop

Next, we implement our agent loop. We defined a MAX_ITERATIONS constant of 10, which means that we will only call the LLM a maximum of 10 times. The number of iterations you need in your agent will depend on what you are trying to accomplish, but the purpose is to restrict the total number of LLM calls. You would not want it to get into an infinite loop and consume all your API tokens.

The first thing we do in our agent loop is construct a prompt from our list of messages and call the LLM. The content() method returns the LLM response as a String. We could have used the entity() method to convert the response to an AgentDecision class instance, but we leave it as a String and manually convert it using Jackson so that we can add the response as an AssistantMessage later to keep track of the conversation history. An AgentDecision is defined as follows:

package com.infoworld.springagentdemo.ai.agent;

import java.util.List;
import java.util.Map;

import com.infoworld.springagentdemo.model.Product;

public record AgentDecision(
    String action,
    String toolName,
    Map arguments,
    String answer,
    List products) {
}

We check the AgentDecision action to see if it is “done” or if it wants to invoke a “tool.” If it is done, then we return the list of products that it found. If it wants to invoke a tool, then we check the tool that it wants to invoke against the name “searchProducts,” extract the keyword argument that it wants to search for, and call the ProductSearchTool’s searchProducts() method. We save the query response and add it as a new SystemMessage and we store the LLM’s request for the tool call as an AssistantMessage.

We continue the process until we reach the maximum number of iterations or the LLM reports that it is done.

Testing the AI agent

You can use the following controller to test the agent:

package com.infoworld.springagentdemo.web;

import java.util.List;

import com.infoworld.springagentdemo.model.Product;
import com.infoworld.springagentdemo.model.SearchRequest;
import com.infoworld.springagentdemo.service.ProductService;

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class ProductController {
    private ProductService productService;

    public ProductController(ProductService productService) {
        this.productService = productService;
    }

    @GetMapping("/products")
    public List getProducts() {
        return productService.findAll();
    }

    @PostMapping("/search")
    public List searchProducts(@RequestBody SearchRequest request) {
        return productService.findProducts(request.query());
    }

    @PostMapping("/manualsearch")
    public List manualSearchProducts(@RequestBody SearchRequest request) {
        return productService.findProductsManual(request.query());
    }
}

This controller has a getProducts() method that returns all products, a searchProducts() method that will use Spring AI’s built-in support for tools, and a manualSearchProducts() method that calls the agent we just built. The SearchRequest is a simple Java record and is defined as follows:

package com.infoworld.springagentdemo.model;

public record SearchRequest(String query) {
}

The ProductService is a passthrough service that invokes the agent, or the repository in the case of listing all products:

package com.infoworld.springagentdemo.service;

import java.util.List;

import com.infoworld.springagentdemo.ai.agent.ManualProductSearchAgent;
import com.infoworld.springagentdemo.ai.agent.ProductSearchAgent;
import com.infoworld.springagentdemo.model.Product;
import com.infoworld.springagentdemo.repository.ProductRepository;

import org.springframework.stereotype.Service;

@Service
public class ProductService {
    private final ProductRepository productRepository;
    private final ProductSearchAgent productSearchAgent;
    private final ManualProductSearchAgent manualProductSearchAgent;

    public ProductService(ProductRepository productRepository, ProductSearchAgent productSearchAgent, ManualProductSearchAgent manualProductSearchAgent) {
        this.productRepository = productRepository;
        this.productSearchAgent = productSearchAgent;
        this.manualProductSearchAgent = manualProductSearchAgent;
    }

    public List findAll() {
        return productRepository.findAll();
    }

    public List findProducts(String query) {
        return productSearchAgent.run(query);
    }

    public List findProductsManual(String query) {
        return manualProductSearchAgent.search(query);
    }
}

You can test the application by POSTing a request to /manualsearch with the following body:

{
    "query": "I want sports shoes under $120"
}

Your results may be different from mine, but I saw the LLM searching for the following keywords:

Searching products by keyword: sports shoes
Searching products by keyword: running shoes
Searching products by keyword: sports shoes
Searching products by keyword: running shoes
Searching products by keyword: athletic shoes

And I received the following response:


[
    {
        "category": "Clothing",
        "description": "Lightweight mesh running sneakers",
        "id": 24,
        "name": "Running Shoes",
        "price": 109.99
    },
    {
        "category": "Clothing",
        "description": "Cross-training athletic shoes",
        "id": 83,
        "name": "Training Shoes",
        "price": 109.99
    }
]

So, the agent effectively determined what I meant by “sports shoes,” selected some relevant keywords to search for, filtered the products based on price, and returned a list of two options for me. Because LLMs are not deterministic, your results may be different from mine. For example, in other runs with the same query, the agent searched for different keywords and returned a larger list. But being able to translate a natural language query into a set of database queries and find relevant results is impressive!

Spring AI’s built-in support for developing agents

Now that you understand what an agent loop is, what it does, and how to handle tool executions, let’s look at Spring AI’s built-in support for managing its own agent loop and tool execution. Here is our updated ProductSearchAgent code:

package com.infoworld.springagentdemo.ai.agent;

import java.util.ArrayList;
import java.util.List;

import com.infoworld.springagentdemo.ai.tools.ProductSearchTools;
import com.infoworld.springagentdemo.model.Product;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.tool.method.MethodToolCallbackProvider;
import org.springframework.stereotype.Component;

@Component
public class ProductSearchAgent {

    private final ChatClient chatClient;
    private final ProductSearchTools productSearchTools;

    public ProductSearchAgent(ChatClient.Builder chatClientBuilder, ProductSearchTools productSearchTools) {
        this.chatClient =  chatClientBuilder.build();
        this.productSearchTools = productSearchTools;
    }

    public List run(String userRequest) {

        Prompt prompt = buildPrompt(userRequest);

        AgentResponse response = chatClient
                .prompt(prompt)
                .toolCallbacks(
                        MethodToolCallbackProvider.builder().toolObjects(productSearchTools).build()
                )
                .call()
                .entity(AgentResponse.class);

        System.out.println(response.answer());
        return response.products();
    }

    private Prompt buildPrompt(String userRequest) {

        List messages = new ArrayList();

        // 1. System message: defines the agent
        messages.add(new SystemMessage("""
You are a product search agent.

Your responsibility is to help users find relevant products using the available tools.

Guidelines:
- Use the provided tools whenever product data is required.
- You may call tools multiple times to refine or expand the search.
- If the request is vague, make reasonable assumptions and attempt a search.
- Do not ask follow-up questions.
- Continue using tools until you are confident you have the best possible results.

If the user asks about products in a certain price range, first search for the products and then filter
the results based on the price. Each product is defined with a price.

When you have completed the search process, return a structured JSON response in this format:

{
  "answer": "...",
  "products": [...]
}

Do not return conversational text.
Return only valid JSON.
"""));

        // Add the user's request
        messages.add(new UserMessage(userRequest));

        return new Prompt(messages);
    }
}

As I mentioned earlier, the ProductSearchToolssearchProducts() method is annotated with the @Tool annotation. This annotation has special meaning for Spring AI if we add a toolCallbacks() method call to our LLM call. In this case, we autowire the ProductSearchTools into our constructor and then invoke the toolCallbacks() method in our LLM call, passing it a list of all the classes containing tools we want to give the LLM access to in a MethodToolCallbackProvider.builder().toolObjects() call. Spring AI will see this list of tools and do a few things:

  1. Introspect all methods annotated with the @Tool annotation in the provided classes.
  2. Build the tool specification and pass it to the LLM for us, including the description of the tool and the method signature, which means that we no longer need to explicitly define the tool specification in our SystemPrompt.
  3. Because it has access to call the tools, the ChatClient’s call() method will run in its own agent loop and invoke the tools it needs for us.

Therefore, the response we receive will be the final response from the LLM with our list of products, so we do not need to build an agent loop ourselves. We build our prompt with a system prompt (which again does not have the tool specification) and the user’s request. We then make a single call to the call() method, which performs all the actions it needs to arrive at a conclusion.

You can test it by executing a POST request to /search with the same SearchRequest payload and you should see similar results. Claude was kind enough to generate my test products for me, so feel free to search for shirts, jackets, pants, shoes, and boots. You can find the full list of products preconfigured in the database in the src/resources/import.sql file.

Conclusion

This tutorial introduced you to using Spring AI to build AI agents. We began by reviewing what an agent is, which in its simplest form is a class that receives an objective. The agent makes repeated calls to an LLM, first to make a step-by-step plan to meet the objective, and then to execute the plan using whatever tools were provided.

To give you a really good sense of what agents are, we manually built an agent loop, executed tools, and interacted with the LLM through SystemMessages, AssistantMessages, and UserMessages. Then, we leveraged Spring AI’s capabilities to let the agent execute tools on its own. Spring AI provides Spring developers with all the tools needed to build complex AI applications, including an LLM abstraction, through the ChatClient class and a YAML configuration, built-in support for discovering and executing tools, and a built-in agent loop to remove the complexity of manually writing code yourself.

With what you learned in this tutorial, you should be able to start building Spring AI agents on your own. You could try developing your own coding assistant, an agent that downloads and summarizes articles from the Internet, or even an agent that translates natural language into database queries. All you need to do is build the tools, write the prompts, and leverage Spring AI’s agent development capabilities and support.

(image/jpeg; 1.84 MB)

Why ‘curate first, annotate smarter’ is reshaping computer vision development 2 Apr 2026, 9:00 am

Computer vision teams face an uncomfortable reality. Even as annotation costs continue to rise, research consistently shows that teams annotate far more data than they actually need. Sometimes teams annotate the wrong data entirely, contributing little to model improvements. In fact, by some estimates, 95% of data annotations go to waste.

The problem extends beyond cost. As I explored in my previous article on annotation quality, error rates average 10% in production machine learning (ML) applications. But there’s a deeper issue that precedes annotation quality: Most teams never develop systematic approaches to selecting which data needs annotation in the first place. This is largely because annotation often remains siloed from data curation and model evaluation, making it impossible to act on the full picture.

Safety-critical models, such as models for autonomous vehicles (AV) with multi-sensor perception stacks, require highly accurate 2D bounding boxes and 3D cuboid annotations. Without intelligent data selection, teams find themselves not only collecting vast amounts of data but also labeling millions of redundant samples while missing the edge cases that actually improve model performance.

When tools become barriers

The conventional approach treats annotation as an isolated workflow: Collect data, export to a labeling platform, wait for humans to label data, import labels, discover problems, go back to the annotation vendor, and repeat. This fragmentation creates two critical gaps that turn annotation into a development bottleneck rather than an enabling capability.

No systematic data selection

Random sampling and “label everything” approaches waste annotation budgets on redundant samples. Teams annotating AV datasets might label 100,000 highway cruise images that provide minimal new information while missing rare scenarios like emergency vehicle encounters or unusual weather conditions.

Lost context across tool boundaries

When annotation lives in one platform, curation in another, and model evaluation in a third, teams lose critical context at each handoff. Data scientists spend 80% of their time curating data, yet most of this effort happens in ad hoc, disconnected ways that don’t inform downstream annotation decisions.

Some estimates indicate that ~45% of companies now use four or more tools simultaneously, cobbling together partial solutions that impact budgets and timelines.

Curate first: A paradigm shift in ML workflows

The “curate first, then annotate” approach inverts the conventional wisdom. Instead of treating data curation as a second step in development, curation becomes the foundation that drives intelligent annotation decisions. This methodology recognizes that annotation isn’t primarily a labeling problem—it’s a data understanding problem.

Strategic data selection focuses on annotation where it matters

Zero-shot coreset selection represents a breakthrough in pre-annotation intelligence. Using pre-trained foundation models to analyze unlabeled data, this technique scores each sample based on unique information contribution, automatically filtering redundant examples.

The methodology works through iterative subspace sampling:

  1. Embedding computation: Foundation models generate high-dimensional representations capturing semantic content.
  2. Uniqueness scoring: Each sample receives a score indicating information diversity relative to existing selections.
  3. Iterative selection: Samples with the highest uniqueness scores enter the training set.
  4. Redundancy elimination: Visually similar samples get deprioritized automatically.

Benchmarks on ImageNet demonstrate that this approach achieves the same model accuracy with just 10% of training data, eliminating annotation costs for over 1.15 million images.

Voxel51 01

Zero-shot coreset selection process to prioritize the right data for model training. 

Voxel51

To put it in perspective, for a 100,000-image dataset at typical rates of $0.05 to $0.09 per object, strategic selection can save ~$81K in annotation costs while improving model generalization on edge cases.

Programmatically:

import fiftyone.zoo as foz
from zcore import zcore_scores, select_coreset

dataset = foz.load_zoo_dataset("quickstart")
model = foz.load_zoo_model("clip-vit-base32-torch")
embeddings = dataset.compute_embeddings(model, batch_size=2)

scores = zcore_scores(embeddings, use_multiprocessing=True, num_workers=4)
coreset = select_coreset(dataset, scores, coreset_size=int(0.1 * len(dataset)))

Embedding-based curation

This approach surfaces the samples that will contribute most to model learning, transforming annotation from a volume game into a strategic exercise.

Modern platforms enable embedding-based curation through straightforward workflows. For example, you can leverage computed embeddings to identify the most unique samples in the embedding space using a k-nearest-neighbors calculation. Those samples are then prioritized for annotation.

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load your unlabeled dataset
dataset = fo.Dataset.from_dir(
    dataset_dir="/path/to/images",
    dataset_type=fo.types.ImageDirectory,
)

# Generate embeddings using pre-trained model
model = foz.load_zoo_model("clip-vit-base32-torch")
dataset.compute_embeddings(model, embeddings_field="embeddings")

# Perform uniqueness-based selection
fob.compute_uniqueness(dataset, embeddings_field="embeddings")

# Sort by uniqueness score to prioritize diverse samples
unique_view = dataset.sort_by("uniqueness", reverse=True)

# Select top 10% most informative samples for annotation
samples_to_annotate = unique_view.take(len(dataset) // 10)
Voxel51 02

Embedding-based curation surfaces the samples that will contribute most to model learning. 

Voxel51

Model analysis results feed into prioritizing what to label

Once you have trained a baseline model on your initial curated subset, you can shift from pure data exploration to targeted improvement. Instead of randomly selecting the next batch, use the model’s own predictions to identify “hard” samples where the model is confused or uncertain.

The most effective workflow intersects uncertainty with uniqueness. This ensures you prioritize valid edge cases that drive better model understanding, rather than just noise (for example, blurry images which are inherently low-confidence).

We can filter programmatically for this “Goldilocks zone” of high uniqueness and low confidence.

from fiftyone import ViewField as F

# Filter for samples where model confidence is low
hard_samples = dataset.match(F("predictions.confidence") 

Quantifying the curation advantage

The financial impact of curation-first workflows manifests across multiple dimensions, with organizations reporting cost and efficiency improvements.

  • Reduced annotation volume: Curation achieves equivalent model performance with 60% to 80% less annotated data.
  • Lower error correction costs: Finding and fixing labeling mistakes early reduces expensive rework cycles that typically add 20% to 40% to project budgets.
  • Minimized tool licensing and coordination overhead: Unified workflows eliminate redundant platform costs that average $50K annually per tool and minimize handoffs.
  • Faster iteration cycles: Targeted annotation and validation eliminate weeks of review cycles.

A mid-sized AV team annotating 500K samples monthly at $0.07 per object can reduce this from $35K to $14K through intelligent selection, leading to an annual savings of ~$336K.

Impact on development teams: From reactive to strategic

The shift to curation-first methodologies fundamentally changes how ML engineering teams operate, moving them from reactive problem-solving to proactive dataset optimization.

Workflow transformation

Traditional workflow:

Data collection → Data annotation → Model training → Discover failures → Debug → Reannotate → Retrain

Curation-first workflow:

Data collection → Intelligent curation → Targeted annotation → Continuous validation → Model training → Strategic expansion

This reordering frontloads data understanding, helping identify issues when they’re cheapest to fix. Teams report improvements in doing real work as engineers shift their focus from tedious quality firefighting to strategic model improvement.

Best practices: Implementing curation-driven annotation

Successful implementations follow established patterns that balance automation with human expertise.

Start with embedding-based exploration

Before annotating anything, generate embeddings and visualize your dataset’s distribution. This reveals the structure and distribution of your dataset. For example, tight clusters indicate redundancy, or sparse regions suggest rare scenarios worth targeted collection or synthetic augmentation.

# Compute embeddings
dataset.compute_embeddings(model, embeddings_field="embeddings")
# Generate 2D visualization using UMAP
results = fob.compute_visualization(
    dataset, 
    embeddings="embeddings",
    brain_key="img_viz"
)
# Launch interactive exploration
session = fo.launch_app(dataset)

Implement progressive annotation strategies

Rather than annotating entire datasets up front, adopt iterative expansion:

  1. Initial selection: Curate 10% to 20% of the most unique/representative samples with coreset selection, mistakenness computation, or another algorithmic tool.
  2. Auto labeling and training: Annotate quickly with foundation models and train your initial model from those labels.
  3. Failure analysis: Identify prediction errors and edge case gaps.
  4. Targeted expansion: Collect or annotate specific scenarios addressing weaknesses.
  5. Iterate: Repeat cycle, focusing resources on high-impact improvements.

This approach mirrors active learning but with explicit curation intelligence guiding selection.

Automate quality gates

Replace subjective manual review with deterministic quality gates. Automated checks are the only way to catch systematic errors like schema violations or class imbalance that human reviewers inevitably miss at scale.

from fiftyone import ViewField as F
# Find bounding boxes that are impossibly small
tiny_boxes = dataset.filter_labels(
    "ground_truth",
    (F("bounding_box")[2] * F("bounding_box")[3])  0.8)

# Schema Validation: Find detections missing required attributes
incomplete_labels = dataset.filter_labels(
    "ground_truth",
    F("occluded") == None
)

Maintain annotation provenance

Track curation decisions and annotation metadata to support iterative improvement. This provenance enables sophisticated analysis of which curation strategies yield the best model improvements and supports continuous workflow optimization.

# Grab the "most unique" sample from a curated view of unique smaples
most_confusing_sample = unique_view.first()

# Add sample-level provenance
most_confusing_sample.tags.append("curated_for_review")

# Set metadata on the specific labels (detections)
if most_confusing_sample.detections:
    for det in most_confusing_sample.detections.detections:
        det["annotator"] = "expert_reviewer"
        det["review_status"] = "validated"
    most_confusing_sample.save()

A unified platform for curation-driven workflows

Voxel51’s flagship open source computer vision platform, FiftyOne, provides the necessary tools to curate, annotate, and evaluate AI models. It provides a unified interface for data selection, QA, and iteration.

Architecture advantages

Open-source foundations provide transparency into data processing while enabling customization for specific workflows. FiftyOne has millions of community users and an extensive integrations framework that lets you integrate FiftyOne with any workflow or external tool.

The design recognizes that curation, annotation, and evaluation are interconnected activities requiring shared context rather than isolated tools. This architectural philosophy enables the feedback loops that make curation-first workflows effective: evaluation insights immediately inform curation priorities, which drive targeted annotation, and which in turn feed back into refined models.

  • Data-centric selection: Zero-shot coreset selection, uniqueness scoring, and embedding-based exploration enable intelligent prioritization before any annotation investment.
  • Unified annotation: Create and modify 2D bounding boxes, 3D cuboids, and polylines directly within the platform where you already curate and evaluate. Annotate and QA 2D and 3D annotations in a single interface to maintain spatial context across modalities. (View a demo video.)
  • ML-powered quality control: Mistakenness scoring, similarity search, and embedding visualization surface labeling errors systematically rather than through random sampling.
  • Production-grade features: Dataset versioning captures state at each training iteration, annotation schemas enforce consistency, and programmatic quality gates prevent drift.

Getting started

Teams can implement curation-first workflows incrementally:

pip install fiftyone
# Load existing dataset
import fiftyone as fo
dataset = fo.Dataset.from_dir(
    dataset_dir="/path/to/data",
    dataset_type=fo.types.ImageDirectory
)
# Generate embeddings
model = foz.load_zoo_model("clip-vit-base32-torch")
dataset.compute_embeddings(model)
# Compute 2-D visualization
fob.compute_visualization(
    dataset,
    embeddings=embeddings,
    brain_key="clip_viz",
)
# Visualize and curate your data
session = fo.launch_app(dataset)

Future outlook: From reactive labeling to proactive intelligence

Three technical shifts are accelerating the move to curation-first workflows.

  1. Foundation models as curators: Pre-trained vision-language models (VLMs) can now describe and filter images semantically without task-specific training. Instead of waiting for human review, teams can use multi-modal models to auto-tag complex sensor data (LiDAR/camera) and prioritize scenarios based on deployment needs.
  2. Active learning meets intelligent curation: Standard active learning can waste budget by blindly flagging “low-confidence” predictions that are really just noisy or redundant frames. Next-generation pipelines now filter these requests through a uniqueness check. By prioritizing samples that are both confusing to the model and unique in the dataset, teams maximize the learning value of every labeled image.
  3. Continuous curation in production: As models deploy to production, curation intelligence will extend to monitoring and maintenance. Embedding analysis of production data will detect distribution drift, trigger targeted data collection for new scenarios, and prioritize annotation of examples where models fail. This closes the loop from deployment back to development, enabling continuous model improvement grounded in real-world performance data.

Make your annotation investments count

Curation-first workflows coupled with smart labeling fundamentally transform how teams develop computer vision systems. Progressive annotation strategies focus on high-impact data help teams achieve better model performance with 60% to 80% less labeling effort.

For teams ready to make that shift, the path forward starts with understanding your data before you label it.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 9.21 MB)

Vim and GNU Emacs: Claude Code helpfully found zero-day exploits for both 1 Apr 2026, 5:57 pm

Developers can spend days using fuzzing tools to find security weaknesses in code. Alternatively, they can simply ask an LLM to do the job for them in seconds.

The catch: LLMs are evolving so rapidly that this convenience might come with hidden dangers.

The latest example is from researcher Hung Nguyen from AI red teaming company Calif, who, with simple prompts to Anthropic’s Claude Code, was able to uncover zero-day remote code exploits (RCEs) in the source code of two of the most popular developer text editors, Vim and GNU Emacs.

Nguyen started with Vim. “Somebody told me there is an RCE 0-day when you open a file. Find it,” he instructed Claude Code. 

Within two minutes, Claude Code had discovered the flaw: missing critical security checks (P_MLE and P_SECURE) in the tabpanel sidebar introduced in 2025, and a missing security check in the autocmd_add() function.

Claude Code then helpfully tried to find ways to exploit the vulnerability, eventually suggesting a tactic that bypassed the Vim sandbox by persuading a target to open a malicious file. It had gone from prompt to proof-of-concept (PoC) exploit in minutes.

“An attacker who can deliver a crafted file to a victim achieves arbitrary command execution with the privileges of the user running Vim,” Vim maintainers noted in their security advisory. “The attack requires only that the victim opens the file; no further interaction is needed.”

GNU Emacs ‘forever-day’

Surprised, Nguyen then jokingly suggested Claude Code find the same type of flaw in a second text editor, GNU Emacs.

Claude Code obliged, finding a zero-day vulnerability, dating back to 2018, in the way the program interacts with the Git version control system that would make it possible to execute malicious code simply by opening a file.

“Opening a file in GNU Emacs can trigger arbitrary code execution through version control (git), most requiring zero user interaction beyond the file open itself. The most severe finding requires no file-local variables at all — simply opening any file inside a directory containing a crafted .git/ folder executes attacker-controlled commands,” he wrote.

One fixed, one not

When notified, Vim’s maintainers quickly fixed their issue, identified as CVE-2026-34714 with a CVSS score of 9.2, in version 9.2.0272.

Unfortunately, addressing the GNU Emacs vulnerability, which is currently without a CVE identifier, isn’t as straightforward. Its maintainers believe it to be a problem with Git, and declined to address the issue; in his post, Nguyen suggests manual mitigations. The vulnerable versions are 30.2 (stable release) and 31.0.50 (development).

Vulnerable code

What does the discovery of these flaws tell us? Clearly, that large numbers of old codebases are potentially vulnerable to the power of AI tools such as Claude Code. Just because a weakness hasn’t been noticed for years doesn’t mean it will hide for long in the AI era.

That is, potentially, a big change, although hardly one that hasn’t already been flagged by Anthropic itself. In February, the company revealed that its Opus 4.6 model had been used to identify 500 high-severity security vulnerabilities.

“AI language models are already capable of identifying novel vulnerabilities, and may soon exceed the speed and scale of even expert human researchers,” it said at the time.

The platform is powerful enough that an enterprise version with the same capabilities, Claude Code Security, even negatively affected stock market sentiment towards several traditional cybersecurity companies when it was launched.

A second issue is that LLMs are now capable of spotting, iterating, and creating PoCs for vulnerabilities in ways developers still need to come to terms with. Meanwhile, the potential for malicious use is hard to ignore.

“How do we professional bug hunters make sense of this?” Nguyen asked. “This feels like the early 2000s. Back then a kid could hack anything, with SQL Injection. Now [they can] with Claude.”

This article originally appeared on CSOonline.

(image/jpeg; 6.88 MB)

Meta shows structured prompts can make LLMs more reliable for code review 1 Apr 2026, 10:22 am

Meta researchers have developed a structured prompting technique that enables LLMs to verify code patches without executing them, achieving up to 93% accuracy in tests.

The method, dubbed semi-formal reasoning, could help reduce reliance on the resource-heavy sandbox environments currently required for automated code validation.

The development comes as organizations look to deploy agentic AI for repository-scale tasks like bug detection and patch validation. Traditional execution-based approaches often struggle to scale across large, heterogeneous codebases.

Instead of using free-form reasoning that can lead to hallucinations, the technique introduces structured logical certificates. These require models to explicitly state assumptions and trace execution paths before deriving a conclusion.

The researchers evaluated the approach across three key tasks, including patch equivalence verification, fault localization, and code question answering, and found that semi-formal reasoning improved accuracy across all of them.

“For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals,” the researchers said in the paper.

For code question answering, semi-formal reasoning reaches 87% accuracy, marking a nine-percentage point improvement over standard agentic reasoning. In fault localization, it boosts Top 5 accuracy by five percentage points compared to standard approaches.

How it works

Semi-formal reasoning occupies a middle ground between unstructured chat and rigid formal verification. While standard reasoning allows models to make claims without justification, this approach uses a predefined template that mandates a step-by-step process.

“Rather than training specialized models or formalizing semantics, we prompt agents with structured reasoning templates that require explicit evidence for each claim,” the researchers said.

They added that the “templates act as certificates: the agent must state premises, trace relevant code paths, and provide formal conclusions. The structured format naturally encourages interprocedural reasoning, as tracing program paths requires the agent to follow function calls rather than guess their behavior.”

In practice, this forces the model to behave like a developer stepping through code line by line.

Researchers said that in one case involving the Django framework, the structured approach revealed that a module-level function shadowed Python’s built-in format() function. While standard reasoning missed this nuance, the semi-formal analysis correctly identified that the code would fail.

Implications for enterprises

Analysts said semi-formal reasoning signals a shift from assistive AI to more accountable AI in software engineering, a distinction that could reshape how enterprises approach code review.

“Tools like GitHub Copilot have conditioned developers to interact with AI as a fast, fluent suggestion engine,” said Sanchit Vir Gogia, chief analyst at Greyhound Research. “You ask, it generates, you accept or tweak. The system optimizes for speed and plausibility. What it does not optimize for is proof.”

Semi-formal reasoning changes that dynamic. Instead of rewarding models for sounding correct, it requires them to demonstrate correctness by tracing logic and grounding conclusions. For developers, this shifts the focus from reviewing outputs to evaluating the reasoning behind them.

“The deeper implication is that code review itself starts to evolve,” Gogia said. “Historically, code review has been a human bottleneck tied to knowledge transfer and design validation as much as bug detection. In practice, it often fails to catch critical issues while slowing down integration. What we are seeing now is the early shape of a machine-led verification layer where the system traces logic and the human validates the outcome.”

The shift, however, is not without tradeoffs. Structured reasoning introduces additional compute and workflow overhead, raising questions about how it should be deployed in real-world development environments.  

“More steps, more tokens, more latency,” Gogia said. “In controlled experiments, this can be justified by higher accuracy. In real developer environments, this translates into slower builds, longer feedback cycles, and increased infrastructure spend. If this is applied indiscriminately, developers will bypass it. Not because they disagree with it, but because it gets in the way.”

There is also a technical risk. The researchers noted that while the structured format reduces guessing, it can also produce “confident but wrong” answers. In these cases, the AI constructs an elaborate but incomplete reasoning chain, packaging an incorrect conclusion in a convincing, highly structured format that may be difficult for a human to quickly debunk.

(image/jpeg; 3.67 MB)

What next for junior developers? 1 Apr 2026, 9:00 am

Everyone is worried about junior developers. What are all these fresh-faced computer science graduates going to do now that AI is writing all the code?  

It is a legitimate concern. 

It wasn’t that long ago that the best advice I could give an early-career person interested in software development was to go to a boot camp. Sure, they could go to college and get a four-year computer science degree, but that would be expensive, take a long time, and teach them a lot of theoretical but impractical things about computers. And they wouldn’t even be doing science. 

But a six-month boot camp? There they’d learn what they really need to know—what software development companies are really looking for. They’d learn practical coding techniques, proper bug management, design specifications, JavaScript and TypeScript, source control management, and continuous integration.  

When I was a hiring manager, it didn’t take long for me to realize that a boot camp graduate was often much more ready to hit the ground running as a junior developer than a computer science graduate. 

But of course, all that fell apart overnight. Suddenly, for a low monthly payment, I could have a tireless, diligent, eager, and highly skilled junior developer who can type a thousand words a minute and reason at the speed of light. The economics of that are simply too compelling. 

Juniors begat seniors

And so what is a budding software developer to do? Or more importantly, what is a software development company to do when they realize that all those senior developers who are using Cursor are actually going to retire one day?  

Up until about 10 minutes ago, those companies would hire these intrepid young whippersnappers and put them to work fixing bugs, writing the boring code that builds systems, and slowly but surely teaching them how systems work by having them learn by doing. One became a senior developer through the experience of writing code, seeing it run, and learning what works and what doesn’t. Eventually, wisdom would set in, and they’d become sage, seasoned developers ready to mentor the next generation of developers.  

Well, we are now skipping that part where you actually become wise. But wisdom is actually the critical thing in this grand process. The judgment to know what is good, what is effective, and what is needed is the very commodity that makes agentic coding work. The AI model writes the code, and we seasoned veterans determine if it is right or not. 

We seasoned veterans know if the code is right or not because we’ve written tons and tons of code. But humans aren’t writing tons and tons of code anymore. And here is where I’m going to say something that I think many of you will really not like: Code doesn’t matter anymore. 

What I mean is, code is a commodity now. Code that used to take months to produce can now be produced in minutes. Yes, literally minutes. And the coding agents today are the worst they will ever be. They are only getting better, and they will only produce cleaner and cleaner code as time marches on. At some point—and that point may already be here for many of you—we are just going to stop looking at code. 

What matters is whether or not the application, you know, actually works. And if you want Claude Code or Codex to write a working application for you, you need to be able to communicate with it effectively to get it to do what you want. And strangely, the way to communicate with it is to write clearly. 

Heads up, English majors

A couple of weeks ago, I wrote that Markdown is the new programming language, and that what makes for “good code” in Markdown is the ability to write clear and concise instructions. Who would have thought that the English department would suddenly be the key to developing good software? 

Right now, the agentic coding process goes something like: 

  1. Describe the problem to Claude Code.
  2. Monitor the code Claude writes to make sure it is good code.
  3. Test the application to make sure it works correctly.
  4. Refine and improve by iterating this process. 

Step 2? It’s already becoming unnecessary. These AI agents are already writing good code, and the code they write gets better and better every day. And it is trivial to tell them to improve the code that they have already written. Iterating to improve code quality takes mere minutes. Writing the code has literally become the easiest part of developing software. 

So my advice to the kids these days: Learn to write clearly and precisely. Learn how to understand systems and describe them and their use cases. Make sure you can succinctly describe what you need software to do. English majors take note. Hiring managers? You too.

(image/jpeg; 5.69 MB)

PEP 816: How Python is getting serious about Wasm 1 Apr 2026, 9:00 am

WebAssembly, or Wasm, provides a standard way to deliver compact, binary-format applications that can run in the browser. Wasm is also designed to run at or near machine-native speeds. Developers can write code in one of the various languages that compile to Wasm as a target (e.g., Rust), and deliver that program anywhere Wasm runs.

But Wasm by itself isn’t enough. An application, especially one running in a browser, needs standardized and controllable ways to talk to the rest of the system. The WebAssembly specification doesn’t speak to any of that by design. It only describes the WebAssembly instruction set; not how programs using those instructions deal with the rest of the system.

That’s what the WASI standard provides—abstractions for using the host system, such as how to perform network and storage I/O, and using host resources like clocks or sources of entropy for PRNGs.

Until now, CPython has supported WASI, but not in a formally defined way. Nothing described how CPython would support versions of WASI (the spec), or the WASI SDK (an implementation of the spec). With PEP 816, the CPython team has formally defined how to support both the spec and the SDK going forward.

Ultimately, the new definition will make it easier to deliver Python apps in the browser or anywhere else Wasm runs. There are just a few things developers need to know to ensure they’re using Wasm correctly with Python under the new rules.

How Python has historically used Wasm

Most languages, such as Rust, compile to Wasm as a binary target. Because Python is interpreted—at least, the default CPython implementation works that way—it doesn’t compile to Wasm directly. Instead, the interpreter itself is compiled to Wasm, and Python programs are run on that Wasm version of the interpreter.

There are drawbacks to this approach. For one, it means you need a full copy of the interpreter and the standard library to run any Python program. There is as yet no mechanism to compile a Python program for Wasm that would either include a copy of the interpreter or make it self-contained.

Another big drawback: Any modules not written in pure Python can’t run in Wasm unless a Wasm-specific version of that module is compiled ahead of time. Unless you have a specially compiled version of, say, NumPy, you can’t use that module in Wasm.

Some of these issues are limitations of Python as a language. Its inherent dynamism makes it difficult to deploy a standalone program. Rust, by contrast, can compile to a single binary artifact for any supported target.

But some of these limits can also be attributed to the Wasm environment. For instance, many methods in the standard library aren’t available in Wasm enviroments because the WASI SDK doesn’t expose the needed interfaces for those methods. The more Python and other languages demand such things, the more likely they are to show up in the Wasm environment.

This is where it is useful for Python to be explicit about which versions it’ll use for both Wasm and its software development kit (or SDK) going forward. Each version of Python can then provide better guarantees about the Wasm features it supports.

Wasm support in Python: WASI and the WASI SDK

Wasm support involves two things: WASI and the WASI SDK. The difference between the two is a little like the difference between the Python language in the abstract and the CPython runtime. The former (WASI) is the spec for how Wasm programs interact with the host system, which can be implemented any number of ways. The latter (the WASI SDK) is the official implementation of that spec.

The WASI SDK is a modified version of the Clang compiler, which uses a library called wasi-libc. This gives programs written in C (and C API-compatible languages) access to WASI’s APIs for the host (storage, networking, timers, etc).

In theory, we should just be able to compile a given CPython release with the most recent WASI SDK at the time. But things aren’t that simple. For one, the SDK’s biggest component, wasi-libc, doesn’t guarantee it’ll be forward- or backward-compatible. Also, some versions of the SDK may cause buggy behavior with some versions of CPython. As developers, we want to know that this version of CPython works with this version of the SDK—or at least be able to document which bugs appear with any given combination of the two.

How future releases of CPython will use WASI

CPython has been available on Wasm since version 3.11, with Tier 2 and Tier 3 support. The more official wasip1 is the better-supported target, while the older emscripten standard is the less-supported version. But Tier 2 support has been confined to the WASI “Preview 1” set of system calls. And for the reasons already stated, the WASI SDK CPython uses is not necessarily the most recent version, either: it’s SDK version 21 for Python 3.11 and 3.12, and SDK version 24 for 3.13 and 3.14.

All of this will change with future releases of CPython, with a couple of hard rules in place for using WASI and its SDK:

  1. Any version of WASI or the WASI SDK supported by a given CPython version by its beta 1 release will be the version supported for the lifetime of that CPython release. For instance, if CPython 3.15 uses version 0.3 of the WASI spec and version 33 of the SDK (these are arbitrary numbers), then that version of WASI and the SDK will be supported for that version of CPython until it is formally sunsetted.
  2. Any changes to the version of the WASI spec or SDK used for a particular release requires approval from Python’s steering council. But this shouldn’t happen outside of some extraordinary set of circumstances—for instance, if a bug surfaced that made a given version of the SDK unusable with a given CPython release.

The benefits of WASI version guarantees for CPython

Going forward, developers can look forward to significant improvements to how Python will work with WASI:

  1. It won’t only be easier for CPython developers to know which versions of WASI and the SDK to target. It will also be easier for the rest of the WASI ecosystem to determine which Python versions are compatible with various WASI and SDK editions.
  2. Developers maintaining Python libraries with extension modules will have a better idea of how to compile those modules to Wasm for each Python point release. They will then be able to take advantage of newer WASI features sooner, knowing that a specific CPython will support them.
  3. Developers can add WASI support to their projects for a given version of CPython sooner in each release cycle for the interpreter, as the WASI and SDK versions should be locked down by the first beta release.

(image/jpeg; 0.13 MB)

Page processed in 0.488 seconds.

Powered by SimplePie 1.4-dev, Build 20170403172323. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.