PyPI warns developers after LiteLLM malware found stealing cloud and CI/CD credentials 25 Mar 2026, 11:13 am

PyPI is warning of possible credential theft from AI applications and developer pipelines after two malicious versions of the widely used Python middleware for large language models, LiteLLM, were briefly published.

“Anyone who has installed and run the project should assume any credentials available to the LiteLLM environment may have been exposed, and revoke/rotate them accordingly,” PyPI said in an advisory that linked the incident to an exploited Trivy dependency from the ongoing TeamPCP supply-chain attack.

According to a Sonatype analysis, the packages embedded a multi-stage payload designed to harvest sensitive data from developer environments, CI/CD pipelines, and cloud configurations, and were live on PyPI for roughly two hours before being taken down.

“Given the package’s three million daily downloads, the compromised LiteLLM could have seen significant exposure during that short time span,” Sonatype researchers said in a blog post. On top of serving as a stealer, the packages were also acting as droppers, enabling follow-on payloads and deeper system compromise.

Three-stage payload built for maximum reach

The compromise affected versions 1.82.7 and 1.82.8. Sonatype’s analysis noted the payload operating in three distinct stages. These included initial execution and data exfiltration, deeper reconnaissance and credential harvesting, and finally persistence with remote control capabilities.

The attack chain relied heavily on obfuscation, with base64-encoded Python code covering up the payload’s tracks. Once executed, the malware collected sensitive data, encrypted it using AES-256-CBC, and then secured the encryption key with an embedded RSA public key before sending everything to attacker-controlled servers.

The disclosure highlighted a common approach that attackers follow these days. Instead of going off immediately after installation, the malware quietly lingers to map the environment and establish a foothold, before pulling credentials from local machines, cloud configs, and automation pipelines.

“It (payload) targets environment variables (including API keys and tokens), SSH Keys, cloud credentials (AWS, GCP, Azure), Kubernetes configs, CI/CD secrets, Docker configs, database credentials, and even cryptocurrency wallets,” said Wiz researchers, who are separately tracking the campaign, in a blog post. “Our data shows that LiteLLM is present in 36% of cloud environments, signifying the potential for widespread impact.”

Wiz also provided a way for its customers to check their environment for exposure via the Wiz Threat Center.

An expanding supply-chain campaign

The LiteLLM incident has been confirmed to be a part of the rapidly unfolding TeamPCP supply chain campaign that first compromised Trivy.

Trivy, developed by Aqua Security, is a widely used open-source vulnerability scanner designed to identify security issues in container images, file systems, and infrastructure-as-code (IaC) configurations. The ongoing attack, attributed to TeamPCP with reported links to LAPSUS$, involved attackers compromising publishing credentials and injecting credential-stealing code into official releases and GitHub Actions used in CI/CD pipelines.

The Trivy compromise was quickly followed by similar supply chain incidents, with attackers leveraging the same access and tactics to target other developer security tools like KICS and Checkmarx, extending the campaign’s reach across multiple CI/CD ecosystems.

PyPI advisory tied the LiteLLM incident directly to the Trivy compromise. The malicious packages were uploaded “after an API Token exposure from an exploited Trivy dependency,” it said.

Ben Read, a lead researcher at Wiz, calls it a systematic campaign that needs to be monitored for further expansion. “We are seeing a dangerous convergence between supply chain attackers and high-profile extortion groups like LAPSUS$,” he said. “By moving horizontally across the ecosystem – hitting tools like liteLLM that are present in over a third of cloud environments – they are creating a snowball effect.”

PyPI has advised users to rotate any secrets accessible to the affected LiteLLM environment, as researchers confirm active data exfiltration and potential exposure across cloud environments tied to the ongoing campaign.

The article originally appeared in InfoWorld.

(image/jpeg; 0.26 MB)

Cloudflare launches Dynamic Workers for AI agent execution 25 Mar 2026, 10:37 am

Cloudflare has rolled out Dynamic Workers, an isolate-based runtime designed to run AI-generated code faster and more efficiently than traditional containers, as the company pushes lightweight, disposable execution environments as a foundation for enterprise AI applications.

The service enables enterprises to spin up execution environments in milliseconds, pointing to a transition away from container-heavy architectures toward more ephemeral runtimes designed for high-volume AI agent workloads.

For many enterprises, this points to a shift in how AI systems are built and executed. Instead of orchestrating predefined tools, organizations are beginning to let models generate and execute code on demand, a shift that raises new questions around security and cost.

Built on Cloudflare’s existing Workers platform, Dynamic Workers uses V8 isolates to execute code generated at runtime, often by LLMs, without requiring a full container or virtual machine.

“An isolate takes a few milliseconds to start and uses a few megabytes of memory,” Cloudflare said in a blog post. “That’s around 100x faster and 10x-100x more memory efficient than a typical container. That means that if you want to start a new isolate for every user request, on-demand, to run one snippet of code, then throw it away, you can.”

Cloudflare is pairing the runtime with its “Code Mode” approach, which encourages models to write short TypeScript functions against defined APIs instead of relying on multiple tool calls, a method the company says can reduce token usage and latency.

From an enterprise perspective, the platform includes controls such as outbound request interception for credential management, automated code scanning, and rapid rollout of V8 security patches. Cloudflare noted that isolate-based sandboxes have different security characteristics compared to hardware-backed environments.

Dynamic Workers are available in open beta under Cloudflare’s Workers paid plan. While pricing is set at $0.002 per unique Worker loaded per day, in addition to standard CPU and invocation charges, the per-Worker fee is waived during the beta period.

Enterprise runtime implications

For enterprise IT teams, the move to isolate-based execution could reshape how AI workloads are architected, especially for use cases that demand high concurrency and low-latency performance.

“Cloudflare is essentially looking to redefine the application lifecycle by pivoting away from the traditional ‘build-test-deploy’ cycle on centralized servers, which often relies on high-overhead, latency-heavy containers,” said Neil Shah, VP for research at Counterpoint Research. “The move to V8 reduces startup times from around 500 ms to under 5 ms, a roughly 100x improvement, making it significant for bursts of agentic AI requests that may require cold starts.”

This shift could also have cost implications. If AI agents can generate and execute scripts locally to produce outcomes, rather than repeatedly calling LLMs, enterprises may see improvements in both efficiency and latency.

However, Shah noted that the model introduces new security considerations that enterprise leaders cannot ignore.

“Allowing AI agents to generate and execute code on the fly introduces a new attack vector and risk,” Shah said. “While Dynamic Workers are sandboxed to limit the impact of a potential compromise, the unpredictability of AI-generated logic requires a robust security framework and clear guardrails.”

Others say these risks extend beyond sandboxing and require broader governance across the AI execution lifecycle. Nitish Tyagi, principal analyst at Gartner, said that while isolate-based environments improve containment, they do not eliminate risks.

“Running an AI agent and executing code in an isolated environment may seem very safe in theory, but it doesn’t ensure complete safety,” Tyagi said.

He pointed to risks such as vulnerabilities in AI-generated code, indirect prompt-injection attacks, and supply-chain threats, in which compromised external sources could lead agents to expose sensitive data or execute harmful actions.

Tyagi also warned of operational risks, including the risk of autonomous agents entering recursive execution loops, which can lead to cost escalation and resource exhaustion.

To mitigate these risks, Tyagi said enterprises need stronger governance mechanisms, including real-time monitoring of agent behavior, tighter control over outbound traffic, and better visibility into AI supply chains and dependencies.

(image/jpeg; 0.07 MB)

Oracle adds pre-built agents to Private Agent Factory in AI Database 26ai 25 Mar 2026, 9:54 am

Oracle has added new prebuilt agents to Private Agent Factory, its no-code framework for building containerized, data-centric agents within AI Database 26ai.

These agents include a Database Knowledge Agent, a Structured Data Analysis Agent, and a Deep Data Research Agent.  

While the Database Knowledge Agent translates natural-language prompts into queries to fetch specific facts, policies, or entities, the Deep Data Research Agent tackles more complex tasks by breaking them into steps and iterating across web sources, document libraries, or both, the company said.

The Structured Data Analysis Agent, meanwhile, is aimed at crunching tabular data —think SQL tables or CSV files — using tools like Python’s pandas library to generate charts, spot trends, flag anomalies, and summarize metrics, the company added.

The addition of these agents and Private Agent Factory to AI Database 26ai will help developers accelerate agent building in a secure and simplified manner, especially for enterprises operating in regulated industries, helping move pilots to production, analysts say.

“With AI Database Private Agent Factory, teams will be able to rapidly create AI agents or leverage pre-built ones, turning experimentation into production-ready solutions quickly. By embedding intelligence at the core of the database, Oracle is enabling a new era of agentic AI, where sophisticated, autonomous systems and applications can adapt and act at scale,” said Noel Yuhanna, principal analyst at Forrester.

Oracle’s rationale, Yuhanna added, reflects its broader strategy of making the database a central pillar of enterprise AI, given that execution ultimately depends on where the data resides.

That view is echoed by Stephanie Walter, practice leader of AI stack at HyperFRAME Research, who says Private Agent Factory is Oracle’s attempt to position itself as “the operational control layer” in enterprises rather than just the storage layer, by bringing data and AI closer together and reducing the need for data movement and external orchestration.

“Every major cloud provider is moving toward tighter coupling between data, models, and orchestration. Oracle’s differentiation is that it starts from the database outward, while hyperscalers typically start from the model or platform outward,” Walter said.

That differentiation is more than architectural nuance, according to Bradley Shimmin, lead of the data intelligence practice at The Futurum Group.

“By architecting agent orchestration directly into the database, Oracle is letting enterprises drop the duct-tape approach of complex, brittle data-movement pipelines that I would say continue to plague cloud-centric ecosystems, even those emphasizing zero-ETL capabilities,” Shimmin said.

That tighter integration also feeds directly into a more pragmatic concern for regulated industries: keeping sensitive data under control as AI agents move from experimentation into production.

“Most agent frameworks today assume you’re comfortable sending data to external LLM providers and orchestrating through cloud-hosted services. For regulated industries—including banking, healthcare, defense, and government—that assumption is a non-starter,” said Ashish Chaturvedi, leader of executive research at HFS Research.

“The Private Agent Factory meets those customers exactly where they are: behind the firewall, with the drawbridge up,” he added.

(image/jpeg; 4.7 MB)

TypeScript 6.0 arrives 25 Mar 2026, 9:00 am

TypeScript 6.0 is slated to be the last release of the language based on the current JavaScript codebase and is now generally available. Version 6.0 acts as a bridge between TypeScript 5.9 and the planned TypeScript 7.0, close to completion and set to be speedier and based on the Go language.

The 6.0 production release was unveiled on March 23, following the release candidate that arrived March 6. Developers can access TypeScript 6.0 via NPM with the following command: npm install -D typescript.

TypeScript has been established as JavaScript with syntax for types. Several changes were cited as noteworthy additions in the general production release of TypeScript 6.0, including an adjustment in type-checking for function expressions in generic calls, especially those occurring in generic JSX expressions. This typically will catch more bugs in existing code, although developers may find that some generic calls may need an explicit type argument, said Daniel Rosenwasser, principal product manager for TypeScript at Microsoft.

Also, Microsoft has extended its deprecation of import assertion syntax (i.e. import ... assert {...}) to import() calls like import(..., { assert: {...}}).

With the general release, Microsoft also has updated the DOM types to reflect the latest web standards, including some adjustments to the Temporal APIs. Other capabilities featured in TypeScript 6.0  include:

  • There is less context sensitivity on this-less functions. If this is never actually used in a function, then it is not considered contextually sensitive, which means these functions will be seen as higher priority when it comes to type inference.
  • A new flag has been introduced, called –stableTypeOrdering , which is intended to assist with TypeScript 6.0 migrations to Version 7.0.
  • TypeScript 6.0 adds support for the es2025option for both target and lib. Although there are no new JavaScript language features in ES2025, this new target adds new types for built-in APIs and moves a few declarations from esnext into es2025.
  • The contents of lib.dom.iterable.d.tsand lib.dom.asynciterable.d.ts are included in lib.dom.d.ts. Developers still can reference dom.iterable and dom.asynciterable in a configuration file’s "lib" array, but they are now just empty files. TypeScript’s liboption lets users specify which global declarations a target runtime has.
  • In TypeScript 6.0, usingmodule where namespacewas expected is now a hard deprecation. This change was necessary because module blocks are a potential ECMAScript proposal that would conflict with the legacy TypeScript syntax.

The foundation of TypeScript 7.0, meanwhile, is set to be a compiler and language service written in Go that takes advantage of the speed of native code and shared-memory multi-threading. Version 7.0 is “extremely close to completion,” Rosenwasser said. It can be tried out from the Visual Studio Code editor or installed via NPM. “In fact, if you’re able to adopt TypeScript 6.0, we encourage you to try out the native previews of TypeScript 7.0,” Rosenwasser said.

(image/jpeg; 11.38 MB)

Stop worrying: Instead, imagine software developers’ next great pivot 25 Mar 2026, 9:00 am

My sister always says, “worry is just a lack of imagination.”   By that, she means we always seem to worry about the worst-case scenarios — about things going badly.  Why not worry, or imagine, that the best possible outcome will happen?  You have a choice — choose to assume that everything will work out perfectly rather than disastrously.

This has never been more true when you look at the folks who think all of us software developers are going to end up selling apples on street corners.

Don’t fear the coding agent

I get it. Software development has suddenly become incredibly efficient.  Claude Code can write code vastly faster and more efficiently than we humans can, and so it seems reasonable that one person can now do (manage?) the work of 10 (50? 100?) people, companies will get rid of the other nine, leaving them destitute.  Seas of software developers will be standing in unemployment lines, their skills rendered moot by the blazing tokens of coding agents

There’s the worst-case scenario.  But what if we apply a bit of imagination?

A similar case happened during the Industrial Revolution.  In the mid-19th century, steam engines were the leading technology, and as they became more efficient, coal miners grew concerned that demand for their services would drop as those engines used less and less coal. 

But the coal miners lacked imagination — more efficient steam engines led to the unexpected result of an increase in the demand for coal.  This counterintuitive outcome was noticed by economist William Stanley Jevons, who realized that cheaper, more efficient steam engines led to their more widespread use in ways that hadn’t yet been conceived, thus expanding the need for both coal miners and factory workers to build more and better steam engines. Everybody wins.

And why won’t the same thing be true for software?  Can’t we imagine a world where the amount of software demanded and produced expands beyond what we think of today?  The “programmers selling apples” scenario assumes that the demand for software remains constant. But if producing software becomes more efficient, won’t that lead to more software being produced? 

Think of this:  I bet most of us have a few side projects that we’d like to get done that we never seem to be able to find the time for.  Your product manager certainly has a long list of features for your product that she’d like to do, but for which there never seems to be the time to put on the schedule. Small businesses all have bespoke requirements for software that off-the-shelf solutions don’t meet. 

Adapting to development disruption

Add to that the software that hasn’t even been conceived of yet, and it’s pretty easy to see — imagine — that there is no shortage of software that can be created.  Making software easier to create won’t lead to the same projected amount of software created.  Making software easier to create will drastically increase the amount of software that will be produced.  The floor just dropped out from under “we don’t have the time for that.”

Now, I’ll give you this: There may be a disruption in the type of work required to produce this software.  Job descriptions change — this is a constant.  We used to need people to write assembly and C.  Procedural development gave way to object-oriented coding.  Windows developers were left behind as the web rose to prominence.  But we all have adapted, and we’ll do so again.

It turns out my sister is right. The best-case scenario is vastly more interesting than anyone bothers to imagine.

(image/jpeg; 0.29 MB)

Speed boost your Python programs with new lazy imports 25 Mar 2026, 9:00 am

When you import a module in Python, the module’s code must be evaluated completely before the program can proceed. For most modules, this isn’t an issue. But if a module has a long and involved startup process, it’s going to slow down the rest of your program at the point where it’s imported.

Python developers typically work around this issue by structuring imports so they don’t happen unless they are needed—for instance, by placing an import in the body of a function instead of at the top level of a module. But this is a clumsy workaround, and complicates the flow of the program.

Python 3.15 adds a new feature, lazy imports, that provides a high-level solution for slow-importing modules. Declaring an import as “lazy” means it will be evaluated when it is first used, not when it is first imported. The cost of a slow import can then be deferred until the code it contains is actually needed. And, while lazy imports introduce new syntax, you can future-proof existing code to use them without having to change any of its syntax.

Eager versus lazy imports

To start, it’s helpful to understand the problem addressed by lazy imports. So, let’s say we have two files in the same directory:

# main.py
print ("Program starting")
from other import some_fn
print ("Other module imported")
some_fn()
print ("Program ended")

# other.py
print("Other module evaluation started")
from time import sleep
sleep(2)
# ^ This simulates a slow-loading module
print("Other module evaluation ended")

def some_fn():
    print ("some_fn run")

If you run main.py, the output should look something like this:


Program starting
Other module evaluation started

[two-second delay]

Other module evaluation ended 
Other module imported 
some_fn run 
Program ended

The mere act of importing other grinds our program to a near-halt before we can even do anything with the imported function, let alone continue with the rest of the program.

Now let’s see what happens if we modify main.py to use lazy imports (this will only work on Python 3.15 or higher):

print ("Program starting")
lazy from other import some_fn
print ("Other module imported")
some_fn()
print ("Program ended")

When you run the program now, the behavior changes:

Program starting
Other module imported
Other module evaluation started

[two-second delay]

Other module evaluation ended
some_fn run
Program ended

Now, the import imposes no delay at all. We only see the delay when we try to run the function we imported from the module.

What’s happening under the hood? When Python detects a lazy import—typically triggered with the lazy keyword on the import line, as shown above—it doesn’t perform the usual import process. Instead, it creates a “proxy object,” or a stand-in, for the imported module. That proxy waits until the program tries to do something with the module. Then the actual import action triggers, and the module is evaluated.

The lazy keyword is always the first word on the line of an import you want to declare as lazy:


# lazily imports foo
lazy import foo
# lazily imports bar from foo
lazy from foo import bar
# same with the use of "as":
lazy import foo as foo1
lazy from foo import bar as bar1

Where to use lazy imports in Python

The most common scenario for using lazy imports is to replace the usual workaround for avoiding a costly import at program startup. As I mentioned previously, placing the import inside a function, instead of at the top level of a module, causes the import to happen only when the function runs. But it also means the import is limited to the function’s scope, and is therefore unavailable to the rest of the module unless you apply another workaround.

With a lazy import, you can keep the import in the top level of a module as you usually would. The only change you have to make is adding the lazy keyword to your code.

Using lazy imports automatically

It is also possible to enable imports on an existing codebase automatically—without rewriting any import statements.

Python 3.15 adds new features to the sys module that control how lazy imports work. For instance, you can declare programmatically that every import from a given point forward in your program’s execution will be lazy:

import sys
sys.set_lazy_imports("all")

If sys.set_lazy_imports() is given "all", then every import in the program from that point on is lazy, whether or not it uses the lazy keyword. Code labeled "normal" would have only explicitly lazy imports handled as lazy, and code labeled "none" would disable lazy importing across the board.

Controlling lazy imports programmatically

You can also hook into lazy imports at runtime, which lets you do things like control which specific modules are lazy-imported:


import sys

def mod_filter(importing, imported, fromlist):
    return imported == ("module")

sys.set_lazy_imports_filter(mod_filter)
sys.set_lazy_imports("all")

sys.set_lazy_imports_filter() lets you supply a function that takes in three parameters:

  • The module where the import is being performed
  • The module being imported
  • A list of names being imported

With that, you can write logic to return True to allow a given import to be lazily imported, or False to force it to be imported normally. This lets you write allow-lists and block-lists for lazy imports as part of a test, or simply as part of how your program works.

Two ways to get started with lazy imports

Python has a long-standing tradition of allowing newer features to be added gracefully to existing codebases. Lazy imports can be used the same way: You can check for the presence of the feature at program start, then apply lazy imports across your codebase automatically by using sys.set_lazy_imports().

To start, you can check the Python version number:

import sys
if (sys.version_info.major==3 and sys.version_info.minor>=15):
    ... # set up lazy imports

Or you can test for the presence of the lazy import controls in sys:

import sys
if getattr(sys, "set_lazy_imports", None):
    ... # set up lazy imports

(image/jpeg; 4.1 MB)

New JetBrains platform manages AI coding agents 24 Mar 2026, 11:12 pm

Seeking to help developers control growing fleets of AI coding agents, JetBrains is introducing JetBrains Central, an agentic development platform for teams to manage and maintain visibility over these agents.

An early access program for JetBrains Central is set to begin in the second quarter of 2026 with a limited number of design partners participating. JetBrains describes the platform as the control and execution plane for agent-driven software production. JetBrains Central is intended to address the difficulties developers face in dealing with the growing number of agents. Developers are increasingly running into challenges with oversight, consistency, and control across these environments, according to JetBrains.

Announced March 24, JetBrains Central acts as a control layer across agentic workflows alongside tools such as the JetBrains’s Air agentic development environment and the Junie LLM-agnostic (large language model) coding agent. JetBrains Central connects developer tools, agents, and development infrastructure into a unified system where automated work can be executed and governed across teams and tools, JetBrains said. Developers can interact with agent workflows from JetBrains IDEs, third-party IDEs, CLI tools, web interfaces, or automated systems. Agents themselves can come from JetBrains or external ecosystems, including Codex, Gemini CLI, or custom agents.

JetBrains Central connects agents with the context needed, including repositories, documentation, and APIs. At the same time, agents operate within real delivery pipelines and infrastructure, interacting with Git repositories, CI/CD systems, cloud environments, and other amenities. When agents need guidance or complete a task, they interact with human teammates through the tools teams already use, such as Slack or Atlassian. This allows agent workflows to operate inside the same systems used by development teams today, rather than in isolated AI tools, according to JetBrains. Specific core capabilities include:

  • Governance and control, including policy enforcement, identity and access management, observability, auditability, and cost attribution for agent-driven work. Some of these functionalities are already available via the JetBrains Central Console.
  • Agent execution infrastructure, with cloud agent runtimes and computation provisioning, allows agents to run reliably across development environments.
  • Agent optimization and context features shared semantic context across repositories and projects. This enables agents to access relevant knowledge and route tasks to the most appropriate models or tools.

(image/jpeg; 2.19 MB)

New ‘StoatWaffle’ malware auto‑executes attacks on developers 24 Mar 2026, 12:01 pm

A newly disclosed malware strain dubbed “StoatWaffle” is giving fresh teeth to the notorious, developer-targeting “Contagious Interview” threat campaign.

According to NTT Security findings, the malware marks an evolution from the long-running campaign’s user-triggered execution to a near-frictionless compromise embedded directly in developer workflows. Attackers are using blockchain-themed project repositories as decoys, embedding a malicious VS Code configuration file that triggers code execution when the folder is opened and trusted by the victim.

“StoatWaffle is a modular malware implemented by Node.js and it has Stealer and RAT modules,” NTT researchers said in a blog post, adding that the campaign operator “WaterPlum” is “continuously developing new malware and updating existing ones.”

This means tracking Contagious Interview activity may now require widening the scope of detection efforts to include weaponized dev environments, not just malicious packages and interview lures.

Opening a folder is all it takes

StoatWaffle abuses developer trust within Visual Studio Code environments. Instead of relying on users to execute suspicious scripts, like in earlier attacks, attackers are embedding malicious configurations inside legitimate-looking project repositories, often themed around blockchain development, a lure theme that has been consistent with Contagious Interview campaigns.

The trick relies on a “.vscode/tasks.json” file configured with a “runOn: folderOpen” setting. Once a developer opens the project and grants trust, the payload executes automatically without any further clicks. The executed StoatWaffle malware operates a modular, Node.js-based framework that typically unfolds in stages. These stages include a loader, credential harvesting components, and then a remote access trojan (RAT) planted for persistence and pivoting access across systems.

The RAT module maintains regular communication with an attacker-controlled C2 server, executing commands to terminate its own process, change the working directory, list files and directories, navigate to the application directory, retrieve directory details, upload a file, execute Node.js code, and run arbitrary shell commands, among others.

StoatWaffle also exhibits custom behavior depending on the victim’s browser. “If the victim browser was Chromium family, it steals browser extension data besides stored credentials,” the researchers said. “If the victim browser was Firefox, it steals browser extension data besides stored credentials. It reads extensions.json and gets the list of browser extension names, then checks whether the designated keyword is included.”

For victims running macOS, the malware also targets Keychain databases, they added.

Contagious Interview, revisited

StoatWaffle isn’t an isolated campaign. It’s the latest chapter in the Contagious Interview attacks, widely attributed to North Korea-linked threat actors tracked as WaterPlum.

Historically, this campaign has targeted developers and job seekers through fake interview processes, luring them into running malicious code under the guise of technical assessments. Previously, the campaign weaponized npm packages and staged loaders like XORIndex and HexEval, often distributing dozens of malicious packages to infiltrate developer ecosystems at scale.

Team 8, one of the group’s sub-clusters, previously relied on malware such as OtterCookie, shifting to StoatWaffle around December 2025, the researchers said.

The disclosure also shared a set of IP-based indicators of compromise (IOCs), likely tied to C2 infrastructure observed during analysis, to support detection efforts.

The article originally appeared in CSO.

(image/jpeg; 2.91 MB)

When Windows 11 sneezes, Azure catches cold 24 Mar 2026, 9:00 am

If you look at Microsoft as a collection of product lines, it is easy to conclude that Windows 11 and Azure occupy different universes. One is a client operating system that has irritated its users, confused administrators, and pushed hardware refresh cycles in ways many customers did not want. The other is a hyperscale cloud platform selling compute, storage, data services, and AI infrastructure to enterprises. On paper, these are different businesses. In practice, they are part of the same trust system.

That is why the real question is not whether every unhappy Windows 11 user immediately stops buying Azure. They do not. The short-term connection is too indirect for that. The real issue is whether Microsoft is weakening the strategic gravity that has historically pulled enterprises toward the Microsoft stack. If Windows becomes less loved, less trusted, and less central, then Azure loses one of its quiet but important advantages: the assumption that Microsoft remains the default operating environment from endpoint to identity to server to cloud.

A cascade of Windows 11 problems

Windows 11 did not fail because of one mistake. It became controversial because Microsoft stacked friction on top of friction. The first issue was hardware eligibility. By tightening CPU support and enforcing TPM 2.0 and Secure Boot requirements, Microsoft effectively told a large installed base that perfectly usable machines were no longer good enough for the future of Windows. For many users and businesses, that translated into an involuntary hardware refresh rather than an upgrade. That remains one of the most damaging perception problems around Windows 11 because it turned operating system modernization into a capital expense conversation.

The second issue has been the aggressive insertion of AI features, especially Copilot, into the Windows experience. Recent reporting indicates Microsoft has been reassessing how deeply to push Copilot into Windows 11 after broad criticism that AI was being forced into core workflows rather than offered as a clearly optional capability. That matters because enterprise customers tend to reward optionality and punish coercion. When users believe the operating system is being used as a delivery vehicle for features they did not request, trust erodes quickly.

The third issue is cumulative quality perception. Even where individual complaints differ, the common narrative has been remarkably consistent: too much UX churn, too much product agenda, and not enough attention to core stability and utility. Once that story takes hold, it is no longer just about Windows 11. It becomes about Microsoft’s judgment.

The short-term impact on Azure

In the near term, I do not think the Windows 11 backlash materially dents Azure revenue in a dramatic, visible way. Azure buying decisions are still driven by enterprise agreements, migration road maps, data gravity, AI demand, regulatory requirements, and the practical realities of application modernization. A company does not walk away from its Azure footprint because employees dislike a desktop rollout.

There is also a structural reason the short-term effect is muted. Most Azure customers run a mixed environment already. Even in Microsoft-heavy enterprises, cloud workloads are often Linux-based, containerized, or managed through cross-platform tools. The Azure strategy today is less “run Windows everywhere” and more “meet customers where they are.” That makes the desktop operating system less immediately determinative than it was a decade ago.

However, that should not be confused with immunity. In the short run, Windows 11 can damage Microsoft’s credibility and affect adjacent buying decisions. If CIOs and architects see Microsoft overreaching on the client, they may become more skeptical of broader Microsoft platform bets. Skepticism does not always kill a deal, but it can slow expansion, increase competitive reviews, and make alternatives look more reasonable.

The risk of ecosystem decoupling

This is where the story gets serious. Microsoft’s power historically came from stack continuity. Windows on the desktop led to Windows Server, Active Directory, Microsoft management tools, Microsoft productivity software, Microsoft developers, and eventually Microsoft cloud. The company benefited from a kind of architectural momentum. Even when customers complained, they often stayed because the ecosystem fit together.

If Windows 11 reduces the footprint or strategic relevance of Windows on end-user devices, that continuity weakens. Lenovo is already shipping some lines of business laptops with both Windows and Linux options, a sign that major manufacturers see practical demand for more operating system flexibility. More broadly, mainstream business laptop coverage now treats Linux-capable systems from Lenovo and Dell as credible enterprise choices rather than edge cases. That shift matters. Once manufacturers normalize OS choice, Microsoft loses part of its distribution advantage.

A reduced Windows footprint does not automatically mean Azure declines, but it does make non-Microsoft infrastructure easier to justify. If the endpoint is no longer assumed to be Windows, then the organization becomes more comfortable with Linux-first operations, browser-based productivity, identity abstraction, cross-platform management, and container-native development. At that point, AWS and Google Cloud gain more than competitive parity. They gain narrative momentum.

Who benefits from Microsoft’s missteps

AWS has long benefited from being seen as the neutral default for cloud infrastructure. Google Cloud benefits from strength in data, AI, Kubernetes, and open source. Both providers become more attractive when enterprises want to avoid deeper entanglement with a single vendor’s ecosystem. If Microsoft weakens the emotional and operational case for staying inside that ecosystem, competitors have less resistance to overcome.

Then there is the rise of sovereign clouds and neo clouds. Sovereign cloud offerings are increasingly attractive to governments, regulated industries, and companies navigating regional data control requirements. Neo clouds, especially GPU-centric specialists, are capturing interest from organizations that want AI infrastructure without buying into a full legacy enterprise stack. These providers are not necessarily replacing Azure across the board, but they are fragmenting the market and redefining what “best fit” looks like.

That fragmentation becomes more dangerous for Microsoft if Windows no longer functions as an ecosystem anchor. Once customers accept heterogeneity at the edge, they become more comfortable buying heterogeneity in the cloud.

Microsoft still has time to stop this from spreading. The fix is not complicated, although it may be culturally difficult. Microsoft has to make Windows useful before it makes Windows strategic. That means reducing forced experiences, making Copilot clearly optional, restoring confidence in the value of core OS improvements, and acknowledging that hardware gating created real resentment. It also means understanding that endpoint trust is not a side issue. It is part of the company’s larger cloud positioning.

If Microsoft treats Windows 11 as merely a noisy consumer controversy, it will miss the enterprise lesson. Platforms are built on confidence. Confidence on the desktop influences confidence in the data center and the cloud. The short-term Azure impact may be modest, but the long-term risk is real: If Windows stops being the front door to the Microsoft universe, Azure stops being the default destination.

That is how desktop mistakes become cloud problems. Not all at once, but gradually and then faster than expected.

(image/jpeg; 3.88 MB)

7 safeguards for observable AI agents 24 Mar 2026, 9:00 am

Many organizations are under pressure to take their AI agent experiments and proof of concepts out of pilots and into production. Devops teams may have limited time to ensure these AI agents meet AI agent non-negotiable requirements for production deployments, including implementing observability, monitoring, and other agenticops practices.

One question devops teams must answer is what their minimum requirements are to ensure AI agents are observable. Teams can start by extracting fundamentals from devops observability practices and layering in dataops observability for data pipelines and modelops for AI models.

But organizations also must extend their observability standards, especially as AI agents take over role-based tasks, integrate with MCP servers for more complex workflows, and support both human-in-the-middle and autonomous operations.

A key observability question is: Who did what, when, why, and with what information, from where? The challenging part is centralizing this information and having an observability data standard that works regardless of whether the decision or action came from an AI agent or a person.

“Devops should apply the same content and quality processes to AI agents as they do for people by leveraging AI-powered solutions that monitor 100% of interactions from both humans and AI agents,” suggests Rob Scudiere, CTO at Verint. “The next step is observing, managing, and monitoring AI and human agents together because performance oversight and continuous improvement are equally critical.”

I asked experts to share key concepts and their best practices for implementing observable AI agents.

1. Define success criteria and operational governance

Observability is a bottom-up process for capturing data on an AI agent’s inputs, decisions, and operations. Before delving into non-functional requirements for AI agents and defining observability standards, teams should first review top-down goals, operational objectives, and compliance requirements.

Kurt Muehmel, head of AI strategy at Dataiku, says observable agents require three disciplines that many teams treat as afterthoughts:

  • Define success criteria because engineers can’t determine what “good” looks like alone. Domain experts need to help build evaluation datasets that capture edge cases only they would recognize.
  • Centralize visibility because agents are being built everywhere, including data platforms, cloud services, and across teams.
  • Establish technical operational governance before deployment, including evaluation criteria, guardrails, and monitoring.

Observability standards should cover proprietary AI agents, those from top-tier SaaS and security companies, and those from growing startups. Regarding technical operational governance:

2. Define the information to track

Observability of AI agents is non-trivial for a handful of reasons:

  • AI agents are not only stateful but have memory and feedback loops to improve decision-making.
  • Actions may be triggered by people, autonomously by the AI agent, or orchestrated by another agent via an MCP server.
  • Tracking the agent’s behavior requires versioning and change tracking for the underlying datasets, AI models, APIs, infrastructure components, and compliance requirements.
  • Observability must account for additional context, including identities, locations, time considerations, and other conditions that can influence an agent’s recommendations.

Given the complexity, it’s not surprising that experts had many suggestions regarding what information to track.

“Teams should treat every agent interaction like a distributed trace with instrumentation at the various decision-making boundaries and capture the prompt, model response, the latency, and the resulting action in order to spot drift, latency issues, or unsafe behaviors in real time,” says Logan Rohloff, tech lead of cloud and observability at RapDev. “Combining these metrics with model-aware signals, such as token usage, confidence scores, policy violations, and MCP interactions enables you to detect when an agent is compromised or acting outside its defined scope.”

Devops teams will need to extend microservice observability principles to support AI agents’ stateful, contextual interactions.

“Don’t overlook the bits around session, context, and workflow identifiers as AI agents are stateful, communicate with each other, and can store and rehydrate sessions,” says Christian Posta, global field CTO at Solo.io. “We need to be able to track causality and flows across this stateful environment, and with microservices, there was always a big challenge getting distributed tracing in place at an organization. Observability is not optional, and without it, there’s no way you can run AI agents and be compliant.”

Agim Emruli, CEO of Flowable, adds that “teams need to establish identity-based access controls, including unique agent credentials and defined permissions, because in multi-agent systems, traceability drives accountability.”

3. Identify errors, hallucinations, and dangerous recommendations

Instrumenting observable APIs and applications helps engineers address errors, identify problem root causes, improve resiliency, and research security and operational issues. The same is true for AI agents that autonomously complete tasks or make recommendations to human operators.

“When an AI agent hallucinates or makes a questionable decision, teams need visibility into the full trajectory, including system prompts, contexts, tool definitions, and all message exchanges,” says Andrew Filev, CEO and founder of Zencoder. “But if that’s your only line of defense, you’re already exposed because agentic systems are open-ended and operate in dynamic environments, requiring real-time verification. This shift started with humans reviewing every result and is now moving toward built-in self- and parallel verification.”

Autonomous verification will be needed as organizations add agents, integrate with MCP servers, and allow agents to connect to sensitive data sources.

“Observing AI agents requires visibility not only into model calls but into the full chain of reasoning, tools, and code paths they activate, so devops can quickly identify hallucinations, broken steps, or unsafe actions,” says Shahar Azulay, CEO and co-founder of Groundcover. “Real-time performance metrics like token usage, latency, and throughput must sit alongside traditional telemetry to detect degradation early and manage the real cost profile of AI in production. And because agents increasingly execute code and access sensitive data, teams need security-focused observability that inspects payloads, validates integrations like MCP, and confirms that every action an agent takes is both authorized and expected.”

4. Ensure AI agent observability addresses risk management

Organizations will recognize greater business value and ROI as they scale AI agents to operational workflows. The implication is that the ecosystem of AI agents’ observability capabilities becomes a fundamental part of the organization’s risk management strategy.

“Make sure that observability of agents extends into tool use: what data sources they access, and how they interact with APIs,” says Graham Neray, co-founder and CEO, Oso. “You should not only be monitoring the actions agents are taking, but also categorizing risk levels of different actions and alerting on any anomalies in agentic actions.”

Risk management leaders will be concerned about rogue agents, data issues, and other IT and security risks that can impact AI agents. Auditors and regulators will expect enterprises to implement robust observability into AI agents and have remediation processes to address unexpected behaviors and other security threats.

5. Extend observability to security monitoring and threat detection

Another consumer of observability data will be security operation centers (SOCs) and security analysts. They will connect the information to data security posture management (DSPM) and other security monitoring tools used for threat detection.

“I expect real insight into how the agent reacts when it connects to external systems because integrations create blind spots that attackers target,” says Amanda Levay, CEO of Redactable. “Leaders need this level of observability because it shows where the agent strains under load, where it misreads context, and where it opens a path that threatens security.”

CISOs will need to extend their operational playbooks as threats from AI actors grow in scale and sophistication.

“Infosec and devops teams need clear visibility into the data transferred to agents, their actions on data and systems, and the requests made of them by users to look for signs of compromise, remediate issues, and perform root-cause analysis, says Mike Rinehart, VP of AI at Securiti AI. “As AI and AI agents become part of important data pipelines, teams must fold governance into prompts, integrations, and deployments so security, privacy, and engineering leaders act from a shared view of the data landscape and the risks that come with it.”

6. Evaluate AI agent performance

Addressing risk management and security concerns is one need for implementing observability in AI agents. The other key question observability can help answer is gauging an AI agent’s performance and providing indicators when improvements are needed.

“When I evaluate AI agents, I expect visibility into how the agent forms its decisions because teams need a clear signal when it drifts from expected behavior,” says Levay of Redactable. “I watch for moments when the agent ignores its normal sources or reaches for shortcuts because those shifts reveal errors that slip past general observability tools.

To evaluate performance, Tim Armandpour, CTO of PagerDuty, says technology leaders must prepare for AI agents that fail subtly rather than catastrophically. He recommends, “Instrument the full decision chain from prompt to output and treat reasoning quality and decision patterns as first-class metrics alongside traditional performance indicators. The teams succeeding at this treat every agent interaction as a security boundary and build observability contracts that make agent behavior auditable and explainable in production.”

7. Prepare for observability AI agents that take action

The natural evolution of observability is when devops organizations turn signals into actions using AI observability agents.

“Observability shouldn’t stop at recording; you should be able to take action if an agent is going astray easily,” says Neray of Oso. “Make sure you can easily restrict agentic actions by tightening access permissions, removing a particular tool, or even fully quarantining an agent to stop rogue behavior.”

Observability data will fuel the next generation of IT and security operational AI agents that will need to monitor a business’s agentic AI operations. The question is whether devops teams will have enough time to implement observability standards, or whether business demand to deploy agents will drive a new era of AI technical debt.

(image/png; 1.24 MB)

An architecture for engineering AI context 24 Mar 2026, 9:00 am

Ensuring reliable and scalable context management in production environments is one of the most persistent challenges in applied AI systems. As organizations move from experimenting with large language models (LLMs) to embedding them deeply into real applications, context has become the dominant bottleneck. Accuracy, reliability, and trust all depend on whether an AI system can consistently reason over the right information at the right time without overwhelming itself or the underlying model.

Two core architectural components of Empromptu’s end-to-end production AI system, Infinite Memory and the Adaptive Context Engine, were designed to solve this problem, not by expanding raw context windows but by rethinking how context is represented, stored, retrieved, and optimized over time.

The core problem: Context as a system constraint

Empromptu is designed as a full-stack system for building and operating AI applications in real-world environments. Within that system, Infinite Memory and Adaptive Context Engine work together to solve one specific but critical problem: how AI systems retain, select, and apply context reliably as complexity grows.

Infinite Memory provides the persistent memory layer of the system. It is responsible for retaining interactions, decisions, and historical context over time without being constrained by traditional context window limits.

The Adaptive Context Engine provides the attention and selection layer. It determines which parts of that memory, along with current data and code, should be surfaced for any given interaction so the AI can act accurately without being overwhelmed.

Together, these components sit beneath the application layer and above the underlying models. They do not replace foundation models or require custom training. Instead, they orchestrate how information flows into those models, making large, messy, real-world systems usable in production.

In practical terms, Infinite Memory answers the question: What can the system remember? The Adaptive Context Engine answers the question: What should the system pay attention to right now?

Both are designed as infrastructure primitives that plug into Empromptu’s broader platform, which includes evaluation, optimization, governance, and integration with existing codebases. This is what allows the system to support long-running sessions, large codebases, and evolving workflows without degrading accuracy over time.

Most modern AI systems operate within strict context limits imposed by the underlying foundation models. These limits force difficult trade-offs:

  • Retain full interaction history and suffer from escalating latency, cost, and performance degradation.
  • Periodically summarize past interactions and accept the loss of nuance, intent, and critical decision history.
  • Reset context entirely between sessions and rely on users to restate information repeatedly.

These approaches may be acceptable in demos or chatbots, but they break down quickly in production systems that must operate over long time horizons, large document sets, or complex codebases.

In real applications, context is not a linear conversation. It includes prior decisions, system state, user intent, historical failures, domain constraints, and evolving requirements. Treating context as a flat text buffer inevitably leads to hallucinations, regressions, and brittle behavior.

The challenge is not how much context an AI system can hold at once, but how intelligently it can decide what context matters for any given action.

Infinite Memory: Moving beyond context windows

Infinite Memory represents a shift away from treating context as something that must fit inside a single prompt. Instead, it introduces a persistent memory layer that exists independently of the model’s immediate context window.

This memory layer captures all interactions, decisions, corrections, and system state over time. Importantly, Infinite Memory does not attempt to inject all of this information into every request. Instead, it stores information in structured, retrievable forms that can be selectively reintroduced when relevant.

From an architectural perspective, Infinite Memory functions more like a knowledge substrate than a conversation log. Each interaction contributes to a growing memory graph that records:

  • User intent and preferences
  • Historical decisions and their outcomes
  • Corrections and failure modes
  • Domain-specific constraints
  • Structural information about code, data, or workflows

This allows the system to support conversations and workflows of effectively unlimited length without overwhelming the underlying model. The result is an AI system that never forgets, but also never blindly recalls everything.

Adaptive Context Engine: Attention as infrastructure

If Infinite Memory is the storage layer, the Adaptive Context Engine is the reasoning layer that decides what to surface and when to do so.

Internally, the Adaptive Context Engine is best understood as an attention management system. Its role is to continuously evaluate available memory and determine which elements are necessary for a specific request, task, or decision.

Unlike static prompt engineering approaches, the Adaptive Context Engine is dynamic and self-optimizing. It learns from usage patterns, outcomes, and feedback to improve its context selection over time. Rather than relying on predefined rules, it treats context selection as an evolving optimization problem.

Multi-level context management

The Adaptive Context Engine operates across multiple layers of abstraction, allowing it to manage both conversational and structural context.

Request harmonization

One of the most common failure modes in AI systems is request fragmentation. Users ask for changes, clarifications, and additions across multiple interactions, often referencing previous requests implicitly rather than explicitly.

Request harmonization addresses this by maintaining a continuously updated representation of the user’s cumulative intent. Each new request is merged into a harmonized request object that reflects everything the user has asked for so far, including constraints and dependencies.

This prevents the system from treating each interaction as an isolated command and allows it to reason over intent holistically rather than sequentially.

Synthetic history generation

Rather than replaying full interaction histories, the system generates what we refer to as synthetic histories. A synthetic history is a distilled representation of past interactions that preserves intent, decisions, and constraints while removing redundant or irrelevant conversational detail.

From the model’s perspective, it appears as though there has been a single coherent exchange that already incorporates everything learned so far. This dramatically reduces token usage while also maintaining reasoning continuity. Synthetic histories are regenerated dynamically, allowing the system to evolve its understanding as new information arrives.

Secondary agent control

For complex tasks, particularly those involving large codebases or document collections, a single monolithic context is inefficient and error-prone. The Adaptive Context Engine employs secondary agents that operate as context selectors.

These secondary agents analyze the task at hand and determine which files, functions, or documents require full expansion and which can remain summarized or abstracted. This selective expansion allows the system to reason deeply about specific components without loading entire systems into context unnecessarily.

CORE Memory: Recursive context expansion at scale

The most advanced component of the Adaptive Context Engine is what we call Centrally-Operated Recursively-Expanded Memory (CORE-Memory). This system addresses the challenge of working with large codebases or complex systems by creating associative trees of information.

CORE Memory automatically analyzes functions, files, and documentation to create hierarchical tags and associations. When the AI needs specific functionality, it can recursively search through these tagged associations rather than loading entire codebases into context. This allows for expansion on classes of files by tag or hierarchy, enabling manipulation of specific parts of code without context overload.

A production-grade system

Infinite Memory and the Adaptive Context Engine were built specifically for production environments, not research demos. Several design principles differentiate them from experimental context management approaches.

Self-managing context

The system is capable of operating across hundreds of documents or files while maintaining high accuracy. In production deployments, it consistently handles more than 250 documents without degradation while still achieving accuracy levels approaching 98%. This is accomplished through selective expansion, continuous pruning, and adaptive optimization rather than brute-force context injection.

Continuous optimization

The Adaptive Context Engine learns from real-world usage. It tracks which context selections lead to successful outcomes and which lead to errors or inefficiencies. Over time, this feedback loop allows the system to refine its attention strategies automatically, reducing hallucinations and improving relevance without manual intervention.

Integration flexibility

The architecture is designed to integrate with existing codebases, data stores, and foundation models. It does not require retraining models or rewriting systems. Instead, it acts as an orchestration layer that enhances reliability and performance across diverse environments.

Real-world applications

Together, Infinite Memory and the Adaptive Context Engine enable capabilities that are difficult or impossible with traditional context management approaches.

Extended Conversations

There are no artificial limits on conversation length or complexity. Context persists indefinitely, supporting long-running workflows and evolving requirements without loss of continuity.

Deep code understanding

The system can reason over large, complex codebases while maintaining awareness of architectural intent, historical decisions, and prior modifications.

Learning from failure

Failures are not discarded. The system retains memory of past errors, corrections, and edge cases, allowing it to avoid repeating mistakes and to improve over time.

Cross-session continuity

Context persists across sessions, users, and environments. This allows AI systems to behave consistently and predictably even as usage patterns evolve.

Architectural benefits

Empromptu’s approach with Infinite Memory and the Adaptive Context Engine offers several advantages over traditional context management techniques.

  • Scalability without linear cost growth
  • Improved reasoning accuracy under real-world constraints
  • Adaptability based on actual usage rather than static rules
  • Compatibility with existing AI infrastructure

Most importantly, it reframes context not as a hard constraint, but as an intelligent resource that can be managed, optimized, and leveraged strategically.

As AI systems move deeper into production environments, context management has become the defining challenge for reliability and trust. Infinite Memory and the Adaptive Context Engine represent a shift away from brittle prompt-based approaches toward a more resilient, system-level solution. By treating memory, attention, and context selection as first-class infrastructure, it becomes possible to build AI applications that scale in complexity without sacrificing accuracy.

The future of applied AI will not be defined by larger context windows alone, but by architectures that understand what matters and when.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 3.09 MB)

VS Code now updates weekly 24 Mar 2026, 9:00 am

With Microsoft now releasing stable updates to its Visual Studio Code editor weekly instead of just monthly, VS Code Versions 1.112 and 1.111 recently have been released, featuring capabilities such as agent troubleshooting, integrated browser debugging, and Copilot CLI permissions. Also featured is the deprecation of VS Code’s Edit Mode.

VS Code 1.112  was released March 18, while VS Code 1.111 arrived on March 9. Both follow what was a monthly update, VS Code 1.110, released March 4. Download instructions for the editor can be found on the Visual Studio Code website.

Integrated browser debugging on VS Code 1.112 means developers can open web apps directly within VS Code and can start debugging sessions with the integrated browser. This allows interaction with the web app, setting of breakpoints, stepping through code, and inspecting variables without leaving VS Code.

With VS Code 1.111, Edit Mode was officially deprecated. Users can temporarily re-enable Edit Mode via the Code setting chat.editMode.hidden. This setting will remain supported through Version 1.125. Beginning with Version 1.125, Edit Mode will be fully removed, and it will no longer be possible to enable it via settings.

For Copilot CLI sessions in VS Code 1.112, meanwhile, developers can configure permissions for local agent sessions in chat to give agents more autonomy in their actions and to reduce the number of approval requests. Developers can choose between permission levels, including default permissions, bypass approvals, and autopilot.

To reduce risks of locally running Model Context Protocol (MCP) servers, developers with VS Code 1.112 now can run locally configured studio MCP servers in a sandboxed environment on macOS and Linux. Sandboxed servers have restricted file system and network access.

Also in VS Code 1.112, agents can now read image files from disk and binary files natively. This allows developers to use agents for a wider variety of tasks, such as analyzing screenshots, reading data from binary files, and more. Binary files are presented to the agent in a hexdump format.

VS Code 1.111, meanwhile, emphasizes agent capabilities. With this release, developers gained a benefit in agent troubleshooting.  To help them understand and troubleshoot agent behavior, developers now can attach a snapshot of agent debug events as context in chat by using #debugEventsSnapshot. This can be used to ask the agent about loaded customizations, token consumption, or to troubleshoot agent behavior. Developers also can select the sparkle chat icon in the top-right corner of the Agent Debug panel to add the debug events snapshot as an attachment to the chat composer. Selecting the attachment opens the Agent Debug panel logs, filtered to the timestamp when the snapshot was taken.

Also in the agent vein, VS Code 1.111 adds a new permissions picker in the Chat view for controlling how much autonomy the agent has. The permission level applies only to the current session. Developers can change it at any time during a session by selecting a different level from the permissions picker.

Further in the agent space, the custom agent frontmatter in VS Code 1.111 adds support for agent-scoped hooks that are only run when a specific agent is selected or when it is invoked via runSubagent. This enables attachment of pre- and post-processing logic to specific agents without affecting other chat interactions.

VS Code 1.111 also featured a preview of an autopilot capability. This lets agents iterate autonomously until they complete their task.

(image/jpeg; 22.55 MB)

Designing self-healing microservices with recovery-aware redrive frameworks 24 Mar 2026, 9:00 am

Cloud-native microservices are built for resilience, but true fault tolerance requires more than automatic retries. In complex distributed systems, a single failure can cascade across multiple services, databases, caches or third-party APIs, causing widespread disruptions. Traditional retry mechanisms, if applied blindly, can exacerbate failures and create what is known as a retry storm, an exponential amplification of failed requests across dependent services.

This article presents a recovery-aware redrive framework, a design approach that enables self-healing microservices. By capturing failed requests, continuously monitoring service health and replaying requests only after recovery is confirmed, systems can achieve controlled, reliable recovery without manual intervention.

Challenges with traditional retry mechanisms

Retry storms occur when multiple services retry failed requests independently without knowledge of downstream system health. Consider the following scenario:

  • Service A calls Service B, which is experiencing high latency.
  • Both services implement automatic retries.
  • Each failed request is retried multiple times across layers.


In complex systems where services depend on multiple layers of other services, a single failed request can be retried multiple times at each layer. This can quickly multiply the number of requests across the system, overwhelming downstream services, delaying recovery, increasing latency and potentially triggering cascading failures even in components that were otherwise healthy. 

Recovery-aware redrive framework: System design

The recovery-aware redrive framework is designed to prevent retry storms while ensuring all failed requests are eventually processed. Its core design principles include:

  • Failure capture: All failed requests are persisted in a durable queue (e.g., Amazon SQS) along with their payloads, timestamps, retry metadata and failure type. This guarantees exact replay semantics.

  • Service health monitoring: A serverless monitoring function (e.g., AWS Lambda) evaluates downstream service metrics, including error rates, latency and circuit breaker states. Requests remain queued until recovery is confirmed.

  • Controlled replay: Once system health indicates recovery, queued requests are replayed at a controlled rate. Failed requests during replay are re-enqueued, enabling multi-cycle recovery while avoiding retry storms. Replay throughput can be dynamically adjusted to match service capacity.
Recovery-aware redrive framework for self-healing microservices

Anshul Gupta

Operational flow

The framework operates in three stages:

  1. Failure detection: Requests failing at any service are captured with full metadata in the durable queue.
  2. Monitoring and recovery detection: Health metrics are continuously analyzed. Recovery is considered achieved when all monitored metrics fall within predefined thresholds.
  3. Replay execution: Requests are replayed safely after recovery, with throughput limited to prevent overload. Failures during replay are returned to the queue for subsequent attempts.

This design ensures safe, predictable retries without amplifying failures. By decoupling failure capture from replay and gating retries based on real-time service health, the system prevents premature retries that could overwhelm recovering services. It also maintains end-to-end request integrity, guaranteeing that all failed requests are eventually processed while preserving the original payload and semantics. This approach reduces operational risk, avoids cascading failures and supports observability, allowing engineers to track failures, recovery events and replay activity in a controlled and auditable manner.

Implementation in cloud-native environments

A practical implementation involves:

  • Failure capture function: Intercepts failed API calls and writes them to a queue.
  • Monitoring function: Evaluates downstream service health continuously.
  • Replay function: Dequeues messages at a controlled rate after recovery, re-queuing failures as necessary.


This decoupling of failure capture from replay enables true self-healing microservices, reducing the need for human intervention during outages.

Benefits of recovery-aware redrive

Implementing a recovery-aware redrive framework offers several operational advantages that directly impact system reliability and resilience. By intelligently managing failed requests and controlling replay based on actual service health, this design not only prevents uncontrolled traffic amplification but also ensures that every request is eventually processed without manual intervention. In addition, it enhances visibility into system behavior, providing actionable insights for troubleshooting and capacity planning. These benefits make the framework particularly well-suited for modern cloud-native environments where stability, observability and cross-platform compatibility are critical.

  • Prevents retry storms: Ensures request amplification is bounded.
  • Maintains reliability: Guarantees that all failed requests are eventually processed.
  • Supports observability: Logs all failures, replay attempts and system metrics for auditing and troubleshooting.
  • Platform agnostic: Compatible with Kubernetes, serverless or hybrid cloud environments.

Best practices

  • Design requests to be idempotent or safely deduplicated.
  • Base monitoring on real system metrics rather than static timers.
  • Throttle replay throughput dynamically according to system capacity.
  • Maintain audit logs of failures and replay activities for operational transparency.

Conclusion

Self-healing microservices require more than traditional retries. A recovery-aware redrive framework provides a structured approach to capture failed requests, monitor downstream service health and replay them safely after recovery. This framework prevents retry storms, improves observability and enables cloud-native systems to recover autonomously from outages, delivering resilient and reliable services in complex distributed environments.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 2.39 MB)

The agent security mess 23 Mar 2026, 9:00 am

Persistent weak layers (PWLs) have plagued my backcountry skiing for the past 10 years. They’re about to mess up the industry’s IT security, too.

For those who don’t spend their early mornings skinning up mountains in Utah’s backcountry, a persistent weak layer, or PWL, is exactly what it sounds like. It’s a fragile layer of snow, often faceted crystals that form during cold and dry spells, which gets buried by subsequent storms. That PWL lies in wait for a trigger: Perhaps a skier hitting a shallow rock band, a sudden spike in spring temperatures, or a heavy snowfall. At that moment, the entire slab above it shatters, slides, and, all too often, kills people.

Enterprise access control is built on its own version of a colossal PWL. For years, we’ve piled new roles, temporary privileges, and overly broad static profiles on top of an unmanaged foundation of dormant access. The structure has held up because people are relatively gentle triggers: We’re slow, easily distracted, and generally prefer to keep our jobs.

But AI agents aren’t human skiers moving carefully down a slope. They’re a massive, rapid loading event, a trigger primed to spark an “avalanche” in your data center.

OK, computer?

This is the core takeaway from new research published by Oso and Cyera, which finally puts hard numbers to a problem that’s been visible but ignored for years. Their research analyzed 2.4 million workers and 3.6 billion application permissions, and the results should concern us. According to the Oso blind spot report, corporate workers completely ignore 96% of their granted permissions. Over a 90-day window, only 4% of granted permissions were ever actually exercised. With sensitive enterprise data, it’s even worse: Workers touch only 9% of the sensitive data they can actually reach, and nearly one-third of users have the power to modify or delete sensitive data.

Seems ok, right? I mean, the fact that they’re not exercising their rights to certain applications or data isn’t a big deal, is it? So long as they don’t use what they have access to, we’re good. Right?

Nope. Maybe this isn’t an issue in a world where people plod about, ignoring their access rights. But when we add autonomous agents to the mix, things get problematic very, very fast. As I’ve argued, the enterprise AI problem isn’t just a matter of hallucinations. It’s really about permissions. Humans act as a natural governor on permission sprawl. A marketing employee might technically have the right to view a million customer records but will only ever look at the 30 they need to finish their campaign for the quarter. The risk (the “persistent weak layer”) remains entirely dormant.

Agents remove that governor entirely.

When an AI agent inherits a human user account, it inherits the entire permission surface, not just the tiny fraction the human actually used. Because agents operate continuously, chain actions across various systems, and execute whatever privileges they possess without hesitation, they turn latent permission debt into active operational risk. If an agent is told to clean up stale records and it happens to hold the dormant permission to modify the entire database, it will attempt to do exactly that.

Fixing permissions

This aligns perfectly with a drum I’ve been beating for years. Back in 2021, I wrote that authorization was rapidly becoming the most critical unresolved challenge in modern software architecture. A year later, I argued that identity and trust must be baked into the development life cycle, not bolted on by a separate security team right before launch. More recently, I’ve pointed out that large language models demand a totally new approach to authorization, that boring governance is the only path to real AI adoption, and that the true challenge in agentic systems is building a robust AI control plane.

The smartest players in the space are already treating this as table stakes. In its framework for trustworthy agents, Anthropic explicitly notes that systems like Claude Code default to read-only access and require human approval before modifying code or infrastructure. Microsoft offers similar guidance, warning against overprivileged applications and demanding tightly scoped service accounts. They understand that in the age of autonomous software, the old assumption that an application probably won’t use a dormant permission is foolish.

The problem won’t stay neatly confined to a single SaaS application, either. We’re already dealing with a world where nonhuman identities are proliferating rapidly. A 2024 industry report from CyberArk notes that machine identities now outnumber human identities by massive margins, often 80 to 1 or higher. A huge chunk of those machine identities have privileged or sensitive access, and most organizations completely lack identity security controls for AI.

Read-only as a default

So, how do we fix the PWL before the avalanche hits? This isn’t something you solve with a clever prompt, a larger context window, or a new foundational model. It’s an architecture problem.

Putting aside the overprovisioned humans (that’s a separate blog post), we can curtail agentic misuse of permissions by building golden paths where the default state for any new AI agent is strictly read-only. We have to stop the reckless, albeit convenient, practice of letting an agent inherit a broad employee account just to make a pilot project work faster for a sprint demo.

Agents require purpose-built identities with aggressively minimal permissions. If 96% of a human user’s access goes unused anyway, we can’t grant that excess access to a machine. We need environments where the ability to draft an action and the ability to execute it are entirely separate permissions. We need explicit approvals for any destructive actions, and we need every single automated action logged and fully reversible.

We spend so much time debating the intelligence of these new models while we ignore the ground they walk on. AI agents aren’t creating a brand-new authorization crisis. They’re simply exposing the persistent weak layer we’ve been ignoring for years. We tolerated bloated roles and static profiles because humans were slow enough to keep the damage theoretical. Agents make it concrete. Hopefully, they’ll also make us pay attention to authorization in ways we largely haven’t.

(image/jpeg; 9.17 MB)

How to land a software development job in an AI-focused world 23 Mar 2026, 9:00 am

It seems we are in a very perplexing and somewhat worrisome time in the technology job market. Artificial intelligence is disrupting workflows and changing job descriptions, while many companies are shedding staff due to years of “overhiring.” Some companies are freezing hiring due to market uncertainty.

Whether you are seeking a full-time position or contracting work, AI—for better or worse—has really changed the game for software developers.

AI is not eliminating software developers, but it is changing what it means to be a good one, according to Loren Absher, director and Americas lead for applied AI advisory at ISG, a research and advisory firm. “Organizations are moving away from hiring developers for how fast they can write code and toward hiring for how well they understand the problem they are solving.”

As AI takes over more routine coding, the differentiator shifts to judgment, system design, and context, Absher says. “That context increasingly includes deep understanding of the industry the software is being built for.”

With fewer jobs and more candidates on the market, “searching for a developer role looks different in 2026,” says Kyle Elliott, a career and executive coach specializing in the technology industry.

“Previously, developers could send out 10 applications and expect to land interviews for a significant portion of them,” Elliott says. “Today, they must be much more strategic with the companies and roles they target, how they craft their resumes and position themselves in the market, and even how they follow up after applying.”

“We’re in a confusing transition period, [but] what I’m seeing is not a collapse; it’s a recalibration,” says Sonu Kapoor, an independent software engineer. “AI hasn’t removed the need for software developers. It has raised expectations around how developers work and the value they bring.”

How are successful job seekers landing software development jobs and contract gigs in this turbulent environment? We asked experts for their tips.

Present a balanced resume

Hiring managers want to see a diversity of attributes in candidates, including having the right skills, relevant certifications, and on-the-job experience. Highlighting AI-related skills, certifications, and experience could provide an extra boost.

“Landing a role today still requires a strong foundation,” says Natalia Rodriguez, vice president of talent acquisition at BairesDev, which reviews more than 2.5 million applications per year. “Engineers who succeed often combine strong core skills with newer capabilities, such as machine learning frameworks and emerging AI-specific techniques. Developers should take the time to really sharpen their fundamentals in programming, data structures, and system design.”

While the right skills and certifications are important, it might be having just the right kind of experience that lands the job. “Software developers must be able to demonstrate they’ve built something effectively, not just talk about it,” says Sheldon Arora, CEO of healthcare staffing agency StaffDNA. Companies and employers are looking for successful implementation on real projects, he says.

While technical skills are important, hiring managers are looking for other skills aimed at supporting organizational goals.

“The developers who thrive in this environment will not be those who simply code faster, but those who combine technical depth with industry knowledge, systems thinking, and sound judgment,” Absher says. “Those are deeply human skills, and they are becoming more valuable as AI becomes more capable.”

Specialize in a valuable niche

If you have specialized in a certain discipline within software development and that specialization is in demand, that could be a big advantage in finding a new job or acquiring freelance assignments.

Kapoor has worked in front-end development with Angular for more than 10 years, and for more than 20 years in software development overall. He thinks that long-term focus is why Google awarded him the Google Developer Expert (GDE) designation for Angular.

“In today’s market, deep specialization builds trust, and trust drives hiring decisions,” Kapoor says.

Build your reputation and network

Developers who demonstrate skills through open source, conference talks, podcasts, and in-depth writing stand out immediately to hiring professionals, Kapoor says.

“I’ve landed several high-paying contracts simply because companies already knew my work,” Kapoor says. “Jobs found me rather than the other way around.”

In a confusing and sometimes crowded technical job market, standing out is more important than ever. Personal branding gives professionals an opportunity to showcase their expertise.

“Developers need to think of themselves as products,” Kapoor says. “Building a visible brand, growing a strong network, collaborating with leaders in your field, and choosing the right platforms, such as niche podcasts or respected publications, compounds over time and creates opportunities that traditional job searches often miss.”

Most developers assume the job market lives on LinkedIn and job boards, says Kolby Goodman, career coach at career site The Job Huntr. “What I see is the best opportunities are being talked about in Slack channels, sprint planning, and leadership meetings,” he says. “The developers who get hired quickly are the ones building relationships with team leads and product managers before a job ever exists.”

Become an AI prompt master

You can also make yourself attractive to hiring managers by becoming proficient at the AI skills they are looking for, such as writing quality prompts.

“Developers who know how to use AI tools effectively, especially how to write clear, precise prompts, are dramatically more productive,” Kapoor says. “Prompting isn’t about replacing engineering skill; it’s about amplifying it. Engineers who can translate requirements into high-quality prompts deliver faster without sacrificing quality.” AI shifts the emphasis toward developers who understand systems, context, and long-term impact, Kapoor says. “AI handles speed, experienced engineers handle judgment,” he says.

Consider contract work

Developers can do quite well with freelancing. Contract work lets you choose your clients and assignments, decide your own work hours, and work from home, among other benefits. Contract work also can lead to full-time employment, either with an existing client or another organization.

“Developers should treat contracts as a doorway, not a downgrade,” Arora says. “Contract roles convert to full-time frequently. Developers should be open to contract and full-time work, especially in a market where companies want to prove you’re a fit before making a permanent hire.”

Customize your resume

Sending out a resume that has no relevance to a particular job is a waste of time, especially in a competitive field where hiring managers are looking for specific skill sets and experiences. Customizing your resume helps you stand out and shows that your skills and experiences align with specific job requirements.

“Take the time to tailor your resume to each role,” Elliott says. “Since companies are receiving hundreds, if not thousands, of applications for a single position, you need to clearly demonstrate how you’re aligned. Set a timer for 20 to 30 minutes and use that time to strategically weave keywords from the job description throughout your resume.”

Applying this technique landed a recent technology client seven interviews, notes Elliott, and six of them came from cold online applications with no contacts at the company.

Highlight project deliverables

Make sure your resume includes the tangible results of your projects and assignments, not just a laundry list of skills.

“Nowadays, there are plenty of candidates who write ‘I know 10 frameworks’ on their resumes,” says Anastasiya Levantsevich, head of people and culture at software development company Pynest. “However, finding a specialist who lists ‘delivering results’ as their strength is not so easy.”

The simplest way for a candidate to stand out is to create a portfolio in the before/after style, Levantsevich says. “For example: it was slow, now it’s faster; there was a lot of manual routine work, now there’s less; there was chaos in the logs, now everything is structured and consistent,” she says.

Follow up after you apply

Don’t be afraid to follow up on your application. This can help you stand out in a crowded field of applicants.

“This can feel particularly foreign for developers who may feel uncomfortable cold messaging a recruiter or contact at their target company,” Elliott says. “But it can significantly increase your chances of landing an interview. I’ve had multiple clients secure interviews because they were politely persistent, took the time to find the recruiter or hiring manager on LinkedIn, and forwarded their resume directly.”

Practice your presentation

How you present yourself during the interview stage can make all the difference in landing a job. Hiring managers want to see how your mind works and what kind of effort you put forth in solving problems.

“Demonstrate your thought process, the types of questions you ask, how you verify your work, and in what situations this is necessary, and how and when you use AI as an assistant,” Levantsevich says. “At Pynest, we have hired people after a short dialogue like, ‘I see you have X, I’ve done something similar with Y, I can show you,’” she says. “This sounds professional and saves time for both sides.”

Focus on the right industries

Some sectors are growing faster than others, and therefore might require more software development expertise.

“Healthcare and health tech are among the most durable hiring markets in the U.S. right now,” Arora says. “There are chronic labor shortages, and APIs, integrations, and data pipeline development are skills companies and medical organizations need. It’s also good to have in-depth working knowledge of workflow and operations software.”

“We see growing opportunities for tech talent across healthcare, fintech, logistics, and ecommerce,” Rodriguez says. “Prepare yourself by understanding the domain constraints shaping technical decisions, whether that’s regulatory requirements, data sensitivity, scalability, or reliability expectations. To stand out, make sure to address how you’ve applied your knowledge within industry contexts.”

(image/jpeg; 1.7 MB)

OpenAI’s desktop superapp: The end of ChatGPT as we know it? 20 Mar 2026, 6:02 pm

OpenAI is reportedly planning to fold its ChatGPT application, Codex coding platform, and AI-powered browser into a single desktop ‘superapp’, a move that signals a shift toward enterprise and developer audiences and away from the consumer market that made the company a household name.

The unified product will merge the ChatGPT interface, the Codex coding tool, and OpenAI’s browser known internally as Atlas into a single desktop application, the Wall Street Journal reported Thursday. The mobile version of ChatGPT is not part of the consolidation and will remain unchanged. OpenAI President Greg Brockman will temporarily oversee the product overhaul and associated organizational changes, while Chief of Applications Fidji Simo leads the commercial effort to bring the new app to market, the report added.

Simo confirmed the plan the same day in a post on X. “Companies go through phases of exploration and phases of refocus; both are critical,” she wrote. “But when new bets start to work, like we’re seeing now with Codex, it’s very important to double down on them and avoid distractions.”

The superapp announcement follows an all-hands meeting on March 16, in which Simo told employees the company needed to stop being distracted by “side quests” and orient aggressively toward coding and business users.

“We realized we were spreading our efforts across too many apps and stacks, and that we need to simplify our efforts,” the Journal reported that day, citing Simo’s address to the employees. “That fragmentation has been slowing us down and making it harder to hit the quality bar we want.” At the same meeting, Simo outlined the commercial imperative plainly: “Our opportunity now is to take those 900 million users and turn them into high-compute users. We’ll do that by transforming ChatGPT into a productivity tool.”

More than a product refresh

The superapp is being designed around agentic AI, systems capable of autonomously executing multi-step tasks such as writing and debugging software, analyzing data, and completing complex workflows without continuous human instruction, the Journal reported. That positions it less as a consumer chatbot and more as an AI-powered work environment aimed at developers and enterprise knowledge workers.

Sanchit Vir Gogia, chief analyst at Greyhound Research, said the move goes beyond product consolidation. “This is not a clean enterprise pivot — it is a forced convergence driven by internal fragmentation, competitive pressure, and the need to monetized where value is actually realized,” he said. “The real value is shifting to where intent becomes action. That is workflows, not conversations.”

The announcement is the latest in a series of enterprise-facing moves. In February, OpenAI launched Frontier, an agent orchestration platform, and announced partnerships with Accenture, BCG, Capgemini, and McKinsey to embed its technology into business workflows.

The numbers behind the pivot

The urgency behind these moves becomes clear when the competitive data is examined. According to enterprise spend management software vendor Ramp, a year ago only one in 25 businesses on its platform paid for Anthropic; today that figure has jumped to nearly one in four. In new enterprise deals, Anthropic is now winning approximately 70% of head-to-head matchups against OpenAI, it said.

Gogia, however, flagged a structural risk. ChatGPT’s dominance was built on simplicity and universal accessibility, qualities a workflow-centric superapp trades away. “In trying to serve consumers, developers, and enterprises within a single interface, OpenAI risks diluting the very clarity that made ChatGPT dominant,” he said.

That risk is compounded by a governance challenge that enterprise IT leaders are only beginning to reckon with.

The governance gap

For IT leaders evaluating OpenAI tooling, Gogia pointed to a deeper challenge the superapp introduces. “The biggest constraint on agentic AI is not capability. It is control,” he said. “Identity management is not designed for non-human actors. Audit trails are incomplete. And there is no mature control plane that governs how agents act, what they access, and how those actions can be reversed or contained.”

Microsoft and Google hold a structural advantage here: Their AI is embedded within platforms that already manage identity, access, and compliance at enterprise scale, a gap enterprise buyers have repeatedly flagged as a persistent concern with OpenAI’s approach. It is precisely that trust deficit that has given Anthropic its opening.

“The battle is no longer about who builds the best chatbot. It is about who owns how work gets done,” Gogia said. “Enterprises are making platform decisions now — and those decisions will not be based on who is most advanced. They will be based on who is most dependable.”

OpenAI did not immediately respond to a request for comment.

This article first appeared on Computerworld.

(image/jpeg; 4.51 MB)

Google’s Stitch UI design tool is now AI-powered 20 Mar 2026, 5:19 pm

Google is introducing AI into its Stitch UI design tool, enabling anyone to create user-interface designs by describing them in natural language or using markdown.

It can also be used to copy the design of an existing web page — or “easily extract a design system from any URL,” as Google put it in a blog post describing the new feature.

The thinking behind the development is that users will often have a variety of ideas in the initial part of the design process. Businesses will now be able to see a visual representation of those ideas, whether they be generated by text, image or code.

Google has said that Stitch will also be paired with a new design agent that can reason across the entire project’s evolution. In addition, it has introduced an Agent manager that helps users to track their progress as well as allowing them to work on multiple ideas in parallel.

(image/jpeg; 13.24 MB)

Stop using AI to submit bug reports, says Google 20 Mar 2026, 4:48 pm

Google will no longer accept AI-generated submissions to a program it funded to find bugs in open-source software. However, it is contributing to a separate program that uses AI to strengthen security in open-source code.

The Google Open Source Software Vulnerability Reward Program team is increasingly concerned about the low quality of some AI-generated bug submissions, with many including hallucinations about how a vulnerability can be triggered or reporting bugs with little security impact.

“To ensure our triage teams can focus on the most critical threats, we will now require higher-quality proof (like OSS-Fuzz reproduction or a merged patch) for certain tiers to filter out low-quality reports and allow us to focus on real-world impact,” Google wrote in a blog post.

The Linux Foundation too is finding the volume of AI-generated bug submissions overwhelming and has sought financial help from AI companies including Google, Anthropic, AWS, Microsoft, and OpenAI to deal with the problem. Together, they are contributing $12.5 million to the foundation to improve the security of open-source software.

“Grant funding alone is not going to help solve the problem that AI tools are causing today on open-source security teams,” said Greg Kroah-Hartman of the Linux kernel project in a blog post. “OpenSSF has the active resources needed to support numerous projects that will help these overworked maintainers with the triage and processing of the increased AI-generated security reports they are currently receiving.”

The funding will be managed by open source security project Alpha-Omega and the Open Source Security Foundation (OSSF) and will be used to provide AI tools to help maintainers deal with the volume of AI-generated submissions.

“We are excited to bring maintainer-centric AI security assistance to the hundreds of thousands of projects that power our world,” said Alpha-Omega co-founder Michael Winser.

(image/jpeg; 9.06 MB)

The ‘toggle-away’ efficiencies: Cutting AI costs inside the training loop 20 Mar 2026, 10:00 am

A single training run can emit as much CO₂ as five cars do in a year.

That finding from the University of Massachusetts, Amherst, has become the defining statistic of the generative AI era. But for the engineers and data scientists staring at a terminal, the problem isn’t just carbon, it’s the cloud bill.

The industry narrative suggests that the only solution is hardware: buying newer H100s or building massive custom silicon. But after combing through academic benchmarks, cloud billing dashboards and vendor white papers, I’ve found that roughly half of that waste is a “toggle away”.

Training efficiency isn’t about squeezing GPUs harder; it’s about spending smarter for the same accuracy. The following methods focus on training-time cost levers, changes inside the loop that cut waste without touching your model architecture.

(Note: All code examples below are available in the accompanying Green AI Optimization Toolkit repository.)

The compute levers: Taking weight off the chassis

The easiest way to speed up a race car is to take weight off the chassis. In Deep Learning, that weight is precision.

For years, 32-bit floating point (FP32) was the default. But today, switching to Mixed-Precision Math (FP16/INT8) is the highest ROI change a practitioner can make. On hardware with dedicated tensor units, like NVIDIA Ampere/Hopper, AMD RDNA 3 or Intel Gaudi 2, mixed precision can increase throughput by 3x or more.

However, this isn’t a magic wand for everyone. If you are running on pre-2019 GPUs (like the Pascal architecture) that lack Tensor Cores, you might see almost no speed gain while risking numerical instability. Similarly, compliance workloads in finance or healthcare that require bit-exact reproducibility may need to stick to FP32.

But for the 90% of use cases involving memory-bound models (ResNet-50, GPT-2, Stable Diffusion), the shift is essential. It also unlocks Gradient Accumulation, allowing you to train massive models on smaller, cheaper cards by simulating larger batch sizes. The implementation: Here is how to implement mixed precision and gradient accumulation in PyTorch. This setup allows you to simulate a batch size of 64 on a GPU that can only fit 8 samples.

python
# From 'green-ai-optimization-toolkit/01_mixed_precision.py'

import torch
from torch.cuda.amp import autocast, GradScaler

# Simulate a Batch Size of 64 using a Micro-Batch of 8
eff_batch_size = 64
micro_batch = 8
accum_steps = eff_batch_size // micro_batch 

scaler = GradScaler() # Prevents gradient underflow in FP16

for i, (data, target) in enumerate(loader):
    # 1. The Toggle: Run forward pass in FP16
    with autocast():
        output = model(data)
        loss = criterion(output, target)
        loss = loss / accum_steps # Normalize loss
    
    # 2. Scale gradients and accumulate
    scaler.scale(loss).backward()
    
    # 3. Step only after N micro-batches
    if (i + 1) % accum_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

The data levers: Feeding the beast

If your GPU utilization is hovering around 40%, you aren’t training a model; you are burning cash. The bottleneck is almost always the data loader.

A common mistake is treating data preprocessing as a per-epoch tax. If you use expensive text tokenizers (like Byte-Pair Encoding) or complex image transforms, cache pre-processed data. Tokenize or resize once, store the result and feed it directly.

Furthermore, look at your file formats. Reading millions of small JPEG or CSV files over a network file system kills I/O throughput due to metadata overhead. Instead, stream data via archives. Sharding your dataset into POSIX tar files or binary formats like Parquet/Avro allows the OS to read ahead, keeping the GPU hungry.

Watch out for:

  • Storage ballooning: Caching pre-processed data can triple your storage footprint. You are trading storage cost (cheap) for compute time (expensive).
  • Over-pruning: While data deduplication is excellent for web scrapes, be careful with curated medical or legal datasets. Aggressive filtering might discard rare edge cases that are critical for model robustness.

The operational levers: Safety and scheduling

The most expensive training run is the one that crashes 99% of the way through and has to be restarted.

In the cloud, spot instances (or pre-emptible VMs) offer discounts of up to 90%. To use them safely, you must implement robust checkpointing. Save the model state frequently (every epoch or N steps) so that if a node is reclaimed, you lose minutes of work, not days.

Open-source orchestration frameworks like SkyPilot have become essential here. SkyPilot abstracts away the complexity of Spot Instances, automatically handling the recovery of reclaimed nodes and allowing engineers to treat disparate clouds (AWS, GCP, Azure) as a single, cost-optimized resource pool.

You should also implement early stopping. There is no ROI in “polishing noise”. If your validation loss plateaus for 3 epochs, kill the run. This is especially potent for fine-tuning tasks, where most gains arrive in the first few epochs. However, be cautious if you are using curriculum learning, where loss might naturally rise before falling again as harder examples are introduced.

The “smoke test” protocol

Finally, never launch a multi-node job without a dry run. A simple script that runs two batches on a CPU can catch shape mismatches and OOM bugs for pennies.

python
# From 'green-ai-optimization-toolkit/03_smoke_test.py'
def smoke_test(model, loader, device='cpu', steps=2):
    """
    Runs a dry-run on CPU to catch shape mismatches 
    and OOM bugs before the real run starts.
    """
    print(f"💨 Running Smoke Test on {device}...")
    model.to(device)
    model.train()
    
    try:
        for i, (data, target) in enumerate(loader):
            if i >= steps: break
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = output.sum()
            loss.backward()
        print("✅ Smoke Test Passed. Safe to launch expensive job.")
        return True
    except Exception as e:
        print(f"❌ Smoke Test Failed: {e}")
        return False

The rapid-fire checklist: 10 tactical quick wins

Beyond the major architectural shifts, there is a long tail of smaller optimizations that, when stacked, yield significant savings. Here is a rapid-fire checklist of tactical wins.

1. Dynamic batch-size auto-tuning

  • The tactic: Have the framework probe VRAM at launch and automatically choose the largest safe batch size.
  • Best for: Shared GPU clusters (Kubernetes/Slurm) where free memory swings wildly.
  • Watch out: Can break real-time streaming SLAs by altering step duration.

2. Continuous profiling

  • The tactic: Run lightweight profilers (PyTorch Profiler, NVIDIA Nsight) for a few seconds per epoch.
  • Best for: Long jobs (>30 mins). Finding even a 5% hotspot pays back the profiler overhead in a day.
  • Watch out: I/O-bound jobs. If GPU utilization is

3. Store tensors in half-precision

  • The tactic: Save checkpoints and activations in FP16 (instead of default FP32).
  • Best for: Large static embeddings (vision, text). It halves I/O volume and storage costs.
  • Watch out: Compliance workloads requiring bit-exact auditing.

4. Early-phase CPU training

  • The tactic: Run the first epoch on cheaper CPUs to catch gross bugs before renting GPUs.
  • Best for: Complex pipelines with heavy text parsing or JSON decoding.
  • Watch out: Tiny datasets where the data transfer time exceeds the compute time.

5. Offline augmentation

  • The tactic: Pre-compute heavy transforms (Mosaic, Style Transfer) and store them, rather than computing on-the-fly.
  • Best for: Heavy transforms that take >20ms per sample.
  • Watch out: Research that studies augmentation randomness; baking it removes variability.

6. Budget alerts & dashboards

  • The tactic: Stream cost metrics per run and alert when burn-rate exceeds a threshold.
  • Best for: Multi-team organizations to prevent “runaway” billing.
  • Watch out: Alert Fatigue. If you ping researchers too often, they will ignore the notifications.

7. Archive stale artifacts

  • The tactic: Automatically move checkpoints >90 days old to cold storage (Glacier/Archive tier).
  • Best for: Mature projects with hundreds of experimental runs.
  • Watch out: Ensure you keep the “Gold Standard” weights on hot storage for inference.

8. Data deduplication

  • The tactic: Remove near-duplicate samples before training.
  • Best for: Web scrapes and raw sensor logs.
  • Watch out: Curated medical/legal datasets where “duplicates” might actually be critical edge cases.

9. Cluster-wide mixed-precision defaults

  • The tactic: Enforce FP16 globally via environment variables so no one “forgets” the cheapest knob.
  • Best for: MLOps teams managing multi-tenant fleets.
  • Watch out: Legacy models that may diverge without specific tuning.

10. Neural architecture search (NAS)

  • The tactic: Automate the search for efficient architectures rather than hand-tuning.
  • Best for: Long-term production models where efficiency pays dividends over years.
  • Watch out: Extremely high upfront compute cost; only worth it if the model will be deployed at massive scale.

Better habits, not just better hardware

You don’t need to wait for an H100 allocation to make your AI stack efficient. By implementing mixed precision, optimizing your data feed and adding operational safety nets, you can drastically reduce both your carbon footprint and your cloud bill.

The most sustainable AI strategy isn’t buying more power, it’s wasting less of what you already have.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 1.31 MB)

AI optimization: How we cut energy costs in social media recommendation systems 20 Mar 2026, 9:00 am

When you scroll through Instagram Reels or browse YouTube, the seamless flow of content feels like magic. But behind that curtain lies a massive, energy-hungry machine. As a software engineer working on recommendation systems at Meta and now Google, I’ve seen firsthand how the quest for better AI models often collides with the physical limits of computing power and energy consumption.

We often talk about “accuracy” and “engagement” as the north stars of AI. But recently, a new metric has become just as critical: efficiency.

At Meta, I worked on the infrastructure powering Instagram Reels recommendations. We were dealing with a platform serving over a billion daily active users. At that scale, even a minor inefficiency in how data is processed or stored snowballs into megawatts of wasted energy and millions of dollars in unnecessary costs. We faced a challenge that is becoming increasingly common in the age of generative AI: how do we make our models smarter without making our data centers hotter?

The answer wasn’t in building a smaller model. It was in rethinking the plumbing — specifically, how we computed, fetched and stored the training data that fueled those models. By optimizing this “invisible” layer of the stack, we achieved over megawatt-scale energy savings and reduced annual operating expenses by eight figures. Here is how we did it.

The hidden cost of the recommendation funnel

To understand the optimization, you have to understand the architecture. Modern recommendation systems generally function like a funnel.

At the top, you have retrieval, where we select thousands of potential candidates from a pool of billions of media items. Next comes early-stage ranking, a high-efficiency phase that filters this large pool down to a smaller set. Finally, we reach late-stage ranking. This is where the heavy lifting happens. We use complex deep learning models — often two-tower architectures that combine user and item embeddings — to precisely order a curated set of 50 to 100 items to maximize user engagement.

This final stage is incredibly feature-dense. To rank a single Reel, the model might look at hundreds of “features.” Some are dense features (like the time a user has spent on the app today) and others are sparse features (like the specific IDs of the last 20 videos watched).

The system doesn’t just use these features to rank content; it also has to log them. Why? Because today’s inference is tomorrow’s training data. If we serve you a video and you “like” it, we need to join that positive label with the exact features the model saw at that moment to retrain and improve the system.

This logging process — writing feature values to a transient key-value (KV) store to wait for user interaction — was our bottleneck.

The challenge of transitive feature logging

To understand why this bottleneck existed, we have to look at the microscopic lifecycle of a single training example.

In a typical serving path, the inference service fetches features from a low-latency feature store to rank a candidate set. However, for a recommendation system to learn, it needs a feedback loop. We must capture the exact state of the world (the features) at the moment of inference and later join them with the user’s future action (the label), such as a “like” or a “click.”

This creates a massive distributed systems challenge: Stateful label joining.

We cannot simply query the feature store again when the user clicks, because features are mutable — a user’s follower count or a video’s popularity changes by the second. Using fresh features with stale labels introduces “online-offline skew,” effectively poisoning the training data.

To solve this, we use a transitive key-value (KV) store. Immediately after ranking, we serialize the feature vector used for inference and write it to a high-throughput KV store with a short time-to-live (TTL). This data sits there, “in transit,” waiting for a client-side signal.

  • If the user interacts: The client fires an event, which acts as a key lookup. We retrieve the frozen feature vector from the KV store, join it with the interaction label and flush it to our offline training warehouse (e.g., Hive/Data Lake) as a “source-of-truth” training example.
  • If the user does not interact: The TTL expires, and the data is dropped to save costs.

This architecture, while robust for data consistency, is incredibly expensive. We were essentially continuously writing petabytes of high-dimensional feature vectors to a distributed KV store, consuming massive network bandwidth and serialization CPU cycles.

Optimizing the “head load”

We realized that our “write amplification” was out of control. In the late-stage ranking phase, we typically rank a deep buffer of items — say, the top 100 candidates — to ensure the client has enough content cached for a smooth scroll.

The default behavior was eager logging: We would serialize and write the feature vectors for all 100 ranked items into the transitive KV store immediately.

However, user behavior follows a steep decay curve. A user might only view the first 5–6 items (the “head load”) before closing the app or refreshing the feed. This meant we were paying the serialization and I/O cost to store features for items 7 through 100, which had a near-zero probability of generating a positive label. We were effectively DDoS-ing our own infrastructure with “ghost data.”

We shifted to a “lazy logging” architecture.

  1. Selective persistence: We reconfigured the serving pipeline to only persist features for the Head Load (e.g., top 6 items) into the KV store initially.
  2. Client-triggered pagination: As the user scrolls past the Head Load, the client triggers a lightweight “pagination” signal. Only then do we asynchronously serialize and log the features for the next batch (items 7–15).

This change decoupled our ranking depth from our storage costs. We could still rank 100 items to find the absolute best content, but we only paid the “storage tax” for the content that actually had a chance of being seen. This reduced our write throughput (QPS) to the KV store significantly, saving megawatts of power previously wasted on serializing data that was destined to expire untouched.

Rethinking storage schemas

Once we reduced what we stored, we looked at how we stored it.

In a standard feature store architecture, data is often stored in a tabular format where every row represents an impression (a specific user seeing a specific item). If we served a batch of 15 items to one user, the logging system would write 15 rows.

Each row contained the item features (which are unique to the video) and the user features (which are identical for all 15 rows). We were effectively writing the user’s age, location and follower count 15 separate times for a single request.

We moved to a batched storage schema. Instead of treating every impression as an isolated event, we separated the data structures. We stored the user features once for the request and stored a list of item features associated with that request.

This simple de-duplication reduced our storage requirement by more than 40%. In distributed systems like the ones powering Instagram or YouTube, storage isn’t passive; it requires CPU to manage, compress and replicate. By slashing the storage footprint, we improved bandwidth availability for the distributed workers fetching data for training, creating a virtuous cycle of efficiency throughout the stack.

Auditing the feature usage

The final piece of the puzzle was spring cleaning. In a system as old and complex as a major social network’s recommendation engine, digital hoarding is a real problem. We had over 100,000 distinct features registered in our system.

However, not all features are created equal. A user’s “age” might carry very little weight in the model compared to “recently liked content.” Yet, both cost resources to compute, fetch and log.

We initiated a large-scale feature auditing program. We analyzed the weights assigned to features by the model and identified thousands that were adding statistically insignificant value to our predictions. Removing these features didn’t just save storage; it reduced the latency of the inference request itself because the model had fewer inputs to process.

The energy imperative

As the industry races toward larger generative AI models, the conversation often focuses on the massive energy cost of training GPUs. Reports indicate that AI energy demand is poised to skyrocket in the coming years.

But for engineers on the ground, the lesson from my time at Meta is that efficiency often comes from the unsexy work of plumbing. It comes from questioning why we move data, how we store it and whether we need it at all.

By optimizing our data flow — lazy logging, schema de-duplication and feature auditing — we proved that you can cut costs and carbon footprints without compromising the user experience. In fact, by freeing up system resources, we often made the application faster and more responsive. Sustainable AI isn’t just about better hardware; it’s about smarter engineering.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 8.44 MB)

Page processed in 1.257 seconds.

Powered by SimplePie 1.4-dev, Build 20170403172323. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.