OpenAI’s $50B AWS deal puts its Microsoft alliance to the test 19 Mar 2026, 1:05 am

Despite OpenAI’s multiple re-affirmations that its relationship with Microsoft is strong and central, in view of recent developments, Redmond doesn’t seem to be convinced.

According to reports, the tech giant is considering legal action against OpenAI and Amazon over the $50 billion cloud deal the two recently struck to make Amazon Web Services (AWS) the exclusive third-party cloud distribution provider for OpenAI Frontier.

This third-party exclusivity agreement could conflict with OpenAI’s existing Azure partnership. Unnamed Microsoft execs purportedly consider the AWS arrangement unworkable, and say it breaches, if not explicitly, but in principle, their agreement with the AI darling.

The three companies are said to be in discussions to resolve the issue before Frontier goes live following a limited preview, without having to resort to litigation.

“This is a tricky issue, and prospective early adopters of the OpenAI-AWS Frontier capabilities will need to proceed with caution,” said Scott Bickley, advisory fellow at Info-Tech Research Group. The OpenAI-Microsoft agreement is “quite convoluted, and contains several provisions that lack absolute clarity in terms of where boundaries reside for IP use and IP sharing, likely by design.”

Is OpenAI double-dipping with Microsoft and AWS?

In late February, AWS and OpenAI announced their intentions to “co-create” a stateful runtime environment, powered by OpenAI models, that would be made available on Amazon Bedrock for AWS customers. “Stateful AI” is meant to overcome the challenges of so-called “stateless AI,” where models offer one-off answers without factoring in context from previous sessions.

According to the agreement, AWS would not only invest another $50 billion in OpenAI, but would be the exclusive third-party cloud provider for Frontier, which is currently in limited preview with a small group of AI-native companies including Abridge⁠, Ambience, Clay⁠, Decagon⁠, Harvey⁠, and Sierra. OpenAI says it will soon expand the program to other AI builders.

AWS has also agreed to give OpenAI 2GW of Trainium capacity to support demand for the stateful environment, Frontier, and “other advanced workloads.” Further, the two companies would develop models specifically for Amazon applications, and expand their existing $38 billion multi-year agreement⁠ to $100 billion over eight years.

However, at the time of that announcement, OpenAI also felt the need to concurrently announce that nothing about its collaborations with other tech companies “in any way” changed the terms of its partnership with Microsoft. Azure would remain the exclusive cloud provider of stateless OpenAI APIs.

The two companies stressed that, as in their original agreement:

  • OpenAI has the flexibility to commit to compute elsewhere, including through infrastructure initiatives like the Stargate project.
  • Both companies can independently pursue new opportunities.
  • An ongoing revenue share arrangement will stay the same; however, that agreement has “always” included revenue-sharing from partnerships between OpenAI and other cloud providers.

OpenAI and Microsoft also underscored the fact that the tech giant will maintain an exclusive license and access to intellectual property (IP) across OpenAI models and products, and that OpenAI’s Frontier and other first-party products would continue to be hosted on Azure.

They stated that their ongoing partnership “was designed to give Microsoft and OpenAI room to pursue new opportunities independently, while continuing to collaborate, which each company is doing, together and independently.”

This re-affirmation followed yet another affirmation of the “next chapter” of the companies’ collaboration in October 2025. Microsoft was one of OpenAI’s earliest financial backers, investing $1 billion in 2019 and $10 billion in 2023.

Concerns about potential future lock-in

Clearly, OpenAI has for some time sought to maintain its independence, while seeking out strategic partnerships with the biggest names in tech. The ChatGPT builder seems to have struck (or is in the midst of striking) deals with nearly every big company out there, including Nvidia, Cerebras, Cisco, Accenture, Snowflake, Oracle, and many others.

“OpenAI is seeking to exploit a loophole between what rights Microsoft has to ‘stateless’ versus ‘stateful’ implementations of LLM models,” Info-Tech’s Bickley observed. Stateful is essential to multi-step agentic workflows, he noted, as it allows AI agents to retain memory and context over time.

But, as with many things, “the devil may reside in the details,” he said, as the AWS announcement calls for the creation of a stateful runtime environment. So, for instance, if Frontier is simply an orchestration layer designed to ensure that API calls are made to an Azure-hosted LLM, Microsoft would get paid for that usage.

The reality is that OpenAI has little choice but to “push the boundaries” of its agreement with Microsoft, and to develop products hosted and used on other hyperscaler clouds, Bickley said. The market is “too big to ignore the AWS and [Google Cloud Platforms] of the world.” Additionally, OpenAI’s massive forecasts of its requirements for capacity (250 GW of data center demand), revenue, and expense/cash burn necessitate a global use case.

“OpenAI is dependent on raising massive amounts of capital to fund this growth trajectory,” said Bickley, and the $50 billion Amazon investment is predicated on the delivery of Frontier.

However, the recent reaffirmation of the Microsoft relationship “muddies the waters,” because it grants OpenAI the right to strike deals with cloud rivals, as long as Microsoft retains its rich revenue-sharing agreement and exclusive hold over stateless models, he noted. This seems to imply that stateful models “may be out of this exclusive IP scope.”

Ultimately, “Microsoft’s aggressive legal response is standard fare for IP disputes among large tech firms, and should not scare away would-be customers,” Bickley emphasized, adding that it will likely be resolved via negotiations.

However, an additional looming issue is the potential for vendor lock-in, he noted. Frontier is tied to OpenAI’s architecture, and now adds “additional lock-in layers” for customer data stored in AWS, along with proprietary orchestration layers through which AI agents will flow. Therefore, as these agentic workflows begin to manage critical enterprise processes, customers’ business workflows could be “distinctly tied” to AWS.

“This will be quite sticky and difficult to migrate off of in the future, assuming there is an alternative to migrate to,” said Bickley.

This article originally appeared on NetworkWorld.

(image/jpeg; 0.42 MB)

Java future calls for boosts with records, primitives, classes 18 Mar 2026, 11:46 pm

Oracle’s latest Java language ambitions are expected to offer improvements in records, classes, primitives, and arrays. As part of these plans, pending features not now marked for a future release of the language are under consideration to officially be part of Java.

In a March 17 presentation at the JavaOne conference in Redwood City, Calif., Oracle’s Dan Smith, senior developer in the company’s Java platform group, cited planned features for inclusion, but added that these features may change or go away instead. New Java language features include preserving the feel of Java and minimizing disruption, making it easier to work with immutable data, being more declarative and less imperative, and minimizing the seams between different features. Reducing the “activation energy” for Java also was cited as a theme.

Among the features under consideration is value classes and objects, a Java Enhancement Proposal (JEP), which calls for enhancing the Java platform with value objects: class instances that have only final fields and lack object identity. Created in August 2020 and updated this month, this proposal is intended to allow developers to opt-in to a programming model for domain values in which objects are distinguished solely by the values of their fields, much as the int value 3 is distinguished from the int value 4. Other goals of this proposal include supporting compatible migration of existing classes that represent domain values to this programming model, and maximizing the freedom of the JVM to store domain values to improve memory footprint, locality, and garbage collection efficiency.

The derived record creation JEP in preview, meanwhile, would provide a concise means to create new record values derived from existing record values. The proposal also is intended to streamline the declaration of record classes by eliminating the need to provide explicit wither methods, which are the immutable analog of setter methods. Records are immutable objects, with developers frequently creating new records from old records to model new data. Derived creation streamlines code by deriving a new record from an existing record, specifying only the components that are different, according to the proposal, created in November 2023 and marked as updated in April 2024.

Also cited by Smith were the enhanced primitive boxing JEP, which is a feature in preview, and the primitive types in patterns, instanceof, and switch JEP, a feature actually undergoing its fourth preview in JDK 26. Enhanced primitive boxing, created in January 2021 and marked as updated in November 2025, uses boxing to support language enhancements that treat primitive types more like reference types. Among goals is allowing boxing of primitive values when they are used as the “receiver” of a field access, method invocation, or method reference. Also on the agenda for this JEP is supporting primitive types as type arguments, implemented via boxing at the boundaries with generic code. Unboxed return types would be allowed when overriding a method with a reference-typed return. The primitive types feature, meanwhile, calls for enhancing pattern matching by allowing primitive types in all pattern contexts and by extending instanceof and switch to work with all primitive types. This feature was created in June 2025 and last updated in December 2025.

For arrays, plans under consideration involve declarative array creation expressions, final arrays, non-null arrays, and covariant primate arrays. Declarative array creation covers capabilities including having a lambda to compute initial values. With final arrays, components cannot be mutated and must be declaratively initialized. Covariant primitive arrays can treat an int[] as a non-null Integer[]. Boxes can be accessed as needed.

(image/jpeg; 5.82 MB)

Edge.js launched to run Node.js for AI 18 Mar 2026, 11:39 pm

Wasmer has introduced Edge.js as a JavaScript runtime that leverages WebAssembly and is designed to safely run Node.js workloads for AI and edge computing. Node apps can run inside a WebAssembly sandbox.

Accessible from edgejs.org and introduced March 16, Edge.js is intended to enable existing Node.js applications to run safely and with startup times impossible to get with containers, according to Wasmer. Instead of introducing new APIs, Edge.js preserves Node compatibility and isolates the unsafe parts of execution using WebAssembly. Existing Node.js applications and native modules can run unmodified while system calls and native modules are sandboxed through WASIX, an extension to the WebAssembly System Interface (WASI). WASIX was designed to make WebAssembly more compatible with POSIX programs, enabling seamless execution of more complex applications in both server and browser environments.

Reimagining Node.js, Edge.js is sandboxed via --safe mode. It is built for AI and serverless workloads, Wasmer said. Edge.js currently supports the V8 and JavaScriptCore JavaScript engines. The architecture is engine-agnostic by design. Plans call for adding support for the QuickJS and SpiderMonkey engines. Additional engines are welcome.

Edge.js is currently about 5% to 20% slower than current Node.js when run natively, and 30% slower when run fully sandboxed with Wasmer. In some cases, when NativeWasm work is intense, as when doing HTTP benchmarks, there could be a bigger gap. Wasmer intends to focus on closing that gap for Edge.js 1.0 and for the next releases of Wasmer.

(image/jpeg; 10.07 MB)

Snowflake’s new ‘autonomous’ AI layer aims to do the work, not just answer questions 18 Mar 2026, 1:00 pm

Snowflake has taken the covers off a product, currently under development, which it describes as an “autonomous” AI layer that promises to turn its data cloud from a place that answers questions about data into one that actually does the work: stitching together analysis, reports, and even slide decks on behalf of business users.

Named as Project SnowWork, the new conversational AI interface offering, which combines Snowflake’s existing technologies, such as its AI Data Cloud, Snowflake Intelligence, and Cortex Code, is Snowflake’s attempt to implant itself into enterprise workflows, Bala Kasiviswanathan, VP of developer and AI experiences at Snowflake, told InfoWorld.

“Project SnowWork comes from a pretty simple belief: if AI is going to really matter in the enterprise, it has to first work for everyday workflows, and it has to be deeply connected to the data and systems that actually run the business,” Kasiviswanathan said.

“The idea is to have AI act more like a proactive collaborator. So instead of just asking questions, business users across functions like finance, marketing, or sales can ask for outcomes. Things like putting together a board-ready forecast, identifying churn risks, generating a report with recommended actions, or digging into supply chain issues,” Kasiviswanathan added.

What’s in it for enterprises?

Analysts say SnowWork could be valuable to enterprises, especially in accelerating operational business decisions and reducing the workload burden on data practitioners, which is often the real cause of delay.

“Every Fortune 500 company we talk to has the same bottleneck. A head of sales wants to understand regional churn patterns, so they file a ticket with the data team. Three weeks later, they get a CSV and a shrug. By then, the decision window has closed, and they’ve already gone with gut instinct. That cycle is broken, and everyone knows it,” said Ashish Chaturvedi, leader of executive research at HFS Research.

SnowWork, according to Chaturvedi, promises to cut that queue to zero as users can get a finished analysis directly in minutes without having to engage a data practitioner.

“If it works as advertised, the productivity unlock is substantial. Not just the time saved, but timely decisions can be made while the information is still warm,” the analyst added.

In fact, removing the need to engage a data practitioner for analysis, according to Moor Insights and Strategy principal analyst Robert Kramer, will allow data teams to spend more time on governance, modeling, and oversight instead of handling repetitive requests.

Play for enterprise AI land grab

Snowflake’s Kasiviswanathan also pitched SnowWork against other chatbots and AI assistants, claiming that it was more accurate and far less reliant on manual coordination because it runs on secured and governed enterprise data.

However, analysts say this is a clever strategy to increase stickiness of its platform, as nearly all technology vendors, including Microsoft, Google, AWS, Salesforce, ServiceNow, Workday, OpenAI, and Anthropic, are moving in aggressively to try and own the majority share of AI in enterprises with their own offerings.

“This is about platform stickiness through surface area expansion. Snowflake’s core data cloud business is in a knife fight — Databricks is breathing down its neck, open-source alternatives are chipping away at the margins, and enterprise CFOs are getting louder about consumption costs,” said HFS’ Chaturvedi.

“Today, the average business user has never logged into Snowflake. Their experience of the platform is indirect, which is filtered through a BI tool. SnowWork puts Snowflake directly on the business user’s desktop, and that changes the commercial gravity entirely. You go from being a back-end utility that procurement reviews once a year to a front-office productivity layer that hundreds of people touch every day,” Chaturvedi pointed out.

That strategy also pitches it directly against Microsoft, Google, and Salesforce, the analyst further said, because if those vendors succeed in making their AI layers the default workspace for enterprise employees, Snowflake could find itself reduced to the pipes underneath the stack — essential but interchangeable, and far removed from the everyday users it now hopes to reach, Chaturvedi added.

Compression of the enterprise technology stack

More broadly, though, the analyst says this could be a part of a broader industry shift where the traditional enterprise technology stack is getting compressed like an accordion.

“The old model had five distinct layers, including data warehouse, BI tool, analyst, deliverable, and decision-maker. Each handoff added latency, cost, and the telephone-game risk of lost context. SnowWork collapses that into three layers comprising data platform, autonomous agent, decision-maker,” Chaturvedi said.

“Every major platform player is making a version of this move. Databricks is building lakehouse apps. Salesforce has Agentforce. Microsoft has Copilot wired into everything. ServiceNow is embedding agentic workflows,” Chaturvedi added.

Still, for all the ambition, there are some obvious caveats, especially SnowWork’s vision that comes with its share of unanswered questions, analysts say.

HFS’s Chaturvedi was skeptical because the product is still under development and Snowflake didn’t reveal the pricing model: “If SnowWork compresses your decision cycle from three weeks to three minutes but triples your Snowflake bill, the CFO math gets complicated fast.”

Similarly, HyperFRAME Research’s practice lead of AI stack Stephanie Walter hinted at vendors’ credibility gap, in general, around AI execution in enterprise or production settings.

“In practice, enterprise AI has shown mixed results when it comes to producing fully usable, end-to-end deliverables without significant human oversight. Moving from assisted analysis to autonomous output is a non-trivial leap, and SnowWork will need to prove that its agents can consistently deliver accurate, contextually correct outcomes before enterprises fully trust it as a system of action,” Walter said.

Snowflake has yet to announce a launch date or timeline, as SnowWork is being tested by select Snowflake customers.

(image/jpeg; 9.4 MB)

Markdown is now a first-class coding language: Deal with it 18 Mar 2026, 9:00 am

Folks are all in a tizzy because a guy posted some Markdown files on GitHub.

Mind you, it’s not just any guy, and they aren’t just any Markdown files. 

The guy is Garry Tan, president and CEO of Y Combinator, which is among the most widely known startup incubator and venture capital firm in tech.  Garry is a long-time builder, having founded the blogging platform Posterous. 

The Markdown files? Tan created what he calls gstack — a collection of Claude Code skills that help to focus Claude in on the specific steps of developing a software product.  And, yes, they are done in Markdown.  Just a bunch of text files. Some love it, and some are … not so impressed. 

A little Markdown backstory

Tan seems to have had a typical career arc — started out coding, built something, had success, and went into management, thus abandoning code.  And like many people who leave coding behind for leadership positions, he missed coding: 

Screen capture of a tweet releated to Markdown by Garry Tan

Foundry

What Tan found 45 days prior to the above tweet was Claude Code.  He discovered what many of us have —  that agentic coding can be an almost unbelievable experience. He, like many others, found he could do in days what normally would take teams of people months to do. 

Tan found, as we all do, that Claude can be a bit unfocused in its work.  It will do exactly what it is asked to do and often can’t see the big picture.  Tan calls this “mushy mode.”  He uses that phrase because Claude can be trained to behave better, and given the right input, can be more focused in specific areas, Tan created gstack to give Claude the capability to play certain roles  — product manager, QA, engineering, DevOps, etc.

People seem to be losing their minds over this. 

Some folks are calling it, “God mode for development,” but others are saying, “This is just a bunch of prompts.”

Tan posted the repo on Product Hunt, and some folks were less than enthused.

Mo Bitar posted a video calling Tan “delusional” and implying that he’s succumbed to the sycophancy of AI.  The comments indicate he is not alone.  Bitar points out, too, that it is “just a bunch of text files.” 

It is this point that I want to home in on.  Yep, it is just “a bunch of text files posted on Github.” (though that is not strictly true — the repo does include code to build binaries to help Claude browse web apps better), but guess what?  All the code you carefully craft by hand without the aid of AI is “just a bunch of text files posted on GitHub.”  Your Docker files, JSON, and YAML are all “just a bunch of text files.” Huh.  

Here is something for folks to consider: Markdown is the new hot coding language. Some folks write Python, some folks write TypeScript, and now, some folks write Markdown.  Humans use compilers to convert C++ code into working apps.  Now, humans can use Claude to convert Markdown into working apps. 

Everything computer science has done over the last hundred years has been to improve our abstraction layers.  We used to code by literally flipping mechanical switches.  Then we built automatic switches and figured out how to flip them electrically.  Then we figured out how to flip them automatically using binary code, and then assembler code, and then higher-level languages. 

Markdown is the new hot coding language.  Deal with it.

(image/jpeg; 8.54 MB)

We mistook event handling for architecture 18 Mar 2026, 9:00 am

Events are essential inputs to modern front-end systems. But when we mistake reactions for architecture, complexity quietly multiplies. Over time, many front-end architectures have come to resemble chains of reactions rather than models of structure. The result is systems that are expressive, but increasingly difficult to reason about.

A different architectural perspective is beginning to emerge. Instead of organizing systems around chains of reactions, some teams are starting to treat application state as the primary structure of the system. In this model, events still occur, but they no longer define the architecture; they simply modify state. The UI and derived behavior follow from those relationships. 

This shift toward state-first front-end architecture offers a clearer way to reason about increasingly complex applications.

When reaction became the default

Front-end engineering runs on events. User interactions, network responses, timers, and streaming data constantly enter our applications, and our systems are designed to respond to them. Events are unavoidable; they represent moments when something changes in the outside world. It’s no wonder that we have become remarkably sophisticated in how we process events. We compose streams, coordinate side effects, dispatch updates, and build increasingly expressive reactive pipelines. Entire ecosystems have evolved around structuring these flows in disciplined and predictable ways.

As applications grew more dynamic and stateful, that sophistication felt not only justified but necessary. Yet somewhere along the way, we began treating event handling not merely as a mechanism, but as architecture itself. We started to think about systems primarily in terms of events and reactions. That subtle shift changed how we reason about systems. Instead of modeling what is true, we increasingly modeled how the system reacts.

And that is where complexity quietly began to accumulate.

Redux and the era of structured change

One of the most influential milestones in modern front-end architecture was the rise of Redux. Redux introduced a compelling discipline: State should be centralized, updates should be predictable, and changes should flow in a unidirectional manner. Instead of mutating values implicitly, developers dispatched explicit actions, and reducers computed new state deterministically.

Redux brought structure where there had been chaos. It introduced a discipline that made state transitions explicit and traceable, which in turn made debugging more systematic and application behavior easier to reason about.

More importantly, Redux normalized a particular way of thinking about front-end systems. Centralized stores, action dispatching, side-effect layers, and explicit update flows became architectural defaults. Variations of this model appeared across frameworks and libraries, influencing how teams structured applications regardless of the specific tools they used.

Even when implementations differed, the underlying assumption remained consistent: Architecture was largely about controlling how events move through the system. This was a major step forward in discipline. But it also reinforced a deeper habit — organizing our mental models around reactions.

Events are inputs, but architecture is structure

An event tells us what just happened. A user clicked a button. A request has been completed. A value changed. Architecture answers a different question: What is true right now?

Events are transient. They describe moments in time. Architecture defines relationships that persist beyond those moments. When systems are organized primarily around events, behavior is often modeled as a chain of reactions: This dispatch triggers that update, which causes another recalculation, which notifies a subscriber elsewhere.

In smaller systems, that chain is easy to follow. In larger systems, understanding behavior increasingly requires replaying a timeline of activity. To explain why a value changed, you trace dispatches and effects. To understand dependencies, you search for subscriptions or derived selectors. The structure exists, but it is implicit in the flow. And implicit architecture becomes harder to reason about as scale increases.

The cognitive cost of flow-centric thinking

Event-driven models, especially in their more structured forms, provided front-end engineering with much-needed rigor. They allowed teams to tame asynchronous complexity and formalize change management.

However, expressiveness does not automatically produce clarity. As applications grow, flow-oriented designs can obscure structural relationships. Dependencies between pieces of state are often inferred from dispatch logic rather than expressed directly. Derived values may be layered through transformations that require understanding not just what depends on what, but when updates propagate.

Thus, event-driven models introduce a subtle cognitive burden. Engineers must simulate execution over time instead of inspecting relationships directly. Questions that should be straightforward — What depends on this value? What recalculates when it changes? — often require tracing reactive pathways through the system. 

The more sophisticated the orchestration, the more effort it takes to understand the architecture as a whole.

The shift toward state-first thinking

A quieter architectural shift is emerging across modern front-end development. Rather than organizing systems primarily around what just happened, teams are increasingly organizing them around what is currently true.

In a state-first model, change does not propagate because an event is fired. It propagates because relationships exist. Dependencies are declared explicitly. Derived values are expressed as direct functions of the underlying state. When something changes, the system recalculates what depends on it in a deterministic manner — not because we manually choreographed the flow, but because we described the relationships.

Events remain essential. User interactions and network responses continue to drive applications forward. The difference is that events resume their proper role as inputs that modify state, rather than serving as the backbone of architectural reasoning. Instead of replaying timelines, engineers inspect relationships. Instead of coordinating flows, they model structure.

This shift does not eliminate reactivity; it refines it.

Redefining front-end architectural skill

For years, front-end mastery often meant orchestrating events with precision: dispatching actions cleanly, managing side effects thoughtfully, and coordinating asynchronous boundaries without introducing chaos. Those skills remain valuable.

But architectural maturity increasingly depends on something deeper: the ability to model state clearly, define dependencies explicitly, and design systems whose behavior can be understood by examining structure rather than replaying history.

Redux was a major step forward. It disciplined change and made the event flow traceable. Yet architecture does not end at disciplined dispatch. Architecture begins when relationships are first-class, when state, derivation, and dependency are visible and intentional rather than consequences of flow.

This shift is already visible across modern frameworks. Systems like Angular Signals, fine-grained reactive models, and state-driven UI architectures are all converging on the same idea: The structure of state should define system behaviour, not the choreography of events.

I describe this emerging model as “state-first front-end architecture,” where application state becomes the primary source of truth, and the UI is derived from it rather than driven by chains of events.

The real question for modern front-end teams is no longer “How do we react to this event?” It is “What is the simplest, clearest way to model what is true?”

When we begin with structure instead of reaction, complexity tends to shrink. Systems become easier to explain, easier to test, and easier to evolve. Events still enter the system, but they no longer define it.

That shift may sound philosophical, but it has practical consequences. It changes how we design components. It changes how we organize the state. It changes how we reason about scale.

Events are indispensable. They are the inputs that move our applications forward. But architecture is not about what just happened. It is about what remains true.

Events will always enter our systems, but they should not define their architecture. The next generation of front-end systems will be shaped less by how elegantly we orchestrate events and more by how clearly we model state. Frameworks like Angular Signals suggest that this transition has already begun, pointing toward a future where UI is treated primarily as a projection of state rather than a reaction to events.

(image/jpeg; 13.55 MB)

I ran Qwen3.5 locally instead of Claude Code. Here’s what happened. 18 Mar 2026, 9:00 am

If you’ve been curious about working with services like Claude Code, but balk at the idea of hitching your IDE to a black-box cloud service and shelling out for tokens, we’re steps closer to a solution. But we’re not quite there yet.

With each new generation of large language models, we’re seeing smaller and more efficient LLMs for many use cases—small enough that you can run them on your own hardware. Most recently, we’ve seen a slew of new models designed for tasks like code analysis and code generation. The recently released Qwen3.5 model set is one example.

What’s it like to use these models for local development? I sat down with a few of the more svelte Qwen3.5 models, LM Studio (a local hosting application for inference models), and Visual Studio Code to find out.

Setting up Qwen3.5 on my desktop

To try out Qwen3.5 for development, I used my desktop system, an AMD Ryzen 5 3600 6-core processor running at 3.6 Ghz, with 32GB of RAM and an RTX 5060 GPU with 8GB of VRAM. I’ve run inference work on this system before using both LM Studio and ComfyUI, so I knew it was no slouch. I also knew from previous experience that LM Studio can be configured to serve models locally.

For the models, I chose a few different iterations of the Qwen3.5 series. Qwen3.5 comes in many variations provided by community contributors, all in a range of sizes. I wasn’t about to try the 397-billion parameter version, for instance: there’s no way I could crowbar a 241GB model into my hardware. Instead, I went with these Qwen3.5 variants:

In each case, I was curious about the tradeoffs between the model’s parameter size and its quantization. Would smaller versions of the same model have comparable performance?

Running the models on LM Studio did not automatically allow me to use them in an IDE. The blocker here was not LM Studio but VS Code, which doesn’t work out of the box with any LLM provider other than GitHub Copilot. Fortunately, a third-party add-on called Continue lets you hitch VS Code to any provider, local or remote, that uses common APIs—and it supports LM Studio out of the box.

Continue, an extension for Visual Studio Code that connects to a variety of LLM providers

Continue is a VS Code extension that connects to a variety of LLM providers. It comes with built-in connectivity options for LM Studio.

Foundry

Setting up the test drive

My testbed project was something I’m currently developing, a utility for Python that allows a Python package to be redistributed on systems without the Python runtime. It’s not a big project— one file that’s under 500 lines of code—which made it a good candidate for testing a development model locally.

The Continue extension lets you use attached files or references to an open project to supply context for a prompt. I pointed to the project file and used the following prompt for each model:

The file currently open is a utility for Python that takes a Python package and makes it into a standalone redistributable artifact on Microsoft Windows by bundling a copy of the Python interpreter. Examine the code and make constructive suggestions for how it could be made more modular and easier to integrate into a CI/CD workflow.

When you load a model into memory, you can twiddle a mind-boggling array of knobs to control how predictions are served with it. The two knobs that have the biggest impact are context length and GPU offload:

  • Context length is how many tokens the model can work with in a single prompt; the more tokens, the more involved the conversation.
  • GPU offload is how many layers of the model are run on the GPU to speed it up; the more layers, the faster the inference.

Turning up either of these consumes memory—system and GPU memory, both—so there are hard ceilings to how high they can go. GPU offload has the biggest impact on performance, so I set that to the maximum for each model, then set the context length as high as I could for that model while still leaving some GPU memory.

Serving predictions locally with LM Studio, by way of the Continue plugin for Visual Studio Code.

Serving predictions locally with LM Studio, by way of the Continue plugin for VS Code. The Continue interface doesn’t provide many of the low-level details about the conversation that you can see in LM Studio directly (e.g., token usage), but does allow embedding context from the current project or any file.

Foundry

Configuring the three models

qwen3.5-9b@q5_1 was the largest of the three models I tested, at 6.33GB. I set it to use 8,192 tokens and 28 layers, for 7.94GB total GPU memory use. This proved to be way too slow to use well, so I racked back the token count enough to use all 32 layers. Predictions came far more snappily after that.

qwen3.5-9b-claude-4.6-opus-reasoning-distilled weighed in at 4.97GB, which allowed for a far bigger token length (16,000 tokens) and all 32 layers. Out of the gate, it delivered much faster inference and tokenization of the input, meaning I didn’t have to wait long for the first reply or for the whole response.

qwen3.5-4b, the littlest brother, is only 3.15GB, meaning I could use an even larger token window if I chose to, but I kept it at 16,000 for now, and also used all 32 layers. Its time-to-first reply was also fast, although overall speed of inference was about the same as the previous model.

A variety of Qwen3.5 models with different sizes and quantizations.

A variety of Qwen3.5 models with different sizes and quantizations. Some are far too big to run comfortably on commodity hardware; others can run on even a modest PC.

Foundry

The good, the bad, and the busted

With each model, my query produced a slew of constructive suggestions: “Refactor the main entry point to use step functions,” or “Add support for environment variables.” Most were accompanied by sample snippets—sometimes full segments of code, sometimes brief conceptual designs. As with any LLM’s output, the results varied a lot between runs—even on the same model with as many of the parameters configured as possible to produce similar output.

The biggest variations between the outputs were mainly in how much detail the model provided for its response, but even that varied less than I expected. Even the the smallest of the models still provided decent advice, although I found the midsized model (the “distilled” 9B model) struck a good balance between compactness and power. Still, having lots of token space didn’t guarantee results. Even with considerable context, some conversations stopped dead in the middle for no apparent reason.

Where things broke down across the board, though, is when I tried to let the models put their advice into direct action. Models can be provided with contextual tools, such as changing code with your permission, or looking things up on the web. Unfortunately, most anything relating to working with the code directly crashed out hard, or only worked after multiple attempts.

For instance, when I tried to let the “distilled” 9B model add recommended type hints to my project, it failed completely. On the first attempt, it crashed in the middle of the operation. On the second try, it got stuck in an internal loop, then backed out of it and decided to add only the most important type hints. This it was able to do, but it mangled several indents in the process, creating a cascading failure for the rest of the job. And on yet another attempt, the agent tried to just erase the entire project file.

Conclusions

The most disappointing part of this whole endeavor was the way the models failed at actually applying any of their recommended changes, or only did so after repeated attempts. I suspect the issue isn’t tool use in the abstract, but tool use that requires lots of context. Cloud-hosted models theoretically have access to enough memory to use their full context token window (262,144 for the models I evaluated). Still, from my experience, even the cloud models can choke and die on their inputs.

Right now, using a compact local model to get insight and feedback about a codebase works best when you have enough GPU memory for the entire model plus the needed context length for your work. It’s also best for obtaining high-level advice you plan to implement yourself, rather than advanced tool operations where the model attempts to autonomously change the code. But I’ve also had that that experience with the full-blown versions of these models.

(image/jpeg; 14.49 MB)

USAT Leverages Times Square Crowd to Demonstrate Instant Digital Dollar Payments 18 Mar 2026, 8:07 am

USAT is transforming Times Square into a live demonstration of instant, internet-native payments. The digital dollar, issued by Anchorage Digital Bank, is taking over Times Square as 2 million spectators flood the streets of New York City for St. Patrick’s Day. The brand activation combines synchronized digital billboards with a street-level campaign designed to introduce digital dollar payments to a mainstream audience, coinciding with the world’s oldest and largest St. Patrick’s Day Parade.

The campaign features coordinated imagery across several of Times Square’s most recognizable digital screens, culminating in a synchronized share-of-voice takeover that transforms multiple screens into a single, unified visual, showing how digital dollars move between people in an instant. At street level, brand ambassadors will distribute 25,000 promotional postcards throughout Times Square and along the parade route, inviting passersby to scan a QR code to download the Rumble Wallet and claim $10 in USAT, free, right from their phone. The activation kicked off at 10 AM ET and ended at 11:59 PM ET.

The activation reflects a growing shift in fintech marketing toward experiential campaigns that translate complex financial technology into tangible consumer experiences, using high-traffic cultural moments and large-scale digital displays to capture public attention. The mechanic is simple by design: Scan. Download. Receive. It is the same technology that already moves money for more than 550 million people worldwide, now available to anyone walking through Times Square with a smartphone in their pocket.

Stablecoins are blockchain-based digital dollars designed to maintain a stable value while enabling instant, internet-native payments between digital wallets. They combine the price stability of traditional currency with the speed and programmability of blockchain networks.

“USAT builds on the principles that made USDT the most widely used stablecoin in the world,” said Paolo Ardoino, CEO of Tether. “Today, USDT is used by more than 550 million people globally, helping move digital dollars across the internet instantly and reliably. USAT brings those same foundations to a new audience, making it easier for people to experience how digital dollars can function in everyday life.”

“Times Square on St. Patrick’s Day is one of the most electric environments in the world,” said Bo Hines, CEO of Tether USAT. “We are not just running ads, we are handing people the future of money and letting them use it on the spot. This activation invites people to experience the next generation of money right on their smartphones. By pairing digital billboards with a dynamic street activation, we are turning a complex technology into something people can see, experience, and use for themselves.”

Digital dollars no longer require a tutorial. They require an opportunity. Large-scale activations like this have become an increasingly common strategy for fintech and technology brands looking to bridge the gap between digital infrastructure and mainstream awareness – and USAT is making that bridge as short as a QR code scan. USAT is a digital dollar designed to maintain a 1:1 value with the U.S. dollar while enabling instant digital payments through blockchain networks. Send it, receive it, spend it – globally, in seconds, using compatible wallets and applications. Moving money should feel as simple as sending a message. With USAT, it does.

(image/png; 1.68 MB)

Project Detroit, bridging Java, Python, JavaScript, moves forward 17 Mar 2026, 5:43 pm

Java’s revived Detroit project, to enable joint usage of Java with Python or JavaScript, is slated to soon become an official project within the OpenJDK community.

Oracle officials plan to highlight Detroit’s status at JavaOne on March 17. “The main benefit [of Detroit] is it allows you to combine industry-leading Java and JavaScript or Java and Python for places where you want to be able to use both of those technologies together,” said Oracle’s Georges Saab, senior vice president of the Java Platform Group, in a briefing on March 12. The goal of the project is to provide implementations of the javax.script API for JavaScript based on the Chrome V8 JavaScript engine and for Python based on CPython, according to the Detroit project page on openjdk.org.

Initially proposed in the 2018 timeframe as a mechanism for JavaScript to be used as an extension language for Java, the project later fizzled when losing sponsorship. But interest in it recently has been revived. The plan is to address Java ecosystem requirements to call other languages, with scripting for business logic and easy access to AI libraries in other languages. While the plan initially calls for Java and Python support, other languages are slated to be added over time. The Java FFM (Foreign Function & Memory) API is expected to be leveraged in the project. Other goals of the project include:

  • Improving application security by isolating Java and native heap executions.
  • Simplifying access to JS/Python libraries until equivalent Java libraries are made.
  • Delivery of full JS/Python compatibility by leveraging the V8 and CPython runtimes. Also, maintenance cost is to be reduced by harnessing the V8 and CPython ecosystem.
  • Leveraging existing investments in performance optimizations for the JS and Python languages.

(image/jpeg; 0.08 MB)

JDK 26: The new features in Java 26 17 Mar 2026, 5:21 pm

Java Development Kit (JDK) 26, the latest standard Java release from Oracle, moves to general production availability on March 17.

The following 10 Jave Enhancement Proposal (JEP) features are officially targeted to JDK 26: A fourth preview of primitive types in patterns, instanceof, and switch; ahead-of-time object caching; an eleventh incubation of the Vector API; second previews of lazy constants and PEM (privacy-enhanced mail) encodings of cryptographic objects; a sixth preview of structured concurrency; warnings about uses of deep reflection to mutate final fields; improving throughput by reducing synchronization in the G1 garbage collector (GC); HTTP/3 for the Client API; and removal of the Java Applet API.

For AI accommodations, Oracle has cited five of these features: the primitive types in patterns capability, Vector API, structured concurrency, lazy constants, and AOT object caching.

JDK 26 is downloadable from Oracle.com. A short-term release of Java backed by six months of Premier-level support, JDK 26 follows the September 16 release of JDK 25, which is a Long-Term Support (LTS) release backed by several years of Premier-level support. General availability of JDK 26 follows two rampdown releases and two release candidates.

The latest JEP feature to be added, primitive types in patterns, instanceof, and switch, is intended to enhance pattern matching by allowing primitive types in all pattern contexts, and to extend instanceof and switch to work with all primitive types. Now in its fourth preview, this feature was previously previewed in JDK 23, JDK 24, and JDK 25. Goals for this feature include enabling uniform data exploration by allowing type patterns for all types, aligning type patterns with instanceof and aligning instanceof with safe casting, and allowing pattern matching to use primitive types in both nested and top-level pattern contexts. Other goals include providing easy-to-use constructs that eliminate the risk of losing information due to unsafe casts, and—following the enhancements to switch in Java 5 (enum switch) and Java 7 (string switch)—allowing switch to process values of any primitive type. Changes in this fourth preview include enhancing the definition of unconditional exactness and applying tighter dominance checks in switch constructs. The changes enable the compiler to identify a wider range of coding errors. For AI, this feature simplifies integration of AI with business logic.

With ahead-of-time object caching, the HotSpot JVM would gain improved startup and warmup times, so it can be used with any garbage collector including the low-latency Z Garbage Collector (ZGC). This would be done by making it possible to load cached Java objects sequentially into memory from a neutral, GC-agnostic format, rather than mapping them directly into memory in a GC-specific format. Goals of this feature include allowing all garbage collectors to work smoothly with the AOT (ahead of time) cache introduced by Project Leyden, separating AOT cache from GC implementation details, and ensuring that use of the AOT cache does not materially impact startup time, relative to previous releases. AI applications also get improved startup via AOT caching with any GC.

The eleventh incubation of the Vector API introduces an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPUs. This achieves performance superior to equivalent scalar computations. The incubating Vector API dates back to JDK 16, which arrived in March 2021. The API is intended to be clear and concise, to be platform-agnostic, to have reliable compilation and performance on x64 and AArch64 CPUs, and to offer graceful degradation. The long-term goal of the Vector API is to leverage Project Valhalla enhancements to the Java object model. The performance of AI computation is also improved with the Vector API.

Also on the docket for JDK 26 is another preview of an API for lazy constants, which had been previewed in JDK 25 via a stable values capability. Lazy constants are objects that hold unmodifiable data and are treated as true constants by the JVM, enabling the same performance optimizations enabled by declaring a field final. Lazy constants offer greater flexibility as to the timing of initialization, as well as efficient data sharing in AI applications.

The second preview of PEM (privacy-enhanced mail) encodings calls for an API for encoding objects that represent cryptographic keys, certificates, and certificate revocation lists into the PEM transport format, and for decoding from that format back into objects. The PEM API was proposed as a preview feature in JDK 25. The second preview features a number of changes, including changing the name of the PEMRecord class to PEM. This class also now includes a decode() method that returns the decoded Base64 content. Also, the encryptKey methods of the EncryptedPrivateKeyInfo class now are named encrypt and accept DEREncodable objects rather than PrivateKey objects, enabling the encryption of KeyPair and PKCS8EncodedKeySpec objects.

The structured concurrency API simplifies concurrent programming by treating groups of related tasks running in different threads as single units of work, thereby streamlining error handling and cancellation, improving reliability, and enhancing observability. Goals include promoting a style of concurrent programming that can eliminate common risks arising from cancellation and shutdown, such as thread leaks and cancellation delays, and improving the observability of concurrent code. This feature in Java 26 also brings enhanced concurrency for AI.

New warnings about uses of deep reflection to mutate final fields are intended to prepare developers for a future release that ensures integrity by default by restricting final field mutation; in other words, making final mean final, which will make Java programs safer and potentially faster. Application developers can avoid both current warnings and future restrictions by selectively enabling the ability to mutate final fields where essential.

The G1 GC proposal is intended to improve application throughput when using the G1 garbage collector by reducing the amount of synchronization required between application threads and GC threads. Goals include reducing the G1 garbage collector’s synchronization overhead, reducing the size of the injected code for G1’s write barriers, and maintaining the overall architecture of G1, with no changes to user interaction. Although G1, which is the default garbage collector of the HotSpot JVM, is designed to balance latency and throughput, achieving this balance sometimes impacts application performance adversely compared to throughput-oriented garbage collectors such as the Parallel and Serial collectors. The G1 GC proposal further notes:

Relative to Parallel, G1 performs more of its work concurrently with the application, reducing the duration of GC pauses and thus improving latency. Unavoidably, this means that application threads must share the CPU with GC threads, and coordinate with them. This synchronization both lowers throughput and increases latency.

The HTTP/3 proposal calls for allowing Java libraries and applications to interact with HTTP/3 servers with minimal code changes. Goals include updating the HTTP Client API to send and receive HTTP/3 requests and responses; requiring only minor changes to the HTTP Client API and Java application code; and allowing developers to opt in to HTTP/3 as opposed to changing the default protocol version from HTTP/2 to HTTP/3.

HTTP/3 is considered a major version of the HTTP data communications protocol for the web. Version 3 was built on the IETF QUIC (Quick UDP Internet Connections) transport protocol, which emphasizes flow-controlled streams, low-latency connection establishment, network path migration, and security among its capabilities.

Removal of the Java Applet API, now considered obsolete, is also targeted for JDK 26. The Applet API was deprecated for removal in JDK 17 in 2021. The API is obsolete because neither recent JDK releases nor current web browsers support applets, according to the proposal. There is no reason to keep the unused and unusable API, the proposal states.

In addition to its 10 major JEP features, JDK 26 also has a variety of additional, smaller features that did not warrant an official JEP, according to Oracle. Among these are hybrid public key encryption, stricter version checking when using the jlink tool to cross link, extending the HTTP client request timeout to cover the response body, and virtual threads now unmounting from the carrier when waiting for another thread to execute a class initializer.

(image/jpeg; 4.86 MB)

Oracle unveils the Java Verified Portfolio 17 Mar 2026, 2:00 pm

Oracle has introduced the Java Verified Portfolio (JVP), which provides developers with a curated set of Oracle-supported tools, libraries, frameworks, and services. Assets included at the JVP launch include the JavaFX Java-based UI framework, the Java Platform extension for Microsoft’s Visual Studio Code editor, and the Helidon Java framework for microservices, Oracle said.

Announced March 17 in conjunction with the release of Java Development Kit (JDK) 26, the JVP offers licensing and support for a developer’s broader application and development stack. More technologies will be added to the portfolio over time, Oracle said. With the JVP initiative, Oracle is acknowledging that Oracle customers and Java developers depend on a wide range of JDK-related tools and other Java technologies that do not belong in the Oracle JDK itself. JVP provides an enterprise-grade set of components that are fully supported and governed by Oracle, with roadmap transparency and life cycle management, the company said.

Oracle’s Java Verified Portfolio offers the following benefits to developers, according to Oracle:

  • Streamlined licensing and roadmap management, with the core JDK separated from portfolio offerings to simplify licensing, support, and future roadmap planning from both Oracle JDK and the JVP.
  • Centralized and flexible value delivery, with centralized access to a comprehensive set of Oracle-backed Java tools, frameworks, libraries, and services, with support timelines and JDK compatibility mappings.
  • Enhanced trust and supply chain integrity, providing governance and ongoing support for all included components, helping organizations trust their Java supply chain. This reduces risk compared to adopting unsupported open source alternatives, Oracle said.
  • Alignment with the Oracle Java ecosystem and customer needs. The JVP is backed by Oracle’s Java Platform Group to ensure consistency, alignment with OCI (Oracle Cloud Infrastructure), and other Oracle products.

The portfolio is accessible through preferred download sites and tools, with license and support free for all OCI Java workloads and Java SE subscription customers. Many releases are licensed free to all users, Oracle said.

(image/jpeg; 6.36 MB)

Visualizing the world with Planetary Computer 17 Mar 2026, 9:00 am

Microsoft’s research arm has long been fascinated with working with large amounts of data at scale. Projects like TerraServer explored how to search and display geospatial data, mixing mapping and demographics to show how we could provide large amounts of information to users’ desktops while working within the bandwidth constraints of the early internet.

That research continues, and projects move into the commercial side of the business as they mature. Sometimes, however, they straddle the boundary between Microsoft supporting external academic customers and providing tools that can be brought into your own commercial projects.

Introducing Planetary Computer

One such tool is Planetary Computer, a set of geospatial data from various providers that’s free to use, along with a set of standards-based APIs for querying and displaying that data, along with SDKs to simplify application development. Some of the tools are now available for use with your own and commercial data sources as Planetary Computer Pro, but the open research-oriented platform is an excellent primer for using massive data sets to add deeper insights into your own and public data.

Microsoft is positioning Planetary Computer as a tool for building environmental applications, using data to watch population, pollution, plant cover, weather, and more, with data that can be used as part of its AI for Good machine learning program.

Planetary Computer isn’t a single application. Instead, it’s a framework for bringing multiple data sources together to build and deliver a collection of different geospatial environmental applications. At its heart is a catalog of curated data sources from a mix of commercial, academic, and government organizations, along with the necessary APIs to query and use that data, as well as ways to display results.

APIs are based on the STAC (SpatioTemporal Asset Catalogs) specification. You can search by coordinates and by time, allowing you to track changes over time, for example, tracking the foliage in a specific area of rainforest over weeks, months, or years. STAC is designed to be provider-agnostic, so the queries you have built to work against one source in Planetary Computer’s catalog can be repurposed to another as you add more sources to your application.

Geospatial data in data science projects

Planetary Computer is a data science platform first and foremost, so much of the documentation focuses on working with Python and R to extract and analyze STAC data. Microsoft provides libraries to help with this, but if you’re looking for a quick way to work with it, there is also a simple data API that uses a URL query to extract data and deliver it in a pre-rendered PNG format ready for display in an application.

Another option generates TileJSON format responses. Like the pre-rendered image call, this delivers a more complex response as a tile that can then be displayed using tools that can parse the TileJSON data returned by the API. Unlike a standalone call, this is an interactive response. As you move around the map, new data will be generated and loaded without you needing to write code to handle the query. The host application’s TileJSON support will deliver this for you.

One possibility is the Folium Python toolkit, which will layer your Planetary Computer results on a base map. The default is Open Street Map, which will allow you to build public-facing applications without navigating often complex commercial mapping licensing.

The API supports more complex queries too, with the ability to build mosaics that cover larger areas. Again, data is returned as TileJSON format tiles and you can use the same techniques to display the results.

Jumpstart analysis with Explorer

A useful part of the Planetary Computer tool set is its Explorer. This helps you quickly display a data set from the service’s catalog. It’s a relatively simple application with two panes. The first allows you to pick a data set from the catalog and then choose a date and the information you want to display. The other is a map view using a familiar mapping control where you can zoom into a place and overlay your selected data. Some of the data in the catalog is surprisingly up-to-date. For example, at the time of this writing in early March 2026, Landsat imagery for the United Kingdom was available up to the end of February. However, not every data set is global, and the Explorer allows you to see what is available for the locations you want to explore.

A useful feature of the Explorer is the ability to use different data sets as different layers. For example, you can show the relationship between leaf cover and ground temperature in urban and rural areas. Alternatively, you can display different temporal ranges from the same data set to see changes over time, such as showing flood patterns for a river or the effects of deforestation.

The Explorer is perhaps best thought of as a basic prototyping tool. It allows you to bring together data in a way that shows how information from different sensor platforms can be combined to give insights and help make policy decisions that can affect entire countries.

From Explorer to applications

Once you’ve built a visualization in Explorer, it’s a matter of clicking the “code snippet for search results” button to get the necessary Python to implement the same search in your code. This gives you a helpful tutorial on how to use the STAC API, building a polygon for searching and then getting data. The snippet is only for search; it will not include the code to build a visualization.

Planetary Computer offers public access to its data, so there’s no need for a token for most queries. However, some of its data is stored in Azure Blob storage, and here you will need a token to include in your queries. This can be generated by a call to a token endpoint in the data set storage account. Calling the endpoint will return the necessary token for all queries to that data set, along with an expiration time that manages how long a token is cached in your application before making a new request.

Microsoft does apply rate limits to its data, and these depend first on whether queries are coming from outside the Azure region that hosts Planetary Computer and if your query includes a token. For best performance, always use an API key and host your applications in West Europe. Much of the required functionality is built into the Planetary Computer Python library, which includes a function that signs requests and manages token caching for you.

Working with Codespaces

If you want a quick way to start using Planetary Computer, Microsoft suggests forking the project’s sample GitHub repository and then using it as the basis of a Codespace environment. To get the best performance, make sure it’s running in West Europe and then launch a Dev Container based on any of the sample environments to start building your own applications.

If you prefer familiar desktop geographical information systems tools, there’s the option of adding a STAC plug-in to the open source QGIS to explore and analyze the data in Planetary Computer’s catalog. This gives you a quick way to mix its data with your own to test hypotheses and get information to support other applications, perhaps to help understand historical patterns for agriculture or planning.

It’s good to see Microsoft supporting pure research and education with tools like this; research teams need access to good data and the ability to bring it into their own applications. At the same time, offering a large-scale service like this gives Microsoft an effective way to monitor and improve Azure’s own services, with a known set of data and APIs that can provide the necessary telemetry and observability to help evaluate new storage and networking features.

(image/jpeg; 0.42 MB)

Cloud-based LLMs risk enterprise stability 17 Mar 2026, 9:00 am

Enterprises are embracing cloud-hosted large language models (LLMs) at unprecedented rates. Lured by the promise of rapid deployment, scalability, and transformative capabilities, organizations are becoming increasingly entwined with these outsourced intelligence engines. However, a dangerous underlying pattern is emerging, one too often overlooked until catastrophe strikes.

The ease and accessibility of cloud-hosted LLMs are making enterprises neglect the principles of basic architectural resilience. Recent events, especially the major outages of 2025 that shut down production for hours and cost billions for global companies, highlight the need for serious reconsideration. We must understand that LLM outages are not rare anomalies; they are becoming more likely and can have serious, companywide impacts.

Any enterprise architect or CTO who has experienced major infrastructure shifts—from mainframes to client-server systems or from on-premises to the cloud—knows that emerging technologies can be double-edged swords. LLMs integrated as SaaS or API endpoints are among the most powerful tools available, enabling new customer experiences, automated decision-making, and redefined workflows. However, as with any change, there is a downside: LLMs, whether from Anthropic, OpenAI, or others, are mostly accessed through a small number of large cloud providers.

This shift marks a major departure from the traditional shop model of earlier internet days, where each company managed its own system, and failures were contained. Today, when an LLM or its cloud host encounters issues, the impact spreads quickly across dozens and sometimes hundreds of dependent businesses in real time. This was clearly demonstrated in 2025 when both a key LLM provider and its cloud infrastructure faced outages. For nearly seven hours, applications powered by LLMs, ranging from legal AI tools to customer service chatbots and supply chain decision systems, became inoperative. The financial losses were significant and tangible: billions lost in revenue and huge costs for emergency fixes.

Outages become more frequent

It is tempting to dismiss large-scale cloud or LLM failures as rare, black-swan events that won’t recur for years. But this is wishful thinking. By relying on a few hyperscale providers for the computational power of enterprise applications, we have created centralized points of failure in our most vital business systems. The convenience and cost-efficiency of third-party LLMs hide a fragile truth: As more organizations rely on these shared services for their data, reasoning, and engagement, each provider becomes a bigger target for operational issues, cyberattacks, misconfigurations, or software bugs.

Furthermore, the demand for LLM services is growing rapidly, pushing the limits of current infrastructure and increasing the risk of overload. Providers are also evolving quickly, layering new models and capabilities on top of complex legacy cloud systems. This creates unstable ground beneath what many executives expect to be a “set-and-forget” solution.

Forgotten architectural foundations

Enterprise architecture isn’t just about innovation; it involves managing risk, especially when adopting dependency-heavy technologies. A harsh truth from the 2025 outages is that many enterprises overlook resilience until it’s too late. Key architectural questions, including how systems degrade during outages, where dependencies are located, and what failover options are in place, are often ignored in favor of faster results.

This oversight is understandable. Architectural resilience is rarely glamorous and doesn’t showcase well, but it’s essential. The time to consider LLM or cloud provider outages isn’t during a crisis, but when initially designing and deploying these systems. Resilience must be intentionally built, not just hoped for.

There are three essential steps to resolving this issue.

First, enterprises need a clear-eyed audit of their LLM dependency chains. This involves more than a superficial review of vendor redundancy. It requires listing where LLMs are used, mapping out upstream and downstream dependencies, and understanding exactly how essential business processes would perform—or fail—if those AI endpoints became unavailable. Many organizations will be surprised by how many mission-critical functions now depend, perhaps invisibly, on a single external LLM.

Second, there should be a focus on architectural patterns that enable graceful degradation. If an LLM goes offline, can customer-facing apps switch to simpler but still functional rules-based interfaces? Is there a cache of responses or business rules to maintain operations temporarily? Consider old-school fallback strategies like local models, simplified algorithms, or manual processes that can be spun up if automation fails. The goal is to preserve core functions and protect the bottom line during outages, not to eliminate inconvenience.

Third, enterprises should invest in ongoing simulation and readiness drills. Just as disaster recovery teams rehearse for data center or network failures, development and operations teams must practice the very real scenario of an LLM outage. These drills should include tabletop exercises (what to do if production LLM access is lost for three hours, or a security breach hits the LLM provider) as well as live failover tests that verify if fallback architectures actually work.

We are entering a new era where the strategic value of LLMs is matched only by the scale of risk they introduce. The rising frequency of outages shows how dependence on cloud-based AI creates a fragile, collective vulnerability in the digital economy. Enterprises must confront this reality by reassessing resilience, mapping dependencies, practicing for failure, and restoring robust design. Enterprises that act now will protect their AI investments against future outages and build a durable, future-proof AI foundation.

(image/jpeg; 15.4 MB)

Update your databases now to avoid data debt 17 Mar 2026, 9:00 am

2026 should be a year of database updates, upgrades, and migrations. Across all of the most commonly used open source databases, end-of-life dates are forcing teams to take stock and move their workloads. The alternative is to stick with what is in place. While staying put might work in the short term, it will lead to more problems and higher costs over time. At the same time, moving too quickly has its own risks and potential challenges. How can we get things just right for you and your team?

Database dates for your diary

If you use MySQL, then there is a major date to plan ahead for. On April 30, 2026, MySQL 8.0 will reach End of Life (EOL) status. As a Long Term Support release, MySQL 8.0 is at the heart of many applications, so there should be a lot of planning completed already around moving to new systems. For those of you running PostgreSQL, version 13 is already End of Life, and version 14 will move to EOL status when version 19 of PostgreSQL is released in September 2026. Redis will see two end-of-life dates in 2026, with 7.2 going EOL in February 2026 and 7.4 in November. Outside open source, MongoDB 6.0 will reach EoL status in June 2026.

For teams running any of these databases, planning ahead around updates will involve a lot of work. For teams that follow a best-of-breed approach and use multiple databases in their stacks for different workloads, the challenge is even greater. Any migration should ideally start at least six months early to allow enough time for testing, compatibility checks, and successful cut-overs in advance of any end-of-life date. Yet the world of IT is rarely ideal; these applications might be critical to the business, leading to compressed timelines or even projects being postponed repeatedly. In a world where “if it’s not broken, don’t try to fix it” is often sage advice, a database migration might be seen as a lot of work for very little reward and high risk when things don’t go according to plan.

So what can teams do to get ahead of these projects and the potential problems that can come up?

Planning the move

The first place to start is knowing all of the database systems that are in place. This can be across test, development, and production instances, and across database versions. There might be multiple versions of one database, or versions of the different databases that are implemented. Either way, making an accurate list of what is implemented is essential. Even if you run your databases in the cloud using a managed service, those databases will be a specific version and will need to be updated over time.

Once you have that list of database instances and versions, you can decide if and when they should be updated. Test and development instances can be moved sooner, while production deployments can be moved once the database is proven to be as resilient and reliable as the existing versions. Everyone in IT is familiar with the rule of not implementing a version of software that is *.0, and waiting for the inevitable bugs or deployment problems to be patched. For production environments that have to deliver to service levels, that move will take some additional time.

You may also want to estimate the time frame for your project. Critical applications will need more careful planning and testing before they get shifted, while less important ones might need less time. Similarly, critical applications might have more complex deployments like sharded databases or clustering for availability. Updating a distributed database is harder and will take more time than a single-server deployment.

However complex your environment is, try creating a standard estimate for projects. An estimate will give you some internal deadlines for those projects based on your understanding of complexity, deployment type, and, most important of all, how critical the application is to the business. This will help you plan not only the migrations, but also your communications around the moves and the potential impacts they might have on the business. For truly mission-critical applications, this timeline might have to extend to twelve months.

Once you are ready to commit to a move, you should measure your performance today before any changes are made. This gives you a benchmark for your existing systems and a picture of what “good” looks like. Without this measurement, you will not be able to determine how successful the move has been. Even for end-of-life software migrations, businesses expect to see some form of return on their investment, and a performance boost from a move can count towards that return.

The details of the migration will depend on your database and how it handles the move. Some updates will be simple and can be carried out in place, while others will be more complex and contrite. The most serious are those where there is no simple rollback process; in effect, once you migrate, there is no route back. For these situations, restoring a backup may be the only recourse. Similarly, a restoration may be in order if you have a problem with performance.

Alongside the database itself, you will have to look at the overall application environment as well. Any change to the database can have a knock-on effect on the application developers, and on the line-of-business team responsible for the service. Implementing a test environment for the updated database for compatibility testing will help you ensure proper behavior and functionality. Alongside this test, you should also look at your documentation, so you can plan your architectural changes and ensure you have fully described your production implementation, rather than just thinking you have described it.

Potential challenges

The biggest challenge around database migrations is getting people on board with the project. To make it easier to get support, it’s important to look beyond the EOL date. Instead, look at the benefits that a change can deliver around performance or ease of use. These improvements might seem small, but they can quickly add up to real business value.

The next biggest challenge in any database migration is getting things to work as expected. You may find edge scenarios that use areas of your database that have changed and that were outside your initial testing and planning. Those functions then have to be updated and implemented, with the same attention to detail that they might have needed before the move.

To make all this work successfully, you should put together a budget for the migration project. This budget should cover any additional personnel costs to include what is needed during the move, as well as the cost of the hardware or any additional resources. Whatever you might have estimated, you may face additional expenses when those extra requirements crop up or unplanned extras need support. Such contingencies must be planned for, or you risk the migration failing as a whole.

How you communicate around a migration project can make a huge difference to its success or failure. Your communications plan should cover those directly involved, like the developers and infrastructure managers, through to those who own the application within the business. Each of these groups may be affected by the migration, and must be made aware of the issues and pain points that may come up. By agreeing communication paths ahead of the move, you make it more likely that you can work together effectively around the whole project.

In 2026—and beyond—database migrations will demand time, effort, and attention from developers and infrastructure teams. To make those migrations effective, planning ahead around end-of-life scenarios will involve preparation, testing, and measurement. Working back from those deadlines will help you prepare more effectively and make the business case to move at the right speed for your team, rather than leaving things to the last minute. The goal here is to reach the Goldilocks zone, where migrations are not too fast or too slow, but just right for the business.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 2.82 MB)

Startups accuse Microsoft of ‘billing trap’ in Azure AI Foundry after unexpected charges 16 Mar 2026, 3:48 pm

A growing number of startup founders are raising concerns about unexpected charges incurred while experimenting with AI models through Microsoft’s Azure AI Foundry platform, turning what began as an isolated complaint into a broader debate over billing transparency.

At least 20 participants in the Microsoft for Startups program have signed a Change.org petition calling on Microsoft CEO Satya Nadella to address what they describe as a “billing trap” inside Azure AI Foundry, arguing that the platform’s interface makes it difficult to distinguish between services covered by startup credits and third-party models that incur direct charges.

“Azure AI Foundry displays both Microsoft-native models (such as Azure OpenAI) and third-party Marketplace models (such as Anthropic Claude) in a completely unified interface — with no visual distinction, no warning, and no confirmation step before charges are incurred,” the petitioners wrote.

The petition, which goes to the extent of claiming that Microsoft had breached the founders’ trust, was drafted by Takuya Tominaga, founder of Tokyo-headquartered startup Leach, who was one of the first to report the billing issue in a detailed blog post.

In the post, Tominaga wrote that he was unaware of any billing for model use until his credit card statement arrived containing a charge of about $1,600 for the use of one of Anthropic’s models.

The founder further wrote that contacting Microsoft support about the issue via the Azure portal was an arduous task, as the portal wouldn’t let him get directly in touch or report the issue.

After he did get through to Azure Support on X via direct message, he was directed to the fine print in Microsoft’s documentation that says that startup credits cannot be used for Microsoft Azure support plans, third-party branded products, products sold through Microsoft Azure Marketplace, or products otherwise sold separately from Microsoft Azure.

When Tominaga pushed further, he wrote that he was offered a partial refund via credits worth $1,000, which he rejected, and was directed to contact Anthropic with any further refund requests.

Anthropic responded to Tominaga by saying that it does not have visibility into usage through Microsoft Foundry and was unable to process a refund.

Tominaga is not alone. Riyaj Shaikh, a systems architect at EPAM Systems in Pune, said in a post on X that he had encountered a similar situation, and that attempts to resolve the billing issues appeared to bounce between the two companies, with each pointing to the other as the appropriate party to handle refunds.

In fact, Shaikh, in the same thread, pointed out that Microsoft’s own moderators are not certain about how large language models are billed as part of the Microsoft for Startups program, and pointed to a post on Microsoft Learn’s official Q&A forum.

In response to a question from a forum user, a moderator had confirmed that startup credits could be used for deploying Claude Opus 4-5 via Azure AI Foundry.

The post was later amended to indicate that startup credits don’t apply.

Shaikh told InfoWorld in an email that he and his team have yet to receive any refund, and have instead been bounced back and forth between Anthropic and Microsoft, with each pointing them to the other for resolution.

Bogdan Sevriukov, founder of AI-based workforce training startup Comprenders, confirmed that he, too,  was facing a similar issue over a charge of €999.60 (about $1,147).

In response to an email seeking comment on the complaints and asking whether it plans to modify the way third-party AI models are presented and billed within Azure AI Foundry, a Microsoft spokesperson said, “We listen closely to customer feedback and are continuously working to provide clear guidance in our product documentation, including pricing details and credit eligibility. We encourage customers to rely on official documentation and to submit a support ticket for additional assistance specific to their environment.” 

The petitioners say that relatively small design changes to Azure AI Foundry’s UI could prevent similar incidents, and are urging Microsoft to introduce clearer labeling, explicit billing warnings, and confirmation prompts before developers deploy third-party models.

These changes, they argue, would help ensure that startups experimenting with AI prototypes do not inadvertently incur unexpected charges and exhaust their budgets.

(image/jpeg; 11.65 MB)

Open VSX extensions hijacked: GlassWorm malware spreads via dependency abuse 16 Mar 2026, 11:37 am

Threat actors are abusing extension dependency relationships in the Open VSX registry to indirectly deliver malware in a new phase of the GlassWorm supply-chain campaign.

Researchers at Socket said they have identified at least 72 additional malicious Open VSX extensions linked to the campaign since January 31, 2026. The extensions appear to target developers by posing as helpful tools, such as linters, formatters, database utilities, or integrations for AI coding assistants, while serving as delivery vehicles for a malware loader linked to the GlassWorm operation.

“Instead of requiring every malicious listing to embed the loader directly, the threat actor is now abusing ‘extensionPack’ and ‘extensionDependencies’ to turn initially standalone-looking extensions into transitive delivery vehicles in later updates, allowing a benign-appearing package to begin pulling separate GlassWorm-linked extension only after trust has already been established,” Socket researchers said in a blog post.

The new campaign technically retains the same core GlassWorm tradecraft while improving survivability and evasion, the researchers added.

Supply-chain attack hiding in extension relationships

extensionPack and extensionDependencies are two features commonly used by Visual Studio Code extensions to bundle or require other extensions.

According to Socket, threat actors are publishing clean-looking extensions that, after gaining user trust and passing marketplace checks, are later updated to include dependencies on separate extensions that contain the GlassWorm loader. When installed or updated, the editor automatically installs all referenced extensions, including the malicious payload.

This transitive delivery model creates a supply-chain pathway similar to dependency abuse in package ecosystems like npm. A recent abuse included a maintainer’s compromise, leading to malicious updates spreading a backdoor malware. The infamous Shai-Hulud campaign that compromised over 800 packages by November, 2025 is another instance of self-propagating dependency abuse.

The new approach likely lowers operational overhead for attackers. Instead of embedding the loader in every malicious extension, they can maintain a smaller number of payload extensions while distributing them through a wider network of dependency relationships.

The evolving GlassWorm

Earlier research into the GlassWorm operation has revealed techniques such as heavy code obfuscation, the use of Unicode characters to hide malicious logic, and infrastructure that retrieves command-and-control servers through blockchain transactions, making the campaign more resilient to takedowns.

The latest wave also mimics widely used developer tools to maximise installation chances. “The extensions overwhelmingly impersonate widely installed developer utilities: linters and formatters like ESLint and Prettier, code runners, popular language tooling for Angular, Flutter, Python, and Vue, and common quality-of-life extensions like vscode-icons, WakaTime, and Better Comments,” the researchers said. “Notably, the campaign also targets AI developer tooling, with extensions targeting Claude Code, Codex, and Antigravity.”

The researchers added that as of March 13, Open VSX has removed the majority of the transitively malicious extensions, yet a few remain live, indicating ongoing takedowns.

Socket published indicators of compromise (IOCs) tied to the campaign, including the names of dozens of malicious Open VSX extensions and associated publisher accounts believed to be linked to the operation. Additionally, the researchers recommend treating extension dependencies with the same scrutiny typically applied to software packages. Organizations should monitor extension updates, audit dependency relationships, and restrict installation to trusted publishers where possible, as attackers increasingly exploit the developer tooling ecosystem as a supply-chain entry point.

(image/jpeg; 1.64 MB)

How AI is changing open source 16 Mar 2026, 9:00 am

Open source has become less of a “thing” in the last few years. Oh, sure, you’ll find the usual suspects waving their “open source is always better” flag, even as the AI community keeps releasing ambitious (and very closed) models and other tools (and as the very nature of open source evolves, as I’ve argued time and time again.). This doesn’t mean open source is fading in importance. It’s not. As CNCF’s contribution tables, GitHub’s Octoverse data, or the Apache Software Foundation’s latest annual report indicate, open source engagement is shifting to the layers that matter most: Kubernetes (yes, really), observability, platform engineering, networking, and the infrastructure required to make AI work in production.

Open source grew up and became dull. We’re all better for that.

Control through code

While we can’t help but be inundated by news of this or that latest model, open source keeps quietly chugging away in the background. CNCF now hosts more than 230 projects with more than 300,000 contributors worldwide. Its 2025 survey found that 98% of organizations have adopted cloud-native techniques, and 82% of container users now run Kubernetes in production. GitHub’s 2025 Octoverse report tells the same story but from an even wider angle: 1.12 billion contributions, more than 180 million developers, and a record 518.7 million merged pull requests. Apache is a bit less flashy but isn’t exactly withering, either. The ASF says it had 9,905 committers working across 295 projects and issued 1,310 software releases in fiscal year 2025.

Who employs all the developers contributing this code? In 2025, as CNCF Devstats show, Red Hat led all CNCF contribution activity with 194,699 contributions. Second place? Microsoft with 107,645. And third? Google at 91,158. Independent contributors still mattered, landing fourth at 52,404, which is a useful reminder that open source hasn’t become purely corporate. But the center of gravity is unmistakable. Serious companies now spend serious money for engineers to shape the plumbing their products depend on. The top contributors have remained constant over the past decade, indicating their willingness to invest in the long game. But during that same time we’ve seen an influx of new contributors, too. 

That shift matters because it changes how we should read open source contributions. Too many people still talk about them as if they were mostly philanthropy. Too many open source program offices still try to convince their engineering teams to contribute because “it’s the right thing to do,” and they hope their developers’ efforts will ingratiate the company into some nebulous community. Nope. Open source is increasingly where vendors try to set defaults, normalize interfaces, and shape the operational assumptions everyone else has to live with.

In other words, open source has become less about openness for its own sake and more about control. Not proprietary control, exactly, but control over the layers where ecosystems harden into standards. The companies investing upstream aren’t doing it because they’ve discovered civic virtue. They’re doing it because whoever shapes the substrate usually gets leverage over everything built on top of it.

Who gives, and why?

Take Red Hat. It’s still the heavyweight in CNCF, which isn’t hard to explain. Red Hat’s OpenShift is a Kubernetes-centric application platform. So of course Red Hat continues to pour effort into the Kubernetes-centered world. That’s not community service; it’s product strategy. It fits the way Red Hat has long exercised influence (and control). But it’s not charity. Fortunately for Kubernetes, Red Hat isn’t alone in contributing to Kubernetes; the stats point to a growing, increasingly diverse contributor base across thousands of organizations.

Kubernetes won because it became too important for any serious infrastructure company to ignore, and Red Hat contributes heavily because its business depends on that remaining true.

Microsoft’s position is even more revealing. Once the company most associated with open source hostility, it now sits second in overall CNCF contributions in 2025. But the more interesting signal is where companies like Microsoft are investing. OpenTelemetry has become one of the fastest-rising CNCF projects, with a 39% rise in commits in 2025 and a contributor base that grew from 1,301 to 1,756 in a single year. Again, this isn’t about charity—more like a land grab around observability standards. Microsoft, Splunk, and other top OpenTelemetry contributors are all helping in order to help themselves. That’s the way open source has always worked.

Then there’s Cilium, which is what happens when boring infrastructure stops being boring, as I recently noted. Cilium’s journey report says the number of contributing companies rose 90% after it joined CNCF, from 533 to 1,011, while individual contributors jumped from 1,269 to 4,464. Google, Datadog, and Cloudflare all expanded their contributions as the project matured. That’s not random. Cilium sits at the intersection of networking, observability, and security, which are precisely the categories that become mission-critical once workloads become distributed, latency-sensitive, and expensive. AI may be driving headlines, but a lot of the real strategic work is happening in projects like Cilium, where the infrastructure determines whether those AI workloads are governable, visible, and efficient.

And how about Nvidia, a company with so much cash it could buy a few countries and set all their developers to work building for Nvidia. But this isn’t how Nvidia has chosen to spend its riches: It ranked 14th in Kubernetes contributions in the past two years, with 5,892 contributions. It has also open sourced KAI Scheduler, a Kubernetes-native GPU scheduler that came out of Run:ai, and Nvidia has described itself as a key contributor to Kubeflow. In other words, Nvidia isn’t just selling chips; it’s investing in the scheduling, orchestration, and workflow layers that determine how effectively those chips get used in real-world AI systems. And it’s doing so through developer communities, rather than lump sum cash payouts.

The Nvidia work is a tell for where open source is going in AI. CNCF says 66% of organizations hosting generative AI models now use Kubernetes for some or all inference workloads, and it explicitly calls Kubernetes the de facto operating system for AI. Of course it would say that, given the foundation’s dependence on Kubernetes as a tentpole project, but that doesn’t diminish the reality that Kubernetes and Kubeflow are increasingly central to training and inference systems. In sum, AI is making open infrastructure more important because few organizations really want to build their future on opaque, inescapable infrastructure they can’t inspect or influence.

An essential supporting actor

So is open source increasing in importance? Absolutely, but not in the warm, nostalgic way some people still imagine. It’s becoming less romantic and more essential. The old story about open source as a fringe alternative or a developer-led morality play was never true, but it’s not even remotely credible now. Open source is where the cloud-native stack gets standardized, where observability gets normalized, where platform engineering gets productized, and where AI infrastructure is increasingly being built.

(image/jpeg; 9.2 MB)

Migrating from Apache Airflow v2 to v3 16 Mar 2026, 9:00 am

During the 2025 holidays, I had some downtime and decided to migrate one of my pet projects from Apache Airflow 2.10.3 to 3.0.6. This article is based entirely on my hands-on experience with that migration and captures my initial takeaways of what worked well, what felt rough around the edges and where Airflow 3 genuinely changes how we think about workflow orchestration.

Rather than covering every new feature, I want to focus on the five changes that stood out the most during the migration.

1. Unified SDK imports improved developer experience

One of the first changes I appreciated as a developer was the move to SDK-first imports. In Airflow 2.x, DAG authoring often required importing objects from multiple modules such as decorators and models. Airflow 3 consolidates this into a more intuitive and unified SDK surface, making DAG code easier to read, write and maintain.

Airflow 2.10.3

from airflow.decorators import dag, task
from airflow.models import Param

Airflow 3.0.6

from airflow.sdk import dag, task, chain
from airflow.sdk.definitions.param import Param

This may look like a small change, but across a growing codebase it significantly reduces cognitive overhead and improves consistency in DAG authoring.

2. Clearer separation between DAG code and the metadata database

The SDK shift also reflects a larger architectural change in Airflow 3: DAG and task code are now intentionally decoupled from the metadata database.

In practice, this resulted in:

  • Fewer accidental dependencies on metadata DB objects
  • Clearer boundaries between orchestration and execution
  • A safer and more scalable execution model built around APIs

As someone who prefers lightweight and decoupled architectures, this felt like a very welcome change and a solid foundation for the future of Airflow.

3. DAG versioning made historical runs trustworthy

With Airflow 3.x, each DAG run is tied to a specific DAG version, and this turned out to be one of the most valuable improvements for day-to-day operations.

In Airflow 2.x, even small changes such as renaming a task or refactoring logic could cause historical runs to appear inconsistent or confusing in the UI. Debugging older runs often meant mentally mapping today’s DAG code to yesterday’s execution.

With DAG versioning:

  • Each run executes against the exact DAG definition it started with
  • Historical runs remain accurate and easy to reason about
  • Debugging past failures no longer depends on current code

This alone significantly improved traceability and confidence when evolving workflows.

4. New UI: modern, but still a work in progress

The redesigned UI in Airflow 3 is one area where my experience was mixed. I have been on the receiving end of feedback from users and clients who found the new layout disorienting, mostly due to buttons moving or workflows changing. While I am personally open to UI changes, the new interface did feel rough around the edges. Some of these behaviours may vary depending on deployment configuration and version.

Some of the issues I noticed:

  • Task ordering appeared inconsistent in the DAG Grid view
  • The enable/disable DAG toggle intermittently disappeared
  • The Delete DAG action was harder to discover
  • Page loads felt slower at times
  • Searching historical DAG runs using date-based filters was less intuitive than before

The UI is clearly more modern and designed to scale better long-term, but today it feels less polished for everyday operational workflows. I expect many of these issues to be addressed as Airflow 3.x matures, though I still find myself missing the predictability of the older UI.

5. Asset-Based scheduling simplified cross-DAG dependencies

One of the most impactful conceptual changes in Airflow 3 is the shift from Datasets to Assets, especially for modelling cross-DAG dependencies.

In my project, several workflows follow a familiar pattern:

  • A file lands in S3
  • One job processes the file
  • A downstream job runs only after that data is ready

In Airflow 2.x, this usually meant chaining sensors and explicit DAG triggers. While functional, this approach added coupling and operational complexity.

With Assets, the focus shifts from “which DAG triggers which?” to “what data is now available?” Workflows become data-driven rather than DAG-driven, resulting in cleaner definitions, fewer sensors and better visibility into real data dependencies.

This felt like a much more natural way to express how data pipelines actually work.

Airflow 2.10.3

from airflow.operators.trigger_dagrun import TriggerDagRunOperator

trigger_downstream = TriggerDagRunOperator(
    task_id="trigger_downstream",
    trigger_dag_id="downstream_dag"
)

Airflow 3.0.6

Upstream job

from airflow.sdk import task
from airflow.sdk.definitions.asset import Asset

MY_JOB_ASSET = [Asset("db://my-data")]

@task(outlets=[MY_JOB_ASSET])
def produce_data():
    pass

Downstream job

from airflow.sdk import dag, task
from datetime import datetime

@dag(
    start_date=datetime(2024, 1, 1),
    schedule=[MY_JOB_ASSET]
)
def downstream_dag():

    @task
    def consume_data():
        pass

    consume_data()

downstream_dag()

Why this migration matters

Beyond individual features, migrating to Airflow 3 felt less like an optional upgrade and more like a necessary step forward. Airflow 3 represents a clear architectural direction for the project: API-driven execution, better isolation, data-aware scheduling and a platform designed for modern scale.

While Airflow 2.x is still widely used, it is clearly moving toward long-term maintenance (end-of-life April 2026) with most innovation and architectural investment happening in the 3.x line. Delaying migration only widens the gap:

  • More breaking changes accumulate
  • Provider compatibility becomes harder to manage
  • Teams miss out on improvements that simplify debugging and orchestration

For me, moving from 2.10.3 to 3.0.6 wasn’t just about staying current; it was about aligning with where Airflow is headed. Even with a few rough edges, Airflow 3 feels like the foundation the project needed for its next phase.

Disclaimer: The views expressed are my own and do not represent those of my employer.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 0.71 MB)

How to build an AI agent that actually works 16 Mar 2026, 9:00 am

Everyone is building “agents,” but there is a large disagreement on what that means. David Loker, VP of AI at CodeRabbit, which runs one of the most widely deployed agentic code review systems in production, has a practical definition that cuts through the hype: The company’s code review “happens in a workflow” with “two agentic loops” embedded at specific points where reasoning is actually needed. Not an autonomous AI roaming free. For CodeRabbit, their agent is a workflow with intelligence inserted where it counts.

That distinction—agents embedded in workflows, not agents as autonomous beings—turns out to be the difference between a demo and a production system. Here’s how to build the production version, grounded in CodeRabbit’s experience and backed by peer-reviewed research.

Although my central example here is a code review agent, the same eight basic principles discussed (and the 10-point checklist) apply to building any kind of agent.

Start with the workflow, not the model

Loker describes CodeRabbit’s architecture as “a workflow with models chosen at various stages… with agentic loops using other model choices.” The system doesn’t start with a large language model (LLM) and hope. It runs a deterministic pipeline that fetches the diff, builds the code graph, runs static analysis, identifies changed files, determines review scope, and then inserts agentic steps where judgment is actually needed.

“There are some things that we know are very important so we run them anyway,” Loker says. “The code graph analysis, import graph analysis, having this static analysis tool information there, the diff, and some of the file-level information.” This base context gets assembled deterministically before any reasoning model is invoked.

Research confirms this hybrid approach. The Agentic Design Patterns framework identifies five subsystems every agent needs: Perception and Grounding, Reasoning and World Model, Action Execution, Learning and Adaptation, and Inter-Agent Communication. ReAct (Reason + Act), the popular pattern where an LLM interleaves chain-of-thought reasoning with tool calls in a single loop, skips most of these subsystems, which is why it’s fragile. Separately, hybrid architectures that combine structured workflow with embedded agentic loops achieve an 88.8% average Goal Completion Rate across five domains, outperforming pure ReAct, chain-of-thought, and tool-only agents on most metrics, including ROI (see Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents).

The takeaway is to map your domain process first. Identify which steps require judgment (agentic) and which are mechanical (deterministic). Build the workflow skeleton, then embed agents where they add value.

Context engineering is the whole game

“Context engineering is the bread and butter” of what CodeRabbit does, Loker says. Not prompt engineering, but context engineering. The difference: Whereas prompt engineering is the process of crafting clever instructions, context engineering is the process of assembling the right information from the right sources, in the right structure, at the right time, for each step of the workflow.

CodeRabbit assembles context from the diff itself, full files, related files discovered via import graph, the code graph built from abstract syntax tree analysis, static analysis results, user-configured review instructions, learned patterns from past feedback, MCP (Model Context Protocol)-connected documentation, and web-fetched library docs. “There’s a massive exploration about how does this PR [pull request] connect up with all of the other aspects of the code,” Loker explains. “Which places could possibly have been impacted by your change and which parts of the codebase impact you.”

The level of detail is deliberately chosen per step. “The LLM is looking for something, it’s looking for a specific piece, and you can give that level of detail,” Loker explains. “Is it looking for just high-level summaries? Is it looking for snippets of code? Is it looking for actual line number code, detailed information? Do I need the whole function or do I just need the function signature? Sometimes, it might only be a function signature and maybe what the function is trying to accomplish that is enough information for us to understand whether or not you’re using it correctly.”

In a recent academic survery of context engineering for large language models, which considered retrieval and generation, processing, management, and system implementations including RAG (retrieval-augmented generation), memory, tools, and multi-agent coordination, the key finding was that LLMs with advanced context engineering are remarkably good at understanding complex contexts but limited at generating equally complex outputs. However, the inverse is also true: Models capable of generating complex outputs are not necessarily good at understanding complex contexts.

The Agentic Context Engineering (ACE) study proves that context should be treated as an evolving playbook, not a static prompt. The ACE system uses incremental delta updates organized as structured “bullets” with metadata that grow and refine over time, rather than monolithic prompt rewrites. Monolithic rewriting caused context to collapse from 18,282 tokens to 122, dropping accuracy from 66.7% to 57.1%. The system treating context as a living, structured document achieved +10.6% on agent benchmarks.

But more context can make your agent worse

Here’s the trap Loker himself flags: “Context packing to that degree will end up in the situation where you’ll just forget. And it’ll only pay attention to some of it. Ultimately, even if it’s factually correct information, the LLM’s performance will degrade as you increase the context size.”

That’s not just an engineering intuition. Researchers at TII and Sapienza formalized this observation as The Distracting Effect, and the numbers are sobering. Not all irrelevant content is equally dangerous. They identified four types of distractors, from weakest to strongest: Related Topic (discusses something nearby but doesn’t contain the answer), Hypothetical (“In ancient Roman times…”), Negation (“It is a common misconception that…”), and Modal Statement (“The Pyramids may have been built via…”). That last category, hedged wrong answers, is the most dangerous because it mimics the style of authoritative text.

The counterintuitive finding is that better retrievers produce more dangerous distractors. The irrelevant results surfaced by stronger retrieval pipelines are more misleading than those from weaker ones. This makes RAG especially dangerous, because pulling semantically related but irrelevant information distracts the model more than nonsense. Adding a reranker makes it worse because related but irrelevant passages that survive reranking are the ones most likely to fool the LLM. Hard distracting passages reduce accuracy by as much as six to 11 points, even when the correct passage is also in the prompt.

This is why Loker emphasizes the selection step: “How do I then choose the information that’s appropriate? And that’s the part that’s like the actual context engineering because you can grab everything, but then you run out of space.”

And more skills can make it worse, too

The same principle applies to procedural knowledge. SkillsBench, a large-scale benchmark for testing “Agent Skills” (structured playbooks injected at inference time) found that human-curated, focused skills raise the pass rate by +16.2 percentage points (pp) on average. But there are traps.

Self-generated skills, where the model creates procedural knowledge before solving the task, provide no benefit on average (-1.3pp). GPT-5.2 actually degraded by -5.6pp. Today’s models cannot reliably author the procedural knowledge they benefit from consuming. This means that auto-generated playbooks need human curation, not just agent self-reflection.

Two or three focused skills are optimal (+18.6pp). Once you hit four or more, gains collapse to +5.9pp. Comprehensive documentation actually hurts performance by -2.9pp. And 16 of 84 tasks showed negative deltas, meaning the skills introduced conflicting guidance or unnecessary complexity for tasks the model already handled well.

One bright spot: A smaller model plus good skills can match a larger model without skills. Haiku 4.5 with skills (27.7%) outperformed Opus 4.5 without skills (22.0%). Investing in curated procedural knowledge is often a better use of budget than upgrading to a bigger model.

Use the right model for each job

“We’re using a combination of more than 10 different variants, depending on the area of the workflow that you’re in,” Loker says. Not because one model is bad, but because different steps need different capabilities.

“At every part of the workflow, depending on the level of requirement, depending on where that workflow sits in terms of difficulty, a model choice will be made,” Loker explains. “The details of how you call that model, the parameters, whether or not they’re using a lot of reasoning tokens or fewer reasoning tokens, the verbosity level… do I need to worry about things like latency? In which case I need to use some of these other models, especially for looping. For example, if I have a large latency and I’m looping, that sort of blows up.”

The cost dimension matters too. “We don’t pass on token costs to customers,” Loker says. “So ultimately we’re incentivized to find the models that are necessary, but ideally this [the chosen model] is necessary and sufficient for solving the problem. So we obviously bias towards quality, but we’re also testing out all the time what is the lowest tier that we can do and maintain that quality bar.”

Build your tool pipeline deliberately

CodeRabbit doesn’t just “give the agent tools.” It runs a deliberate pipeline where each tool has a specific role. The base layer assembles the context (the diff, static analysis, and file-level information), and the agentic layer sits on top, where “it’s going to be able to read files and search for things and look at, for example, abstract syntax tree information to try and figure out where it’s connected,” Loker says.

Static analysis tools are used not to surface results directly, because those have a high rate of false positives, but to help the LLM understand where there might be issues. “So it can then reason about whether or not that’s actually an issue, given all of the other information that it has,” Loker says. The LLM becomes a reasoning layer on top of deterministic analysis, not a replacement for it.

Web queries fill knowledge gaps at runtime. “You might have a library that you use and the cutoff date of the LLM predates that library, or it predates the version of the library that you’re using,” Loker explains. “And so we might need to pull in documentation around what this function is, because we’ll otherwise come up with an error.”

Research formalizes tool use as four distinct stages, each requiring explicit engineering:

  1. Tool discovery: How does the agent know what tools exist?
  2. Tool selection: Given the task, which tool?
  3. Tool invocation: Calling the tool correctly with proper parameters and error handling.
  4. Result integration: Parsing output and injecting it back into reasoning.

Tool selection is the critical bottleneck because most agent failures occur there rather than during invocation.

Memory needs active curation

“Developers chat in the PR back to CodeRabbit,” Loker explains. “And so CodeRabbit will look at that information and be like, oh, they didn’t like this comment. Or they’ll give us some information. Like in our organization, we don’t do things that way, we do it this way. And so we’ll take that information and over time we’re able to adjust.”

The storage is structured and retrieved by context, not appended to a log. Loker gives a concrete example: “You should really be using getters and setters… and you’re like, ‘We don’t care about that here.’ We’ll store that information and then later on, if we’re going to bring up a comment like that, that’s retrieved using RAG. So looking at the context of a future review… we’ll put that into the context window to say, do not rig comments related to getters and setters.”

This is per-organization customization at scale. “All these little things enrich the context, which allows us to do a PR in a very nuanced way and change it across organizations without having to build a new model, which obviously is not scalable for every single organization.”

The MemInsight paper demonstrates that autonomous memory augmentation—enriching stored interactions with semantic metadata, relationships, and context—yields +34% recall improvement over naive RAG baselines. Memory needs active curation, not just storage. MAIN-RAG shows that filtering retrieved context with multiple agents before passing it to the generator is as important as the retrieval itself. Don’t feed everything you retrieve to the LLM; use multi-agent consensus to decide what’s relevant.

Verify your own output

“We deal with this through our post review verification system,” Loker says. “Ultimately, it’s not even necessarily going to be the same model. Like if Sonnet 4.5 does the review, it doesn’t mean that it’s going to be doing the verification. So if it lies, someone’s going to catch it, basically.”

This cross-model verification is deliberate. There’s a benefit because the training of the different models is different. “They’re what they care about, what they focus on; their distributions are different,” he says. “And so you’re going to get a blended experience, which, typically speaking, works out a little bit better.”

The verification goes beyond hallucination detection to checking whether claims are grounded: “This file… say there’s an error, the file doesn’t even exist. So let’s just throw this thing out.” There’s also a false-positive check and an agentic loop that can go back and double check its work by re-examining source files referenced in review comments. Loker calls this “de-noising.”

Cross-model verification, using a different model to check the output of the first, is a specific form of what the research literature calls model feedback. Model feedback is one of four feedback mechanisms alongside human feedback, environmental feedback (execution signals), and tool feedback (static analysis, tests). Environmental feedback is the cheapest and often most reliable. Human feedback is the highest quality but doesn’t scale.

The ACE framework formalizes this as the Reflector role, a component whose entire job is critiquing the Generator’s output. The ACE researchers’ ablation study shows that removing the Reflector significantly degrades performance. Critically, the Reflector must be separate from the Generator; self-reflection has blind spots that cross-component verification catches. The Agentic Design Patterns framework describes cross-component verification as two patterns working together: the Reflector (analyze outcomes for causality) and the Integrator (validate all information before it reaches the reasoning core). If your agent hallucinates, you’re missing an Integrator.

Evaluation is a never-ending investment

Evals are increasingly being seen as the starting point of developing agents. Andrej Karapathy said that Software 1.0 (handwritten programs) easily automates what you can specify, and Software 2.0 (AI-written programs) easily automates what you can verify. And both Greg Brockman and Mike Krieger have agreed that “evals are surprisingly often all you need” and that it is a core skill.

“It’s not something I can be passive about,” Loker says of model evaluation. “And ultimately your customers also are going to expect that you’re using the latest models, and to some degree you have to be willing to provably, at least to some degree, explain to them why this model might not be the model that you’re going to use.”

The evaluation problem is compounded by the fact that models keep changing underneath you. Loker compares it to forced library upgrades: “It’s almost like your library is being forcibly upgraded continuously, and the maintenance of code that goes along with that… Most software engineers would be like, no, don’t do that to me.” And it happens “every few months they’re being automatically, forcefully updated.”

CodeRabbit’s evaluation framework is multi-layered. First, offline metrics: “Is this model as good or directionally better than an existing model at finding issues from a recall precision perspective. Looking at the number of comments [the model] posted, how many were required before it found the same number of bugs?” Signal-to-noise ratio matters: “If it posted fewer comments but found the same number of issues, then we know that the signal to noise ratio has been improved.”

Next, qualitative review: “What do these comments look like, what’s the tone of them, how many of them are patches?” CodeRabbit even checks for hedging language. Loker notes that Sonnet 4.5, for example, will hedge and say something “might be an issue,” in which case his team will consider fixing it.

Then staged rollout: “We’ll branch out again and start rolling it more slowly. How are people perceiving it? And we’ll be watching. We’re watermarking to understand: Does this model achieve higher acceptance rates? Are people essentially abandoning the whole system as a result of this model?”

The GPT-5 launch was a case study in why this matters. “The expectations were pretty high because the recall rates were really good, and the various other metrics were really good. But ultimately, the latency was like, you know, that particular metric was kind of crazy.” CodeRabbit also found that even though GPT-5 pricing was cheaper than Sonnet 4.5, when it comes to its actual million token costs, it uses a lot more thinking tokens. “So the cost-benefit doesn’t really come into play,” Loker says.

The Outcome-Oriented Evaluation of AI Agents framework proposes measuring agents on 11 dimensions, including Goal Completion Rate, Autonomy Index, Multi-Step Task Resilience, and ROI, not just latency and throughput. The researchers’ finding: No single architecture dominates all dimensions. You must profile your use case and measure what matters for your domain.

If you go multi-agent, topology matters

CodeRabbit’s architecture is essentially a coordinated multi-agent system, with different models handling different review stages and a workflow orchestrating their interaction. “There’s right now two agentic loops,” Loker says. “One is before the big review with the heavier reasoning model, and then another one comes out afterward.”

When building multi-agent systems, the coordination topology measurably affects performance. Graph topology (agents communicate freely) outperforms tree, chain, and star (central coordinator) topologies for complex reasoning tasks. Adding an explicit “Plan how to collaborate” step before agents start working improves milestone achievement by +3% (MultiAgentBench). Default to graph for complex tasks. Star is simpler but weaker.

A checklist for agent builders

Building an agent? Here’s the order of operations, grounded in production experience and peer-reviewed research.

  1. Figure out what you can evaluate. This is, at a high level, the business value assessment, and at a lower level, how fast or how often a workflow solves the problem. It will never be 100%. Invest in a workflow forever with continuous rollouts.
  2. Map the workflow. Identify deterministic steps vs. steps that need judgment. Don’t make the whole thing agentic.
  3. Engineer your context. Assemble the right information for each step, not everything but not too little. Use structured, itemized context, not narrative blobs. Filter aggressively: Irrelevant context actively degrades performance, and better retrievers surface more dangerous distractors.
  4. Curate procedural knowledge carefully. Human-written, focused skills help enormously. But don’t let agents write their own playbooks. Keep them to two or three focused modules, and remember that comprehensive documentation hurts more than it helps.
  5. Choose models per step. Different steps, different models. Smaller/faster where you can, heavier where you must.
  6. Build tools deliberately. Discovery, selection, invocation, integration. Each stage needs its own error handling.
  7. Build memory with curation. Don’t just log but augment, structure, and filter what gets stored and retrieved.
  8. Verify your own output. Separate generation from verification. Use a different model or approach to check the output.
  9. Design feedback loops. Environmental signals, user feedback, cross-model critique. Design them in from day one.
  10. If multi-agent, think topology. Graph beats tree beats chain beats star. Plan collaboration explicitly.

The bottom line

The bottom line might be that we shouldn’t all be modeling Claude Code or OpenClaw and making a linear agent that manages its own context and does “whatever.” Instead we should be developing curated workflows with very specific tools. We should make sure the evaluations are well thought out for the overall workflow and are handled up front. In a month or three, the model and everything else will change, so evaluations are eternally useful.

(image/jpeg; 11.53 MB)

Save money by canceling more software projects, says survey 13 Mar 2026, 4:41 pm

Enterprises should be more ruthless about cancelling projects. That’s according to project management software company Tempo, which surveyed 667 project planning leaders at the end of last year. It found that those who deployed better scenario planning and acted ruthlessly is assessing a project’s viability would be better off.

According to the survey, 90% of organizations claimed that their projects were aligned across teams. However, Tempo found that expectations didn’t always meet reality: Only 70% of projects delivered a meaningful return on investment, and over 33% of projects were cancelled or stopped early due to misalignment or lack of ROI.

Companies that deployed scenario planning software had a 17-percentage-point advantage in delivering ROI, according to the survey. Paradoxically, those with more mature planning processes cancelled more projects, Tempo said —not through a failure of planning, but because, the more frequently they review projects, the sooner they can see a project is failing and drop it, with their surviving projects being more profitable on average.

In the survey report’s conclusion, Tempo states. “The highest-performing teams aren’t clinging to perfect plans or heroic roadmaps. They’re reviewing frequently, creating alignment across teams, reallocating resources without drama, and canceling projects early when the numbers stop adding up.”

(image/jpeg; 12.03 MB)

Page processed in 1.408 seconds.

Powered by SimplePie 1.4-dev, Build 20170403172323. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.