Tech Daily

Sponsored by

Protection for Birds, a Gift for You

Snowy Owl. Photo: Grant Eldridge/Audubon Photography Awards

Since 1970, we have lost 3 billion birds in North America. Birds like the Snowy Owl are at increasing risk of extinction due to climate change. Birds are vital to healthy ecosystems - and their decline affects us all.

With over a century of conservation expertise, the National Audubon Society knows what it takes to protect the birds we love. And with your support today, we can.

Join us in protecting birds with a dependable, annual gift and we’ll send you our quarterly, award-winning magazine every year as our thanks.

Donate Now.

Tech Daily Thursday, May 7, 2026

OpenAI Replaced ChatGPT's Default Model and Claims It Hallucinates 52.5% Less. Here Is What Actually Changed.

Earlier this week, OpenAI quietly swapped out the default model powering ChatGPT for all users, replacing GPT-5.3 Instant with GPT-5.5 Instant. If you opened ChatGPT in the last couple of days and felt like something was slightly different, you were not imagining it. The responses are shorter, the tone is a little cooler, the emoji frequency has been deliberately dialed down, and according to OpenAI's own internal evaluations, the model is producing 52.5% fewer hallucinated claims on high-stakes prompts in medicine, law, and finance. It also reduced inaccurate claims by 37.3% on conversations that users had previously flagged for factual errors.

That hallucination reduction is the headline, but it comes with an important asterisk. These are OpenAI's own internal benchmarks, not third-party comparisons against Anthropic, Google, or any other competitor. OpenAI has not published side-by-side numbers. The HealthBench score for the new model is 51.4 out of 100, up from 49.6 for GPT-5.3 Instant, and 38.4 on HealthBench Professional, the more demanding clinical version, up from 32.9. On the AIME 2025 math test, the new model scored 81.2 against 65.4 for the previous default. These are real gains, but they are being measured on the company's own terms for now.

The personality shift is arguably as significant as the accuracy improvements. OpenAI explicitly engineered the model to use fewer gratuitous emojis and to produce tighter, more concise responses. Responses are roughly 30% shorter on average. This is a deliberate reversal from the era of GPT-4o, which users described as warm, affirming, and conversational to the point where some signed petitions when it was deprecated in February 2026, calling it their best friend. The company is now betting that reliability and precision will earn more loyalty than charm, which is a meaningful philosophical bet about what people actually want from AI at scale.

The other major change shipped alongside the model update is a feature called memory sources. For the first time, ChatGPT users can tap a button beneath any response to see exactly which past conversations, uploaded files, or connected Gmail data shaped the answer they received. Users can delete or correct any source they consider outdated or inaccurate, and those sources remain private even when a chat is shared with someone else. Plus and Pro users on the web got the personalization features first, with mobile and the free tier following in the coming weeks. For developers, GPT-5.5 Instant is now available via the API as chat-latest, and GPT-5.3 Instant will remain accessible to paid users for three months before being retired.

The broader context here matters. GPT-5.5 Instant is explicitly positioned as the fast everyday driver, not the heavyweight. OpenAI launched GPT-5.5 Thinking and Pro the month prior for coding, deep research, and long reasoning chains. Instant is what the overwhelming majority of casual ChatGPT users interact with by default, which means this is the version of the product that defines what AI feels like to most of the world. Getting hallucinations down by half on medical and legal queries is not just a benchmark improvement. It is a prerequisite for the product to be trusted in contexts where being wrong actually carries consequences.

Share the newsletter

Anthropic Launched 10 AI Agents for Banks and Put Jamie Dimon on Stage. Wall Street Is Paying Attention.

Earlier this week, Anthropic held an invite-only briefing in New York aimed squarely at the financial services industry, and the company did not underplay the moment. CEO Dario Amodei appeared on stage alongside JPMorgan Chase CEO Jamie Dimon. Anthropic launched ten pre-built AI agents designed for the most labor-intensive workflows in finance. And it unveiled Claude Opus 4.7, its most capable model for financial work yet, which has reached the top of Vals AI's Finance Agent benchmark at 64.37%. All of it came two days after the company announced a $1.5 billion joint venture with Blackstone, Goldman Sachs, and Hellman and Friedman. This is a coordinated offensive, not a product update.

The ten agents cover the full range of high-value but repetitive work that consumes enormous amounts of analyst time at banks, asset managers, and insurers. The pitch builder agent drafts pitchbooks for client meetings. The earnings reviewer analyzes financial results. The credit memo agent drafts structured credit assessments for lending decisions. A KYC screener handles know-your-customer document review. A general ledger reconciler automates month-end close processes. A financial statement auditor reviews filings for inconsistencies. Each ships as a complete reference architecture with the connectors, subagents, and workflow logic needed to run out of the box, and each can be customized around a firm's internal standards for risk management and approval routing. Deployment can happen inside Claude Code and Cowork, where agents assist human analysts in real time, or via Claude Managed Agents, a hosted model where Anthropic runs the underlying production infrastructure for more autonomous operation.

The data partnerships announced alongside the agents are equally important and somewhat underreported. Anthropic announced new connectors from Dun and Bradstreet, Fiscal AI, Financial Modeling Prep, Guidepoint, IBISWorld, SS&C IntraLinks, Third Bridge, and Verisk, giving the agents access to proprietary market data, insurance risk data, and industry research in a way that generic AI tools simply cannot replicate. Moody's launched a separate MCP app giving Claude users direct access to credit ratings and data on more than 600 million public and private companies. Anthropic also announced that Claude can now work natively across Microsoft Excel, PowerPoint, and Word through add-ins, with Outlook support coming later. Because the integrations share context across applications, a task started in Excel can flow directly into a PowerPoint deck without re-entering data.

The market reaction to the launch was immediate and pointed. FactSet Research Systems fell as much as 8.1% on the day of the announcement. Morningstar erased gains to fall more than 3%. S&P Global and Moody's both saw sharp selling pressure. These are exactly the companies whose core products, financial data terminals, research platforms, and credit analysis tools, the new Anthropic agents are positioned to partially replace or at least reduce dependency on for routine tasks. Amodei told the room that Anthropic had projected 10x revenue growth over a recent period and instead saw annualized growth of roughly 80x in a single quarter. He described the company's situation as one of absolute radical uncertainty in which the upside scenarios keep outpacing expectations.

Four Chinese AI Labs Released Competing Coding Models in 12 Days. Each One Costs Less Than a Third of Claude.

In a span of twelve days, four Chinese AI labs released open-weights coding models that landed at roughly the same capability ceiling on agentic software engineering tasks while undercutting Western frontier model pricing by more than two thirds. Z.ai released GLM-5.1. MiniMax released M2.7. Moonshot AI released Kimi K2.6. And DeepSeek released V4. None costs more than a third of Claude Opus 4.7 on a per-token basis. The releases were not staggered accidents. They arrived in a tight cluster with confident demos designed to signal that the underlying capabilities are real and production-ready.

Each launch came with the kind of benchmark and demonstration that labs deploy when they believe the work will hold up to scrutiny. Zhipu's stock closed up 15.92% the day GLM-5.1 launched. MiniMax's debut featured an internal instance of M2.7 running more than 100 rounds of optimizing its own scaffolding code, a demonstration of recursive self-improvement at a level that would have been a research headline eighteen months ago. Kimi K2.6's launch included a 12-hour continuous tool-use trace porting an inference engine to the Zig programming language, a demanding systems-level task that requires sustained context management across thousands of interdependent decisions. These are not chatbot demos. They are engineering capability showcases.

The NIST CAISI evaluation framework introduces an important nuance to interpreting the results. On its aggregate cross-domain benchmark, DeepSeek V4 lags the leading U.S. frontier models by roughly eight months in aggregate capability. That gap is real and should not be minimized. But the pricing differential changes the competitive calculus in ways that raw benchmark scores do not capture. For a startup building an AI-powered product, a model that performs at 85% of the frontier capability at 30% of the cost is not a consolation prize. It is frequently the rational choice, particularly for inference-heavy applications where cost compounds at scale.

The implications for the broader competitive landscape are significant. Western AI labs have operated on the assumption that frontier capability commands frontier pricing, and that assumption has been the foundation of their revenue models. A cohort of Chinese open-weights models that approaches frontier performance on the specific tasks that matter most for software development, which is still the single largest commercial use case for AI, compresses that pricing power. It does not eliminate it. Enterprise buyers with compliance requirements, data governance needs, and integration depth with existing infrastructure have real reasons to pay a premium for established Western providers. But the pricing ceiling for AI inference just got meaningfully lower, and that is a structural shift regardless of which individual model wins any given benchmark.

Check out our website

You are receiving this because you subscribed to Tech Daily.

Tech Daily

Protection for Birds, a Gift for You

Recommended for you

Quick Links

Subscription

Socials