News & Insights
Coverage of AI model releases, pricing shifts, and industry developments.
Comparing Token Costs: What Does AI Actually Cost to Use?
A practical breakdown of what tokens mean in real terms — from a single email to processing an entire codebase.
Gemini 3.1 Pro: Google Claims #1 on 12 of 18 Benchmarks
Google's Gemini 3.1 Pro achieves 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2, more than doubling its predecessor's reasoning score.
SWE-bench Leaderboard: February 2026 Rankings
The latest SWE-bench Verified scores show Kimi K2.5 and Qwen3.5 tied near the top. Here is the full leaderboard breakdown.
DeepSeek V4 and the February 17th Mega-Launch
Five major model launches on a single day: DeepSeek V4, Claude Sonnet 4.6, Qwen 3.5, Grok 4.20, and Cohere Tiny Aya all ship on February 17th.
Claude Opus 4.6 and Sonnet 4.6: Anthropic's February Blitz
Anthropic releases its strongest model pair yet — Opus 4.6 hits 80.8% on SWE-bench and Sonnet 4.6 matches it at 1/5 the cost.
Qwen3.5 397B Arrives: Alibaba's MoE Model Challenges the Frontier
Alibaba's Qwen team releases a 397B-parameter Mixture-of-Experts model with 256K context and open weights, scoring 88.4 on GPQA Diamond.
API Pricing in 2026: A Race to the Bottom or a New Equilibrium?
Input token prices have dropped 80% in 18 months. We analyze what this means for developers and the models competing on cost.
GPT-5.3-Codex: OpenAI Unifies Its Training Stacks
OpenAI's GPT-5.3-Codex is the first model combining Codex and GPT-5 training, scoring 77.3% on Terminal-Bench 2.0 and 81.4% on SWE-Lancer.
Open Source AI in 2026: The Gap Has Closed
Open-weight models now match proprietary alternatives on most benchmarks. We examine what changed and what it means for the industry.
Kimi K2.5: Moonshot AI Enters the Multimodal Frontier
Moonshot AI's Kimi K2.5 combines a 1T MoE architecture with native vision, scoring 76.8 on SWE-bench and 96.1 on AIME 2025.
Claude Opus 4 Sets New Benchmark Records
Anthropic's latest flagship model achieves state-of-the-art on SWE-bench with 72% score and introduces extended thinking capabilities.
Llama 4: Meta's Multimodal MoE Models Launch with Scout and Maverick
Meta releases two Llama 4 variants: Scout with 10M context and Maverick with 400B parameters, both using MoE architecture.
Gemini 2.5 Pro: Google's Thinking Model
Google releases Gemini 2.5 Pro with built-in reasoning capabilities and 1M context window.
DeepSeek R1: Open-Source Reasoning at Scale
DeepSeek releases R1, an open-source reasoning model that matches o1-level performance at less than $1 per million tokens.