Refactor — Notes from the AI engineering frontier

Refactor — Notes from the AI engineering frontier https://refactor4ai.com/blog Practical writing on AI fluency for senior engineers, platform leads, and product managers. en-gb Thu, 14 May 2026 19:01:28 GMT Token budgeting: a senior engineer's mental model https://refactor4ai.com/blog/token-budgeting-mental-model https://refactor4ai.com/blog/token-budgeting-mental-model Thu, 14 May 2026 00:00:00 GMT A working mental model for budgeting LLM tokens — how to estimate per-request cost in your head, where the 10× overruns hide, and the four-number per-feature budget that prevents the end-of-month bill surprise. Concrete numbers for Claude 4.7, GPT-5.5, and Gemini 3.1 as of May 2026. Refactor4AI team Postmortem #1: how an embedding model swap silently broke our retrieval for nine days https://refactor4ai.com/blog/postmortem-001-embedding-swap-broke-rag https://refactor4ai.com/blog/postmortem-001-embedding-swap-broke-rag Thu, 14 May 2026 00:00:00 GMT Production Postmortem — EdTech, ~150 engineers. A routine embedding model upgrade landed in production with the index never re-embedded. Hybrid search rankings collapsed, the assistant began confidently citing the wrong source documents, retry-storms ran up a ~$48K LLM bill, and nobody noticed for nine days because no offline eval was wired to the retrieval layer. Timeline, root cause, fixes, and five lessons for anyone running RAG in production. Refactor4AI team The Living Spec: markdown PRDs that ship https://refactor4ai.com/blog/living-spec-markdown-prds https://refactor4ai.com/blog/living-spec-markdown-prds Thu, 14 May 2026 00:00:00 GMT Ditch the 20-page Google Doc PRD. The Living Spec pattern — a single markdown file checked into the repo next to the code, edited continuously by both PM and AI agents — is how AI-native teams keep documentation aligned with what's actually being built. A senior-PM playbook for the format, the rituals, and the trade-offs. Refactor4AI team The 'lethal trifecta' of agent security https://refactor4ai.com/blog/lethal-trifecta-agent-security https://refactor4ai.com/blog/lethal-trifecta-agent-security Thu, 14 May 2026 00:00:00 GMT Simon Willison's lethal trifecta — private data, untrusted content, and an exfiltration vector — is the single most important security model in 2026. A senior engineer's guide to why prompt-injection defences fail, why structural mitigations are the only ones that work, and which leg of the trifecta is cheapest to cut in your architecture. Refactor4AI team FinOps for AI: cutting bills 40% in a quarter https://refactor4ai.com/blog/finops-for-ai-cutting-bills https://refactor4ai.com/blog/finops-for-ai-cutting-bills Thu, 14 May 2026 00:00:00 GMT The eight-lever playbook we use to take AI bills down 40% in a quarter without degrading quality — prompt caching, batch APIs, routing, smaller models, capacity commitments, output capping, retrieval pruning, and the dashboard wiring that makes the savings stick. Concrete numbers on Claude, GPT-5, and Gemini for mid-2026. Refactor4AI team Bedrock vs Azure OpenAI vs Vertex AI: a 2026 decision tree https://refactor4ai.com/blog/bedrock-vs-azure-openai-vs-vertex-ai-2026 https://refactor4ai.com/blog/bedrock-vs-azure-openai-vs-vertex-ai-2026 Thu, 14 May 2026 00:00:00 GMT An opinionated, decision-tree comparison of AWS Bedrock, Microsoft Azure AI Foundry, and Google Vertex AI as of May 2026 — which models live where, real list prices, throughput economics, the data-egress gotchas nobody puts in the slide deck, and the four questions that should actually pick your cloud. Refactor4AI team Why your agents fail in production (and the 5 fixes that actually work) https://refactor4ai.com/blog/why-agents-fail-in-production https://refactor4ai.com/blog/why-agents-fail-in-production Wed, 13 May 2026 00:00:00 GMT Most enterprise AI agents in 2026 never reach production — Deloitte's 2026 Tech Trends report puts production deployment at 11% even with 38% of organisations actively piloting. Five concrete failure modes show up in every post-mortem we run. Tool error handling, context drift, dumb RAG, brittle connectors, and no evals. Here's what each looks like and how to fix it. Refactor4AI team Staffing AI features without a dedicated AI team https://refactor4ai.com/blog/staffing-ai-no-dedicated-team https://refactor4ai.com/blog/staffing-ai-no-dedicated-team Wed, 13 May 2026 00:00:00 GMT You have been asked to ship AI features and you do not have ML engineers. A pragmatic 2026 staffing playbook for Staff+ engineers and EMs — what roles you actually need (and which ones you do not), how to upskill existing engineers in 90 days, and when to reach for fractional help instead of hiring. Refactor4AI team Reranking: the single highest-ROI improvement you can make to a RAG pipeline https://refactor4ai.com/blog/reranking-highest-roi-rag https://refactor4ai.com/blog/reranking-highest-roi-rag Wed, 13 May 2026 00:00:00 GMT If your RAG system retrieves 50 chunks and stuffs the top-5 into the prompt, you're shipping mediocre answers. A cross-encoder reranker on top of hybrid retrieval lifts retrieval accuracy 15–40% for 80–150ms of added latency. Here's the case for it, the production options in 2026, and the rerank-or-die patterns. Refactor4AI team RAG vs fine-tuning in 2026: a decision framework that actually ships https://refactor4ai.com/blog/rag-vs-fine-tuning-2026 https://refactor4ai.com/blog/rag-vs-fine-tuning-2026 Wed, 13 May 2026 00:00:00 GMT The RAG-vs-fine-tuning debate is mostly noise in 2026. The real question is where to place knowledge, where to encode behaviour, and how to evaluate both. Here's the decision framework, the order of operations, and the LoRA-on-top-of-retrieval pattern most teams should default to. Refactor4AI team Writing PRDs for AI products: from deterministic to probabilistic https://refactor4ai.com/blog/prds-for-ai-products-probabilistic https://refactor4ai.com/blog/prds-for-ai-products-probabilistic Wed, 13 May 2026 00:00:00 GMT Classic PRD templates assume a system that returns the same answer twice. AI products do not. A working PM template for 2026 — acceptance criteria as distributions, eval datasets as spec, failure modes as first-class requirements, and the living document model that survives a model upgrade. Refactor4AI team LLM observability: the 2026 stack you actually need https://refactor4ai.com/blog/llm-observability-stack-2026 https://refactor4ai.com/blog/llm-observability-stack-2026 Wed, 13 May 2026 00:00:00 GMT Stop flying blind on production AI. A practical 2026 reference stack for tracing, evals, cost analytics, and prompt management — built around Langfuse, Helicone, and the OpenTelemetry GenAI conventions — with the integration tradeoffs we actually run into in production. Refactor4AI team Hybrid search 101: BM25 + vector + RRF, and why pure semantic search is leaving recall on the floor https://refactor4ai.com/blog/hybrid-search-bm25-vector-rrf https://refactor4ai.com/blog/hybrid-search-bm25-vector-rrf Wed, 13 May 2026 00:00:00 GMT Vector-only retrieval is the wrong default in 2026. Pure semantic search tops out around 65–78% recall@10 on real-world RAG corpora. Hybrid retrieval — BM25 plus vector plus Reciprocal Rank Fusion — pushes that to 91%, and every major vector DB ships it natively. Here's exactly how it works and how to wire it up. Refactor4AI team Computer-use agents 101: Claude, Operator, ACE https://refactor4ai.com/blog/computer-use-agents-101 https://refactor4ai.com/blog/computer-use-agents-101 Wed, 13 May 2026 00:00:00 GMT A senior-engineer primer on computer-use agents in May 2026 — Claude Computer Use vs OpenAI Operator (CUA) vs General Agents's ACE. How each one drives a GUI, where they fail, the realistic accuracy you should plan for, and which is the right pick for which job. Refactor4AI team Building an AI gateway: rate limits, retries, fallback https://refactor4ai.com/blog/ai-gateway-rate-limits-fallback https://refactor4ai.com/blog/ai-gateway-rate-limits-fallback Wed, 13 May 2026 00:00:00 GMT Direct vendor calls are a single point of failure. A senior-engineer blueprint for the AI gateway pattern in 2026 — virtual keys, per-tenant rate limits, graceful retry with backoff, cross-vendor fallback, prompt caching, and the buy-vs-build choice between LiteLLM, Portkey, Kong AI, and rolling your own. Refactor4AI team Agent loops in 2026: ReAct vs Plan-and-Execute vs Reflection https://refactor4ai.com/blog/agent-loops-react-vs-plan-execute https://refactor4ai.com/blog/agent-loops-react-vs-plan-execute Wed, 13 May 2026 00:00:00 GMT Most production agents in 2026 aren't pure ReAct or pure Plan-and-Execute — they're a hybrid that picks each pattern for the part of the task it's good at. Here's what the three loops actually do, the cost-latency-quality tradeoff, and the architecture pattern that wins for long-running workflows. Refactor4AI team What is MCP? A practical guide to the Model Context Protocol in 2026 https://refactor4ai.com/blog/what-is-mcp-model-context-protocol https://refactor4ai.com/blog/what-is-mcp-model-context-protocol Tue, 12 May 2026 00:00:00 GMT MCP is the standard way LLMs talk to tools, databases and APIs in 2026. This is the plain-English explainer — what MCP is, why it won, and how to build your first MCP server in TypeScript or Python. Refactor4AI team Prompt caching: how to cut your LLM bill by 90% in 2026 https://refactor4ai.com/blog/prompt-caching-90-percent-savings https://refactor4ai.com/blog/prompt-caching-90-percent-savings Tue, 12 May 2026 00:00:00 GMT Prompt caching is the single biggest lever for reducing LLM costs in production. Here's how it works on Anthropic, OpenAI and Google, exactly what to cache, and the patterns that turn a $30k/day feature into a $3k/day one. Refactor4AI team The 2026 AI System Design Interview: a complete preparation guide https://refactor4ai.com/blog/ai-system-design-interview-2026 https://refactor4ai.com/blog/ai-system-design-interview-2026 Tue, 12 May 2026 00:00:00 GMT AI system design is the new system design interview. Here's the format, the categories of questions FAANG and AI labs are actually asking, the framework to structure an answer, and the trade-offs you'll be expected to articulate. Refactor4AI team AI fluency for product managers: what to actually learn in 2026 https://refactor4ai.com/blog/ai-fluency-for-product-managers https://refactor4ai.com/blog/ai-fluency-for-product-managers Tue, 12 May 2026 00:00:00 GMT AI fluency for PMs isn't 'know what an LLM is.' It's writing specs engineers can build, owning the eval plan, classifying risk under the EU AI Act, and pricing AI features. Here's the practical curriculum. Refactor4AI team The 2026 AI Capability Map: Claude Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro https://refactor4ai.com/blog/ai-capability-map-2026 https://refactor4ai.com/blog/ai-capability-map-2026 Tue, 12 May 2026 00:00:00 GMT A practical, role-agnostic capability map for the May 2026 flagship models — costs per million tokens, effective context windows, reasoning premiums, multimodal coverage, cloud availability, and which to actually pick for which job. Refactor4AI team