Topic

#Inference

8 articles on Inference — news, releases, guides and analysis from the SourceFeed engine.

OpenAI Jalapeno and the Shift to Custom Inference Silicon

Custom ASICs are replacing general-purpose GPUs for running large language models to survive the crushing cost of scale.

Priya Nair

The LLM Cost Cliff Your Budget Isn't Ready For

Per-token prices are collapsing, yet AI bills keep exploding. The two facts aren't a contradiction, and confusing them will wreck your business case.

Article · 3d ago1

OpenAI's Jalapeño Chip Is a Bet on Inference Economics

A custom Broadcom-built ASIC for LLM inference puts OpenAI on the same vertical-integration path Google and Amazon paved years ago.

News · 5d ago2

How OpenAI's Jalapeño Chip Changes Production LLM Serving

The custom silicon shift signals a move away from general-purpose GPUs toward highly specialized, memory-optimized inference architectures.

Article · 5d ago1

Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.

Tutorial · 6d ago0

Running 70B Models on 4GB VRAM: The AirLLM Layer-Swap Hack

AirLLM trades disk I/O for VRAM, letting developers run massive models locally without renting enterprise GPU clusters.

Article · 1w ago1

Unified x86 AI Acceleration: Inside the New ACE Specification

The x86 Ecosystem Advisory Group's new spec brings standardized matrix multiplication and tile registers to modern CPU architectures.

Article · 1w ago2

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

Through FP4 quantization, block-level speculative decoding, and the TileRT system stack, Xiaomi claims trillion-parameter decode speeds normally reserved for custom silicon — on a single 8-GPU node.

News · 3w ago5

Inference in your inbox

The best developer & AI content, delivered. No spam.