The technique aims to ease GPU memory constraints that limit how enterprises scale AI inference and long-context applications ...
KubeCon Europe 2026 made AI inference its central focus with major CNCF donations including llm-d, Nvidia's GPU DRA driver ...
Google has introduced TurboQuant, a compression algorithm that reduces large language model (LLM) memory usage by at least 6x ...
Red Hat is pushing Kubernetes inference into the mainstream by contributing llm-d to the CNCF, as enterprises race to run AI ...
“Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually ...
Your self-hosted LLMs care more about your memory performance ...
Jim Fan is one of Nvidia’s senior AI researchers. The shift could be about many orders of magnitude more compute and energy needed for inference that can handle the improved reasoning in the OpenAI ...
Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university ...