US WEKA Validates Long-Context Inference on Oracle Cloud, Achieving 10x Throughput Improvement

2026-06-10 11:29

Favorite

en.Wedoany.com Reported - US AI data and memory infrastructure company WEKA announced on June 9, 2026, that its NeuralMesh platform, combined with Augmented Memory Grid, has completed production-scale benchmark testing on Oracle Cloud Infrastructure (OCI). Results show that, without adding GPUs or cluster nodes, this solution increases concurrent users in long-context inference scenarios by approximately 10x, token throughput by approximately 10x, and tokens generated per GPU by approximately 7x. The test was conducted on a 9-node OCI bare-metal H100 cluster, validating a 100,000-token context window.

This test focused on enterprise-grade long-context inference. WEKA disclosed that with NeuralMesh combined with Augmented Memory Grid, concurrent users increased from approximately 600 in a DRAM-only configuration to over 5,000. In terms of token throughput, the solution achieved approximately 2 million tokens per second, while the DRAM-only baseline was below 200,000 tokens per second. In a one-hour test with 2,400 users, Augmented Memory Grid served approximately 5 billion tokens, compared to approximately 700 million tokens for the DRAM-only baseline.

The test environment used 9 OCI bare-metal H100 nodes, each configured with 8 H100 GPUs, totaling 72 GPUs. According to the Oracle technical blog, each node was also equipped with 16 Gen4 NVMe drives and two 200Gb RDMA network cards. Augmented Memory Grid expanded the available NVMe cache capacity to 287 TiB, while the baseline environment had approximately 8.64 TiB of available DRAM. Each simulated user was set with a 100,000-token input and a 100-token response to simulate cache pressure in long documents, multi-turn conversations, and agent tasks.

The key to such tests is not just the number of GPUs. Long-context inference continuously generates KV cache during operation. When the context window expands to the 100,000-token level, cache capacity and cache hit rate affect throughput, latency, and GPU utilization efficiency. In a DRAM-only configuration, once the cache is saturated, cache eviction and repeated prefill computations are prone to occur. For applications such as search, summarization, code assistance, and multi-turn agents, this leads to higher service costs and more unstable response times.

Augmented Memory Grid decouples the KV cache from local GPU memory and DRAM, placing it into a cluster-level high-performance token repository. WEKA explains on the OCI product page that this solution is based on NeuralMesh and NeuralMesh Axon, using RDMA and GPUDirect Storage to continuously transfer key-value cache data between GPU memory and flash storage, expanding the cache layer without adding physical DRAM, leveraging OCI bare-metal GPU infrastructure.

The Oracle technical blog states that this round of testing progressed from early TTFT validation to production-related workload validation, covering concurrency density, sustained throughput, cache persistence, and service stability under high load. The blog also shows that the test compared a standard vLLM service baseline using HBM+DRAM with a cache expansion solution incorporating Augmented Memory Grid. Results indicate that after DRAM cache reaches its limit, baseline response times fluctuate, while the cache expansion solution maintains more stable service levels under higher concurrency.

WEKA stated that NeuralMesh with Augmented Memory Grid is now available to customers and has been launched on the Oracle Cloud Marketplace, with OCI as its first cloud partner. For customers deploying enterprise AI applications, this result points to a practical issue: as demand for long-context inference rapidly increases, computing power expansion is not the only option. Cache expansion, data paths, and cluster scheduling also impact per-token costs and online service capacity.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

Information and Communication Artificial Intelligence Engineering

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：CNNC Particle's First Horizontal Treatment Room Equipment Installed at Shandong Cancer Hospital BNCT Project

Next：100 BYD T75 Electric Light Trucks Arrive in Mexico, Boosting Local Low-Carbon Logistics Transformation