A Developer’s Guide to Integrating Neysa Aegis LLM Shield
Search Neysa
Updated on
Published on
By
Table of Content
Google DeepMind dropped Gemma 4 on April 2, 2026 – the day after April Fool’s Day, which meant the X (formerly Twitter) discourse spent a full 24 -hours unsure whether to believe the benchmarks. They were real.
Gemma 4 is the most capable open model family Google has shipped. The Apache 2.0 license means you can run it in your own infrastructure, fine-tune it on proprietary data, and build commercial products on top of it. Which is a meaningful distinction from models with research-only or restricted-use licenses – this one is production-ready from day one.
It handles text, images, and audio. It reasons through hard problems, writes and debugs code, calls external tools natively, and understands documents in over 140 languages. The 31B model scored 1452 on LMArena, putting it in territory that was closed-model-only a year ago.
Gemma 4 is available now on Neysa Velocis, on H100, H200, L40S, and L4 GPUs, with transparent on-demand and committed pricing tiers.
Gemma 4 comes in four sizes, each available as a base model and an instruction-tuned variant:
| Model | Parameters | Context Window | Modalities |
| Gemma 4 E2B | 2.3B effective (5.1B with embeddings) | 128K | Text, Image, Audio |
| Gemma 4 E4B | 4.5B effective (8B with embeddings) | 128K | Text, Image, Audio |
| Gemma 4 26B MoE | 25.2B total, 3.8B active | 256K | Text, Image |
| Gemma 4 31B Dense | 30.7B | 256K | Text, Image |
The “E” in E2B and E4B stands for “effective” parameters. These models use Per-Layer Embeddings and a shared KV cache to punch above their weight – the 4.5B effective E4B sits closer to a traditional 8B in actual capability, without the 8B memory bill. They are the only sizes in the family that natively support audio input.
Compared to Gemma 3, the gap is a big leap. Gemma 3 27B scored 20.8% on AIME 2026 math benchmarks. Gemma 4 31B scores 89.2% on the same test.
Context windows doubled to 256K on the larger models. Audio support was added. Native reasoning modes shipped. The HuggingFace team noted they “struggled to find good fine-tuning examples because they are so good out of the box” – which is either a great problem to have or a sign something genuinely shifted.
All checkpoints are on HuggingFace and Ollama, available today.
Every Gemma 4 model supports a configurable reasoning mode. You enable it with a <|think|> token at the start of the system prompt. When active, the model works through its internal reasoning before producing the final answer – similar to chain-of-thought prompting, but baked into the architecture rather than bolted on through prompt hacks.
You can turn it off when latency matters more than depth. If you have used o1 or Sonnet’s extended thinking and wished you had more control over when it fires, this is that, but open and self-hosted.
Gemma 4 outputs structured tool calls without any prompt engineering scaffolding. The model understands when to invoke a function, formats the call correctly, and handles the response. Anyone who has spent time coaxing JSON tool calls out of older open models with regex fallbacks will appreciate how clean this is. It works across all four sizes, including E2B.
The 26B model uses a Mixture-of-Experts design: 25.2B total parameters spread across 128 experts, but only 3.8B parameters activate per forward pass. For inference, you get near-31B quality output at roughly E4B compute cost. A single H100 NVL with 94GB VRAM handles it comfortably. This is the model to reach for when you want the best possible quality-per-dollar ratio in production.
The vision encoder supports five token budget tiers: 70, 140, 280, 560, and 1120 tokens per image. Lower budgets are fast and cheap – fine for captioning, video frame analysis, or classification tasks where you are processing thousands of images. Higher budgets preserve fine-grained detail for OCR, document parsing, or anything where small text in an image matters. You set this per request, not per deployment.
These are instruction-tuned results from the official evaluation suite:
| Benchmark | 31B | 26B MoE | E4B | Gemma 3 27B |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | 20.8% |
| MMLU Pro | 85.2% | 82.6% | 69.4% | 67.6% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 110 |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 42.4% |
| MMMU Pro (vision) | 76.9% | 73.8% | 52.6% | 49.7% |
| Long Context 128K | 66.4% | 44.1% | 25.4% | 13.5% |
The Codeforces ELO of 2150 for the 31B puts it in Grandmaster territory on the competitive programming ladder. Gemma 3 27B at 110 ELO on the same benchmark was not even on the board. That is not a refinement – that is a different category of model.
The 26B MoE at 1718 ELO, running on 3.8B active parameters, is arguably the most interesting result in the table.
Open models are only as capable as the hardware running them.
Gemma 31B Dense model needs H200
Gemma 26B MoE model is more forgiving
Gemma E4B model can run on L4s
Neysa Velocis AI Cloud gives you access to virtual machines or bare-metal clusters, with on-demand and committed pricing tiers ranging from one to thirty-six months, depending on workload duration and team size.
| Gemma 4 Model | Recommended SKU | VRAM | Neysa On-Demand (USD/hr) | Neysa Committed (1-mo, USD/hr) |
| E2B | 1x L4 | 24GB | $1.17 | $0.70 |
| E4B | 1x L40S | 48GB | $1.95 | $1.17 |
| E4B (production load) | 2x L40S | 96GB | $3.89 | $2.18 |
| 26B MoE | 1x H100 NVL | 94GB | $4.39 | $2.85 |
| 31B Dense | 1x H100 SXM | 80GB | $4.39 | $2.85 |
| 31B Dense + batching | 1x H200 SXM | 141GB | $4.73 | $2.99 |
| Multi-user / high-throughput | 4x H100 SXM | 320GB | $17.57 | $11.38 |
Velocis offers on-demand and committed pricing tiers to match different workload durations and team sizes. See full pricing.
The full Gemma 4 ecosystem works on Neysa Velocis without modification:
For production deployments, Neysa’s dedicated inference endpoints are purpose-built for open-weight models – single-tenant, with full isolation, firewall controls, and context length configuration.
Orchestration runs on Kubernetes or Slurm. GPU utilization, latency, and throughput monitoring runs via our custom build observability dashboards.
The Neysa Aegis LLM Shield security layer scans inference traffic for policy violations, data leakage, and adversarial inputs.
Gemma 4 is available on Neysa Velocis. Speak with our team to understand which configuration fits your workload, or request access to get started.
View GPU Pricing | Explore Velocis AI Cloud | Request Access
Build and scale your next real-world impact AI application with Neysa today.
Share this article:

For most organizations, AI inference is where ambition collides with reality. Models that perform flawlessly in early testing begin to slow, fail, or grow prohibitively expensive once real traffic and real data arrive. The problem isn’t the model. It’s the infrastructure underneath AI inference.

The distinction between Open Weights and Open Source models shapes AI’s future, influencing control, adaptability, and trust. Open Weights enhance access, while Open Source fosters collaboration, impacting enterprise strategies and innovation trajectories.