logo
AI/MLHot Topic

Gemma 4 is Now Available on Neysa Velocis


8 mins.
Gemma4

Table of Content

Gemma4

Table of Content

Google DeepMind dropped Gemma 4 on April 2, 2026 – the day after April Fool’s Day, which meant the X (formerly Twitter) discourse spent a full 24 -hours unsure whether to believe the benchmarks. They were real.

Gemma 4 is the most capable open model family Google has shipped. The Apache 2.0 license means you can run it in your own infrastructure, fine-tune it on proprietary data, and build commercial products on top of it. Which is a meaningful distinction from models with research-only or restricted-use licenses – this one is production-ready from day one.

It handles text, images, and audio. It reasons through hard problems, writes and debugs code, calls external tools natively, and understands documents in over 140 languages. The 31B model scored 1452 on LMArena, putting it in territory that was closed-model-only a year ago.

Gemma 4 is available now on Neysa Velocis, on H100, H200, L40S, and L4 GPUs, with transparent on-demand and committed pricing tiers.

What Is Gemma 4?

Gemma 4 comes in four sizes, each available as a base model and an instruction-tuned variant:

ModelParametersContext WindowModalities
Gemma 4 E2B2.3B effective (5.1B with embeddings)128KText, Image, Audio
Gemma 4 E4B4.5B effective (8B with embeddings)128KText, Image, Audio
Gemma 4 26B MoE25.2B total, 3.8B active256KText, Image
Gemma 4 31B Dense30.7B256KText, Image

The “E” in E2B and E4B stands for “effective” parameters. These models use Per-Layer Embeddings and a shared KV cache to punch above their weight – the 4.5B effective E4B sits closer to a traditional 8B in actual capability, without the 8B memory bill. They are the only sizes in the family that natively support audio input.

Compared to Gemma 3, the gap is a big leap. Gemma 3 27B scored 20.8% on AIME 2026 math benchmarks. Gemma 4 31B scores 89.2% on the same test. 

Context windows doubled to 256K on the larger models. Audio support was added. Native reasoning modes shipped. The HuggingFace team noted they “struggled to find good fine-tuning examples because they are so good out of the box” – which is either a great problem to have or a sign something genuinely shifted.

All checkpoints are on HuggingFace and Ollama, available today.

What Makes Gemma 4 Different

Configurable Thinking Mode

Every Gemma 4 model supports a configurable reasoning mode. You enable it with a <|think|> token at the start of the system prompt. When active, the model works through its internal reasoning before producing the final answer – similar to chain-of-thought prompting, but baked into the architecture rather than bolted on through prompt hacks.

You can turn it off when latency matters more than depth. If you have used o1 or Sonnet’s extended thinking and wished you had more control over when it fires, this is that, but open and self-hosted.

Native Function Calling

Gemma 4 outputs structured tool calls without any prompt engineering scaffolding. The model understands when to invoke a function, formats the call correctly, and handles the response. Anyone who has spent time coaxing JSON tool calls out of older open models with regex fallbacks will appreciate how clean this is. It works across all four sizes, including E2B.

MoE Architecture on the 26B

The 26B model uses a Mixture-of-Experts design: 25.2B total parameters spread across 128 experts, but only 3.8B parameters activate per forward pass. For inference, you get near-31B quality output at roughly E4B compute cost. A single H100 NVL with 94GB VRAM handles it comfortably. This is the model to reach for when you want the best possible quality-per-dollar ratio in production.

Variable Image Resolution

The vision encoder supports five token budget tiers: 70, 140, 280, 560, and 1120 tokens per image. Lower budgets are fast and cheap – fine for captioning, video frame analysis, or classification tasks where you are processing thousands of images. Higher budgets preserve fine-grained detail for OCR, document parsing, or anything where small text in an image matters. You set this per request, not per deployment.

Benchmark Numbers Worth Knowing

These are instruction-tuned results from the official evaluation suite:

Benchmark31B26B MoEE4BGemma 3 27B
AIME 2026 (math)89.2%88.3%42.5%20.8%
MMLU Pro85.2%82.6%69.4%67.6%
LiveCodeBench v680.0%77.1%52.0%29.1%
Codeforces ELO21501718940110
GPQA Diamond84.3%82.3%58.6%42.4%
MMMU Pro (vision)76.9%73.8%52.6%49.7%
Long Context 128K66.4%44.1%25.4%13.5%

The Codeforces ELO of 2150 for the 31B puts it in Grandmaster territory on the competitive programming ladder. Gemma 3 27B at 110 ELO on the same benchmark was not even on the board. That is not a refinement – that is a different category of model.

The 26B MoE at 1718 ELO, running on 3.8B active parameters, is arguably the most interesting result in the table.

Choosing the right GPU for running Gemma 4

Open models are only as capable as the hardware running them. 

Gemma 31B Dense model needs H200 

  • It needs atleast 20GB VRAM for basic inference at reduced precision. At full bf16, it comfortably fills an 80GB H100 SXM. 
  • If you are running multi-user inference, long context prompts near the 256K limit, or want headroom for batching, a single H200 SXM with 141GB is the practical choice.

Gemma 26B MoE model is more forgiving

  • With only 3.8B parameters active per pass, it fits a single H100 NVL (94GB) at full precision with room for concurrent requests. 
  • This is the practical production sweet spot for most teams: near-31B quality, H100-class cost.

Gemma E4B model can run on L4s

  • The E2B can run on a single L4 (24GB)
  • If you are deploying inference at the edge, embedding Gemma 4 into a product that needs to stay lean, or just want to experiment cheaply before committing to larger compute, these are the variants to start with.

Get access today

Neysa Velocis AI Cloud gives you access to virtual machines or bare-metal clusters, with on-demand and committed pricing tiers ranging from one to thirty-six months, depending on workload duration and team size.

Which GPU to use for Gemma 4 models?

Gemma 4 ModelRecommended SKUVRAMNeysa On-Demand (USD/hr)Neysa Committed (1-mo, USD/hr)
E2B1x L424GB$1.17$0.70
E4B1x L40S48GB$1.95$1.17
E4B (production load)2x L40S96GB$3.89$2.18
26B MoE1x H100 NVL94GB$4.39$2.85
31B Dense1x H100 SXM80GB$4.39$2.85
31B Dense + batching1x H200 SXM141GB$4.73$2.99
Multi-user / high-throughput4x H100 SXM320GB$17.57$11.38

Velocis offers on-demand and committed pricing tiers to match different workload durations and team sizes. See full pricing.

Serving Stack

The full Gemma 4 ecosystem works on Neysa Velocis without modification:

  • vLLM: High-throughput production serving with continuous batching. The right choice for multi-user API deployments where you care about tokens per second per dollar.
  • Ollama: ollama run gemma4 and you are live. Good for dev environments and quick internal tools.
  • HuggingFace Transformers: Full AutoModelForMultimodalLM support for fine-tuning pipelines, custom inference, and integration with TRL, PEFT, and bitsandbytes.

For production deployments, Neysa’s dedicated inference endpoints are purpose-built for open-weight models – single-tenant, with full isolation, firewall controls, and context length configuration. 

Orchestration runs on Kubernetes or Slurm. GPU utilization, latency, and throughput monitoring runs via our custom build observability dashboards. 

The Neysa Aegis LLM Shield security layer scans inference traffic for policy violations, data leakage, and adversarial inputs.

What can you build with Gemma 4 on Neysa Velocis

  • Coding assistants and dev tools
    Gemma 4 31B’s Codeforces ELO of 2150 makes it competitive with the latest GPT on structured code generation. For teams building coding copilots, PR reviewers, or internal developer assistants, this is a deployable alternative that keeps your codebase off third-party servers. The native function calling also makes it practical as an orchestration layer in multi-step dev workflows.
  • Document AI and KYC
    The variable-resolution vision encoder handles PDFs, scanned forms, and mixed-content documents well. The 256K context window means large contracts or regulatory filings fit in a single prompt. For BFSI teams running KYC, credit underwriting, or document classification, pairing Gemma 4 with Velocis means the documents never leave your environment.
  • Multilingual applications
    You can run use cases around customer support, content moderation, or data processing across Indian regional languages and global markets with Gemma 4.
  • Agentic pipelines
    Native function calling plus a 256K context window is a meaningful combination for agent work. The model can hold long conversation histories, tool outputs, and intermediate reasoning in context while reliably producing clean tool invocations at each step. The 26B MoE hits the practical sweet spot here: near-31B reasoning quality, E4B-class active compute. 
  • Voice and audio processing
    E2B and E4B accept audio input natively. Speech transcription, audio QA, and multilingual voice processing without a separate Whisper or ASR model in your stack. CoVoST scores of 35.54 (E4B) and 33.47 (E2B) put transcription quality in production territory.

Get Started

Gemma 4 is available on Neysa Velocis. Speak with our team to understand which configuration fits your workload, or request access to get started.

View GPU Pricing | Explore Velocis AI Cloud | Request Access

FAQs

What is Gemma 4, and what’s new compared to Gemma 3?
Gemma 4 is Google’s latest open model family with major gains in reasoning, coding, and long-context performance. Compared to Gemma 3, the larger Gemma 4 models expand context to 256K, add native reasoning modes, and introduce stronger instruction-tuned results across math, code, and vision benchmarks.

Which Gemma 4 models are available, and what are their context windows?
Gemma 4 comes in four sizes: E2B, E4B, 26B MoE, and 31B Dense. E2B and E4B support 128K context, while 26B MoE and 31B Dense support 256K context.

What does “effective parameters” mean in E2B and E4B?
“Effective” refers to how these models use techniques like Per-Layer Embeddings and a shared KV cache to deliver capability closer to larger dense models, without the same memory cost. In practical terms, E4B can behave closer to a traditional 8B model while keeping a lower memory footprint.

What is Gemma 4’s configurable thinking mode, and when should I use it?
Gemma 4 supports an optional reasoning mode that you can enable via a <|think|> token in the system prompt. Turn it on when you want deeper reasoning and turn it off when latency and throughput matter more than depth.

What is the advantage of the 26B MoE model for production inference?
The 26B MoE uses a Mixture-of-Experts design where only a small subset of parameters is active per forward pass. That can deliver near-31B output quality at lower active compute cost, which often makes it a strong quality-per-dollar choice for production inference.

Which GPU should I choose to run Gemma 4 on Neysa Velocis?
A practical mapping based on your blog:
• E2B: 1x L4 (24GB)
• E4B: 1x L40S (48GB), or 2x L40S for heavier production load
• 26B MoE: 1x H100 NVL (94GB)
• 31B Dense: 1x H100 SXM (80GB), or H200 SXM (141GB) for long context, batching, or multi-user headroom

Ready
to get started?

Build and scale your next real-world impact AI application with Neysa today.

Share this article: