Gemma 4 Wiki
Track Gemma 4 model sizes, benchmarks, prompting, function calling, multimodal input, local deployment, and fine-tuning across the official Google ecosystem.

Latest Updates
Discover the newest guides, tips, and content
Gemma 4 Ollama Update: How to Run Google’s New Open Models 2026
Explore the massive Gemma 4 Ollama update. Learn how to install the 31B, 26B MoE, and Effective 4B models locally for agentic workflows and coding.
Gemma 4 26B Guide: Exploring Google’s Open Model Power 2026
A comprehensive guide to the Gemma 4 26B Mixture of Experts model. Learn about its architecture, local performance, and agentic capabilities in 2026.
Gemma 4 Ollama: Run Google’s Edge-Optimized AI Locally 2026
Learn how to install and optimize Gemma 4 E4B using Ollama and OpenClaw. A complete guide to local AI deployment with per-layer embedding technology.
Gemma 4 Performance: Complete Guide and Benchmarks 2026
Explore the breakthrough Gemma 4 performance metrics. Learn how Google's open-source AI models run locally on consumer hardware with Turbo Quant technology.
Gemma 4 Hardware Requirements: Complete Local AI Guide 2026
Learn the essential Gemma 4 hardware requirements to run Google's latest open models locally. Detailed VRAM, RAM, and GPU specs for 2B to 31B models.
Gemma 4 Windows: Complete Local AI Setup Guide 2026
Learn how to install and optimize Gemma 4 on Windows. Our comprehensive guide covers hardware requirements, MoE vs. Dense models, and local agentic workflows.
Gemma 4 Ollama MLX: Advanced Local AI Guide 2026
Master the deployment and fine-tuning of Gemma 4 using Ollama and MLX. Complete 2026 guide for Apple Silicon and high-end desktop performance.
Gemma 4 Release: Complete Guide to Google's New Open Models 2026
Explore the official Gemma 4 release including model variants, Apache 2.0 licensing, and agentic workflow capabilities for local AI development.
Gemma 4 31B: Ultimate Guide to Google’s Open Model 2026
Explore the groundbreaking Gemma 4 31B model. Learn about its 256k context window, multimodal gaming capabilities, and local deployment performance.
Gemma 4 Ollama Setup: Run Google’s Most Powerful Open Models 2026
Learn how to perform a complete Gemma 4 Ollama setup to run Google's latest open-source AI models locally. Detailed guide on hardware, OpenClaw integration, and optimization.
Gemma 4 Thinking Mode: Optimization & Hardware Guide 2026
Master the new gemma 4 thinking mode for advanced reasoning. Learn about A4B architecture, latency optimization, and hardware requirements for local AI hosting.
Gemma 4 vs Qwen: Ultimate AI Model Comparison Guide 2026
A deep dive comparison between Google's Gemma 4 and Alibaba's Qwen 3.6 Plus. Explore benchmarks, multimodal features, and local deployment tips for 2026.
Gemma 4 Guide: Mastering Google’s Open-Source AI in 2026
Learn how to run Google's Gemma 4 locally, explore vibe-coding in AIventure, and optimize performance for gaming and development in 2026.
Gemma 4 Release Date: Complete Guide to Google's New Open Model 2026
Google has officially launched Gemma 4. Explore the gemma 4 release date, model specifications, hardware requirements, and how to use these open-source models for your projects.
Gemma 4 Offline: How to Run Google’s Powerhouse AI Locally 2026
Learn how to download and run Gemma 4 offline on your computer. A complete guide to Google's open-source AI models, hardware requirements, and local setup steps.
Gemma 4 Benchmark: Performance Analysis and Model Guide 2026
Explore the latest Gemma 4 benchmark results, architecture upgrades, and deployment strategies for Google's newest Apache 2.0 open-weights models.
Gemma 4 Multimodal: Complete Guide to Google's New Open Models 2026
Explore the groundbreaking capabilities of Gemma 4 multimodal models. Learn about the 26B and 31B architectures, gaming performance, and local deployment tips.
Gemma 4 vs Qwen 3.6: Best AI Models for Gaming & Devs 2026
Compare Google's Gemma 4 and Alibaba's Qwen 3.6. Discover which model wins for local gaming integration, coding, and multimodal performance in 2026.
Gemma 4 vs Gemini: Open Model Performance Comparison 2026
Compare Google's Gemma 4 open models against the Gemini proprietary suite. Discover benchmarks, agentic capabilities, and local hardware requirements.
Gemma 4 What Is: Complete Guide to Google's Open AI Models 2026
Explore everything about Google's Gemma 4 release, including the Apache 2.0 license, workstation and edge models, and native multi-modality features.
Gemma 4 Local Test: Performance & Benchmarking Guide 2026
Explore the comprehensive Gemma 4 local test results. We analyze vision, reasoning, and hardware performance for Google's latest open-weight LLM.
Gemma 4 PC: Local AI Performance and Setup Guide 2026
Learn how to run Google's Gemma 4 on your PC. Explore benchmarks for E2B, 26B, and 31B models, hardware requirements, and optimization tips for local AI.
Gemma 4 Coding Test: Google’s Open Models Benchmarked 2026
An in-depth Gemma 4 coding test covering web development, 3D game engines, and local performance. See how the 26B and 31B models stack up in real-world scenarios.
Gemma 4 Tutorial: Master Google's Open AI Models 2026
Learn how to deploy and fine-tune Google's Gemma 4 models. Our comprehensive tutorial covers multi-modality, MoE architecture, and local setup for 2026.
Gemma 4 Phone: Ultimate Mobile AI Integration Guide 2026
Explore the power of Gemma 4 phone integration. Learn about the Effective 2B and 4B models, mobile-first agentic workflows, and on-device AI performance in 2026.
Gemma 4 Laptop: Best Local AI Models & Hardware Guide 2026
Learn how to optimize your Gemma 4 laptop setup. Discover the best local AI models for reasoning, coding, and agentic workflows in 2026.
Gemma 4 Linux: Local Installation and Setup Guide 2026
Learn how to install and optimize Gemma 4 on Linux distributions. Step-by-step guide for Ollama integration, hardware requirements, and performance tuning.
Gemma 4 E4B: Complete Guide to Google's Edge AI Models 2026
Explore the Gemma 4 E4B model, Google's latest breakthrough in edge AI. Learn about its effective parameters, PLE architecture, and agentic capabilities for 2026.
Gemma 4 Review: Google’s New Open Model Family Guide 2026
An in-depth Gemma 4 review covering the new Apache 2.0 license, workstation and edge models, and native multi-modal capabilities. Updated for 2026.
Gemma 4 Models: Complete Guide to Google's Open AI 2026
Explore the full capabilities of the Gemma 4 models. Learn about the 26B MoE and 31B Dense variants, their gaming applications, and performance benchmarks.
Gemma 4 Resources
Everything you need to get started with Gemma 4 — from local setup to API integration
Gemma 4 Tutorial
Gemma 4 launched on April 2, 2026 in four official sizes: E2B, E4B, 26B A4B, and 31B. The family is built for open-weight deployment under Apache 2.0, with smaller edge models aimed at mobile and laptop-class hardware and larger models aimed at desktops, workstations, and servers.
Understand the four official Gemma 4 sizes
Gemma 4 comes in E2B, E4B, 26B A4B, and 31B. E2B and E4B accept text, image, and audio input; 26B A4B and 31B accept text and image input and target larger local or server deployments.
Match the model to your hardware
Use E2B or E4B when you want mobile, edge, or laptop-friendly local inference. Use 26B A4B for a stronger general-purpose local model, and 31B when you want the largest official Gemma 4 checkpoint.
Choose a starting point
Gemma 4 26B A4B is a strong default for powerful first experiences. If you want the lightest starting point, begin with an instruction-tuned edge model and move up when your workload needs more capability.
Pick how you want to try it
Try hosted Gemma 4 through Google AI Studio and the Gemini API, or download open weights from Hugging Face or Kaggle for local use, tuning, and custom deployment.
Know what Gemma 4 is optimized for
The family is built for reasoning, coding, agentic workflows, and multimodal understanding. Edge models support 128K context, while 26B A4B and 31B support up to 256K context.
Quick Tips
- Instruction-tuned (-it) variants are best for chat and assistant use cases.
- E2B and E4B are the most hardware-accessible starting points for local experimentation.
- The 26B A4B is a Mixture-of-Experts model with faster effective inference than a dense model of similar total size.
- All Gemma 4 weights are released under the Apache 2.0 license.
Gemma 4 Ollama Setup
Ollama is one of the fastest ways to get Gemma 4 running on a laptop or workstation. The default Ollama flow is simple: install Ollama, pull Gemma 4, confirm the model list, choose the right tag for your hardware, and then run from the CLI or local API.
Install and verify Ollama
Download Ollama for Windows, macOS, or Linux, install it, and verify the setup with the command ollama --version.
Pull the default Gemma 4 variant
Use ollama pull gemma4 to download the default Gemma 4 package, then run ollama list to confirm it is available locally.
Choose the right model tag
Use gemma4:e2b for the lightest edge option, gemma4:e4b for a stronger edge default, gemma4:26b for the 26B A4B MoE workstation model, and gemma4:31b for the full large model.
Know what each tag expects
On the Ollama library page, e2b is listed at 7.2GB with 128K context, e4b at 9.6GB with 128K, 26b at 18GB with 256K, and 31b at 20GB with 256K.
Run your first prompt
For a first text test, run ollama run gemma4 "Hello, what can you do?". Ollama also supports image input with the prompt form shown in the official guide.
Use the local API for app integration
Ollama exposes a local web service at http://localhost:11434/api/generate, so you can move from CLI testing to a lightweight local application without setting up a separate model server.
Quick Tips
- E2B and E4B are the practical first picks for local experimentation on lighter hardware.
- The 26b tag targets the 26B A4B MoE model, which uses less active compute than a dense model of similar total size.
- ollama list shows all locally downloaded models and their sizes.
- Ollama supports image input with the prompt form: ollama run gemma4:e2b with an image path.
Gemma 4 API Guide
The Gemini API provides hosted access to Gemma 4, useful when building without managing local inference. The hosted Gemma 4 models in AI Studio and the Gemini API are gemma-4-26b-a4b-it and gemma-4-31b-it.
Create an API key in Google AI Studio
Open Google AI Studio and create a Gemini API key. New users can start with a default Google Cloud project, while existing users can import a Cloud project and create keys there.
Set the key in your environment
The Gemini SDKs automatically pick up GEMINI_API_KEY or GOOGLE_API_KEY. If both are set, GOOGLE_API_KEY takes precedence.
Install the official SDK
For Python, install google-genai. For JavaScript and TypeScript, install @google/genai. Google also publishes SDK paths for Go, Java, C#, and Apps Script.
Choose the hosted Gemma 4 model ID
For hosted Gemma 4, use gemma-4-26b-a4b-it for a faster MoE large model, or gemma-4-31b-it for the flagship dense checkpoint.
Send a first generateContent request
The official example uses client.models.generate_content with the model field set to gemma-4-31b-it. In REST, requests go to the generateContent endpoint with the x-goog-api-key header.
Use AI Studio to bridge from testing to code
Google AI Studio lets you experiment with prompts, model settings, function calling, and structured output, then export working code through the Get code flow.
Quick Tips
- AI Studio is the fastest way to test Gemma 4 prompts before writing any code.
- The Gemini API supports streaming responses for chat and long-generation use cases.
- gemma-4-26b-a4b-it is the MoE model — generally faster and more cost-efficient than 31B.
- Function calling and structured output are available for both hosted Gemma 4 model IDs.
Gemma 4 Hugging Face Download
The official Google collection on Hugging Face includes eight core Gemma 4 checkpoints: E2B, E4B, 26B A4B, and 31B, each in base and instruction-tuned form. Instruction-tuned (-it) repositories are the natural starting point for chat, coding, and assistant experiences.
google/gemma-4-E2B-it
Edge checkpoint with text, image, and audio input and 128K context. Best for fast local assistants and on-device multimodal experimentation.
google/gemma-4-E4B-it
Stronger edge checkpoint with text, image, and audio input and 128K context. More capable than E2B without jumping to workstation-class hardware.
google/gemma-4-26B-A4B-it
Mixture-of-Experts checkpoint with 256K context and text-image input. Large-model quality with faster effective inference than a dense model of similar total size.
google/gemma-4-31B-it
Flagship dense Gemma 4 checkpoint with 256K context and text-image input. Best for the strongest chat, reasoning, coding, and agent workflows.
google/gemma-4-E2B
Base edge checkpoint for users who want to study, adapt, or fine-tune the smallest multimodal Gemma 4 model.
google/gemma-4-E4B
Base edge checkpoint that keeps text, image, and audio input while leaving downstream instruction behavior to your own tuning pipeline.
google/gemma-4-26B-A4B
Base MoE large checkpoint for custom adaptation where you want the 26B A4B architecture without default instruction-tuned behavior.
google/gemma-4-31B
Base 31B dense checkpoint for teams that want the largest official Gemma 4 foundation model before their own fine-tuning or alignment stage.
Choose the Right Gemma 4 Size for Your Hardware
Gemma 4 ships in four sizes with very different trade-offs. The fastest choice is not always the smallest model, and the highest-quality choice is not always the easiest one to deploy.
Gemma 4 is available in two edge-first dense models, one efficient Mixture-of-Experts model, and one large dense model. For most teams, the real decision is not just quality, but where the model runs: phone, laptop, workstation, or server. A practical starting point is 26B A4B when you want strong quality without jumping all the way to 31B.
Gemma 4 E2B
Offline assistants, lightweight multimodal apps, edge deployment
Gemma 4 E4B
Stronger local copilots, on-device reasoning, multimodal apps with more headroom
Gemma 4 26B A4B
Best balance of quality, speed, and long-context work for most teams
Gemma 4 31B
Highest-end reasoning, coding, and multimodal quality in the Gemma 4 family
The Gemma 4 Specs That Actually Matter Before You Build
For most builders, the key questions are context length, modalities, language coverage, licensing, and app-level features. These are the specs that change implementation choices, hosting cost, and product scope.
Gemma 4 is not just a text model refresh. The family combines long context, multimodal input, thinking mode, native system prompts, and function-calling support in one open-weight lineup. The smaller models add audio input, while the larger models extend context to 256K for document-heavy and repository-scale workloads.
March 31, 2026
This is the current Gemma core generation and the one Google now highlights across docs and launch materials.
All models: text and image → text; E2B and E4B also support audio input
You can build text-only, vision, and lightweight speech understanding flows without switching model families.
128K tokens on E2B and E4B; 256K tokens on 26B A4B and 31B
Large prompts such as long documents, long chats, or multi-file code context fit in a single request.
Over 140 languages
This matters for multilingual products, OCR, and globally deployed assistants.
Apache 2.0 license with open weights and support for responsible commercial use
You can tune, deploy, and run Gemma 4 in your own stack with fewer licensing constraints.
Configurable thinking mode, native system role support, structured JSON output, and function calling
These features make Gemma 4 much easier to use for agents, tool use, and instruction-heavy applications.
Variable image resolutions and token budgets of 70, 140, 280, 560, or 1120 tokens
You can trade image detail for speed depending on whether the task is OCR, UI reading, chart analysis, or fast frame processing.
Official Gemma 4 Benchmark Snapshot
These scores show where each Gemma 4 size is strongest across reasoning, coding, science, vision, and long-context retrieval. Use them to shortlist a model quickly, then match that shortlist to your latency and memory budget.
Gemma 4 is positioned as a model family for reasoning, agentic workflows, coding, and multimodal understanding. The official benchmark tables show a clear pattern: 31B leads, 26B A4B stays surprisingly close while being much more efficient, and E4B and E2B bring meaningful capability to smaller devices.
MMLU Pro
Knowledge and reasoning
Best quick comparison for general high-level reasoning performance across the family.
AIME 2026 (no tools)
Math reasoning
31B and 26B A4B are the right targets for math-heavy assistants and planning tasks.
LiveCodeBench v6
Competitive coding
If coding is a primary use case, the larger two models are in a different tier from the edge models.
GPQA Diamond
Scientific reasoning
A strong signal for technical and expert-facing workflows.
MMMU Pro
Multimodal reasoning
Vision tasks benefit heavily from the larger models when accuracy matters more than footprint.
MRCR v2 (128K, 8-needle)
Long-context retrieval
For large-document and repository-scale prompting, 31B is the strongest long-context choice.
How to Fine-Tune Gemma 4 for Real Product Work
Fine-tuning matters when prompting alone is not enough and you want Gemma 4 to perform better on a specific domain, workflow, or role. The practical paths are lightweight adapter tuning for text tasks and multimodal adapter tuning for image-plus-text tasks.
The official Gemma tuning docs center on a simple rule: tune for a defined task, not for vague improvement. For many builders, QLoRA is the most realistic place to start because it keeps hardware requirements much lower than full-model tuning.
Start with a narrow tuning goal
Choose a task or role that the base model should perform better, such as customer support, text-to-SQL, or product description generation. Use fine-tuning when the task is specific and repeated.
Pick the tuning path
Use text tuning for instruction and generation tasks, or vision tuning when your dataset combines images and text. The text QLoRA guide demonstrates text-to-SQL; the vision QLoRA guide demonstrates image-plus-text product descriptions.
Choose a realistic framework
Gemma 4 supports Keras with LoRA, the Gemma library, Hugging Face-based workflows, GKE, and Vertex AI. Hugging Face plus TRL is the most direct path for many developers.
Match the workflow to your hardware
The official text QLoRA example is designed around a T4 16GB setup. The vision QLoRA guide calls for a BF16-capable GPU such as NVIDIA L4 or A100 with more than 16GB of memory.
Use QLoRA when efficiency matters
QLoRA keeps the base model quantized to 4-bit, freezes the original weights, and trains only the added LoRA adapters. This lowers memory usage while preserving strong task performance.
Prepare data in the right format
Build a dataset that directly matches the behavior you want, then format it for conversation-style training with TRL and SFTTrainer. The official text guide uses a large synthetic text-to-SQL dataset.
Evaluate, compare, and deploy
After training, run inference checks against your base model, verify task gains, and then deploy the tuned model or adapter. Treat deployment format as an early decision because framework choice affects the output format you get.
Quick Tips
- Start with QLoRA and a T4-class GPU for text tasks — full fine-tuning is rarely needed for task adaptation.
- Format your dataset to mirror the instruction-tuned chat format that Gemma 4 already understands.
- Keep your eval set from the same distribution as your training data to get meaningful improvement signals.
- The MoE model 26B A4B has efficient active parameters, but its total parameter count still affects checkpoint size during training.
- Use the Gemma 4 -it checkpoint as your starting point for instruction tasks rather than the pre-trained base.
Gemma 4 Prompt Guide
Gemma 4 introduces a new turn-based prompt format with native system instructions, multimodal placeholders, and built-in controls for thinking and tool use.
This guide turns the official Gemma 4 format into a practical prompt library. Structure every interaction as turns, use the system role for behavior and global rules, insert image or audio placeholders where needed, and only enable thinking or tool use when the task actually benefits from them.
Core chat skeleton
Gemma 4 uses native system, user, and model roles, wrapped in turn markers.
- Use system for global instructions
- Use user for the current request
- Use model as the generation start point
System prompt pattern
Put stable behavior rules in one system turn instead of repeating them every time.
- Good for style, scope, and output format
- Native system role support starts with Gemma 4
- Keep it concise and task-specific
Multimodal placeholders
Use placeholder tokens to indicate where image and audio embeddings should be inserted.
- Use <|image|> for images
- Use <|audio|> for audio
- The processor replaces placeholders with embeddings after tokenization
Thinking-ready prompt
Thinking mode is activated by placing <|think|> inside the system instruction.
- Enable it for reasoning-heavy tasks
- Keep it off for simple direct generation
- Use one system turn for both thinking and other global instructions
Tool-aware prompt structure
Tool declarations belong in the system turn, and tool calls and tool responses are handled with dedicated control tokens.
- Useful for APIs, search, calculators, and external data lookups
- Tool use is structured, not plain-text pretending
- Reasoning and tool use can happen in the same turn
Gemma 4 Thinking Mode
Thinking mode lets Gemma 4 produce a reasoning channel before the final answer, and the processor can separate both parts for application use.
Thinking mode is best for tasks where the model benefits from intermediate reasoning before it answers: ambiguous questions, math, coding, tool planning, and multimodal analysis. In Gemma 4, you can enable it at the chat-template level, stream the reasoning live, and then split the output into a thinking block and a user-facing answer block.
Choose the right tasks
Use thinking mode when the request needs decomposition, comparison, planning, or careful interpretation rather than a short direct reply.
- Good fits: math, code debugging, structured decision-making, image-plus-text reasoning
- Less necessary for simple rewrites, short summaries, or straightforward facts
- Official examples cover both text-only and image-text workflows
Enable thinking in the chat template
With Hugging Face Transformers, set enable_thinking=True in apply_chat_template(). At the token level, Gemma 4 uses <|think|> in the system turn.
- E2B and E4B: thinking OFF uses a simple user-model flow; thinking ON adds a system turn with <|think|>
- 26B A4B and 31B: official templates include an empty thinking token when thinking is off to stabilize output
- Thinking is designed to be enabled at the conversation level
Generate and separate the result
The model can emit a reasoning channel first and the final answer after it. You can stream it with TextStreamer and split it with parse_response().
- processor.parse_response() returns separated thinking and answer content
- This works for text prompts and image-text prompts
- The reasoning channel can also include tool calls when the turn becomes agentic
Handle multi-turn chats correctly
For normal multi-turn conversations, strip the previous turn generated thoughts before sending the history back. In tool-calling turns, keep the thought flow intact until the tool cycle finishes.
- Regular chat: remove prior thought blocks before the next turn
- Tool-use exception: do not remove thoughts between function calls inside the same turn
- This keeps context clean while preserving agentic behavior
Gemma 4 Function Calling
Gemma 4 supports native structured tool use, letting the model request functions instead of faking external actions in plain text.
Function calling is the practical bridge between model output and real application behavior. Instead of asking Gemma 4 to guess live data or simulate actions, you define tools, let the model generate a structured call, execute the function in your app, and then feed the result back so the model can finish with a clean natural-language answer.
Define tools clearly
Pass tools through apply_chat_template() using either a manual JSON schema or a raw Python function converted to schema.
- Manual JSON schema is best when you need precise nested parameters
- Raw Python functions are convenient for simple tools with clear type hints and docstrings
- Tool definitions should include name, description, parameter types, and required fields
Let the model request a tool
Gemma 4 receives the user prompt plus available tools and returns a structured function call object rather than plain text when a tool is needed.
- Tool use is controlled with dedicated tokens such as tool, tool_call, and tool_response
- A typical example is a weather or search function
- This is better than plain text when the answer depends on external state or system actions
Validate and execute in your app
Gemma 4 cannot execute code on its own. Your application must parse the function name and arguments, validate them, and run the real function safely.
- Always validate function names and arguments before execution
- Do not rely on generated code without safeguards
- For production systems, map tool names to approved handlers instead of dynamic execution
Return tool output for the final answer
Append the tool result back into the chat history, then let Gemma 4 produce the final user-facing response.
- Official workflow: define tools, model turn, developer turn, final response
- This pattern works for APIs, live lookups, calculators, settings updates, and agent loops
- Tool responses should stay structured so the model can ground the final answer correctly
Gemma 4 Multimodal Guide
Gemma 4 handles text and image across all models, supports video as frames, and adds native audio support on E2B and E4B.
Gemma 4 is built for multimodal input. All models support image and video-style visual understanding, the small models add audio input, and the runtime lets you trade off visual detail against speed using token budgets. That makes Gemma 4 suitable for OCR, captioning, object detection, speech tasks, and mixed media prompts inside one chat flow.
Image understanding
All Gemma 4 models support text-plus-image workflows.
- Common tasks: OCR, object detection, visual question answering, image captioning
- Supports reasoning across multiple images in one prompt
- Best for screenshots, documents, product images, and scene analysis
Video understanding
All Gemma 4 models can process video as a sequence of frames.
- Good for scene description, human interaction, and situational summaries
- Video passed as a content item in the messages array
- Maximum supported video length is 60 seconds at 1 frame per second
Audio understanding
Audio is available on the E2B and E4B models.
- Supports multilingual speech recognition, speech translation, and general speech understanding
- Audio token cost is 25 tokens per second
- Maximum audio length is 30 seconds
Visual token budgets
Gemma 4 introduces variable-resolution image processing so you can choose speed or detail based on the task.
- Supported image budgets: 70, 140, 280, 560, 1120 tokens
- Lower budgets for faster classification, captioning, and video frame analysis
- Higher budgets for OCR, document parsing, and reading small text
Input preparation rules
The processor handles much of the media formatting, but a few limits matter in production.
- Audio should be mono, 16 kHz, float32, normalized to [-1, 1]
- Image file support depends on the framework used to convert files into tensors
- Prompt quality still matters: specific instructions outperform vague multimodal requests
Model capability split
Use the smallest models for mobile and speech-heavy use cases, and the larger models for heavier reasoning with long context.
- E2B and E4B: audio-enabled small models with 128K context
- 26B A4B and 31B: larger reasoning-focused models with 256K context
- All four official sizes available in base and instruction-tuned variants
Gemma 4 GGUF and Quantization
Choose the smallest Gemma 4 footprint that still fits your machine
For most local setups, the practical decision is whether to stay with E2B or E4B, or move up to a 26B A4B GGUF build. Google documents approximate memory needs for BF16, SFP8, and 4-bit-style deployment choices across all four official sizes.
Official local entry points
Google's Ollama guide exposes four Gemma 4 tags: gemma4:e2b, gemma4:e4b, gemma4:26b, and gemma4:31b. LM Studio also supports Gemma models in both GGUF and MLX formats for fully local inference.
Start with E2B or E4B for a lighter local loop, and move to 26B or 31B only when you have the RAM budget and want a stronger reasoning model.
Approximate memory by official size
Google lists approximate inference memory as E2B 9.6 GB BF16 / 3.2 GB Q4_0, E4B 15 GB / 5 GB, 26B A4B 48 GB / 15.6 GB, and 31B 58.3 GB / 17.4 GB.
If your target is a mainstream local machine, 4-bit-style deployment or a smaller model size is usually the line between runnable and impractical.
Official 26B A4B GGUF example
The official ggml-org Gemma 4 26B A4B IT GGUF page recommends llama-server for startup and lists Q4_K_M at 16.8 GB, Q8_0 at 26.9 GB, and F16 at 50.5 GB.
Q4_K_M is the most practical default when you want a large local Gemma 4 model but cannot afford Q8_0 or full 16-bit memory use.
What quantization changes
Higher parameter counts and higher precision are generally more capable, but they cost more processing cycles, memory, and power. Lower precision reduces those costs but can reduce capability.
Use quantization to fit the model to your hardware: smaller GGUF builds help you run locally, but they are a deployment compromise rather than a free upgrade.
Gemma 4 PyTorch Guide
Run Gemma 4 from a PyTorch-first stack
The fastest Python path for Gemma 4 is Hugging Face Transformers on top of PyTorch: install torch and transformers, pick a Gemma 4 model ID, and begin with pipeline-based text inference before moving into multimodal or tool-enabled workflows.
Install the runtime
Google's Gemma 4 text inference guide starts with torch, accelerate, and transformers, plus dialog for conversation handling.
Pick an official Gemma 4 checkpoint
Google's Gemma 4 examples show four official instruction-tuned IDs: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, and google/gemma-4-31B-it.
Start with text generation
Use transformers.pipeline with task="text-generation", device_map="auto", and dtype="auto" as the quickest way to get a first response.
Move to multimodal and tools when needed
For multimodal and function-calling workflows, use AutoProcessor and AutoModelForMultimodalLM with apply_chat_template for tool-aware prompting.
Use native PyTorch for deeper control
Google's PyTorch guide documents Kaggle credential setup, dependency installation, cloning gemma_pytorch, and loading multimodal model classes for experimentation with direct checkpoint control.
Gemma 4 Mobile Deployment
Put Gemma 4 on mobile through the current Android stack
Gemma 4 now has three practical mobile-facing paths: ML Kit Prompt API on AICore preview devices, Android Studio local-model workflows for developer-side usage, and LiteRT-LM for lower-level runtime control across mobile and embedded devices.
Choose the path that matches your goal
Use ML Kit Prompt API on AICore if you are building an Android app experience, Android Studio local models if you want offline coding help, and LiteRT-LM if you need lower-level runtime control.
Prototype on-device with AICore
Google's April 2026 preview lets you target Gemma 4 E2B or E4B through model preference settings inside the Prompt API flow on AICore-enabled devices.
Know the device expectations
Preview models run on AICore-enabled devices and the latest AI accelerators from Google, MediaTek, and Qualcomm. AI Edge Gallery is available for quick model checks on non-AICore devices.
Use Android Studio for developer-side workflows
Android Studio currently recommends Gemma 4 as its local model option. Gemma E4B requires 12 GB RAM and 4 GB storage; Gemma 26B MoE requires 24 GB RAM and 17 GB storage.
Switch to LiteRT-LM for deeper runtime control
LiteRT-LM is a cross-platform library for language model pipelines from phones to embedded systems, with CPU, GPU, and NPU paths including Qualcomm AI Engine Direct and MediaTek NeuroPilot.
Gemma 4 vs Gemma 3
See what actually changes when you move from Gemma 3 to Gemma 4
This comparison is for developers deciding whether to keep an existing Gemma 3 workflow or rebuild around Gemma 4. The clearest differences show up in context length, control format, multimodal scope, and benchmark performance at the top end of each family.
Release and core sizes
Gemma 4 trims the family around clearer deployment tiers: edge-first E-models plus larger workstation-class models.
Context window
For long documents, tool traces, or multi-step history, Gemma 4's larger models open significantly more headroom.
Multimodality
Gemma 4 is the broader multimodal family if your use case moves beyond image-text into video, OCR-heavy flows, or audio-capable edge models.
Prompt and control format
Teams building agents or structured workflows get a cleaner control surface in Gemma 4.
Top-end benchmark snapshot
If upgrading for reasoning, coding, or high-difficulty QA, the top-end Gemma 4 jump is large enough to justify a migration.
Deployment profile
Stay on Gemma 3 when small classic sizes already fit your stack; move to Gemma 4 when you want newer control features, larger-context top models, or stronger edge-oriented variants.