GPT-OSS: Specs, Setup, and Self-Hosting Guide

OpenAI is late to the open-source LLM party, but it has finally shown up in style. The company has released two fully open-source large language models, GPT-OSS-20B and GPT-OSS-120B, its first open-source release since Whisper and CLIP.

Licensed under Apache 2.0, these GPT-OSS models can be freely run (locally or in the cloud), modified, fine-tuned, and even commercialized. GPT-OSS arrives with a combination of high reasoning capability, speed, and cost-effective deployments.

Key Specs and Features

OpenAI’s models are built to cover two distinct use cases:

GPT-OSS-20B: Optimized for local deployment on high-end consumer hardware such as laptops and desktops with 16–32 GB of RAM. It offers a reasoning capability comparable to OpenAI’s own o3-mini models.
GPT-OSS-120B: A large-scale model with reasoning ability on par with o4-mini, designed for data center–grade GPUs. This version is geared toward high-volume workloads and large-scale deployments.

Both models share several advanced capabilities:

Chain-of-thought reasoning: Can reason step-by-step, useful for complex problem-solving.
Configurable reasoning effort: Adjustable between low, medium, and high to balance quality and speed.
Fine-tuning support: Enables building domain-specific or task-specific models.
Mixture-of-Experts architecture: GPT-OSS-120B includes 128 experts, while GPT-OSS-20B has 32. A routing system activates the relevant experts for efficient and specialized responses.
Large context window: 128k tokens for handling long documents, multi-turn conversations, or codebases.
Agentic capabilities — Supports tool use via function calling, browsing, and executing Python code.

Where to try GPT-OSS

OpenAI’s GPT-OSS family is fully downloadable, but you don’t have to start from scratch. There are several ways to try the models online or grab the weights for local and cloud deployments.

Official OpenAI Playground

OpenAI has launched gpt-oss.com as the primary testing ground for both models.

No sign-in required: anyone can interact with the models directly in their browser.
Configurable reasoning effort: switch between low, medium, and high settings to balance speed and accuracy.
One-click download commands: the playground provides pre-built commands for downloading models via HuggingFace, Ollama, or LM Studio.

HuggingFace Downloads & Spaces

Both models are available on HuggingFace, along with:

Model weights for direct download (including different quantization formats).
Spaces hosting interactive demos for instant testing in your browser.
Code snippets for running GPT-OSS with the transformers library or integrating it into an existing ML workflow.

HuggingFace is also the go-to place for fine-tuners, as it hosts community-trained variants and the tools to retrain them on your own datasets.

API Endpoints and OpenRouter

For integration into apps and services, OpenRouter lists multiple providers offering GPT-OSS-20B and GPT-OSS-120B endpoints. You can compare prices, test models in their playgrounds, and obtain documentation for calling the API in several providers.

Running GPT-OSS Locally

Open-source means you can run GPT-OSS on your own hardware for full control and privacy.

Hardware Requirements

The smaller model runs on desktops/laptops and requires at least 16 GB of RAM. The big model requires around 80GB of RAM and a powerful GPU.

GPT-OSS-20B
- RAM: 16–32 GB (more is better).
- GPU (recommended): ≥ 20 GB VRAM for smooth throughput; consumer GPUs (e.g., recent RTX 50-series) work well.
- Apple Silicon: M3 Max with ≥ 32 GB unified memory delivers strong performance.
- Multi-GPU: Scales tokens/sec; community reports very high rates with dual high-end GPUs.
GPT-OSS-120B
- Designed for data-center GPUs (e.g., H100 or A100 class) or top-end workstation cards (e.g., Blackwell RTX 6000). While you can experiment locally, expect heavy memory/VRAM demands and lower throughput.

LM Studio

The latest release of LM Studio is capable of running both GPT-OSS models. It’s easy to use and works on Linux, macOS, and Windows.

To run GPT-OSS on LM Studio

Install: LM Studio (Windows/macOS/Linux).
Search: for gpt-oss in the Models tab.
Pick a quantization: that fits your machine (e.g., a MXFP4 is the 4-bit variant for laptops).
Download and Select the model at the top bar.
Chat: the model is ready to use, fully offline.

OLlama

OLlama provides a simple UI and CLI to run local models. To install OLlama, follow these steps:

Install from ollama.com (Windows/macOS/Linux)
Select: select the gpt-oss-20b or gpt-oss-120b model
Stay local: disable “turbo” and “search”. Turbo offloads the model to the Ollama Cloud, so disable it if you want a fully offline and private session.
Prompt: type your prompt. The first time it runs, Ollama will automatically download the selected model

vLLM and Llama.cpp

vLLM and Llama.cpp provide the open source tools to run models in the backend and build AI-powered applications. These projects do not target end users as much as developers and AI operators who want full control of their AI models.

In the case of Llama.cpp, you have to download the model weights manually. You can find these files on HuggingFace:

Select Models on the top navigation bar
Select GGUF on the left-side filter
Search for “GPT-OSS”. You’ll find several options, but the F16 is the original 4-bit model.
Download the GGUF files and use the following command to start an interactive session in the terminal:

./llama-cli -m /path/to/gpt-oss-20b-Q4.gguf

Self-hosting on Cloud

Running GPT-OSS in the cloud gives you the flexibility of on-demand infrastructure and the privacy of self-managed hosting, but it comes with tradeoffs.

Advantages

Scalability on demand: spin up extra GPU instances when traffic spikes.
No hardware procurement: skip the high upfront costs and long lead times for powerful GPUs.
Global availability: deploy close to your users for reduced latency.
Full control over data: unlike public inference APIs, your prompts and outputs stay within your environment.

Disadvantages

Recurring cost: high-end GPU instances can be thousands of dollars per month.
Management overhead: requires monitoring, scaling, patching, and handling downtime.
Vendor dependency: if your cloud provider changes pricing or hardware availability, you may be forced to adjust your setup.

Cost estimates

GPT-OSS-120B generally requires data center–grade H100 or a high-end consumer GPU such as the Blackwell RTX6000. Estimated monthly costs for dedicated instances are:

Hardware	Typical Cloud Cost (Monthly)	Use Case
NVIDIA RTX 6000 (48 GB)	~$4,000	Small-scale deployments
NVIDIA H100 (80 GB)	~$9,000–$11,000	Large-scale deployments
NVIDIA H200 (141 GB)	~$10,000–$12,000	Extreme workloads or larger fine-tunes

Prices vary by provider, commitment length, and region, but this range gives a realistic ballpark for dedicated GPU hosting.

Shared GPU Instances

Many cloud providers are offering GPU-powered shared instances. These are cost-effective instances that trade off price for privacy.

RunPod: on-demand and spot pricing for H100s; good for experimentation or burst workloads.
Jarvis Labs: GPU rentals with pre-configured ML environments; flexible billing.
NVIDIA Cloud: direct GPU rentals (H100/H200) with enterprise support options.

VPS with GPUs

If you want full control and privacy, you can run a private VM on the cloud. For example:

AWS (p5 series): H100-powered instances in multiple regions.
Google Cloud (A3 series): H100 GPUs with high-bandwidth networking for distributed training/inference.
Linode: dedicated RTX 6000 GPU instances.

Conclusion

With GPT-OSS, OpenAI has finally stepped into the open-source LLM arena—years after competitors like Meta, Mistral, and DeepSeek started the trend of opening their weights.

While benchmarks suggest they won’t dethrone the very latest closed-source frontier models, GPT-OSS offers an appealing balance of intelligence, speed, and cost-efficiency for anyone who values transparency and control. The real question now is how quickly the community will adopt these models. Have you tried them? Leave a comment with your experience.

Thank you for reading, and happy building!

The post GPT-OSS: Specs, Setup, and Self-Hosting Guide appeared first on Semaphore.

Tomas FernandezSource