How to use SAM 3 for Python Computer Vision

Introduction

SAM 3 is Meta’s open-vocabulary segmentation model. that Instead of clicking on an object or drawing a bounding box, you simply describe what you’re looking for, "red cylinder", "gripper", "white coffee mug", and SAM 3 returns a segmentation mask and bounding box for every matching instance it finds in the image.

What this guide covers:

  • What you need (hardware and software prerequisites)
  • Setting up the environment on WSL Ubuntu
  • Getting access to the model checkpoint on Hugging Face
  • Installing SAM 3
  • Running your first detection with a text prompt
  • Understanding the output
  • Practical tips for getting reliable results

By the end you will have a working Python script that loads an image, runs SAM 3 with a text prompt, and prints the bounding box and confidence score of the detected object.


Prerequisites

Hardware

SAM 3 is not practical on CPU. You need a CUDA-compatible GPU. A minimum of 8 GB VRAM is recommended; less than that and you will likely run into out-of-memory errors at default resolution.

  • CUDA 12.6 or higher
  • A reasonably modern NVIDIA GPU (RTX 3060 or better is a comfortable baseline)

Operating System — Use WSL Ubuntu

SAM 3 is developed and tested on Linux. If you are on Windows, you must run everything inside WSL (Windows Subsystem for Linux) with Ubuntu. Do not attempt to run the installation natively on Windows, PyTorch CUDA builds and the SAM 3 package expect a Linux environment.

All commands in this guide assume you are working inside a WSL Ubuntu terminal.

Software

  • Python 3.12 or higher
  • Git
  • A Hugging Face account with access to the SAM 3 checkpoint (see below)

Step 1 — Request Access to the Model Checkpoint

SAM 3’s weights are hosted on Hugging Face and are gated, you need to explicitly request access before you can download them.

  1. Go to https://huggingface.co/facebook/sam3
  2. Log in with your Hugging Face account (create one for free if you don’t have one)
  3. Click “Request access” and accept the license terms
  4. Access is typically granted within minutes to a few hours

Once approved, generate a Hugging Face access token:

  1. Go to https://huggingface.co/settings/tokens
  2. Click “New token”, give it a name, select “Read” role, and copy the token

Keep this token handy, you will need it in Step 4.


Step 2 — Create a Conda Environment

Create a dedicated environment to keep SAM 3’s dependencies isolated from any other projects.

conda create -n sam3 python=3.12
conda activate sam3

Verify the correct Python version is active:

python --version
# Should output: Python 3.12.x

Step 3 — Install PyTorch with CUDA Support

pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Verify that PyTorch can see your GPU:

python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

If cuda.is_available() returns False, your CUDA drivers are either not installed or not visible inside WSL. Fix this before continuing, SAM 3 will not run without GPU access.


Step 4 — Authenticate with Hugging Face

Install the Hugging Face Hub library and log in using the token you generated in Step 1:

pip install huggingface_hub
hf auth login

Paste your token when prompted. This stores credentials locally so that SAM 3 can automatically download the checkpoint when you first run the model.


Step 5 — Clone and Install SAM 3

git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

The -e flag installs the package in editable mode, meaning changes to the cloned source files take effect immediately without reinstalling. This is useful if you want to inspect the source code or make modifications later.

Verify the installation:

python -c "from sam3.model_builder import build_sam3_image_model; print('SAM 3 installed successfully')"

Step 6 — Run Your First Detection

Create a file called detect.py with the following content. Replace "your_image.jpg" with the path to any image file, and "your object" with a description of something visible in that image.

import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load the model — this downloads the checkpoint on first run (~1.7 GB)
# and takes 10–30 seconds depending on your hardware.
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load an image
image = Image.open("your_image.jpg")

# Pass the image to the processor
inference_state = processor.set_image(image)

# Run detection with a text prompt
output = processor.set_text_prompt(state=inference_state, prompt="your object")

# Unpack results
masks  = output["masks"]   # Binary segmentation masks — shape: (N, H, W)
boxes  = output["boxes"]   # Bounding boxes — shape: (N, 4) in [x_min, y_min, x_max, y_max] format
scores = output["scores"]  # Confidence scores — shape: (N,)

if len(scores) == 0:
    print("No objects detected. Try a different prompt or check the image.")
else:
    print(f"Found {len(scores)} instance(s).")
    for i, (box, score) in enumerate(zip(boxes, scores)):
        print(f"  Instance {i+1}: score={score:.2f}, box={box.tolist()}")

Run it:

python detect.py

Expected Output

If the object is present in the image, you will see something like:

Found 2 instance(s).
  Instance 1: score=0.91, box=[142.3, 88.7, 310.5, 412.1]
  Instance 2: score=0.73, box=[502.1, 201.3, 680.4, 455.8]

Each box is in pixel coordinates: [x_min, y_min, x_max, y_max]. The masks array contains one binary mask per instance at the same resolution as the input image, a True value means that pixel belongs to the detected object.

On first run, SAM 3 will download the model checkpoint from Hugging Face (~1.7 GB). This is a one-time download. Subsequent runs load from cache and start in 10–30 seconds.


Step 7 — Filter by Confidence

In practice, you will want to ignore low-confidence detections before acting on the results:

THRESHOLD = 0.5
valid_results = [
    (mask, box, score)
    for mask, box, score in zip(masks, boxes, scores)
    if score > THRESHOLD
]

if not valid_results:
    print("No confident detections above threshold.")
else:
    best_mask, best_box, best_score = max(valid_results, key=lambda x: x[2])
    print(f"Best detection: score={best_score:.2f}, box={best_box.tolist()}")

0.5 is a reasonable starting point. Lower it if you are missing real objects; raise it if you are getting false positives.


Tips for Good Results

Be specific in your prompt. "blue plastic cube" works better than "object". "metallic cylindrical container" will outperform "thing on the table". Match the prompt to what the camera can clearly distinguish.

Prompt for what is visually distinct. SAM 3 works from visual features. If two objects look nearly identical, differentiate them by color, material, or position, "cube on the left", "red one".

Don’t run it on every frame. SAM 3 takes 1–5 seconds per image depending on your GPU. It is not a real-time detector. Trigger inference only when needed, for example, when a task starts and you need to locate a target, rather than on a continuous video stream.

GPU memory. The model uses roughly 4–6 GB VRAM at typical image resolutions. If you hit out-of-memory errors, reduce input image resolution before passing it to processor.set_image():

image = image.resize((640, 480))

The model downloads on first use. On first run you will see Hugging Face download progress. This is normal. If it fails with a 401 error, your hf auth login credentials are either missing or expired, re-run hf auth login.


Useful Links

1 Like