PROJECT RETROSPECTIVE

I Built a GPU Simulator from Scratch in Python

Moving from the opaque "black box" of parallel debugging to a transparent, observable mental model.

512 Threads
1 Visualizer
0 Hardware

The Motivation

The Headache of
Parallel Debugging

There is a specific kind of pain when debugging parallel code. You launch 512 threads, and... silence. Or a race condition that happens once every thousand runs.

I realized I didn't actually understand how a GPU schedules work. I knew the theory—SIMT, warps, barriers—but I couldn't see it.

"If the entire state of the GPU is just a set of NumPy arrays, then the state is plottable."

Current Reality

> Segfault: Thread 42 out of bounds

> Memory Access Violation (Address 0x004F)

> ... (Opaque hardware state)

The Goal (TinyGPU)

Visualizing memory hotspots in real-time

The "Glass Box" Architecture

TinyGPU is designed to be fully observable. Click on the components below to understand how the system transforms code into visual insight.

1. The Assembler

Parses .tgpu assembly files. Converts human-readable text into numeric instructions.

2. The Core (TinyGPU)

The heavy lifter. Uses NumPy for vectorized state (Registers, Memory, PC). Handles SIMT logic.

3. The Visualizer

The "Flight Recorder". Replays the execution history as a frame-by-frame heatmap GIF.

The Core (TinyGPU)

Instead of creating a Python object for every thread (which is slow), TinyGPU uses NumPy for everything. The registers are a single 2D array: self.registers = np.zeros((num_threads, num_registers)). This mimics the SIMD nature of real hardware.

  • Stores 'PC', 'Registers', 'Memory', 'Flags'
  • Runs the step() cycle
  • Manages the 'Active Mask' for branching
class TinyGPU:
    def __init__(self, num_threads, memory_size):
        self.memory = np.zeros(memory_size)
        self.registers = np.zeros((num_threads, 8))
        self.pc = np.zeros(num_threads, dtype=int)
        self.active = np.ones(num_threads, dtype=bool)
Powered by Gemini

AI Assembly Architect

Writing assembly is hard. Describe a parallel algorithm logic below, and the AI will generate the .tgpu assembly code using the TinyGPU instruction set.

Generated Output (.tgpu)

Waiting for input...

Visualizing the "Heartbeat"

This interactive demo recreates the report's Odd-Even Transposition Sort example.

The Bar Chart represents Global Memory. Each bar is a value. In a parallel sort, adjacent pairs are compared and swapped simultaneously.

Controls

Phase: IDLE
Ops: 0

Global Memory

Value Magnitude

Active Thread Mask

Green = Executing

Key Engineering Insights

Vectorized State

The Insight: A GPU is just a state machine. If state is data, it can be vectorized.

Instead of looping 512 times in Python (slow), TinyGPU uses NumPy slicing. ADD R0, R1, R2 becomes a single array operation: regs[:,0] = regs[:,1] + regs[:,2]. This aligns Python's strengths (C-backed arrays) with the GPU's nature (SIMD).

The Active Mask

The Challenge: What happens when Thread 0 takes the if and Thread 1 takes the else?

Real GPUs use an execution mask. In TinyGPU, I implemented self.active, a boolean array. Instructions only update state where active == True. Threads that don't take the branch execute "no-ops" until paths converge.

Synchronization

The Struggle: Implementing SYNC barriers in a serial loop.

I had to create a sync_waiting mask. Threads hit the barrier, mark themselves waiting, and do nothing until all(active_threads) are waiting. Debugging the barrier logic itself was a meta-challenge.

Performance vs. Visibility

I sacrificed raw speed for "Observability". It runs thousands of ops/sec, not billions. But this slowness allows the "Flight Recorder" to capture every single state change for replay.

Closing Reflection

"When you build a simulator, the magic dissolves. The GPU is no longer a beast to be tamed; it’s just a machine looping over arrays."

Observability is Feature #1.

What Works

  • Visual Intuition of barriers
  • Deterministic Unit Testing
  • No Driver Installation

Limitations

  • Pure Python speed (Slow)
  • Simplified Cache Model
  • Custom Toy ISA

Future Roadmap

  • Warp Divergence Viz
  • Python-to-TinyGPU Compiler
  • Web-Based UI (You are here)
Check out the Repo