Moving from the opaque "black box" of parallel debugging to a transparent, observable mental model.
There is a specific kind of pain when debugging parallel code. You launch 512 threads, and... silence. Or a race condition that happens once every thousand runs.
I realized I didn't actually understand how a GPU schedules work. I knew the theory—SIMT, warps, barriers—but I couldn't see it.
"If the entire state of the GPU is just a set of NumPy arrays, then the state is plottable."
> Segfault: Thread 42 out of bounds
> Memory Access Violation (Address 0x004F)
> ... (Opaque hardware state)
Visualizing memory hotspots in real-time
TinyGPU is designed to be fully observable. Click on the components below to understand how the system transforms code into visual insight.
Parses .tgpu assembly files. Converts human-readable
text into numeric instructions.
The heavy lifter. Uses NumPy for vectorized state (Registers, Memory, PC). Handles SIMT logic.
The "Flight Recorder". Replays the execution history as a frame-by-frame heatmap GIF.
Instead of creating a Python object for every thread (which is
slow), TinyGPU uses NumPy for everything. The registers are a
single 2D array:
self.registers = np.zeros((num_threads, num_registers)). This mimics the SIMD nature of real hardware.
class TinyGPU:
def __init__(self, num_threads, memory_size):
self.memory = np.zeros(memory_size)
self.registers = np.zeros((num_threads, 8))
self.pc = np.zeros(num_threads, dtype=int)
self.active = np.ones(num_threads, dtype=bool)
Writing assembly is hard. Describe a parallel algorithm logic below,
and the AI will generate the .tgpu assembly code using
the TinyGPU instruction set.
Waiting for input...
This interactive demo recreates the report's Odd-Even Transposition Sort example.
The Bar Chart represents Global Memory. Each bar is a value. In a parallel sort, adjacent pairs are compared and swapped simultaneously.
Active Thread Mask
Green = ExecutingThe Insight: A GPU is just a state machine. If state is data, it can be vectorized.
Instead of looping 512 times in Python (slow), TinyGPU uses NumPy
slicing. ADD R0, R1, R2 becomes a single array
operation: regs[:,0] = regs[:,1] + regs[:,2]. This
aligns Python's strengths (C-backed arrays) with the GPU's nature
(SIMD).
The Challenge: What happens when Thread 0 takes the
if and Thread 1 takes the else?
Real GPUs use an execution mask. In TinyGPU, I implemented
self.active, a boolean array. Instructions only
update state where active == True. Threads that don't
take the branch execute "no-ops" until paths converge.
The Struggle: Implementing SYNC barriers in a serial
loop.
I had to create a sync_waiting mask. Threads hit the
barrier, mark themselves waiting, and do nothing until
all(active_threads) are waiting. Debugging the
barrier logic itself was a meta-challenge.
I sacrificed raw speed for "Observability". It runs thousands of ops/sec, not billions. But this slowness allows the "Flight Recorder" to capture every single state change for replay.
"When you build a simulator, the magic dissolves. The GPU is no longer
a beast to be tamed; it’s just a machine looping over arrays."
Observability is Feature #1.