Introduction
What is Tileon?
Tileon is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware.
Inspired by Triton, Tileon offers an intuitive way to write high-performance parallel compute kernels while maintaining readability and ease of use.
Key Concepts
- Tile-based Programming: Divides data into tiles (blocks) that can be processed in parallel
- Just-in-Time Compilation: Kernels are compiled at runtime using the
@tileon.jitdecorator - Python-based DSL: Write kernels using familiar Python syntax
- Automatic Parallelization: The compiler handles parallel execution details
Why Tileon?
- Productivity: Python-based DSL for faster development
- Performance: Comparable to CUDA/C++ implementations
- Readability: Clear, understandable kernel code
- Flexibility: Easy to customize and extend
Installation
See Installation for detailed installation instructions.
Quick Example
import torch
import tileon
import tileon.language as tl
@tileon.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
n_elements = output.numel()
grid = lambda meta: (tileon.cdiv(n_elements, meta['BLOCK_SIZE']), )
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
Next Steps
- Vector Addition - Your first Tileon kernel
- Matrix Multiplication - GEMM implementation
- tileon.language - Language reference