Building a GPU Pipeline with the Spade hardware description language
As demonstrated in my previous post about RustHDL, I have been playing around with alternative HDL languages. Lately that language has been Spade. For a simple project to learn Spade, I have to decided to make a vector graphics processing unit. The design consists of a systolic array matrix multiplier that is used to transform the vectors and project them onto the screen.
Spade
Spade is an HDL that greatly inspired by Rust. This inspiration shows in both the overall language design, as well as in the tooling.
What I like
- The language
- Match expressions. These are similar to Rust's, e.g. you must handle all Enum cases at or it is a compiler error.
- Variables can only be assigned to once. This is the source of many Verilog bugs.
- Ports and structured data are a first class concept, and they more useful and intuitive than VHDL records.
- Pipelines. These provide a way to handle timing and synchronization at compile time.
- The tooling
- Automatic test runner integrates well with cocotb.
- Easy to update, similar to rustup
- Installs toolchain for Lattice chips (iCE40 and ECP5)
- Integrates well with gtkwave and Surfer waveform viewers
- External packages can be used in a project. This allowed me to easily use third-party libraries without having to copy someone else's source code into my project like its 1995.
- The community
- The project's Discord is very friendly and helpful.
- Many things I brought up were fixed or issue tracked.
What I don't like
- Code generation doesn't exist. I had to implement stuff by copying and pasting. A lot of my modules are not truly parametric because of this.
- It is still early. I encountered a lot of crashes in the build tools as well as weird error messages.
Systolic Array
A systolic array consists of many cascaded processing elements. Data is "pumped" through these elements. When one element is done with data, it hands it off to its neighbor. This concept is adaptible to matrix multiplication, but some input/output shaping is required to stagger and zero-pad the data for valid computation.
Design
Matrix Multiplier
Processing Element
Operation
- Matrix multiply unit is reset.
- Transformation matrix is loaded in. This provides rotation around the y-axis.
$$ T = \begin{bmatrix} \cos(\theta) & 0 & \sin(\theta) & 0 \\ 0 & 1 & 0 & 0 \\ -\sin(\theta) & 0 & \cos(\theta) & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} $$
Transformation matrix values are pushed in in reverse, row-major order.
$$ \begin{Bmatrix}1, 0, ... , -\sin(\theta), 0, \cos(\theta)\end{Bmatrix} $$
- Vertices are pushed into vector FIFO in row-major order. Input to the FIFO is 32-bits wide, so a full 4x16-bit vector uses two clocks. 32-bit input width was chosed so the the core would work well with a RISC-V soft core.
$$ V = \begin{bmatrix} x_0 & y_0 & z_0 & w_0 \\ x_1 & y_1 & z_1 & w_1 \\ \vdots & \vdots & \vdots & \vdots \\ x_n & y_n & z_n & w_n \\ \end{bmatrix} $$
$$ \begin{Bmatrix}(x_0, y_0), (z_0, w_0), (x_1, y_1), (z_1, w_1), ..., (x_n, y_n), (z_n, w_n)\end{Bmatrix} $$
Conversion to Screen Coordinates
Points a represented by 16-bit integers with 13 fractional bits. This allows a range of (4, -4], but in practice, only a range of [-1, 1] is used. The screen resolution used is 1366 x 768. In order to display the points on the screen, the points are multiplied by 384 and 384 is added to the result. The range [-1, 1] becomes [0, 768].
Coordinate Conversion
Instead of using multiplication directly, it is done with shift-left multiplication of 256 and 128. These results are then added together.
Line Drawing
To raster the lines onto pixels, I implemented Bresenham's line algorithm. All of the setup steps of the algorithm, such as selecting high vs. low implementations as well as calcuating initial values were easily handled by Spade's pipeline primitives. These allowed easy operation sequencing and flip-flop delays. After setup, the raster loop will output pixel coordinates that will be drawn onto the screen.
HDMI Output
There is already an implementation of an HDMI output created by TheZoq2. As an exercise in Spade's library functionality, I will attempt to use it in my project.