DSPfract
From Hamsterworks Wiki!
This FPGA Project was started and finished January 2012.
I'm re-implementing the Mandelbrot viewer on a Papilio Plus Spartan 6 FPGA using the DSP48A1 blocks.
- 800x600 @ 60Hz display with 60Hz scrolling - the most that the 512KB SRAM chip allows
- Fully pipelined calculation core running at 240MHz - one loop iteration per clock cycle
- Memory and VGA controller runs at 80MHz
- Approximately 1 megapixel calculation rate with a maximum of loop 255 iterations per pixel
- Worst case screen redraw is 534ms, best case 5ms (as the memory controller can only write 10MB/sec)
- Calculations pipeline can be extended on larger FPGAs, giving 1 megaplxels per second per 12 DSP48A slices (e.g 4x on an Spartan 6 LX45)
- USB powered (less than 2.5W).
A video of it running is on YouTube at http://www.youtube.com/watch?v=dR4jbX332jU, and if you are interested my earlier project for the Nexys2 (Spartan 3E) board is at Mandelbrot, along with more detailed notes on implementation such as 'C' source.
Contents |
The user interface
Use the four direction buttons on the MegaWing to scroll, and reset+down and reset+up to zoom.
Resource usage
| Resource | Usage |
|---|---|
| Slices | 553 |
| Block RAM | 5 |
| DSP48A1 | 12 |
| PLL_ADV | 1 |
Progress so far
29-Jan-2012 All finished! The bit file is here File:Dspfract.bit
27-Jan-2012 Now has a fully scrollable image, but with only a fixed magnification. Scrolling (as expected) triggers a full screen redraw, but performance is very good. Now to add zooming...
25-Jan-2012 Found logic error in the loop manager, which was causing pixels to be dropped. Now getting full screen 800x600 images, but with a few odd pixels (2 or three) Assumed to be a corner case in the number representation.
24-Jan-2012 Have first patchy 800x600 image, with multiple small 'brots amongst static. Fixed this by correcting the number representation of the constants.
The Spartan 6 DSP48A1 block
It is badly named - It's a 18x18 multiplier with a pre-adder and a 48 bit post-adder, allowing it to implement Multiply Accumulate functions without using the more generic FPGA logic blocks. Each multiplier in this project uses four of these blocks to perform 35 bit signed multiplication.
Overview of the design
It's much like the other Mandelbrot viewer, where I made sure that nothing was idle at any time. This time I want to make it easy to have performance scale with the size of the FPGA.
The designing it in five blocks:
| Block | Description | Status |
|---|---|---|
| The user interface | Might be PS/2, might be joystick, might be keyboard | Completed |
| The scheduler | Accepts input from the user interface, and requests pixels to evaluated. It also sets up stored constants in the calculation pipeline | Completed |
| Pipeline manager | This routes pixels in and out of the calculation pipeline | Completed |
| Calculation pipeline | This computes the complex z(n+1) = z(n)^2+c function. More than one calculation block can be in the pipeline. Once Optimized it uses 10 DSP48A blocks to compute one iteration each cycle. | Completed |
| Calculation memory interface | A FIFO to take the results from the high speed logic to the 80MHz logic driving the memory and video | Completed |
| The memory controller | Received values from the pipeline manager and stores them in memory. It also generates the VGA display | Completed |
Number representation
DSP48A1 blocks are optimised for 18 bit signed integers, so a pipeline of 4 blocks can only multiply a 35 bit fixed point number. Due to trickery the buses between calculation blocks are actually signed 36 bit numbers, able to represent numbers between -8 and (almost)+8. With an additional DSP48As another 15 precision bits could be added, but the speed may need to be reduced due to longer carry chains in the addtions / subtraction operations.
Scheduler
- Waits for a vSync pulse to start
- Responds to the user input (if any)
- Changes the base address of the frame buffer
- Updates the 'constants' in the calculation pipeline to reflect the area being displayed
- Pushes the x/y locations that need to updated into the calculation pipeline
Calculation pipeline
Here is a graphical overview of the calculation block:
Test vectors
Without time to perform a full design this I find the easiest way to develop is to follow the following flow:
- Sketch out design
- Implement major blocks
- Build data-path with imbalances in the pipeline lengths
- Throw test vectors at it to find how long the pipeline should be
- Balance pipeline lengths
- Use test vectors to test for number representation / alignment issues
- Then finally verify the math is working correctly.
First, set constant the following constants in the memory blocks:
| Index | Value(hex) | Value (decimal) |
|---|---|---|
| 0 | 000000000 | 0 |
| 1 | 200000000 | 1 |
| 2 | 400000000 | 2 |
| others | 000000000 | 0 |
These values are used to test for aligmment issues in the pipeline.
| Iterations | Overflow | Real | Imaginary | x | y | Constant | wrx | wry | Active | Testing | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0xFF | 1 | 0000000000 | 0000000000 | 5 | 5 | FFFFFFFFFF | 1 | 1 | 1 | pipeline lengths are in step |
| 2 | 0xFF | 0 | 0000000000 | 0000000000 | 1 | 1 | 0000000000 | 0 | 0 | 1 | Constants are added at the correct time |
| 3 | 0xFF | 0 | 5555555555 | 5555555555 | 0 | 0 | 0000000000 | 0 | 0 | 1 | Check that the early overflow is in step |
| 4 | 0xFF | 0 | 3FFFFFFFFF | 3FFFFFFFFF | 0 | 0 | 0000000000 | 0 | 0 | 1 | Check that the overflow is working |
| 5 | 0xFF | 0 | 1000000000 | 0800000000 | 0 | 0 | 0000000000 | 0 | 0 | 1 | Check that math works - output should be (0C00000000, 1000000000) |
| 6 | 0xFF | 0 | 1000000000 | 0800000000 | 1 | 1 | 0000000000 | 0 | 0 | 1 | Check that math works with constants - output should be (1C00000000, 1000000000) |
| 7 | 0xFF | 0 | 1800000000 | 1800000000 | 0 | 0 | 0000000000 | 0 | 0 | 1 | Check that overflow is working 1.5^2+1.5^2 = 4.5 |
Memory/Video Controller
Important note - for reliable operation you must turn on "Pack I/O Registers/Latches into IOBs" in the mapping properties, or add appropriate constraints. Without this the output signals may contain glitches that will cause intermittent errors.
The target board has fast static RAM on it - making it a lot easier to interface to. I need to perform byte writes, but as the Papilio Plus board doesn't support the UB/LB enable lines I am forced to do read-before-writes.
DSPfact_old_mem are the notes for a controller that requires support UB/LB signals.
Here are the memory cycles
Read from address
This is using "Read Cycle No. 1" from the data sheet:
| Cycle | nWE | nCE | nOE | Address | Data bus |
|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | Read Address | Z + latch |
Write to address
This is using "Write Cycle #2 - nWE controlled, nOE high during write" from the data sheet.
| Cycle | nWE | nCE | nOE | Address | Data bus |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | Write Address | Data |
| 1 | 1 | 0 | 1 | Write Address | Data |
Integrating with video memory requirements
I'm running everything outside of the calculation core at 80MHz. With the display running at 800x600 with a 40MHz pixel clock (see http://tinyvga.com/vga-timing/800x600@60Hz) requires one byte every 2 cycles, or one 16 bit word read every four cycles. Servicing the display will use half the available memory cycles (40MB/s).
As I want to do byte-wide writes, I also need to read the contents of a memory address, update the upper or lower byte then write it back. This will require a read cycle before every write, and the writes will effectively be single byte writes. This will only allow at most a 10MB/s write rate unless write combining is used.
Memory access pattern
This is the current memory access pattern - it differs from the one in the source archive:
- read data to be updated (1 cycle)
- read cycle for display (1 cycle)
- write the data being updated (2 cycles) I've since found an error that now means this can be done in one cycle
(the 'X's are generated by my SRAM test bench to indicate the SRAM access time)
Write throughput is 20MB/s, but memory throughput is
- 40MB/s reads for generating the video
- 40MB/s read before write (allowing byte writes)
- 40MB/s writes
- 120MB/s total
I've implemented it in four processes
- one to service the display
- one to service the memory buses and signals
- one to service the to control the tristate buffers on mem_data
- one to process the clock signal
Since working on this project I've written a small SRAM testbench which will help with simulation of this sort of thing.
Source files
All source files and a full ISE project are in File:Dspfract.zip. It is supplied as-is, use at your own risk!


