DSPfract

From Hamsterworks Wiki!

Jump to: navigation, search

This FPGA Project was started and finished January 2012.

I'm re-implementing the Mandelbrot viewer on a Papilio Plus Spartan 6 FPGA using the DSP48A1 blocks.

Mandelbrot 400x300.png

  • 800x600 @ 60Hz display with 60Hz scrolling - the most that the 512KB SRAM chip allows
  • Fully pipelined calculation core running at 240MHz - one loop iteration per clock cycle
  • Memory and VGA controller runs at 80MHz
  • Approximately 1 megapixel calculation rate with a maximum of loop 255 iterations per pixel
  • Worst case screen redraw is 534ms, best case 5ms (as the memory controller can only write 10MB/sec)
  • Calculations pipeline can be extended on larger FPGAs, giving 1 megaplxels per second per 12 DSP48A slices (e.g 4x on an Spartan 6 LX45)
  • USB powered (less than 2.5W).

A video of it running is on YouTube at http://www.youtube.com/watch?v=dR4jbX332jU, and if you are interested my earlier project for the Nexys2 (Spartan 3E) board is at Mandelbrot, along with more detailed notes on implementation such as 'C' source.

Contents

The user interface

Use the four direction buttons on the MegaWing to scroll, and reset+down and reset+up to zoom.

Resource usage

Resource Usage
Slices 553
Block RAM 5
DSP48A1 12
PLL_ADV 1

Progress so far

29-Jan-2012 All finished! The bit file is here File:Dspfract.bit

27-Jan-2012 Now has a fully scrollable image, but with only a fixed magnification. Scrolling (as expected) triggers a full screen redraw, but performance is very good. Now to add zooming...

25-Jan-2012 Found logic error in the loop manager, which was causing pixels to be dropped. Now getting full screen 800x600 images, but with a few odd pixels (2 or three) Assumed to be a corner case in the number representation.

24-Jan-2012 Have first patchy 800x600 image, with multiple small 'brots amongst static. Fixed this by correcting the number representation of the constants.

The Spartan 6 DSP48A1 block

It is badly named - It's a 18x18 multiplier with a pre-adder and a 48 bit post-adder, allowing it to implement Multiply Accumulate functions without using the more generic FPGA logic blocks. Each multiplier in this project uses four of these blocks to perform 35 bit signed multiplication.

Overview of the design

It's much like the other Mandelbrot viewer, where I made sure that nothing was idle at any time. This time I want to make it easy to have performance scale with the size of the FPGA.

The designing it in five blocks:

Block Description Status
The user interface Might be PS/2, might be joystick, might be keyboard Completed
The scheduler Accepts input from the user interface, and requests pixels to evaluated. It also sets up stored constants in the calculation pipeline Completed
Pipeline manager This routes pixels in and out of the calculation pipeline Completed
Calculation pipeline This computes the complex z(n+1) = z(n)^2+c function. More than one calculation block can be in the pipeline. Once Optimized it uses 10 DSP48A blocks to compute one iteration each cycle. Completed
Calculation memory interface A FIFO to take the results from the high speed logic to the 80MHz logic driving the memory and video Completed
The memory controller Received values from the pipeline manager and stores them in memory. It also generates the VGA display Completed

Number representation

DSP48A1 blocks are optimised for 18 bit signed integers, so a pipeline of 4 blocks can only multiply a 35 bit fixed point number. Due to trickery the buses between calculation blocks are actually signed 36 bit numbers, able to represent numbers between -8 and (almost)+8. With an additional DSP48As another 15 precision bits could be added, but the speed may need to be reduced due to longer carry chains in the addtions / subtraction operations.

Scheduler

  • Waits for a vSync pulse to start
  • Responds to the user input (if any)
  • Changes the base address of the frame buffer
  • Updates the 'constants' in the calculation pipeline to reflect the area being displayed
  • Pushes the x/y locations that need to updated into the calculation pipeline

Calculation pipeline

Here is a graphical overview of the calculation block:

Dspfract pipeline.png

Test vectors

Without time to perform a full design this I find the easiest way to develop is to follow the following flow:

  • Sketch out design
  • Implement major blocks
  • Build data-path with imbalances in the pipeline lengths
  • Throw test vectors at it to find how long the pipeline should be
  • Balance pipeline lengths
  • Use test vectors to test for number representation / alignment issues
  • Then finally verify the math is working correctly.

First, set constant the following constants in the memory blocks:

Index Value(hex) Value (decimal)
0 000000000 0
1 200000000 1
2 400000000 2
others 000000000 0

These values are used to test for aligmment issues in the pipeline.

Iterations Overflow Real Imaginary x y Constant wrx wry Active Testing
1 0xFF 1 0000000000 0000000000 5 5 FFFFFFFFFF 1 1 1 pipeline lengths are in step
2 0xFF 0 0000000000 0000000000 1 1 0000000000 0 0 1 Constants are added at the correct time
3 0xFF 0 5555555555 5555555555 0 0 0000000000 0 0 1 Check that the early overflow is in step
4 0xFF 0 3FFFFFFFFF 3FFFFFFFFF 0 0 0000000000 0 0 1 Check that the overflow is working
5 0xFF 0 1000000000 0800000000 0 0 0000000000 0 0 1 Check that math works - output should be (0C00000000, 1000000000)
6 0xFF 0 1000000000 0800000000 1 1 0000000000 0 0 1 Check that math works with constants - output should be (1C00000000, 1000000000)
7 0xFF 0 1800000000 1800000000 0 0 0000000000 0 0 1 Check that overflow is working 1.5^2+1.5^2 = 4.5

Memory/Video Controller

Important note - for reliable operation you must turn on "Pack I/O Registers/Latches into IOBs" in the mapping properties, or add appropriate constraints. Without this the output signals may contain glitches that will cause intermittent errors.

The target board has fast static RAM on it - making it a lot easier to interface to. I need to perform byte writes, but as the Papilio Plus board doesn't support the UB/LB enable lines I am forced to do read-before-writes.

DSPfact_old_mem are the notes for a controller that requires support UB/LB signals.

Here are the memory cycles

Read from address

This is using "Read Cycle No. 1" from the data sheet:

Cycle nWE nCE nOE Address Data bus
0 1 0 0 Read Address Z + latch

Write to address

This is using "Write Cycle #2 - nWE controlled, nOE high during write" from the data sheet.

Cycle nWE nCE nOE Address Data bus
0 0 0 1 Write Address Data
1 1 0 1 Write Address Data

Integrating with video memory requirements

I'm running everything outside of the calculation core at 80MHz. With the display running at 800x600 with a 40MHz pixel clock (see http://tinyvga.com/vga-timing/800x600@60Hz) requires one byte every 2 cycles, or one 16 bit word read every four cycles. Servicing the display will use half the available memory cycles (40MB/s).

As I want to do byte-wide writes, I also need to read the contents of a memory address, update the upper or lower byte then write it back. This will require a read cycle before every write, and the writes will effectively be single byte writes. This will only allow at most a 10MB/s write rate unless write combining is used.

Memory access pattern

This is the current memory access pattern - it differs from the one in the source archive:

  • read data to be updated (1 cycle)
  • read cycle for display (1 cycle)
  • write the data being updated (2 cycles) I've since found an error that now means this can be done in one cycle

Dspfract mem pattern.png

(the 'X's are generated by my SRAM test bench to indicate the SRAM access time)

Write throughput is 20MB/s, but memory throughput is

  • 40MB/s reads for generating the video
  • 40MB/s read before write (allowing byte writes)
  • 40MB/s writes
  • 120MB/s total

I've implemented it in four processes

  • one to service the display
  • one to service the memory buses and signals
  • one to service the to control the tristate buffers on mem_data
  • one to process the clock signal

Since working on this project I've written a small SRAM testbench which will help with simulation of this sort of thing.

Source files

All source files and a full ISE project are in File:Dspfract.zip. It is supplied as-is, use at your own risk!

Personal tools