From Hamsterworks Wiki!

Jump to: navigation, search

My attempt at a high performance FPGA Mandelbrot fractal generator. Also see DSPfract for a later project that uses the Xilinx DS48A1 core

Mandelbrot screenshot.png

This FPGA Project was started April 2011, and was working by late May 2011.



I first read the original Mandilbrot article in Scientific American (August 1985) while I was studying programming. I even coded it on a Commodore VIC20 in CBM basic. Since then I have often returned to the project to try out new technologies.

The best background for somebody unfamiliar with the Mandelbrot set is

Project Files

I'll be updating all my source files to include a "This has served its purpose for me, you are free to do whatever you want with this" license to all my files. The only parts of this project that does have restrictions is the Digilent PS/2 mouse reference component and the handful of IP core files (FIFOS, BRAM and MULTI 18x18) - see the headers in these files for more information.

Here is the latest bit file: File:Mandelbrot.bit

Here is the latest project zip file :

Project Status


Device Utilization Summary (this may change from build to build...)

Logic UtilizationUsedAvailableUtilization
Number of Slice Flip Flops3,6179,31238%
Number of 4 input LUTs2,9509,31231%
NuMber of occupied Slices2,6104,65656%
Number of Slices containing only related logic2,6102,571100%
Number of Slices containing unrelated logic02,5710%
Total Number of 4 input LUTs3,4229,31235%
Number used as logic2,496
Number used as a route-thru472
Number used for Dual Port RAMs184
Number used as Shift registers270
Number of RAMB16s52025%
Number of DCMs2450%
Number of MULT18X18SIOs102050%

Known bugs are:

  • Leftmost pixel on horizontal cursor drops down a line. This is due to the cursor being 'painted' one cycle late in the display.
  • Rightmost column on screen may display incorrectly.


  • Tidied up the Memory controller FSM, removing a lot of the read states. Also expanded the page size to 16 bytes. This is the largest page size supported by the DRAM chip, so no further improvement will be possible without moving to the synchronous interface mode.
  • Set the top level zoom to level "1" rather than "0". This saved me from reassigning zoom levels. Lazy me...


A night of big changes:

  • made the display into it's own clock domain
  • made the display timing as generic parameters to the component - supporting unto 1024x768 resolution
  • changed the mouse component to support the higher resolution
  • parameterised the screen resolution in the pixel dispatcher

Only two things I know that I would like to fix are

  • the far right column is wrong
  • the zoom level when fully zoomed out is too high, causing an overflow wraparound that spoils the image
  • maybe increase the RAM page size to 16 words, adding a further 16MB/s

I will now have to update the source on the web site...


Fixed up a couple of bugs and added a few features.

  • The screens now 'fade' between different resolutions, This is achieved by updating the pixels in four passes, each updating one pixel of a 2x2 square.
  • wrY was being asserted too long in the dispatcher, causing the top line to actually be line 641
  • Increased the RAM's page size from 4 words to 8, adding 30% more read bandwidth from memory. This should allow me to go 800x600 in future.


Project completed. Just update of the website left to finish. Everything works, and I have replaced the 18bit wide constant tables with 36 bit ones. The project is 2,772 lines of VHDL source, but to me it feels that it has the complexity of a much bigger project (maybe 10,000 lines of C). I think that this is because you have to think not only about "what" but "when" something happens, and you have to play much closer attention to your datatypes.


Mouse interface is now working pretty much perfectly. Can now zoom and pan, but still have a few bugs to squash.


Had my first cut at writing a PS2 mouse interface, without success - the initialization process wasn't in the documentation I was following. Darn! So in a break from tradition I used the Digilent Reference component - took about hour to integrate. Now have moving cross-hairs and can see on the LEDs when buttons are pushed. Next stop - updating the constants in the lookup tables when the mouse buttons are pressed.


Finally debugged everything that performs the calculations, and have a fully working test image. My problem was that I had an off-by-one error in loading slots, and overwrote some active entries in the pipeline.

Brot working.png
Reference Image

This can almost be compared to the reference image, but due to me using a 16x9 monitor mine is stretched:

Now on to building the mouse interface.

=== 2011-05-12 === After a month of Sundays things are actually starting to work. Here is the first image, and it sort of shows that things are close to working! I first loaded my 640x480 display component test project, which initialised the frame buffer to a test pattern, then uploaded my project.

Brot 1st.png

Here is a bit of detail, showing the top 'spike' of the set:

Brot 1st detail.png


So it looks like I have the following faults to look for:

  • There is a bug with either the calc-to-display interface, causing some of the pixel writes to be lost.
  • There is a bug with pipeline slot management, causing some pixels not to be computed.

The two striped bands at the top and bottom of the set look very promising, and make me think that it is more a timing issue in the calcMemoryInterface component than anything else. But it sort of works at 100MHz, wahoo!

Got the major bug fixed - it was pipeline slot management. I was overwriting active slots due to an off-by-one error in the logic that works out if the pipeline can accept a new point.

In C

Here is the whole thing in C, which I used to generate a 3360x1080 backdrop for my Monitor. It also matches the 8 bit colours of the Nexys2.

Sample image from C code, converted to jpg

* brot.c
* Generate an image of the Mandelbrot set. 
* use the command line to set x,y and scale parameters
* Mike Field <>
#include <stdio.h>
#define HRES (1680*2)
#define VRES 1080
unsigned char brot(double cx, double cy)
  int i = 0; 
  double x = cx, y = cy;
  while(i < 255 && x*x+y*y < 4.0)
    double t = x;
    x = x*x-y*y+cx;
    y = 2*t*y+cy;
  return i;

int main(int c, char *v[])
  int x,y;
  double cx = 0.0,cy = 0.0;
  double scale = 4.0/2;
  FILE *f;
  f = fopen("out.ppm","w");
  if(f == NULL)
    fprintf(stderr,"Can't open file\r\n");
    return 0;

  if(c == 4)
     scale /= 2.0;
     fprintf(stderr,"At %f,%f, scale %f\n",cx,cy,scale);
  fprintf(f,"P6\n%i %i\n255\n",HRES,VRES);
  for(y = 0; y < VRES; y++)
    for(x = 0; x < HRES; x++)
       unsigned char i = brot(scale*(x-HRES/2)/(HRES/2.0)+cx,scale*(y-VRES/2)/(HRES/2.0)+cy); 
  return 0;

What I hope to learn

  • Design and implementation of pipelined operations in digital design

Check - learnt that all right. In the end I had to build a twelve stage pipeline to get it all to work at the goal speed of 100MHz. I also learnt a lot about how to get data in and out of a calculation pipeline efficiently.

  • Limits of performance in FPGAs when doing math.

Check! The Spartan-3E going flat out can perform approx 500 million 36bit multiplications per second, or 2 billion 18bit multiplications per second, but the latency is the big issue, at it takes between 4 and 8 cycles to do so (and maybe more cycles for 'twos complement' signed numbers).

  • High speed FPGA design optimisation

Check! I now have a good grasp on the memory / latency / design complexity trade-off occurs in the design process. The rule of thumb is to do everything as soon as you are able, not when it is convenient. By trying to build as much slack into your design as possible you will be able to take advantage of it when you need to speed things up.

  • How to get the most from the Nexys2 DRAM interface

Currently the write bandwidth on the DRAM interface is a major limit on performance. When lots of 'shallow' pixels are to be displayed the pipeline is often full of values that are queued up waiting for memory.

As expected, the 5MB/s write bandwidth is completely outpaced by the core's ability to generate results 20x faster. On 'deep' images the write bandwidth is sufficient, as the core only generates 0.5MB/s of writes. By increasing the depth of the output Fifo on the calculation core this difference in performance could be smoothed out, but the fifo would need to be very deep the worst case would occur quite often.

  • The limits of the Nexys2 board.

I think that the 500K Nexys2 board is an ideal size for a hobbyist - you can do quite complex projects without needing additional resources.

Target platform

My target is the Digilent Nexys2. It will use the VGA output (640x480, 256 colours, 60HZ), and because this needs 307,200 bytes of frame buffer this will need to be kept in the onboard RAM. Managing RAM bandwidth will be a challenge too - the RAM on the Nexys2 has write bandwidth of about 10M 1-byte writes per second, and page-mode reads of 64MB/sec.

Max / Min update times

Best case throughput of fractal engine = 100,000,000 pixels per second (@ 100MHz).

Worst case (all pixels to maximum iterations) = 390,625 pixels per second (@ 100MHz).

As the display is 307,200 pixels it possible to ensures a one second redraw time under the worst case.

At the best case there is 6MB/s of write bandwidth is available, so the minimum time to update the display is 0.02s.

Any pixels that have have less then 16 iterations applied will most likely be stalled getting to memory.

If achieved, these are pretty impressive numbers for home designed hardware... the laptop I am on can calculate around 96,500,000 iterations per second though my C test program per core (a 640x480 with all pixels at 255 iterations is 0.812 seconds).

Number representation

Due to the design of the Xilinx FPGA's 18x18 multiplier blocks I am aiming for 36 bit fixed point (2.34) with external sign. The external sign works well as the inner loop uses the square function and 'complex absolute value' function, where the operand's sign has no impact.

I have been looking at binary math, specifically pipelining of a squaring function, which looked to be heady for this, allowing a more balanced pipeline.


One iteration of the calculations are as follows

A pipeline

 t1 <= A*A
 t2 <= B*B
 abs <= t1+r2
 overflow <= (abs >= 4)
 t3 = t1-t2
 An = t3+Ac
 iterations = iterations + not overflow

B pipeline

 t1 <= A * B
 t2 <= t1 * 2.0
 Bn <= t2 + Bc

My first attempt couldn't meet my desired clock speed, so more time designing was required.

Data flow graph

Here is the new flow graph


The "1024x36 lookups" are dual-ported block RAM holding the real and imaginary parts of the pixel's location on the complex plane.

The core has changed a liitle since this plan, but all major components are show. For exmample, currently the constant lookups are only 18 bits wide.

I might get round to updating the graph one day...

Source files

Here are the source files for various components:

  • mandelbrot.vhd - the top level
  • mandelbrot.ucf - the user constraints file for Nexys2
  • calc.vhd - the calculation core
    • delayline.vhd - A generic n-bit wide delay of varying length
    • delayline1.vhd - A generic 1-bit wide delay of varying length
    • indexholders.vhd - To hold the current 'X' and 'Y' indexes that are in the pipeline
    • iterationcounters.vhd - To hold an increment the current iteration counts that are in the pipeline
    • square.vhd - Calculate the square of a 36 bit number
    • mult36.vhd - Multiply two 36 bit numbers
    • signedadd.vhd - Add to 36 bit signed numbers
    • unsignedadd.vhd - Add to 36 bit unsigned numbers (has lower latency than signedadd)
    • inputmux.vhd - Accept new values into the pipeline, or use the old result
    • slottracker.vhd - Keeps track of which slots in the pipeline are in use
    • calcOutFifo - an IP FIFO to hold the result of the calculation before it is written to memory
    • mult18x18_L4 - an IP 18x18 multiplier with four cycle latency
    • coord_store - an dualported BRAM component to hold the computed x & y coordinates
  • dispatcher.vhd - The pixel dispatcher
  • CalcMemoryInterface.vhd - The glue between the core and the memory controller
  • MemoryController.vhd - The DRAM interface
    • DisplayFifo - an IP component to hold bytes going to the VGA port.
  • DisplaySubsystem.vhd - the display subsystem
  • Clk50to100and65 - A IP using a two DCMs to double the clock signal's frequency from 50MHz to 100MHz, and then generate 65MHz for the display.

Where possible I'm implementing most components using "generics" to make adjusting of the pipeline length a lot easier....

Display subsystem

The Mandelbrot display subsystem is relatively simple implementation of VGA 640x480, being fed data from the memory controller.

I can see how it could be extended work to 800x600 72Hz (a 50MHz pixel clock), but that would require well over half of the maximum RAM I/O bandwidth.

As it nicely divides into 100MHz it wouldn't require a major reworking of the core, I would just need to increase the page size in the memory controller to 8 words, taking it from 50MB/s read to 66MB/s. I shouldn't have to increase the size of the display fifo as the writes take 10 clock cycles, and the following read will take 10 cycles to get the first byte - enough time for 10 bytes to be pulled out of the fifo.

As a later enhancement this was extended to 1024x768 @ 50Hz, and to my amazement it actually worked!

Memory controller

The Mandelbrot memory controller receives writes from the calculation pipeline, and reads data from memory to send to the Mandelbrot display subsystem. It interfaces with the DRAM on the Nexys2.

It all works and is ready to accept data from the calculation engine. The test image below is generated by a seperate module that writes values into the memory through the write interface that the Mandelbrot engine will use.

First display.png

Personal tools