HDlife

From Hamsterworks Wiki!

Jump to: navigation, search

This FPGA Project was started and completed in April 2012.

I have Conway's Game of Life running at full HD at 60Hz update on a Spartan 6 FPGA (my Papilio Plus).

Life picture.png

I'm really sorry about the quality of the picture (it's zoom on a cell phone), but a full screen photo or a video looks much like the Milky Way at night... you can find a very short video of it at http://youtu.be/bqW9PfCPLXc

Contents

What is "Conway's Game of Life"

It is the original 'cellular automation', first discovered way back in 1970. See http://en.wikipedia.org/wiki/Conway's_Game_of_Life for a complete description.

All pixels are updated based on the state of their eight neighbors:

  • if there are two 'on' pixels, then it retains it present state
  • if there are three pixels then set the pixel on
  • and for any other number turn the pixel off

It is often used as a fun project for budding programmers.

Why implement 'Life' it in an FPGA

I was just after a project to see if I could generate 1080p, and only had enough SRAM for monochrome.... and this is the sort of thing that FPGAs are good at:

  • Real-time processing, giving a new frame every frame regardless of the number of bits on the screen
  • Multiple parallel bit operations (in this case calculating 8 pixels at once, one cycle in four).
  • Making the most use of memory bandwidth - although the RAM chip is rated at 200MB/s, this only needs 19MB/sec read, 19MB/sec write)
  • Low power - less than 2.5W vs that of a PC or GPU

I'm only using 8% of my little Spartan 6 LX9. I could take the output of the present 'life engine' and fire it through a second instance, then through a third and so on... I guess I could get 600 generations per second with a little bit more design work (but of course then I wouldn't be able to display every frame).

On a larger FPGA where I can uncouple the display and calculation clocks I could run the memory interface at full speed (200MB/s vs 36MB/s that I use now), and I could calculate at 75 levels deep and use bit word, giving 150x the performance of this project.

From a software point of view, here are some estimated timings for a naive implementation of life in software:

  • Nine instructions to test the pixels to update
  • Three cycles and to test the result
  • A conditional jump
  • One write to memory to update the result
  • Calls to the driver to bit update the frame buffer (or reaching over the PCI bus to access VRAM.
  • Lots of L3 cache accesses - the working set is bigger than most L1 and L2 data caches

At a conservative guess of 20 cycles per pixel (including stalls on memory) a single 2.4GHz CPU core might just be able to update the 124,416,000 pixel updates needed to run at the 60Hz refresh, making the FPGA solution 32x faster, clock for clock, and maybe also 30x more efficient for power.

Far more efficient life implementations exist (which only examine and update the areas that could possibly change), but their worst case performance is pretty much the same as the numbers above.

Why is the code so big?

The code is so big because it implements the following:

  • The 'Life' calculation itself
  • The 1080p analogue VGA output with 12 bit colour (although we only show black and white).
  • The SRAM memory controller

All this is implemented with only simple binary operations like 'AND', 'OR" and the '+' '-' operators.

Compared with the the PC based solution there is:

  • No help from an operating system,
  • No help from display drivers
  • No help from display firmware
  • No help from a GPU
  • No help from a CPU
  • No help from software frameworks or class library (like MFC)
  • No help from runtime libraries
  • No help from any CPU microcode
  • No help from a chipset - e.g. memory controller, PCI bridge
  • No help from any memory caching

What are the nice bits of the design?

  • 1080p needs 148.5MHz pixel clock, but this design clocks at 74.25MHz using DDR outputs to drive the display
  • It is a relatively simple, understandable memory controller design
  • The memory read used for the VGA display is fed straight into the calculation engine, halving memory bandwidth

I think that the memory access is sweet. Each memory controller cycle consists of two slots each of two clock cycles. The first slot is always a read, and the second slot is an optional write. It can also be put into 'idle' state that allows it to sync with the start of a scan line with a length that doesn't divide evenly by two. Here's a pictorial representation of the access pattern (blue is read for video and to allow calculation, green are reads for calculation only, and reds are the writes of new values):

Hidef life mem.png

What are the horrid bits of the design?

  • Simulation of the memory controller and real life do not agree. The timing of the write enable pulse differs from reality. I think that this is because in simulation the DDR component latches after the memory controller's process is called. This took ages to find and fix.
  • There is no way to initialise the SRAM at the moment. To load something in there I start another design, and then use the 'junk' left in the SRAM. The easy option is to hook up a random number generator, or maybe a UART allowing designs to be downloaded.
  • Now that it is up and running it is very underwhelming. The detail is just so fine and at times so much is going on that it is just uninteresting - the animated GIFs on Wikipedia are more impressive.

Source files

hidef_life.vhd

I've tried as much as possible to keep all the magic in the set of constants - VGA timings, when features are enabled and so on.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity hidef_life is
    Port ( clk_32   : in  STD_LOGIC;
           red      : out STD_LOGIC_VECTOR (3 downto 0);
           green    : out STD_LOGIC_VECTOR (3 downto 0);
           blue     : out STD_LOGIC_VECTOR (3 downto 0);
           hsync    : out STD_LOGIC;
           vsync    : out STD_LOGIC;
           button   : in  STD_LOGIC_VECTOR (1 downto 0);
           
           mem_addr : out   STD_LOGIC_VECTOR (17 downto 0);
           mem_data : inout STD_LOGIC_VECTOR (15 downto 0);
           mem_ce   : out   STD_LOGIC;
           mem_oe   : out   STD_LOGIC;
           mem_we   : out   STD_LOGIC;
           mem_be     : out   STD_LOGIC;
           
           serialrx : in  STD_LOGIC);
end hidef_life;

architecture Behavioral of hidef_life is
   constant hVisible    : natural := 1920;
   constant hSyncStart  : natural := 1920+32;
   constant hSyncEnd    : natural := 1920+32+696;
   constant hTotalCount : natural := 1920+32+696+32;

   constant vVisible    : natural := 1080;
   constant vSyncStart  : natural := 1080+22;
   constant vSyncEnd    : natural := 1080+22+11;
   constant vTotalCount : natural := 1080+22+11+22;

   -- When the 'life engine' is updated
   constant hEngineEnd : natural := hVisible/2 + 12; 
   constant vEngineEnd : natural := vVisible + 2;

   -- when we will perform write cycles
   constant hWriteStart : NATURAL := 24/2;
   constant hWriteEnd   : NATURAL := (hVisible+24)/2;

   constant vWriteStart : NATURAL := 2;
   constant vWriteEnd   : NATURAL := vVisible+2;

   -- When to move the memory controller into an "idle" state, allowing it to sync to the start of the next line
   constant hMemoryIdle : natural := (hVisible + 28)/2;
   constant vMemoryIdle : natural := vVisible + 3;

   -- Starting values for the writing address
   constant hWriteColumnStart : natural := hVisible/8 - 2;
   constant hWriteRowStart    : natural := vVisible - 1;

   COMPONENT address_calculator
   PORT(
      column  : IN  std_logic_vector(7  downto 0);
      row     : IN  std_logic_vector(10 downto 0);          
      address : OUT std_logic_vector(17 downto 0)
   );
   END COMPONENT;

   component clocking
   port
      (-- Clock inports
      CLK_32           : in     std_logic;
      -- Clock outports
      CLK_PIXEL        : out    std_logic;
      CLK_PIXELN       : out    std_logic
      );
   end component;

   COMPONENT mem_interface
   PORT(
      clk          : IN std_logic;
      clkn         : IN std_logic;
      idle         : IN std_logic;
      
      writeAddress : IN std_logic_vector(17 downto 0);
      writeData    : IN std_logic_vector(7 downto 0);
      writeEnable  : IN std_logic;

      readAddress  : IN std_logic_vector(17 downto 0);          
      readData     : OUT std_logic_vector(7 downto 0);
      readReady    : OUT std_logic;
      
      mem_data     : INOUT std_logic_vector(15 downto 0);      
      mem_addr     : OUT std_logic_vector(17 downto 0);
      mem_ce       : OUT std_logic;
      mem_oe       : OUT std_logic;
      mem_we       : OUT std_logic;
      mem_be       : OUT std_logic
      );
   END COMPONENT;

   COMPONENT ddr_pixels
   PORT(
      clk         : IN  std_logic;
      clkn        : IN  std_logic;
      pixels      : IN  std_logic_vector(7 downto 0);          
      next_pixels : IN  std_logic;
      red         : OUT std_logic_vector(3 downto 0);
      green       : OUT std_logic_vector(3 downto 0);
      blue        : OUT std_logic_vector(3 downto 0)
      );
   END COMPONENT;

   COMPONENT life_engine
   PORT(
      clk      : IN std_logic;
      ce       : IN std_logic;
      word_in  : IN std_logic_vector(7 downto 0);
      word_out : OUT std_logic_vector(7 downto 0)
      );
   END COMPONENT;

   signal clk_pixel       : std_logic;
   signal clk_pixeln      : std_logic;

   signal  pixel_data        : std_logic_vector(7 downto 0);

   signal  displayPixels     : std_logic;
   signal  engineCE          : std_logic;
   signal  nextHsync         : std_logic;
   signal  nextVsync         : std_logic;
   
   signal  vCounter            : std_logic_vector(10 downto 0) := (others => '0');
   signal  hCounter            : std_logic_vector(10 downto 0) := (others => '0');

   signal  readAddress       : std_logic_vector(17 downto 0);
   signal  readData          : std_logic_vector( 7 downto 0);
   signal  readColumn        : std_logic_vector( 7 downto 0) := (others => '0');
   signal  readRow           : std_logic_vector(10 downto 0) := (others => '0');
   signal  readReady         : std_logic;

   signal  writeAddress      : std_logic_vector(17 downto 0);
   signal  writeData         : std_logic_vector( 7 downto 0);
   signal  writeColumn       : std_logic_vector( 7 downto 0) := "00000000" + hWriteColumnStart;
   signal  writeRow          : std_logic_vector(10 downto 0) := "00000000000" + hWriteRowStart;
   signal  writeEnable       : std_logic;
   
   signal  memory_idle       : std_logic := '0';
   
   -- used to keep the updates visible
   signal slowcounter        : std_logic_vector(2 downto 0) := (others => '0');
   signal updateScreen       : std_logic := '0'; 
   

begin

clocking_inst : clocking
   port map (
      CLK_32      => clk_32,
      CLK_PIXEL  => clk_pixel,
      CLK_PIXELN => clk_pixeln
   );

   Inst_ddr_pixels: ddr_pixels PORT MAP(
      clk         => clk_pixel,
      clkn        => clk_pixeln,
      pixels      => readData,
      next_pixels => displayPixels,
      red         => red,
      green       => green,
      blue        => blue
   );

Inst_life_engine: life_engine PORT MAP(
      clk      => clk_pixel,
      ce       => engineCE,
      word_in  => readData,
      word_out => writeData
   );

Inst_mem_interface: mem_interface PORT MAP(
      clk          => clk_pixel,
      clkn         => clk_pixeln,
      idle         => memory_idle,
      -- Write port
      writeAddress => writeAddress,
      writeData    => writeData,
      writeEnable  => writeEnable,
      -- read port
      readReady    => readReady,
      readAddress  => readAddress,
      readData     => readData,
      -- SRAM interface
      mem_addr     => mem_addr,
      mem_data     => mem_data,
      mem_ce       => mem_ce,
      mem_oe       => mem_oe,
      mem_we       => mem_we,
      mem_be       => mem_be
   );

   write_address_calculator: address_calculator PORT MAP(
      column  => writeColumn,
      row     => writeRow,
      address => writeAddress 
   );   

   read_address_calculator: address_calculator PORT MAP(
      column  => readColumn,
      row     => readRow,
      address => readAddress 
   );   

   process(hCounter,vCounter,readReady,updateScreen)
   begin
      engineCE      <= '0';
      displayPixels <= '0';
      nextHsync     <= '1';
      nextVsync     <= '1';
      memory_idle   <= '0';
      writeEnable   <= '0';
      

      -- The +2 is to allow row[n-1] before the first visible line, and reading row[0] after the last line
      if hCounter < hEngineEnd and vCounter < vEngineEnd  then
         engineCE <= readReady;
      end if;

      if hcounter < hVisible/2 and vCounter < vVisible then
         displayPixels <= readReady;
      end if;

      if hcounter >= hWriteStart and hcounter < hWriteEnd and vCounter >= vWriteStart and vCounter < vWriteEnd then
         writeEnable <= updateScreen;
      end if;
      
      -- The idle state allows the memory to sync up with the start of line in the display line even when it isn't divisible by 4
      if hcounter > hMemoryIdle  or vCounter > vMemoryIdle then
         memory_idle <= '1';
      end if;
      
      if hcounter >= hSyncStart/2 and hcounter < hSyncEnd/2 then
         nextHsync <= '0';
      end if;
      
      if vcounter >= vSyncStart and vcounter < vSyncEnd then
         nextVsync <= '0';
      end if;

   end process;
   
   process(clk_pixel)
   begin
      if rising_edge(clk_pixel) then
         hsync <= nextHsync;
         vsync <= nextVsync;
         
         -- Move onto the next read and write address every forth count.
         if hcounter(1 downto 0) = "11" then
            if writeColumn = hVisible/8 - 1 then
               writeColumn <= (others => '0');
            else
               writeColumn <= writeColumn+1;
            end if;
            
            if readColumn = hVisible/8 - 1 then
               readColumn <= (others => '0');
            else
               readColumn <= readColumn+1;
            end if;   
         end if;
   
   
         if hCounter /= hTotalCount/2-1 then
            hcounter <= hcounter+1;
         else
            hcounter <= (others => '0');
            readColumn  <= "00000000";
            writeColumn <= "00000000" + hWriteColumnStart;         
            
            if vCounter = vTotalCount-1 then
               vCounter <= (others => '0');
               readRow  <= (others => '0');
               writeRow <= "00000000000" + hWriteRowStart;
               
               -- This is for updating the screen at a reduced eate.
               slowcounter <= slowcounter+1;
               updateScreen <= button(0);
               if slowcounter = 0 and button(0) = '0' then
                 updateScreen <= button(1);
               end if;
            else
               vCounter <= vCounter+1;
               if readRow = vVisible-1 then
                  readRow <= (others => '0');
               else
                  readRow <= readRow+1;
               end if;
               
               if writeRow = vVisible-1 then
                  writeRow <= (others => '0');
               else
                  writeRow <= writeRow+1;
               end if;               
            end if;
         end if;
      end if;
   end process;
end Behavioral;

address_calculator.vhd

To calculate the address you need to multiply the row by 240 then add the column. Initially I tried to calculate row*256 - row*16 + column, but during debugging I didn't trust it when I had some verification issues. Instead I am calculating (8*row + 4*row + 2*row + row) * 16 + column...

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity address_calculator is
    Port ( column  : in  STD_LOGIC_VECTOR ( 7 downto 0);
           row     : in  STD_LOGIC_VECTOR (10 downto 0);
           address : out STD_LOGIC_VECTOR (17 downto 0));
end address_calculator;

architecture Behavioral of address_calculator is
   signal row240  : std_logic_vector(17 downto 0);
   signal row15   : std_logic_vector(13 downto 0);
   signal result  : std_logic_vector(17 downto 0);
begin
   row15 <= (row & "000")+(row&"00")+(row&"0")+row;
   row240 <= row15&"0000";
   result  <= row240 + column;
   address <= result(17 downto 0);
end Behavioral;

clocking.vhd

Use the IP core generator to generate two signals (clk_pixel and clk_pixeln) as close to half the pixel clock as possible with 180 degrees of phase shift between them.

Here are the internal attributes:

BANDWIDTH            => "OPTIMIZED",
CLK_FEEDBACK         => "CLKFBOUT",
COMPENSATION         => "SYSTEM_SYNCHRONOUS",
DIVCLK_DIVIDE        => 1,
CLKFBOUT_MULT        => 30,
CLKFBOUT_PHASE       => 0.000,
CLKOUT0_DIVIDE       => 13,
CLKOUT0_PHASE        => 0.000,
CLKOUT0_DUTY_CYCLE   => 0.500,
CLKOUT1_DIVIDE       => 13,
CLKOUT1_PHASE        => 180.000,
CLKOUT1_DUTY_CYCLE   => 0.500,
CLKIN_PERIOD         => 31.250,

ddr_pixels.vhd

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

Library UNISIM;
use UNISIM.vcomponents.all;

entity ddr_pixels is
    Port ( clk         : in  STD_LOGIC;
           clkn        : in  STD_LOGIC;
           pixels      : in  STD_LOGIC_VECTOR (7 downto 0);
           next_pixels : in  STD_LOGIC;
           red         : out STD_LOGIC_VECTOR (3 downto 0);
           green       : out STD_LOGIC_VECTOR (3 downto 0);
           blue        : out STD_LOGIC_VECTOR (3 downto 0)
          );
end ddr_pixels;

architecture Behavioral of ddr_pixels is   
   signal shiftReg : std_logic_vector(7 downto 0)  := (others => '0');
begin

green0_ODDR2 : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC")
               port map (Q => green(0), C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
green1_ODDR2 : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC")
               port map (Q => green(1), C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
green2_ODDR2 : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => green(2), C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
green3_ODDR2 : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => green(3), C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');

blue0_ODDR2  : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => blue(0),  C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
blue1_ODDR2  : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => blue(1),  C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
blue2_ODDR2  : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC")
               port map (Q => blue(2),  C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
blue3_ODDR2  : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => blue(3),  C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');

red0_ODDR2   : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => red(0),   C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
red1_ODDR2   : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC")
               port map (Q => red(1),   C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
red2_ODDR2   : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => red(2),   C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');
red3_ODDR2   : ODDR2 generic map(DDR_ALIGNMENT => "C0", INIT => '0', SRTYPE => "ASYNC") 
               port map (Q => red(3),   C0 => Clk, C1 => Clkn, CE => '1', D0 => shiftReg(7), D1 => shiftReg(6), R => '0', S => '0');

   process(clk)
   begin
     if rising_edge(clk) then
       if next_pixels = '1' then
         shiftReg <= pixels;
       else
          shiftReg   <= shiftReg(5 downto 0) & "00";
       end if;
     end if;
   end process;
end Behavioral;

life_engine.vhd

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

entity life_engine is
    Port ( clk      : in  STD_LOGIC;
           ce       : in  STD_LOGIC;
           word_in  : in  STD_LOGIC_VECTOR (7 downto 0);
           word_out : out  STD_LOGIC_VECTOR (7 downto 0));
end life_engine;

architecture Behavioral of life_engine is
   COMPONENT life_evaluator
   PORT(
      top    : IN std_logic_vector(2 downto 0);
      middle : IN std_logic_vector(2 downto 0);
      bottom : IN std_logic_vector(2 downto 0);          
      result : OUT std_logic
      );
   END COMPONENT;

   COMPONENT data_delay
   PORT (
      d   : IN STD_LOGIC_VECTOR(7 DOWNTO 0);
      clk : IN STD_LOGIC;
      ce  : IN STD_LOGIC;
      q   : OUT STD_LOGIC_VECTOR(7 DOWNTO 0)
      );
   END COMPONENT;

   signal top_line, middle_line, bottom_line : std_logic_vector(16 downto 0);
   signal delay1_in, delay1_out              : std_logic_vector(7 downto 0);
   signal delay2_in, delay2_out              : std_logic_vector(7 downto 0);
begin
   delay1_in   <= word_in;
   delay2_in <= delay1_out;

--   word_out <= middle_line(15 downto 8);

delay1 : data_delay PORT MAP (d => delay1_in, clk => clk, ce => ce, q => delay1_out);
delay2 : data_delay PORT MAP (d => delay2_in, clk => clk, ce => ce, q => delay2_out);
   
gen1:   for i in 0 to 7 generate
   life_evaluator_15: life_evaluator PORT MAP(
      top    => top_line   (i+9 downto i+7),
      middle => middle_line(i+9 downto i+7),
      bottom => bottom_line(i+9 downto i+7),
      result => word_out(i)
   );
   end generate;

   process(clk)
   begin
      if rising_edge(clk) then
         if ce = '1' then
            top_line    <= top_line(8 downto 0)    & word_in;
            middle_line <= middle_line(8 downto 0) & delay1_out;
            bottom_line <= bottom_line(8 downto 0) & delay2_out;
         end if;
      end if;
   end process;
end Behavioral;

life_evaluator.vhd

Here is where the bytes read from memory are buffered, and the new value to be written is calculated, 8 bits at once:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity life_evaluator is
    Port ( top : in  STD_LOGIC_VECTOR (2 downto 0);
           middle : in  STD_LOGIC_VECTOR (2 downto 0);
           bottom : in  STD_LOGIC_VECTOR (2 downto 0);
           result : out  STD_LOGIC);
end life_evaluator;

architecture Behavioral of life_evaluator is
   signal present_state : std_logic;
   signal first_four    : std_logic_vector(3 downto 0);
   signal last_four     : std_logic_vector(3 downto 0);
   
   signal first_count   : std_logic_vector(3 downto 0);
   signal last_count     : std_logic_vector(3 downto 0);
   
   signal total_count   : std_logic_vector(3 downto 0);
begin
   first_four    <= top & middle(2);
   present_state <= middle(1);
   last_four     <= middle(0) & bottom;
   
   with first_four select first_count <= 
         "0000" when "0000",
         "0001" when "0001",
         "0001" when "0010",
         "0010" when "0011",
         "0001" when "0100",
         "0010" when "0101",
         "0010" when "0110",
         "0011" when "0111",
         "0001" when "1000",
         "0010" when "1001",
         "0010" when "1010",
         "0011" when "1011",
         "0010" when "1100",
         "0011" when "1101",
         "0011" when "1110",
         "0100" when others;

   with last_four select last_count <= 
         "0000" when "0000",
         "0001" when "0001",
         "0001" when "0010",
         "0010" when "0011",
         "0001" when "0100",
         "0010" when "0101",
         "0010" when "0110",
         "0011" when "0111",
         "0001" when "1000",
         "0010" when "1001",
         "0010" when "1010",
         "0011" when "1011",
         "0010" when "1100",
         "0011" when "1101",
         "0011" when "1110",
         "0100" when others;

   total_count <= first_count + last_count;
   
   with total_count  select result <= 
       present_state when "0010",  -- present state when 2
       '1'           when "0011",  -- Alive when count = 3
       '0'           when others;  -- otherwise dead
            
end Behavioral;

mem_interface.vhd

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

Library UNISIM;
use UNISIM.vcomponents.all;

entity mem_interface is
    Port ( clk          : in  STD_LOGIC;
           clkn         : in  STD_LOGIC;
           
           idle         : in  STD_LOGIC;
           
           writeAddress : in  STD_LOGIC_VECTOR (17 downto 0);
           writeData    : in  STD_LOGIC_VECTOR (7 downto 0);
           writeEnable  : in  STD_LOGIC;
           
           readReady    : out STD_LOGIC;
           readAddress  : in  STD_LOGIC_VECTOR (17 downto 0);
           readData     : out STD_LOGIC_VECTOR (7 downto 0);
           
           mem_addr : out STD_LOGIC_VECTOR (17 downto 0);
           mem_data : inout STD_LOGIC_VECTOR (15 downto 0);
           mem_ce   : out STD_LOGIC;
           mem_oe   : out STD_LOGIC;
           mem_we   : out STD_LOGIC;
           mem_be   : OUT std_logic);
end mem_interface;

architecture Behavioral of mem_interface is
   signal ddrwe         : std_logic_vector(1 downto 0);
   signal state         : std_logic_vector(1 downto 0);
   signal memdata_hold  : std_logic_vector(15 downto 0);
   signal external_data : std_logic_vector(15 downto 0);
   signal writing       : std_logic := '0';
   signal tristate      : std_logic := '1';
begin
   mem_ce <= '0';
   mem_be <= '0';

we_ODDR2 : ODDR2
   generic map(DDR_ALIGNMENT => "C0", INIT => '1', SRTYPE => "ASYNC")
   port map (Q => mem_we, C0 => Clk, C1 => Clkn, CE => '1', D0 => ddrwe(0), D1 => ddrwe(1), R => '0', S => '0');

   gen1:   for i in 0 to 15 generate
      IOBUF_inst : IOBUF
      generic map (
         DRIVE            => 12,
         IBUF_DELAY_VALUE => "0", -- Specify the amount of added input delay for buffer, "0"-"12" 
         IFD_DELAY_VALUE  => "AUTO", -- Specify the amount of added delay for input register, "AUTO", "0"-"6" 
         IOSTANDARD       => "LVTTL",
         SLEW             => "SLOW")
      port map (
         O  => external_data(i), -- Buffer output
         IO => mem_data(i),      -- Buffer inout port (connect directly to top-level port)
         I  => memdata_hold(i),  -- Buffer input
         T  => tristate          -- 3-state enable input, high=input, low=output 
      );
   end generate;
   
   with state select tristate <= '1' when "00",
                                 '1' when "01",
                                 '0' when others;

   with state select mem_oe   <= '0' when "00",
                                 '0' when "01",
                                 '1' when others;

   with state select readReady <= '1' when "10",
                                   '0' when others;

   with state select ddrwe     <= (NOT (writeEnable) & "1")  when "01",
                                   ("1" & NOT (writing)) when "10",
                                  "11" when others;

   process(clk)
   begin
      if rising_edge(clk) then
         case state is 
            when "00" =>
               state     <= "01";
               -- first half of the read in progress
            when "01" =>
               state <= "10";
               -- second half of the read
               
               -- capture the read data;
               readData     <= external_data(7 downto 0);
               
               -- set up for the write (regardless of write Enable)
               mem_addr     <= writeAddress;
               writing      <= writeEnable;
               memdata_hold <= "00000000" & writeData;               
            when "10" =>
               -- write in progress (if occuring);
               state <= "11";
               writing   <= '0';                  
            when others =>  -- "11" and also the idle state.
               if idle = '0' then
                 state <= "00";
               end if;
               -- second half of write (if occuring)
               -- also set up for the read;
               mem_addr  <= readAddress;
         end case;
      end if;
   end process;
end Behavioral;

data_delay

Use the IP core generator to create a "RAM based shift register" with a width of 8 bits, and a fixed length of 243. The only control signal it needs is "CE" (Clock Enable).

hidef_life.ucf

These constraints are for the Papilio Plus 1.0 with the buttons and VGA on the Arcade Megawing. Later revisions of the boards will need different constraints!

NET "clk_32" TNM_NET = clk;
NET "clk_32" LOC = "P94" | IOSTANDARD = LVTTL | PERIOD = 31.25ns;


NET Blue(0)     LOC="P99"  | IOSTANDARD=LVTTL;  # B0
NET Blue(1)     LOC="P97"  | IOSTANDARD=LVTTL;  # B1
NET Blue(2)     LOC="P92"  | IOSTANDARD=LVTTL;  # B2
NET Blue(3)     LOC="P87"  | IOSTANDARD=LVTTL;  # B3
 
NET Green(0)    LOC="P84"  | IOSTANDARD=LVTTL;  # B4
NET Green(1)    LOC="P82"  | IOSTANDARD=LVTTL;  # B5
NET Green(2)    LOC="P80"  | IOSTANDARD=LVTTL;  # B6
NET Green(3)    LOC="P78"  | IOSTANDARD=LVTTL;  # B7

NET Red(0)      LOC="P118" | IOSTANDARD=LVTTL;  # C4
NET Red(1)      LOC="P119" | IOSTANDARD=LVTTL;  # C5
NET Red(2)      LOC="P120" | IOSTANDARD=LVTTL;  # C6
NET Red(3)      LOC="P121" | IOSTANDARD=LVTTL;  # C7

NET vSync       LOC="P116" | IOSTANDARD=LVTTL;  # C2
NET hSync       LOC="P117" | IOSTANDARD=LVTTL;  # C3

# Address lines
NET "mem_addr<0>" LOC = "P6"  | IOSTANDARD=LVTTL;
NET "mem_addr<1>" LOC = "P7"  | IOSTANDARD=LVTTL;
NET "mem_addr<2>" LOC = "P9"  | IOSTANDARD=LVTTL;
NET "mem_addr<3>" LOC = "P10"  | IOSTANDARD=LVTTL;
NET "mem_addr<4>" LOC = "P11"  | IOSTANDARD=LVTTL;
NET "mem_addr<5>" LOC = "P141"  | IOSTANDARD=LVTTL;
NET "mem_addr<6>" LOC = "P140"  | IOSTANDARD=LVTTL;
NET "mem_addr<7>" LOC = "P139"  | IOSTANDARD=LVTTL;
NET "mem_addr<8>" LOC = "P138"  | IOSTANDARD=LVTTL;
NET "mem_addr<9>" LOC = "P137"  | IOSTANDARD=LVTTL;
NET "mem_addr<10>" LOC = "P46"  | IOSTANDARD=LVTTL;
NET "mem_addr<11>" LOC = "P45"  | IOSTANDARD=LVTTL;
NET "mem_addr<12>" LOC = "P44"  | IOSTANDARD=LVTTL;
NET "mem_addr<13>" LOC = "P43"  | IOSTANDARD=LVTTL;
NET "mem_addr<14>" LOC = "P41"  | IOSTANDARD=LVTTL;
NET "mem_addr<15>" LOC = "P29"  | IOSTANDARD=LVTTL;
NET "mem_addr<16>" LOC = "P30"  | IOSTANDARD=LVTTL;
NET "mem_addr<17>" LOC = "P32"  | IOSTANDARD=LVTTL;
#NET "addr<18>" LOC = ""; 

# Data lines
NET "mem_data<0>" LOC = "P14"   | IOSTANDARD=LVTTL;
NET "mem_data<1>" LOC = "P15"   | IOSTANDARD=LVTTL;
NET "mem_data<2>" LOC = "P16"   | IOSTANDARD=LVTTL;
NET "mem_data<3>" LOC = "P17"   | IOSTANDARD=LVTTL;
NET "mem_data<4>" LOC = "P5"    | IOSTANDARD=LVTTL;
NET "mem_data<5>" LOC = "P2"    | IOSTANDARD=LVTTL;
NET "mem_data<6>" LOC = "P1"    | IOSTANDARD=LVTTL;
NET "mem_data<7>" LOC = "P143"  | IOSTANDARD=LVTTL;
NET "mem_data<8>" LOC = "P40"   | IOSTANDARD=LVTTL;
NET "mem_data<9>" LOC = "P35"   | IOSTANDARD=LVTTL;
NET "mem_data<10>" LOC = "P34"  | IOSTANDARD=LVTTL;
NET "mem_data<11>" LOC = "P33"  | IOSTANDARD=LVTTL;
NET "mem_data<12>" LOC = "P21"  | IOSTANDARD=LVTTL;
NET "mem_data<13>" LOC = "P22"  | IOSTANDARD=LVTTL;
NET "mem_data<14>" LOC = "P23"  | IOSTANDARD=LVTTL;
NET "mem_data<15>" LOC = "P24"  | IOSTANDARD=LVTTL;

# Control lines
NET "mem_ce" LOC = "P12"  | IOSTANDARD=LVTTL;
NET "mem_we" LOC = "P142" | IOSTANDARD=LVTTL;
NET "mem_oe" LOC = "P27"  | IOSTANDARD=LVTTL;
NET "mem_be" LOC = "P26"  | IOSTANDARD=LVTTL; # This is wired to both UB and LB on early revision boards.

NET "button<0>" LOC="P74"  | IOSTANDARD=LVTTL; # B8   Left
NET "Button<1>" LOC="P59"  | IOSTANDARD=LVTTL; # Right

Design notes

I tried all sorts of design ideas before settling on the final design. Here are some 'dumb ideas' I dreamed up:

  • Having a separate VGA generator, fed through a FIFO from the memory controller
  • Doubled buffered display (read from one frame, update the other, then flip roles)
  • Making a SRAM memory controller with two read ports (VGA, data) and one write port, all connecting to different FIFOs

The final design is really simple.

Video timing for 1920x1080 @ 60Hz

My pixel clock is a little bit out of spec, but here are the 'official' timings:

Pixel clock 148.5MHz
Active H pixel 1920
H total 2680
H back porch 32
H sync pulse 696
H front 32
Active V pixel 1080
V total 1135
V back porch 22
V sync pulse 11
V front 22
Personal tools