Comparing programming and FPGAs

From Hamsterworks Wiki!

Jump to: navigation, search

People ask me what the difference is between programming and designing in a Hardware Definition Language. Here is a concrete example to illustrate the different paradigms (although HDL isn't a programming language, so technically it is something completely different...)

To highlight this I've decided to pick a nice simple problem, and then solve the same problem with various different ways.

If I've explained it correctly you will get an appreciation for what FPGAs can give you - high performance, low clock rates, deterministic performance, and a lot of design work.

Contents

The problem

Calculate the average grades of a student, which are being added to a total to allow the average student grade for the whole group to be calculated. Each grade is between 0 and 100, and each student can have between 0 and 7 grades.

P.S. none of these examples have been compiled, so they are most probably riddled with errors.

Object oriented

In the Object Oriented Paradigm (OOP) you tell objects to 'do stuff' by calling methods, and query their attributes.

This is the solution a generic object oriented language, something like C++.

Estimating performance of OOP solutions are awkward - as the data is hidden from you it is hard to tell how much work is being done by each method - for example, does grades.count() walk a linked list? Can you tell if the native format for the grades an integer? perhaps is it a ASCII string that must be converted? There could even be a mix of internal representations, causing different data sets to perform very differently.

Caller

  total = total+student.averageGrade();

Method

int student::averageGrade() {
  int total = 0;

  if(grades.count() == 0)
    return 0;

  Iterator *i = new Iterator(grades);

  while(Iterator.IsValid()) {
    total = total + i.value.asInteger();
    i.next();
  }
  delete(i);

  return total / grades.count();
}

In C

'C' is considered a low-level language a step up from assembler, but abstracted enough to be portable between systems, but no more than that. Rather than objects and message, C has only function calls, statements, loops and conditional execution.

In most cases performance characteristics for a simple function can be estimated pretty reliably. In this example the execution time for 'n' items will be factor*n+constant, where 'constant' will be the call/return overhead and the cost of a division. 'factor' and 'constant' will change for different CPUs (for example your processor might not have a hardware divide), but once a few data points are known performance can be characterized. These performance predictions are best case numbers - it assumes that there is no competition for system resources with other processes, or that the dataset size is that large that the system has to page memory to disk.

In the calling function

total += average(student.grades, student.gradeCount);

Function

int average(unsigned char *values, int count)
{
  int total = 0;
          
  if(count == 0) 
    return 0;

  for(i = 0; i < count; i++)
    total += values[i];
	   
  return total/count;
}

In assembler

Assembler is CPU specific, and for the most part each instruction has a one-to-one correspondence with one of the CPU native instructions.

The source code can not be moved verbatim between platforms that use a different instruction set architectures, but the same algorithms can be usually be coded in similar ways on different CPUs.

Best case performance can be calculated very precisely - 12 instructions and one jump if gradeCount is zero, or 12 instructions plus 4 more per item, including one memory access per item. If you CPU only issues one instruction at time you may even be able to convert this to clock cycles!

So if this was run on a RISC CPU which executes an instruction per clock cycle it will take something like 28 clock cycles to the average of four values.

It is important to note that Assembler still has the concept of subroutines (as most CPUs have a call stack), but lacks 'loops'. Loops are induced in code by jumping back to code that has already been executed. This lack of loops is reflected in the lack of indentation in the source making the code appear to lack structured to somebody who isn't familiar with assembler.

Because of the level of abstraction that the programmer is working at is far removed from the problem extensive comments are required to convey the function of the code to anybody reading it.

In the calling routine

         mov  ax, grades     ; calculate the average grade
         mov  bx, gradeCount
         call average        ; average is in AX, no other registers are changed

The subroutine

average:	; Save registers that will be updated
	 	push bx
		push cx	
		push dx
		
                ; Set up the registers for the loop;
		mov  dx, ax	   ; dx holds the address of the values
		xor  ax, ax	   ; empty ax to hold the total
		xor  cx, cx	   ; empty cx to hold the count of items

		; catch when count of zero
		and  bx,bx	   
		je   exit
			
nextItem:	add  ax, [dx+4*cx] ; Add the item to the total
		inc  cx		   ; Move on to the next item;
		cmp  cx,bx	   ; compare the count against the number to process 
		jl   nextItem	   ; more items to process, so loop

		idiv ax, cx	   ; divide the total by the count

exit		pop dx		   ; restore value in changed registers
		pop cx
		pop bx
		return

In a Hardware Description Language

Hardware Description Languages are completely different to programming - you are describing hardware and not a thread of execution.

  • There are no loops
  • Conditional execution works completely different - 'IF' are implemented using multiplexers or when clock enable inputs are activated on flipflops
  • There are no subroutines, only sub-components - there is no concept of code that can be 'reused'

Performance is very deterministic. When feeding values into the 'averageGrade' component we can feed in a new "grade" every clock cycle (allowing a defined maximum throughput), and can be sure that the output will be available on averageGrade and averageGradeValid a fixed number of clock cycles after 'gradeLast' is asserted - giving a fixed latency that is determined by the latency of the 'div' component that does the division.

Unlike CPUs that have a defined maximum clock cycle the clock cycle of a HDL project is limited by the complexity of the design - the number of 'levels of logic' and the time for signals to propagate within the FPGA. Due to the limitations of FPGA technology the maximum clock rate may be lower, but this design can do in one cycle what requires four in assembler.

Strangely enough, VHDL code is very portable between different FPGAs, but different FPGAs have different performance characteristics so a design will require tweaking for best results.

In the top level module

averageGrade:	average PORT MAP (
		  clk          <= clk,
		  -- Inputs
		  value        <= grade,
		  valueValid   <= gradeValid,
		  valueLast    <= gradeLast,

		  -- Outputs
		  average      <= averageGrade,
		  averageValid <= averageGradeValid
	       );

        -- statements that are active all the time
	with averageGradeValid select totalNext <= total + averageGrade when '1', 
                                                   total when others;

        -- statements that are active when clk changes
        clkproc: process(clk)
	  begin
	    if risingEdge(clk) then
	      total <= totalNext;
	    end if;
	  end process;

This requires a bit of explanation. This code creates hardware component of a type "average" called "averageGrade" that has inputs of the grade and bit signals to indicate if the current value is valid or the last grade for the current student.

When the signal coming out of that component indicates that averageGrade is valid it is added it to the running total to give totalNext, and when the clock signal 'clk' rises we store totalNext into total.

The internals of the 'average' module

This is a direct Register Transfer Level (RTL) implementation of the average function. If implemented by itself in an FPGA it would produce a design that has 20 active pins.

The performance would be determined by either the speed of the divider, or the time required get 'value' into the chip, across to the 16 bit adder that adds 'value' to 'total' and then pass it through the two multiplexers that decide what value gets stored back into total. As written this design will work at 207.340MHz (assuming that the 'division' block can run that fast).

Design changes could be made that will greatly improve speed, if processing 200 million grades per second isn't sufficient.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

ENTITY average IS
       PORT (
        clk          : in STD_LOGIC;
        -- Inputs
        value        : in STD_LOGIC_VECTOR(7 downto 0);
        valueValid   : in STD_LOGIC;
        valueLast    : in STD_LOGIC;

        -- Outputs
        average      : out STD_LOGIC_VECTOR(7 downto 0);
        averageValid : out STD_LOGIC
          ); 
END average;


architecture Behavioral of average is
   -- used to avoid divide by zero
   signal divisor   : std_logic_vector( 7 downto 0);

   -- values to assign on rising clock
   signal totalNext : std_logic_vector(15 downto 0);
   signal countNext : std_logic_vector( 7 downto 0);

   -- values fed back when calculating the average
   signal totalIn   : std_logic_vector(15 downto 0);
   signal countIn   : std_logic_vector( 7 downto 0);

   -- refisters to hold the current average
   signal total     : std_logic_vector(15 downto 0);
   signal count     : std_logic_vector( 7 downto 0);
   
   signal doDivision : std_logic;

        -- Component declaration for the division
   COMPONENT division16by8 
        PORT (
       clk           : in STD_LOGIC;
      in_valid      : in STD_LOGIC;
      quotent       : in STD_LOGIC_VECTOR(15 downto 0);
      divisor       : in STD_LOGIC_VECTOR(7 downto 0);
      dividend      : out STD_LOGIC_VECTOR(7 downto 0);
      dividendValid : out STD_LOGIC
   );
   END COMPONENT;

begin   
        -- make a new 'division' component called 'div'
   div : division16by8 PORT MAP (
      clk           => clk,
      in_valid      => doDivision,
      quotent       => totalNext,
      divisor       => divisor,
      dividend      => average,
      dividendValid => averageValid
   );


  -- concurent statements - all of these are active all the time.
   doDivision <= valueValid and valueLast;

   -- reset the totals and count when gradeLast is set.
   with valueLast  select totalIn   <= (others => '0') when '1',
                   total when others;

   with valueLast  select countIn   <= (others => '0') when '1',
                   count when others;


   -- add to the total when valueValid is set
   with valueValid select totalNext <= totalIn + value when '1',
                   totalIn when others;

   with valueValid select countNext <= countIn + 1 when '1',
                   countIn when others;

   -- this avoids a divide by zero
   with countNext  select divisor   <= "00000001" when "00000000",
                        countNext when others;
               
      
  -- sequential statements - these are only active when clk changes
clkproc: process (clk)
   begin
     if rising_edge(clk) then
       total <= totalNext;
       count <= countNext;
          end if;
   end process;
end Behavioral;

An even better FPGA solution

The above implementation is fine, but is very 'serial'. If free to design any way we like, a could be designed that didn't accumulate values in a register ('total' in the code above) but accepted all the grades in the same cycle, then puts them through a tree of adders, counts them and passes them to the 'division' component.

Here's the external interface of the component:

ENTITY average IS
       PORT (
        clk          : in STD_LOGIC;

        -- Inputs
        in_valid      : in STD_LOGIC;
        value0        : in STD_LOGIC_VECTOR(7 downto 0);
        value0Valid   : in STD_LOGIC;
        value1        : in STD_LOGIC_VECTOR(7 downto 0);
        value1Valid   : in STD_LOGIC;
        value2        : in STD_LOGIC_VECTOR(7 downto 0);
        value2Valid   : in STD_LOGIC;
        value3        : in STD_LOGIC_VECTOR(7 downto 0);
        value3Valid   : in STD_LOGIC;
        value4        : in STD_LOGIC_VECTOR(7 downto 0);
        value4Valid   : in STD_LOGIC;
        value5        : in STD_LOGIC_VECTOR(7 downto 0);
        value5Valid   : in STD_LOGIC;
        value6        : in STD_LOGIC_VECTOR(7 downto 0);
        value6Valid   : in STD_LOGIC;
        value7        : in STD_LOGIC_VECTOR(7 downto 0);
        value7Valid   : in STD_LOGIC;

        -- Outputs
        average      : out STD_LOGIC_VECTOR(7 downto 0);
        averageValid : out STD_LOGIC
          ); 
END average;

architecture Behavioral of average is
   signal total     : std_logic_vector(15 downto 0);
   signal count     : std_logic_vector( 7 downto 0);
   signal divisor   : std_logic_vector( 7 downto 0);
   
        -- Component declaration for the division
   COMPONENT division16by8 
     PORT (
       clk           : in STD_LOGIC;
       in_valid      : in STD_LOGIC;
       quotent       : in STD_LOGIC_VECTOR(15 downto 0);
       divisor       : in STD_LOGIC_VECTOR(7 downto 0);
       dividend      : out STD_LOGIC_VECTOR(7 downto 0);
       dividendValid : out STD_LOGIC
     );
   END COMPONENT;

begin  
   total <= "0000000000000000" + value0 + value1 + value2 + value3 + value4 + value5 + value6 + value7;

   count <= "000" + value0valid + value1valid + value2valid + value3valid + value4valid + value5valid + value6valid + value7valid;

   -- avoid divide by zeros
   with count select divisor <= "00000001" when count = "00000000",
                                count when others;

   div : division16by8 PORT MAP (
      clk           => clk,
      in_valid      => in_valid,
      quotent       => total,
      divisor       => divisor,
      dividend      => average,
      dividendValid => averageValid
   );
end Behavioral;

Designed this way, the component can accept all of a students grades every every clock cycle, with the average being available a few clock cycles later. This is doing the work of upto 48 assembly instructions in each cycle!

Of course there is a few more unwritten 'contracts' that the caller has to abide by - for example unused values must be 0.

Personal tools