Module 7

From Hamsterworks Wiki!

Jump to: navigation, search

Please make sure that you have completed the earlier modules of FPGA course.

Having problems with your code? Do I need to change something? Want a reference solution to the project? Email me at "nospam-hamster@snap.net.nz", after removing the "nospam-" bit. I'll try to get back to you in a day or so.

Contents

Aims of module

  • See how fast an existing design can be clocked
  • Add a timing constraint
  • Find out what is the critical path in the performance of a design
  • Improve on the existing design

The problem of timing closure

When working with FPGAs at the bleeding edge half the problem is getting a working design. The other half of the problem is getting the design to work fast enough! A lot of effort and trial and error can go into finding an solution that meets your design requirements, and sometimes the best solution isn't the obvious one.

Sometimes it may be possible to change your FPGA for a faster grade part or use vendor specific macros to improve the performance of a design, but often there are significant gains to be made in improving your design without incurring additional cost or limiting your design to on architecture.

Even if your original design easily meets your requirements it is a good idea to look at the critical path and try to improve upon it. Not only is timing closure one of these problems where the more you practice you have the better you get, but usually a faster design will have quicker build times and fewer timing issues to solve due to the design tools will have more 'slack' when implementing a design.

This module's scenario

Imagine we are designing a project that needs to capture the timing and duration of a pulse to with better than 10ns accuracy, and these pulses occur within an interval of four seconds.

To give some design margin this calls for a 250MHz clock for the counter - giving at most 4ns of uncertainty around the start and end of the pulse, and 8ns of uncertainty around the width of each pulse. Due to the timings of up to 4 seconds (1,000,000,000 ticks) a 30 bit counter is required.

So the goals of this design are simple - make a 30 bit counter that runs at 250MHz that can be used to time-stamp events.

So how fast can a design run?

At the moment all of your designs run at one clock speed - that supplied by the on-board oscillator. Although we won't go into it in this module, within the FPGA is a device that can generate a wide range of related clock signals, giving great freedom in how fast your projects to will be clocked.

The obvious question is "How fast can a design run?". The answer is "as fast the chosen FPGA and your design allows".

Lets have a quick look at the two halves of this answer...

How the choice of FPGA changes speed

CPUs, memory and a lot of logic chips, FPGAs come in different speed grades. CPUs are rated in clock speeds of GHz or MHz, and memory comes rated in clock speeds or access times, but for FPGAs there are no "simple numbers" - they come in speed grades. A different speed grade indicates that the devices performance is guaranteed to meet or exceed a set of modeling parameters, and it is these modeling parameters allows the design tools to calculate the performance limits of your design.

Once the design has been compiled and mapped to the FPGA components the tools calculate every path from input/output pins and flip-flops and totals the delay every step of the way (much like finding the critical path in project management software). This is then used to generate a report that can be reviewed from the "Static Timing" section of the "Design Summary Report" window:

M7s1.png

The most useful number is usually right down the bottom:

 Clock to Setup on destination clock clk 
 ---------------+---------+---------+---------+---------+ 
                | Src:Rise| Src:Fall| Src:Rise| Src:Fall| 
 Source Clock   |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall| 
 ---------------+---------+---------+---------+---------+ 
 clk            |    4.053|         |         |         | 
 ---------------+---------+---------+---------+---------+

As this design has a minimum clock of 4.053 nanoseconds it can be clocked at up to 246MHz and still be within the FPGA's timing limits.

This isn't fast enough. I could choose to use a faster grade FPGA. But the gains for using a faster, more expensive chip are only minimal - for example going from a Spartan 3E -4 grade to -5 part increases a sample design's maximum speed by 13%.

How the design changes speed

Each flip-flop acts as a start or finish line for a digital signal. When a clock signal ticks the flip-flops assumes a new value, and the updated signal comes out of the flip-flops and starts rippling out through connected logic cells until all logic signals are stable, and all the flip-flops have their updated value presented on their inputs. For a design to work correctly the updated signal has to arrive at the flip-flop with at least enough time to ensure that when the clock ticks again the signal will be reliably captured (called a "setup time").

So the three major components of this time are

  • Routing time - the time it takes to "charge the wires" that route signals between different logic cells. As you can well imagine the drive strength of the source signal, the length of these wires and the number of gates connected to the wires ("fan-out") dictates how much current is required to accurately transfer signals across the FPGA, and therefore the routing time
  • Logic time - this is the time it takes for a logic cell to react to a change of input
  • Setup time - the time required to ensure that the destination of a signal will accurately capture a changed value

As a sweeping generalization, the more complex work that is carried out each clock cycle the greater number of logic blocks will be in the critical path and the slower the design will run. As expected, the more you reduce the complexity of your design the quicker your design will run.

In some cases you can also use components from your FPGA vendor's library of standard building blocks. These building blocks will usually have more efficient implementations as the vendors are better at leveraging architecture-specific features within logic blocks (such as fast carry chains). The 'cost' for using these features is that your design becomes architecture dependent and will need to be re-engineered if you move to a different FPGA.

You would think that a basic component like a 30 bit counter would be hard to improve on, but even in this simple design gains of 20% can be achieved!

Can it be made to run faster without changing the design?

When the tools map the design to the FPGA they can take timing into account - for example, it can try different placements for components on the FPGA until a timing constraint is met. The design tools will then try to meet this constraint.

The "Static Timing Report" will detail any errors, which is where the amount of "slack" time between a signal's source and destination is less than zero - a negative "slack" means "insufficient enough time"):

The quick way to do this

The simple way to add a constraint is to include it in the Implementation constraints file by including these two lines:

NET "clk" TNM_NET = clk;
TIMESPEC TS_clk = PERIOD "clk" 4 ns HIGH 50%;

The time of 4 nanoseconds ("4 ns") gives a constraint of a 250MHz clock.

The long way to do this

You first have to successfully compile your project, so the tools can deduce what clocks are present. Then, in the process window, open the "Timing Constraints" tool:

M7s2.png

You will then be presented with the list of all unconstrained clocks:

M7s3.png

Double click on the unconstrained clock, and you will be presented with this dialogue box:

M7s4.png

Fill it in appropriately then close off all the timing constraint related windows (forcing the constraint to be saved).

You will now need to rebuild the project with this new constraint in place.

What happens if the project can't meet the constraint?

If the design is unable to meet the timing requirements details of the failing paths will be reported in the Static Timing report:

...
Timing constraint: TS_clk = PERIOD TIMEGRP "clk" 4 ns HIGH 50%; 
  465 paths analyzed, 73 endpoints analyzed, 1 failing endpoint 
  1 timing error detected. (1 setup error, 0 hold errors, 0 component switching limit errors) 
  Minimum period is   4.053ns. 
 -------------------------------------------------------------------------------- 
  
 Paths for end point counter_29 (SLICE_X53Y78.CIN), 28 paths 
 -------------------------------------------------------------------------------- 
 Slack (setup path):     -0.053ns (requirement - (data path - clock path skew + uncertainty)) 
   Source:               counter_0 (FF) 
   Destination:          counter_29 (FF) 
   Requirement:          4.000ns 
   Data Path Delay:      4.053ns (Levels of Logic = 15) 
   Clock Path Skew:      0.000ns 
   Source Clock:         clk_BUFGP rising at 0.000ns 
   Destination Clock:    clk_BUFGP rising at 4.000ns 
   Clock Uncertainty:    0.000ns 
  
   Maximum Data Path: counter_0 to counter_29 
     Location             Delay type         Delay(ns)  Physical Resource 
                                                        Logical Resource(s) 
     -------------------------------------------------  ------------------- 
     SLICE_X53Y64.XQ      Tcko                  0.514   counter<0> 
                                                        counter_0 
     SLICE_X53Y64.F4      net (fanout=1)        0.317   counter<0> 
     SLICE_X53Y64.COUT    Topcyf                1.011   counter<0> 
                                                        Mcount_counter_lut<0>_INV_0 
                                                        Mcount_counter_cy<0> 
                                                        Mcount_counter_cy<1> 
     SLICE_X53Y65.CIN     net (fanout=1)        0.000   Mcount_counter_cy<1> 
     SLICE_X53Y65.COUT    Tbyp                  0.103   counter<2> 
                                                        Mcount_counter_cy<2> 
                                                        Mcount_counter_cy<3> 
     SLICE_X53Y66.CIN     net (fanout=1)        0.000   Mcount_counter_cy<3> 
     SLICE_X53Y66.COUT    Tbyp                  0.103   counter<4> 
                                                        Mcount_counter_cy<4> 
                                                        Mcount_counter_cy<5> 
     SLICE_X53Y67.CIN     net (fanout=1)        0.000   Mcount_counter_cy<5> 
     SLICE_X53Y67.COUT    Tbyp                  0.103   counter<6> 
                                                        Mcount_counter_cy<6> 
                                                        Mcount_counter_cy<7> 
     SLICE_X53Y68.CIN     net (fanout=1)        0.000   Mcount_counter_cy<7> 
     SLICE_X53Y68.COUT    Tbyp                  0.103   counter<8> 
                                                        Mcount_counter_cy<8> 
                                                        Mcount_counter_cy<9> 
     SLICE_X53Y69.CIN     net (fanout=1)        0.000   Mcount_counter_cy<9> 
     SLICE_X53Y69.COUT    Tbyp                  0.103   counter<10> 
                                                        Mcount_counter_cy<10> 
                                                        Mcount_counter_cy<11> 
     SLICE_X53Y70.CIN     net (fanout=1)        0.000   Mcount_counter_cy<11> 
     SLICE_X53Y70.COUT    Tbyp                  0.103   counter<12> 
                                                        Mcount_counter_cy<12> 
                                                        Mcount_counter_cy<13> 
     SLICE_X53Y71.CIN     net (fanout=1)        0.000   Mcount_counter_cy<13> 
     SLICE_X53Y71.COUT    Tbyp                  0.103   counter<14> 
                                                        Mcount_counter_cy<14> 
                                                        Mcount_counter_cy<15> 
     SLICE_X53Y72.CIN     net (fanout=1)        0.000   Mcount_counter_cy<15> 
     SLICE_X53Y72.COUT    Tbyp                  0.103   counter<16> 
                                                        Mcount_counter_cy<16> 
                                                        Mcount_counter_cy<17> 
     SLICE_X53Y73.CIN     net (fanout=1)        0.000   Mcount_counter_cy<17> 
     SLICE_X53Y73.COUT    Tbyp                  0.103   counter<18> 
                                                        Mcount_counter_cy<18> 
                                                        Mcount_counter_cy<19> 
     SLICE_X53Y74.CIN     net (fanout=1)        0.000   Mcount_counter_cy<19> 
     SLICE_X53Y74.COUT    Tbyp                  0.103   counter<20> 
                                                        Mcount_counter_cy<20> 
                                                        Mcount_counter_cy<21> 
     SLICE_X53Y75.CIN     net (fanout=1)        0.000   Mcount_counter_cy<21> 
     SLICE_X53Y75.COUT    Tbyp                  0.103   counter<22> 
                                                        Mcount_counter_cy<22> 
                                                        Mcount_counter_cy<23> 
     SLICE_X53Y76.CIN     net (fanout=1)        0.000   Mcount_counter_cy<23> 
     SLICE_X53Y76.COUT    Tbyp                  0.103   counter<24> 
                                                        Mcount_counter_cy<24> 
                                                        Mcount_counter_cy<25> 
     SLICE_X53Y77.CIN     net (fanout=1)        0.000   Mcount_counter_cy<25> 
     SLICE_X53Y77.COUT    Tbyp                  0.103   counter<26> 
                                                        Mcount_counter_cy<26> 
                                                        Mcount_counter_cy<27> 
     SLICE_X53Y78.CIN     net (fanout=1)        0.000   Mcount_counter_cy<27> 
     SLICE_X53Y78.CLK     Tcinck                0.872   counter<28> 
                                                        Mcount_counter_cy<28> 
                                                        Mcount_counter_xor<29> 
                                                        counter_29 
     -------------------------------------------------  --------------------------- 
     Total                                      4.053ns (3.736ns logic, 0.317ns route) 
                                                        (92.2% logic, 7.8% route) 
...

Because this is such a simple design no automatic improvement was possible, but it allows you to see where the crunch is - it is the time taken for the signal to propagate from "counter(0)" through to "counter(29)".

Also worthy of note is that each step along the way takes 0.103ns, indicating that there is a fundamental relationship between the size of numbers you manipulate and a design's speed - a 32 bit counter would take 0.103ns longer to compute the result.

Interestingly enough, given that 3.736ns of the time is incurred by logic and only 0.317ns is incurred by routing it would be easy to assume that a 30 bit counter can't be implemented to run faster than 267MHz in this FPGA - any faster and there wouldn't be enough time for the logic to do its magic. You would be wrong - the final design on this page runs at 298MHz!

So how can this design be improved?

Here's the existing design - it is a small, easy to understand, design that allows easy verification by checking that the LEDs count at the same speed when changes are made:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity Switches_LEDs is
    Port ( switches : in  STD_LOGIC_VECTOR(7 downto 0);
           LEDs     : out STD_LOGIC_VECTOR(7 downto 0);
           clk      : in STD_LOGIC
         );
end Switches_LEDs;

architecture Behavioral of Switches_LEDs is
    signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => '0');
begin

  LEDs <= counter(29 downto 22); 

clk_proc: process(clk)
  begin
     if rising_edge(clk) then
        counter <= counter+1;
     end if;
  end process;
  
end Behavioral;

As mentioned earlier, the problem is the of length "counter" - at 30 bits long it will take at least 30*0.103ns = 3.09ns to increment.

But what if we split the counter into two 15 bit counters? will it be any faster? Lets see - Here's the updated design:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity Switches_LEDs is
    Port ( switches : in  STD_LOGIC_VECTOR(7 downto 0);
           LEDs     : out STD_LOGIC_VECTOR(7 downto 0);
           clk      : in STD_LOGIC
         );
end Switches_LEDs;

architecture Behavioral of Switches_LEDs is
    signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => '0');
    signal incHighNext : STD_LOGIC := '0';
begin

  LEDs <= counter(29 downto 22); 

clk_proc: process(clk)
  begin
     if rising_edge(clk) then
        counter(29 downto 15) <= counter(29 downto 15)+incHighNext;
	  
        if counter(14 downto 0) = "111111111111110" then
           incHighNext <= '1';
        else
           incHighNext <= '0';
        end if;
		  
        counter(14 downto 0) <= counter(14 downto 0)+1;
     end if;
  end process;
end Behavioral;

Here's the updated timing report:

All values displayed in nanoseconds (ns) 
  
 Clock to Setup on destination clock clk 
 ---------------+---------+---------+---------+---------+ 
                | Src:Rise| Src:Fall| Src:Rise| Src:Fall| 
 Source Clock   |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall| 
 ---------------+---------+---------+---------+---------+ 
 clk            |    3.537|         |         |         | 
 ---------------+---------+---------+---------+---------+ 
  
  
 Timing summary: 
 --------------- 
  
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0) 
  
 Constraints cover 284 paths, 0 nets, and 82 connections 
  
 Design statistics: 
    Minimum period:   3.537ns{1}   (Maximum frequency: 282.725MHz)

By using a little more logic the design has gone from 246MHz to 282MHz - about 14% faster.

Why does this work? By exploiting that we know in advance (when we are going to 'carry one' from bit 14 to bit 15), and storing that in a handy flipflop ('incHighNext') we have split the critical path across two clock cycles.

Project 7.1

  • See what the maximum speed the design using the 30 bit counter is for your FPGA board
  • Try changing the speed grade of the FPGA and see what difference that makes to the timing. (To do this, in the hierarchy window right-click on xc3c100e-4cp132 and choose properties - just remember to set it back!)
  • Add a timing constraint and try again
  • See what the maximum speed the design using the 15+15 split counter is for your FPGA board.

Challenges

  • Is the 15+15 split counter optimal compared against a 14+16 split or 16+14 split? If not, why not?
  • Can you increase the maximum clock speed for the project even further?
  • What is the largest counter you can make that runs at 100MHz?

An even better design

Sometimes designs just need a different way at looking at things. The code below performs faster still - why is this?

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity Switches_LEDs is
    Port ( switches : in  STD_LOGIC_VECTOR(7 downto 0);
           LEDs     : out STD_LOGIC_VECTOR(7 downto 0);
           clk      : in STD_LOGIC
         );
end Switches_LEDs;

architecture Behavioral of Switches_LEDs is
    signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => '0');
    signal incHighNext : STD_LOGIC := '0';
begin

  LEDs <= counter(29 downto 22); 

clk_proc: process(clk)
  begin
     if rising_edge(clk) then
        counter(29 downto 15) <= counter(29 downto 15)+incHighNext;
	  
        incHighNext <= not counter(0)  and counter(1)  and counter(2)  and counter(3)  and counter (4)
                       and counter(5)  and counter(6)  and counter(7)  and counter(8)  and counter (9)
                       and counter(10) and counter(11) and counter(12) and counter(13) and counter (14);
        
        counter(14 downto 0) <= counter(14 downto 0)+1;
     end if;
  end process;
end Behavioral;



Random thoughts

  • If there is a chance that performance will be an issue, consider setting a metric for the "levels of logic" you have in your design, and regularly review your design's static timing during the project. It is a very simple way to ensure that you will not end up with one or two very long chains of logic that limits system performance requiring significant re-work
  • Vendors will tell you that "floorplanning" (the high-level planning of the the placement of logic in a design) will allow projects to make significant improvement in achieving timing closure, but only routing delays can be reduced by controlling the placement
  • The best way to solve tricky timing closure issues is to avoid the timing issues
  • It is usually possible to guess the critical within a project in advance, allowing the feasibility of a design to be assessed early on in a project.

Ready to carry on?

Click here to carry on to the next module.

Personal tools