High Speed Link
From Hamsterworks Wiki!
This FPGA Project was started in January 2013, and finished in February 2013. I feel that this project is a mini tour de force of FPGA engineering.
For those who are not familiar, Field Programmable Gate Arrays (FPGA) allow hobbyists to engineer very high technology designs for almost pocket-money cost. A FPGA is a mesh of simple programmable logic blocks, floating in a sea of programmable wires and routing resources.
The aim of this project is to get high speed data from one Spartan 3E FPGA to another over a single LVDS pair (and ground), and doing this without a common clock. Eventually I want to put fibre SFPs into the path allowing full electrical isolation, but getting the data between boards will only be covered here. My initial aim was to achieve 400Mb/s (line rate), quite an ambitious figure. Actual results are 512Mb/s (for 51.2 MB/s once the 20% of 8b/10b coding overhead has been removed.
I'm using two of Gadget Factory's low cost Papilio One boards - an equivalent of 250,000 logic gates per board, for under US$40 each. These boards are USB powered, and have 48 pins of user I/O.
The Spartan 3E FPGAs on this board is a relatively old design, and lack the high-sped serial I/O features that newer FPGAs have - as an example, the Papilio Pro has a Spartan 6 LX9, which has SERDES blocks rated up to about 800Mb/s - 1Gb/s.
The logic blocks on the FPGA are rated clock rates of around 300MHz, but actual performance is dependant on the complexity of the design. The more complex the design, the greater the propagation delays so the lower the maximum clock rate.
When configured as in the The board on the right is sending the test signal to the board on the left over the two twisted wires.
|Tested error rate||Approximately 1 frame in 10^11.|
|Design size||142 Registers, 134 Look up tables (rx component only).|
8b/10b has a basic form of error correction built in - if a single bit error occurs it will upset the parity of the link and this can be detected. However some form of CRC or byte-level forward error correction should be used.
However, the problem is most probably a corner case when three bits are sampled in one frame. A simple fix (and a lot of testing) could possibly greatly reduce error rates. This would involve adjusting the offset for the third bit and seeing if using the slightly earlier or later sample is more appropriate.
The physical connection between the two boards is a low voltage differential signalling pair over twisted wire jumpers. Although other I/O standards are available using LVDS provides greater signal integrity. The Papilio One can support LVDS if the output reference voltage (Vcco) is set to 2.5V on the jump block.
On the wire format
I'm going to use IBM's classic 8b/10b encoding. The reasons are
- Signals have plenty of clock transitions helping ease the problem of clock recovery
- It is relatively efficient (allowing 80% of the channel bit rate for user data), compared with Manchester encoding's 50% efficiency
- it is DC balanced, allowing it to be used with SFPs and other physical layers that have some form of automatic gain control in the signal path
- it includes comma sequences. Due to the coding system used these unique bit sequences will never occur in user data transmission. Because of this they can be used to recover the framing of the signal. The standard comma value is "0011111010" and the inverse "1100000101". So for this project I am going to send "00111110101100000101".
The Papilio One has a 32MHz oscillator. One of the Digital Clock Managers is used to multiply the frequency to half the bit rate (256MHz). The CLKX2 output is a 64MHz output used for the low-speed logic that processes the parallel data. This lower speed is ideal, as the data is transfered at about 51,000,000 frames per second.
With serial links sending is easy. The source data is converted to a stream of 8b/10b frames and clocked out of the transmitting FPGA at the half the desired bit rate. A DDR output register is used to allow two bits to be sent each clock cycle.
For the test set-up the source data is a 20 bit shift register holding the test pattern, with two bits connected to the DDR output's data lines. The output of the DDR register is connected to the LVD driver, which is then connected to two of the FPGA's output pads.
Clock recovery is hard, especially at high speed in an older FPGA like the Spartan 3E. The FPGA has only a few clocking resources, and they are not ideally suited for tracking a high speed bitstream.
A background on FPGA architecture and timing
The basic logic element within an FPGA is Configurable Logic Bock, which in the Spartan 3 architecture consists of a D flip-flop and a four input Look Up Table ("LUT"). The four input signals are used to address a bit in a 16 bit SRAM block, and the value of that bit is presented on the output signal. This output signal can then be either latched in the flip-flop and/or send over an interconnect to another CLB. As an aside, this is a direct digital implementation of a four input Karnaugh Map (http://en.wikipedia.org/wiki/Karnaugh_map).
The lookup process is not instantaneous, and has a small but measurable delay. This is quoted on vendors datasheets, and is about 0.7ns. In addition to the lookup delay, signals take time to propagate between logic blocks. Short hops may take 0.3ns, and long hops may take much longer - maybe even a few ns if the route is long and routed through many switching blocks. By carefully controlling the placement of LUTs on the die of the FPGA the timing of signals can be controlled within a few 100 picoseconds.
However, FPGA vendors only publish these figures to allow designers to assess how 'fast' their parts are, and to give designers some way to analyse and performance - one of the major problems in FPGA design is achieving "timing closure" the point where the project runs at the speed you require of it. The can involve 'floor planning' where the designer maps out areas of the FPGA die for various components, or maybe even manual placement of individual resources, as I have done with a few critical parts.
I am only using small FPGAs, where the distance a signal has to travel is relatively small. On the larger designs the routing delays far exceed the logic delays and this is the main reason why FPGAs can not match the speeds of ASICS.
Sampling a rate greater than the twice the clock speed
Most FPGAs have a DDR input that can be used to sample the incoming signal at twice the clock rate. however this is not suited to data capture at that rates unless the sender is running on the same reference clock as the receiver, and you also have a mechanism to adjust for any phase delay that the link may introduce. Should you get it wrong and sample outside of the "eye" you will not achieve reliable transfer.
With what has been covered up to now, it should relatively obvious that a chain of LUTs will introduce at least a 1ns delay between the signal arriving at the LUT, indexing the resulting result bit, that bit being presented on an output (which all takes 0.7ns) and then propagating across the fabric to the next LUT.
The general idea
If you have a chain of CLBs with the LUTs connected in series and flip-flops connected to a common clock you are able to sample an incoming signal with approximately a 1ns accuracy.
Reality gets in the way
However, nothing is ever that simple. With each LUT that a signal passes through the "1" signals tend to spread out, so a the end of a useful length chain of LUTs the "0"s may completely disappear! This had me stumped for a while, until I realised that if each LUT inverted the signal this spreading cancels out, and a clear signal is maintained. Of course you then have to invert every other sample of the parallel data is it will be inverted.
And here is what the design of as single invert/sample stage looks like in VHDL (excluding the output inverter):
entity sampler is Port ( clk : in STD_LOGIC; data_in : in STD_LOGIC; data_out : out STD_LOGIC; sampled : out STD_LOGIC); end sampler; architecture Behavioral of sampler is signal buffered_data : std_logic; begin sample_lut: LUT4_D generic map ( INIT => X"5555") port map ( I0 => data_in, I1 => '0', I2 => '0', I3 => '0', LO => data_out, O => buffered_data); sample_proc: process(clk) begin if rising_edge(clk) then sampled <= buffered_data; end if; end process; end Behavioral;
This configuration allows sampling at about a gigasample per second. Not enough for the speed I want to transfer data at. Unless you have about 5x oversampling it is very hard to track the clock speed of the sender - there is not enough information to accurately track the sender's phase while provide a little bit of noise immunity. A simple delay chain solution is only useful for bit rates of up to about 200Mb/s.
Going even faster
This is where route-dependant delays come into the picture. By routing a signal out of the delay chain across the fabric and into a second delay chain the resolution can be increased.
It is actually surprisingly hard to obtain the desired 0.5ns delay in this FPGA - the routing tools seems to select either "local lines" with an extra 0.2 or 0.3ns, or a long line that traverses a large portion of the FPGA, and gives an extra 0.8ns delay.
And faster still - the final design
However, on testing this was not enough resolution to achieve a stable transfer. Yet another chain was introduced, and the two chains are offset by about a third of a nanosecond from each other.
In this screenshot the leftmost column are the 'pads' which connect to the outside world, the red lines are the wires carrying the input signal, the yellow boxes are the CLBs (LUTs and flip-flops) that are sampling the signal, the white rectangles are junctions where the routing takes place. The blue boxes are CLBs that are used elsewhere in the design (their signal wires are not shown).
A small "record a thousand samples then squirt over RS232" component was added, and this is a trace of the input as it is sampled - each character represents about 0.3ns, and each line is in the 3.9ns clock domain. The bit boundaries are clearly visible, and the 8b/10b commas (0011111010 and 1100000101) can be seen quite easily:
bit n+2|bit n+1|bit n ====================== 1111111111111111111111 11111111........111111 11111111........111111 .111111111111111...... ...................... ...................... .......11111111....... .......11111111....... ...............1111111 1111111111111111111111 1111111111111111111111 11111111........111111 .1111111........111111 111111111111111.......
Selecting which samples to use for the bitstream
Now that all the ugly asynchronous parts are out the way, we are left with the problem of how to extract the best bits from these samples and reconstruct them into a useful frame. Although this will be all synchronous logic, it is by no means easy. The logic has to run at 256MHz, so the logic used must be simple and fast.
What isn't apparent up to now is the the sender's and receiver's clocks will be running at slightly different rates, so although the bit boundaries are clear in the above capture they will not stay locked in place. The two crystals can differ by up to 50 parts per million either side of their rated frequency and will drift with frequency changes. at 256MHz this could give an entire bit's worth of phase drift 12,800 times a second!
To counter this the design has to dynamically track the bits. The easiest way to do this is to track the boundaries between bits and ensure that you sample roughly in the centre of the gap between them. Most of the time two bits will be captured every clock cycle, but if the sender's clock is slow an occasional one bit will be extracted. Should the sender's clock be running fast then occasionally three bits will need to be extracted.
As you can see in the above capture, the bits are approximately 8 samples wide sometimes a little bit wider, sometimes a little bit narrower. This can be leveraged to track the relative phase of the signal and literally extract the best bits.
Aligning the data based on history
The first step to recover the data is to a subset of bits from the sampled data that are aligned to the sender's clock. The offset for these bits is based on a "sample_offset" which ranges between zero and seven.
Two internal signals are used to control which how the sample_offset is maintained over time. Should bits five and six of the clock-aligned data not match then we need to increment "sample_offset", causing the data to be sampled later. Should bits 10 and 11 not match, then "sample_offset" is decremented, causing the data to be sampled a little sooner.
There is only the little issue when "sample_offset" reaches the limit of zero or seven. If it is zero and there is a need to sample sooner the only option is to sample only a single bit that cycle and set sample_offset to seven. Should "sample_offset" be seven, and there is a need to sample later then three bits are captured for this cycle and sample_offset is set to zero. The "take1" and "take3" signals tell the following downstream how many bits are to be captured.
Along with take1, and take3, bits 16, 8 and 0 are passed on to the next stage as the sampled bits. As a minor improvement, take3 is only asserted when sample_offset is zero. Because of this bit three of sampled_bits will always be bit 16 of the raw samples whenever take3 is asserted.
Building the data back into frames and passing to the low speed logic
Collecting the bits into sub-frames
Every cycle one, two or three bits are appended into a shift register, and when there will will be five or more bits in the shift register the oldest five bits are output to the next stage. Bits are either skipped or picked up when the relative phase between the sender and receiver shifts by one bit.
A cunning addition is that the "frame_align" component (further downstream) can send back signals to skip a additional bit. This allows it to hunt for the 8b/10b sync codeword and correctly frame the data, This is a little slower to sync than searching in the final frames (as only one alignment is tested at a time), but uses minimal logic resources.
Collecting the sub-frames into frames
At approximately 102,000,000 sub-frames per second the rate of data is still too high for most complex FPGA logic (e.g. a CPU). Two subframes are assembled into a single ten bit frame, and this can then be passed off to the rest of the design. Ten bit frames are ideal, as they can then be passed through a 8b/10b decoder to recover the transmitted data.
Clock domain crossing from from the high speed to low speed logic
For the high speed logic to signal the low speed logic that a new frame has arrive a signal line is toggled. When the low speed logic sees this signal toggle it must capture (and then process) the newly arrived frame. The signal line will toggle approximately 5/4ths of the 64MHz slow speed clock, ensuring that the slow speed logic will never miss a frame.
In the project's zip file is the full project for the design, including a transmitter, receiver and verifier.
The transmitter sends [sync codeword] and 8b/10b coded values for 'H', 'e', 'l', 'l', 'o' over and over again, at 51.2MB/s.
The receiver checks that only those five frames are received, and maintains a count of unexpected frames and the number of times sync has been lost.
The LEDS display "sync" (one bit), Loss of sync count (3 bits), bad frames (4 bits).
The RS232 serial port on the receiver shows the raw frames as they arrive, six zeros, the "drop a bit" toggle signal (used when attempting to sync), and then the 'synced' status signal.
This has been left running for twenty four hours with four frame errors being observed, and no loss of sync. During this time approximately 4TB of data was transferred. All four frame errors occurred when I at the desk, which was only for an hour or so of the test period. It might just be a loose wire / vibration thing.
The verification module itself has been tested by unplugging the signal wires. This caused at least one loss of sync error, and the frame error values to count.
A zip file of the project is here File:HighSpeedLink.zip