Spartan 6 1080p
From Hamsterworks Wiki!
This FPGA Project was completed in February 2013.
The serializers on the Spartan 6 LX are rated to 1050Mb/s, In itself this is quite impressive, but for most development boards without HDMI transmitters it leaves huge hole in the feature set - it is is not fast enough for 1080p (aka Full HD, aka 1920x1080@60Hz). A 720p signal uses about 750Mb/s on four channels, and 1080p requires four channels of 1,500Mb/s each!
But wouldn't it be nice to generate 1080p? You could then test your decoders or image processing on a less expensive board like an Digilent Atlys.
Well you can - if not for production at least for testing. Due to the unavoidable 175 ps jitter in the PLL's outputs the signal is not up to spec. However, that doesn't meant that it won't work! Think of this as a stop-gap measure you can use while a board with an HDMI encoder chip is being engineered.
Be sure to read the section called "Manual placement of critical components" if you use this design!
The board I am using is my much treasured Pipistrello, but should also work on the Digilient Atlys (or any other Spartan 6 board) where the differential pairs are connected directly to the HDMI socket. If you are not using a -3 grade part some re-work may be required to achieve timing closure.
The project uses three clocks. A 150MHz pixel clock for generating the VGA signals, and also drives the TMDS encodes, a 375MHz serialiser clock, and a second 375MHz serialiser clock, with 90 degrees phase shift to the other.
Generating VGA signals is covered all over the web, and encoding data into TDMS has been covered in my 720p DVI-D test project. The in-fabric serailization of the data is where the magic is in this project, so let's have a look at that...
In-fabric serialization - generating the quad data rate (QDR) signal with flip-flops
The logic with in a FPGA only operates on a single clock edge, so how can data be serialised within the FPGA fabric?
- Flip-flops can be configured to register on either the rising or falling edge of one clock clock.
- If you have two clocks that are 90 degrees out of phase you have four edges to work with:
- The rising edge of CLK0 (at 0 degrees)
- The rising edge of CLK90 (at 90 degrees)
- The falling edge of CLK0 (at 180 degrees)
- The falling edge of CLK90 (at 270 degrees)
With a flip-flop sensitive to each of these edges we need some way to combine their outputs to generate the desired QDR signal. The only safe way I know to do this is to configure a single lookup table ("LUT") as a quad-input XOR table. As each of the outputs of the flip-flops change one of two things can happen:
- The value in the flip-flop stays the same - nothing happens
- The value in the flip-flop changes and the output is cleanly toggled.
Because of this there should be no dynamic hazards to produce signal glitches. Excellent!
So how we have some way to generate a QDR signal, how do we work out what value to put in the data flip-flops?
I think of it this way, it the current bit does not match the prior bit, then we must flip the value being stored in the output flipflop.
To get this value you simply XOR the desired data value ("bit(n)") with the prior data value ("bit(n-1)") and also XOR the current value that is in the "change" flip-flop:
if rising_edge(clk0) then change(0) <= buffered(0) xor last XOR change(0); change(3 downto 1) <= buffered(3 downto 1) xor buffered(2 downto 0) xor change(3 downto 1); last <= buffered(3); end if;
As you can see this occurs on the rising edge of clock CLK0, and can be done well before transmitting the signal.
You can actually get rid of "last" register by replacing the assignment of change(0) with the following:
change(0) <= buffered(0) xor change(1) xor change(2) xor change(3);
However this makes it harder to achieve timing closure.
At this point we have enough that we can simulate a QDR output.
With 660ps or so between clock phases the timing is too tight. A little extra buffering is required to move these values into the flip-flops that are actually connected to the XOR LUT. This is all pretty standard technique of walking the values for no more than a 180 degree phase change at a time:
- Change(0) is output the following cycle
- Change(1) goes into a flip-flop at 270 degrees, and is then output at 90 degrees on the following cycle
- Change(2) goes into a second flip-flop at 0 degrees, and is then output at 180 degrees on the following cycle
- Change(3) goes into a second flip-flop at 0 degrees, and is then output at 270 degrees on the following cycle
This allows for 1.3ns paths between flip-flops, and is just in spec for the Spartan 6 at -3 grade.
The move from the 150MHz clocking domain to the 375MHz clocking domain also presents issues. The receiving side is too fast to use a FIFO, but as TMDS encoded pixels only needed to be transferred twice every five cycles there are two windows where the data will be stable for over 5ns. With planning data can be transferred at these times without using any special techniques. However, I am useless with constraints and without them this will always produced errors on the timing report. In the attached project I've introduced an extra set of registers to remove this warning.
At this point the project successfully build without timing errors.
Manual placement of critical components
At this point I am sure that the design will fail to work. The problem is that the path delays from the final flip-flops to the XOR LUT have been implemented to ensure that they meet 375MHz timing, not 1.5GHz. There will also be skew between the four TMDS channels due to the differences in path delay from the XOR LUT to the IOBUF buffer.
The best way I know to fix this is to constrain the location of the XOR LUT and the four flip-flops to be very close to each other, and very close to the output pin. With a bit of experimentation I came up with the following placement, which gives +/-50ps in path delays:
You can see the clock, blue, green, and red channels are all uniformly placed, directly over the desired I/O buffer. This ensures minimal signal skew between the outputs. For some reason the routing tool selected a different path for one of the red signals, but the timing is still fine.
And here are the constraints used for my board - you must modify these constraints to suit your FPGA board
INST "I_dvid/qdr_b/ff*" LOC = "SLICE_X20Y3:SLICE_X21Y4"; INST "I_dvid/qdr_b/xor_lut" LOC = "SLICE_X20Y2"; INST "I_dvid/qdr_g/ff*" LOC = "SLICE_X26Y3:SLICE_X27Y4 "; INST "I_dvid/qdr_g/xor_lut" LOC = "SLICE_X26Y2"; INST "I_dvid/qdr_r/ff*" LOC = "SLICE_X22Y3:SLICE_X23Y4"; INST "I_dvid/qdr_r/xor_lut" LOC = "SLICE_X22Y2"; INST "I_dvid/qdr_c/ff*" LOC = "SLICE_X14Y3:SLICE_X15Y4"; INST "I_dvid/qdr_c/xor_lut" LOC = "SLICE_X14Y2";
Here is the complete source files for the project, as used on the Pipistrello. When somebody with an Atlys has this working, can you please send me your constraints and I'll add them here too!
Moving to other grades
My device is a -3C grade. If you are using a different grade you may need to rework some parts to meet timing. I think that it should be possible. Two areas will most probably need revisiting:
- The TDMS encoder will need to be pipelined. At 7 levels of logic it is a bit complex. This should not be too hard.
- In the output serializers, rather than having 180 degrees between flip-flops you may need to add an extra stage, allowing you to have 270 degrees between clocks.
If I get enough requests to do this I might get around to it....
OK, I got bored. The changes required...
- The pixel generation had to be changed to generate two pixels at a time @75MHz, making the fast domain an integer multiple of the slow one.
- A lot more precise placement is required in the "qdr" outputs. Here are the constraints for one channel:
INST "I_dvid/qdr_b/last" LOC = "SLICE_X20Y8"; INST "I_dvid/qdr_b/change_1" LOC = "SLICE_X21Y8"; INST "I_dvid/qdr_b/change_2" LOC = "SLICE_X20Y7"; INST "I_dvid/qdr_b/change_3" LOC = "SLICE_X21Y7"; INST "I_dvid/qdr_b/change_0" LOC = "SLICE_X20Y6"; INST "I_dvid/qdr_b/reclock_1" LOC = "SLICE_X21Y6"; INST "I_dvid/qdr_b/reclock_2" LOC = "SLICE_X20Y5"; INST "I_dvid/qdr_b/reclock_3" LOC = "SLICE_X21Y5"; INST "I_dvid/qdr_b/ff_0" LOC = "SLICE_X20Y4"; INST "I_dvid/qdr_b/ff_1" LOC = "SLICE_X21Y4"; INST "I_dvid/qdr_b/ff_2" LOC = "SLICE_X20Y3"; INST "I_dvid/qdr_b/ff_3" LOC = "SLICE_X21Y3"; INST "I_dvid/qdr_b/xor_lut" LOC = "SLICE_X20Y2";
This is what the final routing looks like:
Very nice and smooth - it actually looks pretty much like the "Technology View" for the QDR component turned on its side! I might back-port these explicit constraints to the original project to prevent annoying timing failures on different builds. Send me an email if you want the project - I don't recommend it however, due to the extra resource usage and pain of generating pixels two at a time. It was a pain to get to pass timing due to the clock skew figures for the -2 grade. If the target device truly does have this level of clock skew then the signal most probably will not work as the signal will be way out of spec.