From Hamsterworks Wiki!
Please make sure that you have completed the earlier modules of FPGA course.
Having problems with your code? Do I need to change something? Want a reference solution to the project? Email me at "email@example.com", after removing the "nospam-" bit. I'll try to get back to you in a day or so.
Aim of this module
- Learn the basics of the multiplier blocks
- Learn how to maximise multiplier performance
Performance of binary multiplication
Binary multiplication is complex - implementing multiplication of any two 'n' bit numbers approximately n*(n-1) full adders, n half adders and n*n 'AND' operations.
To multiply the four bit binary numbers "abcd" by "efgh" the following operations are needed (where '&' is a binary AND):
Multiplication also has a big implication for your designs performance - because of the 'carries' required, multiplying two 'n' bit numbers takes around twice as long as adding two 'n' bit numbers.
It also consumes a very large amount of logic if multiplicaiton is implemented in the generic logic blocks within the FPGA.
Multiplication in FPGAs
To get around this, most FPGAs include multiple multiplier blocks - a XC3S100 has four 18 bit x 18 bit multipliers, and a XC3S250 has twelve!
To improve performance multipliers also include additional registers, allowing the multiplicands and result to be registered withing the multiplier block. There are also optional registers within the multiplier that holds the partial result half way through the multiplication.
Using these greatly improves throughput performance by removing the routing delays to get the inputs to and from the multipliers, but a the cost of increased latency - measured in either time between two number being given to the multiplier and the result being available, or the number of clock cycles.
When all these are enabled the multiplier works as follows
|0||A and B inputs are latched|
|1||The first half of the multiplication is preformed|
|2||The second half of the multiplication is preformed|
|3||The result of the multiplication is available on the P output|
Multipliers can accept a new set of 'A' and 'B' values each clock cycle, so up to three can be 'in flight' at any one time. In some cases this is useful but in other cases it can be annoying.
A useful case might be processing Red/Green/Blue video values, where each channel is separate.
An annoying case is where feedback is needed of the output value back into the input. If the math isn't in your favor you may be better off not using any registers at all - it may even be slightly faster and running at one third the clock speed will use less power.
What if 18x18 isn't 'wide' enough?
What if you want to use bigger numbers? say 32 bits? Sure!
Just like in decimal when multiplying two two digit numbers "ab" and "cd" is calculated as "a*c*10*10 + b*c*10 + a*d*10 + b*d" the same works - just replace each of a,v,c,d with an 18 bit number, and each '10' with 2^18.
As the designer you have the choice of either:
- using four multipliers and three adders, with a best-performance latency of 5 cycles with a throughput of one pair of A and B values per clock
- using the same multiplier to calculate each of the four intermediate products, with a best-performance latency of 13 cycles (four 3-cycle multiplication plus the final addition) and with careful scheduling you can process three input pairs every 12 cycles.
Project 19.1 - Digital Volume control
I had a hard time thinking of s useful project, but finally got there!
- Revisit the Audio output project
- Use the core generator to sdd an IP multipler, with an 8 bit unsigned input for the volume and the other matching your BRAM's sample size
- Add a multiplier between the block BRAM and the DAC, using the value of switches as the other input.
- Use the highest output bits of the multiplier to feed the DAC
- If you get the correct 'signed' / 'unsigned' settings for each input of the multiplier you will now be able to control the volume with the switches
Unless you are careful you may have issues with mismatching signed and unsigned values - it pays to simulate this carefully!
- You can also implement multiplication using the generic logic ("LUT"s). If interested, you can change the IP multiplier to use LUTs instead of the dedicated multiplier blocks and compare maximum performance and resource usage.
Ready to carry on?
Click here to carry on to the next module.