# Binary division

### From Hamsterworks Wiki!

Division is slow. Very slow compared to addition, subtraction or even multiplication. It also consumes a lot of CPU die space and power - so much so that some processor designers decide not to have a divide instruction at all!

If you find a better binary division algorithm than SRT Division (http://cs.uns.edu.ar/~jechaiz/arquitectura/division/SRT.html) and will be a rich man...

**PLEASE NOTE - MOST OF THIS IS JUNK - THE __udivsi3() ROUTINE IN LIBGCC IS PRETTY SIMPLE AND FAST** -

In a lot of cases it is much faster than radix 4 division, but as it's latency is variable it is unsuited for implementation in an FPGA.

## Contents |

## Division on an FPGA

I've got 16 bit Radix 2 Division up and running on an FPGA - 350MHz with a 16 stage pipleline, 29MHz fully combinatorial.

## Faster division

Pretty much everybody is familiar with using right shifts to perform division by powers of two, but can you do better when the divisor isn't a power of two, or isn't known at compile time? Well, it turns out you can - especially on AMD processors!

By exploiting knowledge of the range of numbers involved you can be much quicker.

(*As pointed out on hackaday, if you do a lot of division by constants: http://www.hackersdelight.org/divcMore.pdf. Thanks Kukuta!*)

## Radix 4 division

I'm playing around with radix-4 division to look at implementing it in an FPGA.

Here's unsigned 16 bit division in C, and a test bench main() routine:

#include <stdio.h> #include <stdlib.h> /* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */ unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend) { unsigned short quotient = 0; unsigned char i = 16; /* Number of bits to process */ if(dividend == 0) { printf("Divide by zero"); exit(0); } while(i > 0) { unsigned char j; unsigned int d1,d2,d3; i -= 2; d1 = dividend << i; d2 = dividend << (i+1); d3 = d1+d2; if(divisor < d1) j = 0; else if(divisor < d2) { j = 1; divisor -= d1; } else if(divisor < d3) { j = 2; divisor -= d2; } else { j = 3; divisor -= d3; } quotient = (quotient << 2)+j; } return quotient; } int main(int c, char *v[]) { int i,j; for(i = 0; i < 256*256; i++) { for(j = 1; j < 256*256; j++) { unsigned k = radix4_division_u16(i,j); if(i/j != k) { printf("Error with %i/%i != %i\n",i,j,k); exit(0); } } if(i%650 == 0) printf("%2.2f%% tested\n",i*100.0/256/256); } return 0; }

## Performance on my AMD laptop

Strangely this is surprisingly fast on an AMD Athlon II P320 - or should that be that division is very slow on AMD?

To use the CPU's divide instruction to test the full 16 bit range for both dividend and divisor took 2:51 (two integer divisions * 2^16 * 2^16) = approx 8 billion divides.

With one of the divides replaced to call the Radix 4 code it now took 3:47 - only 58 seconds longer!

This works out to an extra 13.5ns per 16 bit divide (or around 27 clock cycles on a 2.1GHz AMD).

Here's the same test program used for comparison:

#include <stdio.h> #include <stdlib.h> /* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */ unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend) { return divisor / dividend; } int main(int c, char *v[]) { int i,j; for(i = 0; i < 256*256; i++) { for(j = 1; j < 256*256; j++) { unsigned k = radix4_division_u16(i,j); if(i/j != k) { printf("Error with %i/%i != %i\n",i,j,k); exit(0); } } if(i%650 == 0) printf("%2.2f%% tested\n",i*100.0/256/256); } return 0; }

But now lets turn on optimisaitons. Compiling with -O3 turns the tables - radix4 now takes 2:26 vs a 2:47 - the Radix-4 version was quicker than the CPU's IDIV!?

Interesting. This definitely warrants further investigation...

## So what gives?

Initial guess - maybe the divide is using a 64 bit instruction, where as the code above can use the multiple integer cores in parallel. It may also be benefiting from the CPU's branch prediction.

## My new test case

To investigate this more I've changed my code to maximise the time spent doing divides - and taking a nod to IC testing methodology I'm just adding up the results and checking that the totals match.

Remember to compile with -O3!

### idiv source

#include <stdio.h> #include <stdlib.h> int main(int c, char *v[]) { int i,j; int total=0; for(i = 0; i < 256*256; i++) for(j = 1; j < 256*256; j++) total += i/j; printf("Total = %i\n",total); return 0; }

This now takes 1m38.019s

#include <stdio.h> #include <stdlib.h> /* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */ /* (C) 2011 Mike Field (hamster@snap.net.nz */ static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend) { unsigned short quotient = 0; unsigned char i = 16; /* Number of bits to process */ if(dividend == 0) { printf("Divide by zero"); exit(0); } while(i > 0) { unsigned char j; unsigned int d1,d2,d3; i -= 2; d1 = dividend << i; d2 = dividend << (i+1); d3 = d1+d2; if(divisor < d1) j = 0; else if(divisor < d2) { j = 1; divisor -= d1; } else if(divisor < d3) { j = 2; divisor -= d2; } else { j = 3; divisor -= d3; } quotient = (quotient << 2)+j; } return quotient; } int main(int c, char *v[]) { int i,j; int total=0; for(i = 0; i < 256*256; i++) for(j = 1; j < 256*256; j++) total += radix4_division_u16(i,j); printf("Total = %i\n",total); return 0; }

The Radix4 version now takes 1m00.364s. **Wow! radix4 is 40% faster!**

## Taking it to the next level

Ok, so 40% faster than an AMD is is pretty good - I can do 66% more divisions in the given time period. Can I actually do any better than that?

Sure can - I've got down to 36 seconds for the same test. My radix4_division_u16() function is now 64% faster - 175% more work in the same time.

This is done by avoiding 4 passes of the loop when the dividend is greater than 255, halving the work for these divides

static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend) { unsigned short quotient = 0; unsigned char i; if(dividend == 0) { printf("Divide by zero"); exit(0); } if( dividend < 256) { i = 16; /* Number of bits to process */ while(i > 8) { unsigned char j; unsigned int d1,d2,d3; i -= 2; d1 = dividend << i; d2 = dividend << (i+1); d3 = d1+d2; if(divisor < d1) j = 0; else if(divisor < d2) { j = 1; divisor -= d1; } else if(divisor < d3) { j = 2; divisor -= d2; } else { j = 3; divisor -= d3; } quotient = (quotient << 2)+j; } } i = 8; /* Number of bits to process */ while(i > 0) { unsigned char j; unsigned int d1,d2,d3; i -= 2; d1 = dividend << i; d2 = dividend << (i+1); d3 = d1+d2; if(divisor < d1) j = 0; else if(divisor < d2) { j = 1; divisor -= d1; } else if(divisor < d3) { j = 2; divisor -= d2; } else { j = 3; divisor -= d3; } quotient = (quotient << 2)+j; } return quotient; }

## Performance on Intel

So, AMD can be improved on, what about Intel? Well, on Intel Core i5 things are dramatically different.

[root@mikef-acer ~]# time ./radix4; time ./idiv Total = 1599432336 real 0m26.835s user 0m26.696s sys 0m0.014s Total = 1599432336 real 0m16.249s user 0m16.170s sys 0m0.004s

My test that uses "idiv" takes only 16.2 seconds, the radix4 version takes 26 seconds. A bit of a loss there.

## Maybe it is just branch prediction?

Changing the order of the index variables changes the effectiveness of the branch prediction - for example, when dividing by 2^16-1 the same path is followed in all but one case.

On the AMD P320:

[hamster@linuxdev radix4]$ time ./radix4; time ./idiv Total = 1599432336 real 0m34.167s user 0m34.111s sys 0m0.016s Total = 1599432336 real 1m28.819s user 1m28.600s sys 0m0.018s

On the Intel:

[root@mikef-acer ~]# time ./radix4; time ./idiv Total = 1599432336 real 0m18.520s user 0m18.363s sys 0m0.022s Total = 1599432336 real 0m16.581s user 0m16.453s sys 0m0.020s

It is closer for Intel, but the AMD is really really slow,

## What about modulo?

When we only need the remainder (aka modulo) the loop gets tighter - the 'quotient' and 'j' variables are no longer needed:

static unsigned short radix4_modulo_u16(unsigned short divisor, unsigned short dividend) { unsigned char i; if(dividend == 0) { printf("Divide by zero"); exit(0); } i = 16; /* Number of bits to process */ while(i > 0) { unsigned int d1,d2,d3; i -= 2; d1 = dividend << i; d2 = dividend << (i+1); d3 = d1+d2; if(divisor < d1) ; else if(divisor < d2) divisor -= d1; else if(divisor < d3) divisor -= d2; else divisor -= d3; } return divisor; }

Although the code has changed quote a bit the timing is pretty much the same:

[hamster@linuxdev radix4]$ time ./radix4m; time ./idivm Total = 788240730 real 0m33.396s user 0m33.347s sys 0m0.009s Total = 788240730 real 1m27.861s user 1m27.773s sys 0m0.010s

## A few ideas on Radix 4 division on an FPGA

- d1,d2,d3 can be calculated each time, or can be put into shift-by-two registers. Using shift registers will greatly reduce the critical path between pipeline stages and allowing higher clock rates.
- for a divisor of n bits d1,d2,d3 need to be either 2n-2 bits long (or at least n+2 bits long if '1's are not shifted out bits n and n+1).
- this method allows you to generate 2 bits of the result per iteration (which will be equivalent to the FPGA clock cycle time).
- the radix of 4 is perfect for implimentation in the 4 input lookup tables in an FPGA.
- two of these stages could be put back to back to achieve 4 bits per iteration, but at only half the clock speed.
- the comparison and subtraction could be the same (using the "borrow bit" as the result of the comparison). A single stage would be an adder for calculating (2d+d), three N bit subtractors, a priority encoder and a n-bit 4-to-1 mux to select the output.
- The remainder is accessible at the end of the process.

Back to Main Page