# Binary division

Division is slow. Very slow compared to addition, subtraction or even multiplication. It also consumes a lot of CPU die space and power - so much so that some processor designers decide not to have a divide instruction at all!

If you find a better binary division algorithm than SRT Division (http://cs.uns.edu.ar/~jechaiz/arquitectura/division/SRT.html) and will be a rich man...

PLEASE NOTE - MOST OF THIS IS JUNK - THE __udivsi3() ROUTINE IN LIBGCC IS PRETTY SIMPLE AND FAST -

In a lot of cases it is much faster than radix 4 division, but as it's latency is variable it is unsuited for implementation in an FPGA.

## Division on an FPGA

I've got 16 bit Radix 2 Division up and running on an FPGA - 350MHz with a 16 stage pipleline, 29MHz fully combinatorial.

## Faster division

Pretty much everybody is familiar with using right shifts to perform division by powers of two, but can you do better when the divisor isn't a power of two, or isn't known at compile time? Well, it turns out you can - especially on AMD processors!

By exploiting knowledge of the range of numbers involved you can be much quicker.

(As pointed out on hackaday, if you do a lot of division by constants: http://www.hackersdelight.org/divcMore.pdf. Thanks Kukuta!)

## Radix 4 division

I'm playing around with radix-4 division to look at implementing it in an FPGA.

Here's unsigned 16 bit division in C, and a test bench main() routine:

```#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
unsigned short quotient = 0;
unsigned char i = 16;  /* Number of bits to process */

if(dividend == 0)
{
printf("Divide by zero");
exit(0);
}

while(i > 0)
{
unsigned char j;
unsigned int d1,d2,d3;

i -= 2;

d1 = dividend << i;
d2 = dividend << (i+1);
d3 = d1+d2;

if(divisor < d1)
j = 0;
else if(divisor < d2) {
j = 1;
divisor -= d1;
} else if(divisor < d3) {
j = 2;
divisor -= d2;
} else {
j = 3;
divisor -= d3;
}
quotient = (quotient << 2)+j;
}
return quotient;
}

int main(int c, char *v[])
{
int i,j;
for(i = 0; i < 256*256; i++)
{
for(j = 1; j < 256*256; j++)
{
unsigned k = radix4_division_u16(i,j);
if(i/j != k)
{
printf("Error with %i/%i != %i\n",i,j,k);
exit(0);
}
}
if(i%650 == 0)
printf("%2.2f%% tested\n",i*100.0/256/256);
}
return 0;
}

```

## Performance on my AMD laptop

Strangely this is surprisingly fast on an AMD Athlon II P320 - or should that be that division is very slow on AMD?

To use the CPU's divide instruction to test the full 16 bit range for both dividend and divisor took 2:51 (two integer divisions * 2^16 * 2^16) = approx 8 billion divides.

With one of the divides replaced to call the Radix 4 code it now took 3:47 - only 58 seconds longer!

This works out to an extra 13.5ns per 16 bit divide (or around 27 clock cycles on a 2.1GHz AMD).

Here's the same test program used for comparison:

```#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
return divisor / dividend;
}

int main(int c, char *v[])
{
int i,j;
for(i = 0; i < 256*256; i++)
{
for(j = 1; j < 256*256; j++)
{
unsigned k = radix4_division_u16(i,j);
if(i/j != k)
{
printf("Error with %i/%i != %i\n",i,j,k);
exit(0);
}
}
if(i%650 == 0)
printf("%2.2f%% tested\n",i*100.0/256/256);
}
return 0;
}

```

But now lets turn on optimisaitons. Compiling with -O3 turns the tables - radix4 now takes 2:26 vs a 2:47 - the Radix-4 version was quicker than the CPU's IDIV!?

Interesting. This definitely warrants further investigation...

## So what gives?

Initial guess - maybe the divide is using a 64 bit instruction, where as the code above can use the multiple integer cores in parallel. It may also be benefiting from the CPU's branch prediction.

## My new test case

To investigate this more I've changed my code to maximise the time spent doing divides - and taking a nod to IC testing methodology I'm just adding up the results and checking that the totals match.

Remember to compile with -O3!

### idiv source

```#include <stdio.h>
#include <stdlib.h>

int main(int c, char *v[])
{
int i,j;
int total=0;
for(i = 0; i < 256*256; i++)
for(j = 1; j < 256*256; j++)
total += i/j;

printf("Total = %i\n",total);
return 0;
}

```

This now takes 1m38.019s

```#include <stdio.h>
#include <stdlib.h>

/* This code assumes that unsigned short is 16 bits, and unsigned int is 32 bits */

/* (C) 2011 Mike Field (hamster@snap.net.nz */

static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
unsigned short quotient = 0;
unsigned char i = 16;  /* Number of bits to process */

if(dividend == 0)
{
printf("Divide by zero");
exit(0);
}

while(i > 0)
{
unsigned char j;
unsigned int d1,d2,d3;

i -= 2;

d1 = dividend << i;
d2 = dividend << (i+1);
d3 = d1+d2;

if(divisor < d1)
j = 0;
else if(divisor < d2) {
j = 1;
divisor -= d1;
} else if(divisor < d3) {
j = 2;
divisor -= d2;
} else {
j = 3;
divisor -= d3;
}
quotient = (quotient << 2)+j;
}
return quotient;
}

int main(int c, char *v[])
{
int i,j;
int total=0;
for(i = 0; i < 256*256; i++)
for(j = 1; j < 256*256; j++)

printf("Total = %i\n",total);
return 0;
}

```

The Radix4 version now takes 1m00.364s. Wow! radix4 is 40% faster!

## Taking it to the next level

Ok, so 40% faster than an AMD is is pretty good - I can do 66% more divisions in the given time period. Can I actually do any better than that?

Sure can - I've got down to 36 seconds for the same test. My radix4_division_u16() function is now 64% faster - 175% more work in the same time.

This is done by avoiding 4 passes of the loop when the dividend is greater than 255, halving the work for these divides

```static unsigned short radix4_division_u16(unsigned short divisor, unsigned short dividend)
{
unsigned short quotient = 0;
unsigned char i;

if(dividend == 0)
{
printf("Divide by zero");
exit(0);
}
if( dividend < 256)
{
i = 16;  /* Number of bits to process */
while(i > 8)
{
unsigned char j;
unsigned int d1,d2,d3;

i -= 2;

d1 = dividend << i;
d2 = dividend << (i+1);
d3 = d1+d2;

if(divisor < d1)
j = 0;
else if(divisor < d2) {
j = 1;
divisor -= d1;
} else if(divisor < d3) {
j = 2;
divisor -= d2;
} else {
j = 3;
divisor -= d3;
}
quotient = (quotient << 2)+j;
}
}

i = 8;  /* Number of bits to process */
while(i > 0)
{
unsigned char j;
unsigned int d1,d2,d3;

i -= 2;

d1 = dividend << i;
d2 = dividend << (i+1);
d3 = d1+d2;

if(divisor < d1)
j = 0;
else if(divisor < d2) {
j = 1;
divisor -= d1;
} else if(divisor < d3) {
j = 2;
divisor -= d2;
} else {
j = 3;
divisor -= d3;
}
quotient = (quotient << 2)+j;
}
return quotient;
}

```

## Performance on Intel

So, AMD can be improved on, what about Intel? Well, on Intel Core i5 things are dramatically different.

```[root@mikef-acer ~]# time ./radix4; time ./idiv
Total = 1599432336

real	0m26.835s
user	0m26.696s
sys	0m0.014s
Total = 1599432336

real	0m16.249s
user	0m16.170s
sys	0m0.004s

```

My test that uses "idiv" takes only 16.2 seconds, the radix4 version takes 26 seconds. A bit of a loss there.

## Maybe it is just branch prediction?

Changing the order of the index variables changes the effectiveness of the branch prediction - for example, when dividing by 2^16-1 the same path is followed in all but one case.

On the AMD P320:

```[hamster@linuxdev radix4]\$ time ./radix4; time ./idiv
Total = 1599432336

real 0m34.167s
user 0m34.111s
sys 0m0.016s
Total = 1599432336

real 1m28.819s
user 1m28.600s
sys 0m0.018s

```

On the Intel:

```[root@mikef-acer ~]# time ./radix4; time ./idiv
Total = 1599432336

real	0m18.520s
user	0m18.363s
sys	0m0.022s
Total = 1599432336

real	0m16.581s
user	0m16.453s
sys	0m0.020s

```

It is closer for Intel, but the AMD is really really slow,

## What about modulo?

When we only need the remainder (aka modulo) the loop gets tighter - the 'quotient' and 'j' variables are no longer needed:

```static unsigned short radix4_modulo_u16(unsigned short divisor, unsigned short dividend)
{
unsigned char i;

if(dividend == 0)
{
printf("Divide by zero");
exit(0);
}
i = 16;  /* Number of bits to process */
while(i > 0)
{
unsigned int d1,d2,d3;
i -= 2;
d1 = dividend << i;
d2 = dividend << (i+1);
d3 = d1+d2;

if(divisor < d1)
;
else if(divisor < d2)
divisor -= d1;
else if(divisor < d3)
divisor -= d2;
else
divisor -= d3;
}
return divisor;
}

```

Although the code has changed quote a bit the timing is pretty much the same:

```[hamster@linuxdev radix4]\$ time ./radix4m; time ./idivm
Total = 788240730

real 0m33.396s
user 0m33.347s
sys 0m0.009s
Total = 788240730

real 1m27.861s
user 1m27.773s
sys 0m0.010s

```

## A few ideas on Radix 4 division on an FPGA

• d1,d2,d3 can be calculated each time, or can be put into shift-by-two registers. Using shift registers will greatly reduce the critical path between pipeline stages and allowing higher clock rates.
• for a divisor of n bits d1,d2,d3 need to be either 2n-2 bits long (or at least n+2 bits long if '1's are not shifted out bits n and n+1).
• this method allows you to generate 2 bits of the result per iteration (which will be equivalent to the FPGA clock cycle time).
• the radix of 4 is perfect for implimentation in the 4 input lookup tables in an FPGA.
• two of these stages could be put back to back to achieve 4 bits per iteration, but at only half the clock speed.
• the comparison and subtraction could be the same (using the "borrow bit" as the result of the comparison). A single stage would be an adder for calculating (2d+d), three N bit subtractors, a priority encoder and a n-bit 4-to-1 mux to select the output.
• The remainder is accessible at the end of the process.

Back to Main Page