Choosing a division implementation


This recipe shows you:

The ARM instruction set does not have a divide instruction. In some applications it is important that a general purpose divide executes as quickly as possible. This recipe shows how to choose between different divide implementations for the ARM.

Divide implementations in the C-library

The C-library offers a choice between two divide variants. This choice is basically a speed vs. space tradeoff; the options are: 'unrolled' and 'small'.

In the C-library build directory (eg. directory semi for the semi-hosted library), the file options is used to select variants of the C-library.

The supplied file contains the following:

memcpy = fast
divide = unrolled
stdfile_redirection = off
fp_type = module
stack = contiguous
backtrace = off

The default divide implementation 'unrolled' is fast, but occupies a total of 416 bytes (55 instructions for the signed version plus 49 instructions for the unsigned version). This is an appropriate default for most toolkit users who are interested in obtaining maximum performance.

Alternatively you can change this file to select 'small' divide which is more compact at 136 bytes(20 instructions for signed plus 14 instructions for unsigned) but somewhat slower as there is considerable looping overhead.

For a comparison of the speed difference between these two routines, consult the following table (the speed of divide is data-dependent):

Signed division example timings

Cycle times are F-cycles on a cached ARM6 series processor excluding call & return

---------------------------------------------------------
Calc        |unrolled     |rolled cycles                 
            |cycles       |                              
---------------------------------------------------------
0/100       |22           |19                            
---------------------------------------------------------
9/7         |22           |19                            
---------------------------------------------------------
4096/2      |70           |136                           
---------------------------------------------------------
1000000/3   |99           |240                           
---------------------------------------------------------
1000000000/1|148          |370                           
---------------------------------------------------------

If you have a specific requirement you can modify the supplied routines to suit your application better. For instance, you could write an unrolled-2-times version or you could combine the signed and unsigned versions to save more space.

Guaranteed-performance divide routines for real-time applications

The C-library also contains two fully unwound divide routines. These have been carefully implemented for maximum speed. They are useful when a guaranteed performance is required, eg. for real-time feedback control routines or DSP. The long maximum division time of the standard C-library functions may make them unsuitable for some real-time applications.

The supplied routines implement signed division only; it would be possible to modify them for unsigned division if required. The routines are described by the standard header file "stdlib.h" which contains the following declarations:

ARM real-time divide functions for guaranteed performance

typedef struct __sdiv32by16 { int quot, rem; } __sdiv32by16;
/* used int so that values return in regs, although 16-bit */
typedef struct __sdiv64by32 { int rem, quot; } __sdiv64by32;

__value_in_regs extern __sdiv32by16 __rt_sdiv32by16(
    int /*numer*/,
    short int /*denom*/);
/*
 * Signed div: (16-bit quot), (16-bit rem) = (32-bit) / (16-bit)
 */
__value_in_regs extern __sdiv64by32 __rt_sdiv64by32(
    int /*numer_h*/, unsigned int /*numer_l*/,
    int /*denom*/);
/*
 * Signed div: (32-bit quot), (32-bit rem) = (64-bit) / (32-bit)
 */

These routines both have guaranteed performance:

108 cycles for __rt_sdiv64by32 (excluding call & return)

44 cycles for __rt_sdiv32by16

Currently the C-compiler does not automatically use these routines, as the default routines have early-termination which yields good performance for small values.

In order to use these new divide routines, you should explicitly call the relevant function. The __rt_div64by32 function is complicated by the fact that our C-compiler does not currently support 64-bit integers, as you have to split a 64-bit value between two 32-bit variables.

The following example program shows how to use these routines. This program is available as dspdiv.c in the examples directory. Once the program has been compiled and linked, type

armsd dspdiv 1000 3            

to calculate 1000/3

divdsp.c source code

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    if (argc != 3)
        puts("needs 2 numeric args");
    else
    {
        __sdiv32by16 result;

        result = __rt_sdiv32by16(atoi(argv[1]), atoi(argv[2]));

        printf("quotient %d\n", result.quot);
        printf("remainder %d\n", result.rem);
    }
    return 0;
}

Summary

The standard division routine used by the C-library can be selected by using the options file in the C-library build area. If the supplied variants are not suitable, you can write your own.

For real-time applications, the maximum division time must be as short as possible to ensure that the calculation can complete in time. In this case, the functions __rt_sdiv64by32 and __rt_sdiv32by16 are useful.

Related topics