Algorithms

 Have you considered table lookup? No smiley ... In the days when a CPU filled a rack of boards, and the ALU (Arithmetic/Logic Unit) alone was a least one board, maybe more, there was a machine that did that, although for floating point rather than integer. The most significant bits of the mantissas (always kept normalized, with a hidden MSB) were used as indexes into a huge 2D table in ROM, giving the 11 most significant bits. From that, a Newton iteration was done, doubling the precision for each iteration. The entire iteration was done in hardware: The initial lookup took one clock cycle, each iteration took an extra clock cycle (two for single precision, four for double precision). The final normalization of the result took yet another clock cycle. This FP divide was so fast that the CPU didn't have any integer divide logic. It was faster to convert the integers to 64 bit FP, do the division and convert back. The FP logic alone was a circuit board about A3 size (i.e. twice the size of a standard typewriter paper) packed with chips. For all I know, maybe modern CPUs use the same technique today. In the late 1970s, it was so remarkable that the design was presented in internationally recognized professional magazines. If I were to write a division function for arbitrary length integers (or arbitrary precision float), I would consider seriously something in this direction. If the machine provides a division instruction, you can use that to obtain the first 'n' bits, rather than using a huge lookup table.
