Tuesday, July 8, 2008

Making division in Python faster

A very useful feature of Python is that it comes with built-in arbitrary-precision integers. The implementation is not the fastest in the world, but it is adequate for many purposes.

Python is clever enough to use the Karatsuba algorithm for multiplication of large integers, which gives an O(n1.6) asymptotic time complexity for n-digit multiplication. This is a huge improvement over the O(n2) schoolbook algorithm when multiplying anything larger than a few hundred digits.

Unfortunately, division in Python is not so well equipped: Python uses a brute force O(n2) algorithm for quotients of any size. Several algorithms in mpmath perform several multiplications followed by a single large division; at high precision, the final division can take as much time as all the preceding multiplications together. Division is needed much less often than multiplication, but it is needed nonetheless.

Newton's method to the rescue. If f(x) can be evaluated to n-digit accuracy in time O(q(n)), then O(q(n)) is also the complexity for evaluating f−1(x) using Newton's method (under reasonable assumptions). Thus Newton's method allows one to divide as fast as multiplying, compute logarithms as fast as exponentials, etc. The overhead hidden in the O()-expression is typically a small factor, of the order of 2-5.

To divide with the same asymptotic speed as multiplying, we rewrite p/q as p · (1/q), and use Newton's method to calculate r = 1/q. Newton's method leads to the iteration rn+1 = 2rn - qrn2, which converges quadratically (i.e. each iteration doubles the accuracy). It is now important to exploit the self-correcting nature of Newton's method by performing each step with an arithmetic precision equal to the accuracy. This way only a single step has to be performed at full precision. If this optimization is not used, the time complexity is just O((log n) q(n)), not O(q(n)).

Here is my implementation of Newton division in Python:

from mpmath.lib import giant_steps, lshift, rshift
from math import log


def size(x):
if isinstance(x, (int, long)):
return int(log(x,2))
# GMPY support
return x.numdigits(2)

def newdiv(p, q):
szp = size(p)
szq = size(q)
szr = szp - szq
if min(szp, szq, szr) < 2*START_PREC:
return p//q
r = (1 << (2*START_PREC)) // (q >> (szq - START_PREC))
last_prec = START_PREC
for prec in giant_steps(START_PREC, szr):
a = lshift(r, prec-last_prec+1)
b = rshift(r**2 * rshift(q, szq-prec), 2*last_prec)
r = a - b
last_prec = prec
return ((p >> szq) * r) >> szr

The core algorithm is just a few lines, although those lines took me a few moments to write (getting the shifts right gives me a headache). I imported the functions lshift, rshift (which are like the << and >> operators but allow negative shifts) and giant_steps (which counts up from START_PREC to ... szr/4, szr/2, szr) from mpmath just for convenience; the implementation above requires essentially nothing but Python builtins.

How does it fare? A little benchmarking code:

from time import clock

def timing(INT):
fmt = "%10s %12f %12f %12f %8s"
print "%10s %12s %12s %12s %8s" % ("size", "old time", "new time",
"faster", "error")
print "-"*78
for i in range(4,30):
n = 2**i
Q = INT(10)**n
P = Q**2
t1 = clock()
R1 = P // Q
t2 = clock()
R2 = newdiv(P,Q)
t3 = clock()
size, old_time, new_time = n, t2-t1, t3-t2
faster, error = old_time/new_time, R2-R1
print fmt % (size, old_time, new_time, faster, error)

I choose to benchmark the performance of dividing a 2n-digit integer by an n-digit integer, for a result (and required precision) of n digits. The improvement is likely to be smaller if the denominator and quotient have unequal size, but the balanced case is the most important to optimize.

Now the performance compared to Python's builtin division operator:


size old time new time faster error
16 0.000005 0.000101 0.050000 -1
32 0.000005 0.000044 0.107595 -1
64 0.000006 0.000052 0.123656 0
128 0.000013 0.000064 0.197368 0
256 0.000033 0.000088 0.377778 -1
512 0.000115 0.000173 0.663430 0
1024 0.000413 0.000399 1.035689 1
2048 0.001629 0.001099 1.481576 0
4096 0.006453 0.004292 1.503352 -1
8192 0.025596 0.009966 2.568285 0
16384 0.101964 0.030327 3.362182 -1
32768 0.408266 0.091811 4.446792 -1
65536 1.633296 0.278531 5.863967 -1
131072 6.535185 0.834847 7.828001 0
262144 26.108710 2.517134 10.372397 0
524288 104.445635 7.576984 13.784593 -1
1048576 418.701976 22.760790 18.395758 0

The results are excellent. The Newton division breaks even with the builtin division already at 1000 digits (despite the interpreter overhead), and is 10x faster at 260,000 digits.

The implementation in newdiv is not exact; as you can see, the results differ from the (presumably correct!) values computed by Python by a few units in the last place. This is not a big concern, as I intend to use this mainly for numerical algorithms, and it is always possible to simply add a few guard digits. For number theory and other applications that require exact division, I think it is sufficient to multiply a few of the least significant bits in q and r to see if they agree with p, and correct as necessary.

The point of this exercise was to try to give division of Python long ints the advantage of the builtin Karatsuba multiplication. Just for fun, I also tried it with gmpy integers:

from gmpy import mpz

size old time new time faster error
16 0.000011 0.000068 0.168033 -3
32 0.000008 0.000045 0.185185 -2
64 0.000006 0.000047 0.137725 -1
128 0.000005 0.000053 0.089005 -2
256 0.000007 0.000070 0.099602 0
512 0.000015 0.000083 0.184564 -1
1024 0.000044 0.000156 0.282143 0
2048 0.000134 0.000279 0.481000 -1
4096 0.000404 0.000657 0.614960 -2
8192 0.001184 0.001719 0.688882 -1
16384 0.003481 0.004585 0.759322 -2
32768 0.009980 0.012043 0.828671 -1
65536 0.027762 0.028783 0.964516 -1
131072 0.072987 0.067773 1.076926 -1
262144 0.186714 0.151604 1.231588 -1
524288 0.462057 0.342160 1.350411 -1
1048576 1.119103 0.788582 1.419134 -2
2097152 2.726458 1.838570 1.482923 -1
4194304 6.607204 4.069614 1.623546 -1
8388608 15.627027 8.980883 1.740032 -1
16777216 36.581787 19.167144 1.908567 -4
33554432 83.568330 40.131393 2.082368 -1
67108864 190.596889 81.412867 2.341115 -1

Interestingly, the Newton division breaks even at 130,000 digits and is twice as fast at 30 million digits. So it is clear that either gmpy does not use GMP optimally, or GMP does not use the asymptotically fastest possible division algorithm. I think the latter is the case; this page contains development code for the next version of GMP, including improved division code, and notes that "the new code has complexity O(M(n)) for all operations, while the old mpz code is O(M(n)log(n))". This seems consistent with my timings.

A small notice: I will be away for about a week starting tomorrow. I will bring my laptop to continue working on my GSoC project during the time, but I won't have internet access.

Edit: Fixed a misplaced parenthesis in newdiv that gave the wrong criterion for falling back to p//q. Benchmarks unaffected.


Unknown said...

Can consideration of submitting your implementation to be added to Python's standard library? I don't know if the math heads from python-dev would be interested or not, but it wouldn't hurt to try.

Fredrik Johansson said...

Certainly, though I'm not sure exactly where in the standard library it would go. It would make more sense to update the builtin integer division code.

Unknown said...

A couple years ago, I wrote a Python library for arbitrary precision radix-10 arithmetic. It uses Nussbaumer convolution for multiplication and a new division algorithm.

For 524288 digits, it is about 7x faster than newdiv() but it also requires several hundred lines of code.

I don't have valid gmpy-based tests because I get a segmentation fault. I'll track that down and see how it compares against gmpy's native division.

Anonymous said...
This comment has been removed by a blog administrator.