Fredrik Johansson: python

The nth harmonic number is the nth partial sum of the divergent harmonic series, H_n = 1 + 1/2 + 1/3 + ... + 1/n. The simplest way to compute this quantity is to add it directly the way it is written: 1, 1+1/2, 1+1/2+1/3, and so on. For n approximately greater than 10 or 100, this is algorithm is not a very good one. Why?

Firstly, in floating-point arithmetic, it is much better to use the asymptotic expansion

which requires fewer terms the larger n is. Combining this with recurrence when n is too small gives a fast numerical algorithm for fixed-precision or arbitrary-precision computation of harmonic numbers (especially large ones). It allows one to determine, for example, that it takes exactly 15092688622113788323693563264538101449859497 terms until the partial sums of the harmonic series exceed 100. (This algorithm is implemented in mpmath, and most other comparable software.)

Secondly, the naive algorithm is especially bad for exact computation of harmonic numbers. This might not be obvious at a glance. We are just adding n numbers: what's to improve? The catch is that rational arithmetic has hidden costs: a single addition requires three multiplications and usually a GCD reduction. In contrast with integer or floating-point addition, rational addition gets very slow when the numerators and denominators get large. Computing harmonic numbers the naive way triggers worst-case behavior of rational arithmetic: the denominators grow like n! (if no GCD is performed, the denominator of H_n literally becomes n!).

It should be well known that computing n! as 1, 1·2, 1·2·3, ... is a poor method when n is large. A much better algorithm, assuming that multiplication is subquadratic, is to recursively split the products in half, i.e. compute (1·2·...·n/2) · ((n/2+1)·...·n) and so on. This algorithm has complexity O(log(n) M(n log(n)) compared to O(n² log(n)) for the naive algorithm, where M(n) is the complexity of multiplying two n-bit integers. This approach is called binary splitting, and has many other applications.

The example of factorials directly suggests an analogous algorithm for computing harmonic numbers with reduced complexity: just recursively split the summation in half.

A few more variations are also possible. Instead of working with rational numbers, which require a GCD reduction after each addition, one can work with the unreduced numerators and denominators as integers, and form a rational number at the end. Another way to obtain a pure-integer algorithm for H_n = p/q is to compute the denominator q = n! and then use the formula

for the numerator.

Below, I have implemented five different algorithms for computing harmonic numbers based on the preceding remarks. The code works with either Python 2.6, using the Fraction type from the standard library, or with gmpy, by adding the following respective header code:


# python
from math import factorial as fac
from fractions import Fraction as mpq
one = 1
mpz = int

# gmpy
from gmpy import mpz, mpq, fac
one = mpz(1)

Algorithm 1: directly add 1, 1/2, 1/3, ..., 1/n:


def harmonic1(n):
    return sum(mpq(1,k) for k in xrange(1,n+1))

Algorithm 2: like algorithm 1, GCD reduction postponed to the end:


def harmonic2(n):
    p, q = mpz(1), mpz(1)
    for k in xrange(2,n+1):
        p, q = p*k+q, k*q
    return mpq(p,q)

Algorithm 3: compute denominator; compute numerator series:


def harmonic3(n):
    q = fac(n)
    p = sum(q//k for k in xrange(1,n+1))
    return mpq(p,q)

Algorithm 4: binary splitting:


def _harmonic4(a, b):
    if b-a == 1:
        return mpq(1,a)
    m = (a+b)//2
    return _harmonic4(a,m) + _harmonic4(m,b)

def harmonic4(n):
    return _harmonic4(1,n+1)

Algorithm 5: binary splitting, GCD reduction postponed to the end:


def _harmonic5(a, b):
    if b-a == 1:
        return one, mpz(a)
    m = (a+b)//2
    p, q = _harmonic5(a,m)
    r, s = _harmonic5(m,b)
    return p*s+q*r, q*s

def harmonic5(n):
    return mpq(*_harmonic5(1,n+1))

I did some benchmarking on my laptop (32-bit, 1.6 GHz Intel Celeron M), with n up to 10⁴ for the Python version and 10⁶ for the gmpy version. Here are the results (times in seconds):


python:
n         harmonic1  harmonic2  harmonic3  harmonic4  harmonic5
10         0.000159   0.000009   0.000010   0.000160   0.000025
100        0.004060   0.000233   0.000234   0.002211   0.000358
1000       0.889937   0.028822   0.029111   0.048106   0.026434
10000    769.109402   3.813702   3.823027   2.710409   3.475280

gmpy:
n         harmonic1  harmonic2  harmonic3  harmonic4  harmonic5
10         0.000028   0.000010   0.000010   0.000033   0.000021
100        0.000284   0.000097   0.000111   0.000365   0.000226
1000       0.003543   0.001870   0.004986   0.004280   0.002651
10000      0.103110   0.142100   0.588723   0.059249   0.052265
100000     7.861546  17.002986  74.712460   1.115333   1.200865
1000000  994.548226      -           -     27.927289  24.425421

Algorithm 1 clearly does not do well asymptotically. For the largest n measured, it is 284 times slower than the fastest algorithm with Python arithmetic and 40 times slower with gmpy arithmetic. Interestingly, algorithms 2 and 3 both substantially improve on algorithm 1 with Python arithmetic at least up to n = 10000, but both are worse than algorithm 1 with gmpy. This is presumably because GMP uses a much better GCD algorithm that reduces the relative cost of rational arithmetic.

Algorithms 4 and 5 are the clear winners for large n, but it is hard to tell which is better. The timings seem to fluctuate a bit when repeated, so small differences in the table above are unreliable. I think algorithm 5 could be optimized a bit if implemented in another language, by accumulating temporary values in machine integers.

It is amusing to note that computing the 10000th harmonic number with the recursive algorithm and an optimized implementation of rational arithmetic is 14,700 times faster than the direct algorithm with a non-optimized implementation of rational arithmetic. Choosing a larger n (say, 10⁵) gives an even more sensational ratio, of course, which I'm not patient enough to try. The moral is: Moore's law is not an excuse for doing it wrong.

Algorithm 1 is not all bad, of course. With memoization (or implemented as a generator), it is good enough for sequential generation of the numbers H₁, H₂, H₃, ..., and this tends to be needed much more commonly than isolated large harmonic numbers. SymPy therefore uses the memoized version of algorithm 1.

The algorithms above can be adapted for the task of computing generalized harmonic numbers, i.e. numbers of the form 1 + 1/2^k + 1/3^k + ... + 1/n^k for some integer k. Also worth noting is that algorithm 5 allows for computation of Stirling numbers of the first kind. Since |S(n+1,2)| = n! · H_n, algorithm 5 can be viewed as an efficient algorithm for simultaneous computation of S(n,2) and n!. Expressing Stirling numbers in terms of generalized harmonic numbers, this extends to an algorithm for S(n,k) for any (large) n and (small) k.

A final remark: binary splitting is not the fastest algorithm for computing factorials. The fastest known method is to decompose n! into a product of prime powers and perform balanced multiplication-exponentiation. Can this idea be applied to harmonic numbers and/or Stirling numbers? It's not obvious to me how it would be done. Are there any other, possibly even faster algorithms for harmonic numbers? I've looked around, but been unable to find anything on the subject.

For large-integer arithmetic, that is. Mark Dickinson has beeen working on a patch that changes the internal representation of long integers from 15-bit to 30-bit digits, and there is an additional patch that adds a few algorithmic optimizations, written by Mark and Mario Pernici. Likely both patches will pass review and be included in Python 3.1. I asked Mark to run the mpmath unit tests to find out how big a difference the patches make, which he kindly did. Here are the results:


    32-bit build:
    unpatched (15-bit digits): 42.91 seconds
    patched (30-bit digits): 40.57 seconds
    patched+optimized: 30.15 seconds

    64-bit build:
    unpatched: 38.39 seconds
    patched: 22.36 seconds
    patched+optimized: 21.55 seconds
    patched+optimized+bitlength: 20.20 seconds

The last result includes the use of the new int.bit_length() method (which I had a small part in implementing) instead of the pure-Python version in mpmath.

It's not surprising that the patches make high-precision arithmetic much faster, but most of the mpmath tests actually work at low precision and depend mostly on general speed of the Python interpreter. There the speedup is perhaps of order 0-30%. For the tests working at very high precision, the improvement is a factor 4 or 5 with the 64-bit build. Then there are a number of tests in between. With some imagination, all unit tests together provide a plausible estimate of actual performance for a wide range of applications.

An excerpt of before/after results for some particular tests, comparing 64-bit unpatched and 64-bit patched+optimized+bitlength:


    # Only tests 53-bit precision
    double_compatibility      ok        1.3149240 s
    double_compatibility      ok        1.1979949 s

    # Logarithms up to 10,000 digits
    log_hp                    ok        4.0845711 s
    log_hp                    ok        0.7967579 s

    # Large Bernoulli numbers
    bernoulli                 ok        6.8625491 s
    bernoulli                 ok        1.4261260 s

    # Gamma function, high precision
    gamma_huge_2              ok        2.4907949 s
    gamma_huge_2              ok        0.5781031 s

    # Numerical quadrature, 15 to 100 digit precision
    basic_integrals           ok        3.0117619 s
    basic_integrals           ok        1.8687689 s

Mpmath does not yet run on Python 3.x; the benchmarking was made with a version of mpmath hacked slightly for the tests to pass. It would be nice to provide a 3.x compatible version of mpmath soon. The 2to3 tool fortunately handles almost all necessary patching; a handful of fairly simple additional changes need to be made. The most annoying problem is the lack of cmp (and particularly the cmp argument for sorted), which has no trivial workaround, but still should not be too hard to fix. In any case, it seems likely that the 30-bit patch will also be backported to Python 2.7, so most users should be able to take advantage of it.

It would be interesting to compare these benchmarks with the GMPY mode in mpmath. GMPY too provides a ~2x speedup for the unit tests. (This factor would be much larger if tests at even higher precision were included. At a million digits, GMPY is orders of magnitude faster than pure Python.) Originally GMPY only sped up the very high precision tests, and was otherwise slower than pure Python, probably due to Python's internal optimizations for int and long instances. This disadvantage was eliminated by implementing some helper functions for mpmath in C in GMPY. Mario Pernici has recently worked on further optimizations along the same lines, which should substantially improve low-precision performance.

Fredrik Johansson

Saturday, February 21, 2009

How (not) to compute harmonic numbers

Thursday, February 19, 2009

Python 3.1 twice as fast as 3.0

About Me

Blog Archive

Other blogs