Wednesday, June 8, 2011

Some FLINT 2.2 highlights

Version 2.2 of FLINT (Fast Library for Number Theory) was released last weekend. Some updated benchmarks are available.

In this blog post, I'm going to talk a bit about features I contributed in this version. With apologies to Sebastian Pancratz who wrote a whole lot of code as well -- in particular, a new module for computing with p-adic numbers, and a module for rational functions! Bill Hart also implemented a faster polynomial GCD, which is a quite important update since GCD is crucial for most polynomial business. Anyhow...

Polynomial interpolation

I've added various functions for polynomial interpolation to the fmpz_poly module. In general, these can be used to speed up various computations involving integer or rational polynomials by mapping a given problem to (Z/nZ)[x], Z or even Z/nZ, taking advantage of fast arithmetic in those rings, and then via interpolation recovering a result in Z[x] or Q[x].

Firstly, there are some new Chinese Remainder Theorem functions for integer polynomials, allowing you to reconstruct an integer polynomial from a bunch of polynomials with coefficients modulo different primes. Straightforward code (the actual work is done by functions in the fmpz module), but useful to have. The CRT functions are used by the new modular GCD code.

There are also functions for evaluating an integer polynomial on a set of points, and forming the interpolating polynomial (generally with rational coefficients) given a set of points and the values at those points.

Finally, user-friendly functions for evaluation and interpolation at a power of two (Kronecker segmentation) have been added. The code for this is actually a very old part of FLINT, and possibly some of the most complicated code in the library (packing bits efficiently is surprisingly hard). The new functions just wrap this functionality, but take care of memory management and various special cases, so you can now just safely do something like:

fmpz_t z;
long bits = fmpz_poly_max_bits(poly) + 1; /* +1 for signs */
fmpz_poly_bit_pack(z, poly, bits);
fmpz_poly_bit_unpack(poly, z, bits); /* recover poly */

Apart from Kronecker segmentation, these functions are not currently asymptotically fast. Fast multi-modulus CRT for coefficient reconstruction is probably not all that important in most circumstances, because it's more common to use evaluation-interpolation techniques for polynomials of large degree and small coefficients than the other way around. Nonetheless, polynomials with large coefficients do arise as well. For example, the vector Bernoulli number code in FLINT relies on fast CRT, and currently uses custom code for this.

Polynomial interpolation uses Lagrange interpolation with barycentric weights, with a few tricks to avoid fractions. This is all implemented using an O(n^2) algorithm, but the actual time complexity is higher due to the fact that the coefficients when working over integers usually will be large, around n! in magnitude.

Here are some timing examples, evaluating and recovering a length-n polynomial with +/-1 coefficients basically as follows:

x = _fmpz_vec_init(n);
y = _fmpz_vec_init(n);

for (i = 0; i < n; i++)
x[i] = -n/2 + i;

fmpz_poly_randtest(P, state, n, 1);

fmpz_poly_evaluate_fmpz_vec(y, P, x, n);
fmpz_poly_interpolate_fmpz_vec(Q, x, y, n);

The bits column below measures the largest value in y, which grows quite large despite the input polynomial having small coefficients:

n=8 eval=762 ns interp=13 us bits=8 ok=1
n=16 eval=3662 ns interp=61 us bits=42 ok=1
n=32 eval=29 us interp=673 us bits=-113 ok=1
n=64 eval=136 us interp=4951 us bits=-316 ok=1
n=128 eval=625 us interp=45 ms bits=-762 ok=1
n=256 eval=2500 us interp=792 ms bits=-1779 ok=1
n=512 eval=12 ms interp=10 s bits=-4089 ok=1

As you can see, the interpolation speed is not too bad for small n, but eventually grows out of control. How to do better?

Naive Lagrange interpolation is not optimal: it is possible to do n-point evaluation and interpolation in essentially O(n log2 n) operations. Such algorithms do not necessarily lead to an improvement over the integers (you still have to deal with coefficient explosion), but they should win over finite fields. So the right solution will perhaps be to add polynomial evaluation/interpolation functions based on modular arithmetic.

Rational numbers and matrices

A new module fmpq is provided for computing with arbitrary-precision rational numbers. For the user, the fmpq_t type essentially behaves identically to the MPIR mpq_t type. However, an fmpq_t only takes up two words of memory when the numerator and denominator are small (less than 262), whereas an mpq_t always requires six words plus additional heap-allocated space for the actual number data.

The fmpq functions are a bit faster than mpq functions in many cases when the numerator and/or denominator is small. But the main improvement should come for vectors, matrices or polynomials of rational numbers, due to the significantly reduced memory usage and memory management overhead (especially when many entries are zero or integer-valued).

Some higher-level functionality is also provided in the fmpq module, e.g. for rational reconstruction. The functions for computing special rational numbers (like Bernoulli numbers) have also been switched over to the fmpq type. Another supported feature is enumeration of the rationals (using the Calkin-Wilf sequence or by height). Generating the 100 million "first" positive rational numbers takes 9.6 seconds done in order of height, or 2.6 seconds in Calkin-Wilf order.

FLINT actually does not use fmpq's to represent polynomials over Q (fmpq_poly), and probably never will. The fmpq_poly module represents a polynomial over Q as an integer polynomial with a single common denominator, which is usually faster. The reason for adding the fmpq_t type is that it enabled developing the new fmpq_mat module, which implements dense matrices of rational numbers. For matrices, a common-denominator representation would be less convenient and in many cases completely impractical.

The new FLINT fmpq_mat module is very fast, or at least very non-slow. It is easy to find examples where it does a simple computation a thousand times faster than the rational matrices in Sage.

There's not actually much code in the fmpq_mat module itself; it does almost all "level 3" linear algebra (computations requiring matrix multiplication or Gaussian elimination) by clearing denominators and computing over the integers. This approach is in fact stolen shamelessly from Sage, but the functions in Sage are highly unoptimized in many cases. The code in Sage still wins for many sufficiently large problems as it has asymptotically fast algorithms for many things we do not (like computing null spaces). See the benchmarks page for more details.

I should not forget to mention that I've implemented Dixon's p-adic algorithm for solving Ax = b for nonsingular square A. (I wish I had a good link for Dixon's algorithm here, but sadly it doesn't appear to be described conveniently anywhere on the web. The original paper is "Exact solution of linear equations using P-adic expansions", if you have the means to get through the Springer paywall.)

This is now used both for solving over both Z and Q. The solver in FLINT is competitive with Sage (which uses IML+ATLAS) up to systems of dimension somewhere between perhaps 100 and 1000 (depending greatly on the size of the entries in the inputs and in the solution!). There's much to do here -- we should eventually have BLAS support in FLINT, which will speed up core matrix arithmetic, but there's room for a lot of algorithmic tuning as well.

There are some other minor new matrix features as well... they can be found in the changelog.

Polynomial matrices

A new module (fmpz_poly_mat) is provided for dense matrices over Z[x], i.e. matrices whose entries are polynomials with integer coefficients. The available functionality includes matrix multiplication, row reduction, and determinants. Matrix multiplication is particularly fast, as it uses the Kronecker segmentation interpolation/evaluation technique described above. (A similar algorithm is provided for determinants, but it's not really optimal as this point.)

The benchmarks page has detailed some detailed timings, so I won't repeat them here -- but generally speaking, the FLINT implementation is an order of magnitude faster than Sage or Magma for matrices of manageable size.

There's much more to be done for polynomial matrices. Row reduction is implemented quite efficiently, but it's too slow as an algorithm for many tasks such as computing null spaces of very large matrices. A future goal is to implement asymptotically fast algorithms (see the paper on x-adic lifting by Burçin Eröcal and Arne Storjohann for example).

Monday, March 14, 2011

100 mpmath one-liners for pi

Since it's pi day today, I thought I'd share a list of mpmath one-liners for computing the value of pi to high precision using various representations in terms of special functions, infinite series, integrals, etc. Most of them can already be found as doctest examples in some form in the mpmath documentation.

A few of the formulas explicitly involve pi. Using those to calculate pi is rather circular (!), though a few of them could still be used for computing pi using numerical root-finding. In any case, most of the formulas are circular even when pi doesn't appear explicitly since mpmath is likely using its value internally. In any further case, the majority of the formulas are not efficient for computing pi to very high precision (at least as written). Still, ~50 digits is no problem. Enjoy!

from mpmath import *
mp.dps = 50; mp.pretty = True

(2/diff(erf, 0))**2
findroot(sin, 3)
findroot(cos, 1)*2
4*(hyp1f2(1,1.5,1,1) / struvel(-0.5, 2))**2
1/meijerg([[],[]], [[0],[0.5]], 0)**2
(meijerg([[],[2]], [[1,1.5],[]], 1, 0.5) / erfc(1))**2
(1-e) / meijerg([[1],[0.5]], [[1],[0.5,0]], 1)
sqrt(7*zeta(3)/(4*diff(lerchphi, (-1,-2,1), (0,1,0))))
nsum(lambda k: 4*(-1)**(k+1)/(2*k-1), [1,inf])
nsum(lambda k: (3**k-1)/4**k*zeta(k+1), [1,inf])
nsum(lambda k: 8/(2*k-1)**2, [1,inf])**0.5
nsum(lambda k: 2*fac(k)/fac2(2*k+1), [0,inf])
nsum(lambda k: fac(k)**2/fac(2*k+1), [0,inf])*3*sqrt(3)/2
nsum(lambda k: fac(k)**2/(phi**(2*k+1)*fac(2*k+1)), [0,inf])*(5*sqrt(phi+2))/2
nsum(lambda k: (4/(8*k+1)-2/(8*k+4)-1/(8*k+5)-1/(8*k+6))/16**k, [0,inf])
2/nsum(lambda k: (-1)**k*(4*k+1)*(fac2(2*k-1)/fac2(2*k))**3, [0,inf])
nsum(lambda k: 72/(k*expm1(k*pi))-96/(k*expm1(2*pi*k))+24/(k*expm1(4*pi*k)), [1,inf])
1/nsum(lambda k: binomial(2*k,k)**3*(42*k+5)/2**(12*k+4), [0,inf])
4/nsum(lambda k: (-1)**k*(1123+21460*k)*fac2(2*k-1)*fac2(4*k-1)/(882**(2*k+1)*32**k*fac(k)**3), [0,inf])
9801/sqrt(8)/nsum(lambda k: fac(4*k)*(1103+26390*k)/(fac(k)**4*396**(4*k)), [0,inf])
426880*sqrt(10005)/nsum(lambda k: (-1)**k*fac(6*k)*(13591409+545140134*k)/(fac(k)**3*fac(3*k)*(640320**3)**k), [0,inf])
4/nsum(lambda k: (6*k+1)*rf(0.5,k)**3/(4**k*fac(k)**3), [0,inf])
(ln(8)+sqrt(48*nsum(lambda m,n: (-1)**(m+n)/(m**2+n**2), [1,inf],[1,inf]) + 9*log(2)**2))/2
-nsum(lambda x,y: (-1)**(x+y)/(x**2+y**2), [-inf,inf], [-inf,inf], ignore=True)/ln2
2*nsum(lambda k: sin(k)/k, [1,inf])+1
quad(lambda x: 2/(x**2+1), [0,inf])
quad(lambda x: exp(-x**2), [-inf,inf])**2
2*quad(lambda x: sqrt(1-x**2), [-1,1])
chop(quad(lambda z: 1/(2j*z), [1,j,-1,-j,1]))
3*(4*log(2+sqrt(3))-quad(lambda x,y: 1/sqrt(1+x**2+y**2), [-1,1],[-1,1]))/2
sqrt(8*quad(lambda x,y: 1/(1-(x*y)**2), [0,1],[0,1]))
sqrt(6*quad(lambda x,y: 1/(1-x*y), [0,1],[0,1]))
sqrt(6*quad(lambda x: x/expm1(x), [0,inf]))
quad(lambda x: (16*x-16)/(x**4-2*x**3+4*x-4), [0,1])
quad(lambda x: sqrt(x-x**2), [0,0.25])*24+3*sqrt(3)/4
mpf(22)/7 - quad(lambda x: x**4*(1-x)**4/(1+x**2), [0,1])
mpf(355)/113 - quad(lambda x: x**8*(1-x)**8*(25+816*x**2)/(1+x**2), [0,1])/3164
2*quadosc(lambda x: sin(x)/x, [0,inf], omega=1)
40*quadosc(lambda x: sin(x)**6/x**6, [0,inf], omega=1)/11
e*quadosc(lambda x: cos(x)/(1+x**2), [-inf,inf], omega=1)
8*quadosc(lambda x: cos(x**2), [0,inf], zeros=lambda n: sqrt(n))**2
2*quadosc(lambda x: sin(exp(x)), [1,inf], zeros=ln)+2*si(e)
exp(2*quad(loggamma, [0,1]))/2
2*nprod(lambda k: sec(pi/2**k), [2,inf])
s=lambda k: sqrt(0.5+s(k-1)/2) if k else 0; 2/nprod(s, [1,inf])
s=lambda k: sqrt(2+s(k-1)) if k else 0; limit(lambda k: sqrt(2-s(k))*2**(k+1), inf)
2*nprod(lambda k: (2*k)**2/((2*k-1)*(2*k+1)), [1,inf])
2*nprod(lambda k: (4*k**2)/(4*k**2-1), [1, inf])
sqrt(6*ln(nprod(lambda k: exp(1/k**2), [1,inf])))
nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])/csch(pi)
nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])*sinh(pi)
nprod(lambda k: (k**4-1)/(k**4+1), [2, inf])*(cosh(sqrt(2)*pi)-cos(sqrt(2)*pi))/sinh(pi)
sinh(pi)/nprod(lambda k: (1-1/k**4), [2, inf])/4
sinh(pi)/nprod(lambda k: (1+1/k**2), [2, inf])/2
(exp(1+euler/2)/nprod(lambda n: (1+1/n)**n * exp(1/(2*n)-1), [1, inf]))**2/2
3*sqrt(2)*cosh(pi*sqrt(3)/2)**2*csch(pi*sqrt(2))/nprod(lambda k: (1+1/k+1/k**2)**2/(1+2/k+3/k**2), [1, inf])
2/e*nprod(lambda k: (1+2/k)**((-1)**(k+1)*k), [1,inf])
limit(lambda k: 16**k/(k*binomial(2*k,k)**2), inf)
limit(lambda x: 4*x*hyp1f2(0.5,1.5,1.5,-x**2), inf)
1/log(limit(lambda n: nprod(lambda k: pi/(2*atan(k)), [n,2*n]), inf),4)
limit(lambda k: 2**(4*k+1)*fac(k)**4/(2*k+1)/fac(2*k)**2, inf)
limit(lambda k: fac(k) / (sqrt(k)*(k/e)**k), inf)**2/2
limit(lambda k: (-(-1)**k*bernoulli(2*k)*2**(2*k-1)/fac(2*k))**(-1/(2*k)), inf)
limit(lambda k: besseljzero(1,k)/k, inf)
1/limit(lambda x: airyai(x)*2*x**0.25*exp(2*x**1.5/3), inf, exp=True)**2
1/limit(lambda x: airybi(x)*x**0.25*exp(-2*x**1.5/3), inf, exp=True)**2

Friday, March 11, 2011

A FLINT example: Lambert W function power series

Two days ago, a new version of the Fast Library for Number Theory (FLINT) was released. I contributed a lot of new code to this release, including linear algebra speed improvements and new functionality for fast power series arithmetic and computation of special numbers and polynomials (see the release announcement and some of my benchmarking results).

In this blog post I'll demonstrate how to do power series arithmetic with FLINT, using its fmpq_poly module which implements polynomials over the rational numbers Q. Standard operations (addition, multiplication and division) were available before; the functions I've added include square root, log, exp, sin, tan, atan, etc. (all the usual elementary functions). The same functions are also available for power series over a finite field Z/nZ (with word-size n). Everything is asymptotically fast (the running time is linear in the size of the output, up to logarithmic factors).

Of course, transcendental functions are a bit restricted when considered over Q or Z/nZ, since it's only possible to obtain power series expansions at specific rational points (in most cases just x = 0). So at present, some very interesting numerical applications of fast power series arithmetic are not supported. But some time in the future, we'll probably add support for numerical power series over the reals and complexes as well.

As today's example, let us implement the Lambert W function for the power series ring Q[[x]]. The Lambert W function is defined implicitly by the equation x = W(x) exp(W(z)), which can be solved using Newton iteration with the update step w = w - (w exp(w) - x) / ((w+1) exp(w)).

Power series Newton iteration is just like numerical Newton iteration, except that the convergence behavior is much simpler: starting with a correct first-order expansion, each iteration at least doubles the number of correct coefficients.

A simple recursive implementation with asymptotically optimal performance (up to constant factors) looks as follows:

#include <stdio.h>
#include "flint.h"
#include "fmpq_poly.h"

void lambertw(fmpq_poly_t w, fmpq_poly_t x, long n)
if (n == 1)
fmpq_poly_t t, u, v;

lambertw(w, x, (n + 1) / 2);


fmpq_poly_exp_series(t, w, n);
fmpq_poly_mullow(u, t, w, n);
fmpq_poly_sub(v, u, x);
fmpq_poly_add(t, u, t);
fmpq_poly_div_series(u, v, t, n);
fmpq_poly_sub(w, w, u);


Beyond the base case W(x) = 0 + O(x), the function just computes w to accuracy ceil(n/2), and then extends it to accuracy n using a single Newton step. As we can see, C code directly using the FLINT library interface gets a bit verbose, but this style has the advantage of giving precise control over temporary memory allocation, polynomial lengths, etc. (it is very similar to the interface of GMP/MPIR).

We add a simple test main routine:

int main()
fmpq_poly_t x;
fmpq_poly_t w;


fmpq_poly_set_coeff_ui(x, 1, 1);
lambertw(w, x, 10);

fmpq_poly_print_pretty(w, "x");


The output of the program is:

531441/4480*x^9 - 16384/315*x^8 + 16807/720*x^7 - 54/5*x^6 + 125/24*x^5 - 8/3*x^4 + 3/2*x^3 - 1*x^2 + 1*x

It is well known that the coefficients in this series are given in closed form by (-k)k-1 / k!, so we can check that the output is correct.

Computing 1000 terms takes just a few seconds. If this sounds like much, remember that the coefficients grow rapidly: together, the computed numerators and denominators have over 2 million digits!

So far this is perhaps not so interesting, as we could compute the coefficients faster using a direct formula. But the nice thing is that arbitrary compositions are allowed, i.e we can compute W(f(x)) for any given power series f, and this will still be just as fast.

Let's consider a nontrivial example: the infinite "power tower" T(z) = zzzz... A moment's reflection shows that this is an analytic function with a rational power series expansion around z = 1. In fact, we have explicitly T(z) = W(-log(z))/(-log(z)). We can compute this series expansion (in the shifted variable x = z - 1) as follows:

int main()
fmpq_poly_t x;
fmpq_poly_t w;

long n = 10;


fmpq_poly_set_coeff_ui(x, 0, 1);
fmpq_poly_set_coeff_ui(x, 1, 1);
fmpq_poly_log_series(x, x, n + 1);
fmpq_poly_neg(x, x);
lambertw(w, x, n + 1);
fmpq_poly_shift_right(w, w, 1);
fmpq_poly_shift_right(x, x, 1);
fmpq_poly_div_series(w, w, x, n);

fmpq_poly_print_pretty(w, "x");


The only complication is that fmpq_poly_div_series requires a nonzero leading coefficient in the denominator, so we must shift both series down one power.

The program outputs:

118001/2520*x^9 + 123101/5040*x^8 + 4681/360*x^7 + 283/40*x^6 + 4*x^5 + 7/3*x^4 + 3/2*x^3 + 1*x^2 + 1*x + 1

To make things nicer, we assume that the coefficients have the form ak / k! (i.e. that T(z) is the exponential generating function for ak) and change the output code to something like the following:

long k;
mpq_t t;
mpz_t u;

for (k = 0; k < n; k++)
fmpq_poly_get_coeff_mpq(t, w, k);
mpz_fac_ui(u, k);
mpz_mul(mpq_numref(t), mpq_numref(t), u);
gmp_printf("%Qd ", t);


This indeed gives us an integer sequence:

1 1 2 9 56 480 5094 65534 984808 16992144

Now what is the value of the 1000th coefficient (to be precise, a1000, the initial one being the 0th!) in this sequence? After a simple modification of the program, 2.9 seconds of computation gives:


In fact, if we look up the first 10 coefficients in the On-Line Encyclopedia of Integer Sequences, we find This OEIS entry lists the representation

a(n) = Sum_{k=0..n} Stirling1(n, k)*(k+1)^(k-1)

Since FLINT supports fast vector computation of Stirling numbers, this formula can be implemented efficiently:

#include "fmpz.h"
#include "fmpz_vec.h"
#include "arith.h"

void coefficient(fmpz_t a, long n)
long k;
fmpz * s;
fmpz_t t;

s = _fmpz_vec_init(n + 1);
fmpz_stirling1_vec(s, n, n + 1);

for (k = 1; k <= n; k++)
fmpz_set_ui(t, k + 1);
fmpz_pow_ui(t, t, k - 1);
fmpz_addmul(a, s + k, t);

_fmpz_vec_clear(s, n + 1);

int main()
fmpz_t a;
coefficient(a, 1000);

And indeed, the output turns out to be the same!

This program is faster, taking only 0.1 seconds to run. But of course, it only gives us a single coefficient, and would be slower for computing a range of values by making repeated calls.

Similar ideas to those presented here (basically, reducing a problem to fast polynomial multiplication using generating functions, Newton iteration, etc.) are used internally by FLINT for computation of the standard elementary functions themselves as well as various special numbers and polynomials (Bernoulli numbers and polynomials, partitions, Stirling numbers, Bell numbers, etc). The internal code uses a lot of tricks to reduce overhead and handle special cases faster, however. (See the previous blog post Fast combinatorial and number-theoretic functions with FLINT 2, and for more recent information the release announcement and benchmarks page linked at the top of this post.)

In other news, I haven't written a lot of code for mpmath or Sage recently. Of course, my hope is that FLINT (2) will make it into Sage in the not too distant future. The fast polynomial and power series arithmetic support in FLINT will also be very useful for future special functions applications (in mpmath and elsewhere).