## Thursday, January 26, 2012

### Move along

I'm moving my blog to http://fredrikj.net/blog. If you're living in the future and reading this, you should go there instead!

## Wednesday, June 8, 2011

### Some FLINT 2.2 highlights

Version 2.2 of FLINT (Fast Library for Number Theory) was released last weekend. Some updated benchmarks are available.

In this blog post, I'm going to talk a bit about features I contributed in this version. With apologies to Sebastian Pancratz who wrote a whole lot of code as well -- in particular, a new module for computing with p-adic numbers, and a module for rational functions! Bill Hart also implemented a faster polynomial GCD, which is a quite important update since GCD is crucial for most polynomial business. Anyhow...

## Polynomial interpolation

I've added various functions for polynomial interpolation to the fmpz_poly module. In general, these can be used to speed up various computations involving integer or rational polynomials by mapping a given problem to (Z/nZ)[x], Z or even Z/nZ, taking advantage of fast arithmetic in those rings, and then via interpolation recovering a result in Z[x] or Q[x].

Firstly, there are some new Chinese Remainder Theorem functions for integer polynomials, allowing you to reconstruct an integer polynomial from a bunch of polynomials with coefficients modulo different primes. Straightforward code (the actual work is done by functions in the fmpz module), but useful to have. The CRT functions are used by the new modular GCD code.

There are also functions for evaluating an integer polynomial on a set of points, and forming the interpolating polynomial (generally with rational coefficients) given a set of points and the values at those points.

Finally, user-friendly functions for evaluation and interpolation at a power of two (Kronecker segmentation) have been added. The code for this is actually a very old part of FLINT, and possibly some of the most complicated code in the library (packing bits efficiently is surprisingly hard). The new functions just wrap this functionality, but take care of memory management and various special cases, so you can now just safely do something like:

`    fmpz_t z;    fmpz_init(z);    long bits = fmpz_poly_max_bits(poly) + 1; /* +1 for signs */    fmpz_poly_bit_pack(z, poly, bits);    fmpz_poly_bit_unpack(poly, z, bits);  /* recover poly */    fmpz_clear(z);`

Apart from Kronecker segmentation, these functions are not currently asymptotically fast. Fast multi-modulus CRT for coefficient reconstruction is probably not all that important in most circumstances, because it's more common to use evaluation-interpolation techniques for polynomials of large degree and small coefficients than the other way around. Nonetheless, polynomials with large coefficients do arise as well. For example, the vector Bernoulli number code in FLINT relies on fast CRT, and currently uses custom code for this.

Polynomial interpolation uses Lagrange interpolation with barycentric weights, with a few tricks to avoid fractions. This is all implemented using an O(n^2) algorithm, but the actual time complexity is higher due to the fact that the coefficients when working over integers usually will be large, around n! in magnitude.

Here are some timing examples, evaluating and recovering a length-n polynomial with +/-1 coefficients basically as follows:
`    x = _fmpz_vec_init(n);    y = _fmpz_vec_init(n);    fmpz_poly_init(P);    fmpz_poly_init(Q);    for (i = 0; i < n; i++)        x[i] = -n/2 + i;    fmpz_poly_randtest(P, state, n, 1);    fmpz_poly_evaluate_fmpz_vec(y, P, x, n);    fmpz_poly_interpolate_fmpz_vec(Q, x, y, n);`

The bits column below measures the largest value in y, which grows quite large despite the input polynomial having small coefficients:

`    n=8  eval=762 ns  interp=13 us  bits=8  ok=1    n=16  eval=3662 ns  interp=61 us  bits=42  ok=1    n=32  eval=29 us  interp=673 us  bits=-113  ok=1    n=64  eval=136 us  interp=4951 us  bits=-316  ok=1    n=128  eval=625 us  interp=45 ms  bits=-762  ok=1    n=256  eval=2500 us  interp=792 ms  bits=-1779  ok=1    n=512  eval=12 ms  interp=10 s  bits=-4089  ok=1`

As you can see, the interpolation speed is not too bad for small n, but eventually grows out of control. How to do better?

Naive Lagrange interpolation is not optimal: it is possible to do n-point evaluation and interpolation in essentially O(n log2 n) operations. Such algorithms do not necessarily lead to an improvement over the integers (you still have to deal with coefficient explosion), but they should win over finite fields. So the right solution will perhaps be to add polynomial evaluation/interpolation functions based on modular arithmetic.

## Rational numbers and matrices

A new module fmpq is provided for computing with arbitrary-precision rational numbers. For the user, the fmpq_t type essentially behaves identically to the MPIR mpq_t type. However, an fmpq_t only takes up two words of memory when the numerator and denominator are small (less than 262), whereas an mpq_t always requires six words plus additional heap-allocated space for the actual number data.

The fmpq functions are a bit faster than mpq functions in many cases when the numerator and/or denominator is small. But the main improvement should come for vectors, matrices or polynomials of rational numbers, due to the significantly reduced memory usage and memory management overhead (especially when many entries are zero or integer-valued).

Some higher-level functionality is also provided in the fmpq module, e.g. for rational reconstruction. The functions for computing special rational numbers (like Bernoulli numbers) have also been switched over to the fmpq type. Another supported feature is enumeration of the rationals (using the Calkin-Wilf sequence or by height). Generating the 100 million "first" positive rational numbers takes 9.6 seconds done in order of height, or 2.6 seconds in Calkin-Wilf order.

FLINT actually does not use fmpq's to represent polynomials over Q (fmpq_poly), and probably never will. The fmpq_poly module represents a polynomial over Q as an integer polynomial with a single common denominator, which is usually faster. The reason for adding the fmpq_t type is that it enabled developing the new fmpq_mat module, which implements dense matrices of rational numbers. For matrices, a common-denominator representation would be less convenient and in many cases completely impractical.

The new FLINT fmpq_mat module is very fast, or at least very non-slow. It is easy to find examples where it does a simple computation a thousand times faster than the rational matrices in Sage.

There's not actually much code in the fmpq_mat module itself; it does almost all "level 3" linear algebra (computations requiring matrix multiplication or Gaussian elimination) by clearing denominators and computing over the integers. This approach is in fact stolen shamelessly from Sage, but the functions in Sage are highly unoptimized in many cases. The code in Sage still wins for many sufficiently large problems as it has asymptotically fast algorithms for many things we do not (like computing null spaces). See the benchmarks page for more details.

I should not forget to mention that I've implemented Dixon's p-adic algorithm for solving Ax = b for nonsingular square A. (I wish I had a good link for Dixon's algorithm here, but sadly it doesn't appear to be described conveniently anywhere on the web. The original paper is "Exact solution of linear equations using P-adic expansions", if you have the means to get through the Springer paywall.)

This is now used both for solving over both Z and Q. The solver in FLINT is competitive with Sage (which uses IML+ATLAS) up to systems of dimension somewhere between perhaps 100 and 1000 (depending greatly on the size of the entries in the inputs and in the solution!). There's much to do here -- we should eventually have BLAS support in FLINT, which will speed up core matrix arithmetic, but there's room for a lot of algorithmic tuning as well.

There are some other minor new matrix features as well... they can be found in the changelog.

## Polynomial matrices

A new module (fmpz_poly_mat) is provided for dense matrices over Z[x], i.e. matrices whose entries are polynomials with integer coefficients. The available functionality includes matrix multiplication, row reduction, and determinants. Matrix multiplication is particularly fast, as it uses the Kronecker segmentation interpolation/evaluation technique described above. (A similar algorithm is provided for determinants, but it's not really optimal as this point.)

The benchmarks page has detailed some detailed timings, so I won't repeat them here -- but generally speaking, the FLINT implementation is an order of magnitude faster than Sage or Magma for matrices of manageable size.

There's much more to be done for polynomial matrices. Row reduction is implemented quite efficiently, but it's too slow as an algorithm for many tasks such as computing null spaces of very large matrices. A future goal is to implement asymptotically fast algorithms (see the paper on x-adic lifting by Burçin Eröcal and Arne Storjohann for example).

## Monday, March 14, 2011

### 100 mpmath one-liners for pi

Since it's pi day today, I thought I'd share a list of mpmath one-liners for computing the value of pi to high precision using various representations in terms of special functions, infinite series, integrals, etc. Most of them can already be found as doctest examples in some form in the mpmath documentation.

A few of the formulas explicitly involve pi. Using those to calculate pi is rather circular (!), though a few of them could still be used for computing pi using numerical root-finding. In any case, most of the formulas are circular even when pi doesn't appear explicitly since mpmath is likely using its value internally. In any further case, the majority of the formulas are not efficient for computing pi to very high precision (at least as written). Still, ~50 digits is no problem. Enjoy!

`from mpmath import *mp.dps = 50; mp.pretty = True+pi180*degree4*atan(1)16*acot(5)-4*acot(239)48*acot(49)+128*acot(57)-20*acot(239)+48*acot(110443)chop(2*j*log((1-j)/(1+j)))chop(-2j*asinh(1j))chop(ci(-inf)/1j)gamma(0.5)**2beta(0.5,0.5)(2/diff(erf, 0))**2findroot(sin, 3)findroot(cos, 1)*2chop(-2j*lambertw(-pi/2))besseljzero(0.5,1)3*sqrt(3)/2/hyp2f1((-1,3),(1,3),1,1)8/(hyp2f1(0.5,0.5,1,0.5)*gamma(0.75)/gamma(1.25))**24*(hyp1f2(1,1.5,1,1) / struvel(-0.5, 2))**21/meijerg([[],[]], [[0],[0.5]], 0)**2(meijerg([[],[2]], [[1,1.5],[]], 1, 0.5) / erfc(1))**2(1-e) / meijerg([[1],[0.5]], [[1],[0.5,0]], 1)sqrt(psi(1,0.25)-8*catalan)elliprc(1,2)*4elliprg(0,1,1)*42*agm(1,0.5)*ellipk(0.75)(gamma(0.75)*jtheta(3,0,exp(-pi)))**4cbrt(gamma(0.25)**4*agm(1,sqrt(2))**2/8)sqrt(6*zeta(2))sqrt(6*(zeta(2,3)+5./4))sqrt(zeta(2,(3,4))+8*catalan)exp(-2*zeta(0,1,1))/2sqrt(12*altzeta(2))4*dirichlet(1,[0,1,0,-1])2*catalan/dirichlet(-1,[0,1,0,-1],1)exp(-dirichlet(0,[0,1,0,-1],1))*gamma(0.25)**2/(2*sqrt(2))sqrt(7*zeta(3)/(4*diff(lerchphi, (-1,-2,1), (0,1,0))))sqrt(-12*polylog(2,-1))sqrt(6*log(2)**2+12*polylog(2,0.5))chop(root(-81j*(polylog(3,root(1,3,1))+4*zeta(3)/9)/2,3))2*clsin(1,1)+1(3+sqrt(3)*sqrt(1+8*clcos(2,1)))/2root(2,6)*sqrt(e)/(glaisher**6*barnesg(0.5)**4)nsum(lambda k: 4*(-1)**(k+1)/(2*k-1), [1,inf])nsum(lambda k: (3**k-1)/4**k*zeta(k+1), [1,inf])nsum(lambda k: 8/(2*k-1)**2, [1,inf])**0.5nsum(lambda k: 2*fac(k)/fac2(2*k+1), [0,inf])nsum(lambda k: fac(k)**2/fac(2*k+1), [0,inf])*3*sqrt(3)/2nsum(lambda k: fac(k)**2/(phi**(2*k+1)*fac(2*k+1)), [0,inf])*(5*sqrt(phi+2))/2nsum(lambda k: (4/(8*k+1)-2/(8*k+4)-1/(8*k+5)-1/(8*k+6))/16**k, [0,inf])2/nsum(lambda k: (-1)**k*(4*k+1)*(fac2(2*k-1)/fac2(2*k))**3, [0,inf])nsum(lambda k: 72/(k*expm1(k*pi))-96/(k*expm1(2*pi*k))+24/(k*expm1(4*pi*k)), [1,inf])1/nsum(lambda k: binomial(2*k,k)**3*(42*k+5)/2**(12*k+4), [0,inf])4/nsum(lambda k: (-1)**k*(1123+21460*k)*fac2(2*k-1)*fac2(4*k-1)/(882**(2*k+1)*32**k*fac(k)**3), [0,inf])9801/sqrt(8)/nsum(lambda k: fac(4*k)*(1103+26390*k)/(fac(k)**4*396**(4*k)), [0,inf])426880*sqrt(10005)/nsum(lambda k: (-1)**k*fac(6*k)*(13591409+545140134*k)/(fac(k)**3*fac(3*k)*(640320**3)**k), [0,inf])4/nsum(lambda k: (6*k+1)*rf(0.5,k)**3/(4**k*fac(k)**3), [0,inf])(ln(8)+sqrt(48*nsum(lambda m,n: (-1)**(m+n)/(m**2+n**2), [1,inf],[1,inf]) + 9*log(2)**2))/2-nsum(lambda x,y: (-1)**(x+y)/(x**2+y**2), [-inf,inf], [-inf,inf], ignore=True)/ln22*nsum(lambda k: sin(k)/k, [1,inf])+1quad(lambda x: 2/(x**2+1), [0,inf])quad(lambda x: exp(-x**2), [-inf,inf])**22*quad(lambda x: sqrt(1-x**2), [-1,1])chop(quad(lambda z: 1/(2j*z), [1,j,-1,-j,1]))3*(4*log(2+sqrt(3))-quad(lambda x,y: 1/sqrt(1+x**2+y**2), [-1,1],[-1,1]))/2sqrt(8*quad(lambda x,y: 1/(1-(x*y)**2), [0,1],[0,1]))sqrt(6*quad(lambda x,y: 1/(1-x*y), [0,1],[0,1]))sqrt(6*quad(lambda x: x/expm1(x), [0,inf]))quad(lambda x: (16*x-16)/(x**4-2*x**3+4*x-4), [0,1])quad(lambda x: sqrt(x-x**2), [0,0.25])*24+3*sqrt(3)/4mpf(22)/7 - quad(lambda x: x**4*(1-x)**4/(1+x**2), [0,1])mpf(355)/113 - quad(lambda x: x**8*(1-x)**8*(25+816*x**2)/(1+x**2), [0,1])/31642*quadosc(lambda x: sin(x)/x, [0,inf], omega=1)40*quadosc(lambda x: sin(x)**6/x**6, [0,inf], omega=1)/11e*quadosc(lambda x: cos(x)/(1+x**2), [-inf,inf], omega=1)8*quadosc(lambda x: cos(x**2), [0,inf], zeros=lambda n: sqrt(n))**22*quadosc(lambda x: sin(exp(x)), [1,inf], zeros=ln)+2*si(e)exp(2*quad(loggamma, [0,1]))/22*nprod(lambda k: sec(pi/2**k), [2,inf])s=lambda k: sqrt(0.5+s(k-1)/2) if k else 0; 2/nprod(s, [1,inf])s=lambda k: sqrt(2+s(k-1)) if k else 0; limit(lambda k: sqrt(2-s(k))*2**(k+1), inf)2*nprod(lambda k: (2*k)**2/((2*k-1)*(2*k+1)), [1,inf])2*nprod(lambda k: (4*k**2)/(4*k**2-1), [1, inf])sqrt(6*ln(nprod(lambda k: exp(1/k**2), [1,inf])))nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])/csch(pi)nprod(lambda k: (k**2-1)/(k**2+1), [2,inf])*sinh(pi)nprod(lambda k: (k**4-1)/(k**4+1), [2, inf])*(cosh(sqrt(2)*pi)-cos(sqrt(2)*pi))/sinh(pi)sinh(pi)/nprod(lambda k: (1-1/k**4), [2, inf])/4sinh(pi)/nprod(lambda k: (1+1/k**2), [2, inf])/2(exp(1+euler/2)/nprod(lambda n: (1+1/n)**n * exp(1/(2*n)-1), [1, inf]))**2/23*sqrt(2)*cosh(pi*sqrt(3)/2)**2*csch(pi*sqrt(2))/nprod(lambda k: (1+1/k+1/k**2)**2/(1+2/k+3/k**2), [1, inf])2/e*nprod(lambda k: (1+2/k)**((-1)**(k+1)*k), [1,inf])limit(lambda k: 16**k/(k*binomial(2*k,k)**2), inf)limit(lambda x: 4*x*hyp1f2(0.5,1.5,1.5,-x**2), inf)1/log(limit(lambda n: nprod(lambda k: pi/(2*atan(k)), [n,2*n]), inf),4)limit(lambda k: 2**(4*k+1)*fac(k)**4/(2*k+1)/fac(2*k)**2, inf)limit(lambda k: fac(k) / (sqrt(k)*(k/e)**k), inf)**2/2limit(lambda k: (-(-1)**k*bernoulli(2*k)*2**(2*k-1)/fac(2*k))**(-1/(2*k)), inf)limit(lambda k: besseljzero(1,k)/k, inf)1/limit(lambda x: airyai(x)*2*x**0.25*exp(2*x**1.5/3), inf, exp=True)**21/limit(lambda x: airybi(x)*x**0.25*exp(-2*x**1.5/3), inf, exp=True)**2`

## Friday, March 11, 2011

### A FLINT example: Lambert W function power series

Two days ago, a new version of the Fast Library for Number Theory (FLINT) was released. I contributed a lot of new code to this release, including linear algebra speed improvements and new functionality for fast power series arithmetic and computation of special numbers and polynomials (see the release announcement and some of my benchmarking results).

In this blog post I'll demonstrate how to do power series arithmetic with FLINT, using its fmpq_poly module which implements polynomials over the rational numbers Q. Standard operations (addition, multiplication and division) were available before; the functions I've added include square root, log, exp, sin, tan, atan, etc. (all the usual elementary functions). The same functions are also available for power series over a finite field Z/nZ (with word-size n). Everything is asymptotically fast (the running time is linear in the size of the output, up to logarithmic factors).

Of course, transcendental functions are a bit restricted when considered over Q or Z/nZ, since it's only possible to obtain power series expansions at specific rational points (in most cases just x = 0). So at present, some very interesting numerical applications of fast power series arithmetic are not supported. But some time in the future, we'll probably add support for numerical power series over the reals and complexes as well.

As today's example, let us implement the Lambert W function for the power series ring Q[[x]]. The Lambert W function is defined implicitly by the equation x = W(x) exp(W(z)), which can be solved using Newton iteration with the update step w = w - (w exp(w) - x) / ((w+1) exp(w)).

Power series Newton iteration is just like numerical Newton iteration, except that the convergence behavior is much simpler: starting with a correct first-order expansion, each iteration at least doubles the number of correct coefficients.

A simple recursive implementation with asymptotically optimal performance (up to constant factors) looks as follows:

`#include <stdio.h>#include "flint.h"#include "fmpq_poly.h"void lambertw(fmpq_poly_t w, fmpq_poly_t x, long n){    if (n == 1)    {        fmpq_poly_zero(w);    }    else    {        fmpq_poly_t t, u, v;        lambertw(w, x, (n + 1) / 2);        fmpq_poly_init(t);        fmpq_poly_init(u);        fmpq_poly_init(v);        fmpq_poly_exp_series(t, w, n);        fmpq_poly_mullow(u, t, w, n);        fmpq_poly_sub(v, u, x);        fmpq_poly_add(t, u, t);        fmpq_poly_div_series(u, v, t, n);        fmpq_poly_sub(w, w, u);        fmpq_poly_clear(t);        fmpq_poly_clear(u);        fmpq_poly_clear(v);    }}`

Beyond the base case W(x) = 0 + O(x), the function just computes w to accuracy ceil(n/2), and then extends it to accuracy n using a single Newton step. As we can see, C code directly using the FLINT library interface gets a bit verbose, but this style has the advantage of giving precise control over temporary memory allocation, polynomial lengths, etc. (it is very similar to the interface of GMP/MPIR).

We add a simple test main routine:
`int main(){   fmpq_poly_t x;   fmpq_poly_t w;   fmpq_poly_init(x);   fmpq_poly_init(w);   fmpq_poly_set_coeff_ui(x, 1, 1);   lambertw(w, x, 10);   fmpq_poly_print_pretty(w, "x");   printf("\n");   fmpq_poly_clear(x);   fmpq_poly_clear(w);}`

The output of the program is:
`531441/4480*x^9 - 16384/315*x^8 + 16807/720*x^7 - 54/5*x^6 + 125/24*x^5 - 8/3*x^4 + 3/2*x^3 - 1*x^2 + 1*x`

It is well known that the coefficients in this series are given in closed form by (-k)k-1 / k!, so we can check that the output is correct.

Computing 1000 terms takes just a few seconds. If this sounds like much, remember that the coefficients grow rapidly: together, the computed numerators and denominators have over 2 million digits!

So far this is perhaps not so interesting, as we could compute the coefficients faster using a direct formula. But the nice thing is that arbitrary compositions are allowed, i.e we can compute W(f(x)) for any given power series f, and this will still be just as fast.

Let's consider a nontrivial example: the infinite "power tower" T(z) = zzzz... A moment's reflection shows that this is an analytic function with a rational power series expansion around z = 1. In fact, we have explicitly T(z) = W(-log(z))/(-log(z)). We can compute this series expansion (in the shifted variable x = z - 1) as follows:

`int main(){    fmpq_poly_t x;    fmpq_poly_t w;    long n = 10;    fmpq_poly_init(x);    fmpq_poly_init(w);    fmpq_poly_set_coeff_ui(x, 0, 1);    fmpq_poly_set_coeff_ui(x, 1, 1);    fmpq_poly_log_series(x, x, n + 1);    fmpq_poly_neg(x, x);    lambertw(w, x, n + 1);    fmpq_poly_shift_right(w, w, 1);    fmpq_poly_shift_right(x, x, 1);    fmpq_poly_div_series(w, w, x, n);    fmpq_poly_print_pretty(w, "x");    printf("\n");    fmpq_poly_clear(x);    fmpq_poly_clear(w);}`

The only complication is that fmpq_poly_div_series requires a nonzero leading coefficient in the denominator, so we must shift both series down one power.

The program outputs:
`118001/2520*x^9 + 123101/5040*x^8 + 4681/360*x^7 + 283/40*x^6 + 4*x^5 + 7/3*x^4 + 3/2*x^3 + 1*x^2 + 1*x + 1`

To make things nicer, we assume that the coefficients have the form ak / k! (i.e. that T(z) is the exponential generating function for ak) and change the output code to something like the following:
`    long k;    mpq_t t;    mpz_t u;    mpq_init(t);    mpz_init(u);    for (k = 0; k < n; k++)    {        fmpq_poly_get_coeff_mpq(t, w, k);        mpz_fac_ui(u, k);        mpz_mul(mpq_numref(t), mpq_numref(t), u);        mpq_canonicalize(t);        gmp_printf("%Qd ", t);    }    mpq_clear(t);    mpz_clear(u);`

This indeed gives us an integer sequence:
`1 1 2 9 56 480 5094 65534 984808 16992144`

Now what is the value of the 1000th coefficient (to be precise, a1000, the initial one being the 0th!) in this sequence? After a simple modification of the program, 2.9 seconds of computation gives:

`116088723416360877058165139472971678305682655888757200617040183288023530426756681791214166146936295338906200405380900565797054717998071778437757582562676432270594729770831984037179011167877182932317695683927346108840781529292782919617415801022889763531998203556748720236870472740313747820376836354056589570878404139562784693762331122998711070595645913436447537334994232839721368275902686875807251095288080395306471091025409811078916244347343343375806012255865925818202775569656436509351036076228649393400187670469063215003559774586495010151736330831006687588008043886163633208133324925968354018598718396321446522507297042269011590554350050765064097808856685726892919091844572545581642428942983342505179168857619230316014346424101371730872734534492192176599495608409492914591040791939356414531202971705769303257234151456918871942207889248610196901459400077483577940763454422516589494589386979762908326280910675714898537511196619258057757601829560715165754755469941168861084140499195252056413724265130518619966880917401902668151574186675809680229260294868082194497633384642944873208313626575767679265889756445878063163639282166245308180447623432893312520697087313187138285220141409331942812710129867491990841736391939490562342870154316209797955556381777937576606896211985949120247041122030144008552040487919104081821646288468944794572548379308285499126418611400713712447555062853630274495412279277142852027491666742488186890767945371565766096452794814548702964428648297660149787638501522977387119357596043059939423242161640102515280896797542967829629757402705726445239053261557399630212654678115919485631223995547355297477425151029625304838666187951874709256802926224889173882107084716891403043088761748938211657131479578425767585519331805968937010542495567221591600504522701519356853332139872512204043830445131201157613311750725449188186072484468315734307808390196624736783193070534665116557731933519958498663270193078704185994119446629783305199163258244436211827836670241745954935539341498910525641015621246608253851978785829794919003347187955531964814287965653050322140399695072998272983889906823049155302053273484019653833081580196857296769881600411144855641888964455021209598897362668473406912526816735047448372816163718832244604054261282083620649731423678182582137133666912162187578149277916758677659326221406922607543435597637586885441804409524773454375858826053548656981688502940651435148227696208156279868460423027051552771077659399889469617306015354335528530235916712574337562579736559278351853549825129834280128952701817672970606139463650468155476330275845066948765336085851188608302309056603401440047692698200295529572915618836122163118770906896634410940116898688481585685180958996837198544863615413808321802623327256966120967255251353141629521865937921459938657771439492527626159018195922050167504883881038997644963556212956342228712695352450134112412161126957056000000000000000000000000000000000000000000 `

In fact, if we look up the first 10 coefficients in the On-Line Encyclopedia of Integer Sequences, we find http://oeis.org/A033917. This OEIS entry lists the representation

a(n) = Sum_{k=0..n} Stirling1(n, k)*(k+1)^(k-1)

Since FLINT supports fast vector computation of Stirling numbers, this formula can be implemented efficiently:

`#include "fmpz.h"#include "fmpz_vec.h"#include "arith.h"void coefficient(fmpz_t a, long n){    long k;    fmpz * s;    fmpz_t t;    s = _fmpz_vec_init(n + 1);    fmpz_stirling1_vec(s, n, n + 1);    fmpz_init(t);    fmpz_zero(a);    for (k = 1; k <= n; k++)    {        fmpz_set_ui(t, k + 1);        fmpz_pow_ui(t, t, k - 1);        fmpz_addmul(a, s + k, t);    }    _fmpz_vec_clear(s, n + 1);    fmpz_clear(t);}int main(){    fmpz_t a;    fmpz_init(a);    coefficient(a, 1000);    fmpz_print(a);    printf("\n");    fmpz_clear(a);}`

And indeed, the output turns out to be the same!

This program is faster, taking only 0.1 seconds to run. But of course, it only gives us a single coefficient, and would be slower for computing a range of values by making repeated calls.

Similar ideas to those presented here (basically, reducing a problem to fast polynomial multiplication using generating functions, Newton iteration, etc.) are used internally by FLINT for computation of the standard elementary functions themselves as well as various special numbers and polynomials (Bernoulli numbers and polynomials, partitions, Stirling numbers, Bell numbers, etc). The internal code uses a lot of tricks to reduce overhead and handle special cases faster, however. (See the previous blog post Fast combinatorial and number-theoretic functions with FLINT 2, and for more recent information the release announcement and benchmarks page linked at the top of this post.)

In other news, I haven't written a lot of code for mpmath or Sage recently. Of course, my hope is that FLINT (2) will make it into Sage in the not too distant future. The fast polynomial and power series arithmetic support in FLINT will also be very useful for future special functions applications (in mpmath and elsewhere).

## Friday, September 24, 2010

### Announcing mpmath 0.16

I'm happy to announce the release of mpmath 0.16, which contains the usual bugfixes as well as a slew of new features!

The main focus has been to improve coverage of special functions. Additions include inhomogeneous Bessel functions, Bessel function zeros, incomplete elliptic integrals, and parabolic cylinder functions. As of 0.16, mpmath implements essentially everything listed in the NIST Digital Library of Mathematical Functions chapters 1-20, as well as 21,24,27 and 33. (For 25 and 26 -- combinatorial and number-theoretic functions, see also my post about FLINT 2.)

Another major change is that mpmath 0.16 running in Sage will be much faster thanks to new extension code (currently awaiting review for inclusion in Sage). I've clocked speedups between 1.3x and 2x for various nontrivial pieces of code (such as the mpmath test suite and the torture test programs).

Thanks to William Stein, my work on mpmath during the summer was funded using resources from NSF grant DMS-0757627. This support is gratefully acknowledged.

Most of the new features are described in previous posts on this blog. For convenience, here is a short summary:

Assorted special functions update

• The documentation now includes plots to illustrate several of the special functions.

• Airy functions have been rewritten for improved speed and accuracy and to support evaluation of derivatives.

• Functions airyaizero(), airybizero() for computation of Airy function zeros have been implemented.

• Inhomogeneous Airy (Scorer) functions scorergi() and scorerhi() have been implemented.

• Four inhomogeneous Bessel functions have been added (lommels1(), lommels2(), angerj(), webere()).

• The Lambert W function has been rewritten to fix various bugs and numerical issues

Incomplete elliptic integrals complete

• The Legendre and Carlson incomplete elliptic integrals for real and complex arguments have been implemented (ellipf(), ellipe(), ellippi(), elliprf(), elliprc(), elliprj(), elliprd(), elliprg()).

Sage Days 23, and Bessel function zeros

• Functions besseljzero() and besselyzero() have been implemented for computing the m-th zero of Jν(z), J'ν(z) Yν(z), or Y'ν(z) for any positive integer index m and real order ν ≥ 0.

Post Sage Days 24 report

• The Parabolic cylinder functions pcfd(), pcfu(), pcfv(), pcfw() have been implemented.

Euler-Maclaurin summation of hypergeometric series

• Hypergeometric functions pFp-1(...; ...; z) now support accurate evaluation close to the singularity at z = 1.

• A function sumap() has been added for summation of infinite series using the Abel-Plana formula.

• Functions diffs_prod() and diffs_prod() have been added for generating high-order derivatives of products or exponentials of functions with known derivatives.

Again, mpmath in Sage is about to get faster

• New Cython extension code has been written for Sage to speed up various operations in mpmath, including elementary functions and hypergeometric series.

There are various other changes as well, such as support for matrix slice indexing (contributed by Ioannis Tziakos -- thanks!). As usual, details are available in the changelog and the Changes page on the Google Code project site.

## Wednesday, September 22, 2010

### Again, mpmath in Sage is about to get faster

My summer project on special functions in mpmath and Sage, generously supported by William Stein with funds from NSF grant DMS-0757627, is nearing completion. I will soon release mpmath-0.16, which contains lots of new special functions and bugfixes. Sage users will also benefit from ~1500 lines of new Cython code (preliminary patch here) that speeds up various basic operations. Executing mpmath.runtests() in Sage on my laptop now takes 10.47 seconds (8.60 from a warm cache), compared to 14.21 (11.84) seconds with the new extensions disabled -- a global speedup of 30%.

For comparison, pure-Python mpmath with gmpy as the backend takes 21.46 (18.72) seconds to execute the unit tests and pure-Python mpmath with the pure-Python backend takes 52.33 (45.92) seconds.

Specifically, the new extension code implements exp for real and complex arguments, cos, sin and ln for real arguments, complex exponentiation in some cases, and summation of hypergeometric series, entirely in Cython.

Timings before (new extensions disabled):
`sage: import mpmathsage: x = mpmath.mpf(0.37)sage: y = mpmath.mpf(0.49)sage: %timeit mpmath.exp(x)625 loops, best of 3: 14.5 µs per loopsage: %timeit mpmath.ln(x)625 loops, best of 3: 23.2 µs per loopsage: %timeit mpmath.cos(x)625 loops, best of 3: 17.2 µs per loopsage: %timeit x ^ y625 loops, best of 3: 39.9 µs per loopsage: %timeit mpmath.hyp1f1(2r,3r,4r)625 loops, best of 3: 90.3 µs per loopsage: %timeit mpmath.hyp1f1(x,y,x)625 loops, best of 3: 83.6 µs per loopsage: %timeit mpmath.hyp1f1(x,y,mpmath.mpc(x,y))625 loops, best of 3: 136 µs per loop`

Timings after (new extensions enabled):
`sage: import mpmathsage: x = mpmath.mpf(0.37)sage: y = mpmath.mpf(0.49)sage: %timeit mpmath.exp(x)625 loops, best of 3: 2.72 µs per loopsage: %timeit mpmath.ln(x)625 loops, best of 3: 7.25 µs per loopsage: %timeit mpmath.cos(x)625 loops, best of 3: 4.13 µs per loopsage: %timeit x ^ y625 loops, best of 3: 10.5 µs per loopsage: %timeit mpmath.hyp1f1(2r,3r,4r)625 loops, best of 3: 47.1 µs per loopsage: %timeit mpmath.hyp1f1(x,y,x)625 loops, best of 3: 59.4 µs per loopsage: %timeit mpmath.hyp1f1(x,y,mpmath.mpc(x,y))625 loops, best of 3: 83.1 µs per loop`

The new elementary functions use a combination of custom algorithms and straightforward MPFR wrappers. Why not just wrap MPFR for everything? There are two primary reasons:

Firstly, because MPFR numbers have a limited range, custom code still needs to be used in the overflowing cases, and this is almost as much work as an implementation-from-scratch. (There are also some more minor incompatibilities, like lack of round-away-from-zero in MPFR, that result in a lot of extra work.)

Secondly, MPFR is not always fast (or as fast as it could be), so it pays off to write custom code. In fact, some of the ordinary Python implementations of functions in mpmath are faster than their MPFR counterparts in various cases, although that is rather exceptional (atan is an example). But generally, at low-mid precisions, it is possible to be perhaps 2-4x faster than MPFR with carefully optimized C code (see fastfunlib). This is a longer-term goal.

Already now, with the new extension code, the mpmath exponential function becomes faster than the Sage RealNumber version (based on MPFR) at low precision:
`sage: %timeit mpmath.exp(x)625 loops, best of 3: 2.75 µs per loopsage: w = RealField(53)(x)sage: %timeit w.exp()625 loops, best of 3: 5.57 µs per loop`

As the timings above indicate, hypergeometric series have gotten up to 2x faster. The speedup of the actual summation is much larger, but much of that gain is lost in various Python overheads (more work can be done on this). There should be a noticeable speedup for some hypergeometric function computations, while others will not benefit as much, for the moment.

Another benchmark is the extratest_zeta.py script in mpmath, which exercises the mpmath implementation of the Riemann-Siegel formula for evaluation of ζ(s) for complex s with large imaginary part. Such computations largely depend on elementary function performance (cos, sin, exp, log).

Here are the new timings for mpmath in Sage:
`fredrik@scv:~/sage\$ ./sage /home/fredrik/mp/mpmath/tests/extratest_zeta.py399999999 156762524.675 ok = True (time = 1.144)241389216 97490234.2277 ok = True (time = 9.271)526196239 202950727.691 ok = True (time = 1.671)542964976 209039046.579 ok = True (time = 1.189)1048449112 388858885.231 ok = True (time = 1.774)1048449113 388858885.384 ok = True (time = 1.604)1048449114 388858886.002 ok = True (time = 2.096)1048449115 388858886.002 ok = True (time = 2.587)1048449116 388858886.691 ok = True (time = 1.546)`

This is mpmath in Sage with the new extension code disabled:
`fredrik@scv:~/sage\$ ./sage /home/fredrik/mp/mpmath/tests/extratest_zeta.py399999999 156762524.675 ok = True (time = 2.352)241389216 97490234.2277 ok = True (time = 14.088)526196239 202950727.691 ok = True (time = 3.036)542964976 209039046.579 ok = True (time = 2.104)1048449112 388858885.231 ok = True (time = 3.707)1048449113 388858885.384 ok = True (time = 3.283)1048449114 388858886.002 ok = True (time = 4.444)1048449115 388858886.002 ok = True (time = 5.592)1048449116 388858886.691 ok = True (time = 3.101)`

This is mpmath in ordinary Python mode, using gmpy:
`fredrik@scv:~/sage\$ python /home/fredrik/mp/mpmath/tests/extratest_zeta.py399999999 156762524.675 ok = True (time = 2.741)241389216 97490234.2277 ok = True (time = 13.842)526196239 202950727.691 ok = True (time = 3.124)542964976 209039046.579 ok = True (time = 2.143)1048449112 388858885.231 ok = True (time = 3.257)1048449113 388858885.384 ok = True (time = 2.912)1048449114 388858886.002 ok = True (time = 3.953)1048449115 388858886.002 ok = True (time = 4.964)1048449116 388858886.691 ok = True (time = 2.762)`

With the new extension code, it appears that zeta computations are up to about twice as fast. This speedup could be made much larger as there still is a significant amount of Python overhead left to remove -- also a project for the future.

## Sunday, September 5, 2010

### Fast combinatorial and number-theoretic functions with FLINT 2

Time for a development update! Recently, I've done only a limited amount of work on mpmath (I have a some almost-finished Cython code for sage.libs.mpmath and new code for numerical integration in mpmath, both to be committed fairly soon -- within a couple of weeks, hopefully).

The last few weeks, I've mostly been contributing to FLINT 2. For those unfamiliar with it, FLINT is a fast C library for computational number theory developed by Bill Hart and others (the other active developers right now are Sebastian Pancratz and Andy Novocin). In particular, FLINT implements ridiculously fast multiprecision integer vectors and polynomials. It also provides very fast primality testing and factorization for word-size integers (32 or 64 bits), among other things. FLINT 2 is an in-progress rewrite of FLINT 1.x, a current standard component in Sage.

What does this have to do with numerical evaluation of special functions (the usual theme of this blog)? In short, my goal is to add code to FLINT 2 for exact special function computations -- combinatorial and number-theoretic functions, special polynomials and the like. Such functions benefit tremendously from the fast integer and polynomial arithmetic available in FLINT 2.

All my code can be found in my public GitHub repository (the most recent commits as of this writing are in the 'factor' branch).

Functions I've implemented so far include:

• Möbius μ and Euler φ (totient) functions for word-size and arbitrary-size integers

• Divisor sum function σk for arbitrary-size integers

• Ramanujan τ function (Δ-function q-expansion)

• Harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n

• Primorials 2 · 3 · 5 · ... · pn

• Stirling numbers (1st and 2nd kind)

The versions in FLINT 2 of these functions should now be faster than all other implementations I've tried (GAP, Pari, Mathematica, the Sage library) for all ranges of arguments, except for those requiring factorization of large integers.

Some of these functions depend fundamentally on the ability to factorize integers efficiently. So far I've only implemented trial division for large integers in FLINT 2, with some clever code to extract large powers of small factors quickly. Sufficiently small cofactors are handled by calling Bill Hart's single-word factoring routines. The resulting code is very fast for "artificial" numbers like factorials, and will eventually be complemented with prime and perfect power detection code, plus fast implementations of Brent's algorithm and other methods. Later on the quadratic sieve from FLINT 1 will probably be ported to FLINT 2, so that FLINT 2 will be able to factor any reasonable number reasonably quickly.

Below, I've posted some benchmark results. A word of caution: all Mathematica timings were done on a different system, which is faster than my own laptop (typically by 30% or so). So in reality, Mathematica performs slightly worse relatively than indicated below. Everything else is timed on my laptop. I have not included test code for the FLINT2 functions (but it's just straightforward C code -- a function call or two between timeit_start and timeit_stop using FLINT 2's profiler module).

Möbius function (the following is basically a raw exercise of the small-integer factoring code):
`Pari:sage: %time pari('sum(n=1,10^6,moebius(n))');CPU times: user 1.04 s, sys: 0.00 s, total: 1.04 sWall time: 1.04 sMathematica:In[1]:= Timing[Sum[MoebiusMu[n], {n,1,10^6}];]Out[1]= {0.71, Null}flint2:650 ms`

Divisor sum:
`Sage (uses Cython code):sage: %time sigma(factorial(1000),1000);CPU times: user 0.47 s, sys: 0.00 s, total: 0.47 sWall time: 0.46 sMathematica:In[1]:= Timing[DivisorSigma[1000,1000!];]Out[1]= {3.01, Null}flint2:350 ms`

Ramanujan τ function:
`Sage (uses FLINT 1):sage: %time delta_qexp(100000);CPU times: user 0.42 s, sys: 0.01 s, total: 0.43 sWall time: 0.42 ssage: %time delta_qexp(1000000);CPU times: user 6.02 s, sys: 0.37 s, total: 6.39 sWall time: 6.40 sflint2:100000: 230 ms1000000: 4500 ms`

An isolated value (Mathematica seems to be the only other software that knows how to compute this):
`Mathematica:In[1]:= Timing[RamanujanTau[10000!];]Out[1]= {8.74, Null}flint2:280 ms`

Harmonic numbers (again, only Mathematica seems to implement these). See also my old blog post How (not) to compute harmonic numbers. I've included the fastest version from there, harmonic5:

`Mathematica:In[1]:= Timing[HarmonicNumber[100000];]Out[1]= {0.22, Null}In[2]:= Timing[HarmonicNumber[1000000];]Out[2]= {6.25, Null}In[3]:= Timing[HarmonicNumber[10000000];]Out[3]= {129.13, Null}harmonic5: (100000):100000: 0.471 s1000000: 8.259 s10000000: 143.639 sflint2:100000: 100 ms1000000: 2560 ms10000000: 49400 ms`

The FLINT 2 function benefits from an improved algorithm that eliminates terms and reduces the size of the temporary numerators and denominators, as well as low-level optimization (the basecase summation directly uses the MPIR mpn interface).

Isolated Stirling numbers of the first kind:

`Mathematica:In[1]:= Timing[StirlingS1[1000,500];]Out[1]= {0.24, Null}In[2]:= Timing[StirlingS1[2000,1000];]Out[2]= {1.79, Null}In[3]:= Timing[StirlingS1[3000,1500];]Out[3]= {5.13, Null}flint 2:100,500: 100 ms2000,1000: 740 ms3000,1500: 1520 ms`

Isolated Stirling numbers of the second kind:
`Mathematica:In[1]:= Timing[StirlingS2[1000,500];]Out11]= {0.21, Null}In[2]:= Timing[StirlingS2[2000,1000];]Out[2]= {1.54, Null}In[3]:= Timing[StirlingS2[3000,1500];]Out[3]= {4.55, Null}In[4]:= Timing[StirlingS2[5000,2500];]Out[4]= {29.25, Null}flint2:1000,500: 2 ms2000,1000: 17 ms3000,1500: 50 ms5000,2500: 240 ms`

In addition, fast functions are provided for computing a whole row or matrix of Stirling numbers. For example, computing the triangular matrix of ~1.1 million Stirling numbers of the first kind up to S(1500,1500) takes only 1.3 seconds. In Mathematica (again, on the faster system):
`In[1]:= Timing[Table[StirlingS1[n,k], {n,0,1500}, {k,0,n}];]Out[1]= {2.13, Null}`

The benchmarks above mostly demonstrate performance for large inputs. Another nice aspect of the FLINT 2 functions is that there typically is very little overhead for small inputs. The high performance is due to a combination of algorithms, low-level optimization, and (most importantly) the fast underlying arithmetic in FLINT 2. I will perhaps write some more about the algorithms (for e.g. Stirling numbers) in a later post.