Thursday, February 11, 2010

A new gamma function implementation

The gamma function is probably the most important nonelementary special function. The gamma function or ratios of gamma functions appear in everything from the functional equation of the Riemann zeta function to the normalization factors of hypergeometric functions. Even expressions that are usually rational functions (such as binomial coefficients) are often most conveniently expressed (or in software, implemented) in terms of the gamma function since this automatically provides correct asymptotics and extension to noninteger parameters. Needless to say, a well-implemented gamma function is fundamental for a special functions library.

The gamma function implementation used in mpmath until now is one of the oldest parts of the code (I wrote it around three years ago), and although it has done its job, it has some flaws. It mainly uses Spouge's approximation, which is fairly efficient but nonoptimal for asymptotically large arguments. The implementation also has some subtle precision issues.

The new gamma function added in this commit is the result of over a year of intermittent work. I did most of the work during last summer but didn't have time to finish it until recently. The improvements are essentially the following:

  • Removal of various overheads

  • Special-casing half-integers

  • Using Taylor series for small real arguments

  • Using Stirling's series for complex and/or large arguments

  • Optimizing for computing loggamma

  • Handling values near poles and near the real axis extremely carefully


The difference in performance is quite dramatic. The smallest speedup occurs for small complex arguments, but it turns out that even there, the Stirling series is faster than Spouge's approximation (at least with the removal of overheads in the new implementation) despite the need to perform argument transformations. For real arguments, and large complex arguments, the new implementation is consistently faster by a large factor, especially for loggamma.

I'll show some benchmarks. The times are in milliseconds; the first value in each cell is the time for the old implementation and the second is the time for the new one, followed by the speedup. The results are on my laptop, with gmpy (I'll try with Sage later):

Firstly, results for gamma(z):

zdps=15dps=50dps=250dps=1000
3.00.0086
0.0077
(1.13x)
0.0089
0.0076
(1.16x)
0.0087
0.0075
(1.16x)
0.0085
0.0077
(1.12x)
5.270.1200
0.0342
(3.51x)
0.2513
0.0545
(4.61x)
2.5964
0.4220
(6.15x)
80.7089
11.3011
(7.14x)
2.50.1255
0.0158
(7.94x)
0.2398
0.0138
(17.41x)
2.5519
0.0144
(177.72x)
79.0651
0.0149
(5319.87x)
-3.70.2518
0.0457
(5.51x)
0.4006
0.0615
(6.52x)
3.1073
0.4283
(7.26x)
87.4200
11.3981
(7.67x)
(-3.0+4.0j)0.6205
0.3942
(1.57x)
1.0095
0.6093
(1.66x)
9.0896
5.9099
(1.54x)
253.9219
191.3358
(1.33x)
(3.25-4.125j)0.3548
0.2864
(1.24x)
0.7456
0.4791
(1.56x)
8.6651
5.6367
(1.54x)
252.7598
189.2187
(1.34x)
(15.0+20.0j)0.3980
0.1973
(2.02x)
0.7526
0.4496
(1.67x)
8.6583
5.4604
(1.59x)
250.8646
188.5849
(1.33x)
52.10.1661
0.0586
(2.83x)
0.3050
0.0894
(3.41x)
2.9862
0.5685
(5.25x)
84.8334
12.4169
(6.83x)
123.250.1643
0.0638
(2.57x)
0.3071
0.0963
(3.19x)
2.8908
0.9194
(3.14x)
83.8168
14.6588
(5.72x)
-200.70.3011
0.1485
(2.03x)
0.4897
0.1914
(2.56x)
3.5145
1.1359
(3.09x)
89.8617
17.5500
(5.12x)
(100.25+100.0j)0.4238
0.1857
(2.28x)
0.7641
0.2757
(2.77x)
8.6807
3.1403
(2.76x)
249.6731
171.4369
(1.46x)
12345678.90.3032
0.0701
(4.32x)
0.4382
0.0857
(5.11x)
3.4078
0.3257
(10.46x)
88.4082
5.2828
(16.74x)
(0.0+12345678.9j)0.7814
0.1409
(5.54x)
1.1671
0.1802
(6.48x)
10.3985
0.6169
(16.86x)
265.6192
8.3569
(31.78x)
1.0e+200.8833
0.0798
(11.06x)
1.3019
0.0899
(14.48x)
6.4010
0.2939
(21.78x)
107.0924
3.5141
(30.47x)
(0.0+1.0e+20j)3.5025
0.1515
(23.11x)
3.6163
0.1834
(19.72x)
19.5676
0.5308
(36.86x)
322.9033
6.2148
(51.96x)


Secondly, results for loggamma(z):

zdps=15dps=50dps=250dps=1000
3.00.0700
0.0115
(6.07x)
0.0677
0.0116
(5.84x)
0.0683
0.0119
(5.73x)
0.0683
0.0121
(5.65x)
5.270.2207
0.0564
(3.91x)
0.3548
0.0796
(4.45x)
2.8925
0.5221
(5.54x)
88.3556
12.6452
(6.99x)
2.50.2210
0.0319
(6.93x)
0.3565
0.0374
(9.53x)
2.8595
0.1011
(28.28x)
81.8333
1.3546
(60.41x)
-3.70.4392
0.0782
(5.62x)
0.6418
0.0987
(6.50x)
3.5767
0.5429
(6.59x)
88.8484
12.7363
(6.98x)
(-3.0+4.0j)1.1235
0.4942
(2.27x)
1.5685
0.7224
(2.17x)
10.0879
6.0459
(1.67x)
267.2893
198.7613
(1.34x)
(3.25-4.125j)0.8789
0.2704
(3.25x)
1.2326
0.4752
(2.59x)
9.7501
5.4572
(1.79x)
264.7649
190.4007
(1.39x)
(15.0+20.0j)0.8983
0.1248
(7.20x)
1.2923
0.4221
(3.06x)
9.7917
5.2766
(1.86x)
265.2711
187.9775
(1.41x)
52.10.2604
0.0812
(3.21x)
0.4186
0.1140
(3.67x)
3.2493
0.6890
(4.72x)
87.0111
13.9280
(6.25x)
123.250.2629
0.0478
(5.50x)
0.4227
0.0692
(6.11x)
3.1110
1.0216
(3.05x)
85.9028
15.9917
(5.37x)
-200.70.5439
0.1631
(3.34x)
0.7465
0.2221
(3.36x)
3.9305
1.2537
(3.14x)
92.5380
18.8309
(4.91x)
(100.25+100.0j)0.9103
0.1179
(7.72x)
1.3332
0.1849
(7.21x)
10.1655
3.1083
(3.27x)
266.3565
170.3099
(1.56x)
12345678.90.4026
0.0491
(8.21x)
0.5439
0.0525
(10.36x)
3.7183
0.2051
(18.13x)
89.9402
4.0044
(22.46x)
(0.0+12345678.9j)1.2628
0.0765
(16.52x)
1.7085
0.0858
(19.91x)
11.5023
0.2749
(41.84x)
275.6174
4.4548
(61.87x)
1.0e+200.9820
0.0495
(19.84x)
1.3882
0.0540
(25.72x)
6.6983
0.1481
(45.21x)
110.1757
2.1954
(50.19x)
(0.0+1.0e+20j)4.0385
0.0788
(51.24x)
4.3758
0.0833
(52.54x)
20.9478
0.1936
(108.20x)
335.9407
2.3911
(140.50x)


All times are from a warm cache. The new implementation has slightly longer precomputation time (for the Taylor series coefficients) at high precision, and the Taylor series therefore isn't used above a certain precision (around 1000 digits) despite being much faster on subsequent calls. I will probably add some user-visible function for controlling this.

Finally, several optimizations are possible still. The most important would be to do the implementation in Cython. (Improving the underlying elementary functions will also speed up the gamma function.) Potential algorithmic improvements include faster handling of near-integers (using Taylor series with a reduced number of terms), avoiding the reflection formula for complex arguments in the left half-plane but not too close to the negative half-axis, and performing the argument transformations with a reduced number of multiplications. But I doubt I will have time (or interest -- the present code was demanding enough to finish) to work of any of these things, at least any time soon.

The speedups naturally also affect performance of many other functions, e.g. hypergeometric functions; I might post some benchmarks of that later.

No comments: