.. todo:: jython, ironpython, pyrex, pypy, stackless, profling read the sphinx doc to comment the c code. http://sphinx.pocoo.org/latest/domains.html .. c:function:: my_function Vewy interesting comment But it is not possible to use literalinclude and its pyobject function with c... I can use the line numbers, though Connecting Python and C, using multiples cores ============================================== The second part of this article :doc:`functional`, shows a few techniques available to the Python programmer to accelerate a code. The previous implementation is adapted to use efficiently the many processors and cores available on a host. Then the critical path that crunches numbers are rewritten in C, because this language is very good at this. Finally, a better mathematical method is used which is faster by several orders of magnitude. Processes for multicore ----------------------- It is interesting to note that all the previous scripts run on only one core even if the host features many processors and cores. Python makes it easy with the :mod:`multiprocessing` module to run functions into their own separate system process which are dispatched by the kernel on the available processors and cores. In the following version of the script, four processes will be run, each handling a fourth of the requested iterations. Each process is created by providing a function to be run and a list of arguments. The :mod:`multiprocessing` make the :class:`Queue` available which is reachable by each processes and safe for concurrent read and write access. An anonymous function which computes pi and writes the approximation to such a Queue instance is given as the function to be run by each processes. In this implementation, the π approximation is the mean of the approximations found in the queue. .. literalinclude:: functional/procedural_with_processes.py :lines: 7- Let's time this version: .. sourcecode:: sh ~$ test_it ./procedural_with_processes.py For 200000 points, with 4 processes, Pi = 3.13968 duration: 0.27 seconds For 1000000 points, with 4 processes, Pi = 3.141212 duration: 1.08 seconds For 5000000 points, with 4 processes, Pi = 3.1419328 duration: 5.12 seconds The durations are exactly halved when compared to *procedural.py*, the two cores of the Intel Core 2 Duo, on which this article is edited, were effectively used. Number crunching in C --------------------- Python code is transformed, by the Python interpreter, on execution into byte compiled code, which is composed of commands interpreted by the Python virtual machine. The Python virtual machine is a compiled software written in C which directly talks to the processor. For some demanding computing uses, such as this approximation of π, this two step process: interpretation first, and execution second, is suboptimal this is why it is desirable to write extensions directly in C. Retrieving of the command line argument, printing the result and splitting the work into processes are only done once in the lifetime of the script so their impact on performance are negligible, it is very practical to write it in Python. In our example, the hard work in this script is the *pi* function, which could not be simpler in terms of signature: it requires an int, returns a float, raises no errors, and makes no side effects (apart from CPU usage). We can write the pi function in C so that, when compiled, it is directly understood by the processor, sidestepping the Python virtual machine. The :mod:`distutils` module has one powerful way to create extensions in C for Python and is nicely integrated into the standard Python distribution. For the record, :mod:`ctypes` is a way to reuse installed shared librairies (also in the standard library). Swig_ is another way to mix C with many other languages. Another strategy is to build a compiler directly in the interpreter: this is called a `just in time compiler`_ and Pypy_ offers such an approach, among other features. .. _swig: http://www.swig.org/tutorial.html .. _`just in time compiler`: http://en.wikipedia.org/wiki/Just-in-time_compilation .. _Pypy: http://pypy.org/ Using distutils, two more files are needed a C file (let's call it ``pimodule.c``) and a configuration file. Here are the steps involved: #. Write standard a C function called pi: .. literalinclude:: functional/pimodule.c :language: c :lines: 6-21 The *rdtsc* function initializes the random generator with a random seed. .. literalinclude:: functional/pimodule.c :lines: 4-5 :language: c #. Write the conventional boilerplate 2 functions and one struct in the C file that distutils expect. More details can be found here_. .. _here: http://docs.python.org/dev/extending/extending.html * A wrapper function for the pi function, which matches Python interfaces, is defined. This wrapper receives the arguments in the form of Python objects that it transforms to input argument for the C function in the correct format: here, a simple int. The wrapper also builds an Python float object from the C float approximation of Pi returned by the *pi* C function: .. literalinclude:: functional/pimodule.c :lines: 24-33 :language: c The non standard type such as *PyObject* are available through the inclusion of the ```` headers. * The methods exported to Python are declared in an array of *PyMethodDef*: .. literalinclude:: functional/pimodule.c :lines: 36-39 :language: c For each method described in the array, the first element is the method name in Python, the second element is the pointer to the C function, the third argument specifies if the Python method should accept variable arguments and keyword arguments. Here no keywords arguments are possible, only variable arguments. The last element is the docstring for the method. * An initialization function is written for the package. This function will be executed when the module is imported in the Python interpreter: .. literalinclude:: functional/pimodule.c :lines: 41-45 :language: c That's it for *pimodule.c*. It is ready for compilation. #. Most Python code aimed to be distributed include a ``setup.py`` file which specifies the packaging options. This file uses the :func:`~distutils.core.setup` function and would minimally be:: from distutils.core import setup setup( name = 'pi', version = '0.1', description = 'This is simple method for approximating Pi') The :class:`~distutils.core.Extension` class from the *distutils* package has some magic taking care of the build configuration of the module. It is one more line in the ``setup.py``: .. literalinclude:: functional/setup.py #. Building and installing the module is now straightforward: .. sourcecode:: sh ~$ python setup.py build # Compilation ~$ sudo python setup.py install # Installation in the Python path ~$ python >>> import pi # The module is available >>> help(pi) # to the interpreter [ ... ] pi(...) Simple Pi Approximation >>> pi.pi(1000) 3.2040000 The script from the previous multiprocess section only needs to be slightly modified: instead of defining the *pi* function, it now needs to be be imported from the *pi* module. This multiprocess C version shows the following performance: .. sourcecode:: sh ~$ test_it ./procedural_in_c.py An approximation of Pi with 4 processes: 3.14168 duration: 0.09 seconds An approximation of Pi with 4 processes: 3.143028 duration: 0.07 seconds An approximation of Pi with 4 processes: 3.1415872 duration: 0.18 seconds The computation is accelerated by a 30 fold factor over the previous version is pure Python. This is not fantastic but it is indeed much more efficient. When compared to Python, C does shine in the processing of numbers. Running the same original procedural pure Python script with the Pypy interpreter instead of the standard Python interpreter shows a five fold speed up, for free. A fast algorithm for Pi ----------------------- If the problem really was computing π, its `Wikipedia page`_ has several better methods. Here is a seriously fast approximation known as the Bailey-Borwein-Plouffe formula : .. _`Wikipedia page`: http://en.wikipedia.org/wiki/Pi#Numerical_approximations .. math:: \pi \approx \sum_{k=0}^{\infty} \frac{1}{16^k} \left(\frac{4}{8k+1}-\frac{2}{8k+4}-\frac{1}{8k+5}-\frac{1}{8k+6}\right) This translates into the *bbp* function, below, in Python, which is correct for the first 13 digits in only ten iterations while previous methods needed millions of iterations for the same results. .. doctest:: >>> bbp = lambda n: sum( 1./(16**k)*(4./(8*k+1)-2./(8*k+4)-1./(8*k+5)-1./(8*k+6)) ... for k in xrange(n) ) >>> print bbp(2) 3.14142246642 >>> print bbp(10) 3.14159265359 >>> import math >>> bbp(10) - math.pi < 10**(-14) True :class:`~timeit.Timer` is a class which takes a callable as the argument. The method *timeit* will return the duration in seconds for *one million* executions of the callable. .. doctest:: >>> from timeit import Timer >>> Timer(lambda:bbp(10)).timeit()<20 True Obtaining the 13 correct digits of π can be done in less than 20 microseconds in pure Python with a good algorithm. If 13 digits is not considered enough, Python offers the :mod:`decimal` module. The :attr:`prec` attribute of the context object returned by the :func:`~decimal.getcontext()` method of the decimal package is the number of decimal required. .. doctest:: >>> from decimal import Decimal as d, getcontext >>> def bbp(n): ... return sum( 1/d(16**k) \ ... * (4/d(8*k+1)-2/d(8*k+4)-1/d(8*k+5)-1/d(8*k+6)) \ ... for k in xrange(n)) ... >>> print bbp(50) 3.141592653589793238462643381 >>> >>> getcontext().prec = 70 >>> print bbp(50) 3.141592653589793238462643383279502884197169399375105820974944592246655 And now, on to something **really** fantastic: *Late 2009, about 2700 billion decimal digits of Pi were computed in 116 days, using a single desktop computer. This is presently the World Record for the computation of Pi.* Kudos goes to `Fabrice Bellard`_. He combined fast math methods, with hardware optimization. .. _`Fabrice Bellard`: http://bellard.org/pi/pi2700e9/