Python in Science: 2010

Tuesday, December 28, 2010

Efficient MCMC in Python -- Errata and some extra info

In my previous post, some readers pointed out that the pure Python version of the code was slower than it should be. I checked and found out that the timing was wrong due to some bug in the %time flag in the sage notebook.

Some other interested readers pointed out that using numpy's RNGs in the pure Python version would sure improve the performance. Again I went back and tested it.

So without further ado, here is the new timings in the lame machine I am writing this in:

Pure Python: 107.5 seconds
Pure Python + Numpy: 106 seconds
Pure Python + Numpy + storing the results in an array: 103.7 seconds
Cython + standard library's random and math modules: 102 seconds
Cython +Numpy: 93.26 seconds
Cython + GSL RNGs: 5.3 seconds

The source code and instructions for compilation of the cython versions can be found in this gist. Please Have fun with it and continue to suggest further improvements.

Saturday, December 25, 2010

Efficcient MCMC in Python

I have recently come across a very interesting blog. That is, interesting for people like me, who are enthusiasts of Statistics, programming and their combination expressed preferably in the Python language. One recent post of that blog, caught my attention, as it offered a comparison between MCMC implementations of a simple Gibbs sampler in multiple languages, namely: R, Python, Java and C.

After presenting the Python version the author (Darren Wilkinson) said that in an ideal world, he would always program in Python, but in reality, most of the time he had to use Java or C to obtain acceptable performance. That statement got me thinking that the very success of Python as a programming language is its ability of taking us to this ideal world, and keeping us there. Even in the performance domain, the Python community has developed a number of elegant solutions to minimize the need to resort to lower level languages: we have the ability to call directly into C code with Ctypes, the amazing progress of JIT (just in time compiling) capabilities in Pypy, which (we hope) will one day be incorporated into the reference implementation of Python, just to cite a few of the tools. But to solve the particular problem of optimizing numerical code to near C performance, my favorite tool these days is Cython.

So in this post, I set out to demonstrate how far we can go in terms of performance improvement with Cython, for the example code provided by Darren. Here is the original code in pure Python with a few modification of my own (basically the return of the samples):


import random,math

def gibbs(N=20000,thin=500):
x=0
y=0
samples = []
for i in range(N):
    for j in range(thin):
        x=random.gammavariate(3,1.0/(y*y+4))
        y=random.gauss(1.0/(x+1),1.0/math.sqrt(x+1))
    samples.append((x,y))
return samples

smp = gibbs()

Well, we can see by the look of it, that it's going to be slow in Python, two nested loops plus two calls to two non trivial Python functions in their core. In this kind of code (MCMC) there is no real way around heavily iterative structures. That's why it's such a good example to illustrate the powers of Cython. This code took 1480 seconds to run in my computer. This is going to be the baseline against which we will compare the performance of the Cython implementations. Here is the first one:


import random,math

def gibbs(int N=20000,int thin=500):
  cdef double x=0
  cdef double y=0
  cdef int i, j
  samples = []
  for i in range(N):
      for j in range(thin):
          x=random.gammavariate(3,1.0/(y*y+4))
          y=random.gauss(1.0/(x+1),1.0/math.sqrt(x+1))
      samples.append((x,y))
  return samples

smp = gibbs()

For those who are unfamiliar with Cython, Cython is just Python plus (C style) type declarations. You then save the file with a .pyx extension (gibbs.pyx) and compile it to straight C with these commands in a python shell:

>>> import pyximport; pyximport.install()
>>> import gibbs

This will compile your code automatically, and import it executing the function. This naturally assumes you are on a linux machine with Cython installed. I actually ran these experiments from a Sage Worksheet, a marvelous tool for those doing scientific computing. In Sage all you have to do to have your Cython code automatically compiled is to add a %cython line at the top of your cell containing cython code. This implementation took 127 seconds to run, an 11.6 fold improvement in performance. My next implementation included replacing the random number generators in the core of the loops for the equivalent ones from numpy.random, and that shaved a few extra seconds off the time (see the sage worksheet). The real bottleneck here is that the two function calls are still Python function calls which have the overhead of checking the types of the variables at runtime. So for my definitive implementation I decided to use GSL's random number generators. That involved declaring the parts of Gnu Scientific Library that I'd be using. But Cython makes it very easy:


#declaring external GSL functions to be used
cdef extern from "math.h":
   double sqrt(double)
  
cdef double Sqrt(double n):
   return sqrt(n)

cdef extern from "gsl/gsl_rng.h":
   ctypedef struct gsl_rng_type:
       pass
   ctypedef struct gsl_rng:
       pass
   gsl_rng_type *gsl_rng_mt19937
   gsl_rng *gsl_rng_alloc(gsl_rng_type * T)
  
cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)

cdef extern from "gsl/gsl_randist.h":
   double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
   double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)
  
# original Cython code
def gibbs(int N=20000,int thin=500):
   cdef double x=0
   cdef double y=0
   cdef int i, j
   samples = []
   #print "Iter  x  y"
   for i in range(N):
       for j in range(thin):
           x = gamma(r,3,1.0/(y*y+4))
           y = gaussian(r,1.0/Sqrt(x+1))
       samples.append([x,y])
   return samples

smp = gibbs()

This implementation took 5 seconds to run a 296 fold improvement in performance. At this point it felt good to be in my Ideal (Real) Python world.

Friday, October 22, 2010

Model-Builder: the road to 1.0

Model-Builder is a niche educational software and the first "largish" Python application I ever wrote, way back in 2001. Back then, WxPython seemed like the easiest way to develop a GUI application inPython, mainly due to the automation provided by the IDE Boa-constructor.

A lot has changed since then and for many years now, my favorite GUI Toolkit has been Qt via it's Python binding, PyQt. For a long time, WxPython turned me away from improving the GUI of Model-builder, but not anymore. Six months ago, on a lazy sunday afternoon, I decide to port Model-Builder to Qt.

Naturally that decision was made easier by long postponed desire to add features to Model-Builder. So with the port I got my first foot on the road to MB 1.0.

Porting and redesigning the GUI was the easiest part of it, The new features, both planned and invented on the way, are still keeping me from making the much anticipated stable 1.0 release.

In this post, and possibly on subsequent ones, I'll try to cover the new features implemented so far along with a brief commentary on the particular challenges associated with their implementation. First the list of the most important new features:

New Qt4-based GUI
Tab-based design
Support to internationalization (i18n)
Configuration persistence in a configuration file in the User's home directory.
Splash Screen
Uncertainty analysis feature based on methods from my BIP library.
Report generation in Html format with Latex formatted Equations.
Web application counterpart integrated with the desktop app: Model-Builder cloud community.
Collection of usage statistics with user permission.
Sympy based basic algebraic analysis.
Social collaboration tools on Model-Builder Cloud Community.
Model upload and download from the cloud.

Now let's talk a little about each of these steps:

New Qt4-based GUI
This was the best decision I could have made. Qt4 is a great GUI toolkit and together with QtWebkit and QScintilla, it allowed me to easily implement a lot of interesting features. QScintilla has builtin support for syntax highlighting, which I use for the equation editor. Current highlighting is still very simple but can be vastly improved in the future. QtWebkit allowed me to implement the display of html reports which I plan to make exportable soon. Besides that, In conjunction with Eric4, It's a breeze to handle the internationalization of the GUI, thanks to the integration with QtLinguist.
Tab-based design.
The tab-based design was meant to allow for the separation of various modes of operation of MB without the need to open separate windows all the time. Moreover it provides events which I use to sinchronize the various activities, for example every time a tab changes, MB check if model-definition has changed and updates the control panel in the simulator.
Support to internationalization (i18n)
Without Qt4 I would not have the time to do internationalization by more traditional methods. The first language supported, besides the default english, is my own native language, Brazilian Portuguese, Translators are welcome, a new translation can be completed in less than an hour.
Configuration persistence in a configuration file in the User's home directory.
That became necessary to facilitate interaction with the web app. Otherwise the user would have to enter his/her userid every time.
Splash Screen.
This is just a bit of eye candy. ;-)
Uncertainty analysis feature based on methods from my BIP library.
This feature consumed most of the implementation effort in this branch, and is still not complete, but pretty close. It is not as flexible a tool as using BIP directly as a library in a script, but does a good job to fit arbitrary models to data. Implementation of this feature required the implementation of loading and display of data-sets in csv format. Basically a simple spreadsheet-like widget, with support to opening multiples files in separate sheets.
Report generation in Html format with Latex formatted Equations.
This feature was needed for better formatting of the model's equations in mathematical notation. For the rendering part, I chose the MathJax javascript library which converts LaTeX expressions on the fly, with great quality. The LaTeX is generated by Sympy, which is now a required dependency. The Html is compose from Jinja2 templates. One unresolved issue with this solution is the size of the MathJax library, since QtWebkit has no caching support and packing it with MB would gretly increase the total size of the package.
Web application counterpart integrated with the desktop app: Model-Builder cloud community.
This is by far the biggest innovation for this release. The Model-Builder Cloud Community is meant to serve initially as an online model repository, but evolve to become a full-fledged online collaboration platform dedicated to the Model-Builder users. It already supports OpenId Authentication, basic social features such as inviting others to collaborate on models, visualization of models, etc. Sugestion for new features are welcome, and the source code for the web app is also open, so if you fell like collaborating, check it out!
Collection of usage statistics with user permission.
One of the most useful features for me. It will allow me to track the number of installations, and uses of MB. All with user permission, of course.
Sympy based basic algebraic analysis.
Since I had to add Sympy as a dependency for the formatting of the equations, I decided to use it to provide further algebraic manipulations of the models. The first thing I implemented was an automatic derivation of the Jacobian Matrix for the system of equations. I am considering embedding an Isympy console in a tab, to give the user the full power of Sympy to analyze the model.
Social collaboration tools on Model-Builder Cloud Community.
This is still very simple. Users can invite others to collaborate in a model and can also visualize their network of collaborators both in table format and as a graph on a map. The webapp captures the user's geo location (if permission is granted).
Model upload and download from the cloud.
This is the corner stone of making the website into a model repository. In the future it may interesting to add semantic descriptors which can be filled in after the model is uploaded. Currently there is no support for storing data on the web but graphical snapshots of the last run are planned.

That is it. Collaborators are very much welcome both for the desktop and the web application. My bandwith is definetly not enough to implement all this and still get food on the table. ;-)

Sunday, August 22, 2010

Python in the Clouds

Consegi 2010 is over. It was a great meeting! I was happily suprised to realize how far the adoption of free software has come in the last few years. Congratulations to the Free Software community for such a successful lobby!

Big thanks are due to Giuseppe (@gsromag) , Luiz Guilherme (@aldabalde) and their colleagues from SERPRO, who where responsible for the or ganization of this amazing meeting.

Python was also a big winner in this event. A large number of talks and workshops either focused or used python directly. I personally gave a workshop and a talk on Google App Engine in auditoriums filled to capacity with people with the most varied backgrounds, from students to government IT people.

Another interesting trend (sentiment?) that I could pick up in the event was a certain disconfort around JAVA, mainly due the its legal liabilities as illustrated by the recent legal action unleashed by Oracle against the usage of Java in the Android platform.

I was also glad to learn about the Free Cloud alliance, and some interesting free-software, cloud related tools such as NiftyName and Neopodd.

My big thanks to the organizers for inviting me, and for such a great event.

Thursday, July 15, 2010

Consegi 2010

I have been invited to give a talk on the 3rd International Congress on Free Software and E-Government -- Consegi, taking place in Brasilia on 18-20 of august. It is a massive event bringing together Leading professionals from all over the world. Last year's event had 6000 participants from 19 countries!

This year's theme is cloud computing and I'll bee talking about Google AppEngine (GAE). Since I am not a Google employee, I plan to make an unbiased presentation of the the plaform's potential in the context of the growing ecosystem of similar cloud computing platforms. I also plan to talk about GAE integration with other Google services such as Gadgets, Wave API, and OpenSocial but I hope to be able to also discuss the use of GAE as a front-end to other cloud-based systems which may require heavier computing and data storage features which are not offered by GAE. By this I mean scientific computing and data analysis platforms, which are subjects which I am deeply involved with.

The topics above are naturally of great importance to E-gov projects but I am afraid I wont have time to go into this subject, Hopefully in the q&a after my talk some one will give me an opportunity to talk about this.

I'd appreciate to hear any suggestions of other topics which might fit in such a talk.

Thursday, June 3, 2010

PyBrain

Just came across this interesting python library for machine learning. People interested in this should also checkout Peach: http://code.google.com/p/peach/

acerca de: http://pybrain.org/pages/features (ver no Google Sidewiki)

Monday, April 19, 2010

New Liveplots Release

I recently blogged about Liveplots. I have just released another version with a very important improvement.

Before, even though plot server was running on a separate process, there was still a small overhead for the calling code since the plotting function had to be evaluated (and data piped to Gnuplot) before the execution returned to your simulation. To improve matters, I have implemeted queueing of the plotting calls, so that now, any call to the plot server returns immediately, allowing the main code to run at maximum speeds. In the simple.py example the speed improvement was greater the 10x!!. This of course mean that the plotting runs asynchronously to the simulation, and depending on the rate of plot creation, plotting can lag behind somewhat.

To make this new feature even nicer, it is implemented as a queuing decorator so that future expansion of the plotting API can be done whithout worrying about the queueing machinery a simple @enqueue decorator to the plot function will do the magic.

Enjoy!

Thursday, April 15, 2010

Liveplots package

I have recently released a package I had been using privately for a while. It's called Liveplots and is a package which provides the tools for monitoring computationally intensive, long running numerical simulations. I am aware that similar tools for Java, like Livegraph. But I needed something that was light, simple, and in Python.

Liveplots provide a daemonized plot server (i.e. running as a separate process) which accepts plot commands via xmlrpc, and a file system monitor, which can be used to produce visualizations when data files are created or modified.

Liveplots is fast enough for most applications. The simple example script available with the source code is able to generate about 15 plots per second on my laptop, but your mileage may vary, depending on how fast you computer is. For the file monitor, it doesn't make sense talking about speed since it is event driven but the same upper limmit should apply. An example is provided for it as well. The file monitor works only in Linux since it depends on pyinotify, but the plot server should work everywhere Python and Gnuplot run.

A very simple API is offered by the plot server, basically consisting of three types of plots: histogram, scatter, and line plots. All of these commands are able to plot multiple data-sets on the same set of axes or on a multi-plot page layout. I think this should suffice for now, since the purpose of this package is to provide a quick visual inspection of long running computations, not to produce publication quality plots. I any case, If you would like to see its API extended, please send me patches. New plot commands should be really easy to implement if you are familiar with Gnuplot.

The visualization part is done by Gnuplot, which I chose because it is fast and ubiquitous. It requires Gnuplot.py, which unfortunately is not yet available through easy_install, thus requiring an extra installation step. Liveplots itself is easy_installable.

Monday, April 5, 2010

Automatic Parameterization of Scipy's Random Number Generators

I am working on a new version of Model-Builder, which will incorporate Uncertainty analysis features based on my BIP package.

One key aspect of doing uncertainty analysis, is the ability to attribute prior distributions to the model's parameters, and programatically, this is easily accomplished through the use of scipy.stats random number generators which provide a large variety of both continuous and discrete distributions. However, since Model-Builder is a GUI-based software, I need to offer a unified interface for the parameterization of such distribution.

This should not be as hard as it seems because in the stats package, all distributions are either a subclass of stats.rv_continuous or stats.rv_discrete, and thus [almost] all of them can be parameterized by calling the class constructor with a tuple of shape parameters, the length of which will vary from distribution to distribution but is available as the attribute numargs. This is exemplified in the doc string of all the distributions as the following pseudo-doctest:

>>> import matplotlib.pyplot as plt
>>> numargs = norm.numargs
>>> [ ] = [0.9,]*numargs
>>> rv = norm()

Display frozen pdf

>>> x = np.linspace(0,np.minimum(rv.dist.b,3))
>>> h=plt.plot(x,rv.pdf(x))

So I set out to test if this would actually work for all generators. If it worked for at least the most important ones, it would mean I could just have the user provide me with a -tuple of numbers, and pass them to whichever distribution the user chose. To run this test I came up with the following script:


# -*- coding: utf-8 -*-
import pylab as P
import numpy as np
f1 = P.figure()
f2 = P.figure()
a1 = f1.add_subplot(111)
a1.set_title("Continuous Distributions")
a2 = f2.add_subplot(111)
a2.set_title("Discrete Distributions")
for d in dir(st):
   obj = getattr(st, d)
   if isinstance(obj, st.rv_continuous):
       try:
           numargs = obj.numargs
           shape_pars = [0.9]*numargs
           rv = obj(shape_pars)
           x = np.linspace(0,np.minimum(rv.dist.b,3))
           h=a1.plot(x,rv.pdf(x),  label=obj.name)
           a1.legend()
           print "+++>", obj.name
       except:
           print "--->", obj.name
   if isinstance(obj, st.rv_discrete):
       try:
           numargs = obj.numargs
           shape_pars = [0.9]*numargs
           rv = obj(shape_pars)
           x = np.linspace(np.maximum(-3, rv.dist.a),np.minimum(rv.dist.b,3))
           h=a2.plot(x,rv.pmf(x),  label=obj.name)
           a2.legend()
           print "+++>", obj.name
       except:
           print "--->", obj.name
P.show()

When I ran this simple script, I not got the plots of each probability distribution that conformed with the parameterization via the shape tuple, but also a report which did (+++>) or did not(--->) conform. Here are the results:

+++> alpha
---> anglit
---> arcsine
+++> bernoulli
---> beta
---> betaprime
---> binom
---> boltzmann
+++> bradford
---> burr
---> cauchy
+++> chi
+++> chi2
---> cosine
+++> dgamma
+++> dlaplace
+++> dweibull
+++> erlang
---> expon
+++> exponpow
---> exponweib
---> f
+++> fatiguelife
+++> fink
+++> foldcauchy
+++> foldnorm
+++> frechet_l
+++> frechet_r
+++> gamma
---> gausshyper
---> genexpon
+++> genextreme
---> gengamma
+++> genhalflogistic
+++> genlogistic
+++> genpareto
+++> geom
---> gilbrat
+++> gompertz
---> gumbel_l
---> gumbel_r
---> halfcauchy
---> halflogistic
---> halfnorm
---> hypergeom
---> hypsecant
+++> invgamma
+++> invnorm
+++> invweibull
---> johnsonb
---> johnsonsu
+++> ksone
---> kstwobign
---> laplace
---> levy
---> levy_l
---> levy_stable
+++> loggamma
---> logistic
+++> loglaplace
+++> lognorm
+++> logser
+++> lomax
---> maxwell
---> mielke
+++> nakagami
---> nbinom
---> ncf
---> nct
---> ncx2
---> norm
+++> pareto
+++> planck
+++> poisson
+++> powerlaw
---> powerlognorm
+++> powernorm
---> randint
---> rayleigh
+++> rdist
+++> recipinvgauss
---> reciprocal
+++> rice
---> semicircular
+++> t
+++> triang
+++> truncexpon
---> truncnorm
+++> tukeylambda
---> uniform
+++> vonmises
---> wald
+++> weibull_max
+++> weibull_min
+++> wrapcauchy
+++> zipf

From the list above, it is clear that although the majority passed the test, many important ones did not (n, beta, uniform, etc.). This don't mean those cannot be used or won't work, but only that they cannot be parameterized in the same way as the rest. The norm distribution, for instance, has numargs = 0, (WTF?)

Although initially I was pretty excited about the possibility of handling every probability distribution in the same standard way, in the end I was disappointed that the Scipy people, after having come up with a such nice design for the stats package, did not think about running their own doc-tests (or making them runnable in the first place).

I am still looking for a solution to this. If you have an idea of how to elegantly work around this issue, please comment below.

Thursday, March 18, 2010

Converting C code to Python

This may sound like a weird thing to do, but actually, I have craved for something like that every time I have to read C code. Converting C code to Python, can not only help us understand code more easily, but also turn non perfomance-critical code easier to maintain. C can be easy enough to read if well-written and formatted (indented) adequately, however having to move back and forth between .h and .c files drives me mad sometimes.

You can imagine my joy when I came across "ctopy", create by none other than Eric Raymond himself!

The first problem I ran into was the fact that ctopy was not available on my Ubuntu. No problem, a quick search through Eric's web site lead me to it:
http://www.catb.org/~esr/ctopy/

Naturally, ctopy is written in python and my immediate impulse was to download and test it.

For test code I decided to download a small C program from the "computer language shoutout" site:

http://shootout.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=gcc&id=6

Unfortunately even for this simple program, ctopy didn't do a good job. I am not including the results here. but you can easily replicate them with the command:

cat mandelbrot.c | indent |./ctopy

The man page of ctopy recommends to pipe the C code through indent, because it relies on the indentation of the C code to do the translation.

Well as Raymond himself says in the beggining of ctopy's man page, its a quick and dirty translator, which requires a human to finish the job.

Anyway, that's how far my dream of a "Google Translate" for code went. I decided to post the links in case someone decides to continue where Raymond stopped.

Friday, February 26, 2010

Utopia

Interesting post, But hoping to invert those proportions, i.e., making the smarter guys the 80% is harder that having a society where everyone earns more money than they can spend... Pretty next to impossible...;-)

in reference to: Coding Horror: The Two Types of Programmers (view on Google Sidewiki)

Python in Science