2014-09-13

Further Adventures in Python Optimisation

Previously we found that PyPy achieves the best performance gain, executing fieldfunc_py in ~6 us. At the end of that article, I mentioned that the C implementation is up to 50x faster, managing the same calculation in ~0.12 us.

The naive conclusion is that the best thing is to simply call the C function to do the heavy lifting, achieving performance somewhere between PyPy and C. But nothing in life is easy...

Naive C Interfacing

The C code in the previous article was compiled using

gcc -O -shared -o c_fieldfunc.so -fPIC c_fieldfunc.c

The MagnetElement class was then amended to use the C function whenever possible:

class MagnetElement(object):
  def __init__(self, position, size, magnetisation, fieldcalcfunc=fieldfunc_fast_py):
    """ 
    position, and magnetisation are all expected to be numpy arrays of 3
    elements each.

    size is a single number, implying all elements are square
    """

    self.position = position
    self.size = size
    self.magnetisation = magnetisation
    self.moment = magnetisation * size * size * size
    self._fieldcalcfunc = fieldcalcfunc

    try:
      import ctypes
      cmodule = ctypes.cdll.LoadLibrary('c_fieldfunc.so')
      self._fieldcalcfunc = cmodule.fieldfunc
      self.fieldAt = self._cfieldAt
    except OSError, ex: 
      pass

  def _cfieldAt(self, p): 
    import ctypes
    def voidp(x):
      return ctypes.c_void_p(x.ctypes.data)
    field = np.array([0,0,0], np.double)
    self._fieldcalcfunc(voidp(p),
                        voidp(self.position),
                        voidp(self.moment),
                        voidp(field))
    return field

  def fieldAt(self, p): 
    """ 
    p is expected to be a numpy array of 3
    """
    return self._fieldcalcfunc(p, self.position, self.moment)

This however yielded a per-run time of ~28 us!! So clearly there is significant cost in interfacing. The most obvious of these is creating a new np.array each time, and defining and calling voidp.

A Better C Interface

Below is an improved version:

class MagnetElement(object):
  def __init__(self, position, size, magnetisation, fieldcalcfunc=fieldfunc_fast_py):
    """ 
    position, and magnetisation are all expected to be numpy arrays of 3
    elements each.

    size is a single number, implying all elements are square
    """

    self.position = position
    self.size = size
    self.magnetisation = magnetisation
    self.moment = magnetisation * size * size * size
    self._fieldcalcfunc = fieldcalcfunc

    try:
      import ctypes
      cmodule = ctypes.cdll.LoadLibrary('c_fieldfunc.so')
      self._fieldcalcfunc = cmodule.fieldfunc
      self.fieldAt = self._cfieldAt
      self._field = np.array([0,0,0], np.double)

      def voidp(x):
        return ctypes.c_void_p(x.ctypes.data)
      self._position_p = voidp(self.position)
      self._moment_p = voidp(self.moment)
      self._field_p = voidp(self._field)
    except OSError, ex: 
      pass

  def _cfieldAt(self, p): 
    import ctypes
    self._fieldcalcfunc(ctypes.c_void_p(p.ctypes.data),
                        self._position_p,
                        self._moment_p,
                        self._field_p)
    return self._field

  def fieldAt(self, p):
    """
    p is expected to be a numpy array of 3
    """
    return self._fieldcalcfunc(p, self.position, self.moment)

Now we are down to ~7 us, which is about what PyPy gave us. We are still slower, and more effort is involved compared to the installing PyPy + numpy. That said, this method has the advantage that it is compatible with existing Python and numpy installations, and can be used to optimise python code that uses parts of numpy that are not yet implemented in PyPy.

Summary

C interfacing needs to be done carefully to gain maximum benefit
Not quite as easy as using PyPy, but more compatible

Addendum

While we achieved performance similar to PyPy using C interfacing in this one function, PyPy is going to give faster performance across the entire program, while Python+C will only speed up this one function. In my particular case, Python+C is over all ~2x as slow as PyPy.

Cheers,
Steve