Recently I found myself looking at the following piece of code and trying to optimise it:

from __future__ import division

import numpy as np

MagneticPermeability = 4*np.pi*1e-7
MagneticPermeabilityOver4Pi = MagneticPermeability / (4*np.pi)

def fieldfunc_py(p, q, moment):
  r = p - q 
  rmag = np.sqrt(r.dot(r))

  rdash3 = rmag**3
  rdash5 = rmag**5

  rminus5term = r * moment.dot(r) * 3 / rdash5
  rminus3term = moment / rdash3

  field = (rminus5term - rminus3term) * MagneticPermeabilityOver4Pi
  return field

This, I have been told, computes the field at p due to a magnetic di-pole at q.

To profile this code I wrote the following function:

def profile(fieldfunc, dtype=np.double):
  def V(*args):
    return np.array(args, dtype)

  q = V(0, 0, 0)
  moment = V(1, 0, 0)
  p = V(10,10,10)

  def f():
    fieldfunc(p, q, moment)

  # force the JIT or whatever to run
  f()

  from timeit import Timer
  t = Timer(f)
  runs = 100000
  print fieldfunc.__name__,'per run', t.timeit(number=runs)*1e6/runs, 'us'

This yields a per-run time of ~22 us.

Optimise Python/Numpy

The first optimisation then is to write a better optimised version:

def fieldfunc_fast_py(p, q, moment):
  r = p - q
  rmag2 = np.dot(r, r)

  rdash3 = rmag2**1.5
  rdash5 = rmag2**2.5

  rminus5term = r * np.dot(moment, r) * 3 / rdash5
  rminus3term = moment / rdash3

  field = (rminus5term - rminus3term) * MagneticPermeabilityOver4Pi
  return field

The main difference is using np.dot() and avoiding np.sqrt(). Using ** is faster than using square root and multiplication. These changes gets us down to ~18 us or so, so slight improvement.

numba

Numba offers JIT compilation of python into native code, and is pretty easy to use and compatible with numpy. To use it the profile function was called as:

profile(jit(fieldfunc_py))
profile(jit(fieldfunc_fast_py))

Using jit without type arguments causes numba to infer the type. This yields per-run time of ~22 us for fieldfunc_py and ~18 us for fieldfunc_fast_py. There is no noticable speed up, in fact per-run times increased slightly. This is possibly due to the fact that the code is already pretty well optimised, and the extra overhead of interfacing between compiled code and the python runtime.

PyPy

PyPy is an altnerative implementation of Python that also uses JIT, but does it for the entire program. It is a little harder to use because it needs its own version of numpy, which is only partially implemented. For our needs this will suffice.

With PyPy, we see significant improvement. fieldfunc_py's per-run time went down to ~6 us, and fieldfunc_fast_py's per-run time went down to ~6 us also. This is a huge improvement.

Data Type Matters

The diligent reader may have noticed that profile accepts dtype as a keyword arguments. This was used to test the effects of using np.float32 vs np.float64. The following code was written:

for dtype in np.float32, np.float64:
  print 'With', dtype
  profile(fieldfunc_py, dtype)
  profile(fieldfunc_fast_py, dtype)

  try:
    from numba import jit
    print '==> Using numba jit'
    profile(jit(fieldfunc_py))
    profile(jit(fieldfunc_fast_py))
  except:
    pass

This yielded the following output using CPython:

With <type 'numpy.float32'>
fieldfunc_py per run 30.8365797997 us
fieldfunc_fast_py per run 27.7728509903 us
==> Using numba jit
fieldfunc_py per run 22.2719597816 us
fieldfunc_fast_py per run 18.1778311729 us

With <type 'numpy.float64'>
fieldfunc_py per run 21.8909192085 us
fieldfunc_fast_py per run 17.5257897377 us
==> Using numba jit
fieldfunc_py per run 22.4718093872 us
fieldfunc_fast_py per run 18.2463383675 us

Choice of data type has a significant effect, and it counter-intuitive: the larger data type is faster. Interestingly the same isn't true for PyPy, which outputted:

With <type 'numpy.float32'>
fieldfunc_py per run 6.29536867142 us
fieldfunc_fast_py per run 6.067070961 us

With <type 'numpy.float64'>
fieldfunc_py per run 6.28163099289 us
fieldfunc_fast_py per run 6.17563962936 us

Which is essentially the same.

How Fast Can It Be Done?

The final question is, how fast can this be done? Here is a naive C implementation:

#include <math.h>
#include <stdio.h>

typedef double dtype;

dtype vdot(dtype *u, dtype *v) {
  return u[0]*v[0] + u[1]*v[1] + u[2]*v[2];
}

/**
 * Computes w=u+v
 */
void vadd(dtype *u, dtype *v, dtype *w) {
  w[0] = u[0] + v[0];
  w[1] = u[1] + v[1];
  w[2] = u[2] + v[2];
}

/**
 * Computes w=u-v
 */
void vsub(dtype *u, dtype *v, dtype *w) {
  w[0] = u[0] - v[0];
  w[1] = u[1] - v[1];
  w[2] = u[2] - v[2];
}
/**
 * Computes w = x*u, returns w
 */ 
void vmult(dtype x, dtype *u, dtype *w) {
  w[0] = x*u[0];
  w[1] = x*u[1];
  w[2] = x*u[2];
}

#define MagneticPermeability          (4*M_PI*1e-7f)
#define MagneticPermeabilityOver4Pi   (MagneticPermeability/(4*M_PI))

void vprint(dtype *v) {
  printf("(%e, %e, %e)\n", v[0], v[1], v[2]);
}

void fieldfunc(dtype *p, dtype *q, dtype *moment, dtype *out) {
  dtype r[3];
  vsub(p, q, r);
  dtype rmag2 = vdot(r,r);

  dtype rdash3 = pow(rmag2, 1.5);
  dtype rdash5 = pow(rmag2, 2.5);


  dtype t5[3];
  vmult(vdot(moment, r), r, t5);
  vmult(3/rdash5, t5, t5);

  dtype t3[3];
  vmult(1/rdash3, moment, t3);

  vsub(t5, t3, out);
  vmult(MagneticPermeabilityOver4Pi, out, out);
}

int main(int argc, char **argv) {
  dtype q[] = {0, 0, 0};
  dtype p[] = {10, 10, 10};
  dtype moment[] = {1, 0, 0};
  dtype field[3];

  fieldfunc(p, q, moment, field);
  vprint(field);

  int x;
  for (x = 0; x < 1000000; ++x) {
    fieldfunc(p, q, moment, field);
  }

  return 0;
}

timing was done as follows:

$ gcc -o c_bench c_bench.c -lm && time ./c_bench 
(0.000000e+00, 1.924501e-11, 1.924501e-11)

real    0m0.116s
user    0m0.114s
sys     0m0.001s

Because we do 1 million iterations, per-run time is ~0.12 us, or around 50 times faster.

Summary

Most gain/effort comes from using PyPy
In this particular case, numba did not yield signficant speed up
C is still much faster

Cheers,
Steve

Journey's End

2014-09-13

Adventures in Python Optimisation