Monday, 6 March 2017

Guest Post: Python to R

This is a guest post by Alex Glaser, who runs the London Kaggle meetup and organises several dojos.

Alex took on the challenge of making his own neural networks, but instead of using Python, he used R. Here is talks about that journey, things he had to overcome, and some insight into performance differences and tools for profiling too.

Python to R Translation

Having read through Make your own Neural Network (and indeed made one myself) I decided to experiment with the Python code and write a translation into R. Having been involved in statistical computing for many years I’m always interested in seeing how different languages are used and where they can be best utilised.
There were a few ground rules I set myself before starting the task:
  • All code was to be ‘base’ R (other packages could be added later)
  • The code would be as close to a ‘line-by-line’ translation (again, more R-centric code could be written later)
  • The assignment opertor “<-” would be used.
As a little aside, a quick word about the assignment operator. It can be confusing for new users, or those coming from other languages, but for the majority of issues it can be used interchangeably with “=”. Having been a long time R user I quite like the assignment operator, a little history about it can be found here here. It also provides a bit of continuity with the other assignment operators, notably the global assignment operator “<<-”. It also allows assignment of a variable within a function call, e.g.

Translating the code from Python to R also allowed me to start using R Studio’s notbook. Don’t get me wrong, I do like Juypter, but there’s always room to look at what else is out there. Each cell starts with a magic-like command saying what language is going to be used in each cell e.g. ```{r} for R, ```{python} for Python, etc.

Just sticking with the code in Part2 of Tariq’s book (code available here) a simple place to start was just to replicate printing of a single MNIST image (part2_mnist_data_set.ipynb). Reading the data in was fairly simple; both R and Python have the readlines command (readLines in R), R also has some nice graphical capabilities and matrix is a commonly used object. A few ideas cropped up which might be of interest to a new user: splitting a string results in a list (another R data type) and in order to plot the image successfully we need to reverse the ordering of the rows. The latter could be done using indexes but I thought using an apply function would be quite a nice way of doing this. The apply suite of functions are an important part of R code and often provide a succinct way of coding without lots of for loops.

Okay, one notebook down, another one to go, this time the biggie (part2_neural_network_mnist_data.ipynb). One aspect of Python (and other aspects of object-orientated languages) that differs from R is the notion of a class. A class does exist in R, but often they are used internally to ‘collect’ all output from a function, e.g.

Also, this class would be defined at the end of a function rather than at the start, e.g. you may get code like the following at the end of a function

which would return an object of class ‘quiz’.

Our initial attempt at ‘translating’ the code was supposed to be as close as a ‘line-to-line’ translation as possibe, so that people could see how one line in Python would be written in R. This also meant that we had to create an artificial class using R’s function; note that it uses the dollar symbol to reference elements of this class, rather than the dot that we see in Python code. Also, we used the word ‘self’ to allow continuation with the Python code though it doesn’t often get used with R code. One final comment, it only replicates some of the functionality of a class, it isn’t a class replacement so some of the behaviour may not be the same.
Matrix multiplication in R is done by using the following command: “%*%”, e.g

Most of the time the coding was relatively straighforward, and after a few false starts, we managed to replicate the results of the original Python code and get over 97% accuracy. However there was one big difference, the time taken. Now I’ve heard all sorts of arguments about the speed comparison of R and Python, but had assumed that since things like matrix multiplication were undertaken in C++ or Fortran these speed differences would not be considerate, however that was not the case. The Python code on my (admittedly 5+ year old Mac) takes about 6 mins, whilst the R code took roughly double that.

There are a few nice ‘profiling’ commands in R (and the profVis package provides some nice interactivity) and when we looked the R code in more depth it was the final matrix multiplication in the ‘train’ function that was taking about 85% of the time (we used the tcrossprod command in R to separate this multiplication from the rest). This last matrix multiplication is simply the outer product of two vectors, so it’s difficult to see why it would be too time consuming

Looking at a few examples it’s not hard to see that Python’s function is far faster than R’s %*% command. Now for a few matrices this isn’t an issue (what’s a few hundreths of a second against a few thousandths?), however for the MYONN model we’ll be calling each function 300,000 times, so after a while this time differential builds up.

As mentioned earlier this difference in timings is quite surprising since the underlying code should be C++ or Fortran. It could also be that some underlying library was better optimised in Python than R. This will definitely be explored at a future R or Python coding dojo.
It’s been a fun experience, and as with all work there’s more unexpected questions that come up. A brief synopsis of future work will be:
  • Try and figure out why Python’s matrix multiplication is so much quicker than R’s. Could also try some functions from Rcpp.
  • Write the code so that it is a bit more Rcentric, and see if there are any libraries ,such as in the tidyverse, which might be useful (though it would only really be useful if we can solve the previous problem).
  • Look at using Julia to see how that compares with R and Python

The R code is available from my GitHub page here, so feel free to download and change as you see fit. Any help with regards optimising the numerical libraries in R to match Python’s speed would be appreciated.