Sunday, 1 March 2015

The MNIST Dataset of Handwitten Digits

In the machine learning community common data sets have emerged. Having common datasets is a good way of making sure that different ideas can be tested and compared in a meaningful way - because the data they are tested against is the same.

You may have come across the famous iris flower dataset which is very common for clustering or classification algorithms.

With our neural network, we eventually want it to classify human handwritten numbers. So we'd want to train it on a dataset of handwritten numbers, with labels to tell us what the numbers should be. There is in fact a very popular such dataset called the MNIST dataset. It's a big database, with 60,000 training examples, and 10,000 for testing.

The format of the MNIST database isn't the easiest to work with, so others have created simpler CSV files, such as this one. The CSV files are:



The format of these is easy to understand:

  • The first value is the "label", that is, the actual digit that the handwriting is supposed to represent, such as a "7" or a "9". It is the answer to which the neural network is aspiring to classify.
  • The subsequent values, all comma separated, are the pixel values of the handwritten digit. The size of the pixel array is 28 by 28, so there are 784 values after the label.


These files are still big, and it would be nicer to work with much smaller ones whilst we experiment and develop our code for visualising handwriting and the neural networks themselves.

So here are smaller versions of the above CSV files, but with 100 training and 10 test items:
Next we'll try to load these files into python and see if we can display the handwritten characters.

We can load the data easily from a file as follows:

f = open("mnist_test_10.csv", 'r')
a = f.readlines()
f.close()

We can then split each line according the commas, to get the label and the bitmap values, from which we build an array. We do need to reshape the linear array to 28x28 before we display it.

f = figure(figsize=(15,15));
count=1
for line in a:
    linebits = line.split(',')
    imarray = numpy.asfarray(linebits[1:]).reshape((28,28))
    subplot(5,5,count)
    subplots_adjust(hspace=0.5)
    count += 1
    title("Label is " + linebits[0])
    imshow(imarray, cmap='Greys', interpolation='None')
    pass

The output in IPython is a series of images, You can check that the label matches the handwritten image:

c+jNZAACApzGSAAAAKkYSAABAxUgCAACoGEkAAAAVIwkAAKBiJAEAAFSMJAAAgMoLsOpZ5VnFFAkAAAAASUVORK5CYII= (841×342)


Cool - we can now import handwritten image data from the MNIST dataset and work with it in Python!

PS Yes, the "subplots" command for each loop isn't efficient but I didnt' have time to work out how to do plotting subplots properly.



UPDATE: The book is out! - and provides examples of working with the MNIST data set, as well as using your own handwriting to create a test dataset.



21 comments:

  1. Hi,
    Could you please help me on how to create a dataset for my own image database which contains non-linear image dimensions each almost equal to 900x960 pixels in size?

    ReplyDelete
    Replies
    1. Sure - you can resize the images to the correct size. You can do this in many ways .. eg.

      1. you can use an image editor like the gimp or photoshop to batch process the images to the right size
      2. use comnand line tools like imagemagick to script this
      3. use python's image libraries to resize the images before they become input to .. google search for "python rescale image array"

      (sorry for late reply .. I didn't see your comment)

      Delete
    2. Hi, how to process the my image in mnist format, I do not know how to center the digit by using center of mass in to 28*28 image.

      Delete
    3. Hi I believe the MNIST data set is aloready centred.

      Delete
  2. Hello,
    I wanted to thank you for your article.
    I had a question please.
    Can you please step by step how to upload my own images, resize them etc., in order to create a neural network.
    I have looked to find examples but I can't.
    Thank you

    ReplyDelete
    Replies
    1. Hi Nader - the process involves the following steps:

      1. Understand the size of images you want to process with your neural network. You don't always have control over the images - so the size is predetermined.

      2. Decide what size neural network you want. It is best if the input images match the network so you don't lose information. In many cases however, you will have to resize your images to the size of your network input layer.

      How do you load and resize images? Have a look at this post which explains it, and includes code on githib for you to see: http://makeyourownneuralnetwork.blogspot.co.uk/2016/03/your-own-handwriting-real-test.html

      Delete
    2. Thank you for the reply.
      I have the following code:

      I get an error when trying to apply the below code onto the MNIST sample dataset for both training and testing.

      The error is:
      Exception: Error when checking model input: expected dense_input_2 to have shape (None, 784) but got array with shape (784L, 1L)

      import pandas
      import numpy
      import numpy
      from keras.datasets import mnist
      from keras.models import Sequential
      from keras.layers import Dense
      from keras.layers import Dropout
      from keras.utils import np_utils
      # fix random seed for reproducibility
      seed = 7
      numpy.random.seed(seed)
      # Read in the TRAINING dataset
      f = open("C:/Users/USER/Desktop/mnist/mnist_train_100.csv", 'r')
      a = f.readlines() # place everythig in a lsit called 'a'
      #print(a)
      f.close()
      # go through the list a and split by comma
      output_nodes = 10
      for record in a: #go through the big list "a"
      all_values = record.split(',')
      X_train = (numpy.asfarray(all_values[1:]) / 255.0 * 0.99) + 0.01
      y_train = numpy.zeros(output_nodes) + 0.01
      y_train[int(all_values[0])] = 0.99
      # Read in the TEST data set and then split
      f = open("C:/Users/USER/Desktop/mnist/mnist_test_10.csv", 'r')
      a = f.readlines() # place everythig in a lsit called 'a'
      #print(a)
      f.close()
      # go through the list a and split by comma
      for record in a: #go through the big list "a"
      all_values = record.split(',')
      X_test = (numpy.asfarray(all_values[1:]) / 255.0 * 0.99) + 0.01
      y_test = numpy.zeros(output_nodes) + 0.01
      y_test[int(all_values[0])] = 0.99

      num_pixels = len(X_train)
      # define baseline model
      def baseline_model():
      # create model
      model = Sequential()
      model.add(Dense(num_pixels, input_dim=num_pixels, init='normal', activation='relu'))
      model.add(Dense(output_nodes, init='normal', activation='softmax'))
      # Compile model
      model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
      return model
      ## build the model
      #model = baseline_model()
      ## Fit the model
      #model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=200,verbose=2)

      Delete
    3. I'm not familiar with keras.

      However the error message suggests the input isn't described sufficiently.

      I hope this keras page explaining how to describe the inputs helps:
      http://keras.io/getting-started/sequential-model-guide/

      Delete
  3. Thank you for taking the time to reply.
    I will look at the link.
    Please release your book on Amazon or any other place, so I can buy it.
    I am looking to more books from you.
    I bought your book from amazon, but they are having technical issues with it.

    ReplyDelete
    Replies
    1. Thanks Nader.

      What are the technical issues? Email me on makeyourownneuralnetwork@gmail.com and we'll fix it if Amazon can't.

      Delete
  4. is that mnist_train_100.csv file can be used to train RBM in MapReduce framework?

    ReplyDelete
    Replies
    1. Hi - I think there are two questions here:

      1. is it possible to use the CSV file to train an RMB?
      2. is this good data for a training set?

      I think the answers are ...

      1. CSVB is a very simple and open format - and many frameworks can use it, and if not, it is easily convertible to other formats. i don't know which specific framework you are using .. I don't use them myself.

      2. this is a small sample of the MNIST training data with only 100 entries you should use it only to test early versions of your code .. and then use the full data linked in the blog post eg http://www.pjreddie.com/media/files/mnist_train.csv

      Delete
  5. how can i create my own csv file containing my image information and labels and feed it to theano

    ReplyDelete
    Replies
    1. i don't know about theano but there are usually 2 ways to read image information.

      one is to process the image file yourself and extract the data.

      the easier option is to read image into an array (or 3 arrays for red, green, blue) as per the examples in the book or on this blog .. have a look at http://www.scipy-lectures.org/advanced/image_processing/

      Delete
  6. hello sir.i have a problem in using theano.i have my handwritten images.i create a csv file which contains the label of the corresponding image.i use a code in python to make a .pkl file by combining my images and corresponding label that i gave and by doing so i make a .pkl file.this code break my images into 3 parts 1.trainset 2.validation set and 3.test set. now while i am trying to use this .pkl file to feed in theano the desired " model file " is not generated and give the underlying result...
    ... loading data
    ... building the model
    ... training the model
    Optimization complete with best validation score of inf %,with test performance
    0.000000 %
    The code run for 1000 epochs, with 2373018.624642 epochs/sec
    The code for file mn.py ran for 0.0s
    i can't understan the problem.
    i use the theano code that is provided for mnist.
    can u sir kindly help me to sortout this problem.
    thank you..

    ReplyDelete
    Replies
    1. hi Asif - I have no experience of Theano at all so I can't help you.
      Perhaps asking on stack overflow?
      Maybe this link will help .. someone else doing MNIST with Theano: http://deeplearning.net/tutorial/gettingstarted.html

      Delete
  7. is there anyway to edit the csv files to add another label for each img like even or odd , for example to make it predict the number and whether is it even or odd ??

    ReplyDelete
    Replies
    1. Hi Tasneem - you can do this is several ways.

      You can do it manually by openning the csv file in a text editor (like sublimetext, or even notepadd) and doing it yourself.

      You can open it in a spreadsheet programm (excel, google sheets, etc) and do a search/replace

      Or you can write a program or script to load the csv and replace the label and save the data back as a csv.

      Myself I would probably use a script.

      I hope that helps.

      Remember CSV files are just plain text files ... don't save as a spreadhseet file (like .xls) as you will change the format and make it less easy to manipulate.

      Delete
  8. i need csv file separately for each mnist image. For example: i want csv file for mnist "0" images only. How can i get that?

    ReplyDelete
    Replies
    1. Hey Suzan ... the best way would be to write a script to filter out the records (lines) in the csv files ... if you're using linux or macos you can use the grep or fgrep commands .. google how to use them.

      If you dont' like such command line approaches, you could write a short python program .. perhaps load the csv into a python pandas dataframe and select out the '0' images into a new frame and save that as a a separate file.

      Google pandas csv reader and you'll find it easy enough.

      Delete