Tuesday, March 29, 2011

experiment with handwritten digit recognition and SVM

For some time I have been working through various maths texts in order to build some base to understand machine learning algorithms. A lot of theory without application becomes a bit boring and I needed to take time off and do something different. So, I was looking to experiment with some well known problem in machine learning just for fun.

Last night I downloaded the handwritten character dataset from UCI machine learning repository, mainly the two files optdigits.tra(the training set) and optdigits.tes(the test set). The dataset basically contains 64 feature vectors(each integer ranging 0..16) and the class(the actual digit ranging 0..9). This is obtained by the process described on handwritten character dataset page..

..We used preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions..


Anyway, so essentially learning problem is to classify the class(0 to 9) from 64 feature vectors. I had heard somewhere that SVMs are state of the art way now a days for OCR(earlier it used to be Neural Networks) and decided to use SVM on the dataset.
I have some familiarity with WEKA , WEKA explorer makes it very easy to experiment with various machine learning algorithms. From here on, this post is very weka centric where I describe the steps I took to apply SVM to mentioned dataset.

First I converted both the dataset files from csv to weka ARFF format. I opened the csv files into excel and gave each column an unique name(as weka expects each feature vector to have a unique name). Then I saved the file back as csv again. Next, I loaded it into weka using the "Open File..." button on weka explorer "Preprocess" tab. Then I saved the loaded dataset as ARFF file. It needed only one more change. I opened the ARFF file on text editor and changed the type of last(column that represented the actual digit) attribute from numeric to nominal as follow(so that weka could apply classification algorithms to it)...

@attribute @@class@@ {0,1,2,3,4,5,6,7,8,9}

I am making both the ARFF files available if you want to skip above steps.
optdigits.tra.arff
optdigits.tes.arff

Next, I loaded optdigits.tra.arff from weka explorer using "Open File.." button. (I have highlighted the part showing # of instances that is the number of training examples)



Next, on the weka explorer, go to "Classify" tab and click on "Choose" button to select SMO as shown in the image below(If you click next to "Choose" button on the command written, it will open up a screen that tells that SMO is the the sequential minimal optimization algorithm for training support vector classifier)




Next, to load the test dataset, we select the "Supplied test set" radio button, click on "Set..." button and use the wizard to load test dataset file.

Then we click on the "Start" button that builds the model(..the classifier), runs it on the test dataset and shows the evaluation results as below.

=== Evaluation on test set ===
=== Summary ===

Correctly Classified Instances 1734 96.4942 %
Incorrectly Classified Instances 63 3.5058 %
Kappa statistic 0.961
Mean absolute error 0.1603
Root mean squared error 0.2721
Relative absolute error 89.0407 %
Root relative squared error 90.6884 %
Total Number of Instances 1797

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.994 0.001 0.989 0.994 0.992 1 0
0.984 0.01 0.918 0.984 0.95 0.996 1
0.96 0.003 0.971 0.96 0.966 0.993 2
0.934 0.002 0.983 0.934 0.958 0.987 3
0.994 0.003 0.973 0.994 0.984 0.997 4
0.989 0.008 0.933 0.989 0.96 0.996 5
0.983 0.001 0.994 0.983 0.989 0.999 6
0.922 0.002 0.982 0.922 0.951 0.998 7
0.943 0.004 0.965 0.943 0.953 0.982 8
0.944 0.006 0.95 0.944 0.947 0.991 9
Weighted Avg. 0.965 0.004 0.966 0.965 0.965 0.994

=== Confusion Matrix ===

a b c d e f g h i j <-- classified as
177 0 0 0 1 0 0 0 0 0 | a = 0
0 179 0 0 0 0 1 0 1 1 | b = 1
0 7 170 0 0 0 0 0 0 0 | c = 2
1 1 4 171 0 2 0 3 1 0 | d = 3
0 0 0 0 180 0 0 0 1 0 | e = 4
0 0 1 0 0 180 0 0 0 1 | f = 5
1 0 0 0 1 0 178 0 1 0 | g = 6
0 0 0 0 1 7 0 165 1 5 | h = 7
0 7 0 0 0 1 0 0 164 2 | i = 8
0 1 0 3 2 3 0 0 1 170 | j = 9


It is interesting to see that we could get about 96.5% correct classifications with such ease :).

References:
Handwritten Character Dataset
Weka and the Weka Book for practical introduction to machine learning and Weka

Note: This experiment was done with weka version 3.6.4

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Awesome blog..It will surely help Me in getting through with my final year project successfully..
    A great thanks to Mr. Himanshu from bottom of my heart for such a Informative blog.....

    ReplyDelete