Thursday, March 31, 2011

more: experiment with handwritten digit recognition and SVM

In case you haven't read, read the last post first.

In my last post, I wrote about a little experiment done on handwritten digit recognition using the WEKA explorer GUI. Now, this is the super simple scala script to do the same thing with code instead of gui.

import weka.core.Instances
import weka.core.converters.ArffLoader.ArffReader
import weka.classifiers.{Classifier,Evaluation}
import weka.classifiers.functions.SMO
import java.io.{FileReader,BufferedReader}

object OCR {
def main(args : Array[String]) : Unit = {
val trainingSet = trainingInstances
evaluate(trainingSet,testInstances,svmClassifier(trainingSet))
}

def getInstances(src: String) = {
val reader = new BufferedReader(new FileReader(src))
val arffReader = new ArffReader(reader)
val instances = arffReader.getData()
instances.setClassIndex(instances.numAttributes() - 1)
instances //return the read instances
}

def trainingInstances = getInstances("\\path\\to\\optdigits.tra.arff")
def testInstances = getInstances("\\path\\to\\optdigits.tes.arff")

def svmClassifier(instances: Instances) = {
val smo = new SMO()
smo.buildClassifier(instances)
smo //return the trained multiclass SVM classifier
}

def evaluate(trainingSet: Instances, testSet: Instances, model: Classifier) {
val eval = new Evaluation(trainingSet)
eval.evaluateModel(model,testSet)
println(eval.toSummaryString("Results:\n", true))
}
}

On running above, this is what I get...
Results:

Correctly Classified Instances 1734 96.4942 %
Incorrectly Classified Instances 63 3.5058 %
Kappa statistic 0.961
K&B Relative Info Score 53500.7814 %
K&B Information Score 1777.1913 bits 0.989 bits/instance
Class complexity | order 0 5969.5443 bits 3.322 bits/instance
Class complexity | scheme 4192.4976 bits 2.3331 bits/instance
Complexity improvement (Sf) 1777.0468 bits 0.9889 bits/instance
Mean absolute error 0.1603
Root mean squared error 0.2721
Relative absolute error 89.0407 %
Root relative squared error 90.6884 %
Total Number of Instances 1797


References:
Use WEKA in your Java Code

Tuesday, March 29, 2011

experiment with handwritten digit recognition and SVM

For some time I have been working through various maths texts in order to build some base to understand machine learning algorithms. A lot of theory without application becomes a bit boring and I needed to take time off and do something different. So, I was looking to experiment with some well known problem in machine learning just for fun.

Last night I downloaded the handwritten character dataset from UCI machine learning repository, mainly the two files optdigits.tra(the training set) and optdigits.tes(the test set). The dataset basically contains 64 feature vectors(each integer ranging 0..16) and the class(the actual digit ranging 0..9). This is obtained by the process described on handwritten character dataset page..

..We used preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions..


Anyway, so essentially learning problem is to classify the class(0 to 9) from 64 feature vectors. I had heard somewhere that SVMs are state of the art way now a days for OCR(earlier it used to be Neural Networks) and decided to use SVM on the dataset.
I have some familiarity with WEKA , WEKA explorer makes it very easy to experiment with various machine learning algorithms. From here on, this post is very weka centric where I describe the steps I took to apply SVM to mentioned dataset.

First I converted both the dataset files from csv to weka ARFF format. I opened the csv files into excel and gave each column an unique name(as weka expects each feature vector to have a unique name). Then I saved the file back as csv again. Next, I loaded it into weka using the "Open File..." button on weka explorer "Preprocess" tab. Then I saved the loaded dataset as ARFF file. It needed only one more change. I opened the ARFF file on text editor and changed the type of last(column that represented the actual digit) attribute from numeric to nominal as follow(so that weka could apply classification algorithms to it)...

@attribute @@class@@ {0,1,2,3,4,5,6,7,8,9}

I am making both the ARFF files available if you want to skip above steps.
optdigits.tra.arff
optdigits.tes.arff

Next, I loaded optdigits.tra.arff from weka explorer using "Open File.." button. (I have highlighted the part showing # of instances that is the number of training examples)



Next, on the weka explorer, go to "Classify" tab and click on "Choose" button to select SMO as shown in the image below(If you click next to "Choose" button on the command written, it will open up a screen that tells that SMO is the the sequential minimal optimization algorithm for training support vector classifier)




Next, to load the test dataset, we select the "Supplied test set" radio button, click on "Set..." button and use the wizard to load test dataset file.

Then we click on the "Start" button that builds the model(..the classifier), runs it on the test dataset and shows the evaluation results as below.

=== Evaluation on test set ===
=== Summary ===

Correctly Classified Instances 1734 96.4942 %
Incorrectly Classified Instances 63 3.5058 %
Kappa statistic 0.961
Mean absolute error 0.1603
Root mean squared error 0.2721
Relative absolute error 89.0407 %
Root relative squared error 90.6884 %
Total Number of Instances 1797

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.994 0.001 0.989 0.994 0.992 1 0
0.984 0.01 0.918 0.984 0.95 0.996 1
0.96 0.003 0.971 0.96 0.966 0.993 2
0.934 0.002 0.983 0.934 0.958 0.987 3
0.994 0.003 0.973 0.994 0.984 0.997 4
0.989 0.008 0.933 0.989 0.96 0.996 5
0.983 0.001 0.994 0.983 0.989 0.999 6
0.922 0.002 0.982 0.922 0.951 0.998 7
0.943 0.004 0.965 0.943 0.953 0.982 8
0.944 0.006 0.95 0.944 0.947 0.991 9
Weighted Avg. 0.965 0.004 0.966 0.965 0.965 0.994

=== Confusion Matrix ===

a b c d e f g h i j <-- classified as
177 0 0 0 1 0 0 0 0 0 | a = 0
0 179 0 0 0 0 1 0 1 1 | b = 1
0 7 170 0 0 0 0 0 0 0 | c = 2
1 1 4 171 0 2 0 3 1 0 | d = 3
0 0 0 0 180 0 0 0 1 0 | e = 4
0 0 1 0 0 180 0 0 0 1 | f = 5
1 0 0 0 1 0 178 0 1 0 | g = 6
0 0 0 0 1 7 0 165 1 5 | h = 7
0 7 0 0 0 1 0 0 164 2 | i = 8
0 1 0 3 2 3 0 0 1 170 | j = 9


It is interesting to see that we could get about 96.5% correct classifications with such ease :).

References:
Handwritten Character Dataset
Weka and the Weka Book for practical introduction to machine learning and Weka

Note: This experiment was done with weka version 3.6.4

Thursday, March 10, 2011

concurrency concepts in databases and hibernate

I've read about this multiple times but often forget the details as it is not often that I need to work on it in my applications. So, here I am taking quick notes for faster recap of it later.

First thing to know is what are the issues that can happen by multiple concurrent transactions if no preventive measures are taken.

lost update : I have found different meanings to this term. The hibernate book describes this as follow. One transaction commits its modifications and another one rolls back resulting undoing of the updates done by the other transaction.
However at many other places lost-update has the meaning that of second-lost-update described later. And, ANSI SQL-92 standard does not seem to talk about it.

dirty read : One transaction fetches a record and modifies it. Another transaction fetches the same record and gets the modifications made by first transaction even if they are not committed.

unrepeatable read : One transaction reads some record, another transaction commits some modifications to same record, first transaction reads the same record again and finds it different from what was read previously.
One special case of unrepeatable read is the second lost updates problem. Let say, One transaction reads some record and modifies it, another transaction reads same record and modifies it. First one commits its modifications and then the second commits and modifications made by first transaction are lost.

phantom read : One transaction reads a list of records by some filter criteria. Another transaction starts, inserts some records that fulfill the filter criteria used by first transaction and commits. First transaction reads same list again by same filter criteria but ends up getting a different set of records this time.

Now, to overcome above issues ANSI standard defines following transaction isolation levels.

read uncommitted : this permits all 3.. dirty read, unrepeatable read and phantom reads.

read committed : this does not permit dirty reads but allows unrepeatable as well as phantom reads.

repeatable read : this does not permit dirty and unrepeatable reads but allows phantom reads.

serializable : this is the strictest transaction isolation which simply does not seem to allow concurrent transactions. this again is usually not used in typical applications due to being too restrictive and badly performing. This does not allow any of dirty read, unrepeatable read or phantom reads.

exact implementation of above isolation levels varies significantly among the vendors. One has to consult the docs of particular db to understand the impact of each isolation level to performance and scalability.

In Hibernate, by default, every JDBC connection to a database is in the default isolation level of the DBMS, usually read-committed or repeatable-read. You can change this by explicitly setting hibernate.connection.isolation property. Then hibernate will set the mentioned isolation level to each new JDBC connection(note that it will not change the default isolation level of your db).

Optimistic concurrency control:
An optimistic approach assumes that everything will be OK and any conflicts are detected only in the end when data is written/flushed to the db.
Multi-user applications usually default to optimistic concurrency control and database connections with a read-committed isolation level.
Let say the isolation level is set to read-committed. Transaction A and B start at the same time and both read and modify the same record(they can not see each other's changes as dirty read is not permitted in read-committed isolation level). Transaction A commits and then B does. Then one of the following 3 things can happen.

last commit wins : both transactions commit successfully and B overrides any modifications done by A and no error raised.

first commit wins : when second transaction(B) is committed, conflict is detected and error is raised.

merge conflicting updates : conflicts are detected and one interactive dialogue helps resolving those conflicts

With hibernate, you get last-commit-wins by default. You can enable first-commit-wins strategy by enabling optimistic concurrency control. This needs versioning enabled for entities .


In the end, if you need more fine grained control on locking then you can use variety of "SELECT ... FOR UPDATE ..." statements by using LockMode.UPGRADE et al. This is called explicit pessimistic locking.


Reference:
Chapter-9: Transactions and Concurrency in the book "Java Persistence with Hibernate".
ANSI SQL-92 Standard Spec