What is a good performance goal? If this problem were being started from scratch it would be reasonable to create a baseline classifier by a simple method such as counting each digit’s population in the training set and selecting the most abundant member as the answer in all cases. In an evenly distributed population this would give an answer that was correct 10% of the time. It shouldn’t be too hard to beat that with a moderately more sophisticated classifier.
It’s not going to be that easy with this problem though. Kaggle has provided two sample solutions. First is a random forest which uses 1000 decision trees and votes on the digit identification. The other is a nearest-neighbor classifier that finds the 10 euclidean-closest trained samples and votes on its solution. These set the bar pretty high. The competition’s leader board shows them scoring 96.829% and 96.557% on test data submissions.