Adapting ML Class Code for Kaggle Data

Kaggle Digit Recognizer series

Sample data from the Machine Learning class is arranged as one sample per row with data values ranging from just a bit below 0 to a little above 1. Each digit image is a flattened 20×20 grey scale pixel array. The Kaggle data is arranged similarly but values range from 0 to 255. Kaggle digits are flattened 28×28 pixel arrays. The difference in value ranges should not be a problem for the exisiting code. The neural network input layer for Kaggle data will have 784 units vs the 400 used for the ML class data.

So the only code modification at this point is changing:

input_layer_size  = 400;  % 20x20 Input Images of Digits
hidden_layer_size = 25;   % 25 hidden units
num_labels = 10;          % 10 labels, from 1 to 10

to this:

input_layer_size  = 784;  % 28x28 Input Images of Digits
hidden_layer_size = 25;   % 25 hidden units
num_labels = 10;          % 10 labels, from 1 to 10

So we should be ready to go! The first thing we see running the top level Octave code is an image of a random selection of input samples.
What? This can’t be right! Well it doesn’t really matter to the performance of classifying digit samples. But it is rather awkward to turn your head 90 degrees and look in a mirror at the computer display so lets fix this. The problem is caused by a difference of opinion on up/down/left/right for the two data sources. To fix this the displayData() function needs a change where the display_array is being accumulated. Change this:

        reshape(X(curr_ex, :), example_height, example_width) / max_val;

to this:

        reshape(X(curr_ex, :), example_height, example_width)' / max_val;

That’s it. Just transpose the result of reshape. A single character change.
Much better.

Need Data

Kaggle Digit Recognizer series

Data for the digit recognizer is provided by Kaggle in csv formatted files. There are 42,000 training samples and 28,000 test samples. These can be read in to Octave with the csvread() function however this is painfully slow to witness. Much better to read the csv’s once and save the data in a more easily digestible binary format to speed up development and testing. I find it useful to have smaller slices of the data files available too. No need to throw 42,000 samples at the code when you’re just testing paths and logic. The code below reads the csv’s and saves full sized mat files along with 10%, 1%, and 0.1% slices.

%% training data set contains 42000 samples
%% the first row contains column labels

fname = 'train';
samples = csvread(strcat(fname, '.csv'));

samples = samples(2:end, :); % remove the first row (contains column labels)

rand('seed', 3.14159); % ensure this is repeatable
samples = samples(randperm(size(samples, 1)), :); % shuffle the sample order

X = samples(:, 2:end); % seperate inputs and outputs
y = samples(:, 1);

y(y==0) = 10; % change zero labels to tens to work with existing code

% save the full training data set
save('-float-binary', strcat(fname, '.mat'), 'X', 'y');

% save the abbreviated training data sets
sizes = [4200 420 42];
for s = sizes
    X = X(1:s, :);
    y = y(1:s, :);
    save('-float-binary', strcat(fname, int2str(s), '.mat'), 'X', 'y');

%% test data set contains 28000 samples
%% the first row contains column labels

fname = 'test';

Xtest = csvread(strcat(fname, '.csv'));

Xtest = Xtest(2:end, :); % remove the first row (contains column labels)

% save the full test data set
save('-float-binary', strcat(fname, '.mat'), 'Xtest');

% save the abbreviated test data sets
sizes = [2800 280 28];
for s = sizes
    Xtest = Xtest(1:s, :);
    save('-float-binary', strcat(fname, int2str(s), '.mat'), 'Xtest');

i.e. train.csv has been transmogrified in to train.mat, train4200.mat, train420.mat, and train42.mat.

Just as a little sanity check lets have a look at how the training data target values are distributed. OK, not perfectly uniform, but a reasonable distribution of 0 through 9.

Performance Goal

Kaggle Digit Recognizer series

What is a good performance goal? If this problem were being started from scratch it would be reasonable to create a baseline classifier by a simple method such as counting each digit’s population in the training set and selecting the most abundant member as theĀ  answer in all cases. In an evenly distributed population this would give an answer that was correct 10% of the time. It shouldn’t be too hard to beat that with a moderately more sophisticated classifier.

It’s not going to be that easy with this problem though. Kaggle has provided two sample solutions. First is a random forest which uses 1000 decision trees and votes on the digit identification. The other is a nearest-neighbor classifier that finds the 10 euclidean-closest trained samples and votes on its solution. These set the bar pretty high. The competition’s leader board shows them scoring 96.829% and 96.557% on test data submissions.