Googlie MOOCs

At some point late last summer while in down-the-rabbit-hole mode, following links I’d found about some MOOC building tools Google had produced, I ended up on the Course Builder page. The list of courses available at that time was small, perhaps 6 – 8. They were on a wide variety of subjects for such a small population. My impression was they had all used various beta forms of Course Builder and were evidence for Google that their builder tools worked. Indeed they did.

UWaikatoI decided to try Data Mining with Weka, from Professor Ian Witten at New Zealand’s University of Waikato. What a great course! Only five weeks long and it gave me some basic abilities with Weka. I consider Weka as a sort of data workbench tool ready to get down to business on understanding what you’ve got in a dataset and then jump right in with analysis. Upon checking the Course Builder class list today I found the number of offerings has exploded since last year. And Data Mining with Weka will be running for a second time beginning in March. Don’t miss it this time. It is a very low-pain / high-gain course. Witten will follow this up in late April with the new MOOC More Data Mining with Weka and will cover huge datasets and neural nets among its topics. I will be there.

bigdataXappsAnother too-good-to-pass-up MOOC I found on the Course Builder site is Big Data Applications and Analytics by Professor Geoffrey Fox at Indiana University. The course syllabus shows case studies from many domains and the tech used for homework involves python, visualization, and cloud. It’s not on a hard calendar deadline schedule, but work at your own pace. This is great for me as I already have two other courses in progress.

Andrew, Jeremy, and PayPal


Yes, PayPal! They host TechXploration, a monthly meetup in San Jose. Andrew Ng spoke on deep learning in August 2013. Youtube links to his presentation can be found in the meetup comments and also are repeated here.
part 1 (14:32), part 2 (14:37), part 3 (14:54), part 4 (12:09), part 5 (14:38), part 6 (5:41)

Also of interest: Jeremy Howard, Chief Data Scientist and President at Kaggle, spoke in July 2013.
part 1 (19:14), part 2 (19:33), part 3 (14:00), part 4 (17:17)

Debugging Optimizer Failure – Not

Kaggle Digit Recognizer series

KdigitsSee9watFrom yesterday’s post:

Octave appears to be failing on the first iteration inside the fmincg() optimization function. The error message is not very helpful. It complains of being out of memory or an index exceeding its range. No line number given.

I could not reproduce this failure. This was on a “fresh” run of Octave so perhaps it was caused by some weird program state left over from ongoing development. Note to self: try restarting Octave when mysterious crashes with unhelpful error messages occur.

So then before the science begins let’s try one more Kaggle submission, this time using all 42,000 samples to train the neural network. Same as last attempt are the 350 and 100 node hidden layers, lambda 1, and 500 training iterations. Run time was about 5000 seconds (1 hour, 13 minutes), final cost function 0.0671, and self-classified accuracy 99.714%.KdigitsSubmit-4

Slight improvement but probably not meaningful. There is so much random stuff going on here, from the random initialization of Thetas, to whatever random selection is done at Kaggle in scoring the results. The cross validation and analysis discussed at the end of yesterday’s post really are next.

Adding a Second Hidden Layer

Kaggle Digit Recognizer series

KdigitsSee9The initial results (previous post) for the digit classifier were coming in with an accuracy 4 points below the Kaggle-provided sample solutions. This was with only two naive attempts. First was a neural net with a 25-node hidden layer trained over 50 iterations. Next was a net with a 300-node hidden layer trained for 5000 iterations. Some improvement may be gained by tuning the regularization lambda. However no training data subsetting and cross validation has been done yet towards this goal.

It seemed reasonable (and interesting!) to modify the code to allow a second hidden node layer in pursuit of better results. Where to start? The new code will be cloned from what is already working. The single hidden layer functions will remain intact to allow easy side-by-side testing of solutions. The cost function clearly needs to change so nnCostFunction2H() will be used for that. I’m adopting a 2H suffix for functions which support the dual hidden layer network model.

I like the confidence that comes from checking the cost function’s gradients against computed numerical gradients so there will be a checkNNGradients2H() as well.

A predict2H() will be needed too. It would be preferable to have all the training and test data prediction code together in a single function. But at this point in development I would rather have the single and dual hidden layer top level orchestration code in separate files to avoid if-then toxemia in getting all the bits right. Therefore we’ll have trainNN2H.m and runNN2H.m as top level scripts for training the net and producing predicted classifications for submission to Kaggle.

So the changes are really not that extensive. They must be precise though, no room to get sloppy if the vectorized code is expected to work properly. The part of the existing code that was bothering me most deals with reconstructing the Theta matrices. I think there is too much math going on as parameters to the reshape() function calls. I find code like this hard to read and frightening to consider extending:

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

I need a picture of what is happening here!

So then for two hidden layers Theta recovery will look like this:

And from that picture I can see a pattern of how the Theta dimensions progress giving me a pretty good idea how to proceed if I want to parameterize the hidden layer depth. Also, the Theta recovery is done in at least two places in the existing code so I’ve replaced it with a function:

function [Theta1 Theta2 Theta3] = reshapeTheta2H(nn_params, ...
                                                 input_layer_size, ...
                                                 hidden_layer_size, ...
                                                 hidden_layer2_size, ...
%reshapeTheta2H Recovers Theta matrices for 2 hidden layer NN from flattened vector

    Theta1_size = hidden_layer_size * (input_layer_size + 1);
    Theta2_size = hidden_layer2_size * (hidden_layer_size + 1);
    Theta3_size = num_labels * (hidden_layer2_size + 1);

    Theta1_start = 1;
    Theta2_start = Theta1_size + 1;
    Theta3_start = Theta1_size + Theta2_size + 1;

    Theta1 = reshape(nn_params(Theta1_start : Theta1_size), ...
                     hidden_layer_size, (input_layer_size + 1));

    Theta2 = reshape(nn_params(Theta2_start : (Theta1_size + Theta2_size)), ...
                     hidden_layer2_size, (hidden_layer_size + 1));

    Theta3 = reshape(nn_params(Theta3_start : end), ...
                     num_labels, (hidden_layer2_size + 1));


Now another naive test run, no regularization tuning yet. This neural net will use 350 nodes in the first hidden layer, 100 nodes in the second hidden layer, lambda = 1, and 500 training iterations. The full training set of 42,000 samples will be used.

But no! Octave appears to be failing on the first iteration inside the fmincg() optimization function. The error message is not very helpful. It complains of being out of memory or an index exceeding its range. No line number given. This needs investigation but not right now, I’d really like to see some results.

Cutting down the training set size by 20% to 33,600 samples works with no complaints. Run time is just under 3600 seconds (1 hour). Final iteration (500) cost function value is 0.0596 and self-classified accuracy is 99.821%. The Kaggle submission for this net scored 96.486% accuracy.

That is an improvement, but not so much. It is still under the Kaggle sample solution performance but getting closer. Now it’s time to put the science in Data Science. Next to-do is cross validation to find a proper lambda. And next after that will be examining training vs cross validation error rates over a range of sample set sizes. This should tell if the model is having trouble with high bias or high variance.

Do You Speak Whale?

NARightWhaleTailToo cool, must learn more about this: Daniel Nouri’s Using deep learning to listen for whales uses multiple convolution layers to process inputs to a neural net. Details? Dunno much. In the whale case audio is treated as a greyscale image. Are the convolution layers treated as an independent feature generation stage in front of the neural net? Or are all weights learned together? Since convolution layer weights are shared among multiple neurons it seems a deep “normal” net could achieve similar results but may result in training times measured in archeological time units.

Good stuff in the references list at end of the post. Especially see the Nature article in [1] Computer science: The learning machines for an overview. Related: How Google Cracked House Number Identification in Street View and of course Kaggle’s The Marinexplore and Cornell University Whale Detection Challenge.

Cost Functions Series from Salford System

In the ten-part series How to Interpret Model Performance with Cost Functions Salford Systems continues to publish really well done data science tutorial webinars. Length of the segments vary with most about 10 to 20 minutes.

  • An Introduction to Understanding Cost Functions
  • Least Squares Deviation Cost for a Regression Problem
  • Least Absolute Deviation and Huber-M Cost for a Regression Problem
  • Introducing the Binary Classification Problem
  • Evaluating Prediction Success with Precision and Recall
  • Measuring Performance with the ROC Curve
  • Assessing Model Performance with Gains and Lift
  • Direct Interpretation of Response with Logistic Function
  • Multinomial Classification- Expected Cost
  • Multinomial Classification- Log Likelihood

MOOC Watch – January 2014

parallel programming iconA new crop of online courses has begun with the new year. To wrap my head around how GPU processing could be applied to Machine Learning and other Data Science areas I’ve enrolled in the University of Illinois’ Heterogeneous Parallel Programming via Coursera. This one started on January 6th. It began with Nvidia CUDA similar to how Udacity’s Intro to Parallel Programming is taught. The U-of-I class however plans to cover OpenCL and other interfaces after the groundwork is done with CUDA. I like this for avoiding potential vendor lock-in with CUDA only solutions.

statslearning logoOn January 21st StatLearning: Statistical Learning from Stanford University begins. They intend to cover all of An Introduction to Statistical Learning in nine weeks! This will be a lot wider in topics than Andrew Ng’s Machine Learning class but cannot possibly go as deep in that time. A pdf of the book will be available at no cost. The R language will be used but without a syllabus its not clear how heavy of a programming load comes with this class. The course will be on an open source version of the edX platform directly from Stanford. It is not through

Udacity starts the year rolling out their New Course Experience and a Data Science track. The topics there are certainly interesting but I’m already max’d out with the two above. I am also wait-and-see on the New Course Experience. A quick look at what was available late last year using the new Udacity approach left me thinking the free-to-learn track may be pretty light weight. More depth in the form of projects and tutoring comes with a per course monthly fee. No more all-you-can-eat (can-learn) for free from Udacity.

scikit-learn info


Project web site for downloads, documentation, and tutorial.

There have been other tutorials making the rounds of various PyCon, SciPy, and PyData events over the last few years. They seem to be maturing. These were presented at SciPy 2013 by Olivier Grisel, Jake Vanderplas, and Gaƫl Varoquaux I like these for the coherence as a combined presentation. These people have done individual presentation at conferences too. This is a lot of video to watch split in to five sections. On YouTube:

Material for these including iPython notebooks are at Jake’s github page.

Also see Ben Lorica’s Six reasons why I recommend scikit-learn post on the O’Reilly site.

Lots of other conference videos of interest are at Vimeo’s PyData Portfolios.

Oh No! Machine Learning Class is Over

Wow! Andrew Ng’s Machine Learning Class via Coursera was amazing. What next? For more Andrew it looks like the iTunes store has two different versions of a 2008 on campus course. Access these via iTunesU > Universities & Colleges > Stanford > Engineering. Both seem to have the same 20 lecture videos.

This one ML-iTuneslooks like a newer reorganization of the material with topic titles interspersed among the videos. Where those topics lead is a mystery, they do not appear active on Win iTunes but might go somewhere on Apple hardware.

update: There is a much better organized version of this material at the stanford engineering everywhere site. Don’t bother with the iTunes version. This one has links to a syllabus, lecture videos and pdf’s, problem assignments and solutions, and other reference material.

Also online there is tutorial material on Unsupervised Feature Learning and Deep Learning which lists Andrew as the first contributor.

Lots of references from the Coursera Machine Learning wiki page ML:Useful Resources need to be explored too. You might need to be signed in on Coursera to access this page.

And finally there are two courses of interest on the Stanford Open Classroom site. Both Machine Learning and Unsupervised Feature Learning and Deep Learning have some interesting bits but neither seems to be complete. Works in progress? Or early MOOC efforts that won’t ever be finished?

Initial Results

Kaggle Digit Recognizer series

For this first attempt at getting actual Kaggle classification results the neural network from the Machine Learning course will only be changed in its input size dimension. A single hidden layer with 25 units is used. The default of 50 iterations for training and lambda of 1 for regularization will be used.

A little code rearrangement is in order here too. Some things in ex4.m aren’t needed (parts 2 through 5). Most of its remaining logic gets moved in to trainNN.m and a few lines are added to save the trained Theta matrices. New is runNN.m which loads Thetas, runs sample data through the network, and saves results in a Kaggle submission friendly format like so:

%% runNN.m

thetaFile = 'Thetas.mat';
testDataFile = 'test.mat';
resultsFile = 'results.csv';

% load trained model theta matrices

% load test data

% predict outputs from test samples
pred = predict(Theta1, Theta2, Xtest);

% change 10 labels back to 0 for Kaggle results
pred(pred==10) = 0;

% save predicted results
outfd = fopen(resultsFile, 'w');
fprintf(outfd, 'ImageId,Label\n');
for n = 1:size(pred)
	fprintf(outfd, '%d,%d\n', n, pred(n));

Initial training of the net was done with all 42,000 samples. Splitting the data in to training and validation sets will be done later. After 50 training iterations the cost function value was 1.19. The network self-classified its training data with 87.698% accuracy. The initial Kaggle submission of the test data set achieved 86.600% accuracy, well behind the two Kaggle sample solutions with lots of room for improvement.


For an unguided attempt at improving on this result I trained a new network with 300 hidden nodes (still single layer) over 5000 iterations. Final self-classified accuracy was 100% and cost function was 0.0192. Run time for this was about 4200 seconds vs 10 seconds for the previous attempt. The Kaggle submission from this net was 92.629% accurate.