Kronecker Product

The Kronecker Product is a snazzy bit of linear algebra that I don’t use often enough to remember exactly how it works. I have most recently run in to it during an excellent course on machine learning using Spark MLlib. There it was used to simplify a variety of data set feature samplings (e.g. grouped offsets within a time series, aggregate blocks in a series). I have also used it in previous data science flavored courses.

warning: may be slow to render the mathy bits while mathjax loads

The first example in the course lab material was straight forward enough:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 & 1 \cdot 2 & 2 \cdot 1 & 2 \cdot 2 \\\ 3 \cdot 1 & 3 \cdot 2 & 4 \cdot 1 & 4 \cdot 2 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 2 & 4 \\\ 3 & 6 & 4 & 8 \end{bmatrix} $$
But the second example starts to get obscure:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 & 1 \cdot 2 & 2 \cdot 1 & 2 \cdot 2 \\\ 1 \cdot 3 & 1 \cdot 4 & 2 \cdot 3 & 2 \cdot 4 \\\ 3 \cdot 1 & 3 \cdot 2 & 4 \cdot 1 & 4 \cdot 2 \\\ 3 \cdot 3 & 3 \cdot 4 & 4 \cdot 3 & 4 \cdot 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 2 & 4 \\\ 3 & 4 & 6 & 8 \\\ 3 & 6 & 4 & 8 \\\ 9 & 12 & 12 & 16 \end{bmatrix} $$
Using the same values in each input matrix makes it confusing as to what operand has come from where. This would be so much clearer to me if unique numbers were used for the inputs:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 5 & 6 \\\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 \cdot 5 & 1 \cdot 6 & 2 \cdot 5 & 2 \cdot 6 \\\ 1 \cdot 7 & 1 \cdot 8 & 2 \cdot 7 & 2 \cdot 8 \\\ 3 \cdot 5 & 3 \cdot 6 & 4 \cdot 5 & 4 \cdot 6 \\\ 3 \cdot 7 & 3 \cdot 8 & 4 \cdot 7 & 4 \cdot 8 \end{bmatrix} = \begin{bmatrix} 5 & 6 & 10 & 12 \\\ 7 & 8 & 14 & 16 \\\ 15 & 18 & 20 & 24 \\\ 21 & 24 & 28 & 32 \end{bmatrix} $$
Which is the approach used on the Kronecker Wikipedia page. But I still want to see this more clearly and without needing to be shown the intermediate computations:
$$ \begin{bmatrix} 1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & 1 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\\ 3 & 4 & 0 & 0 & 0 & 0 \\\ 0 & 0 & 1 & 2 & 0 & 0 \\\ 0 & 0 & 3 & 4 & 0 & 0 \\\ 0 & 0 & 0 & 0 & 1 & 2 \\\ 0 & 0 & 0 & 0 & 3 & 4 \end{bmatrix} $$
Hey, look at how that 1-2-3-4 block gets replicated! But what about all those multiplications? OK, try this:
$$ \begin{bmatrix} 1 & 0 & 0 \\\ 0 & 5 & 0 \\\ 0 & 0 & 10 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\\ 3 & 4 & 0 & 0 & 0 & 0 \\\ 0 & 0 & 5 & 10 & 0 & 0 \\\ 0 & 0 & 15 & 20 & 0 & 0 \\\ 0 & 0 & 0 & 0 & 10 & 20 \\\ 0 & 0 & 0 & 0 & 30 & 40 \end{bmatrix} $$

In Python’s numpy the function is named kron. In R it is named kronecker and %x% can be used in place of the full name (in keeping with R’s tradition of being weird).

PyCon 2014 Videos

The 2014 PyCon in Montreal is over and 138 videos from the conference have already been posted. There are a huge variety of topics covered. To avoid scrolling through the entire list every time I want to see what I missed by staying home here is a subset of session and tutorial videos of interest to data science and machine learning fans. I have listed running times as h:mm:ss. Summary text is from the site.

Sessions (short)

Diving into Open Data with IPython Notebook & Pandas, 0:30:55
I’ll walk you through Python’s best tools for getting a grip on data: IPython Notebook and pandas. I’ll show you how to read in data, clean it up, graph it, and draw some conclusions, using some open data about the number of cyclists on Montréal’s bike paths as an example.

Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm, 0:20:56
One of the great features of Python is its machine learning capabilities. Scikit is a rich Python package which allows developers to create predictive apps. In this presentation, we will guess what type of music do Python programmers like to listen to, using Scikit and the k-nearest neighbor algorithm.

Enough Machine Learning to Make Hacker News Readable Again, 0:28:49
It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

How to Get Started with Machine Learning, 0:25:50
Provide an introduction to machine learning to clarify what it is, what it’s not and how it fits into this picture of all the hot topics around data analytics and big data.

Realtime predictive analytics using scikit-learn & RabbitMQ, 0:28:58
scikit-learn is an awesome tool allowing developers with little or no machine learning knowledge to predict the future! But once you’ve trained a scikit-learn algorithm, what now? In this talk, I describe how to deploy a predictive model in a production environment using scikit-learn and RabbitMQ. You’ll see a realtime content classification system to demonstrate this design.

Tutorials (long)

Mining Social Web APIs with IPython Notebook, 3:25:24
Social websites such as Twitter, Facebook, LinkedIn, Google+, and GitHub have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from the thoroughly revised 2nd Edition of Mining the Social Web.

Bayesian statistics made simple, 3:15:29
An introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know Python can get started quickly and use Bayesian analysis to solve real problems. This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).

Beyond Defaults: Creating Polished Visualizations Using Matplotlib, 3:08:23
When people hear of matplotlib, they think rudimentary graphs that will need to be touched up in photoshop. This tutorial aims to teach attendees how to exploit the functionality provided by various matplotlib libraries to create professional looking data visualizations.

Data Wrangling for Kaggle Data Science Competitions — An etude, 3:22:04
Let us mix Python analytics tools, add a dash of Machine Learning Algorithmics & work on Data Science Analytics competitions hosted by Kaggle. This tutorial introduces the intersection of Data, Inference & Machine Learning, structured in a progressive mode, so that the attendees learn by hands-on wrangling with data for interesting inferences using scikit-learn (scipy, numpy) & pandas

Hands-on with Pydata: how to build a minimal recommendation engine, 3:21:00
In this tutorial we’ll set ourselves the goal of building a minimal recommendation engine, and in the process learn about Python’s excellent Pydata and related projects and tools: NumPy, pandas, and the IPython Notebook.

Python for Social Scientists, 3:27:00
Many provocative social questions can be answered with data, and datasets are more available than ever. Start working with it here. First we’ll download and visualize one data set from the World Bank Indicators page together, using Matplotlib. Then you’ll have time on your own to pick another data set from any online source and plot that. At the end every person/pair will share what they found.

Exploring Machine Learning with Scikit-learn, 3:24:14
This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.

Diving deeper into Machine Learning with Scikit-learn, 3:15:13
This tutorial session is an hands-on workshop on applied Machine Learning with the scikit-learn library. We will dive deeper into scikit-learn model evaluation and automated parameter tuning. We will also study how to scale text classification models for sentiment analysis or spam detection and use IPython.parallel to leverage multi-CPU or ad-hoc cloud clusters.

Googlie MOOCs

At some point late last summer while in down-the-rabbit-hole mode, following links I’d found about some MOOC building tools Google had produced, I ended up on the Course Builder page. The list of courses available at that time was small, perhaps 6 – 8. They were on a wide variety of subjects for such a small population. My impression was they had all used various beta forms of Course Builder and were evidence for Google that their builder tools worked. Indeed they did.

UWaikatoI decided to try Data Mining with Weka, from Professor Ian Witten at New Zealand’s University of Waikato. What a great course! Only five weeks long and it gave me some basic abilities with Weka. I consider Weka as a sort of data workbench tool ready to get down to business on understanding what you’ve got in a dataset and then jump right in with analysis. Upon checking the Course Builder class list today I found the number of offerings has exploded since last year. And Data Mining with Weka will be running for a second time beginning in March. Don’t miss it this time. It is a very low-pain / high-gain course. Witten will follow this up in late April with the new MOOC More Data Mining with Weka and will cover huge datasets and neural nets among its topics. I will be there.

bigdataXappsAnother too-good-to-pass-up MOOC I found on the Course Builder site is Big Data Applications and Analytics by Professor Geoffrey Fox at Indiana University. The course syllabus shows case studies from many domains and the tech used for homework involves python, visualization, and cloud. It’s not on a hard calendar deadline schedule, but work at your own pace. This is great for me as I already have two other courses in progress.

Andrew, Jeremy, and PayPal


Yes, PayPal! They host TechXploration, a monthly meetup in San Jose. Andrew Ng spoke on deep learning in August 2013. Youtube links to his presentation can be found in the meetup comments and also are repeated here.
part 1 (14:32), part 2 (14:37), part 3 (14:54), part 4 (12:09), part 5 (14:38), part 6 (5:41)

Also of interest: Jeremy Howard, Chief Data Scientist and President at Kaggle, spoke in July 2013.
part 1 (19:14), part 2 (19:33), part 3 (14:00), part 4 (17:17)

Cost Functions Series from Salford System

In the ten-part series How to Interpret Model Performance with Cost Functions Salford Systems continues to publish really well done data science tutorial webinars. Length of the segments vary with most about 10 to 20 minutes.

  • An Introduction to Understanding Cost Functions
  • Least Squares Deviation Cost for a Regression Problem
  • Least Absolute Deviation and Huber-M Cost for a Regression Problem
  • Introducing the Binary Classification Problem
  • Evaluating Prediction Success with Precision and Recall
  • Measuring Performance with the ROC Curve
  • Assessing Model Performance with Gains and Lift
  • Direct Interpretation of Response with Logistic Function
  • Multinomial Classification- Expected Cost
  • Multinomial Classification- Log Likelihood

Free Statistics Textbook


SAS competitor StatSoft has a free version of their monster statistics textbook available online free. It covers lots of ground wide rather than deep.  I see it as more of a reference guide than a class textbook that would have derivations and proofs. So in that sense it’s more useful than a textbook. The online book is not exceedingly obvious to find from the company’s home page so follow the link here to go directly to the book.

DaVinci Data Detectives


It all started last December with Richard Hackathorn’s post to the discussion boards of the Boulder/Denver BigData and Data Science and Business Analytics Meetups proposing a study group for an upcoming online data analysis class.

Are you a data detective? Love to discover the secrets hidden deep within data? …Do something useful with it? Starting January 22 for 8 weeks, Coursera is offering “Data Analysis” taught by Jeff Leek. Using the R statistical language, this course is a practical hands-dirty introduction to crucial statistics, like linear regression, principal components analysis, cross-validation, and p-values. Further it’s FREE! Watch the one-minute video at http://www.coursera.o….

Taking a Coursera course is a lonely venture! Join this local study group, where the goal is for all of us to earn that certificate, becoming certified data detectives! Starting on January 24, we will meet on Thursdays over lunch, 11:30 to 1:00, at the Vault of the DaVinci Institute http://www.davinciins… 511 E South Boulder Road, Louisville, CO 80027. Pack a brown bag, stop at Subway, or pre-order from Ralphie’s Tavern (next door). Join us for the fun (and a bit of hard work)!

Richard is right, online courses can be very lonely places. Course discussion forums and wiki pages help, but are no substitute for meeting your classmates in the real world. The study group has been a great success. I learned of Roger Peng’s Computing for Data Analysis course through the group — it was a great way to learn programming in R. The weekly study sessions have helped to clear up confusions and stay motivated. One of our members is now using R and some of the techniques from this class in their day job for a task formerly done with Excel.

Data Detectives ends on 3/14 with our 8th session and final week of the Data Analysis class.  For many of us the data science learning won’t stop there though.  We have Andrew Ng’s Machine Learning and Bill Howe’s Introduction to Data Science coming up soon.

Cowabunga! Data Science Power! (can we get a pizza?)

O’Reilly Webcast: Deep Learning


Jeremy Howard was the guest on O’Reilly’s 3/5 webcast. The focus was on how deep learning techniques are being applied in Kaggle competitions. The highlighted competition was won by a team with no knowledge of the problem domain and used what seemed to be explained as a deeper than usual neural network approach.  The team was lead by Geoffrey Hinton, who happens to teach Coursera’s Neural Networks for Machine Learning. This class was offered in 2012 but is not yet scheduled to run in 2013. Hmmm.

I found only a description of this talk in O’Reilly’s webcast archive but no link for playback like many of their other webcasts.  They did however send me a URL for playback ( ). I suspect this link is tied to a cookie in my browser, so may/may-not work for you, or just require registration to watch. The playback spawns another browser tab for Jeremy’s slides so popups need to be enabled.

Evolution of Regression Modeling


Part 1 of this series from Salford Systems was held on 3/1. This was the first of a four part series, each being two weeks apart.  Sessions 1 and 3 are lecture format, 2 and 4 will be hands-on with Salford’s modeling software.  They offer a free 10 day trial of the software and promise an additional 30 days on request.  So time the download right and this can cover both hands-on sessions. As a data analysis noob, this series gives a great view of what comes beyond the basics. Links to the video and slides for this and other Salford webinars can be found here.