Kronecker Product


The Kronecker Product is a snazzy bit of linear algebra that I don’t use often enough to remember exactly how it works. I have most recently run in to it during an excellent course on machine learning using Spark MLlib. There it was used to simplify a variety of data set feature samplings (e.g. grouped offsets within a time series, aggregate blocks in a series). I have also used it in previous data science flavored courses.

warning: may be slow to render the mathy bits while mathjax loads

The first example in the course lab material was straight forward enough:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 & 1 \cdot 2 & 2 \cdot 1 & 2 \cdot 2 \\\ 3 \cdot 1 & 3 \cdot 2 & 4 \cdot 1 & 4 \cdot 2 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 2 & 4 \\\ 3 & 6 & 4 & 8 \end{bmatrix} $$
But the second example starts to get obscure:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 & 1 \cdot 2 & 2 \cdot 1 & 2 \cdot 2 \\\ 1 \cdot 3 & 1 \cdot 4 & 2 \cdot 3 & 2 \cdot 4 \\\ 3 \cdot 1 & 3 \cdot 2 & 4 \cdot 1 & 4 \cdot 2 \\\ 3 \cdot 3 & 3 \cdot 4 & 4 \cdot 3 & 4 \cdot 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 2 & 4 \\\ 3 & 4 & 6 & 8 \\\ 3 & 6 & 4 & 8 \\\ 9 & 12 & 12 & 16 \end{bmatrix} $$
Using the same values in each input matrix makes it confusing as to what operand has come from where. This would be so much clearer to me if unique numbers were used for the inputs:
$$ \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \otimes \begin{bmatrix} 5 & 6 \\\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 \cdot 5 & 1 \cdot 6 & 2 \cdot 5 & 2 \cdot 6 \\\ 1 \cdot 7 & 1 \cdot 8 & 2 \cdot 7 & 2 \cdot 8 \\\ 3 \cdot 5 & 3 \cdot 6 & 4 \cdot 5 & 4 \cdot 6 \\\ 3 \cdot 7 & 3 \cdot 8 & 4 \cdot 7 & 4 \cdot 8 \end{bmatrix} = \begin{bmatrix} 5 & 6 & 10 & 12 \\\ 7 & 8 & 14 & 16 \\\ 15 & 18 & 20 & 24 \\\ 21 & 24 & 28 & 32 \end{bmatrix} $$
Which is the approach used on the Kronecker Wikipedia page. But I still want to see this more clearly and without needing to be shown the intermediate computations:
$$ \begin{bmatrix} 1 & 0 & 0 \\\ 0 & 1 & 0 \\\ 0 & 0 & 1 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\\ 3 & 4 & 0 & 0 & 0 & 0 \\\ 0 & 0 & 1 & 2 & 0 & 0 \\\ 0 & 0 & 3 & 4 & 0 & 0 \\\ 0 & 0 & 0 & 0 & 1 & 2 \\\ 0 & 0 & 0 & 0 & 3 & 4 \end{bmatrix} $$
Hey, look at how that 1-2-3-4 block gets replicated! But what about all those multiplications? OK, try this:
$$ \begin{bmatrix} 1 & 0 & 0 \\\ 0 & 5 & 0 \\\ 0 & 0 & 10 \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\\ 3 & 4 & 0 & 0 & 0 & 0 \\\ 0 & 0 & 5 & 10 & 0 & 0 \\\ 0 & 0 & 15 & 20 & 0 & 0 \\\ 0 & 0 & 0 & 0 & 10 & 20 \\\ 0 & 0 & 0 & 0 & 30 & 40 \end{bmatrix} $$

In Python’s numpy the function is named kron. In R it is named kronecker and %x% can be used in place of the full name (in keeping with R’s tradition of being weird).

VirtualBox on Ubuntu

There is a ton of information available on the web to help with installing VirtualBox. Too much, maybe. It’s a bit daunting to find the right set of instructions to follow. This machine is running Ubuntu 14.04, start here: https://www.virtualbox.org/wiki/Linux_Downloads

But don’t start at the top, it is not really necessary to download a package as described there. I went to the Debian-based Linux distributions section and did:

  • Added this to the end of the /etc/apt/sources.list file:
    ## added to support virtualbox installation
    deb http://download.virtualbox.org/virtualbox/debian trusty contrib

    The trusty part will need to be changed if the host OS is not version 14.04

  • Fetched the Oracle public key in to my Downloads directory (from the “here” link)
  • sudo apt-key add ~/Downloads/oracle_vbox.asc
  • Did not do the combine downloading and registering line
  • sudo apt-get update
    sudo apt-get install virtualbox-4.3
  • And also picked up the other recommended package:
    sudo apt-get install dkms

I did not experience any of the signature bad messages shown.

Now the system is ready to create a virtual box using the VirtualBox Manager and install a copy of a guest OS. Done, So VirtualBox runs fine and the guest OS installation went fine, BUT when the new virtual system was started up it’s display area was confined to a tiny 640×480 space on the screen!

After a bit more googling I found: http://youtu.be/JH29ALyraG4

Which advises on installing under Windows 8.1. No, not what I wanted, but the good applicable part comes just after 10:30 in the video. This is where the fix for the tiny window problem is described.

In the virtual system on a terminal window run:

sudo apt-get install virtualbox-guest-utils virtualbox-guest-x11 virtualbox-guest-dkms

Fixed! Virtual box display now resizes as desired.

Tip: change the background image for the virtual system to avoid confusion with the host OS.

PyCon 2014 Videos

The 2014 PyCon in Montreal is over and 138 videos from the conference have already been posted. There are a huge variety of topics covered. To avoid scrolling through the entire list every time I want to see what I missed by staying home here is a subset of session and tutorial videos of interest to data science and machine learning fans. I have listed running times as h:mm:ss. Summary text is from the pyvideo.org site.

Sessions (short)

Diving into Open Data with IPython Notebook & Pandas, 0:30:55
I’ll walk you through Python’s best tools for getting a grip on data: IPython Notebook and pandas. I’ll show you how to read in data, clean it up, graph it, and draw some conclusions, using some open data about the number of cyclists on Montréal’s bike paths as an example.

Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm, 0:20:56
One of the great features of Python is its machine learning capabilities. Scikit is a rich Python package which allows developers to create predictive apps. In this presentation, we will guess what type of music do Python programmers like to listen to, using Scikit and the k-nearest neighbor algorithm.

Enough Machine Learning to Make Hacker News Readable Again, 0:28:49
It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

How to Get Started with Machine Learning, 0:25:50
Provide an introduction to machine learning to clarify what it is, what it’s not and how it fits into this picture of all the hot topics around data analytics and big data.

Realtime predictive analytics using scikit-learn & RabbitMQ, 0:28:58
scikit-learn is an awesome tool allowing developers with little or no machine learning knowledge to predict the future! But once you’ve trained a scikit-learn algorithm, what now? In this talk, I describe how to deploy a predictive model in a production environment using scikit-learn and RabbitMQ. You’ll see a realtime content classification system to demonstrate this design.

Tutorials (long)

Mining Social Web APIs with IPython Notebook, 3:25:24
Social websites such as Twitter, Facebook, LinkedIn, Google+, and GitHub have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from the thoroughly revised 2nd Edition of Mining the Social Web.

Bayesian statistics made simple, 3:15:29
An introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know Python can get started quickly and use Bayesian analysis to solve real problems. This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).

Beyond Defaults: Creating Polished Visualizations Using Matplotlib, 3:08:23
When people hear of matplotlib, they think rudimentary graphs that will need to be touched up in photoshop. This tutorial aims to teach attendees how to exploit the functionality provided by various matplotlib libraries to create professional looking data visualizations.

Data Wrangling for Kaggle Data Science Competitions — An etude, 3:22:04
Let us mix Python analytics tools, add a dash of Machine Learning Algorithmics & work on Data Science Analytics competitions hosted by Kaggle. This tutorial introduces the intersection of Data, Inference & Machine Learning, structured in a progressive mode, so that the attendees learn by hands-on wrangling with data for interesting inferences using scikit-learn (scipy, numpy) & pandas

Hands-on with Pydata: how to build a minimal recommendation engine, 3:21:00
In this tutorial we’ll set ourselves the goal of building a minimal recommendation engine, and in the process learn about Python’s excellent Pydata and related projects and tools: NumPy, pandas, and the IPython Notebook.

Python for Social Scientists, 3:27:00
Many provocative social questions can be answered with data, and datasets are more available than ever. Start working with it here. First we’ll download and visualize one data set from the World Bank Indicators page together, using Matplotlib. Then you’ll have time on your own to pick another data set from any online source and plot that. At the end every person/pair will share what they found.

Exploring Machine Learning with Scikit-learn, 3:24:14
This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.

Diving deeper into Machine Learning with Scikit-learn, 3:15:13
This tutorial session is an hands-on workshop on applied Machine Learning with the scikit-learn library. We will dive deeper into scikit-learn model evaluation and automated parameter tuning. We will also study how to scale text classification models for sentiment analysis or spam detection and use IPython.parallel to leverage multi-CPU or ad-hoc cloud clusters.

Visualization Links

links400
These are cut’n’pasted from my Coursera Computaional Investing class forum. Not yet examined, just here to follow up on.

I don’t know what Prof. Tucker chose for his company but highstock (http://www.highcharts.com/stock/demo/) is in my opinion the best around (free for non commercial products). In any case here is a list of other good visualization libraries:

http://g.raphaeljs.com/
http://www.jqplot.com/
http://plugins.jquery.com/project/gchart
http://vis.stanford.edu/protovis/
http://polymaps.org/
http://code.google.com/p/flot/

Machine Learning is coming

ML-icon2

Are you prepared?

Take a look at Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask, that’s a good place to start. It is an article on the big ml blog about an article (meta article?). Charles Parker summarizes Pedro Domingos’ short paper A Few Useful Things to Know about Machine Learning, which is what to read next.

I’m not real sure about this next one: A First Encounter with Machine Learning which is a 93-page pdf from Max Welling. The first paragraph of the preface explains the focus of the book. This may be an early version though; there are “??” references to figures and the bibliography is empty. Be warned, very “mathy”.

Prepared? Prepared for what? Yeah, this is where we’re going. Andrew Ng’s 10-week Machine Learning class starts on Coursera April 22. Lots of good reviews for this online.

Not yet scheduled but also very interesting looking is Geoffrey Hinton’s Neural Networks for Machine Learning, also on Coursera. Hinton was recently swallowed up by Google, but maybe not entirely. Let’s hope Coursera runs this again this year.

Tons more links to follow are in the  Stack Overflow post Overwhelmed by Machine Learning—is there an ML101 book?