What Hackers Should Know About Machine Learning

A mire of algebra, stats, and dry academic research, this arcane discipline allows computers to make decisions in place of humans. But where’s a hacker to start?

What Hackers Should Know About Machine Learning

Drew Conway is the co-author of Machine Learning for Hackers and must be one of the few data scientists out there who started his career working on counter-terrorism at the Department of Defense. FastCo.Labs talked to him about algebra, GitHub, and the ugly side of Machine Learning.


Why should developers learn Machine Learning?

I don’t necessarily think that every developer should learn Machine Learning. Machine Learning as a discipline is interested in the application of statistical methods to decision making. If your job as an engineer is to build large systems that have nothing to do with that, then I wouldn’t say that you should learn it. The process of learning it can improve your overall statistical literacy and I would say that’s a general benefit in life.

Why did you write the book?


We were familiar with the reference texts around Machine Learning and all of those reference texts require a pretty substantial foundation in linear algebra, calculus, and statistics. We wanted to create a reference book which was more geared towards practitioners who were used to thinking algorithmically. We wanted one that didn’t require a lot of math, didn’t require a lot of statistical training.

What are the biggest gaps in the average hacker’s knowledge when learning Machine Learning?

A college intro level probability class so that you could learn how different probability distributions are reflected in the real world. Why do we care so much about the normal distribution? What is it about the normal distribution that’s so fundamental to the things we observe in nature versus a binomial distribution? What kind of processes and phenomena does that represent? Then in terms of actually doing the work, linear algebra and matrix algebra. You get the probabilistic stuff so you can understand the framework of thinking about how things work and then often the linear algebra and the matrix algebra is how it gets done in software.


Someone who is a professional scientific researcher probably understands all the stuff about cleaning up data–that’s their bread and butter–whereas a professional software engineer understands how to build from the ground up but hears less often “Here’s some data. I need you to tell me what’s going on.” The “here’s some data” part is the really ugly part of cleaning it up, creating a matrix out of it, etc.

There’s a curiosity that’s required to do this stuff, looking at a data set and thinking what is an appropriate or interesting question to interrogate with the data–that exploratory data analysis step. I have a new dataset, I’m just going to sit at the command line and look at the density distributions and do some scatterplots and see what the structure of the data is. I think that requires some practice but also some intuition about the data generation process. Of course if you don’t have any training and you have never done any of this stuff before it may seem a bit opaque at first. For most the developers I know that have no background in that, that can be a bit intimidating in the beginning.

What are the differences between doing a Machine Learning project and a development project?

Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I’m always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here’s a dataset, here’s how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it’s not just input-output. Input, look around, see what’s going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It’s very cyclic. It’s not linear.


A data scientist has a very different relationship with the code than a developer does. I look at the code as a tool to go from the question I am interested in answering to having some insight and that’s sort of the end of that. That code is more or less disposable. For developers they are thinking about writing code to build into a larger system. They are thinking about how can I write something that can be reused? People who do large-scale Machine Learning, people at Google and Facebook, think in a similar way to a software engineer in the sense that there are lots of different interesting Machine Learning tools and methods that people can use that don’t scale well to the web-scale datasets those companies are dealing with. So at the beginning their process is more like, well what is the limited set of tools that I have which can actually scale up and be useful in this question?

There are different levels. There’s exploratory research data science which many people coming into jobs from academia do, and they are building tools which are more like minimum viable pieces of technology. In some places there are people who do that, but then have to figure out a way to optimize that at large scale, and then there are the people who work on production systems who are writing code which is going to be used all the time as part of the product itself.

Do we need a GitHub for data analysis?

The real limitation of GitHub is that it’s not meant to be like S3 where you can can store a ton of data on it. The data limitation is a significant one. In reality I think it’s fine for the data to be separate to the actual analytical code. The thing that I think was missing for a while was an appropriate way of conveying results. Most people who do data eventually get to the point where they have a graph or they have something they want to show you. But now with GitHub pages people do that all the time. If you look at Mike Bostock‘s stuff for D3 (a JavaScript library for visualization) his stuff is all on GitHub, he uses GitHub pages and he does a great job with it. GitHub really gets you 80% of the way there. The data portion is the real limitation but that’s okay because everyone is going to want to use a different type of database, a different data structure for their project.


What would you add in a new edition of the book?

There’s lots of new methods that we would certainly add. One of the things that we don’t do at all in the book is ensemble approaches to Machine Learning, combining multiple methods. We don’t talk at all about model fitting and evaluating quality of models. Those are certainly things we would do in a second edition. Part of the reason that we didn’t do it in the first one is because they are more intermediate level things and we were going for a novice audience.

My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There’s a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?

You can get pretty far with Machine Learning for Hackers but our hope is for those who want to move from hacker to real Machine Learning engineer, that they will go out there and build on the fundamentals, go out there and read Bishop and Hastie.


[Image: Flickr user Dustin McClure]

About the author

Lapsed software developer, tech journalist, wannabe data scientist. Ciara has a B.Sc.