Is it a boy or a girl? An introduction to Machine Learning

Have you ever noticed what happens when you hear a name you haven’t heard before? You automatically put it in a bucket, the girl names bucket or the boy names bucket. In this tutorial, we’re getting started with machine learning. We’ll be building a classifier able to distinguish between boy and girl names. If this sounds interesting read along. If you expect a tonne of intricate math, read along. It’s easier and more fun than you think.

The US Social Security Administration published annual datasets with the popularity of baby names over the years. The dataset also contains the associated gender along with the number of births. We’re interested only in name-gender pairs. You can build your own dataset by downloading the data from https://www.ssa.gov/oact/babynames/limits.html.

For convenience, I already did this for you. Here’s a cleaned dataset containing only the data of interest: names_dataset

Let’s start by reading the data:

We’re using there for convenience the pandas library that offers a simple method for reading a dataset in CSV format. The resulting object is called a DataFrame. It offers a lot of helpful features for analysing data. It won’t be of much use to us in this scenario.

Have you ever wondered what clues does our intuition use to predict if it’s a girl’s or a boy’s name? My best guess is that we use some heuristics. We know that names ending in certain letters are usually for boys and names ending in some other letters are for girls. Let’s try to teach a machine learning model these heuristics.

The features function extracts the name’s characteristics we are considering. We expect our machine learning model to find correlations between these features and the gender.

We want our features function to work on lists (or arrays) since all the tools we’re going to be using work in this way. Numpy offers a convenient function for vectorizing a function. After that, we apply the feature extraction function on the whole dataset. We name the result X. The targets (what we try to predict) will be named y.

Since the dataset is sorted, a good idea is to shuffle the data. Next, we’re splitting the data into 2 parts. A part used for training and a part used for testing. Keeping data aside for testing is essential for evaluating our model. We expect our model to perform well on the data it has seen. We want to make sure it also has predictive power, meaning it also does well on unseen data.

Here we introduce scikit-learn (sklearn) framework. Scikit-learn is the most popular Python machine learning framework and for good reason.** It contains robust implementations for a lot of machine learning models and a lot of other helpful utilities.** Classifiers are mathematical models and can’t work with words or character sequences. Here’s where vectorizers come in. Their role is to transform our features into feature-vectors.

A vectorizer has to be trained to know what are the possible features and their possible values. After a vectorizer is trained using the fit method you can use it to vectorize your data using the transform method. Let’s analyse the result of transform. It returns a sparse matrix with 2 rows because we transformed a list of 2 names, Mary and John. The columns tell us what features are present. For example, the first feature for Mary, denoted by (0, 12) is “first-letter=m”.

Oh, in case you’ve missed it, you just trained your first model. We used a DecisionTreeClassifier, one of the most popular and simple machine learning model. Simply put, a DecisionTreeClassifier tries to extract discriminating rules from the features. The most discriminating rules are higher up in a tree, while less discriminating ones are towards the leaves. Here’s a very simple tree to get a sense of what’s going on:

dt

This is a very oversimplified model, but the purpose of it is to illustrate you how the model works.

Here’s how to use the model to make predictions:

Let’s measure how well our model is doing:

Here’s how our actual model looks like (a collapsed version):

screen-shot-2017-01-11-at-20-04-07

Here’s what the root of the tree tells us. It discriminates on wheater feature 4470 is less or equal than 0.5 (meaning that it’s false). If you want to find out what feature 4470 is you can do this:

The last-letter=a is a really strongly discriminating feature because mostly girl names end in the letter a.

If this was your first trained model, please congratulate yourself. You’ve learned quite a lot! Can you think of applications of classifiers? What do you think are some cool uses?