Is it a boy or a girl? An introduction to Machine Learning
Have you ever noticed what happens when you hear a name you haven’t heard before? You automatically put it in a bucket, the girl names bucket or the boy names bucket. In this tutorial, we’re getting started with machine learning. We’ll be building a classifier able to distinguish between boy and girl names. If this sounds interesting read along. If you expect a tonne of intricate math, read along. It’s easier and more fun than you think.
The US Social Security Administration published annual datasets with the popularity of baby names over the years. The dataset also contains the associated gender along with the number of births. We’re interested only in name-gender pairs. You can build your own dataset by downloading the data from https://www.ssa.gov/oact/babynames/limits.html.
For convenience, I already did this for you. Here’s a cleaned dataset containing only the data of interest: names_dataset
Let’s start by reading the data:
1 2 3 4 5 6 7 8 | import pandas as pd import numpy as np names = pd.read_csv('names_dataset.csv') print names print "%d names in dataset" % len(names) # 95025 names in dataset |
We’re using there for convenience the pandas library that offers a simple method for reading a dataset in CSV format. The resulting object is called a DataFrame. It offers a lot of helpful features for analysing data. It won’t be of much use to us in this scenario.
1 2 3 4 5 6 7 | # Get the data out of the dataframe into a numpy matrix and keep only the name and gender columns names = names.as_matrix()[:, 1:] print names # We're using 80% of the data for training TRAIN_SPLIT = 0.8 |
Have you ever wondered what clues does our intuition use to predict if it’s a girl’s or a boy’s name? My best guess is that we use some heuristics. We know that names ending in certain letters are usually for boys and names ending in some other letters are for girls. Let’s try to teach a machine learning model these heuristics.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def features(name): name = name.lower() return { 'first-letter': name[0], # First letter 'first2-letters': name[0:2], # First 2 letters 'first3-letters': name[0:3], # First 3 letters 'last-letter': name[-1], 'last2-letters': name[-2:], 'last3-letters': name[-3:], } print features("John") # {'first2-letters': 'jo', 'last-letter': 'n', 'first-letter': 'j', 'last2-letters': 'hn', 'last3-letters': 'ohn', 'first3-letters': 'joh'} |
The features function extracts the name’s characteristics we are considering. We expect our machine learning model to find correlations between these features and the gender.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Vectorize the features function features = np.vectorize(features) print features(["Anna", "Hannah", "Paul"]) # [ array({'first2-letters': 'an', 'last-letter': 'a', 'first-letter': 'a', 'last2-letters': 'na', 'last3-letters': 'nna', 'first3-letters': 'ann'}, dtype=object) # array({'first2-letters': 'ha', 'last-letter': 'h', 'first-letter': 'h', 'last2-letters': 'ah', 'last3-letters': 'nah', 'first3-letters': 'han'}, dtype=object) # array({'first2-letters': 'pa', 'last-letter': 'l', 'first-letter': 'p', 'last2-letters': 'ul', 'last3-letters': 'aul', 'first3-letters': 'pau'}, dtype=object)] # Extract the features for the whole dataset X = features(names[:, 0]) # X contains the features # Get the gender column y = names[:, 1] # y contains the targets # Test if we built the dataset correctly print "Name: %s, features=%s, gender=%s" % (names[0][0], X[0], y[0]) # Name: Mary, features={'first2-letters': 'ma', 'last-letter': 'y', 'first-letter': 'm', 'last2-letters': 'ry', 'last3-letters': 'ary', 'first3-letters': 'mar'}, gender=F |
We want our features function to work on lists (or arrays) since all the tools we’re going to be using work in this way. Numpy offers a convenient function for vectorizing a function. After that, we apply the feature extraction function on the whole dataset. We name the result X. The targets (what we try to predict) will be named y.
1 2 3 4 5 6 7 8 | from sklearn.utils import shuffle X, y = shuffle(X, y) X_train, X_test = X[:int(TRAIN_SPLIT * len(X))], X[int(TRAIN_SPLIT * len(X)):] y_train, y_test = y[:int(TRAIN_SPLIT * len(y))], y[int(TRAIN_SPLIT * len(y)):] # Check to see if the datasets add up print len(X_train), len(X_test), len(y_train), len(y_test) # 76020 19005 76020 19005 |
Since the dataset is sorted, a good idea is to shuffle the data. Next, we’re splitting the data into 2 parts. A part used for training and a part used for testing. Keeping data aside for testing is essential for evaluating our model. We expect our model to perform well on the data it has seen. We want to make sure it also has predictive power, meaning it also does well on unseen data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | from sklearn.feature_extraction import DictVectorizer print features(["Mary", "John"]) vectorizer = DictVectorizer() vectorizer.fit(X_train) transformed = vectorizer.transform(features(["Mary", "John"])) print transformed """ (0, 12) 1.0 (0, 244) 1.0 (0, 2722) 1.0 (0, 4516) 1.0 (0, 4827) 1.0 (0, 5147) 1.0 (1, 9) 1.0 (1, 199) 1.0 (1, 2263) 1.0 (1, 4505) 1.0 (1, 4640) 1.0 (1, 7202) 1.0 """ print type(transformed) # <class 'scipy.sparse.csr.csr_matrix'> print transformed.toarray()[0][12] # 1.0 print vectorizer.feature_names_[12] # first-letter=m |
Here we introduce scikit-learn (sklearn) framework. Scikit-learn is the most popular Python machine learning framework and for good reason.** It contains robust implementations for a lot of machine learning models and a lot of other helpful utilities.** Classifiers are mathematical models and can’t work with words or character sequences. Here’s where vectorizers come in. Their role is to transform our features into feature-vectors.
A vectorizer has to be trained to know what are the possible features and their possible values. After a vectorizer is trained using the fit method you can use it to vectorize your data using the transform method. Let’s analyse the result of transform. It returns a sparse matrix with 2 rows because we transformed a list of 2 names, Mary and John. The columns tell us what features are present. For example, the first feature for Mary, denoted by (0, 12) is “first-letter=m”.
1 2 3 4 5 | from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier() clf.fit(vectorizer.transform(X_train), y_train) |
Oh, in case you’ve missed it, you just trained your first model. We used a DecisionTreeClassifier, one of the most popular and simple machine learning model. Simply put, a DecisionTreeClassifier tries to extract discriminating rules from the features. The most discriminating rules are higher up in a tree, while less discriminating ones are towards the leaves. Here’s a very simple tree to get a sense of what’s going on:
This is a very oversimplified model, but the purpose of it is to illustrate you how the model works.
Here’s how to use the model to make predictions:
1 2 | print clf.predict(vectorizer.transform(features(["Alex", "Emma"]))) # ['M' 'F'] |
Let’s measure how well our model is doing:
1 2 3 4 5 6 | # Accuracy on training set print clf.score(vectorizer.transform(X_train), y_train) # 0.988292554591 = 98.8% accurate # Accuracy on test set print clf.score(vectorizer.transform(X_test), y_test) # 0.863246514075 = 86.3% accurate |
Here’s how our actual model looks like (a collapsed version):
Here’s what the root of the tree tells us. It discriminates on wheater feature 4470 is less or equal than 0.5 (meaning that it’s false). If you want to find out what feature 4470 is you can do this:
1 2 | print vectorizer.feature_names_[4470] # last-letter=a |
The last-letter=a is a really strongly discriminating feature because mostly girl names end in the letter a.
If this was your first trained model, please congratulate yourself. You’ve learned quite a lot! Can you think of applications of classifiers? What do you think are some cool uses?
[…] sure you understand what classification is before going through this tutorial. You can check this Introduction to Machine Learning, specially created for […]
Hi,
very good introduction, one question regarding the prediction at the end
Is it possible to get an estimation/score associated with the prediction?
Hi Christophe,
Yes, you can. Pick a classifier that supports “predict_proba” and use it instead of “predict”
This is a wonderful resource for novice like me. Thank you for making these tutorials!
I had a minor issue obtaining named features from the DictVectorizer based on your code example (line 26, when you first create Vectorizer). This resolved when I replaced with method: