Over the past month I’ve been working through an introductory Machine Learning course with Python. Being completely new to Python and finding the pace of the course quite high, I wanted to write up a post with a handful of tips and tricks I picked up before moving on to the next module in the program.
As a case study for this post, I put together a table of Canadian data to explore the relationships behind the GDP values of each of Canada’s provinces and territories. By the end of the post our goal is to come up with a model that will predict the GDP of a region given a set of key features.
If you have a Jupyter Notebook environment available you can download this post’s Notebook file on GitHub. 👈 Cool - Jupyter Notebooks render out to HTML on GitHub!
Here’s the initial dataset mentioned above, which includes the following features:
Unemployment rate (%), as well as the target value
GDP (million, CAD) (making this a Regression problem). Data has been collected primarily from articles on Wikipedia:
If all goes well when the above code is run, it should render a table similar to:
We’ll use this dataset as the subject of our investigation for the remainder of this post.
Most real world datasets will not be able to fit completely on your laptop monitor, so it will be necessary to examine their data programmatically. Looking at our data and searching for factors that may lead to a higher GDP, let’s begin by counting the number of provinces with “high” unemployment vs. “low”, where a province is classified as “high” if it’s % unemployment is > 7% (an arbitrary number that I picked):
Reshaping Numpy arrays
I kept running into functions which required their params to be 2d arrays, and found myself often converting a 1d array to 2d for processing. Here is how that can be done easily with Numpy:
Classifying samples by their features
To further investigate this data, let’s generate an additional feature based on the total GDP for the region. We’ll make it a numeric value from 1 to 3, where a class of 1 indicates a relatively small GDP, 2 for medium, and 3 for large.
We can generate this feature by calling
apply on the DataFrame, and passing in a function to execute on each sample of the data. By passing
axis=1, the function will be applied to each row, and the value returned from the function will be included as a new feature.
Here’s how generating a GDP size class could be implemented for our problem:
Looking for features that will be good at predicting GDP
Now that we have our samples classified as small, medium, or large, we can use the
pandas.plotting.scatter_matrix function to plot a matrix of scatter plots with each set of features paired up. The samples in each plot will be color coded so we can see at a glance which features are correlated to a high GDP, and how they are grouped. Here’s how this code will look for our dataset:
Here’s how the resulting matrix looks:
What can we learn from this matrix? Well, it appears that some feature combinations like Land and Population lead to similar coloured points on the plots being grouped together. These are features which could be useful for predicting GDP with. On the other hand, the plots involving Unemployment don’t seem particularly well grouped, which might indicate that this feature is less significant to a region’s GDP.
Comparing models for Regression
Let’s make some predictions! The goal of the exercise below is to pick the best model for predicting the GDP of a previously unseen province or territory. The splitting of the data will therefor look as follows, with y values being continuous (as opposed to using the “GDP size label” classification we defined above):
We covered a variety of learning models in the first month of the course I’m taking. The approach of choosing the correct model for the task at hand is still black magic to me, so I decided to evaluate a few different approaches and “score” them on how well they generalize to the test data that was split above.
The section titled “Comparing models for Regression” in the notebook (sorry, can’t seem to deep link into the Notebook file) contains the code covering the use of the five below approaches as well as the R-squared score (aka the coefficient of determination) of each on the training and test sets:
- Linear regression
- Ridge regression
- Ridge regression with normalization
- Polynomial regression
- Decision Tree regression
Since the model’s
fit() step and the scoring are similar for each approach I won’t repeat the code here, but encourage you to check out the notebook if you’d like to see how each approach performed.
Which performed best? In this case, the
Ridge regression with normalization model performed the best, predicting GDP with an R-squared score of
0.954 on the test data.
NOTE! This is a contrived example, and is extremely light on the number of samples. There is nothing at all wrong with the models that performed poorly on this dataset.
Decision tree regression
Let’s take a closer look at the decision tree regression model. This model is interesting because it ranks the importance of the features in determining the result:
The last print statement above yields the following:
Important features: [ ('Population', 0.9868237285875826), ('Land (km^2)', 0.013176271412417347), ('Unemployment rate (%)', 0.0) ]
👆 based on this ranking, and from the perspective of building an effective decision tree, Population is the most important feature by an overwhelming margin. Which makes sense intuitively: the more people, the higher the GDP.
Additionally, as we suspected by looking at the scatter matrix above, unemployment rate does not seem to factor in much at all.
Joining DataFrames on a common column
Lastly, I’d like to highlight a handy method I used to merge (or
join in SQL terms) two dataframes together on a common column.
Let’s say there was additional data which we wanted to include as part of our model, but it did not exist in the .csv file from which we read the rest of the data. With the
pandas.merge method, two DataFrames can be merged together with very little fuss. Here’s an example of merging in another DataFrame which includes details on the amount of water in each region:
To note from the above,
how='left' tells the merge to operate similar to an SQL left outer join: all keys from the left frame will be included in the resulting DataFrame, but keys in the right frame without a match on the left will be excluded.
This has been a great course so far and I am learning a ton. If you would like to check it out for yourself, click here: coursera.org/learn/python-machine-learning
Next up: Neural Networks! 🤓