Let’s get started with some Machine Learning basics! We'll get your Machine Learning development environment setup and tested.
Here’s where the fun part begins. We’re going to set up your machine learning environment and run our first classifier. It won’t do a whole lot compared to where we’ll ultimately end up, but it’s a good start.
This is truly the "hello world" of machine learning. While what we'll achieve here isn't about to set the world on fire for amazing data insights, it will give you critical understanding of how machine learning provides outcomes.
We’re going to assume you’re doing this on a Raspberry Pi. If you would like to build a Virtual Machine to handle this instead of using up an actual Pi, check out our Virtual Pi article also published in this issue. It gives you the Pi, without having to dedicate hardware. However, we’ll be working in Python, so the code is fairly standard regardless of your platform, and you should be able to follow everything past the installation steps regardless of the platform.
BASIC SYSTEM UPDATES
Before anything, you should always ensure your packages and package is up to date. This ensures you get the latest repositories of code whenever you do an installation.
sudo apt-get update
Now we actually upgrade your system.
sudo apt-get upgrade
Let the system do its thing - it may take a few minutes depending on your internet speed and how many packages need upgrading.
Once it’s complete you're ready to continue, if there were lots of packages upgraded, a reboot is useful. The system may prompt you to restart also, in which case it's highly recommended.
GETTING STARTED WITH SCIKIT-LEARN
Scikit-Learn is what we’re going to use for a while in this machine learning series, and in this part, we’re going to get the environment setup.
Scikit-Learn is a free machine learning library. It’s pronounced like sy - kit - learn. Sci as in science, and the rest is fairly self explanatory. The project started in 2007 as a Google Summer of Code project by David Cournapeau, and grew from there. The first public release was in 2010, so it’s quite well established with a thriving development community.
It’s a great starting point for exploring machine learning because it includes a bunch of useful algorithms and has a fairly small learning curve compared to some solutions.
There are packages for Windows, Mac, and Linux distributions, so you can probably get scikit-learn for your favourite operation system. For the purposes of makers, we’ll use a Raspberry Pi environment in all of our examples. If you have a fresh install of the latest Raspbian desktop, then you can probably just install the scikit-learn packages.
There’s a package available using the Debian package manager, which helps ensure any dependencies are available. We’re using Debian Buster (the most recent stable release at the time of writing).
sudo apt-get install python3-sklearn python3-sklearn-lib
There are official packages using the PIP package manager which is cross-platform (though you may need to install PIP itself first). To install scikit-learn using PIP, use the following:
pip3 install -U scikit-learn
Now you need to create a Python script and confirm that scikit-learn is available.
On Raspbian, open up Mu, which is a Python3 editor made by the Raspberry Pi foundation. You’ll find it under the Programming group in the desktop menu. You can use whatever Python editor you like, or just a text editor, but an IDE like Mu will help you along the way.
Save your blank file somewhere convenient and name it something useful such as machine_learning_hello_world.py.
Now we’re going to start this script off simply, with just two lines.
import sklearn
print("I work! Hello Machine Learning world!");
As you may have guessed, this script simply imports the Scikit-Learn library. Run the code. If something hasn’t gone right through installation it’ll fail and you’ll need to re-trace your steps.
Note: You may notice that we get a warning about a deprecated module. This will probably be sorted out by the scikit-learn development team, and we’ll ignore that for now.
All things being equal, you should see something like this:
This means you’re ready to get started with Machine Learning!We’re going to run through a few examples here now, and further explain them in future parts of this series.
YOUR FIRST MACHINE LEARNING OUTCOME
This is the exciting part - we're going to build our first machine learning programme!
In Part Zero, we demonstrated how machine learning could predict a fruit, based on characteristics. We’ll continue that example here, with a slightly simplified dataset. You may notice that the term we use for the outcome is a prediction, because it is just that. It's a prediction based on available / trained / historic data.
Here we’re going to use supervised learning to train a decision tree classifier. Supervised learning means we train our model with known data, so it can make predictions on unknown data. The decision tree is the method our model will use to classify the data later.
We have the table below, with weight, skin thickness, and type.
In order to use this in machine learning, we’ll transform all of our values to integers. This simplifies the computational effort required. Our weights can remain, since they’re already integers, but we’ll transpose (translate) our values to integers which is fairly simple. It doesn’t matter what integers you transform to, as long as you can transpose the results back.
As you can see, it’s a little more difficult for us to read, but it’s much faster for a computer to interpret!
In order to use scikit-learn, our import statement is now changed to import the decision tree classifier also (we'll discuss classifier selections later), so our first line of code gets a quick update.
from sklearn import tree
Now we’ll get our example data into our Python code so we can use it. For machine learning, we have to split the data in what may seem an odd way.
features = [[100,1], [110,1], [140,2], [150,2]]
labels = [ 1, 1, 2, 2 ]
Not following? We have essentially split the TYPE, which is what our classifier will be asked to predict, into a separate array.
Still not following? What if we put the labels back in?
features = [ [100, thin], [110, thin], [140, thick], [150, thick] ]
labels = [ apple, apple, orange, orange ]
In our first array, we have our features. Think of them as the clues to the question “what am I”.
In our second array, we have our labels. Think of them as the answers to the same question.
Now, let’s train our classifier using our sample data. This is teaching the classifier how to make predictions by feeding it the questions and answers together.
decisionMaker = tree.DecisionTreeClassifier()
decisionMaker = decisionMaker.fit(features, labels)
This creates a variable called decisionMaker (you can call it whatever you like, however descriptive variables help while you're getting the hang of it), which is a DecisionTreeClassifier. We then tell it that we have the features and labels (questions and answers) to learn from.
You can run this script if you like, and check it for errors. Congratulations, you've successfully trained your first machine learning classifier!
Now that our code is training our classifier using our example data, we can ask it to make a prediction on a new fruit we have, within the scope of data it’s aware of. We do this with the "predict" method.
We’ll ask it what it thinks a 135g piece of fruit, with a thick skin is.
As you can see, it gives us a [2], which is an orange! Sure, we only got a single number, but that's a true machine learning outcome!
There was no data that told our model that oranges might weigh 135g, or that oranges always have thick skins. The data is what wrote the rulebook for classification, and that is machine learning!
SAME DATA, DIFFERENT OUTCOMES
We parsed some data that was quite clearly an Orange in the previous example. So what if we parse data that’s a little less obvious?
print(decisionMaker.predict([[130,2]]);
Here, we’ve given it an object that has a weight of 130g, which is precisely the average weight of our four previous fruits. Run the script 10 different times, and you may get a different result each time!
The two different results were given using precisely the same code snippet shown. No changes were made at all.
You might think this is something of a problem, however, this also serves as an example of how important the quality and size of the dataset is. It also demonstrates how training a model is a fluid idea. Unlike hand-written code, you may not get the same outcome each time it's run. However this is a good thing.
Machine learning purely needs more data to accurately make new predictions. Without adequate data to learn from, your classifier can be thought of as inadequately trained. In our example, if we had provided a larger dataset to learn from, this seemingly random variability would reduce or disappear altogether.
Think about how humans learn. If you train a surgeon how to perform a lung transplant, you can’t simply show them a video about lung transplants and assume everything will be fine. For something that complex, there are years and years of practical and theoretical training involved. That is the surgeon gathering data, training their model. Once adequately trained, they can apply that knowledge to successfully perform a lung transplant on a new patient.
CLOSEST MATCH
Remember that machine learning classifiers will typically always try and give you an answer.
While you can certainly programme in exit routines, most of the classifiers we’ll be dealing with for now will always attempt to give you a result. Even if the data you ask it to predict on represents an outlier, it will still attempt to provide an answer. This can be good and bad, depending on what you’re doing, however, we’ll deal with this further down the track.
IN SUMMARY
If you’ve gone through this example thoroughly, you’ll start to see how machine learning is very different from traditional coding, and data is critically important.
We have now established the following:
- We provide our classifier data, and the classifier writes the virtual rulebook for making future predictions.
- The classifier will attempt to classify new data it hasn’t seen before, based on the training.