Artificial Intelligence

Decision trees and regression algorithms

Rob Bell

Issue 31, February 2020

Let's continue our education in Machine Learning by visualising a decision tree, and experimenting with regression algorithms.

In part 1, we gave you a quick introduction into the world of Machine Learning, and you got to write your first Machine Learning code!

If you haven’t read Part 1 (or Part Zero - the pre-code primer) and you’re unfamiliar with Machine Learning, it’s highly recommended that you go back and check those out. They lay important foundations that are required for a thorough understanding of Machine Learning.

We hope that your understanding of Machine Learning continues to grow, and you’re following along with us. If not, please reach out and ask questions, or leave comments on the online articles.

UNDERSTANDING THE DECISIONS

In Part 1, we ran some quick classifications by training a model with very simple data, and asking it to classify (i.e. make a prediction) some unseen data using machine learning.

One interesting aspect we noted, however, is that with machine learning you can end up with different results each time we run the script. That’s because of how the model actually trains, and the observations it makes about the data can affect the outcomes. For those examples, we used a relatively straightforward algorithm called DecisionTreeClassifier.

One amazing thing about using the DecisionTree model is that we can actually have our system visualise the process it’s using to make predictions.

This is one of the best ways to understand what Machine Learning does behind the scenes, and one of the few models which can be easily visualised, because the result can be viewed in a linear way.

When we have multi-dimensional assessments taking place with algorithms such as k-nearest-neighbors, it’s virtually impossible to visualise it in any way that’s useful, unless we look at the data one feature at a time.

While plotting data certainly has functional uses, it doesn’t fully capture the inter-feature relationships which are usually being considered by the algorithm.

If we lost you there for a second, consider plotting the data for one feature only, is like looking at a star through a telescope. Sure, you can see stars one by one, but you need to zoom out to see the bigger picture.

So naturally, we’ll stick with the DecisionTreeClassifier model we used in the last instalment, as it’s perfectly suitable for this purpose.

In order to translate our data into something visual, we’ll use two packages. Pydotplus, and Graphviz.

INSTALLING PYDOTPLUS

Pydotplus is a powerful package that helps utilise and visualise data using graphvis. Its predecessor Pydot was very popular, however was only compatible with older Python versions. Since we're using Python3, Pydotplus is the go to package.

Fortunately commands between both Pydot and Pydotplus are very similar in many ways, so if you happen to be familiar with Pydot, you should find your way around easily.

Now, you might jump ahead and use the Raspian package manager...

`sudo apt-get install pydotplus`

or even PIP...

`pip install pydotplus`

However, you will likely find that these don’t work. Sure, they’ll install, but you won’t be able to actually import the library into your Python code.

Part of this is that we’re running Python3. In order to ensure we get the library into the right location when running PIP, we can use the following:

`python3 -m pip install pydotplus`

This python3 pretext ensures we end up in the right spot and that it can be included in your Python3 scripts without any problems.

Unfortunately, quirks like this (which are simply differences in how libraries and various applications are used) can create huge roadblocks to getting things going.

INSTALLING GRAPHVIZ

You’ll also need to install graphviz on your system, which unlike the previous step, is best handled with the package manager for Raspbian.

`sudo apt-get install graphviz`

There’s quite a few packages that will install, and installing system code will often prompt you to confirm you want to install the code, and an estimate of the additional disk space it requires.

Once that's finished, you should now have all the packages required and be able to proceed.

GENERATING THE ACTUAL DECISION TREE

With datasets that have a high number of features, the usefulness of visualisation may reduce. However, it still allows you to see how the algorithm has been trained to derive a particular outcome.

For this, we’ll stick with the same fundamental data set we created last month, which was a classifier for types of fruit. This really doesn’t have enough data for true machine learning, but takes us through the motions without getting too lost in the details. The same principles work with larger data sets.

As we noted previously, when you have entirely separated data (that is, clear separation between features against classifications, as in our test case of oranges versus apples), the decision process for Machine Learning is fairly straightforward. However, due to the way Machine Learning functions, it can arrive at the same conclusion in different ways.

This is where visualising our tree can be particularly useful in helping understand what Machine Learning is doing behind the curtains.

THE SIMPLEST TREE

The code to visualise a tree is fairly straightforward, and not terribly important to understand right now, so now you have installed the dependencies we’ll gloss over some of the details so we can focus on Machine Learning concepts.

You can load and run the simple_tree.py code available in the digital resources, or type it out from the following:

`from sklearn import treefeatures = [[100,1],[105,1],[110,1],            [140,2],[145,2],[150,2]]labels = [1,1,1,2,2,2]decisionMaker = tree.DecisionTreeClassifier()decisionMaker = decisionMaker.fit (features, labels)# everything below is code to display only# there is no machine learning happening herefrom sklearn.externals.six import StringIOimport pydotplusdot_data = StringIO()tree.export_graphviz(decisionMaker,    out_file=dot_data,    feature_names=["Weight","Skin Thickness"],    class_names=["Apple", "Orange"],    filled=True, rounded=True,    impurity=False)graph = pydotplus.graph_from_dot_data(dot_data.getvalue())graph.write_pdf("/home/pi/Desktop/simpleTree.pdf")`

Run your code. Providing you get no errors, you should see a PDF generate on your RaspberryPi desktop. If you’re running as a different user (the default is “pi”), you will need to update the filepath to your own desktop, or another convenient location.

Here’s what you should see when you check out the PDF:

This is as simple as it gets, with only one decision being required in order to predict the type of fruit, based on either skin thickness or weight. Note that you may see the primary decision as being skin thickness, or weight, and it may actually change if you re-execute the script. Mathematically speaking, the data provides the same probability using either of the available features.

As you can see, the decisionTree logic is fairly similar to how a human makes a decision. However, we didn’t have to tell it how to make this assessment, the critical difference between traditional coding and machine learning.

MORE COMPLICATED DATA

In our original test data, we deliberately made things simple. There was a fairly simple relationship between weight and skin thickness. After all, we’re just demonstrating things here, so keeping it simple is generally better. But what happens if we deliberately complicate the data?

We’ll update the features with some additional complexity, and add support for the third feature in the visualisation with a label too. We have also thrown a few extra data objects in there to assist with training the model too.

We have also deliberately created crossover in the weights of each fruit so that some oranges are as light as apples, and some apples are as heavy as oranges.

`features = [[121,1,5],[125,1,7],[130,2,8],        [135,1,9],[120,1,5],[125,2,6],        [129,2,7],[133,2,5]]labels = [1,1,1,1,2,2,2,2]`

This essentially trains our model with more data, and creates more complexity for classifications, so the decision tree used when we make a prediction has more steps.

We also need to add support for the third feature in our visualisation, by way of adding “Ripeness” as the label. What you actually call it doesn’t affect the outcome, other than changing what’s listed on the PDF decision tree.

`feature_names=["Weight","Skin Thickness","Ripeness"]`

Alternatively, grab the code complex_tree.py from the digital resources which has the updated features and run it. You’ll need to update the PDF location again if you’re not running as user “pi”.

Run your code. Open up complex_tree.pdf and you should have something like this:

As you can see, our decision process for making accurate predictions is now significantly more complex. Interestingly, ripeness has become the primary feature used for predictions.

Skin Thickness becomes the next important feature analysed. Ultimately, we get all the way down to Weight as well.

The actual decision values here aren’t critically important. What’s important here is understanding how Machine Learning has created a pathway for classification of new data. You can even actually use the generated decision tree to make your own predictions by following the logic shown in each step!

We should note the data in this particular tree by noting that this is fictitious data based on nothing more than trying to demonstrate some behaviour.

The dataset is also too small to generate reliable outcomes, since with the existing data anything with a ripeness score of above 7.5 is deemed an apple, which would be a problem if we made a prediction using ripe oranges without retraining the model. This is why a high sample number is important, as is good feature selection.

CLASSIFICATION VERSUS REGRESSION ALGORITHMS

Up until now, we’ve talked about a few different algorithms for Machine Learning. There are a growing number of tried and tested algorithms to suit a host of different data types, that’s for sure. However, when it comes to supervised learning, there are two primary categories of algorithm; classification algorithms and regression algorithms.

Under a totally different umbrella of algorithms we have the unsupervised learning options. These include clustering and association algorithms. We will deal with unsupervised learning down the track, and stick with supervised learning for now.

Selecting the correct algorithm isn’t so much a choice as a requirement of the problem you’re trying to resolve.

Classification

As in our examples provided up until now, classification algorithms are designed to fit a peg into a hole, so to speak. That is, classify a problem into a list of predetermined answers.

This is particularly useful for, as the name suggests, classifying a type of fruit.

When you’re predicting types of fruit, you already know what types of fruit you’re classifying against. It’s very unlikely that you’re looking for a new type of fruit you’ve never encountered before. Nor is it likely that you want to account for every conceivable fruit that exists on the planet.

Regression

While the name can be somewhat misleading, regression algorithms are intended to provide a prediction that doesn’t have a literal match already. This could be a floating point number (such as 3.14) or an integer (such as 25).

This is particularly useful for non-determinate predictions such as predicting a house price, or horsepower of a vehicle’s engine based on certain parameters.

Consider this; if you have house data (location, number of bedrooms, size of land, etc.) along with the sale price, there are scenarios where you may want to estimate its current value if it was to be placed on the market. A classification algorithm would still have some value here, but you’ll end up with a value that matches a previous sale value.

This may sound suitable at first, however, you probably want to predict a price that isn’t precisely in your data. As an over-simplification, if you have a 3-bedroom house at 100,000 and a 5-bedroom house for 200,000, a classification algorithm can only guide you to one of the existing options. By contrast, a regression algorithm can use the data to predict the price of 150,000 (or whatever the case may be based on the data).

Choosing Between Algorithms

Regression algorithms can provide infinite granularity in the output, This is perfect for continuous outputs such as a house price estimate, but not much use for determining types of fruit. However, as resolution for a classification problem is limited to the existing labels present in the training data, this is perfect for problems where the output values are already known (you just need to know which value it is).

If you were to, for example, want to recommend a real estate agent to sell a particular property, then you’re probably better off using a classification algorithm.

In that scenario, there is a list of finite values. House sales, and the agents which sold them, etc. In fact, both regression and classification algorithms could be trained from the same data (assuming you had all the required values) to derive outcomes, depending on precisely what you wanted to do.

This should hopefully provide you some understanding of how the different algorithms have their place in machine learning.

USING A REGRESSION ALGORITHM

As we’ve just explained, you don’t always want to classify data, but sometimes you want to obtain a value in a range based on the data provided, such as our house prices example. So let’s implement a very basic idea of this. The complexity of data is similar to our fruit classification problem, but moves to a regression-compatible problem which is house price prediction.

There are several regression algorithms built into Scikit-learn, however, we’re going to use the LinearRegression algorithm for this problem. It’s a fairly straightforward mathematical regression algorithm, which may have less useful application in the real world, and it’s easy to implement with a simple dataset.

You can import the linear_model library easily:

`from sklearn import linear_model`

For our sample data, we have a simple table with eight values. We are only specifying three parameters; house type (apartment or house), number of bedrooms, and the sale price (which is our label).

Naturally, there are loads of useful features that dictate the value of any property sale, such as the age of the building, postcode, quality of the fittings inside, you name it. However, this is enough data for us to demonstrate regression.

It’s also important to note that this data is entirely fictitious and has no intended correlation with anything at all, so don’t read too much into the actual numbers. It’s purely to demonstrate regression algorithms.

Before we start, we need to simplify our data, remembering that Scikit-learn likes simplified datasets, so we translate all of our hard values to integers. Naturally, once our arrays become more complex, we need to index and map array data to translate from basic integers used for machine learning, in order to translate the outputs back to human readable where required.

And so on...

In order to put this into Python, we convert the data into a regular array. Just as we have done previously with our examples, we split the features (that is, the data points), and the labels (the values which we will train the model on, to eventually make predictions).

`features = [[1,1],[1,1],[1,2],        [1,2],[2,2],[2,3],        [2,3],[2,4]]labels = [ 100000, 120000, 150000,        160000, 200000, 220000,        250000, 270000 ]`

You may notice here that we use a backslash on the end of the lines here to help present the arrays better. If you haven’t seen this before, this is mostly done for print purposes to ensure continuity of the lines. In most cases, the backslash isn’t required in your actual files, but is ignored if it is. In the downloadable code files for this article, you’ll see that the arrays are simply on one line each.

We initialise the LinearRegression object, and train our model.

`reg = linear_model.LinearRegression()reg = reg.fit(features,labels)`

Now we’re ready to make a prediction! Let’s predict the value of a new property that’s not in the original array. We’ll predict the value of a two bedroom apartment, to start with.

`print(reg.predict([ [1,2] ])`

You should see a result of 151666.66666667.

Here’s the full Python code, including notes. It’s still only a handful of lines.

`from sklearn import linear_model#this is a multidimensional array of featuresfeatures = [[1,1],[1,1],[1,2],        [1,2],[2,2],[2,3],        [2,3],[2,4]]# an array of labels which # correspond to featureslabels = [ 100000, 120000, 150000,        160000, 200000, 220000,        250000, 270000]# initialise LinearRegression objectreg = linear_model.LinearRegression()# train the model with our datareg = reg.fit(features,labels)# make a prediction and print the outputprint(reg.predict([[1,2]])`

And a capture for you to see, from Mu also.

MANAGING COMPLEXITY

Now, if we were to look at the data as humans, and purely look for how we can make sense of things, we can make our own predictions. However, our own human mind really only assesses parts of the data.

If you look at the table in the spreadsheet, and want to estimate the price of the two-bedroom apartment, you’ll probably look down the list at other apartments until you get to the two bedroom options. We have two of them in our data, with a price of 150,000 and 160,000.

With no additional information, you might suggest a price of 155,000 - halfway between the two most relevant options in the table. Sounds fair, right? Sure. Mathematically speaking, however, you’ve just ignored a large part of the available data.

Despite this, you are probably wondering why our regression model has predicted a price of 151,666.

This is essentially because even with only a second attribute, machine learning uses more complex data relationships than we do as humans.

Here’s where things get interesting. By removing the data (that is, a house versus an apartment) and simplifying the data to be purely apartment pricing, we do get the same answer as we had predicted manually.

`features = [[1,1],[1,1],[1,2],[1,2]]labels = [100000, 120000, 150000, 160000]`

The prediction is, as we expect, 155,000. The only variable here is the number of bedrooms (one or two), and when we’ve asked for a prediction on a two bedroom option, it’s logically in between the two known prices. If we run the same prediction for a one bedroom, then we will automatically expect, and indeed see, a prediction of 110,000 also.

Why do we have to remove the other data to get the regression algorithm to match our own assessment? It’s essentially because as humans, we automatically filter this data in order to process the data.

Think about your own assessment process. You automatically discarded the house that has two bedrooms when looking at the price indexes, because you’re not trying to predict the price of a house. For most problems that we have to solve as humans, this sort of filtering process serves us well and ensures that we don’t get overloaded.

The beauty of Machine Learning is how complicated the data can become, while improving the accuracy and usefulness of predictions. Now, we can step up the game and get more accurate predictions.

IMPROVING THE DATA

Up until now we’ve been using totally fictitious data with somewhat random attributes. What if we increase the complexity?

Now, we’re going to use legitimate data, gleaned from recent property sales for a suburb near the DIYODE office. Instead of just dwelling type and number of bedrooms, we now include the number of bathrooms and how many car spaces are included.

We could have used fictitious data for this purpose also, however, at this point, it became more efficient to source some real data.

We’ll include this spreadsheet in our digital resources for you to take a look at the data too.

For clarity, I have removed most of the data. It’s not that it’s not relevant at all to house prices, but that it’s not really relevant for our demonstration. Naturally, the actual location of these properties is probably a factor (eg, close to the beach or public transport), but it’s well outside of the detail we need for our demonstration.

Some aspects such as land size, etc. are useful, but need to be treated differently for units than houses, since it has less influence with a unit than a house. Just because there’s 1000 units on a quarter-acre block doesn’t mean it’s as valuable as a house on the same sized block of land.

CONVERTING A SPREADSHEET TO USABLE ARRAY

First thing’s first. You need to convert your spreadsheet to an array. There are libraries to do this for us and we’ll explore them soon, but it’s useful for us to simply create a useable array and paste it into our code sometimes. We’ll deal with libraries to handle this (and create from a database) in a future instalment.

There are a few steps to doing this. First, we’ll create our features array for a single row. You can do this with a simple formula.

Note: We are using Google Docs, however, those using Microsoft Excel or open source alternatives should be able to convert these formulas fairly easily too.

`="["&if(A2="House", 2, 1)&","&join(",", B2:D2)&"]"`

Enter this on the next free column on row 2, and you should see something like this.

As you can see, we’ve done two things here. We’ve transposed “House” to a 2, and anything else to a 1 (since Units are the only other thing in our dataset for this column), to provide Scikit-learn with the integers it needs to work. We’ve then created an array with comma separators for each value, aggregating the contents of columns B to D. This same method can be used to aggregate any columns - just change “D2” to whatever column you want it to go to. Naturally, we leave out the sale price (column E) as that’s our label.

You can then drag to fill (or copy and paste) that formula to column F for all rows in the data set.

Now you have your feature arrays for each row, you can aggregate them into a single cell.

`="["&join(",",F2:F124)&"]"`

The last thing to do is create an array of labels (that is, our sale price), which is fairly straightforward too.

`="["&join(",",E2:E124)&"]"`

You should then have two rows with arrays representing all of your data.

REMEMBER: Features is a multi-dimensional array, so there’s brackets at either end, while each array within has its own brackets. They’re then comma separated. Labels is a single-dimensional array, so there are no brackets around individual values.

All you now have to do is copy and paste the results from your spreadsheet into your Python code. You should end up with something like this (note the arrays are large so you won’t see it all on screen).

Congratulations! You’ve taken a spreadsheet of data to nice Python arrays which is Scikit-learn compatible data!

Once you get data with more features, several hundred or thousands of rows, or anything more complicated, you’ll be better off using data import functions. We’ll explore how to do those later on. For now, we want to run some predictions!

One thing you’ll notice now that we have more features is, we need to feed Scikit-learn more features too! If you don’t, you’ll simply get errors.

Now that we have specified four features for each record in our training data, you can’t just ask for a prediction using one feature (or any less than the number of features you’ve trained your algorithm with) because it’s outside of the parameters of what these algorithms do. You could, in theory, progressively predict features to flesh out the full object, but that would have variable outcomes.

Let’s make a prediction for a house with four bedrooms, two bathrooms, and one car space.

You should see a price prediction somewhere around \$930,980 (and some decimals). You can play around with the values and see what predictions are made.

One thing we haven’t demonstrated before is that it’s also possible to make several predictions by adding values to the multidimensional array also.

This is useful to provide contrasting predictions (such as if one feature is changed), such as the price prediction of a property where the only difference is house versus unit (at least, where our data is concerned).

`print(reg.predict([[1,3,1,1], [2,3,1,1]])`

Here you’ll obtain predictions for all value sets provided. Here we see a price difference of around \$65,000 just by changing from a house to unit.

If we change the values to predict on a different set of parameters, such as a four bed, three bath, two car house and unit, there’s barely 20,000 different in the price prediction. If we were to make a human guess, it may be that four bedroom units are far less common and are only new luxury apartments, whereas, four bedroom houses are fairly common in this area and may be older. In reality, we don’t have enough data to make this assessment, but if we expanded our dataset to cater for more features we could probably make a better assessment of this outcome.

MORE FEATURES

As you have seen, more data drives better predictions, however, feature selection is critical. If you say, indexed the colour of the walls or the type of music the previous owners enjoyed, it’s probably not going to improve your predictions. However, the age of the building, whether the parking is undercover or secure, and other features would almost certainly play a role in the sale price, and therefore, enable more accurate predictions (assuming we can obtain that data too).

THE POWER OF THE MACHINE

Hand-writing code to handle this level of complexity would have taken us considerable time. We’ve done this in merely minutes.

Now we’re not about to de-throne any property valuers just yet! There’s much more to it to make actual property price predictions, but I’m confident you’re seeing the power of Machine Learning now.

We’re also using one of the most fundamental regression algorithms here. Using other algorithms, and indeed writing our own (which we can explore at a later date) can help improve the outcomes.