Github files; https://github.com/ghettocounselor

Random Forest also called a Decision Forest

Intuition Lecture 126 https://www.udemy.com/machinelearning/learn/lecture/5714412

Lecutre 128 https://www.udemy.com/machinelearning/learn/lecture/5765754

Random Forest algorithms are using decisions trees about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables. Now, that’s the decision tree process; the forest is multiple trees! Each tree is a random subset of the data. You can have a Random Forest that is a collection of the same tree or a collection of many types of trees working as one forest.

interesting real world uses

Microsoft xbox https://www.youtube.com/watch?v=lntbRsi8lU8

Microsoft Real-Time Human Pose recognition https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/BodyPartRecognition.pdf

Check Working directory getwd() to always know where you are working.

Importing the dataset

we are after the age and salary and the y/n purchased so in R that’s columns 3-5

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

Have a look at data

summary(dataset)

##       Age        EstimatedSalary    Purchased     
##  Min.   :18.00   Min.   : 15000   Min.   :0.0000  
##  1st Qu.:29.75   1st Qu.: 43000   1st Qu.:0.0000  
##  Median :37.00   Median : 70000   Median :0.0000  
##  Mean   :37.66   Mean   : 69742   Mean   :0.3575  
##  3rd Qu.:46.00   3rd Qu.: 88000   3rd Qu.:1.0000  
##  Max.   :60.00   Max.   :150000   Max.   :1.0000

head(dataset)

##   Age EstimatedSalary Purchased
## 1  19           19000         0
## 2  35           20000         0
## 3  26           43000         0
## 4  27           57000         0
## 5  19           76000         0
## 6  27           58000         0

Encoding the target feature, catagorical variable, as factor

We do this remember because the model we are using doesn’t do this for us.

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Let’s look again

summary(dataset)

##       Age        EstimatedSalary  Purchased
##  Min.   :18.00   Min.   : 15000   0:257    
##  1st Qu.:29.75   1st Qu.: 43000   1:143    
##  Median :37.00   Median : 70000            
##  Mean   :37.66   Mean   : 69742            
##  3rd Qu.:46.00   3rd Qu.: 88000            
##  Max.   :60.00   Max.   :150000

Splitting the dataset into the Training set and Test set

General rule of thumb is 75% for split ratio; 75% train, 25% test

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

Feature Scaling - for classification it’s better to do feature scalling additionally we have variables where the units are not the same. For decision trees we don’t need to do this because the model is not based on euclidian distances, however it will make the graphing faster.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Let’s have a look.

head(training_set)

##           Age EstimatedSalary Purchased
## 1  -1.7655475      -1.4733414         0
## 3  -1.0962966      -0.7883761         0
## 6  -1.0006894      -0.3602727         0
## 7  -1.0006894       0.3817730         0
## 8  -0.5226531       2.2654277         1
## 10 -0.2358313      -0.1604912         0

Fitting Decision Tree to the Training set

Things are a little different here, we don’t need formula, and other features we just need x and y. x will be the independent variables (hence the datase -3 removing the column we don’t need), y is the dependent variable.

# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ .,
                   data = training_set)

Predict the Test set results - Decision Tree

Because of the slight variation in structure of the decision tree we need to add the type = class.

y_pred = predict(classifier, newdata = test_set[-3])

Note; it’s a Matrix! And it’s a matrix of probabilities, column 0 is the probability the user will buy the SUV and column 1 is the probability that the user would buy the SUV.

head(y_pred)

##           0          1
## 2  0.967033 0.03296703
## 4  0.967033 0.03296703
## 5  0.967033 0.03296703
## 9  0.967033 0.03296703
## 12 0.967033 0.03296703
## 18 0.967033 0.03296703

Let’s fix that.

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')

Let’s look

y_pred

##   2   4   5   9  12  18  19  20  22  29  32  34  35  38  45  46  48  52 
##   0   0   0   0   0   0   1   1   0   0   1   0   1   0   0   0   0   0 
##  66  69  74  75  82  84  85  86  87  89 103 104 107 108 109 117 124 126 
##   0   0   1   0   0   1   0   1   0   0   1   1   0   1   1   0   0   0 
## 127 131 134 139 148 154 156 159 162 163 170 175 176 193 199 200 208 213 
##   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1   1 
## 224 226 228 229 230 234 236 237 239 241 255 264 265 266 273 274 281 286 
##   1   0   1   0   0   1   1   0   1   1   0   0   1   1   1   1   1   1 
## 292 299 302 305 307 310 316 324 326 332 339 341 343 347 353 363 364 367 
##   1   0   0   0   1   0   0   1   0   1   0   1   0   1   1   0   0   1 
## 368 369 372 373 380 383 389 392 395 400 
##   1   0   1   0   1   1   1   1   0   1 
## Levels: 0 1

Making the Confusion Matrix - Decision Tree

Now we have the normal CM because we added the class

cm = table(test_set[, 3], y_pred)
cm

##    y_pred
##      0  1
##   0 53 11
##   1  6 30

A caption

Visualising the Training set results - Decision Tree

library(ElemStatLearn)
# declare set as the training set
set = training_set
# this section creates the background region red/green. It does that by the 'by' which you can think of as the steps in python, so each 0.01 is interpreted as 0 or 1 and is either green or red. The -1 and +1 give us the space around the edges so the dots are not jammed. Another way to think of the 'by' as is as the resolution of the graphing of the background
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
# just giving a name to the X and Y 
colnames(grid_set) = c('Age', 'EstimatedSalary')
# this is the MAGIC of the background coloring
# here we use the classifier to predict the result of each of each of the pixel bits noted above. NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
# that's the end of the background
# now we plat the actual data 
plot(set[, -3],
     main = 'Decision Tree (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2)) # this bit creates the limits to the values plotted this is also a part of the MAGIC as it creates the line between green and red
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
# here we run through all the y_pred data and use ifelse to color the dots
# note the dots are the real data, the background is the pixel by pixel determination of y/n
# graph the dots on top of the background give you the image
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Visualising the Test set results - Decision Tree

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
# NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Understanding or seeing overfitting;

In the Training graphs if we see peculiar groupings of Red (for instance). These in and of themselves are not troublesome, however when we then use the classifier on the Test set we find that not only are those areas prone to not including any accurate predictions, as in there is nothing there, what we also see is mistakes, where green dots are shown in these peculiar areas. This is indicative of over fitting of our Forest on the Training data, as in the forest has gotten more accustomed to the data than to the probability of the context of the data, so to speak.

Added bonus let’s visualize the Decsion Trees

However we need to take the feature scaling out so we can read the splits :D

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ .,
                   data = training_set)

Plotting the tree

plot(classifier)
text(classifier)

Random Forest in a prediction model