Random Forest

Intuition Lecture 126 https://www.udemy.com/machinelearning/learn/lecture/5714412

Lecutre 129 https://www.udemy.com/machinelearning/learn/lecture/5771094

Decision tree algorithms are about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables.

Random forests are about having multiple trees, a forest of trees. Those trees can all be of the same type or algorithm or the forest can be made up of a mixture of tree types (algorithms). There are some very interesting further metaphorical thoughts that describe how the forest acts (decides).

Again as with Decision Trees the Random Forest is not based on euclidian distances but rather classifications.

Check Working directory getwd() to always know where you are working.

Importing the dataset

we are after the age and salary and the y/n purchased so in R that’s columns 3-5

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

Have a look at data

summary(dataset)

##       Age        EstimatedSalary    Purchased     
##  Min.   :18.00   Min.   : 15000   Min.   :0.0000  
##  1st Qu.:29.75   1st Qu.: 43000   1st Qu.:0.0000  
##  Median :37.00   Median : 70000   Median :0.0000  
##  Mean   :37.66   Mean   : 69742   Mean   :0.3575  
##  3rd Qu.:46.00   3rd Qu.: 88000   3rd Qu.:1.0000  
##  Max.   :60.00   Max.   :150000   Max.   :1.0000

head(dataset)

##   Age EstimatedSalary Purchased
## 1  19           19000         0
## 2  35           20000         0
## 3  26           43000         0
## 4  27           57000         0
## 5  19           76000         0
## 6  27           58000         0

Encoding the target feature, catagorical variable, as factor

We do this remember because the model we are using doesn’t do this for us.

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Let’s look again

summary(dataset)

##       Age        EstimatedSalary  Purchased
##  Min.   :18.00   Min.   : 15000   0:257    
##  1st Qu.:29.75   1st Qu.: 43000   1:143    
##  Median :37.00   Median : 70000            
##  Mean   :37.66   Mean   : 69742            
##  3rd Qu.:46.00   3rd Qu.: 88000            
##  Max.   :60.00   Max.   :150000

Splitting the dataset into the Training set and Test set

General rule of thumb is 75% for split ratio; 75% train, 25% test

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature Scaling

Feature Scaling - for classification it’s better to do feature scalling additionally we have variables where the units are not the same. For decision trees we don’t need to do this because the model is not based on euclidian distances, however it will make the graphing faster.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Let’s have a look.

head(training_set)

##           Age EstimatedSalary Purchased
## 1  -1.7655475      -1.4733414         0
## 3  -1.0962966      -0.7883761         0
## 6  -1.0006894      -0.3602727         0
## 7  -1.0006894       0.3817730         0
## 8  -0.5226531       2.2654277         1
## 10 -0.2358313      -0.1604912         0

Fitting Decision Tree to the Training set

Things are a little different here, we don’t need formula, and other features we just need x and y. x will be the independent variables (hence the datase -3 removing the column we don’t need), y is the dependent variable.

# install.packages('randomForest')
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

classifier = randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree = 500, random_state = 0)

Predict the Test set results - Random Forest

Because of the slight variation in structure of the decision tree we need to add the type = class.

y_pred = predict(classifier, newdata = test_set[-3])

Let’s have a look

y_pred

##   2   4   5   9  12  18  19  20  22  29  32  34  35  38  45  46  48  52 
##   0   0   0   0   0   0   1   1   0   0   1   0   0   0   0   0   0   0 
##  66  69  74  75  82  84  85  86  87  89 103 104 107 108 109 117 124 126 
##   0   0   1   0   0   1   0   1   0   0   1   1   0   0   0   0   0   0 
## 127 131 134 139 148 154 156 159 162 163 170 175 176 193 199 200 208 213 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1 
## 224 226 228 229 230 234 236 237 239 241 255 264 265 266 273 274 281 286 
##   1   0   1   0   0   1   0   0   1   1   1   0   1   1   1   1   1   1 
## 292 299 302 305 307 310 316 324 326 332 339 341 343 347 353 363 364 367 
##   1   0   0   0   1   0   0   1   0   1   0   1   0   1   1   0   0   1 
## 368 369 372 373 380 383 389 392 395 400 
##   1   0   1   0   1   1   1   1   0   1 
## Levels: 0 1

Making the Confusion Matrix - Decision Tree

Now we have the normal CM because we added the class

cm = table(test_set[, 3], y_pred)
cm

##    y_pred
##      0  1
##   0 56  8
##   1  7 29

A caption

Visualising the Training set results - Decision Tree

library(ElemStatLearn)
# declare set as the training set
set = training_set
# this section creates the background region red/green. It does that by the 'by' which you can think of as the steps in python, so each 0.01 is interpreted as 0 or 1 and is either green or red. The -1 and +1 give us the space around the edges so the dots are not jammed. Another way to think of the 'by' as is as the resolution of the graphing of the background
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
# just giving a name to the X and Y 
colnames(grid_set) = c('Age', 'EstimatedSalary')
# this is the MAGIC of the background coloring
# here we use the classifier to predict the result of each of each of the pixel bits noted above. NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
# that's the end of the background
# now we plat the actual data 
plot(set[, -3],
     main = 'Random Forest classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2)) # this bit creates the limits to the values plotted this is also a part of the MAGIC as it creates the line between green and red
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
# here we run through all the y_pred data and use ifelse to color the dots
# note the dots are the real data, the background is the pixel by pixel determination of y/n
# graph the dots on top of the background give you the image
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Visualising the Test set results - Decision Tree

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
# NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Random Forest classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Added bonus let’s visualize the Decsion Trees

However we need to take the feature scaling out so we can read the splits :D

dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# install.packages('rpart')
library(randomForest)
classifier = randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree = 500, random_state = 0)

Plotting the tree; hmm… not sure what to make of that, it’s interesting.

plot(classifier)

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf

Random Forest prediction model in R