Intuition Lecture 126 https://www.udemy.com/machinelearning/learn/lecture/5714412
Lecutre 129 https://www.udemy.com/machinelearning/learn/lecture/5771094
Decision tree algorithms are about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables.
Random forests are about having multiple trees, a forest of trees. Those trees can all be of the same type or algorithm or the forest can be made up of a mixture of tree types (algorithms). There are some very interesting further metaphorical thoughts that describe how the forest acts (decides).
Again as with Decision Trees the Random Forest is not based on euclidian distances but rather classifications.
Check Working directory getwd() to always know where you are working.
we are after the age and salary and the y/n purchased so in R that’s columns 3-5
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
Have a look at data
summary(dataset)
## Age EstimatedSalary Purchased
## Min. :18.00 Min. : 15000 Min. :0.0000
## 1st Qu.:29.75 1st Qu.: 43000 1st Qu.:0.0000
## Median :37.00 Median : 70000 Median :0.0000
## Mean :37.66 Mean : 69742 Mean :0.3575
## 3rd Qu.:46.00 3rd Qu.: 88000 3rd Qu.:1.0000
## Max. :60.00 Max. :150000 Max. :1.0000
head(dataset)
## Age EstimatedSalary Purchased
## 1 19 19000 0
## 2 35 20000 0
## 3 26 43000 0
## 4 27 57000 0
## 5 19 76000 0
## 6 27 58000 0
We do this remember because the model we are using doesn’t do this for us.
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
Let’s look again
summary(dataset)
## Age EstimatedSalary Purchased
## Min. :18.00 Min. : 15000 0:257
## 1st Qu.:29.75 1st Qu.: 43000 1:143
## Median :37.00 Median : 70000
## Mean :37.66 Mean : 69742
## 3rd Qu.:46.00 3rd Qu.: 88000
## Max. :60.00 Max. :150000
General rule of thumb is 75% for split ratio; 75% train, 25% test
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Feature Scaling - for classification it’s better to do feature scalling additionally we have variables where the units are not the same. For decision trees we don’t need to do this because the model is not based on euclidian distances, however it will make the graphing faster.
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
Let’s have a look.
head(training_set)
## Age EstimatedSalary Purchased
## 1 -1.7655475 -1.4733414 0
## 3 -1.0962966 -0.7883761 0
## 6 -1.0006894 -0.3602727 0
## 7 -1.0006894 0.3817730 0
## 8 -0.5226531 2.2654277 1
## 10 -0.2358313 -0.1604912 0
Things are a little different here, we don’t need formula, and other features we just need x and y. x will be the independent variables (hence the datase -3 removing the column we don’t need), y is the dependent variable.
# install.packages('randomForest')
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
classifier = randomForest(x = training_set[-3],
y = training_set$Purchased,
ntree = 500, random_state = 0)
Because of the slight variation in structure of the decision tree we need to add the type = class.
y_pred = predict(classifier, newdata = test_set[-3])
Let’s have a look
y_pred
## 2 4 5 9 12 18 19 20 22 29 32 34 35 38 45 46 48 52
## 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
## 66 69 74 75 82 84 85 86 87 89 103 104 107 108 109 117 124 126
## 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0
## 127 131 134 139 148 154 156 159 162 163 170 175 176 193 199 200 208 213
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## 224 226 228 229 230 234 236 237 239 241 255 264 265 266 273 274 281 286
## 1 0 1 0 0 1 0 0 1 1 1 0 1 1 1 1 1 1
## 292 299 302 305 307 310 316 324 326 332 339 341 343 347 353 363 364 367
## 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1
## 368 369 372 373 380 383 389 392 395 400
## 1 0 1 0 1 1 1 1 0 1
## Levels: 0 1
Now we have the normal CM because we added the class
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 56 8
## 1 7 29
A caption
library(ElemStatLearn)
# declare set as the training set
set = training_set
# this section creates the background region red/green. It does that by the 'by' which you can think of as the steps in python, so each 0.01 is interpreted as 0 or 1 and is either green or red. The -1 and +1 give us the space around the edges so the dots are not jammed. Another way to think of the 'by' as is as the resolution of the graphing of the background
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
# just giving a name to the X and Y
colnames(grid_set) = c('Age', 'EstimatedSalary')
# this is the MAGIC of the background coloring
# here we use the classifier to predict the result of each of each of the pixel bits noted above. NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
# that's the end of the background
# now we plat the actual data
plot(set[, -3],
main = 'Random Forest classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2)) # this bit creates the limits to the values plotted this is also a part of the MAGIC as it creates the line between green and red
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
# here we run through all the y_pred data and use ifelse to color the dots
# note the dots are the real data, the background is the pixel by pixel determination of y/n
# graph the dots on top of the background give you the image
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
# NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Random Forest classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
However we need to take the feature scaling out so we can read the splits :D
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# install.packages('rpart')
library(randomForest)
classifier = randomForest(x = training_set[-3],
y = training_set$Purchased,
ntree = 500, random_state = 0)
Plotting the tree; hmm… not sure what to make of that, it’s interesting.
plot(classifier)
=========================
Github files; https://github.com/ghettocounselor
Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf