Github files; https://github.com/ghettocounselor
Intuition Lecture 126 https://www.udemy.com/machinelearning/learn/lecture/5714412
Lecutre 128 https://www.udemy.com/machinelearning/learn/lecture/5765754
Random Forest algorithms are using decisions trees about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables. Now, that’s the decision tree process; the forest is multiple trees! Each tree is a random subset of the data. You can have a Random Forest that is a collection of the same tree or a collection of many types of trees working as one forest.
Microsoft xbox https://www.youtube.com/watch?v=lntbRsi8lU8
Microsoft Real-Time Human Pose recognition https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/BodyPartRecognition.pdf
Check Working directory getwd() to always know where you are working.
we are after the age and salary and the y/n purchased so in R that’s columns 3-5
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
Have a look at data
summary(dataset)
## Age EstimatedSalary Purchased
## Min. :18.00 Min. : 15000 Min. :0.0000
## 1st Qu.:29.75 1st Qu.: 43000 1st Qu.:0.0000
## Median :37.00 Median : 70000 Median :0.0000
## Mean :37.66 Mean : 69742 Mean :0.3575
## 3rd Qu.:46.00 3rd Qu.: 88000 3rd Qu.:1.0000
## Max. :60.00 Max. :150000 Max. :1.0000
head(dataset)
## Age EstimatedSalary Purchased
## 1 19 19000 0
## 2 35 20000 0
## 3 26 43000 0
## 4 27 57000 0
## 5 19 76000 0
## 6 27 58000 0
We do this remember because the model we are using doesn’t do this for us.
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
Let’s look again
summary(dataset)
## Age EstimatedSalary Purchased
## Min. :18.00 Min. : 15000 0:257
## 1st Qu.:29.75 1st Qu.: 43000 1:143
## Median :37.00 Median : 70000
## Mean :37.66 Mean : 69742
## 3rd Qu.:46.00 3rd Qu.: 88000
## Max. :60.00 Max. :150000
General rule of thumb is 75% for split ratio; 75% train, 25% test
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Feature Scaling - for classification it’s better to do feature scalling additionally we have variables where the units are not the same. For decision trees we don’t need to do this because the model is not based on euclidian distances, however it will make the graphing faster.
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
Let’s have a look.
head(training_set)
## Age EstimatedSalary Purchased
## 1 -1.7655475 -1.4733414 0
## 3 -1.0962966 -0.7883761 0
## 6 -1.0006894 -0.3602727 0
## 7 -1.0006894 0.3817730 0
## 8 -0.5226531 2.2654277 1
## 10 -0.2358313 -0.1604912 0
Things are a little different here, we don’t need formula, and other features we just need x and y. x will be the independent variables (hence the datase -3 removing the column we don’t need), y is the dependent variable.
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ .,
data = training_set)
Because of the slight variation in structure of the decision tree we need to add the type = class.
y_pred = predict(classifier, newdata = test_set[-3])
Note; it’s a Matrix! And it’s a matrix of probabilities, column 0 is the probability the user will buy the SUV and column 1 is the probability that the user would buy the SUV.
head(y_pred)
## 0 1
## 2 0.967033 0.03296703
## 4 0.967033 0.03296703
## 5 0.967033 0.03296703
## 9 0.967033 0.03296703
## 12 0.967033 0.03296703
## 18 0.967033 0.03296703
Let’s fix that.
y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
Let’s look
y_pred
## 2 4 5 9 12 18 19 20 22 29 32 34 35 38 45 46 48 52
## 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0
## 66 69 74 75 82 84 85 86 87 89 103 104 107 108 109 117 124 126
## 0 0 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 0
## 127 131 134 139 148 154 156 159 162 163 170 175 176 193 199 200 208 213
## 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1
## 224 226 228 229 230 234 236 237 239 241 255 264 265 266 273 274 281 286
## 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1
## 292 299 302 305 307 310 316 324 326 332 339 341 343 347 353 363 364 367
## 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1
## 368 369 372 373 380 383 389 392 395 400
## 1 0 1 0 1 1 1 1 0 1
## Levels: 0 1
Now we have the normal CM because we added the class
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 53 11
## 1 6 30
A caption
library(ElemStatLearn)
# declare set as the training set
set = training_set
# this section creates the background region red/green. It does that by the 'by' which you can think of as the steps in python, so each 0.01 is interpreted as 0 or 1 and is either green or red. The -1 and +1 give us the space around the edges so the dots are not jammed. Another way to think of the 'by' as is as the resolution of the graphing of the background
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
# just giving a name to the X and Y
colnames(grid_set) = c('Age', 'EstimatedSalary')
# this is the MAGIC of the background coloring
# here we use the classifier to predict the result of each of each of the pixel bits noted above. NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
# that's the end of the background
# now we plat the actual data
plot(set[, -3],
main = 'Decision Tree (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2)) # this bit creates the limits to the values plotted this is also a part of the MAGIC as it creates the line between green and red
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
# here we run through all the y_pred data and use ifelse to color the dots
# note the dots are the real data, the background is the pixel by pixel determination of y/n
# graph the dots on top of the background give you the image
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
# NOTE we need class here because we have a y_grid is a matrix!
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
In the Training graphs if we see peculiar groupings of Red (for instance). These in and of themselves are not troublesome, however when we then use the classifier on the Test set we find that not only are those areas prone to not including any accurate predictions, as in there is nothing there, what we also see is mistakes, where green dots are shown in these peculiar areas. This is indicative of over fitting of our Forest on the Training data, as in the forest has gotten more accustomed to the data than to the probability of the context of the data, so to speak.
However we need to take the feature scaling out so we can read the splits :D
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ .,
data = training_set)
Plotting the tree
plot(classifier)
text(classifier)