Based on the the topics presented, bring a dataset of your choice and create a Decision tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Description: Gas mileage, horsepower, and other information for 392 vehicles. Usage Auto
Format: A data frame with 392 observations on the following 9 variables. mpg miles per gallon cylinders Number of cylinders between 4 and 8 displacement Engine displacement (cu. inches) horsepower Engine horsepower weight Vehicle weight (lbs.) acceleration Time to accelerate from 0 to 60 mph (sec.) year Model year (modulo 100) origin Origin of car (1. American, 2. European, 3. Japanese) name Vehicle name
The orginal data contained 408 observations but 16 observations with missing values were removed.
Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.
# Get the list of data sets contained in package
x <- data(package = "ISLR")
x$results[, "Item"]
## [1] "Auto" "Caravan" "Carseats" "College" "Credit" "Default"
## [7] "Hitters" "Khan" "NCI60" "OJ" "Portfolio" "Smarket"
## [13] "Wage" "Weekly"
colnames(x)
## NULL
As you can see some our variables are ranging from Auto to Weekly etc
data(Carseats)
# Get the variable names
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
dim(Carseats)
## [1] 400 11
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
As suggested by the instructions in the assignment I will be doing classification thetree to analyze the carseats data set. Also notice that a simulated data set containing Sales of child car seats at 400 different stores. You’ll see 400 observations and 11 variables in the data set. I am interested in predicting Sales based on the other variables in the data set. Of course, since Sales is a continuous variable, we need to make as a binary variable. You’ll see the new variable, High, will take value of Yes if the Sales variable exceeds 8, and No otherwise.
High = ifelse(Carseats$Sales <=8, "No", "Yes")
Carseats=data.frame(Carseats,High)
Carseats.H <- Carseats[,-1]
Carseats.H$High = as.factor(Carseats$High)
class(Carseats.H$High)
## [1] "factor"
set.seed(888)
thetrain = sample(1:nrow(Carseats.H), 200)
Carseats.thetrain=Carseats.H[thetrain,]
Carseats.thetest=Carseats.H[-thetrain,]
High.thetest=High[-thetrain]
My first step is to make classification thetree using the thetraining set to predict High using all variables except Sales (remember that High was derived from Sales).
The cp value is a stopping parameter. It helps speed up the search for splits because it can identify splits that don’t meet this criteria and prune them before going too far.
If you take the approach of building really deep thetrees, the default value of 0.01 might be too restrictive.
fit.thetree = rpart(High ~ ., data=Carseats.thetrain, method = "class", cp=0.008)
fit.thetree
## n= 200
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 200 87 No (0.56500000 0.43500000)
## 2) Price>=96.5 161 57 No (0.64596273 0.35403727)
## 4) ShelveLoc=Bad 42 3 No (0.92857143 0.07142857) *
## 5) ShelveLoc=Good,Medium 119 54 No (0.54621849 0.45378151)
## 10) Advertising< 8.5 65 19 No (0.70769231 0.29230769)
## 20) CompPrice< 144.5 51 9 No (0.82352941 0.17647059) *
## 21) CompPrice>=144.5 14 4 Yes (0.28571429 0.71428571) *
## 11) Advertising>=8.5 54 19 Yes (0.35185185 0.64814815)
## 22) ShelveLoc=Medium 36 18 No (0.50000000 0.50000000)
## 44) Education>=13.5 21 6 No (0.71428571 0.28571429) *
## 45) Education< 13.5 15 3 Yes (0.20000000 0.80000000) *
## 23) ShelveLoc=Good 18 1 Yes (0.05555556 0.94444444) *
## 3) Price< 96.5 39 9 Yes (0.23076923 0.76923077)
## 6) CompPrice< 99 8 3 No (0.62500000 0.37500000) *
## 7) CompPrice>=99 31 4 Yes (0.12903226 0.87096774) *
# Visualizing
rpart.plot(fit.thetree)
pred.thetree = predict(fit.thetree, Carseats.thetest, type = "class")
table(pred.thetree,High.thetest)
## High.thetest
## pred.thetree No Yes
## No 90 32
## Yes 33 45
#plotcp(fit.thetree)
printcp(fit.thetree)
##
## Classification tree:
## rpart(formula = High ~ ., data = Carseats.thetrain, method = "class",
## cp = 0.008)
##
## Variables actually used in tree construction:
## [1] Advertising CompPrice Education Price ShelveLoc
##
## Root node error: 87/200 = 0.435
##
## n= 200
##
## CP nsplit rel error xerror xstd
## 1 0.241379 0 1.00000 1.00000 0.080587
## 2 0.091954 1 0.75862 0.98851 0.080476
## 3 0.068966 3 0.57471 0.87356 0.078901
## 4 0.051724 4 0.50575 0.73563 0.075827
## 5 0.022989 6 0.40230 0.73563 0.075827
## 6 0.008000 7 0.37931 0.65517 0.073379
# lowest cp value
fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
## [1] 0.008
we’ll prune the regression thetree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest thetest error.
Note that the optimal value for cp is the one that leads to the lowest xerror in the previous output, which represents the error on the observations from the cross-validation data.
bestcp <-fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
pruned.thetree <- prune(fit.thetree, cp = bestcp)
rpart.plot(pruned.thetree)
pred.prune = predict(pruned.thetree, Carseats.thetest, type="class")
table(pred.prune, High.thetest)
## High.thetest
## pred.prune No Yes
## No 90 32
## Yes 33 45
# High variable
Carseats.S <- Carseats[,-12]
set.seed(999)
thetrain = sample(1:nrow(Carseats.S), 200)
Carseats.thetrain=Carseats.S[thetrain,]
Carseats.thetest=Carseats.S[-thetrain,]
Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for thetests of significance. The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows: (yi - ) = ( i - ) + (yi - i)
fit.thetree = rpart(Sales ~ ., data=Carseats.thetrain, method="anova", cp=0.008)
#summary(fit.thetree)
fit.thetree
## n= 200
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 200 1605.306000 7.289650
## 2) ShelveLoc=Bad,Medium 161 982.688100 6.655714
## 4) Price>=94.5 135 646.254000 6.136593
## 8) ShelveLoc=Bad 46 163.515600 4.679565
## 16) Population< 106 11 48.512490 3.319091 *
## 17) Population>=106 35 88.244510 5.107143
## 34) Age>=33.5 28 65.860070 4.787857 *
## 35) Age< 33.5 7 8.112371 6.384286 *
## 9) ShelveLoc=Medium 89 334.610500 6.889663
## 18) Price>=127 29 83.957080 5.622069
## 36) Advertising< 3.5 13 18.137710 4.636154 *
## 37) Advertising>=3.5 16 42.915940 6.423125 *
## 19) Price< 127 60 181.534500 7.502333
## 38) Age>=60.5 25 70.170620 6.565600
## 76) CompPrice< 118.5 9 10.483560 5.307778 *
## 77) CompPrice>=118.5 16 37.438540 7.273125 *
## 39) Age< 60.5 35 73.758030 8.171429
## 78) Advertising< 6 18 29.197650 7.171667 *
## 79) Advertising>=6 17 7.519200 9.230000 *
## 5) Price< 94.5 26 111.153100 9.351154
## 10) Advertising< 9 15 70.662090 8.597333 *
## 11) Advertising>=9 11 20.344090 10.379090 *
## 3) ShelveLoc=Good 39 290.814100 9.906667
## 6) Price>=135 9 15.959890 6.391111 *
## 7) Price< 135 30 130.252300 10.961330
## 14) Age>=62 7 12.926490 9.168571 *
## 15) Age< 62 23 87.980690 11.506960
## 30) Urban=No 9 22.282160 9.887778 *
## 31) Urban=Yes 14 26.934240 12.547860 *
rpart.plot(fit.thetree)
fit.thetree$variable.importance
## ShelveLoc Price Age Advertising CompPrice Income
## 483.59509 474.09259 139.48793 87.61880 78.87432 57.47487
## US Population Urban Education
## 55.24076 54.08655 38.76430 30.12734
pred.thetree = predict(fit.thetree, Carseats.thetest)
The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors. The lower the MSE, the better the forecast.
# mean square error
mse <- mean((pred.thetree - Carseats.thetest$Sales)^2)
mse
## [1] 4.530078
# CP value
printcp(fit.thetree)
##
## Regression tree:
## rpart(formula = Sales ~ ., data = Carseats.thetrain, method = "anova",
## cp = 0.008)
##
## Variables actually used in tree construction:
## [1] Advertising Age CompPrice Population Price ShelveLoc
## [7] Urban
##
## Root node error: 1605.3/200 = 8.0265
##
## n= 200
##
## CP nsplit rel error xerror xstd
## 1 0.2066921 0 1.00000 1.00668 0.094663
## 2 0.1403352 1 0.79331 0.87310 0.081991
## 3 0.0922740 2 0.65297 0.69819 0.067697
## 4 0.0900774 3 0.56070 0.69214 0.067207
## 5 0.0430565 4 0.47062 0.57741 0.056675
## 6 0.0234260 5 0.42756 0.57185 0.061614
## 7 0.0230742 6 0.40414 0.59280 0.063329
## 8 0.0212139 7 0.38106 0.59789 0.063341
## 9 0.0166688 9 0.33864 0.60690 0.064391
## 10 0.0142673 10 0.32197 0.64061 0.063234
## 11 0.0138594 11 0.30770 0.62653 0.061521
## 12 0.0125502 12 0.29384 0.63228 0.061349
## 13 0.0088906 13 0.28129 0.64558 0.060469
## 14 0.0080000 14 0.27240 0.64923 0.060476
bestcp <- fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
bestcp
## [1] 0.02342595
The accuracy of the model on the thetest data is better when the thetree is pruned, which means that the pruned decision thetree model generalizes well and is more suited for a production environment. However, there are also other factors that can influence decision thetree model creation, such as building a thetree on an unbalanced class. These factors were not accounted for in this demonstration but it’s very important for them to be examined during a live model formulation.
pruned.thetree <- prune(fit.thetree, cp = bestcp)
# Visualize the thetree
rpart.plot(pruned.thetree)
# Checking the order of variable importance
pruned.thetree$variable.importance
## ShelveLoc Price Age Income CompPrice Education
## 479.932017 442.221990 35.353913 33.396181 29.246370 7.150235
## Advertising Population
## 3.220173 2.383412
With the decision thetree, it can enable validation since it is the best predictive model. In fact, it finds use of making quantitative analysis of the business platform. In addition, it can validate results of the statistical thetests. Nevertheless, it can support naturally the classification of problems with several classes by modification process.
# Use the thetest data to evaluate performance of pruned regression thetree
pred.prune = predict(pruned.thetree, Carseats.thetest)
# Calcualte the MSE for the pruned thetree
mse <- mean((pred.prune - Carseats.thetest$Sales)^2)
mse
## [1] 4.897713
Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. It creates as many thetrees on the subset of the data and combines the output of all the thetrees. In this way it reduces overfitting problem in decision thetrees and also reduces the variance and therefore improves the accuracy.
Random Forest can be used to solve both classification as well as regression problems.
Random Forest works well with both categorical and continuous variables.
Random Forest can automatically handle missing values.
No feature scaling required: No feature scaling (standardization and normalization) required in case of Random Forest as it uses rule based approach instead of distance calculation.
# random forest using all predictors
# using
modFit.rf <- randomForest::randomForest(Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[,c(1:11)])
modFit.rf
##
## Call:
## randomForest(formula = Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[, c(1:11)])
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 3.232277
## % Var explained: 59.73
forest_pred <- predict(modFit.rf, Carseats.thetest)
table(forest_pred)
## forest_pred
## 4.49766066666666 4.553272 4.74268461904762 4.84429633333333
## 1 1 1 1
## 4.882053 4.90977966666667 4.916167 4.92679866666667
## 1 1 1 1
## 5.08684316666667 5.09389766666667 5.13174255555556 5.143352
## 1 1 1 1
## 5.20166183333333 5.33597961904761 5.41031466666666 5.42922566666667
## 1 1 1 1
## 5.44026733333333 5.62702733333333 5.64218895238094 5.66944888095238
## 1 1 1 1
## 5.67166533333333 5.69624733333333 5.825111 5.834741
## 1 1 1 1
## 5.85814066666667 5.87860738095238 5.89826866666667 5.96244161904762
## 1 1 1 1
## 6.02942271428571 6.043148 6.04935166666667 6.06333661904762
## 1 1 1 1
## 6.08969066666667 6.12923283333333 6.20497100000001 6.20790599999999
## 1 1 1 1
## 6.23353299999999 6.24949233333333 6.32152233333333 6.34873466666666
## 1 1 1 1
## 6.36093066666667 6.36178433333333 6.37020176190476 6.440513
## 1 1 1 1
## 6.45411866666667 6.460274 6.468564 6.47315
## 1 1 1 1
## 6.50512361904762 6.518236 6.542815 6.56347799999999
## 1 1 1 1
## 6.56632616666667 6.58527966666667 6.612339 6.630332
## 1 1 1 1
## 6.657704 6.659737 6.66003588888889 6.66013833333333
## 1 1 1 1
## 6.66025654761905 6.66469266666667 6.69525128571428 6.70967971428571
## 1 1 1 1
## 6.71034083333332 6.71400961904761 6.729461 6.73066033333333
## 1 1 1 1
## 6.747615 6.770789 6.78079833333332 6.823457
## 1 1 1 1
## 6.83419466666667 6.8782 6.889779 6.892132
## 1 1 1 1
## 6.89628266666666 6.91858033333334 6.95286433333334 7.00136583333334
## 1 1 1 1
## 7.02389080952381 7.02771199999999 7.03207566666666 7.07083433333335
## 1 1 1 1
## 7.08448661904762 7.09387000000001 7.09501466666666 7.11656666666667
## 1 1 1 1
## 7.134659 7.14507866666666 7.237091 7.24696433333333
## 1 1 1 1
## 7.24784095238095 7.28751466666666 7.290333 7.30234133333334
## 1 1 1 1
## 7.33092099999999 7.33240428571429 7.42347380952381 7.42905366666667
## 1 1 1 1
## 7.43322299999999 7.43324433333333 7.43324533333333 7.44922799999999
## 1 1 1 1
## 7.45635100000001 7.46085933333334 7.46259466666665 7.48436533333333
## 1 1 1 1
## 7.507944 7.519279 7.53220399999999 7.53987033333333
## 1 1 1 1
## 7.58219422222222 7.59806666666667 7.6013 7.66689866666667
## 1 1 1 1
## 7.70365899999999 7.73442738095238 7.75570966666667 7.76352133333334
## 1 1 1 1
## 7.78285166666667 7.79906728571428 7.80812733333333 7.831732
## 1 1 1 1
## 7.90635866666666 7.92275 7.93173066666667 7.95244433333334
## 1 1 1 1
## 7.95700066666666 7.963845 7.97789366666666 7.98678133333334
## 1 1 1 1
## 7.99230916666666 7.99579480952382 8.02173850000001 8.03621566666667
## 1 1 1 1
## 8.09594166666667 8.09940071428571 8.2424045 8.27331433333334
## 1 1 1 1
## 8.27478033333333 8.34325595238097 8.37773866666666 8.43998066666666
## 1 1 1 1
## 8.54173966666666 8.54311 8.56097883333334 8.612151
## 1 1 1 1
## 8.61851666666667 8.71259409523809 8.751208 8.85622416666667
## 1 1 1 1
## 8.87673733333333 8.90295466666666 8.94765433333332 8.97254733333333
## 1 1 1 1
## 8.977202 8.98490516666666 8.98880828571429 9.01419433333333
## 1 1 1 1
## 9.05158583333332 9.13910233333333 9.25350033333333 9.28426899999999
## 1 1 1 1
## 9.28980111111112 9.29261899999999 9.30257266666666 9.31912499999999
## 1 1 1 1
## 9.34772299999999 9.351979 9.39249666666667 9.44541966666667
## 1 1 1 1
## 9.44659633333334 9.45246266666666 9.50048399999999 9.63173233333331
## 1 1 1 1
## 9.6427775 9.67040744444443 9.80139166666666 9.83688099999999
## 1 1 1 1
## 9.88388566666668 9.91572733333333 9.96535833333333 10.0005271111111
## 1 1 1 1
## 10.044763 10.0535023333333 10.0589236666667 10.0940126666667
## 1 1 1 1
## 10.0968243333333 10.1466936666667 10.212493 10.2341617777778
## 1 1 1 1
## 10.2349104444444 10.2355576666667 10.267478 10.3406077777778
## 1 1 1 1
## 10.3625343333333 10.4262394444444 11.2468724444444 11.4011914444444
## 1 1 1 1
Based on real cases where desicion thetrees went wrong, and ‘the bad & ugly’ aspects of decision thetrees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-thetrees), how can you change this perception when using the decision thetree you created to solve a real problem?
In my opinion, Decision thetree is better when the dataset have a “Feature” that is really important to take a decision. Random Forest, select some “Features” randomly to build the thetrees, if a “Feature” is important, sometimes Random Forest will build thetrees that will not have the significance that the“Feature” has in the final decision. I think that Random Forest is good to avoid low quality of data, example: Imagine a dataset that shows (all houses that doors are green have a high cost), in Decision thetrees this is a bias in the data that can be avoid in Random Forest