DATA622

DECISION TREE CLASSIFICATION PROBLEM

Based on the the topics presented, bring a dataset of your choice and create a Decision tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

PACKAGE: ISLR

Description: Gas mileage, horsepower, and other information for 392 vehicles. Usage Auto

Format: A data frame with 392 observations on the following 9 variables. mpg miles per gallon cylinders Number of cylinders between 4 and 8 displacement Engine displacement (cu. inches) horsepower Engine horsepower weight Vehicle weight (lbs.) acceleration Time to accelerate from 0 to 60 mph (sec.) year Model year (modulo 100) origin Origin of car (1. American, 2. European, 3. Japanese) name Vehicle name

The orginal data contained 408 observations but 16 observations with missing values were removed.

Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

INITIAL DATA ANALYSIS

# Get the list of data sets contained in package
x <- data(package = "ISLR")
x$results[, "Item"]

##  [1] "Auto"      "Caravan"   "Carseats"  "College"   "Credit"    "Default"  
##  [7] "Hitters"   "Khan"      "NCI60"     "OJ"        "Portfolio" "Smarket"  
## [13] "Wage"      "Weekly"

colnames(x)

## NULL

As you can see some our variables are ranging from Auto to Weekly etc

EXPLORING THE DATA

data(Carseats)
# Get the variable names
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

dim(Carseats)

## [1] 400  11

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

As suggested by the instructions in the assignment I will be doing classification thetree to analyze the carseats data set. Also notice that a simulated data set containing Sales of child car seats at 400 different stores. You’ll see 400 observations and 11 variables in the data set. I am interested in predicting Sales based on the other variables in the data set. Of course, since Sales is a continuous variable, we need to make as a binary variable. You’ll see the new variable, High, will take value of Yes if the Sales variable exceeds 8, and No otherwise.

Classification thetrees

High = ifelse(Carseats$Sales <=8, "No", "Yes")

Carseats=data.frame(Carseats,High)

Carseats.H <- Carseats[,-1]

Carseats.H$High = as.factor(Carseats$High)
class(Carseats.H$High)

## [1] "factor"

set.seed(888)
thetrain = sample(1:nrow(Carseats.H), 200)
Carseats.thetrain=Carseats.H[thetrain,]
Carseats.thetest=Carseats.H[-thetrain,]

High.thetest=High[-thetrain]

My first step is to make classification thetree using the thetraining set to predict High using all variables except Sales (remember that High was derived from Sales).

The cp value is a stopping parameter. It helps speed up the search for splits because it can identify splits that don’t meet this criteria and prune them before going too far.

If you take the approach of building really deep thetrees, the default value of 0.01 might be too restrictive.

fit.thetree = rpart(High ~ ., data=Carseats.thetrain, method = "class", cp=0.008)
fit.thetree

## n= 200 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 200 87 No (0.56500000 0.43500000)  
##    2) Price>=96.5 161 57 No (0.64596273 0.35403727)  
##      4) ShelveLoc=Bad 42  3 No (0.92857143 0.07142857) *
##      5) ShelveLoc=Good,Medium 119 54 No (0.54621849 0.45378151)  
##       10) Advertising< 8.5 65 19 No (0.70769231 0.29230769)  
##         20) CompPrice< 144.5 51  9 No (0.82352941 0.17647059) *
##         21) CompPrice>=144.5 14  4 Yes (0.28571429 0.71428571) *
##       11) Advertising>=8.5 54 19 Yes (0.35185185 0.64814815)  
##         22) ShelveLoc=Medium 36 18 No (0.50000000 0.50000000)  
##           44) Education>=13.5 21  6 No (0.71428571 0.28571429) *
##           45) Education< 13.5 15  3 Yes (0.20000000 0.80000000) *
##         23) ShelveLoc=Good 18  1 Yes (0.05555556 0.94444444) *
##    3) Price< 96.5 39  9 Yes (0.23076923 0.76923077)  
##      6) CompPrice< 99 8  3 No (0.62500000 0.37500000) *
##      7) CompPrice>=99 31  4 Yes (0.12903226 0.87096774) *

# Visualizing 
rpart.plot(fit.thetree)

pred.thetree = predict(fit.thetree, Carseats.thetest, type = "class")

table(pred.thetree,High.thetest)

##             High.thetest
## pred.thetree No Yes
##          No  90  32
##          Yes 33  45

#plotcp(fit.thetree)
printcp(fit.thetree)

## 
## Classification tree:
## rpart(formula = High ~ ., data = Carseats.thetrain, method = "class", 
##     cp = 0.008)
## 
## Variables actually used in tree construction:
## [1] Advertising CompPrice   Education   Price       ShelveLoc  
## 
## Root node error: 87/200 = 0.435
## 
## n= 200 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.241379      0   1.00000 1.00000 0.080587
## 2 0.091954      1   0.75862 0.98851 0.080476
## 3 0.068966      3   0.57471 0.87356 0.078901
## 4 0.051724      4   0.50575 0.73563 0.075827
## 5 0.022989      6   0.40230 0.73563 0.075827
## 6 0.008000      7   0.37931 0.65517 0.073379

# lowest cp value
fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]

## [1] 0.008

we’ll prune the regression thetree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest thetest error.

Note that the optimal value for cp is the one that leads to the lowest xerror in the previous output, which represents the error on the observations from the cross-validation data.

bestcp <-fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
pruned.thetree <- prune(fit.thetree, cp = bestcp)
rpart.plot(pruned.thetree)

pred.prune = predict(pruned.thetree, Carseats.thetest, type="class")


table(pred.prune, High.thetest)

##           High.thetest
## pred.prune No Yes
##        No  90  32
##        Yes 33  45

Regression thetree Fit

# High variable 
Carseats.S <- Carseats[,-12]

set.seed(999)
thetrain = sample(1:nrow(Carseats.S), 200)
Carseats.thetrain=Carseats.S[thetrain,]
Carseats.thetest=Carseats.S[-thetrain,]

Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for thetests of significance. The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows: (yi - ) = ( i - ) + (yi - i)

fit.thetree = rpart(Sales ~ ., data=Carseats.thetrain, method="anova", cp=0.008)
#summary(fit.thetree)
fit.thetree

## n= 200 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 200 1605.306000  7.289650  
##    2) ShelveLoc=Bad,Medium 161  982.688100  6.655714  
##      4) Price>=94.5 135  646.254000  6.136593  
##        8) ShelveLoc=Bad 46  163.515600  4.679565  
##         16) Population< 106 11   48.512490  3.319091 *
##         17) Population>=106 35   88.244510  5.107143  
##           34) Age>=33.5 28   65.860070  4.787857 *
##           35) Age< 33.5 7    8.112371  6.384286 *
##        9) ShelveLoc=Medium 89  334.610500  6.889663  
##         18) Price>=127 29   83.957080  5.622069  
##           36) Advertising< 3.5 13   18.137710  4.636154 *
##           37) Advertising>=3.5 16   42.915940  6.423125 *
##         19) Price< 127 60  181.534500  7.502333  
##           38) Age>=60.5 25   70.170620  6.565600  
##             76) CompPrice< 118.5 9   10.483560  5.307778 *
##             77) CompPrice>=118.5 16   37.438540  7.273125 *
##           39) Age< 60.5 35   73.758030  8.171429  
##             78) Advertising< 6 18   29.197650  7.171667 *
##             79) Advertising>=6 17    7.519200  9.230000 *
##      5) Price< 94.5 26  111.153100  9.351154  
##       10) Advertising< 9 15   70.662090  8.597333 *
##       11) Advertising>=9 11   20.344090 10.379090 *
##    3) ShelveLoc=Good 39  290.814100  9.906667  
##      6) Price>=135 9   15.959890  6.391111 *
##      7) Price< 135 30  130.252300 10.961330  
##       14) Age>=62 7   12.926490  9.168571 *
##       15) Age< 62 23   87.980690 11.506960  
##         30) Urban=No 9   22.282160  9.887778 *
##         31) Urban=Yes 14   26.934240 12.547860 *

rpart.plot(fit.thetree)

fit.thetree$variable.importance

##   ShelveLoc       Price         Age Advertising   CompPrice      Income 
##   483.59509   474.09259   139.48793    87.61880    78.87432    57.47487 
##          US  Population       Urban   Education 
##    55.24076    54.08655    38.76430    30.12734

pred.thetree = predict(fit.thetree, Carseats.thetest)

The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors. The lower the MSE, the better the forecast.

# mean square error
mse <- mean((pred.thetree - Carseats.thetest$Sales)^2)
mse

## [1] 4.530078

# CP value
printcp(fit.thetree)

## 
## Regression tree:
## rpart(formula = Sales ~ ., data = Carseats.thetrain, method = "anova", 
##     cp = 0.008)
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Population  Price       ShelveLoc  
## [7] Urban      
## 
## Root node error: 1605.3/200 = 8.0265
## 
## n= 200 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2066921      0   1.00000 1.00668 0.094663
## 2  0.1403352      1   0.79331 0.87310 0.081991
## 3  0.0922740      2   0.65297 0.69819 0.067697
## 4  0.0900774      3   0.56070 0.69214 0.067207
## 5  0.0430565      4   0.47062 0.57741 0.056675
## 6  0.0234260      5   0.42756 0.57185 0.061614
## 7  0.0230742      6   0.40414 0.59280 0.063329
## 8  0.0212139      7   0.38106 0.59789 0.063341
## 9  0.0166688      9   0.33864 0.60690 0.064391
## 10 0.0142673     10   0.32197 0.64061 0.063234
## 11 0.0138594     11   0.30770 0.62653 0.061521
## 12 0.0125502     12   0.29384 0.63228 0.061349
## 13 0.0088906     13   0.28129 0.64558 0.060469
## 14 0.0080000     14   0.27240 0.64923 0.060476

bestcp <- fit.thetree$cptable[which.min(fit.thetree$cptable[,"xerror"]),"CP"]
bestcp

## [1] 0.02342595

The accuracy of the model on the thetest data is better when the thetree is pruned, which means that the pruned decision thetree model generalizes well and is more suited for a production environment. However, there are also other factors that can influence decision thetree model creation, such as building a thetree on an unbalanced class. These factors were not accounted for in this demonstration but it’s very important for them to be examined during a live model formulation.

pruned.thetree <- prune(fit.thetree, cp = bestcp)

# Visualize the thetree
rpart.plot(pruned.thetree)

# Checking the order of variable importance
pruned.thetree$variable.importance

##   ShelveLoc       Price         Age      Income   CompPrice   Education 
##  479.932017  442.221990   35.353913   33.396181   29.246370    7.150235 
## Advertising  Population 
##    3.220173    2.383412

EVALUATING PERFORMANCE

With the decision thetree, it can enable validation since it is the best predictive model. In fact, it finds use of making quantitative analysis of the business platform. In addition, it can validate results of the statistical thetests. Nevertheless, it can support naturally the classification of problems with several classes by modification process.

# Use the thetest data to evaluate performance of pruned regression thetree
pred.prune = predict(pruned.thetree, Carseats.thetest)

# Calcualte the MSE for the pruned thetree
mse <- mean((pred.prune - Carseats.thetest$Sales)^2)
mse

## [1] 4.897713

RANDOM FOREST

Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. It creates as many thetrees on the subset of the data and combines the output of all the thetrees. In this way it reduces overfitting problem in decision thetrees and also reduces the variance and therefore improves the accuracy.

Random Forest can be used to solve both classification as well as regression problems.

Random Forest works well with both categorical and continuous variables.

Random Forest can automatically handle missing values.

No feature scaling required: No feature scaling (standardization and normalization) required in case of Random Forest as it uses rule based approach instead of distance calculation.

# random forest using all predictors
# using 
modFit.rf <- randomForest::randomForest(Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[,c(1:11)])

modFit.rf

## 
## Call:
##  randomForest(formula = Carseats.thetrain$Sales ~ ., data = Carseats.thetrain[,      c(1:11)]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 3.232277
##                     % Var explained: 59.73

forest_pred <- predict(modFit.rf, Carseats.thetest)
table(forest_pred)

## forest_pred
## 4.49766066666666         4.553272 4.74268461904762 4.84429633333333 
##                1                1                1                1 
##         4.882053 4.90977966666667         4.916167 4.92679866666667 
##                1                1                1                1 
## 5.08684316666667 5.09389766666667 5.13174255555556         5.143352 
##                1                1                1                1 
## 5.20166183333333 5.33597961904761 5.41031466666666 5.42922566666667 
##                1                1                1                1 
## 5.44026733333333 5.62702733333333 5.64218895238094 5.66944888095238 
##                1                1                1                1 
## 5.67166533333333 5.69624733333333         5.825111         5.834741 
##                1                1                1                1 
## 5.85814066666667 5.87860738095238 5.89826866666667 5.96244161904762 
##                1                1                1                1 
## 6.02942271428571         6.043148 6.04935166666667 6.06333661904762 
##                1                1                1                1 
## 6.08969066666667 6.12923283333333 6.20497100000001 6.20790599999999 
##                1                1                1                1 
## 6.23353299999999 6.24949233333333 6.32152233333333 6.34873466666666 
##                1                1                1                1 
## 6.36093066666667 6.36178433333333 6.37020176190476         6.440513 
##                1                1                1                1 
## 6.45411866666667         6.460274         6.468564          6.47315 
##                1                1                1                1 
## 6.50512361904762         6.518236         6.542815 6.56347799999999 
##                1                1                1                1 
## 6.56632616666667 6.58527966666667         6.612339         6.630332 
##                1                1                1                1 
##         6.657704         6.659737 6.66003588888889 6.66013833333333 
##                1                1                1                1 
## 6.66025654761905 6.66469266666667 6.69525128571428 6.70967971428571 
##                1                1                1                1 
## 6.71034083333332 6.71400961904761         6.729461 6.73066033333333 
##                1                1                1                1 
##         6.747615         6.770789 6.78079833333332         6.823457 
##                1                1                1                1 
## 6.83419466666667           6.8782         6.889779         6.892132 
##                1                1                1                1 
## 6.89628266666666 6.91858033333334 6.95286433333334 7.00136583333334 
##                1                1                1                1 
## 7.02389080952381 7.02771199999999 7.03207566666666 7.07083433333335 
##                1                1                1                1 
## 7.08448661904762 7.09387000000001 7.09501466666666 7.11656666666667 
##                1                1                1                1 
##         7.134659 7.14507866666666         7.237091 7.24696433333333 
##                1                1                1                1 
## 7.24784095238095 7.28751466666666         7.290333 7.30234133333334 
##                1                1                1                1 
## 7.33092099999999 7.33240428571429 7.42347380952381 7.42905366666667 
##                1                1                1                1 
## 7.43322299999999 7.43324433333333 7.43324533333333 7.44922799999999 
##                1                1                1                1 
## 7.45635100000001 7.46085933333334 7.46259466666665 7.48436533333333 
##                1                1                1                1 
##         7.507944         7.519279 7.53220399999999 7.53987033333333 
##                1                1                1                1 
## 7.58219422222222 7.59806666666667           7.6013 7.66689866666667 
##                1                1                1                1 
## 7.70365899999999 7.73442738095238 7.75570966666667 7.76352133333334 
##                1                1                1                1 
## 7.78285166666667 7.79906728571428 7.80812733333333         7.831732 
##                1                1                1                1 
## 7.90635866666666          7.92275 7.93173066666667 7.95244433333334 
##                1                1                1                1 
## 7.95700066666666         7.963845 7.97789366666666 7.98678133333334 
##                1                1                1                1 
## 7.99230916666666 7.99579480952382 8.02173850000001 8.03621566666667 
##                1                1                1                1 
## 8.09594166666667 8.09940071428571        8.2424045 8.27331433333334 
##                1                1                1                1 
## 8.27478033333333 8.34325595238097 8.37773866666666 8.43998066666666 
##                1                1                1                1 
## 8.54173966666666          8.54311 8.56097883333334         8.612151 
##                1                1                1                1 
## 8.61851666666667 8.71259409523809         8.751208 8.85622416666667 
##                1                1                1                1 
## 8.87673733333333 8.90295466666666 8.94765433333332 8.97254733333333 
##                1                1                1                1 
##         8.977202 8.98490516666666 8.98880828571429 9.01419433333333 
##                1                1                1                1 
## 9.05158583333332 9.13910233333333 9.25350033333333 9.28426899999999 
##                1                1                1                1 
## 9.28980111111112 9.29261899999999 9.30257266666666 9.31912499999999 
##                1                1                1                1 
## 9.34772299999999         9.351979 9.39249666666667 9.44541966666667 
##                1                1                1                1 
## 9.44659633333334 9.45246266666666 9.50048399999999 9.63173233333331 
##                1                1                1                1 
##        9.6427775 9.67040744444443 9.80139166666666 9.83688099999999 
##                1                1                1                1 
## 9.88388566666668 9.91572733333333 9.96535833333333 10.0005271111111 
##                1                1                1                1 
##        10.044763 10.0535023333333 10.0589236666667 10.0940126666667 
##                1                1                1                1 
## 10.0968243333333 10.1466936666667        10.212493 10.2341617777778 
##                1                1                1                1 
## 10.2349104444444 10.2355576666667        10.267478 10.3406077777778 
##                1                1                1                1 
## 10.3625343333333 10.4262394444444 11.2468724444444 11.4011914444444 
##                1                1                1                1

Based on real cases where desicion thetrees went wrong, and ‘the bad & ugly’ aspects of decision thetrees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-thetrees), how can you change this perception when using the decision thetree you created to solve a real problem?

In my opinion, Decision thetree is better when the dataset have a “Feature” that is really important to take a decision. Random Forest, select some “Features” randomly to build the thetrees, if a “Feature” is important, sometimes Random Forest will build thetrees that will not have the significance that the“Feature” has in the final decision. I think that Random Forest is good to avoid low quality of data, example: Imagine a dataset that shows (all houses that doors are green have a high cost), in Decision thetrees this is a bias in the data that can be avoid in Random Forest

DATA622_HW2

John Mazon

4/2/2022