In this section, we answer the research questions defined in the first section of the project using decision tree model. As mentioned in the previous Part, our dataset contains a continues target variable, therefore, the correct predictive regression modeling techniques for Decision Trees is regression trees. Predictors will be selected based on their correlation with target variable specified by the research question with limiting the features to only six highly correlated features. Finally, we will conduct validation and evaluation assessment techniques and find coefficient, r-squared, adjusted rsquare, and standard error for every model.
Pruning technique used to reduce the size of decision trees by removing sections of the tree that are non-critical and redundant in order improves predictive accuracy by the reduction of overfitting .
The below steps will be conducts for each research question:
Load the dataset, slice attributes, and target attributes related to the research question
Create a Test-Train Split to train, evaluate and validate our model
Build a full Decision Trees and optimize it by pruning using the most optimum CP
Use model evaluation techniques such as RMSE,Rsquared and MAE
Use model validation techniques such as LOOCV, Cross validation and Repeated Cross Validation
First research question
In this section, we create a model for following research question “Which ML model has the best classification metrics to predict percentages that kids might end up family housing with two parents?”, the below highly correlated features are used for the model:
PctIlleg: percentage of kids born to never married
pctWPubAsst: percentage of households with public assistance income in 1989
PctPopUnderPov: percentage of people under the poverty level
medFamInc: median family income (differs from household income for non-family households)
medIncome: median household income
pctWInvInc: percentage of households with investment / rent income in 1989
PctKids2Par: percentage of kids in family housing with two parents
First step, Load the dataset, slice attributes, and target attributes related to the research question
#setwd("D:/Downloads")
#load the data
data_corr <- read.csv("data_clean.csv",header = T)
#Slice related attributes into df dataframe
df <- data_corr[c("PctIlleg","pctWPubAsst","PctPopUnderPov","medFamInc","medIncome","pctWInvInc","PctKids2Par")]
head(df)
Next, create a Test-Train Split to train, evaluate and validate our model, our test dataset is 20% of our entire dataset. In addition, seed configured to zero in order to replicate the same experiment.
# Test-Train Split
install.packages('caTools')
Installing package into 㤼㸱F:/Google Drive/E-book/R/win-library/4.0㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/caTools_1.18.1.zip'
Content type 'application/zip' length 317197 bytes (309 KB)
downloaded 309 KB
package ‘caTools’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\a7med\AppData\Local\Temp\Rtmp0QrMLi\downloaded_packages
library(caTools)
set.seed(0)
split =sample.split(df,SplitRatio = 0.8)
train = subset(df,split == TRUE)
test = subset(df,split == FALSE)
head(train)
head(test)
Next, Build a full Decision Trees and optimize it by pruning using the most optimum CP, however, we have to find the most optimum CP for pruning by building full Decision Trees with CP equal to zero.
#install required packages
install.packages('rpart')
Error in install.packages : Updating loaded packages
install.packages('rpart.plot')
Installing package into 㤼㸱F:/Google Drive/E-book/R/win-library/4.0㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rpart.plot_3.0.9.zip'
Content type 'application/zip' length 1033977 bytes (1009 KB)
downloaded 1009 KB
package ‘rpart.plot’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\a7med\AppData\Local\Temp\Rtmp0QrMLi\downloaded_packages
install.packages("rpart")
Installing package into 㤼㸱F:/Google Drive/E-book/R/win-library/4.0㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rpart_4.1-15.zip'
Content type 'application/zip' length 765690 bytes (747 KB)
downloaded 747 KB
package ‘rpart’ successfully unpacked and MD5 sums checked
Warning in install.packages :
cannot remove prior installation of package ‘rpart’
Warning in install.packages :
problem copying F:\Google Drive\E-book\R\win-library\4.0\00LOCK\rpart\libs\x64\rpart.dll to F:\Google Drive\E-book\R\win-library\4.0\rpart\libs\x64\rpart.dll: Permission denied
Warning in install.packages :
restored ‘rpart’
The downloaded binary packages are in
C:\Users\a7med\AppData\Local\Temp\Rtmp0QrMLi\downloaded_packages
library(rpart)
library(rpart.plot)
#Run regression tree model to find the most optimum cp
fulltree <- rpart(formula = PctKids2Par~., data = train, control = rpart.control( cp = 0))
mincp <- fulltree$cptable[which.min(fulltree$cptable[,"xerror"]),"CP"]
plotcp(fulltree)
From figure number 1, the most optimum CP is the lowest CP value that have the least relative error and tree size, which is equal to 0.0003, subsequently; we build a full Decision Trees and prune it using the optimum CP as shown in figure number 2.
#Tree Pruning
prunedtree <- prune(fulltree, cp = mincp)
rpart.plot(prunedtree, box.palette="RdBu", digits = -3)
test$pruned <- predict(prunedtree, test, type = "vector")
Finally, we use model evaluation techniques to find R2, RMSE, MAE and MSE2.
installed.packages('Metrics')
Package LibPath Version Priority Depends Imports LinkingTo Suggests Enhances License License_is_FOSS
License_restricts_use OS_type Archs MD5sum NeedsCompilation Built
library(Metrics)
Attaching package: 㤼㸱Metrics㤼㸲
The following objects are masked from 㤼㸱package:caret㤼㸲:
precision, recall
performance <- data.frame( R2 = R2(test$pruned, test$PctKids2Par),
RMSE = RMSE(test$pruned, test$PctKids2Par),
MAE = MAE(test$pruned, test$PctKids2Par),
MSE2 = mean((test$pruned - test$PctKids2Par)^2))
performance
Next, we use validation techniques such as LOOCV, Cross validation and Repeated Cross Validation in order.
library(caret)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
# Leave one out cross validation - LOOCV
# Define training control
train.control.LOOCV <- trainControl(method = "LOOCV")
# Train the model
LOOCV <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.LOOCV,tuneGrid = cp.grid)
# Summarize the results
print(LOOCV)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 1959, 1959, 1959, 1959, 1959, 1959, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.07357167 0.8712806 0.05598688
0.001 0.07322123 0.8715098 0.05569811
0.002 0.07672058 0.8588502 0.05876454
0.003 0.07863310 0.8517203 0.06065240
0.004 0.07917141 0.8496305 0.06113732
0.005 0.08017425 0.8457918 0.06238850
0.006 0.08132025 0.8413248 0.06350218
0.007 0.08117448 0.8418868 0.06345920
0.008 0.08318203 0.8339625 0.06545350
0.009 0.08535404 0.8251825 0.06720654
0.010 0.08535404 0.8251825 0.06720654
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.001.
Next, we use K-fold cross-validation, the technique used against a sequence of CP to find the most optimum. Ten folds is configured for technique.
# K-fold cross-validation
# Define training control
set.seed(123)
train.control.cv <- trainControl(method = "cv", number = 10)
# Train the model
cv <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.cv,tuneGrid = cp.grid)
# Summarize the results
print(cv)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1764, 1763, 1765, 1764, 1764, 1765, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.07458764 0.8682880 0.05583290
0.001 0.07488059 0.8667450 0.05738251
0.002 0.07767762 0.8568727 0.05984501
0.003 0.07873371 0.8528414 0.06086148
0.004 0.07937136 0.8505033 0.06153656
0.005 0.08092711 0.8443967 0.06270297
0.006 0.08191125 0.8404299 0.06376332
0.007 0.08260000 0.8376132 0.06459001
0.008 0.08508634 0.8276669 0.06654498
0.009 0.08554216 0.8259448 0.06698376
0.010 0.08587788 0.8245843 0.06725568
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
Next, we use Repeated K-fold cross-validation, the technique used against a sequence of CP to find the most optimum. Ten folds with three repeats configured for technique.
# Repeated K-fold cross-validation
set.seed(123)
train.control.repeatedcv <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
# Train the model
repeatedcv <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.repeatedcv,tuneGrid = cp.grid)
# Summarize the results
print(repeatedcv)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1764, 1763, 1765, 1764, 1764, 1765, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.07453701 0.8681612 0.05597149
0.001 0.07459356 0.8668359 0.05718294
0.002 0.07711535 0.8576849 0.05950917
0.003 0.07845355 0.8526869 0.06078661
0.004 0.07961102 0.8482006 0.06191358
0.005 0.08093959 0.8430827 0.06304916
0.006 0.08188199 0.8394141 0.06412822
0.007 0.08237809 0.8373436 0.06457404
0.008 0.08444024 0.8292835 0.06622773
0.009 0.08510176 0.8268375 0.06679961
0.010 0.08522702 0.8263016 0.06687264
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
In this section, we create a model for following research question “Which ML model has the best classification metrics to predict percentages that someone might depend on public assistance income?”, the below highly correlated features are used for the model:
PctYoungKids2Par: percent of kids 4 and under in two parent households
PctKids2Par: percentage of kids in family housing with two parents
pctWInvInc: percentage of households with investment / rent income in 1989
PctLess9thGrade: percentage of people 25 and over with less than a 9th grade education
PctNotHSGrad: percentage of people 25 and over that are not high school graduates
PctPopUnderPov: percentage of people under the poverty level
pctWPubAsst:percentage of households with public assistance income in 1989
First step, Load the dataset, slice attributes, and target attributes related to the research question
#load the data
data_corr <- read.csv("data_clean.csv",header = T)
#Slice related attributes into df dataframe
df <- data_corr[c("PctYoungKids2Par","PctKids2Par","pctWInvInc","PctLess9thGrade","PctNotHSGrad","PctPopUnderPov","pctWPubAsst")]
head(df)
By repeating all the steps mentioned on the previous section, we use model evaluation techniques to find R2, RMSE, MAE and MSE2 and plot the Tree as shown in figure number 3.
#Tree Pruning
prunedtree <- prune(fulltree, cp = mincp)
rpart.plot(prunedtree, box.palette="RdBu", digits = -3)
test$pruned <- predict(prunedtree, test, type = "vector")
#Model performance metrics
performance <- data.frame( R2 = R2(test$pruned, test$pctWPubAsst),
RMSE = RMSE(test$pruned, test$pctWPubAsst),
MAE = MAE(test$pruned, test$pctWPubAsst),
MSE2 = mean((test$pruned - test$pctWPubAsst)^2))
performance
Next, we use validation techniques such as LOOCV, Cross validation and Repeated Cross Validation in order. ## Leave-One-Out Cross-Validation
First validation technique is Leave-One-Out Cross-Validation, the technique used against a sequence of CP to find the most optimum.
cp.grid <- expand.grid(.cp = (0:10)*0.001)
# Leave one out cross validation - LOOCV
# Define training control
train.control.LOOCV <- trainControl(method = "LOOCV")
# Train the model
LOOCV <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.LOOCV,tuneGrid = cp.grid)
Next, we use Repeated K-fold cross-validation, the technique used against a sequence of CP to find the most optimum. Ten folds with three repeats configured for technique.
# Repeated K-fold cross-validation
set.seed(123)
train.control.repeatedcv <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
# Train the model
repeatedcv <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.repeatedcv,tuneGrid = cp.grid)
# Summarize the results
print(repeatedcv)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1764, 1763, 1765, 1764, 1764, 1765, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.07731842 0.8581638 0.05967035
0.001 0.07555772 0.8634534 0.05878530
0.002 0.07682919 0.8585964 0.06050429
0.003 0.07767708 0.8556103 0.06136534
0.004 0.07928180 0.8495035 0.06241600
0.005 0.08069363 0.8441580 0.06354659
0.006 0.08161399 0.8406257 0.06442809
0.007 0.08187433 0.8395691 0.06466970
0.008 0.08218811 0.8383946 0.06493020
0.009 0.08336455 0.8337437 0.06625871
0.010 0.08403381 0.8310294 0.06686190
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.001.
In this section, we create a model for following research question “Which ML model has the best classification metrics to predict violent crimes per population?”, the below highly correlated features are used for the model:
PctKids2Par: percentage of kids in family housing with two parents
PctFam2Par: percentage of families (with kids) that are headed by two parents
PctYoungKids2Par: percent of kids 4 and under in two parent households
TotalPctDiv: percentage of population who are divorced
pctWPubAsst: percentage of households with public assistance income in 1989
PctIlleg: percentage of kids born to never married
ViolentCrimesPerPop: total number of violent crimes per 100K population GOAL attribute
First step, Load the dataset, slice attributes, and target attributes related to the research question
#load the data
data_corr <- read.csv("data_clean.csv",header = T)
#Slice related attributes into df dataframe
df <- data_corr[c("PctKids2Par","PctFam2Par","PctYoungKids2Par","TotalPctDiv","pctWPubAsst","PctIlleg","ViolentCrimesPerPop")]
head(df)
By repeating all the steps mentioned on the previous section, we use model evaluation techniques to find R2, RMSE, MAE and MSE2 and plot the Tree as shown in figure number 4.
#Tree Pruning
prunedtree <- prune(fulltree, cp = mincp)
rpart.plot(prunedtree, box.palette="RdBu", digits = -3)
test$pruned <- predict(prunedtree, test, type = "vector")
#Model performance metrics
performance <- data.frame( #R2 = R2(test$pruned, test$ViolentCrimesPerPop),
RMSE = RMSE(test$pruned, test$ViolentCrimesPerPop),
MAE = MAE(test$pruned, test$ViolentCrimesPerPop),
MSE2 = mean((test$pruned - test$ViolentCrimesPerPop)^2))
performance
Next, we use validation techniques such as LOOCV, Cross validation and Repeated Cross Validation in order. ## Leave-One-Out Cross-Validation
First validation technique is Leave-One-Out Cross-Validation, the technique used against a sequence of CP to find the most optimum.
cp.grid <- expand.grid(.cp = (0:10)*0.001)
# Leave one out cross validation - LOOCV
# Define training control
train.control.LOOCV <- trainControl(method = "LOOCV")
# Train the model
LOOCV <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.LOOCV,tuneGrid = cp.grid)
# Summarize the results
print(LOOCV)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 1959, 1959, 1959, 1959, 1959, 1959, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.03418075 0.9720116 0.02637377
0.001 0.03681265 0.9674737 0.02872839
0.002 0.04196478 0.9577414 0.03340132
0.003 0.04480233 0.9518414 0.03588543
0.004 0.04707172 0.9468375 0.03737454
0.005 0.04911222 0.9421246 0.03915869
0.006 0.04920138 0.9419135 0.03924957
0.007 0.04920138 0.9419135 0.03924957
0.008 0.04920138 0.9419135 0.03924957
0.009 0.04920138 0.9419135 0.03924957
0.010 0.04920138 0.9419135 0.03924957
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
Next, we use K-fold cross-validation, the technique used against a sequence of CP to find the most optimum. Ten folds is configured for technique.
# K-fold cross-validation
# Define training control
set.seed(123)
train.control.cv <- trainControl(method = "cv", number = 10)
# Train the model
cv <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.cv,tuneGrid = cp.grid)
# Summarize the results
print(cv)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1764, 1763, 1765, 1764, 1764, 1765, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.03384649 0.9725788 0.02632753
0.001 0.03751814 0.9660019 0.02944487
0.002 0.04191489 0.9576327 0.03341077
0.003 0.04444036 0.9522942 0.03565194
0.004 0.04670451 0.9475621 0.03722992
0.005 0.04891898 0.9426378 0.03920383
0.006 0.04897554 0.9425108 0.03920224
0.007 0.04897554 0.9425108 0.03920224
0.008 0.04897554 0.9425108 0.03920224
0.009 0.04897554 0.9425108 0.03920224
0.010 0.04897554 0.9425108 0.03920224
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
Next, we use Repeated K-fold cross-validation, the technique used against a sequence of CP to find the most optimum. Ten folds with three repeats configured for technique.
# Repeated K-fold cross-validation
set.seed(123)
train.control.repeatedcv <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
# Train the model
repeatedcv <- train(PctKids2Par~., data = df, method = "rpart",trControl = train.control.repeatedcv,tuneGrid = cp.grid)
# Summarize the results
print(repeatedcv)
CART
1960 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1764, 1763, 1765, 1764, 1764, 1765, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.000 0.03372565 0.9727262 0.02628385
0.001 0.03743549 0.9662402 0.02943846
0.002 0.04167924 0.9582385 0.03324330
0.003 0.04434660 0.9526983 0.03545597
0.004 0.04679384 0.9475879 0.03727969
0.005 0.04855650 0.9435567 0.03881583
0.006 0.04877678 0.9430230 0.03899143
0.007 0.04877678 0.9430230 0.03899143
0.008 0.04877678 0.9430230 0.03899143
0.009 0.04877678 0.9430230 0.03899143
0.010 0.04877678 0.9430230 0.03899143
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
From the previous sections, we list down all our findings divided into two tables, validation and evaluation tables as shown in table number 1 and number 2, we can conclude the below:
Question 1: Leave-One-Out Cross Validation has the highest R2 and the lowest RMSE and MAE so it is suitable for first research question
Question 2: Leave-One-Out Cross Validation has the highest R2 and the lowest RMSE and MAE so it is suitable for first research question
Question 3: Repeated K-fold cross-validation has the highest R2 and the lowest RMSE and MAE so it is suitable for first research question