First lets load the data into R:
** Load the data:**
com <- read.csv("communities.csv")
** Clean and transform the data: **
There are many missing values in the counties and communities columns Bellow is a table with the frequency of the missing values:
freq = c()
for(i in names(com))
{
freq <- c(freq, sum(is.na(com[i])))
}
table(freq)
freq
0 1 1174 1177 1675
103 1 1 1 22
As can be seen in the table. 103 of the variables do not have any missing variables. Only 1 has 1 missing variable. One column has 1174 missing values, and another has 1177 missing values. And 22 variables has 1675 missing values.
Lets discard the colums where there are many missing values.
#com_reduced = c()
for(i in names(com))
{
if (sum(is.na(com[i])) > 997)
{
com[i] = NULL
#com_reduced = c(com_reduced, com[i])
}
}
freq = c()
for(i in names(com))
{
freq <- c(freq, sum(is.na(com[i])))
}
table(freq)
freq
0 1
103 1
The columns that had missing values, usually had most of the values missing. I discarded all the columns that had more than half of its values missing. This new reduced data set is now called com_reduced. And a new frequency table was created. In it we can see that all but one column have no data missing. And the column that has missing data, is only missing one value. Lets remove the row, where there is a missing value:
com = com[-c(131), ]
freq = c()
for(i in names(com))
{
freq <- c(freq, sum(is.na(com[i])))
}
table(freq)
freq
0
104
Now there are no missing values.
** Using the fold variable, set the first fold to be the test set, and all other folds to be the training set. **
com$state <- factor(com$state)
train <- com[com$fold != 1,]
test <- com[com$fold == 1,]
train$fold = NULL
test$fold = NULL
We separated the training and testing data using the folds, and then removed folds so it will not be used to train the algorithms. Also transformed state into a categorical variable.
We will run linear regression on the data. But need to remove the categorical variables such s state and communityname.
lr_train = train
lr_test = test
lr_train$state = NULL
lr_train$communityname= NULL
lr_test$state = NULL
lr_test$communityname= NULL
linear_regression = lm(ViolentCrimesPerPop ~ ., data=lr_train)
linear_regression_pred = predict(linear_regression, newdata = lr_test)
** best test error **
MSE <- function(y, yhat)
{
mean((y-yhat)^2)
}
lr_mse = MSE(linear_regression_pred, lr_test$ViolentCrimesPerPop)
lr_mse
[1] 0.0212055
The mean square error for linear regression is 0.0212055.
** summarize best model **
We used all predictors, except the categorical ones, which can not be used for regression.
First we want to use cross validation to find the best size of the tree.
library(rpart)
set.seed(222)
reg_trees <- rpart(ViolentCrimesPerPop ~ ., data=train, cp = 0.000001)
plot(reg_trees)
** best test error **
The resulting plot creates a tree with a large depth. Without pruning the error wold be:
reg_trees_pred <- predict(reg_trees, test)
mse_tree = MSE(reg_trees_pred, test$ViolentCrimesPerPop)
mse_tree
[1] 0.03987249
The test error is larger than the linear regression error. Lets try to prune back to an appropriate size, and see if we get better resuts. Lets get the point where the cross validation error was at its minimal. And use it for a new model:
w <- which.min(reg_trees$cptable[,4])
reg_trees_pruned <- prune(reg_trees, cp=reg_trees$cptable[w,1])
reg_trees_pred <- predict(reg_trees_pruned, test)
mse_tree = MSE(reg_trees_pred, test$ViolentCrimesPerPop)
mse_tree
[1] 0.04017651
Pruning did not seem to help the model. It actually increased the error a bit.
** summarize best model and Steps taken in model selection **
We first created a very large tree, with a very low cp value. Then pruned back using cross validation. But the pruned model was not better than the non pruned model on the test data. Looking at the results, regular trees do not seem to be the best option for this data set.
We will run random forests on the data. Because it can not handle more than 53 categories, we will remove the communitynames column and the states column.
library(randomForest)
train2 = train
test2 = test
train2$communityname = NULL
test2$communityname = NULL
train2$state = NULL
test2$state = NULL
set.seed(222)
rf <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=ncol(train2)-1, importance=TRUE)
** best test error **
rf
Call:
randomForest(formula = ViolentCrimesPerPop ~ ., data = train2, mtry = ncol(train2) - 1, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 100
Mean of squared residuals: 0.01877328
% Var explained: 65.38
rf_pred = predict(rf, newdat = test2)
rf_mse = MSE(rf_pred, test2$ViolentCrimesPerPop)
rf_mse
[1] 0.02089148
The out of the bag error rate was 0.0187728, and the test error rate is 0.02089148. Random forest got better results than linear regression. Lets try varying the number of candidates at each split:
library(randomForest)
set.seed(222)
rf10 <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=10, importance=TRUE)
rf20 <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=20, importance=TRUE)
rf50 <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=50, importance=TRUE)
rf100 <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=100, importance=TRUE)
rf150 <- randomForest(ViolentCrimesPerPop ~ ., data=train2, mtry=150, importance=TRUE)
invalid mtry: reset to within valid range
rf_pred10 = predict(rf10, newdat = test2)
rf_pred20 = predict(rf20, newdat = test2)
rf_pred50 = predict(rf50, newdat = test2)
rf_pred100 = predict(rf100, newdat = test2)
rf_pred150 = predict(rf150, newdat = test2)
rf_mse10 = MSE(rf_pred10, test2$ViolentCrimesPerPop)
rf_mse20 = MSE(rf_pred20, test2$ViolentCrimesPerPop)
rf_mse50 = MSE(rf_pred50, test2$ViolentCrimesPerPop)
rf_mse100 = MSE(rf_pred100, test2$ViolentCrimesPerPop)
rf_mse150 = MSE(rf_pred150, test2$ViolentCrimesPerPop)
rf_mse10
[1] 0.02102909
rf_mse20
[1] 0.02080692
rf_mse50
[1] 0.02090452
rf_mse100
[1] 0.02086692
rf_mse150
[1] 0.02102006
** summarize best model and Steps taken in model selection **
I have tried several different number of candidates for number of splits, and have found that the best number is around 20 splits, where the error is 0.02080692. So far this is the lowest MSE found. Lets take a look at the most important variables:
varImpPlot(rf20,n.var = 5)
The variables that contribute mostly to the best MSE splits, are racePctWhite, and PctIlleg. The variables which create the best purity splits are PctIlleg, and PCtKids2Par.
install.packages("gbm")
Installing package into <U+393C><U+3E31>C:/Users/Nicole Salomons/Documents/R/win-library/3.3<U+393C><U+3E32>
(as <U+393C><U+3E31>lib<U+393C><U+3E32> is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/gbm_2.1.3.zip'
Content type 'application/zip' length 903583 bytes (882 KB)
downloaded 882 KB
package gbm successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Nicole Salomons\AppData\Local\Temp\Rtmpgva4F0\downloaded_packages
library(gbm)
package <U+393C><U+3E31>gbm<U+393C><U+3E32> was built under R version 3.3.3Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
So far using randomForests had the best results. Lets try to do boosting to see if we get better results. Because it can’t handle variables with more than 1024 levels, we will remove the communityname column.
train$communityname = NULL
test$communityname =NULL
b1 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian")
head(summary(b1))
From the plot it looks like the important variables in gradient boosting are similar to the ones in RandomForests.
boost_pred = predict(b1,newdata = test, n.trees = 100, type = "response")
boost_mse = MSE(boost_pred, test$ViolentCrimesPerPop)
boost_mse
[1] 0.05121395
** Steps taken in model selection **
The MSE is not good, the other models got better results. Lets try to do some tuning to see if we can get better results:
b2 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.001)
b3 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.01)
b4 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.05)
b5 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.1)
b6 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=1)
boost_pred2 = predict(b2,newdata = test, n.trees = 100, type = "response")
boost_pred3 = predict(b3,newdata = test, n.trees = 100, type = "response")
boost_pred4 = predict(b4,newdata = test, n.trees = 100, type = "response")
boost_pred5 = predict(b5,newdata = test, n.trees = 100, type = "response")
boost_pred6 = predict(b6,newdata = test, n.trees = 100, type = "response")
boost_mse2 = MSE(boost_pred2, test$ViolentCrimesPerPop)
boost_mse3 = MSE(boost_pred3, test$ViolentCrimesPerPop)
boost_mse4 = MSE(boost_pred4, test$ViolentCrimesPerPop)
boost_mse5 = MSE(boost_pred5, test$ViolentCrimesPerPop)
boost_mse6 = MSE(boost_pred6, test$ViolentCrimesPerPop)
boost_mse2
[1] 0.05122542
boost_mse3
[1] 0.03478026
boost_mse4
[1] 0.02026958
boost_mse5
[1] 0.01942117
boost_mse6
[1] 0.02444969
The best shrinkage was 0.1, where the error was 0.01942117. This is already better than the previous approaches. Lets try some more variables close to the shrinkage of 0.1:
b2 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.07)
b3 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.1)
b4 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.5)
boost_pred2 = predict(b2,newdata = test, n.trees = 100, type = "response")
boost_pred3 = predict(b3,newdata = test, n.trees = 100, type = "response")
boost_pred4 = predict(b4,newdata = test, n.trees = 100, type = "response")
boost_mse2 = MSE(boost_pred2, test$ViolentCrimesPerPop)
boost_mse3 = MSE(boost_pred3, test$ViolentCrimesPerPop)
boost_mse4 = MSE(boost_pred4, test$ViolentCrimesPerPop)
boost_mse2
[1] 0.02029421
boost_mse3
[1] 0.01919307
boost_mse4
[1] 0.01918006
The best shrinkage is 0.5, now lets vary the interaction depth:
b2 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.5,interaction.depth = 1)
b3 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.5,interaction.depth = 2)
b4 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.5,interaction.depth = 5)
b5 <- gbm(ViolentCrimesPerPop ~., data=train, distribution="gaussian", shrinkage=0.5,interaction.depth = 10)
boost_pred2 = predict(b2,newdata = test, n.trees = 100, type = "response")
boost_pred3 = predict(b3,newdata = test, n.trees = 100, type = "response")
boost_pred4 = predict(b4,newdata = test, n.trees = 100, type = "response")
boost_pred5 = predict(b4,newdata = test, n.trees = 100, type = "response")
boost_mse2 = MSE(boost_pred2, test$ViolentCrimesPerPop)
boost_mse3 = MSE(boost_pred3, test$ViolentCrimesPerPop)
boost_mse4 = MSE(boost_pred4, test$ViolentCrimesPerPop)
boost_mse5 = MSE(boost_pred5, test$ViolentCrimesPerPop)
boost_mse2
[1] 0.02194068
boost_mse3
[1] 0.02264764
boost_mse4
[1] 0.02641267
boost_mse5
[1] 0.02641267
** best test error and summarize the best model**
The best MSE we found was for the default interaction depth, and skrinkage 0.5. The test error is 0.01918006.
head(summary(b2))
** Summarize the performance of the approaches taken in Part 2. **
The best MSE for each of the algorithms: 1) Linear regression: 0.0212055 2) Regression Trees: 0.03987249 3) Random forests: 0.02080692 4) Gradient Boosting: 0.01942117
The best algorithm found was gradient boosting.
** Were your more sophisticated methods able to outperform linear regression? **
Yes they were. Gradient boosting did better than linear regression. Random forests also did better than linear regression. Both of them did a bit better, but not extremely better. Regression Trees on the other hand was not able to do well.