First lets load the data into R:
com <- read.csv("communities.csv")
(1) How many distinct states are represented in this dataset? How many distinct counties are represented?
length(unique(com$state))
[1] 46
length(unique(com$county))
[1] 109
There are a total of 46 distinct states, and 109 distinct counties (108 if you do not consider NA as a county). But there are many missing values in the counties column, so there actually might be many more.
(2) Produce a frequency table with the number of missing values per column.
freq = c()
for(i in names(com))
{
j <- com[i]
v <- sum(is.na(j))
freq <- c(freq, v)
}
table(freq)
freq
0 1 1174 1177 1675
103 1 1 1 22
As can be seen in the table. 103 of the variables do not have any missing variables. Only 1 has 1 missing variable. One column has 1174 missing values, and another has 1177 missing values. And 22 variables has 1675 missing values.
(3) Discard all columns with more than 50% of the values missing. Reproduce the frequency table from 2 using the reduced dataset.
com_reduced = c()
for(i in names(com))
{
j <- com[i]
v <- sum(is.na(j))
#print (v)
if (v < 997)
{
com_reduced = c(com_reduced, com[i])
}
}
freq_red = c()
for(i in names(com_reduced))
{
j <- com_reduced[i]
v <- length(grep("NA", j))
freq_red <- c(freq_red, v)
}
table(freq_red)
freq_red
0 1
103 1
The columns that had missing values, usually had most of the values missing. I discarded all the columns that had more than half of its values missing. This new reduced data set is now called com_reduced. And a new frequency table was created. In it we can see that all but one column have no data missing. And the column that has missing data, is only missing one value.
hist(com_reduced$ViolentCrimesPerPop, main = "Histogram of Violent Crimes Per Population", xlab = "Violent Crimes per 100K")
The ViolentCrimesPerPop number is the number of violented crimes commited per 100K people. On average 0.24 crimes were commited. Most places did have a low number of violent crimes, and the frequency dicreases with the increase of the number of violent crimes.
First lets remove the identifying variables. THe variables we will remove:
community name
community (has already been removed because too many missing values)
fold
state
county (has already been removed because too many missing values)
These variables that have been removed because they do not have any ordering to them.
com_relevant <- within(com_reduced, rm(state))
com_relevant <- within(com_relevant, rm(communityname))
com_relevant <- within(com_relevant, rm(fold))
(1) Run forward stepwise selection on your dataset.
Because I get an error because of the missing value, I will be remove the column that is causing the error. And afterwards we run forward stepwise selection.
com_relevant_2 <- within(com_relevant, rm(OtherPerCap))
m.full <- lm(ViolentCrimesPerPop ~ ., data=com_relevant_2)
m.empty <- lm(ViolentCrimesPerPop ~ 1, data=com_relevant_2)
m.forward <- step(m.empty, scope=list(upper=m.full),data=com_relevant_2, direction="forward",trace = FALSE)
Explain why you would have a warning message if you use any predictors with missing values in the stepwise selection process.
It causes an error because it is not possible to do linear regression on a missing value.
(2) Also run backward and bi-directional stepwise selection on your dataset.
#backward
m.full <- lm(ViolentCrimesPerPop ~ ., data=com_relevant_2)
m.empty <- lm(ViolentCrimesPerPop ~ 1, data=com_relevant_2)
m.backward <- step(m.full, scope=list(lower=m.empty),data=com_relevant_2, direction="backward",trace = FALSE)
m.both <- step(m.empty, scope=list(upper=m.full),data=com_relevant_2, direction="both",trace = FALSE)
For each of the 3 models obtained via forward, backward, and bi-directional stepwise selection, report the model sizes (i.e. number of predictors) and the R squared values achieved.
sf <- summary(m.forward)
sba <- summary(m.backward)
sbo <- summary(m.both)
print ("forward:")
[1] "forward:"
sf$r.squared
[1] 0.6832967
AIC(m.forward)
[1] -2366.599
print("backward:")
[1] "backward:"
sba$r.squared
[1] 0.6908641
AIC(m.backward)
[1] -2382.823
print("both:")
[1] "both:"
sbo$r.squared
[1] 0.6832967
AIC(m.both)
[1] -2366.599
Both forward and bi-directional stepwise selection found 37 relevant predictors and had r-squared values of 0.683, and AIC of -2366.599. Backward selection found 53 predictors and had an r-squared value of 0.6910, and an AIC of -2382.823.
The model which I consider the best is the one generated from forward stepwise selection. First we will extract the model, and then run regression on the same data.
f <- formula(m.forward)
f
ViolentCrimesPerPop ~ PctKids2Par + racePctWhite + HousVacant +
pctUrban + PctWorkMom + NumStreet + MalePctDivorce + PctIlleg +
numbUrban + PctPersDenseHous + racepctblack + agePct12t29 +
MedOwnCostPctIncNoMtg + PctPopUnderPov + pctWRetire + MedRentPctHousInc +
RentLowQ + MedRent + pctWWage + whitePerCap + MalePctNevMarr +
PctEmploy + PctEmplManu + TotalPctDiv + perCapInc + pctWInvInc +
LemasPctOfficDrugUn + MedOwnCostPctInc + PctVacMore6Mos +
PctVacantBoarded + HispPerCap + pctWFarmSelf + indianPerCap +
PctLess9thGrade + PctLargHouseFam + agePct12t21 + AsianPerCap
reg <- lm(f, data=com_relevant_2)
reg
Call:
lm(formula = f, data = com_relevant_2)
Coefficients:
(Intercept) PctKids2Par racePctWhite HousVacant
0.75437 -0.34470 -0.07903 0.26889
pctUrban PctWorkMom NumStreet MalePctDivorce
0.04002 -0.13716 0.18639 0.34656
PctIlleg numbUrban PctPersDenseHous racepctblack
0.14703 -0.21317 0.25001 0.17566
agePct12t29 MedOwnCostPctIncNoMtg PctPopUnderPov pctWRetire
-0.28410 -0.08087 -0.15525 -0.10541
MedRentPctHousInc RentLowQ MedRent pctWWage
0.05484 -0.24271 0.26546 -0.22630
whitePerCap MalePctNevMarr PctEmploy PctEmplManu
-0.32198 0.15148 0.19944 -0.03149
TotalPctDiv perCapInc pctWInvInc LemasPctOfficDrugUn
-0.30508 0.21736 -0.16796 0.02951
MedOwnCostPctInc PctVacMore6Mos PctVacantBoarded HispPerCap
-0.04733 -0.05978 0.04487 0.04240
pctWFarmSelf indianPerCap PctLess9thGrade PctLargHouseFam
0.03912 -0.03551 -0.05776 -0.08847
agePct12t21 AsianPerCap
0.09153 0.02642
** (1) Which model did you end up choosing? Justify your answer, show the regression output, and in a sentence or two, summarize what you learn from two or three of the significant coefficients. **
I chose forward stepwise selection. Forward and bi-directional gave the same results. And backward selection gave a similar r-squared but is a little lower. But it is only 1% better at fitting the data, which is most likely caused by overfitting. Another disadvantage of the backward selection is that it had a significant additional number of predictors. 16 more than forward selection. THerefore I thought it was better to choose a more simple model.
Above we can see the relevant predictors, and the weight of each. I will talk a bit more about the 3 predictors with the heighest weights.
MalePercentageDivorce - It seems that regions with high divorce rates are directly correlated with violent crime rates.
PctKids2Par (percentage who have kids in a home with two parents) - It seems that regions with many stable families cause less violent crimes.
WhitePerCap (per capita income for caucasians) - In regions where caucasians have higher income, it is likely that there will be less violent crimes.
(2) Run 10-fold cross-validation to estimate the test MSE. Use the fold column in the dataset for determining the folds.
fold = com$fold
mse = 0
for (i in 1:10)
{
train <- com[fold != i,]
valid <- com[fold == i,]
m <- lm(f, data=train)
p <- predict(m,newdata=valid)
mse = mse + mean((p - valid$ViolentCrimesPerPop)^2)
}
print (mse / 10)
[1] 0.01793732
Report your estimated test MSE.
The Mean squared error for the model is 0.0179.
(3) Use the bootstrap to estimate a 90% confidence interval for multiple R2.
#confint(m.forward)
set.seed(222)
N <- 1000
r2 <- rep(NA, N)
n <- length(com_relevant_2[[1]])
com_r <- structure(com_relevant_2, row.names = c(NA, -n), class = "data.frame")
for (i in 1:N) {
s <- sample(1:nrow(com_r), nrow(com_r), replace=TRUE)
m1.s <- lm(f, data=com_r[s,])
r2[i] <- summary(m1.s)$r.squared
}
1000 random samples were created, and from that extracted the confidence intervals. The confidence intervals for multiple R-squared are: 2.5% - 0.6597118 97.5% - 0.7188363