Question 8.1 - Describe a situation or problem from your
job, everyday life, current events, etc., for which a linear regression
model would be appropriate. List some (up to 5) predictors that you
might use.
Once again, I will use the example of measuring the health of a biome
or habitat. To determine the health of a biome, we might want to compare
previous amounts of biodiversity to current levels. As such, we might
want to use a linear regression model in attempts to predict
biodiversity levels. Some predictors we might use are as follows:
All of these predictors might be used to determine a healthy amount
of biodiersity in a given environment.
Question 8.2 - Using crime data from http://www.statsci.org/data/general/uscrime.txt (file
uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use
regression (a useful R function is lm or glm) to predict the observed
crime rate in a city with the following data:
M = 14.0
So =
0
Ed = 10.0
Po1 = 12.0
Po2 =
15.5
LF = 0.640
M.F = 94.0
Pop =
150
NW = 1.1
U1 = 0.120
U2 =
3.6
Wealth = 3200
Ineq = 20.1
Prob =
0.04
Time = 39.0
Show your model (factors used and their coefficients),
the software output, and the quality of fit.
Note that because there are only 47 data points and 15
predictors, you’ll probably notice some overfitting. We’ll see ways of
dealing with this sort of problem later in the
course.
Methodology - To begin, we want to go ahead and read in our data, then proceed to create training, validation, and testing data sets. There are only 47 data points, so we expect a certain degree of overfitting, but we can still use this to measure different models. Once we create our different data sets, we want to create three different linear models: standard predictors, log based predictor, and a polynomial predictor. Thus, we then want to use our predict function to predict our three models created with our training data against the validation to find which model has the smallest average residual. We are going to measure our model accuracy by the average absolute residual value from our model. Once we have then selected our model using our validation data, we are going to go ahead and test our data against our testing data to ensure that we don’t have wildly different results due to overfitting. Thereafter, we will predict our crime data with the given city data with the model we select.
crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 5/Homework5_ISYE6501/Homework5_ISYE6501/data 8.2/uscrime.txt", header = TRUE)
head(crime_data)
Select rows for Training, Validation, and Testing Data sets.
set.seed(521)
round(nrow(crime_data)*.7)
## [1] 33
crime_training_rows = sample(nrow(crime_data), round(nrow(crime_data)*.7))
crime_training = crime_data[crime_training_rows,]
crime_valandtest = crime_data[-crime_training_rows,]
crime_validation_rows = sample(nrow(crime_valandtest), round(nrow(crime_valandtest)/2))
crime_validation = crime_valandtest[crime_validation_rows,]
crime_testing = crime_valandtest[-crime_validation_rows,]
Create our first standard linear model.
mod_1 = lm(Crime~., data = crime_training)
mod_1$coefficients
## (Intercept) M So Ed Po1
## -5.210999e+03 1.128406e+02 1.721150e+02 2.820159e+02 2.135616e+02
## Po2 LF M.F Pop NW
## -1.188034e+02 -6.549855e+01 2.046560e+00 -1.609070e+00 1.110518e+00
## U1 U2 Wealth Ineq Prob
## -3.994903e+03 2.127638e+02 -4.951951e-02 4.578133e+01 -5.538871e+03
## Time
## -7.910424e+00
summary(mod_1)
##
## Call:
## lm(formula = Crime ~ ., data = crime_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.91 -79.62 -16.00 61.58 268.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.211e+03 1.634e+03 -3.189 0.00538 **
## M 1.128e+02 3.840e+01 2.938 0.00919 **
## So 1.721e+02 1.273e+02 1.352 0.19422
## Ed 2.820e+02 5.095e+01 5.535 3.64e-05 ***
## Po1 2.136e+02 1.006e+02 2.123 0.04873 *
## Po2 -1.188e+02 1.088e+02 -1.092 0.29015
## LF -6.550e+01 1.285e+03 -0.051 0.95994
## M.F 2.047e+00 1.799e+01 0.114 0.91076
## Pop -1.609e+00 1.334e+00 -1.206 0.24415
## NW 1.111e+00 6.383e+00 0.174 0.86393
## U1 -3.995e+03 3.303e+03 -1.210 0.24298
## U2 2.128e+02 7.214e+01 2.949 0.00897 **
## Wealth -4.952e-02 9.948e-02 -0.498 0.62501
## Ineq 4.578e+01 2.250e+01 2.035 0.05775 .
## Prob -5.539e+03 1.992e+03 -2.781 0.01282 *
## Time -7.910e+00 7.218e+00 -1.096 0.28836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 144.6 on 17 degrees of freedom
## Multiple R-squared: 0.9127, Adjusted R-squared: 0.8356
## F-statistic: 11.85 on 15 and 17 DF, p-value: 3.452e-06
This first model has an R-squared value of ~.84 and this is a pretty solid value. Let us now look at its residuals values and our Q-Q plot.
plot(crime_training$Crime)
plot(mod_1)
Our residuals look pretty standard, let us now use our validation set to predict values from our first model.
predicted_1 = as.vector(predict.lm(mod_1, crime_validation))
predicted_1
## [1] 566.2062 835.5273 574.0591 578.8189 249.4712 1694.3905 1114.3152
tested_1 = as.vector(crime_validation$Crime)
mod1_resid = rep(0, nrow(crime_testing))
for (i in 1:nrow(crime_validation)){
mod1_resid[i] = tested_1[i] - predicted_1[i]
}
mod1_resid
## [1] -127.20619 838.47271 121.94093 277.18115 92.52878 274.60952 157.68475
mean1 = mean(abs(mod1_resid))
mean1
## [1] 269.9463
We can see how our average residual is pretty high compared to our model.
plot(mod_1, which = 1)
abline(h = mean1, col = "blue", lwd = 2)
As we can see, we have a bit of overfitting here, let us also try to get a better set of predictors for our model instead of all of our values. Thus, let us try a new model the log of crime. This might help take into consideration some curvature in our data.
mod_2 = lm(log(Crime)~., data = crime_training)
summary(mod_2)
##
## Call:
## lm(formula = log(Crime) ~ ., data = crime_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22682 -0.08215 -0.00450 0.05784 0.32538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.511e-01 1.930e+00 -0.078 0.93848
## M 1.375e-01 4.535e-02 3.032 0.00753 **
## So 2.152e-01 1.504e-01 1.431 0.17056
## Ed 3.410e-01 6.016e-02 5.668 2.78e-05 ***
## Po1 1.114e-01 1.188e-01 0.938 0.36130
## Po2 -2.848e-02 1.285e-01 -0.222 0.82724
## LF 2.653e-01 1.517e+00 0.175 0.86327
## M.F -1.263e-02 2.124e-02 -0.594 0.56003
## Pop -6.344e-05 1.575e-03 -0.040 0.96833
## NW 5.771e-03 7.537e-03 0.766 0.45436
## U1 -1.568e+00 3.900e+00 -0.402 0.69265
## U2 1.774e-01 8.518e-02 2.082 0.05274 .
## Wealth 8.363e-05 1.175e-04 0.712 0.48615
## Ineq 6.666e-02 2.656e-02 2.509 0.02251 *
## Prob -5.846e+00 2.352e+00 -2.485 0.02364 *
## Time -1.222e-02 8.523e-03 -1.434 0.16979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1707 on 17 degrees of freedom
## Multiple R-squared: 0.8942, Adjusted R-squared: 0.8008
## F-statistic: 9.574 on 15 and 17 DF, p-value: 1.576e-05
We see this model has an R-Squared value of ~.80, which is slightly worse than our first model.
plot(mod_2)
predicted_2 = exp(as.vector(predict.lm(mod_2, crime_validation)))
predicted_2
## [1] 586.7177 829.2724 601.8554 651.2973 411.0483 2410.6486 765.3705
tested_2 = as.vector(crime_validation$Crime)
mod2_resid = rep(0, nrow(crime_validation))
for (i in 1:nrow(crime_validation)){
mod2_resid[i] = tested_2[i] - predicted_2[i]
}
mod2_resid
## [1] -147.71767 844.72762 94.14458 204.70270 -69.04829 -441.64863 506.62948
mean2 = mean(abs(mod2_resid))
mean2
## [1] 329.8027
We see that the standard residual here is slightly higher than our first model. Let us also analyze whether or not a polynomial adjustment of our model can help our prediction. We are now going to include all our predictors and their second degree variables, as well.
mod_3 = lm(Crime~M + I(M^2) + So + I(So^2) + Ed + I(Ed^2) + Po1 + I(Po1^2) + Po2 + I(Po2^2) + LF + I(LF^2) + M.F + I(M.F^2) + Pop + I(Pop^2) + NW + I(NW^2) + U1 + I(U1^2) + U2 + I(U2^2) + Wealth + I(Wealth^2) + Ineq + I(Ineq^2) + Prob + I(Prob^2) + Time + I(Time^2), data = crime_training)
summary(mod_3)
##
## Call:
## lm(formula = Crime ~ M + I(M^2) + So + I(So^2) + Ed + I(Ed^2) +
## Po1 + I(Po1^2) + Po2 + I(Po2^2) + LF + I(LF^2) + M.F + I(M.F^2) +
## Pop + I(Pop^2) + NW + I(NW^2) + U1 + I(U1^2) + U2 + I(U2^2) +
## Wealth + I(Wealth^2) + Ineq + I(Ineq^2) + Prob + I(Prob^2) +
## Time + I(Time^2), data = crime_training)
##
## Residuals:
## 8 25 18 12 10 28 46 33
## -35.0886 57.9272 -7.9831 74.3984 -29.3170 -57.7478 -39.0432 35.9234
## 42 24 40 13 37 41 29 5
## 25.7889 24.6941 9.2804 -59.8732 -5.5533 6.7215 0.5759 30.6850
## 44 34 7 6 45 21 32 26
## -43.1384 94.4272 -22.0877 -16.7503 -8.4912 18.2340 26.9663 9.4324
## 17 2 14 35 31 43 38 15
## -36.6771 22.9521 -95.8063 -27.7768 -2.9745 27.8276 16.3016 -1.0757
## 16
## 7.2483
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.645e+04 1.404e+05 0.473 0.668
## M 7.406e+02 8.106e+02 0.914 0.428
## I(M^2) -2.028e+01 2.787e+01 -0.728 0.519
## So 1.893e+02 2.690e+02 0.704 0.532
## I(So^2) NA NA NA NA
## Ed 6.712e+02 1.787e+03 0.376 0.732
## I(Ed^2) -2.139e+01 8.356e+01 -0.256 0.815
## Po1 4.172e+02 1.150e+03 0.363 0.741
## I(Po1^2) -1.199e+01 6.462e+01 -0.186 0.865
## Po2 -4.545e+02 1.307e+03 -0.348 0.751
## I(Po2^2) 1.936e+01 7.714e+01 0.251 0.818
## LF 3.519e+04 7.329e+04 0.480 0.664
## I(LF^2) -2.895e+04 6.339e+04 -0.457 0.679
## M.F -1.762e+03 3.137e+03 -0.562 0.613
## I(M.F^2) 8.587e+00 1.572e+01 0.546 0.623
## Pop -2.372e+00 6.083e+00 -0.390 0.723
## I(Pop^2) -1.495e-02 4.318e-02 -0.346 0.752
## NW 1.821e+01 6.056e+01 0.301 0.783
## I(NW^2) -5.340e-01 1.705e+00 -0.313 0.775
## U1 7.253e+03 2.454e+04 0.296 0.787
## I(U1^2) -4.471e+04 1.317e+05 -0.340 0.757
## U2 3.720e+02 7.168e+02 0.519 0.640
## I(U2^2) -2.215e+01 1.342e+02 -0.165 0.879
## Wealth 7.284e-02 1.979e+00 0.037 0.973
## I(Wealth^2) -8.865e-06 1.828e-04 -0.048 0.964
## Ineq 1.616e+02 3.575e+02 0.452 0.682
## I(Ineq^2) -2.970e+00 8.893e+00 -0.334 0.760
## Prob -1.856e+04 1.399e+04 -1.327 0.277
## I(Prob^2) 7.919e+04 9.441e+04 0.839 0.463
## Time 1.686e+01 6.299e+01 0.268 0.806
## I(Time^2) -6.079e-01 1.330e+00 -0.457 0.679
##
## Residual standard error: 127.6 on 3 degrees of freedom
## Multiple R-squared: 0.988, Adjusted R-squared: 0.872
## F-statistic: 8.516 on 29 and 3 DF, p-value: 0.05085
This model has an R-Squared value of ~.87, which is higher than the other two models, but that is to be expected since we have doubled the amount of predictors. Let us now analyze our residuals for this data.
plot(mod_3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
predicted_3 = as.vector(predict.lm(mod_3, crime_validation))
predicted_3
## [1] 121.3463 810.3753 485.9528 820.4733 263.3560 1495.5246 1283.6116
tested_3 = as.vector(crime_validation$Crime)
mod3_resid = rep(0, nrow(crime_validation))
for (i in 1:nrow(crime_validation)){
mod3_resid[i] = tested_3[i] - predicted_3[i]
}
mod3_resid
## [1] 317.65369 863.62467 210.04717 35.52666 78.64402 473.47542 -11.61164
mean3 = mean(abs(mod3_resid))
mean3
## [1] 284.369
Since our Polynomial Model had a pretty similar mean residual to the standard model and a higher R-Squared, we will use this as our main model and see if our testing data set is also well represented by this set.
predicted_3 = as.vector(predict.lm(mod_3, crime_testing))
predicted_3
## [1] 856.1365 494.0097 1790.2079 1100.2910 1196.5401 1130.7938 916.1548
tested_3 = as.vector(crime_testing$Crime)
mod3_resid = rep(0, nrow(crime_testing))
for (i in 1:nrow(crime_testing)){
mod3_resid[i] = tested_3[i] - predicted_3[i]
}
mod3_resid
## [1] -65.13648 83.99028 -1040.20794 124.70896 19.45994 -304.79381
## [7] -67.15479
mean3 = mean(abs(mod3_resid))
mean3
## [1] 243.636
As we can see, our residual error is still the lowest compared to the other models. Thus, we are going to use this model to predict the crime value as given to us in our problem.
problem_values = data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)
solution1 = predict(mod_3, problem_values)
solution1
## 1
## 1096.183
Discussion of Results - We can see how after testing three different linear models, our best one was our polynomial model and this produced our smallest, absolute, mean residual from our predicted values. To put our crime index of 1096.2 into context, we know our smallest crime value is around 350 and our largest value is around 2000. Thus, the value does make sense in context with our other crime data. Ultimately, after using cross validation to select our best model, we find a crime value of ~1096.