Question 8.1 - Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use.

Once again, I will use the example of measuring the health of a biome or habitat. To determine the health of a biome, we might want to compare previous amounts of biodiversity to current levels. As such, we might want to use a linear regression model in attempts to predict biodiversity levels. Some predictors we might use are as follows:

All of these predictors might be used to determine a healthy amount of biodiersity in a given environment.

Question 8.2 - Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data:

M = 14.0
So = 0
Ed = 10.0
Po1 = 12.0
Po2 = 15.5
LF = 0.640
M.F = 94.0
Pop = 150
NW = 1.1
U1 = 0.120
U2 = 3.6
Wealth = 3200
Ineq = 20.1
Prob = 0.04
Time = 39.0

Show your model (factors used and their coefficients), the software output, and the quality of fit.

Note that because there are only 47 data points and 15 predictors, you’ll probably notice some overfitting. We’ll see ways of dealing with this sort of problem later in the course.

Methodology - To begin, we want to go ahead and read in our data, then proceed to create training, validation, and testing data sets. There are only 47 data points, so we expect a certain degree of overfitting, but we can still use this to measure different models. Once we create our different data sets, we want to create three different linear models: standard predictors, log based predictor, and a polynomial predictor. Thus, we then want to use our predict function to predict our three models created with our training data against the validation to find which model has the smallest average residual. We are going to measure our model accuracy by the average absolute residual value from our model. Once we have then selected our model using our validation data, we are going to go ahead and test our data against our testing data to ensure that we don’t have wildly different results due to overfitting. Thereafter, we will predict our crime data with the given city data with the model we select.

crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 5/Homework5_ISYE6501/Homework5_ISYE6501/data 8.2/uscrime.txt", header = TRUE)

head(crime_data)

Select rows for Training, Validation, and Testing Data sets.

set.seed(521)

round(nrow(crime_data)*.7)
## [1] 33
crime_training_rows = sample(nrow(crime_data), round(nrow(crime_data)*.7))

crime_training = crime_data[crime_training_rows,]

crime_valandtest = crime_data[-crime_training_rows,]

crime_validation_rows = sample(nrow(crime_valandtest), round(nrow(crime_valandtest)/2))

crime_validation = crime_valandtest[crime_validation_rows,]

crime_testing = crime_valandtest[-crime_validation_rows,]

Create our first standard linear model.

mod_1 = lm(Crime~., data = crime_training)

mod_1$coefficients
##   (Intercept)             M            So            Ed           Po1 
## -5.210999e+03  1.128406e+02  1.721150e+02  2.820159e+02  2.135616e+02 
##           Po2            LF           M.F           Pop            NW 
## -1.188034e+02 -6.549855e+01  2.046560e+00 -1.609070e+00  1.110518e+00 
##            U1            U2        Wealth          Ineq          Prob 
## -3.994903e+03  2.127638e+02 -4.951951e-02  4.578133e+01 -5.538871e+03 
##          Time 
## -7.910424e+00
summary(mod_1)
## 
## Call:
## lm(formula = Crime ~ ., data = crime_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -167.91  -79.62  -16.00   61.58  268.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.211e+03  1.634e+03  -3.189  0.00538 ** 
## M            1.128e+02  3.840e+01   2.938  0.00919 ** 
## So           1.721e+02  1.273e+02   1.352  0.19422    
## Ed           2.820e+02  5.095e+01   5.535 3.64e-05 ***
## Po1          2.136e+02  1.006e+02   2.123  0.04873 *  
## Po2         -1.188e+02  1.088e+02  -1.092  0.29015    
## LF          -6.550e+01  1.285e+03  -0.051  0.95994    
## M.F          2.047e+00  1.799e+01   0.114  0.91076    
## Pop         -1.609e+00  1.334e+00  -1.206  0.24415    
## NW           1.111e+00  6.383e+00   0.174  0.86393    
## U1          -3.995e+03  3.303e+03  -1.210  0.24298    
## U2           2.128e+02  7.214e+01   2.949  0.00897 ** 
## Wealth      -4.952e-02  9.948e-02  -0.498  0.62501    
## Ineq         4.578e+01  2.250e+01   2.035  0.05775 .  
## Prob        -5.539e+03  1.992e+03  -2.781  0.01282 *  
## Time        -7.910e+00  7.218e+00  -1.096  0.28836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144.6 on 17 degrees of freedom
## Multiple R-squared:  0.9127, Adjusted R-squared:  0.8356 
## F-statistic: 11.85 on 15 and 17 DF,  p-value: 3.452e-06

This first model has an R-squared value of ~.84 and this is a pretty solid value. Let us now look at its residuals values and our Q-Q plot.

plot(crime_training$Crime)

plot(mod_1)

Our residuals look pretty standard, let us now use our validation set to predict values from our first model.

predicted_1 = as.vector(predict.lm(mod_1, crime_validation))
predicted_1
## [1]  566.2062  835.5273  574.0591  578.8189  249.4712 1694.3905 1114.3152
tested_1 = as.vector(crime_validation$Crime)

mod1_resid = rep(0, nrow(crime_testing))

for (i in 1:nrow(crime_validation)){
  mod1_resid[i] = tested_1[i] - predicted_1[i]
}

mod1_resid
## [1] -127.20619  838.47271  121.94093  277.18115   92.52878  274.60952  157.68475
mean1 = mean(abs(mod1_resid))
mean1
## [1] 269.9463

We can see how our average residual is pretty high compared to our model.

plot(mod_1, which = 1)
abline(h = mean1, col = "blue", lwd = 2)

As we can see, we have a bit of overfitting here, let us also try to get a better set of predictors for our model instead of all of our values. Thus, let us try a new model the log of crime. This might help take into consideration some curvature in our data.

mod_2 = lm(log(Crime)~., data = crime_training)

summary(mod_2)
## 
## Call:
## lm(formula = log(Crime) ~ ., data = crime_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22682 -0.08215 -0.00450  0.05784  0.32538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.511e-01  1.930e+00  -0.078  0.93848    
## M            1.375e-01  4.535e-02   3.032  0.00753 ** 
## So           2.152e-01  1.504e-01   1.431  0.17056    
## Ed           3.410e-01  6.016e-02   5.668 2.78e-05 ***
## Po1          1.114e-01  1.188e-01   0.938  0.36130    
## Po2         -2.848e-02  1.285e-01  -0.222  0.82724    
## LF           2.653e-01  1.517e+00   0.175  0.86327    
## M.F         -1.263e-02  2.124e-02  -0.594  0.56003    
## Pop         -6.344e-05  1.575e-03  -0.040  0.96833    
## NW           5.771e-03  7.537e-03   0.766  0.45436    
## U1          -1.568e+00  3.900e+00  -0.402  0.69265    
## U2           1.774e-01  8.518e-02   2.082  0.05274 .  
## Wealth       8.363e-05  1.175e-04   0.712  0.48615    
## Ineq         6.666e-02  2.656e-02   2.509  0.02251 *  
## Prob        -5.846e+00  2.352e+00  -2.485  0.02364 *  
## Time        -1.222e-02  8.523e-03  -1.434  0.16979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1707 on 17 degrees of freedom
## Multiple R-squared:  0.8942, Adjusted R-squared:  0.8008 
## F-statistic: 9.574 on 15 and 17 DF,  p-value: 1.576e-05

We see this model has an R-Squared value of ~.80, which is slightly worse than our first model.

plot(mod_2)

predicted_2 = exp(as.vector(predict.lm(mod_2, crime_validation)))
predicted_2
## [1]  586.7177  829.2724  601.8554  651.2973  411.0483 2410.6486  765.3705
tested_2 = as.vector(crime_validation$Crime)

mod2_resid = rep(0, nrow(crime_validation))

for (i in 1:nrow(crime_validation)){
  mod2_resid[i] = tested_2[i] - predicted_2[i]
}

mod2_resid
## [1] -147.71767  844.72762   94.14458  204.70270  -69.04829 -441.64863  506.62948
mean2 = mean(abs(mod2_resid))
mean2
## [1] 329.8027

We see that the standard residual here is slightly higher than our first model. Let us also analyze whether or not a polynomial adjustment of our model can help our prediction. We are now going to include all our predictors and their second degree variables, as well.

mod_3 = lm(Crime~M + I(M^2) + So + I(So^2) + Ed + I(Ed^2) + Po1 + I(Po1^2) + Po2 + I(Po2^2) + LF + I(LF^2) + M.F + I(M.F^2) + Pop + I(Pop^2) + NW + I(NW^2) + U1 + I(U1^2) + U2 + I(U2^2) + Wealth + I(Wealth^2) + Ineq + I(Ineq^2) + Prob + I(Prob^2) + Time + I(Time^2), data = crime_training)

summary(mod_3)
## 
## Call:
## lm(formula = Crime ~ M + I(M^2) + So + I(So^2) + Ed + I(Ed^2) + 
##     Po1 + I(Po1^2) + Po2 + I(Po2^2) + LF + I(LF^2) + M.F + I(M.F^2) + 
##     Pop + I(Pop^2) + NW + I(NW^2) + U1 + I(U1^2) + U2 + I(U2^2) + 
##     Wealth + I(Wealth^2) + Ineq + I(Ineq^2) + Prob + I(Prob^2) + 
##     Time + I(Time^2), data = crime_training)
## 
## Residuals:
##        8       25       18       12       10       28       46       33 
## -35.0886  57.9272  -7.9831  74.3984 -29.3170 -57.7478 -39.0432  35.9234 
##       42       24       40       13       37       41       29        5 
##  25.7889  24.6941   9.2804 -59.8732  -5.5533   6.7215   0.5759  30.6850 
##       44       34        7        6       45       21       32       26 
## -43.1384  94.4272 -22.0877 -16.7503  -8.4912  18.2340  26.9663   9.4324 
##       17        2       14       35       31       43       38       15 
## -36.6771  22.9521 -95.8063 -27.7768  -2.9745  27.8276  16.3016  -1.0757 
##       16 
##   7.2483 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.645e+04  1.404e+05   0.473    0.668
## M            7.406e+02  8.106e+02   0.914    0.428
## I(M^2)      -2.028e+01  2.787e+01  -0.728    0.519
## So           1.893e+02  2.690e+02   0.704    0.532
## I(So^2)             NA         NA      NA       NA
## Ed           6.712e+02  1.787e+03   0.376    0.732
## I(Ed^2)     -2.139e+01  8.356e+01  -0.256    0.815
## Po1          4.172e+02  1.150e+03   0.363    0.741
## I(Po1^2)    -1.199e+01  6.462e+01  -0.186    0.865
## Po2         -4.545e+02  1.307e+03  -0.348    0.751
## I(Po2^2)     1.936e+01  7.714e+01   0.251    0.818
## LF           3.519e+04  7.329e+04   0.480    0.664
## I(LF^2)     -2.895e+04  6.339e+04  -0.457    0.679
## M.F         -1.762e+03  3.137e+03  -0.562    0.613
## I(M.F^2)     8.587e+00  1.572e+01   0.546    0.623
## Pop         -2.372e+00  6.083e+00  -0.390    0.723
## I(Pop^2)    -1.495e-02  4.318e-02  -0.346    0.752
## NW           1.821e+01  6.056e+01   0.301    0.783
## I(NW^2)     -5.340e-01  1.705e+00  -0.313    0.775
## U1           7.253e+03  2.454e+04   0.296    0.787
## I(U1^2)     -4.471e+04  1.317e+05  -0.340    0.757
## U2           3.720e+02  7.168e+02   0.519    0.640
## I(U2^2)     -2.215e+01  1.342e+02  -0.165    0.879
## Wealth       7.284e-02  1.979e+00   0.037    0.973
## I(Wealth^2) -8.865e-06  1.828e-04  -0.048    0.964
## Ineq         1.616e+02  3.575e+02   0.452    0.682
## I(Ineq^2)   -2.970e+00  8.893e+00  -0.334    0.760
## Prob        -1.856e+04  1.399e+04  -1.327    0.277
## I(Prob^2)    7.919e+04  9.441e+04   0.839    0.463
## Time         1.686e+01  6.299e+01   0.268    0.806
## I(Time^2)   -6.079e-01  1.330e+00  -0.457    0.679
## 
## Residual standard error: 127.6 on 3 degrees of freedom
## Multiple R-squared:  0.988,  Adjusted R-squared:  0.872 
## F-statistic: 8.516 on 29 and 3 DF,  p-value: 0.05085

This model has an R-Squared value of ~.87, which is higher than the other two models, but that is to be expected since we have doubled the amount of predictors. Let us now analyze our residuals for this data.

plot(mod_3)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

predicted_3 = as.vector(predict.lm(mod_3, crime_validation))
predicted_3
## [1]  121.3463  810.3753  485.9528  820.4733  263.3560 1495.5246 1283.6116
tested_3 = as.vector(crime_validation$Crime)

mod3_resid = rep(0, nrow(crime_validation))

for (i in 1:nrow(crime_validation)){
  mod3_resid[i] = tested_3[i] - predicted_3[i]
}

mod3_resid
## [1] 317.65369 863.62467 210.04717  35.52666  78.64402 473.47542 -11.61164
mean3 = mean(abs(mod3_resid))
mean3
## [1] 284.369

Since our Polynomial Model had a pretty similar mean residual to the standard model and a higher R-Squared, we will use this as our main model and see if our testing data set is also well represented by this set.

predicted_3 = as.vector(predict.lm(mod_3, crime_testing))
predicted_3
## [1]  856.1365  494.0097 1790.2079 1100.2910 1196.5401 1130.7938  916.1548
tested_3 = as.vector(crime_testing$Crime)

mod3_resid = rep(0, nrow(crime_testing))

for (i in 1:nrow(crime_testing)){
  mod3_resid[i] = tested_3[i] - predicted_3[i]
}

mod3_resid
## [1]   -65.13648    83.99028 -1040.20794   124.70896    19.45994  -304.79381
## [7]   -67.15479
mean3 = mean(abs(mod3_resid))
mean3
## [1] 243.636

As we can see, our residual error is still the lowest compared to the other models. Thus, we are going to use this model to predict the crime value as given to us in our problem.

problem_values = data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)

solution1 = predict(mod_3, problem_values)

solution1
##        1 
## 1096.183

Discussion of Results - We can see how after testing three different linear models, our best one was our polynomial model and this produced our smallest, absolute, mean residual from our predicted values. To put our crime index of 1096.2 into context, we know our smallest crime value is around 350 and our largest value is around 2000. Thus, the value does make sense in context with our other crime data. Ultimately, after using cross validation to select our best model, we find a crime value of ~1096.