QUESTION 8.1:

Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate.
List some (up to 5) predictors that you might use.

ANSWER 8.1:

A linear regression model can be used in the real estate industry in order to predict the value of a home’s selling price (dependent variable) based on a number of characteristics (independent variables/predictors). Some of these predictors may have a more significant impact on the selling price of a house, and therefore have a stronger relationship on the dependent variable. Some of these characteristics may have very little impact, so it is important to try and eliminate any predictors that do not exhibit a strong relationship to the dependent variable in order to reduce overfitting.

Here are some predictors that may have a strong impact on the selling price of the house:
-Overall square footage
-Bedrooms and bathrooms
-Location/proximity
-Hazardous Potential (e.g. high risk of flooding, earthquakes)
-Noise level (near an airport, train tracks. busy road)

Here are some predictors that would likely have less of an impact on the selling price, and may or may not be worth including in this hypothetical regression model:
-Size of garage
-Number of windows
-Gas or Electric heating.
-Central Air vs Air Condition Units

The model that we build should best aid us in setting the most accurate and realistic prices for the house values, and can also help us evaluate which real estate in our market is either overvalued or undervalued.

QUESTION 8.2:

Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data:
M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5, LF = 0.640, M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120, U2 = 3.6, Wealth = 3200,Ineq = 20.1,Prob = 0.04, Time = 39.0
Show your model (factors used and their coefficients), the software output, and the quality of fit.

ANSWER 8.2:

For this question, we will build a model using the crime dataset that will help us predict crime rates for the previously given set of test data. We will create several models, each with a certain number of predictors, and then we will perform cross validation and RFE (Recursive Feature Elimination) in order to determine how many predictors we should use, and which ones to use.

First, we’ll clear the environment, set seed for reproducibility, and load in our libraries:

rm(list = ls())
set.seed(123)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggplot2)
library(stats)

Let’s now load in our test data:

city_data <- data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5,
                        LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120,
                        U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)

And now we’ll load in our crime data and view a summary of it.

crime_data <- read.table("/Users/djmariano/uscrime.txt", header = TRUE, stringsAsFactors = FALSE)
summary(crime_data)
##        M               So               Ed             Po1       
##  Min.   :11.90   Min.   :0.0000   Min.   : 8.70   Min.   : 4.50  
##  1st Qu.:13.00   1st Qu.:0.0000   1st Qu.: 9.75   1st Qu.: 6.25  
##  Median :13.60   Median :0.0000   Median :10.80   Median : 7.80  
##  Mean   :13.86   Mean   :0.3404   Mean   :10.56   Mean   : 8.50  
##  3rd Qu.:14.60   3rd Qu.:1.0000   3rd Qu.:11.45   3rd Qu.:10.45  
##  Max.   :17.70   Max.   :1.0000   Max.   :12.20   Max.   :16.60  
##       Po2               LF              M.F              Pop        
##  Min.   : 4.100   Min.   :0.4800   Min.   : 93.40   Min.   :  3.00  
##  1st Qu.: 5.850   1st Qu.:0.5305   1st Qu.: 96.45   1st Qu.: 10.00  
##  Median : 7.300   Median :0.5600   Median : 97.70   Median : 25.00  
##  Mean   : 8.023   Mean   :0.5612   Mean   : 98.30   Mean   : 36.62  
##  3rd Qu.: 9.700   3rd Qu.:0.5930   3rd Qu.: 99.20   3rd Qu.: 41.50  
##  Max.   :15.700   Max.   :0.6410   Max.   :107.10   Max.   :168.00  
##        NW              U1                U2            Wealth    
##  Min.   : 0.20   Min.   :0.07000   Min.   :2.000   Min.   :2880  
##  1st Qu.: 2.40   1st Qu.:0.08050   1st Qu.:2.750   1st Qu.:4595  
##  Median : 7.60   Median :0.09200   Median :3.400   Median :5370  
##  Mean   :10.11   Mean   :0.09547   Mean   :3.398   Mean   :5254  
##  3rd Qu.:13.25   3rd Qu.:0.10400   3rd Qu.:3.850   3rd Qu.:5915  
##  Max.   :42.30   Max.   :0.14200   Max.   :5.800   Max.   :6890  
##       Ineq            Prob              Time           Crime       
##  Min.   :12.60   Min.   :0.00690   Min.   :12.20   Min.   : 342.0  
##  1st Qu.:16.55   1st Qu.:0.03270   1st Qu.:21.60   1st Qu.: 658.5  
##  Median :17.60   Median :0.04210   Median :25.80   Median : 831.0  
##  Mean   :19.40   Mean   :0.04709   Mean   :26.60   Mean   : 905.1  
##  3rd Qu.:22.75   3rd Qu.:0.05445   3rd Qu.:30.45   3rd Qu.:1057.5  
##  Max.   :27.60   Max.   :0.11980   Max.   :44.00   Max.   :1993.0

For this first step, we are going to build a model (using R’s lm() function) to predict the crime rates using all the included variables as predictors.
Since we are trying to predict crime rates, we’ll use Crime as our dependent variable.

model <- lm(Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + Wealth + Ineq + Prob + Time, data=crime_data)

Now let’s view the model’s coefficients, significance levels, and other characteristics:

summary(model)
## 
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + 
##     NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = crime_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
## M            8.783e+01  4.171e+01   2.106 0.043443 *  
## So          -3.803e+00  1.488e+02  -0.026 0.979765    
## Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
## Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
## Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
## LF          -6.638e+02  1.470e+03  -0.452 0.654654    
## M.F          1.741e+01  2.035e+01   0.855 0.398995    
## Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
## NW           4.204e+00  6.481e+00   0.649 0.521279    
## U1          -5.827e+03  4.210e+03  -1.384 0.176238    
## U2           1.678e+02  8.234e+01   2.038 0.050161 .  
## Wealth       9.617e-02  1.037e-01   0.928 0.360754    
## Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
## Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
## Time        -3.479e+00  7.165e+00  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

We’ll now use this model to predict crime rates using the test data:

predicted_crime_rate <- predict(model, city_data)
predicted_crime_rate
##        1 
## 155.4349

Our model predicts a crime rate of 155. Let’s view the range of the Crime rate variable and see if it falls in line.

range(crime_data$Crime)
## [1]  342 1993

This is far below our lowest value for Crime in the dataset, and therefore is likely inaccurate due to overfitting.
This is because we are including too many predictors for our model, regardless of their actual significance on the Crime rate.

We should start by removing any predictors that had stars or a period next to them in the previous summary.

newmodel <- lm(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data=crime_data)
summary(newmodel)
## 
## Call:
## lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5040.50     899.84  -5.602 1.72e-06 ***
## M             105.02      33.30   3.154  0.00305 ** 
## Ed            196.47      44.75   4.390 8.07e-05 ***
## Po1           115.02      13.75   8.363 2.56e-10 ***
## U2             89.37      40.91   2.185  0.03483 *  
## Ineq           67.65      13.94   4.855 1.88e-05 ***
## Prob        -3801.84    1528.10  -2.488  0.01711 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

Now we’ll use the new model to predict the crime rate for the same city_data.

predicted_newmodel <- predict(newmodel, city_data)
predicted_newmodel
##        1 
## 1304.245

This value falls more in line with our crime dataset. We also observe that our new model has a higher adjusted R-squared and a lower RSE, which means it provides a more accurate fit to the data.

Now what if we create a new model, but this time, we only use values lower than 0.05 (as discussed in the lectures)? We’ll include U2 just because it’s 0.0501 and it doesn’t feel right leaving it out :)

model_0.05 <- lm(Crime ~ M + Ed + U2 + Ineq + Prob, data=crime_data)
summary(model_0.05)
## 
## Call:
## lm(formula = Crime ~ M + Ed + U2 + Ineq + Prob, data = crime_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -478.8 -233.6  -46.5  143.2  797.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3336.52    1435.26  -2.325  0.02512 * 
## M              85.33      54.39   1.569  0.12437   
## Ed            214.69      73.20   2.933  0.00547 **
## U2            160.01      65.54   2.441  0.01903 * 
## Ineq           29.50      21.56   1.368  0.17880   
## Prob        -6897.24    2427.81  -2.841  0.00697 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 328.6 on 41 degrees of freedom
## Multiple R-squared:  0.3565, Adjusted R-squared:  0.278 
## F-statistic: 4.542 on 5 and 41 DF,  p-value: 0.002186

Let’s now predict the crime rate of the test data using this new model

predicted_model_0.05 <- predict(model_0.05, city_data)
predicted_model_0.05
##        1 
## 898.1004

We see that this new model seems to perform worse than both our original model and our newmodel. By removing the Po1 predictor, we may have eliminated a variable that was significant in explaining the variation of the Crime variable. This led to higher RSE and lower R-squared values, which means the Po1 predictor likely has a strong relationship to the Crime variable. This model seemingly performs worse than our original model, but that is simply because our original model suffers from overfitting.

So how do we determine how many predictors, and which predictors to use? As I mentioned in the beginning of the question, we can use R’s rfe() function to perform Recursive Feature Elimination. This will remove any predictors with the weakest relationships to the Crime variable and provide us with the best estimate of which predictors to use. We can find out how many predictors should be used by using cross validation.

Here we will set our control parameter for our rfe() function. We’ll choose 4 folds.

control <- rfeControl(functions=lmFuncs, method="repeatedcv", number=4, repeats = 25, verbose = FALSE)

Now we perform our RFE and we’ll check out the results.

rfe <- rfe(crime_data[,-16], crime_data[[16]],
              sizes = c(1:15),
              rfeControl = control)
print(rfe)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (4 fold, repeated 25 times) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared   MAE RMSESD RsquaredSD MAESD Selected
##          1 366.7   0.1666 282.9  75.54     0.1617 51.68         
##          2 364.6   0.1985 281.0  72.18     0.1524 54.85         
##          3 369.2   0.1726 288.4  73.68     0.1303 56.37         
##          4 348.3   0.2687 281.4  60.67     0.1827 52.06         
##          5 338.2   0.3328 272.6  59.84     0.1920 53.86         
##          6 321.9   0.4064 259.9  58.06     0.2010 51.06         
##          7 315.7   0.4340 254.7  60.98     0.2106 53.64         
##          8 288.8   0.5135 229.4  65.59     0.2242 54.64         
##          9 266.8   0.5759 208.0  69.76     0.2076 58.10         
##         10 256.7   0.5997 201.2  63.44     0.2015 54.04        *
##         11 262.0   0.5799 203.7  64.03     0.2104 52.66         
##         12 272.6   0.5546 210.5  65.41     0.2121 53.16         
##         13 275.8   0.5493 212.9  68.09     0.2119 55.89         
##         14 282.7   0.5356 217.6  69.29     0.2162 58.45         
##         15 288.0   0.5231 224.4  69.99     0.2120 61.34         
## 
## The top 5 variables (out of 10):
##    U1, Prob, LF, Po1, Ed

We see the star next to 10 which suggests the optimal number of predictors to use is 10. Now let’s see which predictors we should use:

predictors(rfe)
##  [1] "U1"   "Prob" "LF"   "Po1"  "Ed"   "U2"   "Po2"  "So"   "M"    "Ineq"

Now let’s build one last model using only these predictors.

finalmodel <- lm(Crime ~ U1 + Prob + LF + Ed + Po1 + U2 + Po2 + So + M + Ineq, data=crime_data)
summary(finalmodel)
## 
## Call:
## lm(formula = Crime ~ U1 + Prob + LF + Ed + Po1 + U2 + Po2 + So + 
##     M + Ineq, data = crime_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -446.88  -95.08   16.57  116.19  539.68 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5099.78     989.89  -5.152 9.44e-06 ***
## U1          -2925.15    3332.91  -0.878 0.385951    
## Prob        -4000.57    1711.52  -2.337 0.025094 *  
## LF            531.84    1163.49   0.457 0.650340    
## Ed            210.85      57.43   3.671 0.000777 ***
## Po1           177.49      96.84   1.833 0.075112 .  
## U2            150.25      77.07   1.949 0.059060 .  
## Po2           -79.51     105.11  -0.756 0.454314    
## So             78.48     129.24   0.607 0.547491    
## M             100.88      35.40   2.849 0.007201 ** 
## Ineq           58.80      16.77   3.506 0.001239 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204.7 on 36 degrees of freedom
## Multiple R-squared:  0.7808, Adjusted R-squared:  0.7199 
## F-statistic: 12.82 on 10 and 36 DF,  p-value: 3.937e-09

Let’s now predict the crime rate of the test data using this final model.

predicted_final_model <- predict(finalmodel, city_data)
predicted_final_model
##        1 
## 870.6834

We observe that our finalmodel provides us with a similar crimerate to model_0.05, however we have a much lower residual standard error and higher Multiple R-squared and Adjuasted R-squared, implying that our finalmodel is the better performing model.

The choice in this example would come down to whether we should use newmodel or finalmodel. newmodel seems to perform better because it has a slightly higher Adjusted R-squared and lower RSE, however, the crime rate suggested is a lot higher compared to our model_0.05 and our finalmodel. In addition, our newmodel considers less predictors than our finalmodel, and since our Multiple R-squared is higher on our finalmodel, the newmodel may still be missing an important predictor that is included in our finalmodel. Therefore, I feel comfortable in saying our finalmodel is likely the stronger model, and is best served to use to predict the Crime rate for our test data.