QUESTION 8.1:
Describe a situation or problem from your job, everyday life, current
events, etc., for which a linear regression model would be
appropriate.
List some (up to 5) predictors that you might use.
ANSWER 8.1:
A linear regression model can be used in the real estate industry in order to predict the value of a home’s selling price (dependent variable) based on a number of characteristics (independent variables/predictors). Some of these predictors may have a more significant impact on the selling price of a house, and therefore have a stronger relationship on the dependent variable. Some of these characteristics may have very little impact, so it is important to try and eliminate any predictors that do not exhibit a strong relationship to the dependent variable in order to reduce overfitting.
Here are some predictors that may have a strong impact on the selling
price of the house:
-Overall square footage
-Bedrooms and bathrooms
-Location/proximity
-Hazardous Potential (e.g. high risk of flooding, earthquakes)
-Noise level (near an airport, train tracks. busy road)
Here are some predictors that would likely have less of an impact on
the selling price, and may or may not be worth including in this
hypothetical regression model:
-Size of garage
-Number of windows
-Gas or Electric heating.
-Central Air vs Air Condition Units
The model that we build should best aid us in setting the most accurate and realistic prices for the house values, and can also help us evaluate which real estate in our market is either overvalued or undervalued.
QUESTION 8.2:
Using crime data from http://www.statsci.org/data/general/uscrime.txt (file
uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), use
regression (a useful R function is lm or glm) to predict the observed
crime rate in a city with the following data:
M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5, LF = 0.640, M.F =
94.0,Pop = 150,NW = 1.1,U1 = 0.120, U2 = 3.6, Wealth = 3200,Ineq =
20.1,Prob = 0.04, Time = 39.0
Show your model (factors used and their coefficients), the software
output, and the quality of fit.
ANSWER 8.2:
For this question, we will build a model using the crime dataset that will help us predict crime rates for the previously given set of test data. We will create several models, each with a certain number of predictors, and then we will perform cross validation and RFE (Recursive Feature Elimination) in order to determine how many predictors we should use, and which ones to use.
First, we’ll clear the environment, set seed for reproducibility, and load in our libraries:
rm(list = ls())
set.seed(123)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(stats)
Let’s now load in our test data:
city_data <- data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5,
LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120,
U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)
And now we’ll load in our crime data and view a summary of it.
crime_data <- read.table("/Users/djmariano/uscrime.txt", header = TRUE, stringsAsFactors = FALSE)
summary(crime_data)
## M So Ed Po1
## Min. :11.90 Min. :0.0000 Min. : 8.70 Min. : 4.50
## 1st Qu.:13.00 1st Qu.:0.0000 1st Qu.: 9.75 1st Qu.: 6.25
## Median :13.60 Median :0.0000 Median :10.80 Median : 7.80
## Mean :13.86 Mean :0.3404 Mean :10.56 Mean : 8.50
## 3rd Qu.:14.60 3rd Qu.:1.0000 3rd Qu.:11.45 3rd Qu.:10.45
## Max. :17.70 Max. :1.0000 Max. :12.20 Max. :16.60
## Po2 LF M.F Pop
## Min. : 4.100 Min. :0.4800 Min. : 93.40 Min. : 3.00
## 1st Qu.: 5.850 1st Qu.:0.5305 1st Qu.: 96.45 1st Qu.: 10.00
## Median : 7.300 Median :0.5600 Median : 97.70 Median : 25.00
## Mean : 8.023 Mean :0.5612 Mean : 98.30 Mean : 36.62
## 3rd Qu.: 9.700 3rd Qu.:0.5930 3rd Qu.: 99.20 3rd Qu.: 41.50
## Max. :15.700 Max. :0.6410 Max. :107.10 Max. :168.00
## NW U1 U2 Wealth
## Min. : 0.20 Min. :0.07000 Min. :2.000 Min. :2880
## 1st Qu.: 2.40 1st Qu.:0.08050 1st Qu.:2.750 1st Qu.:4595
## Median : 7.60 Median :0.09200 Median :3.400 Median :5370
## Mean :10.11 Mean :0.09547 Mean :3.398 Mean :5254
## 3rd Qu.:13.25 3rd Qu.:0.10400 3rd Qu.:3.850 3rd Qu.:5915
## Max. :42.30 Max. :0.14200 Max. :5.800 Max. :6890
## Ineq Prob Time Crime
## Min. :12.60 Min. :0.00690 Min. :12.20 Min. : 342.0
## 1st Qu.:16.55 1st Qu.:0.03270 1st Qu.:21.60 1st Qu.: 658.5
## Median :17.60 Median :0.04210 Median :25.80 Median : 831.0
## Mean :19.40 Mean :0.04709 Mean :26.60 Mean : 905.1
## 3rd Qu.:22.75 3rd Qu.:0.05445 3rd Qu.:30.45 3rd Qu.:1057.5
## Max. :27.60 Max. :0.11980 Max. :44.00 Max. :1993.0
For this first step, we are going to build a model (using R’s lm()
function) to predict the crime rates using all the included variables as
predictors.
Since we are trying to predict crime rates, we’ll use Crime as our
dependent variable.
model <- lm(Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + Wealth + Ineq + Prob + Time, data=crime_data)
Now let’s view the model’s coefficients, significance levels, and other characteristics:
summary(model)
##
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop +
## NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = crime_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
We’ll now use this model to predict crime rates using the test data:
predicted_crime_rate <- predict(model, city_data)
predicted_crime_rate
## 1
## 155.4349
Our model predicts a crime rate of 155. Let’s view the range of the Crime rate variable and see if it falls in line.
range(crime_data$Crime)
## [1] 342 1993
This is far below our lowest value for Crime in the dataset, and
therefore is likely inaccurate due to overfitting.
This is because we are including too many predictors for our model,
regardless of their actual significance on the Crime rate.
We should start by removing any predictors that had stars or a period next to them in the previous summary.
newmodel <- lm(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data=crime_data)
summary(newmodel)
##
## Call:
## lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.50 899.84 -5.602 1.72e-06 ***
## M 105.02 33.30 3.154 0.00305 **
## Ed 196.47 44.75 4.390 8.07e-05 ***
## Po1 115.02 13.75 8.363 2.56e-10 ***
## U2 89.37 40.91 2.185 0.03483 *
## Ineq 67.65 13.94 4.855 1.88e-05 ***
## Prob -3801.84 1528.10 -2.488 0.01711 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
Now we’ll use the new model to predict the crime rate for the same city_data.
predicted_newmodel <- predict(newmodel, city_data)
predicted_newmodel
## 1
## 1304.245
This value falls more in line with our crime dataset. We also observe that our new model has a higher adjusted R-squared and a lower RSE, which means it provides a more accurate fit to the data.
Now what if we create a new model, but this time, we only use values lower than 0.05 (as discussed in the lectures)? We’ll include U2 just because it’s 0.0501 and it doesn’t feel right leaving it out :)
model_0.05 <- lm(Crime ~ M + Ed + U2 + Ineq + Prob, data=crime_data)
summary(model_0.05)
##
## Call:
## lm(formula = Crime ~ M + Ed + U2 + Ineq + Prob, data = crime_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -478.8 -233.6 -46.5 143.2 797.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3336.52 1435.26 -2.325 0.02512 *
## M 85.33 54.39 1.569 0.12437
## Ed 214.69 73.20 2.933 0.00547 **
## U2 160.01 65.54 2.441 0.01903 *
## Ineq 29.50 21.56 1.368 0.17880
## Prob -6897.24 2427.81 -2.841 0.00697 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 328.6 on 41 degrees of freedom
## Multiple R-squared: 0.3565, Adjusted R-squared: 0.278
## F-statistic: 4.542 on 5 and 41 DF, p-value: 0.002186
Let’s now predict the crime rate of the test data using this new model
predicted_model_0.05 <- predict(model_0.05, city_data)
predicted_model_0.05
## 1
## 898.1004
We see that this new model seems to perform worse than both our original model and our newmodel. By removing the Po1 predictor, we may have eliminated a variable that was significant in explaining the variation of the Crime variable. This led to higher RSE and lower R-squared values, which means the Po1 predictor likely has a strong relationship to the Crime variable. This model seemingly performs worse than our original model, but that is simply because our original model suffers from overfitting.
So how do we determine how many predictors, and which predictors to use? As I mentioned in the beginning of the question, we can use R’s rfe() function to perform Recursive Feature Elimination. This will remove any predictors with the weakest relationships to the Crime variable and provide us with the best estimate of which predictors to use. We can find out how many predictors should be used by using cross validation.
Here we will set our control parameter for our rfe() function. We’ll choose 4 folds.
control <- rfeControl(functions=lmFuncs, method="repeatedcv", number=4, repeats = 25, verbose = FALSE)
Now we perform our RFE and we’ll check out the results.
rfe <- rfe(crime_data[,-16], crime_data[[16]],
sizes = c(1:15),
rfeControl = control)
print(rfe)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (4 fold, repeated 25 times)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 366.7 0.1666 282.9 75.54 0.1617 51.68
## 2 364.6 0.1985 281.0 72.18 0.1524 54.85
## 3 369.2 0.1726 288.4 73.68 0.1303 56.37
## 4 348.3 0.2687 281.4 60.67 0.1827 52.06
## 5 338.2 0.3328 272.6 59.84 0.1920 53.86
## 6 321.9 0.4064 259.9 58.06 0.2010 51.06
## 7 315.7 0.4340 254.7 60.98 0.2106 53.64
## 8 288.8 0.5135 229.4 65.59 0.2242 54.64
## 9 266.8 0.5759 208.0 69.76 0.2076 58.10
## 10 256.7 0.5997 201.2 63.44 0.2015 54.04 *
## 11 262.0 0.5799 203.7 64.03 0.2104 52.66
## 12 272.6 0.5546 210.5 65.41 0.2121 53.16
## 13 275.8 0.5493 212.9 68.09 0.2119 55.89
## 14 282.7 0.5356 217.6 69.29 0.2162 58.45
## 15 288.0 0.5231 224.4 69.99 0.2120 61.34
##
## The top 5 variables (out of 10):
## U1, Prob, LF, Po1, Ed
We see the star next to 10 which suggests the optimal number of predictors to use is 10. Now let’s see which predictors we should use:
predictors(rfe)
## [1] "U1" "Prob" "LF" "Po1" "Ed" "U2" "Po2" "So" "M" "Ineq"
Now let’s build one last model using only these predictors.
finalmodel <- lm(Crime ~ U1 + Prob + LF + Ed + Po1 + U2 + Po2 + So + M + Ineq, data=crime_data)
summary(finalmodel)
##
## Call:
## lm(formula = Crime ~ U1 + Prob + LF + Ed + Po1 + U2 + Po2 + So +
## M + Ineq, data = crime_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -446.88 -95.08 16.57 116.19 539.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5099.78 989.89 -5.152 9.44e-06 ***
## U1 -2925.15 3332.91 -0.878 0.385951
## Prob -4000.57 1711.52 -2.337 0.025094 *
## LF 531.84 1163.49 0.457 0.650340
## Ed 210.85 57.43 3.671 0.000777 ***
## Po1 177.49 96.84 1.833 0.075112 .
## U2 150.25 77.07 1.949 0.059060 .
## Po2 -79.51 105.11 -0.756 0.454314
## So 78.48 129.24 0.607 0.547491
## M 100.88 35.40 2.849 0.007201 **
## Ineq 58.80 16.77 3.506 0.001239 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204.7 on 36 degrees of freedom
## Multiple R-squared: 0.7808, Adjusted R-squared: 0.7199
## F-statistic: 12.82 on 10 and 36 DF, p-value: 3.937e-09
Let’s now predict the crime rate of the test data using this final model.
predicted_final_model <- predict(finalmodel, city_data)
predicted_final_model
## 1
## 870.6834
We observe that our finalmodel provides us with a similar crimerate to model_0.05, however we have a much lower residual standard error and higher Multiple R-squared and Adjuasted R-squared, implying that our finalmodel is the better performing model.
The choice in this example would come down to whether we should use newmodel or finalmodel. newmodel seems to perform better because it has a slightly higher Adjusted R-squared and lower RSE, however, the crime rate suggested is a lot higher compared to our model_0.05 and our finalmodel. In addition, our newmodel considers less predictors than our finalmodel, and since our Multiple R-squared is higher on our finalmodel, the newmodel may still be missing an important predictor that is included in our finalmodel. Therefore, I feel comfortable in saying our finalmodel is likely the stronger model, and is best served to use to predict the Crime rate for our test data.