Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use.
Answer:
One of the application of regression model could be in calculating the salary of employees. For example, if a company wants to hire a new employee, and is wondering what salary is appropriate to be offered, a regression model can be built to predict that. Some relevant predictors for building the model are level of education,years of experience, title/position, location and so on.
Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data:
test<-data.frame(M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5,
LF = 0.640, M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,
U2 = 3.6, Wealth = 3200,Ineq = 20.1,Prob = 0.04, Time = 39.0)
options(warn=-1)
library(ggplot2)
library(tidyverse)
library(caret)
#loading data
crime=read_tsv("uscrime.txt")
Parsed with column specification:
cols(
M = [32mcol_double()[39m,
So = [32mcol_double()[39m,
Ed = [32mcol_double()[39m,
Po1 = [32mcol_double()[39m,
Po2 = [32mcol_double()[39m,
LF = [32mcol_double()[39m,
M.F = [32mcol_double()[39m,
Pop = [32mcol_double()[39m,
NW = [32mcol_double()[39m,
U1 = [32mcol_double()[39m,
U2 = [32mcol_double()[39m,
Wealth = [32mcol_double()[39m,
Ineq = [32mcol_double()[39m,
Prob = [32mcol_double()[39m,
Time = [32mcol_double()[39m,
Crime = [32mcol_double()[39m
)
Criminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960. The data set contains the following columns:
Variable Description
M percentage of males aged 14–24 in total state population
So indicator variable for a southern state
Ed mean years of schooling of the population aged 25 years or over
Po1 per capita expenditure on police protection in 1960
Po2 per capita expenditure on police protection in 1959
LF labour force participation rate of civilian urban males in the age-group 14-24
M.F number of males per 100 females
Pop state population in 1960 in hundred thousands
NW percentage of nonwhites in the population
U1 unemployment rate of urban males 14–24
U2 unemployment rate of urban males 35–39
Wealth wealth: median value of transferable assets or family income
Ineq income inequality: percentage of families earning below half the median income
Prob probability of imprisonment: ratio of number of commitments to number of offenses
Time average time in months served by offenders in state prisons before their first release
Crime crime rate: number of offenses per 100,000 population in 1960
#checking the data
head(crime)
| M | So | Ed | Po1 | Po2 | LF | M.F | Pop | NW | U1 | U2 | Wealth | Ineq | Prob | Time | Crime |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 15.1 | 1 | 9.1 | 5.8 | 5.6 | 0.510 | 95.0 | 33 | 30.1 | 0.108 | 4.1 | 3940 | 26.1 | 0.084602 | 26.2011 | 791 |
| 14.3 | 0 | 11.3 | 10.3 | 9.5 | 0.583 | 101.2 | 13 | 10.2 | 0.096 | 3.6 | 5570 | 19.4 | 0.029599 | 25.2999 | 1635 |
| 14.2 | 1 | 8.9 | 4.5 | 4.4 | 0.533 | 96.9 | 18 | 21.9 | 0.094 | 3.3 | 3180 | 25.0 | 0.083401 | 24.3006 | 578 |
| 13.6 | 0 | 12.1 | 14.9 | 14.1 | 0.577 | 99.4 | 157 | 8.0 | 0.102 | 3.9 | 6730 | 16.7 | 0.015801 | 29.9012 | 1969 |
| 14.1 | 0 | 12.1 | 10.9 | 10.1 | 0.591 | 98.5 | 18 | 3.0 | 0.091 | 2.0 | 5780 | 17.4 | 0.041399 | 21.2998 | 1234 |
| 12.1 | 0 | 11.0 | 11.8 | 11.5 | 0.547 | 96.4 | 25 | 4.4 | 0.084 | 2.9 | 6890 | 12.6 | 0.034201 | 20.9995 | 682 |
#Looking at the structure of data
str(crime)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 47 obs. of 16 variables:
$ M : num 15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
$ So : num 1 0 1 0 0 0 1 1 1 0 ...
$ Ed : num 9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
$ Po1 : num 5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
$ Po2 : num 5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
$ LF : num 0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
$ M.F : num 95 101.2 96.9 99.4 98.5 ...
$ Pop : num 33 13 18 157 18 25 4 50 39 7 ...
$ NW : num 30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
$ U1 : num 0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
$ U2 : num 4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
$ Wealth: num 3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
$ Ineq : num 26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
$ Prob : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
$ Time : num 26.2 25.3 24.3 29.9 21.3 ...
$ Crime : num 791 1635 578 1969 1234 ...
- attr(*, "spec")=
.. cols(
.. M = [32mcol_double()[39m,
.. So = [32mcol_double()[39m,
.. Ed = [32mcol_double()[39m,
.. Po1 = [32mcol_double()[39m,
.. Po2 = [32mcol_double()[39m,
.. LF = [32mcol_double()[39m,
.. M.F = [32mcol_double()[39m,
.. Pop = [32mcol_double()[39m,
.. NW = [32mcol_double()[39m,
.. U1 = [32mcol_double()[39m,
.. U2 = [32mcol_double()[39m,
.. Wealth = [32mcol_double()[39m,
.. Ineq = [32mcol_double()[39m,
.. Prob = [32mcol_double()[39m,
.. Time = [32mcol_double()[39m,
.. Crime = [32mcol_double()[39m
.. )
A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables).The objective here is to find the relevant variables/predictors to have a good model for predicting crime rate of new data point.
Simply put, the goal is to do a meaningful feature selection.
Feature selection is an important task. It is essential for two reasons. First, we need to keep our model simple( for a couple of reasons). Second, including insignificant variables can significantly impact the model performance.
One method of feature selection is hypothesis testing, to check if the independent variable has a significant dependent variable or not.
We will use simple linear model lm() using all variable to examine the prediction and see which one has a significant impact on producing response.
set.seed(0)
# as simple linear model
model_1<-lm(Crime~.,data=crime)
summary(model_1)
Call:
lm(formula = Crime ~ ., data = crime)
Residuals:
Min 1Q Median 3Q Max
-395.74 -98.09 -6.69 112.99 512.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
M 8.783e+01 4.171e+01 2.106 0.043443 *
So -3.803e+00 1.488e+02 -0.026 0.979765
Ed 1.883e+02 6.209e+01 3.033 0.004861 **
Po1 1.928e+02 1.061e+02 1.817 0.078892 .
Po2 -1.094e+02 1.175e+02 -0.931 0.358830
LF -6.638e+02 1.470e+03 -0.452 0.654654
M.F 1.741e+01 2.035e+01 0.855 0.398995
Pop -7.330e-01 1.290e+00 -0.568 0.573845
NW 4.204e+00 6.481e+00 0.649 0.521279
U1 -5.827e+03 4.210e+03 -1.384 0.176238
U2 1.678e+02 8.234e+01 2.038 0.050161 .
Wealth 9.617e-02 1.037e-01 0.928 0.360754
Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
Time -3.479e+00 7.165e+00 -0.486 0.630708
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 209.1 on 31 degrees of freedom
Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
predict(model_1,test)
1: 155.434896887446
range(crime$Crime)
The predicted crime rate is way below the minimum value of crime rate in our data set. It might be possible for a data point to have such a low crime rate. However, we need to examine our regression model since it is very prone to overfitting.
We have used all predictors in this model. Looking at coefficients in model summary, we realize not every variable has a significant impact on producing the response.We can remove variables with coefficients which have a p-value of 0.05 or more.
#filtering variables with significant impact on response
data.frame(summary(model_1)$coef[summary(model_1)$coef[,4] <= .05, 4])
| summary.model_1..coef.summary.model_1..coef…4…..0.05..4. | |
|---|---|
| <dbl> | |
| (Intercept) | 0.0008929887 |
| M | 0.0434433942 |
| Ed | 0.0048614327 |
| Ineq | 0.0039831365 |
| Prob | 0.0406269260 |
A p-value indicates whether or not you can reject or accept a hypothesis. The hypothesis, in this case, is that the predictor is not meaningful for the model.
For example,a p-value of 0.85 means there’s 85% chance that this predictor is not meaningful for the regression. A standard way to test if the predictors are not meaningful is looking if the p-values are larger than 0.05. If so, we assume those predictors don’t affect the predictive power of the model, and we can remove them.
set.seed(0)
model_2<-lm(Crime~M+Ed+Ineq+Prob,data=crime,x=TRUE,y=TRUE)
summary(model_2)
Call:
lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE,
y = TRUE)
Residuals:
Min 1Q Median 3Q Max
-532.97 -254.03 -55.72 137.80 960.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1339.35 1247.01 -1.074 0.28893
M 35.97 53.39 0.674 0.50417
Ed 148.61 71.92 2.066 0.04499 *
Ineq 26.87 22.77 1.180 0.24458
Prob -7331.92 2560.27 -2.864 0.00651 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 347.5 on 42 degrees of freedom
Multiple R-squared: 0.2629, Adjusted R-squared: 0.1927
F-statistic: 3.745 on 4 and 42 DF, p-value: 0.01077
predict(model_2,test)
1: 897.230683687278
This prediction makes more sense in context of our data set. If we look at Adjusted R-squared, we notice it is significantly lower in this model. The higher Adjusted R-squared in the first model does not mean it was a better one. Having too many variables which some of them are not significant for prediction, creates an overfitted model,hence higher Adjusted R-squared. Taking out variables from model is going to lower the Adjusted R-square, but does not give any information on quality of the model.
Although, we have selected variables with small p-values, it does not neccesary mean other variables are not meaningful. Sometimes, when two variables have a strong correlation, the model selects one as important predictor and the other predictor’s receive a higher p-value.
We will try using a different method to train our model and see if we can improve the predictive power.
Recursive feature elimination(rfe), is a technique in which a model is built with all the variables, and then the algorithm removes the weakest features one by one until we reach the specified number of features. While using RFE, we need to specify the number of features to be used. However, this is often not known at the beginning. The optimal number of features can be identified using cross-validation.
library(caret)
set.seed(0)
# Setting the cross validation parameters
ctrl_param <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
number=10,
repeats = 25,
verbose = FALSE)
lm_rfe <- rfe(crime[,-16], crime[[16]],
sizes = c(1:15),
rfeControl = ctrl_param)
lm_rfe
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 25 times)
Resampling performance over subset size:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 359.9 0.3361 291.2 131.91 0.2926 100.05
2 344.1 0.3539 276.5 135.96 0.2889 111.02
3 351.9 0.3213 285.4 133.22 0.2895 109.99
4 333.3 0.4145 277.8 101.94 0.3139 91.01
5 316.1 0.4769 261.2 97.50 0.3125 86.74
6 306.8 0.4853 253.9 94.46 0.3123 84.85
7 307.3 0.4975 255.1 93.07 0.3079 83.56
8 268.5 0.5692 222.7 84.48 0.2999 72.92
9 234.7 0.6446 189.7 94.01 0.2886 76.30
10 230.6 0.6572 186.9 84.12 0.2666 67.17 *
11 235.9 0.6333 190.9 90.45 0.2748 69.68
12 248.5 0.6093 199.5 91.42 0.2867 72.36
13 253.0 0.6070 204.2 93.88 0.2896 75.76
14 259.4 0.5948 208.3 95.69 0.2928 77.11
15 263.6 0.5878 213.5 93.76 0.2866 78.58
The top 5 variables (out of 10):
U1, Prob, LF, Po1, Ed
By looking at RMSEs, it seems that selecting 10 features results in lower RMSE/higher R-squared.So, the model is fitted with 10 strongest features.
Below we can see which feastures have been selected:
predictors(lm_rfe)
lm_rfe$fit
Call:
lm(formula = y ~ ., data = tmp)
Coefficients:
(Intercept) U1 Prob LF Po1 Ed
-5099.78 -2925.15 -4000.57 531.84 177.49 210.85
U2 Po2 M Ineq So
150.25 -79.51 100.88 58.80 78.48
predict(lm_rfe,test)
1: 870.683361434229
Using cross-validation and RFE method, we can achieve a better model with a decent value for R-squared. The prediction of crime rate seems to be in an acceptable range based on our data. There are many other ways to build a linear model for regression analysis.But they are not in scope of work for this homework.