ISYE6501x Homework 5

Question 8.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use.

Answer:

One of the application of regression model could be in calculating the salary of employees. For example, if a company wants to hire a new employee, and is wondering what salary is appropriate to be offered, a regression model can be built to predict that. Some relevant predictors for building the model are level of education,years of experience, title/position, location and so on.

Question 8.2

Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data:

test<-data.frame(M = 14.0,So = 0,Ed = 10.0, Po1 = 12.0,Po2 = 15.5,
                 LF = 0.640, M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,
                 U2 = 3.6, Wealth = 3200,Ineq = 20.1,Prob = 0.04, Time = 39.0)

Loading Libraries

options(warn=-1)
library(ggplot2)
library(tidyverse)
library(caret)
#loading data
crime=read_tsv("uscrime.txt")
Parsed with column specification:
cols(
  M = col_double(),
  So = col_double(),
  Ed = col_double(),
  Po1 = col_double(),
  Po2 = col_double(),
  LF = col_double(),
  M.F = col_double(),
  Pop = col_double(),
  NW = col_double(),
  U1 = col_double(),
  U2 = col_double(),
  Wealth = col_double(),
  Ineq = col_double(),
  Prob = col_double(),
  Time = col_double(),
  Crime = col_double()
)

Criminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960. The data set contains the following columns:

Variable   Description
M       percentage of males aged 14–24 in total state population
So      indicator variable for a southern state
Ed      mean years of schooling of the population aged 25 years or over
Po1     per capita expenditure on police protection in 1960
Po2     per capita expenditure on police protection in 1959
LF      labour force participation rate of civilian urban males in the age-group 14-24
M.F     number of males per 100 females
Pop     state population in 1960 in hundred thousands
NW      percentage of nonwhites in the population
U1      unemployment rate of urban males 14–24
U2      unemployment rate of urban males 35–39
Wealth  wealth: median value of transferable assets or family income
Ineq    income inequality: percentage of families earning below half the median income
Prob    probability of imprisonment: ratio of number of commitments to number of offenses
Time    average time in months served by offenders in state prisons before their first release
Crime       crime rate: number of offenses per 100,000 population in 1960
#checking the data
head(crime)
A tibble: 6 × 16
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob Time Crime
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602 26.2011 791
14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599 25.2999 1635
14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401 24.3006 578
13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801 29.9012 1969
14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399 21.2998 1234
12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201 20.9995 682
#Looking at the structure of data
str(crime)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    47 obs. of  16 variables:
 $ M     : num  15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
 $ So    : num  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed    : num  9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
 $ Po1   : num  5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
 $ Po2   : num  5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
 $ LF    : num  0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
 $ M.F   : num  95 101.2 96.9 99.4 98.5 ...
 $ Pop   : num  33 13 18 157 18 25 4 50 39 7 ...
 $ NW    : num  30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
 $ U1    : num  0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
 $ U2    : num  4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
 $ Wealth: num  3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
 $ Ineq  : num  26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
 $ Prob  : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
 $ Time  : num  26.2 25.3 24.3 29.9 21.3 ...
 $ Crime : num  791 1635 578 1969 1234 ...
 - attr(*, "spec")=
  .. cols(
  ..   M = col_double(),
  ..   So = col_double(),
  ..   Ed = col_double(),
  ..   Po1 = col_double(),
  ..   Po2 = col_double(),
  ..   LF = col_double(),
  ..   M.F = col_double(),
  ..   Pop = col_double(),
  ..   NW = col_double(),
  ..   U1 = col_double(),
  ..   U2 = col_double(),
  ..   Wealth = col_double(),
  ..   Ineq = col_double(),
  ..   Prob = col_double(),
  ..   Time = col_double(),
  ..   Crime = col_double()
  .. )

A linear regression is a statistical model that analyzes the relationship between a response variable (often called y) and one or more variables and their interactions (often called x or explanatory variables).The objective here is to find the relevant variables/predictors to have a good model for predicting crime rate of new data point.

Simply put, the goal is to do a meaningful feature selection.

Feature selection is an important task. It is essential for two reasons. First, we need to keep our model simple( for a couple of reasons). Second, including insignificant variables can significantly impact the model performance.

Feature Selection using Hypothesis Testing

One method of feature selection is hypothesis testing, to check if the independent variable has a significant dependent variable or not.

  • Null Hypothesis : the variable has no relationship with output variable.
  • Alternate Hypothesis :the variable has a relationship with output variable and impacts response prediction.

We will use simple linear model lm() using all variable to examine the prediction and see which one has a significant impact on producing response.

set.seed(0)
# as simple linear model
model_1<-lm(Crime~.,data=crime)

summary(model_1)
Call:
lm(formula = Crime ~ ., data = crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-395.74  -98.09   -6.69  112.99  512.67 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
M            8.783e+01  4.171e+01   2.106 0.043443 *  
So          -3.803e+00  1.488e+02  -0.026 0.979765    
Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
LF          -6.638e+02  1.470e+03  -0.452 0.654654    
M.F          1.741e+01  2.035e+01   0.855 0.398995    
Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
NW           4.204e+00  6.481e+00   0.649 0.521279    
U1          -5.827e+03  4.210e+03  -1.384 0.176238    
U2           1.678e+02  8.234e+01   2.038 0.050161 .  
Wealth       9.617e-02  1.037e-01   0.928 0.360754    
Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
Time        -3.479e+00  7.165e+00  -0.486 0.630708    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 209.1 on 31 degrees of freedom
Multiple R-squared:  0.8031,    Adjusted R-squared:  0.7078 
F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07
predict(model_1,test)

1: 155.434896887446

range(crime$Crime)
  1. 342
  2. 1993

The predicted crime rate is way below the minimum value of crime rate in our data set. It might be possible for a data point to have such a low crime rate. However, we need to examine our regression model since it is very prone to overfitting.

We have used all predictors in this model. Looking at coefficients in model summary, we realize not every variable has a significant impact on producing the response.We can remove variables with coefficients which have a p-value of 0.05 or more.

#filtering variables with significant impact on response
data.frame(summary(model_1)$coef[summary(model_1)$coef[,4] <= .05, 4])
A data.frame: 5 × 1
summary.model_1..coef.summary.model_1..coef…4…..0.05..4.
<dbl>
(Intercept) 0.0008929887
M 0.0434433942
Ed 0.0048614327
Ineq 0.0039831365
Prob 0.0406269260

A p-value indicates whether or not you can reject or accept a hypothesis. The hypothesis, in this case, is that the predictor is not meaningful for the model.

For example,a p-value of 0.85 means there’s 85% chance that this predictor is not meaningful for the regression. A standard way to test if the predictors are not meaningful is looking if the p-values are larger than 0.05. If so, we assume those predictors don’t affect the predictive power of the model, and we can remove them.

set.seed(0)
model_2<-lm(Crime~M+Ed+Ineq+Prob,data=crime,x=TRUE,y=TRUE)
summary(model_2)
Call:
lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime, x = TRUE, 
    y = TRUE)

Residuals:
    Min      1Q  Median      3Q     Max 
-532.97 -254.03  -55.72  137.80  960.21 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1339.35    1247.01  -1.074  0.28893   
M              35.97      53.39   0.674  0.50417   
Ed            148.61      71.92   2.066  0.04499 * 
Ineq           26.87      22.77   1.180  0.24458   
Prob        -7331.92    2560.27  -2.864  0.00651 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 347.5 on 42 degrees of freedom
Multiple R-squared:  0.2629,    Adjusted R-squared:  0.1927 
F-statistic: 3.745 on 4 and 42 DF,  p-value: 0.01077
predict(model_2,test)

1: 897.230683687278

This prediction makes more sense in context of our data set. If we look at Adjusted R-squared, we notice it is significantly lower in this model. The higher Adjusted R-squared in the first model does not mean it was a better one. Having too many variables which some of them are not significant for prediction, creates an overfitted model,hence higher Adjusted R-squared. Taking out variables from model is going to lower the Adjusted R-square, but does not give any information on quality of the model.

Although, we have selected variables with small p-values, it does not neccesary mean other variables are not meaningful. Sometimes, when two variables have a strong correlation, the model selects one as important predictor and the other predictor’s receive a higher p-value.

We will try using a different method to train our model and see if we can improve the predictive power.

Feature Selection using Recursive Feature Elimination method

Recursive feature elimination(rfe), is a technique in which a model is built with all the variables, and then the algorithm removes the weakest features one by one until we reach the specified number of features. While using RFE, we need to specify the number of features to be used. However, this is often not known at the beginning. The optimal number of features can be identified using cross-validation.

library(caret)
set.seed(0)

# Setting the cross validation parameters
ctrl_param <- rfeControl(functions = lmFuncs,
                   method = "repeatedcv",
                   number=10,
                   repeats = 25,
                   verbose = FALSE)
                  
lm_rfe <- rfe(crime[,-16], crime[[16]],
                 sizes = c(1:15),
                 rfeControl = ctrl_param)
                

lm_rfe
Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 25 times) 

Resampling performance over subset size:

 Variables  RMSE Rsquared   MAE RMSESD RsquaredSD  MAESD Selected
         1 359.9   0.3361 291.2 131.91     0.2926 100.05         
         2 344.1   0.3539 276.5 135.96     0.2889 111.02         
         3 351.9   0.3213 285.4 133.22     0.2895 109.99         
         4 333.3   0.4145 277.8 101.94     0.3139  91.01         
         5 316.1   0.4769 261.2  97.50     0.3125  86.74         
         6 306.8   0.4853 253.9  94.46     0.3123  84.85         
         7 307.3   0.4975 255.1  93.07     0.3079  83.56         
         8 268.5   0.5692 222.7  84.48     0.2999  72.92         
         9 234.7   0.6446 189.7  94.01     0.2886  76.30         
        10 230.6   0.6572 186.9  84.12     0.2666  67.17        *
        11 235.9   0.6333 190.9  90.45     0.2748  69.68         
        12 248.5   0.6093 199.5  91.42     0.2867  72.36         
        13 253.0   0.6070 204.2  93.88     0.2896  75.76         
        14 259.4   0.5948 208.3  95.69     0.2928  77.11         
        15 263.6   0.5878 213.5  93.76     0.2866  78.58         

The top 5 variables (out of 10):
   U1, Prob, LF, Po1, Ed

By looking at RMSEs, it seems that selecting 10 features results in lower RMSE/higher R-squared.So, the model is fitted with 10 strongest features.

Below we can see which feastures have been selected:

predictors(lm_rfe)
  1. ‘U1’
  2. ‘Prob’
  3. ‘LF’
  4. ‘Po1’
  5. ‘Ed’
  6. ‘U2’
  7. ‘Po2’
  8. ‘M’
  9. ‘Ineq’
  10. ‘So’
lm_rfe$fit
Call:
lm(formula = y ~ ., data = tmp)

Coefficients:
(Intercept)           U1         Prob           LF          Po1           Ed  
   -5099.78     -2925.15     -4000.57       531.84       177.49       210.85  
         U2          Po2            M         Ineq           So  
     150.25       -79.51       100.88        58.80        78.48  
predict(lm_rfe,test)

1: 870.683361434229

Conclusion:

Using cross-validation and RFE method, we can achieve a better model with a decent value for R-squared. The prediction of crime rate seems to be in an acceptable range based on our data. There are many other ways to build a linear model for regression analysis.But they are not in scope of work for this homework.