ISYE 6501 Homework 8

Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using:
1. Stepwise regression
2. Lasso
3. Elastic net
For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on different scales and the constraint won’t have the desired effect.

For Parts 2 and 3, use the glmnet function in R.

Notes on R:
• For the elastic net model, what we called λ in the videos, glmnet calls “alpha”; you can get a range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other values of alpha in between].
• In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s matrix format, rather than data frame format. You can convert a data frame to a matrix using as.matrix – for example, x <- as.matrix(data[,1:n-1])
• Rather than specifying a value of T, glmnet returns models for a variety of values of T.

Methodology- To begin this, we want to create a standard model to reference our r-squared values with. Once we create our standard model with all variables, we utilize the step function to preform a stepwise variable selection on our model to see which variables it added and which variables it dropped. Accordingly, we will then scale our data to preform our lasso and elastic net variable selections via the glmnet models. For both of these models, we will select the lambda which produces the smallest error term for our model. Then, we will manually calculate our r squared value. Since our elastic net variable selection is a combination of ridge selection and lasso selection, we are going to test 19 values of 0<alpha<1 to see which values produce the best r squared value.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.5.1

## Loading required package: Matrix

## Loaded glmnet 4.1-10

crime_data = read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 8/Homework8_ISYE6501/Homework8_ISYE6501/data 11.1/uscrime.txt", header = TRUE)

head(crime_data)

set.seed(777)

round(nrow(crime_data)*.7)

## [1] 33

crime_training_rows = sample(nrow(crime_data), round(nrow(crime_data)*.7))

crime_training = crime_data[crime_training_rows,]

crime_valandtest = crime_data[-crime_training_rows,]

crime_validation_rows = sample(nrow(crime_valandtest), round(nrow(crime_valandtest)/2))

crime_validation = crime_valandtest[crime_validation_rows,]

crime_testing = crime_valandtest[-crime_validation_rows,]

Let us start by making our standard lm model for reference.

sample_mod = lm(Crime~., data = crime_data)

summary(sample_mod)

## 
## Call:
## lm(formula = Crime ~ ., data = crime_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
## M            8.783e+01  4.171e+01   2.106 0.043443 *  
## So          -3.803e+00  1.488e+02  -0.026 0.979765    
## Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
## Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
## Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
## LF          -6.638e+02  1.470e+03  -0.452 0.654654    
## M.F          1.741e+01  2.035e+01   0.855 0.398995    
## Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
## NW           4.204e+00  6.481e+00   0.649 0.521279    
## U1          -5.827e+03  4.210e+03  -1.384 0.176238    
## U2           1.678e+02  8.234e+01   2.038 0.050161 .  
## Wealth       9.617e-02  1.037e-01   0.928 0.360754    
## Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
## Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
## Time        -3.479e+00  7.165e+00  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

stepwise_model = step(sample_mod, direction = "both")

## Start:  AIC=514.65
## Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + 
##     U2 + Wealth + Ineq + Prob + Time
## 
##          Df Sum of Sq     RSS    AIC
## - So      1        29 1354974 512.65
## - LF      1      8917 1363862 512.96
## - Time    1     10304 1365250 513.00
## - Pop     1     14122 1369068 513.14
## - NW      1     18395 1373341 513.28
## - M.F     1     31967 1386913 513.74
## - Wealth  1     37613 1392558 513.94
## - Po2     1     37919 1392865 513.95
## <none>                1354946 514.65
## - U1      1     83722 1438668 515.47
## - Po1     1    144306 1499252 517.41
## - U2      1    181536 1536482 518.56
## - M       1    193770 1548716 518.93
## - Prob    1    199538 1554484 519.11
## - Ed      1    402117 1757063 524.86
## - Ineq    1    423031 1777977 525.42
## 
## Step:  AIC=512.65
## Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + 
##     Wealth + Ineq + Prob + Time
## 
##          Df Sum of Sq     RSS    AIC
## - Time    1     10341 1365315 511.01
## - LF      1     10878 1365852 511.03
## - Pop     1     14127 1369101 511.14
## - NW      1     21626 1376600 511.39
## - M.F     1     32449 1387423 511.76
## - Po2     1     37954 1392929 511.95
## - Wealth  1     39223 1394197 511.99
## <none>                1354974 512.65
## - U1      1     96420 1451395 513.88
## + So      1        29 1354946 514.65
## - Po1     1    144302 1499277 515.41
## - U2      1    189859 1544834 516.81
## - M       1    195084 1550059 516.97
## - Prob    1    204463 1559437 517.26
## - Ed      1    403140 1758114 522.89
## - Ineq    1    488834 1843808 525.13
## 
## Step:  AIC=511.01
## Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + 
##     Wealth + Ineq + Prob
## 
##          Df Sum of Sq     RSS    AIC
## - LF      1     10533 1375848 509.37
## - NW      1     15482 1380797 509.54
## - Pop     1     21846 1387161 509.75
## - Po2     1     28932 1394247 509.99
## - Wealth  1     36070 1401385 510.23
## - M.F     1     41784 1407099 510.42
## <none>                1365315 511.01
## - U1      1     91420 1456735 512.05
## + Time    1     10341 1354974 512.65
## + So      1        65 1365250 513.00
## - Po1     1    134137 1499452 513.41
## - U2      1    184143 1549458 514.95
## - M       1    186110 1551425 515.01
## - Prob    1    237493 1602808 516.54
## - Ed      1    409448 1774763 521.33
## - Ineq    1    502909 1868224 523.75
## 
## Step:  AIC=509.37
## Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + NW + U1 + U2 + Wealth + 
##     Ineq + Prob
## 
##          Df Sum of Sq     RSS    AIC
## - NW      1     11675 1387523 507.77
## - Po2     1     21418 1397266 508.09
## - Pop     1     27803 1403651 508.31
## - M.F     1     31252 1407100 508.42
## - Wealth  1     35035 1410883 508.55
## <none>                1375848 509.37
## - U1      1     80954 1456802 510.06
## + LF      1     10533 1365315 511.01
## + Time    1      9996 1365852 511.03
## + So      1      3046 1372802 511.26
## - Po1     1    123896 1499744 511.42
## - U2      1    190746 1566594 513.47
## - M       1    217716 1593564 514.27
## - Prob    1    226971 1602819 514.54
## - Ed      1    413254 1789103 519.71
## - Ineq    1    500944 1876792 521.96
## 
## Step:  AIC=507.77
## Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + U1 + U2 + Wealth + Ineq + 
##     Prob
## 
##          Df Sum of Sq     RSS    AIC
## - Po2     1     16706 1404229 506.33
## - Pop     1     25793 1413315 506.63
## - M.F     1     26785 1414308 506.66
## - Wealth  1     31551 1419073 506.82
## <none>                1387523 507.77
## - U1      1     83881 1471404 508.52
## + NW      1     11675 1375848 509.37
## + So      1      7207 1380316 509.52
## + LF      1      6726 1380797 509.54
## + Time    1      4534 1382989 509.61
## - Po1     1    118348 1505871 509.61
## - U2      1    201453 1588976 512.14
## - Prob    1    216760 1604282 512.59
## - M       1    309214 1696737 515.22
## - Ed      1    402754 1790276 517.74
## - Ineq    1    589736 1977259 522.41
## 
## Step:  AIC=506.33
## Crime ~ M + Ed + Po1 + M.F + Pop + U1 + U2 + Wealth + Ineq + 
##     Prob
## 
##          Df Sum of Sq     RSS    AIC
## - Pop     1     22345 1426575 505.07
## - Wealth  1     32142 1436371 505.39
## - M.F     1     36808 1441037 505.54
## <none>                1404229 506.33
## - U1      1     86373 1490602 507.13
## + Po2     1     16706 1387523 507.77
## + NW      1      6963 1397266 508.09
## + So      1      3807 1400422 508.20
## + LF      1      1986 1402243 508.26
## + Time    1       575 1403654 508.31
## - U2      1    205814 1610043 510.76
## - Prob    1    218607 1622836 511.13
## - M       1    307001 1711230 513.62
## - Ed      1    389502 1793731 515.83
## - Ineq    1    608627 2012856 521.25
## - Po1     1   1050202 2454432 530.57
## 
## Step:  AIC=505.07
## Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Wealth + Ineq + Prob
## 
##          Df Sum of Sq     RSS    AIC
## - Wealth  1     26493 1453068 503.93
## <none>                1426575 505.07
## - M.F     1     84491 1511065 505.77
## - U1      1     99463 1526037 506.24
## + Pop     1     22345 1404229 506.33
## + Po2     1     13259 1413315 506.63
## + NW      1      5927 1420648 506.87
## + So      1      5724 1420851 506.88
## + LF      1      5176 1421398 506.90
## + Time    1      3913 1422661 506.94
## - Prob    1    198571 1625145 509.20
## - U2      1    208880 1635455 509.49
## - M       1    320926 1747501 512.61
## - Ed      1    386773 1813348 514.35
## - Ineq    1    594779 2021354 519.45
## - Po1     1   1127277 2553852 530.44
## 
## Step:  AIC=503.93
## Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
## 
##          Df Sum of Sq     RSS    AIC
## <none>                1453068 503.93
## + Wealth  1     26493 1426575 505.07
## - M.F     1    103159 1556227 505.16
## + Pop     1     16697 1436371 505.39
## + Po2     1     14148 1438919 505.47
## + So      1      9329 1443739 505.63
## + LF      1      4374 1448694 505.79
## + NW      1      3799 1449269 505.81
## + Time    1      2293 1450775 505.86
## - U1      1    127044 1580112 505.87
## - Prob    1    247978 1701046 509.34
## - U2      1    255443 1708511 509.55
## - M       1    296790 1749858 510.67
## - Ed      1    445788 1898855 514.51
## - Ineq    1    738244 2191312 521.24
## - Po1     1   1672038 3125105 537.93

summary(stepwise_model)

## 
## Call:
## lm(formula = Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob, 
##     data = crime_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6426.10    1194.61  -5.379 4.04e-06 ***
## M              93.32      33.50   2.786  0.00828 ** 
## Ed            180.12      52.75   3.414  0.00153 ** 
## Po1           102.65      15.52   6.613 8.26e-08 ***
## M.F            22.34      13.60   1.642  0.10874    
## U1          -6086.63    3339.27  -1.823  0.07622 .  
## U2            187.35      72.48   2.585  0.01371 *  
## Ineq           61.33      13.96   4.394 8.63e-05 ***
## Prob        -3796.03    1490.65  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

scaled_crime = cbind(as.data.frame(scale(crime_data[,1])),
                     as.data.frame(crime_data[,2]),
                     as.data.frame(scale(crime_data[,-c(1,2,16)])),
                     as.data.frame(crime_data[,16]))

colnames(scaled_crime)[1] = 'M'
colnames(scaled_crime)[2] = 'So'
colnames(scaled_crime)[16] = 'Crime'
scaled_crime

Here, we create our lasso model and select the best lambda value.

x <- as.matrix(scaled_crime[, -16])
y <- scaled_crime[, 16]

lasso_model <- cv.glmnet(x, y, alpha = 1, family = "gaussian")  

lasso_model$lambda.min

## [1] 5.2862

lasso_model$lambda.1se

## [1] 28.21086

coef(lasso_model, s = "lambda.min")

## 16 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept)  892.67157
## M             99.58912
## So            36.46475
## Ed           162.79540
## Po1          298.16510
## Po2            .      
## LF             .      
## M.F           54.24080
## Pop          -10.58189
## NW            11.43047
## U1           -59.85369
## U2           101.53407
## Wealth        36.94309
## Ineq         229.70002
## Prob         -86.99065
## Time           .

Then we predict values with this model and compute an r squared value to examine model.

pred <- predict(lasso_model, newx = x, s = "lambda.min")
rss <- sum((y - pred)^2)
tss <- sum((y - mean(y))^2)
r_squared <- 1 - rss / tss
r_squared

## [1] 0.7886713

We are going to do a similar process with the elastic net variable selection, but we are also going to test many different values of alpha to see which one produces the best r squared value.

best_elastic <- function(x, y, alpha_value) {
  
  elastic_model <- cv.glmnet(x, y, alpha = alpha_value, family = "gaussian")
  
  best_lambda <- elastic_model$lambda.min
  
  preds <- predict(elastic_model, newx = x, s = best_lambda)
  
  r2 <- 1 - sum((y - preds)^2) / sum((y - mean(y))^2)
  
  return(r2)
}

results <- c()

for (i in 1:19) {
  alpha <- i * 0.05
  
  r2 <- best_elastic(x, y, alpha)
  
  results <- c(results, r2)
}

results

##  [1] 0.7435244 0.7758774 0.7719830 0.7873942 0.7928422 0.7917273 0.7757398
##  [8] 0.7723386 0.7870129 0.7552124 0.7753076 0.7936427 0.7268276 0.7864752
## [15] 0.7737604 0.7832011 0.7953337 0.7863184 0.7758676

alphas <- seq(0.05, 0.95, by = 0.05)
best_index <- which.max(results)
best_alpha <- alphas[best_index]
best_r2 <- results[best_index]

best_alpha

## [1] 0.85

best_r2

## [1] 0.7953337

elastic_model = cv.glmnet(x, y, alpha = best_alpha, family = "gaussian")

coef(elastic_model, s = "lambda.min")

## 16 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept)  892.39097
## M             99.55622
## So            37.28903
## Ed           163.39557
## Po1          295.04358
## Po2            .      
## LF             .      
## M.F           55.25640
## Pop          -11.13501
## NW            13.37212
## U1           -62.30431
## U2           103.94979
## Wealth        40.11189
## Ineq         229.63894
## Prob         -87.76424
## Time           .

Discussion of Results - In today’s homework we created four different models, a standard linear model, a model created using stepwise variable selection, a model created using our lasso variable selection, and lastly, a model created using our elastic net variable selection. Their respective r squared values were ~.70, ~.74, ~.79, and ~.8. Accordingly, this makes sense, for every new model was becoming more complex where we began with a standard model, then we began taking variables away without altering their weights, then we began altering their weights via the lasso selection process, then we used the elastic net process which builds upon the lasso process. Ultimately, these results make sense, for our final and most complex model yielded our best r squared value.

ISYE 6501 Homework 8

James Jessup