Practicum Problems BY DIPEN CHAWLA

Load the mtcars sample dataset from the built-in datasets (data(mtcars)) into R using a dataframe. Perform a basic 80/20 test-train split on the data (you may use caret, the sample method, or manually) and fit a linear model with mpg as the target response, and all other variables as predictors/features (you will need to set up a dummy variable for am). What features are selected as relevant based on resulting t-statistics? What are the associated coefficient values for relevant features? Perform a ridge regression using the glmnet package from CRAN, specifying a vector of 100 values of λ for tuning. Use cross-validation (via cv.glmnet) to determine the minimum value for λ - what do you obtain? (Hint: You can use doMC in order to speed-up your cross-validation by specifying parallel=TRUE in your glmnet calls.). Plot training MSE as a function of λ (you may also use log λ). What is out-of-sample test set performance (using predict), and how do the coefficients differ versus the regular linear model? Has ridge regression performed shrinkage, variable selection, or both?

library(ggplot2)
library(lattice)
library(caret)
library(Matrix)
library(foreach)
library(glmnet)

## Loaded glmnet 4.1-4

library(ROCR)

carData = data.frame(mtcars)
set.seed(123)

mt_cars_trainData = carData[sample(seq_len(nrow(carData)), size = floor(0.8 * nrow(carData))), ]
mt_cars_testData = carData[-sample(seq_len(nrow(carData)), size = floor(0.8 * nrow(carData))), ]

linearModelTrain = lm(mpg ~ ., mt_cars_trainData)

summary(linearModelTrain)

## 
## Call:
## lm(formula = mpg ~ ., data = mt_cars_trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8774 -1.3957 -0.0511  0.8254  4.5637 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 21.48672   21.70872   0.990   0.3391  
## cyl         -1.02280    1.35957  -0.752   0.4643  
## disp         0.02220    0.02531   0.877   0.3952  
## hp          -0.01921    0.02852  -0.674   0.5114  
## drat         0.22504    1.89085   0.119   0.9070  
## wt          -4.60044    2.37145  -1.940   0.0728 .
## qsec         0.83809    0.80049   1.047   0.3129  
## vs           0.76966    2.46407   0.312   0.7594  
## am           2.38345    2.51892   0.946   0.3601  
## gear        -0.14113    1.77632  -0.079   0.9378  
## carb         0.29213    1.01352   0.288   0.7774  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.85 on 14 degrees of freedom
## Multiple R-squared:  0.8854, Adjusted R-squared:  0.8035 
## F-statistic: 10.82 on 10 and 14 DF,  p-value: 5.62e-05

linearModelTrain$coefficients

## (Intercept)         cyl        disp          hp        drat          wt 
## 21.48672070 -1.02279987  0.02220239 -0.01921390  0.22504432 -4.60044246 
##        qsec          vs          am        gear        carb 
##  0.83808831  0.76966459  2.38344846 -0.14113143  0.29212936

predicted = predict(linearModelTrain, mt_cars_testData, type = "response")
actual = mt_cars_testData[, "mpg"]
cat("Output Regular linear model = ", mean((predicted - actual)^2))

## Output Regular linear model =  5.598949

lambda_seq = 10^seq(5, -5, by = -.1)

xVariable = model.matrix(mpg~. ,mt_cars_trainData)[,-1]
yVariable = mt_cars_trainData$mpg

# Cross-validation to perform minimum lambda 
cross_val_fit = cv.glmnet(xVariable, yVariable, alpha = 0, lambda = lambda_seq)

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold

optimal_lambda = cross_val_fit$lambda.min
cat("Optimal Lambda Value = ",optimal_lambda)

## Optimal Lambda Value =  3.162278

fit = glmnet(xVariable, yVariable, alpha = 0, lambda = optimal_lambda)
plot(cross_val_fit)

coef(fit)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 21.451812119
## cyl         -0.462106925
## disp        -0.005849156
## hp          -0.010238998
## drat         0.935508239
## wt          -1.299386742
## qsec         0.221370984
## vs           1.044852486
## am           1.638741603
## gear         0.416922103
## carb        -0.496192538

x_ridge_data = model.matrix(mpg~. ,mt_cars_testData)[,-1]
predicted_ridge = predict(fit, s = optimal_lambda, newx = x_ridge_data)
actual_data = mt_cars_testData[, "mpg"]
cat("Output of Ridge Regression Model = ", mean((predicted_ridge - actual_data)^2))

## Output of Ridge Regression Model =  4.068905

Since we observed that after we perform ridge regression the coefficients have shrunk, So they are now closer to 0, but none of them is exactly 0. MSE reduces from 5.595272 to 4.065019. Since, Ridge regression has performed Shrinkage. The optimum lambda value is 3.162

Load the swiss sample dataset from the built-in datasets (data(swiss)) into R using a dataframe. Perform a basic 80/20 test-train split on the data (you may use caret, the sample method, or manually) and fit a linear model with Fertility as the target response, and all other variables as predictors/features. What features are selected as relevant based on resulting t-statistics? What are the associated coefficient values for relevant features? Perform a lasso regression using the glmnet package from CRAN, specifying a vector of 100 values of λ for tuning. Use cross-validation (via cv.glmnet) to determine the minimum value for λ - what do you obtain? (Hint: You can use doMC in order to speed-up your cross-validation by specifying parallel=TRUE in your glmnet calls.). Plot training MSE as a function of λ (you may also use log λ). What is out-of-sample test set performance (using predict), and how do the coefficients differ versus the regular linear model? Has lasso regression performed shrinkage, variable selection, or both?

library(ggplot2)
library(lattice)
library(caret)
library(Matrix)
library(foreach)
library(glmnet)

swiss_data_frame = data.frame(swiss)

set.seed(200)
dataPartition = floor(0.8 * nrow(swiss))

swiss_train_data = swiss_data_frame[sample(seq_len(nrow(swiss_data_frame)),size = dataPartition),]

swiss_test_data = swiss_data_frame[-sample(seq_len(nrow(swiss_data_frame)),size = dataPartition),]

model_linear_model = lm(Fertility ~ ., swiss_train_data)

summary(model_linear_model)

## 
## Call:
## lm(formula = Fertility ~ ., data = swiss_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3725  -4.8931   0.3094   5.1898  14.3784 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.56865   11.54371   5.767 2.39e-06 ***
## Agriculture      -0.16994    0.08158  -2.083 0.045588 *  
## Examination      -0.32529    0.28557  -1.139 0.263385    
## Education        -0.77292    0.20001  -3.864 0.000532 ***
## Catholic          0.11098    0.04100   2.707 0.010944 *  
## Infant.Mortality  1.09735    0.41155   2.666 0.012072 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.42 on 31 degrees of freedom
## Multiple R-squared:  0.7093, Adjusted R-squared:  0.6624 
## F-statistic: 15.13 on 5 and 31 DF,  p-value: 1.54e-07

model_linear_model$coefficients

##      (Intercept)      Agriculture      Examination        Education 
##       66.5686460       -0.1699429       -0.3252892       -0.7729194 
##         Catholic Infant.Mortality 
##        0.1109790        1.0973513

predicted = predict(model_linear_model, swiss_test_data, type = "response")

actual = swiss_test_data[, "Fertility"]
cat("Output Regular linear model = ", mean((predicted - actual)^2))

## Output Regular linear model =  68.15539

lambda_seq = 10^seq(5, -5, by = -.1)

yVar = swiss_train_data$Fertility
xVar = model.matrix(Fertility~. ,swiss_train_data)[,-1]

lasso_cv = cv.glmnet(xVar, yVar, alpha = 1, lambda = lambda_seq)

opt_lambda_val = lasso_cv$lambda.min
cat("Optimal Lambda = ",opt_lambda_val)

## Optimal Lambda =  0.1584893

model_fit = glmnet(xVar, yVar, alpha = 1, lambda = opt_lambda_val)
plot(lasso_cv)

coef(model_fit)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)      64.8324735
## Agriculture      -0.1401177
## Examination      -0.2970107
## Education        -0.7379700
## Catholic          0.1036069
## Infant.Mortality  1.0826605

xVarSec = model.matrix(Fertility~. ,swiss_test_data)[,-1]

model_predict = predict(model_fit, s = opt_lambda_val, newx = xVarSec)

actual = swiss_test_data[, "Fertility"]
cat("Output Lasso Regression = ", mean((model_predict - actual)^2))

## Output Lasso Regression =  69.90852

coef(model_linear_model)

##      (Intercept)      Agriculture      Examination        Education 
##       66.5686460       -0.1699429       -0.3252892       -0.7729194 
##         Catholic Infant.Mortality 
##        0.1109790        1.0973513

coef(lasso_cv)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                           s1
## (Intercept)      58.74328473
## Agriculture       .         
## Examination      -0.17407982
## Education        -0.54142456
## Catholic          0.06159911
## Infant.Mortality  0.91413824

## It mostly performed shrinkage

# As we Compared to Linear fit Lasso Regularization has shrinked the coefficients and two of them are shrinked to zero.

As we observed that after we perform lasso regression the coefficients have been shrunked. They are now closer to 0, but none of them is exactly 0. MSE increased from 68.15539 to 69.90852. Lasso regression usually performs variable selection but in this case it has performed Shrinkage. Since,Optimum lambda value is 0.1584893.

Load the Concrete Compressive Strength sample dataset from the UCI Machine Learning Repository (Concrete Data.xls) into R using a dataframe (Note: You will need to either use the xlsx or readxl packages to load Excel data, or manually save as CSV). Use the mgcv package to create a generalized additive model (via the gam function) to predict the Concrete Compressive Strength (CCS) as a non-linear function of the input components (C1-C6) - compare the R2 value for a GAM with linear terms as well as smoothed terms (Hint: Use the s() function to apply smoothing using the default bs of tp). Visualize the regression using the visreg package, showing the fit as a function of each predictor with confidence intervals - comment on the quality of the fit at extreme values of the predictors.

library(readxl)
library(corrplot)

## corrplot 0.92 loaded

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

library(visreg)

concrete_data = read_excel("/Users/dipenchawla/Desktop/Concrete_Data.xls")
summary(concrete_data)

##  Cement (component 1)(kg in a m^3 mixture)
##  Min.   :102.0                            
##  1st Qu.:192.4                            
##  Median :272.9                            
##  Mean   :281.2                            
##  3rd Qu.:350.0                            
##  Max.   :540.0                            
##  Blast Furnace Slag (component 2)(kg in a m^3 mixture)
##  Min.   :  0.0                                        
##  1st Qu.:  0.0                                        
##  Median : 22.0                                        
##  Mean   : 73.9                                        
##  3rd Qu.:142.9                                        
##  Max.   :359.4                                        
##  Fly Ash (component 3)(kg in a m^3 mixture)
##  Min.   :  0.00                            
##  1st Qu.:  0.00                            
##  Median :  0.00                            
##  Mean   : 54.19                            
##  3rd Qu.:118.27                            
##  Max.   :200.10                            
##  Water  (component 4)(kg in a m^3 mixture)
##  Min.   :121.8                            
##  1st Qu.:164.9                            
##  Median :185.0                            
##  Mean   :181.6                            
##  3rd Qu.:192.0                            
##  Max.   :247.0                            
##  Superplasticizer (component 5)(kg in a m^3 mixture)
##  Min.   : 0.000                                     
##  1st Qu.: 0.000                                     
##  Median : 6.350                                     
##  Mean   : 6.203                                     
##  3rd Qu.:10.160                                     
##  Max.   :32.200                                     
##  Coarse Aggregate  (component 6)(kg in a m^3 mixture)
##  Min.   : 801.0                                      
##  1st Qu.: 932.0                                      
##  Median : 968.0                                      
##  Mean   : 972.9                                      
##  3rd Qu.:1029.4                                      
##  Max.   :1145.0                                      
##  Fine Aggregate (component 7)(kg in a m^3 mixture)   Age (day)     
##  Min.   :594.0                                     Min.   :  1.00  
##  1st Qu.:731.0                                     1st Qu.:  7.00  
##  Median :779.5                                     Median : 28.00  
##  Mean   :773.6                                     Mean   : 45.66  
##  3rd Qu.:824.0                                     3rd Qu.: 56.00  
##  Max.   :992.6                                     Max.   :365.00  
##  Concrete compressive strength(MPa, megapascals)
##  Min.   : 2.332                                 
##  1st Qu.:23.707                                 
##  Median :34.443                                 
##  Mean   :35.818                                 
##  3rd Qu.:46.136                                 
##  Max.   :82.599

str(concrete_data)

## tibble [1,030 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Cement (component 1)(kg in a m^3 mixture)            : num [1:1030] 540 540 332 332 199 ...
##  $ Blast Furnace Slag (component 2)(kg in a m^3 mixture): num [1:1030] 0 0 142 142 132 ...
##  $ Fly Ash (component 3)(kg in a m^3 mixture)           : num [1:1030] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Water  (component 4)(kg in a m^3 mixture)            : num [1:1030] 162 162 228 228 192 228 228 228 228 228 ...
##  $ Superplasticizer (component 5)(kg in a m^3 mixture)  : num [1:1030] 2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ Coarse Aggregate  (component 6)(kg in a m^3 mixture) : num [1:1030] 1040 1055 932 932 978 ...
##  $ Fine Aggregate (component 7)(kg in a m^3 mixture)    : num [1:1030] 676 676 594 594 826 ...
##  $ Age (day)                                            : num [1:1030] 28 28 270 365 360 90 365 28 28 28 ...
##  $ Concrete compressive strength(MPa, megapascals)      : num [1:1030] 80 61.9 40.3 41.1 44.3 ...

# we are taking Columns C1-C6

colnames(concrete_data) = c("cement", "Blast_Furnace_Slag", "Fly_Ash", "Water", "Superplasticizer", "Coarse_Aggregate", "Fine_Aggregate", "Age", "ccs")

column_c1_c6 =  c("cement", "Blast_Furnace_Slag", "Fly_Ash", "Water", "Superplasticizer", "Coarse_Aggregate", "ccs")
concrete_data = concrete_data[column_c1_c6]
summary(concrete_data)

##      cement      Blast_Furnace_Slag    Fly_Ash           Water      
##  Min.   :102.0   Min.   :  0.0      Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0      1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0      Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9      Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9      3rd Qu.:118.27   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4      Max.   :200.10   Max.   :247.0  
##  Superplasticizer Coarse_Aggregate      ccs        
##  Min.   : 0.000   Min.   : 801.0   Min.   : 2.332  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:23.707  
##  Median : 6.350   Median : 968.0   Median :34.443  
##  Mean   : 6.203   Mean   : 972.9   Mean   :35.818  
##  3rd Qu.:10.160   3rd Qu.:1029.4   3rd Qu.:46.136  
##  Max.   :32.200   Max.   :1145.0   Max.   :82.599

corrplot(cor(concrete_data), method = "number")

first_model = gam(ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + Coarse_Aggregate , data=concrete_data)
summary(first_model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + 
##     Coarse_Aggregate
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.326997  10.510518   0.507 0.612387    
## cement              0.108256   0.005214  20.761  < 2e-16 ***
## Blast_Furnace_Slag  0.079357   0.006193  12.814  < 2e-16 ***
## Fly_Ash             0.055928   0.009287   6.022  2.4e-09 ***
## Water              -0.103871   0.027796  -3.737 0.000197 ***
## Superplasticizer    0.356016   0.110251   3.229 0.001281 ** 
## Coarse_Aggregate    0.008027   0.006272   1.280 0.200940    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.445   Deviance explained = 44.9%
## GCV = 155.83  Scale est. = 154.77    n = 1030

second_model = gam(ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + s(Superplasticizer) + s(Coarse_Aggregate) , data=concrete_data)
summary(second_model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + 
##     s(Superplasticizer) + s(Coarse_Aggregate)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.8178     0.3566   100.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                         edf Ref.df      F  p-value    
## s(cement)             4.464  5.513 69.530  < 2e-16 ***
## s(Blast_Furnace_Slag) 2.088  2.578 48.091  < 2e-16 ***
## s(Fly_Ash)            5.332  6.404  1.784    0.101    
## s(Water)              8.567  8.936 13.504  < 2e-16 ***
## s(Superplasticizer)   7.133  8.143  5.498 1.22e-06 ***
## s(Coarse_Aggregate)   1.000  1.000  0.018    0.892    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.531   Deviance explained = 54.4%
## GCV = 134.84  Scale est. = 130.96    n = 1030

first_model.sse = sum(fitted(first_model)-concrete_data$ccs)^2
first_model.ssr = sum(fitted(first_model) -mean(concrete_data$ccs))^2
first_model.sst = first_model.sse + first_model.ssr

residualsqr_main=1-(first_model.sse/first_model.sst)
print(residualsqr_main)

## [1] 0.5013695

second_model.sse <- sum(fitted(second_model)-concrete_data$ccs)^2
second_model.ssr <- sum(fitted(second_model) -mean(concrete_data$ccs))^2
second_model.sst = second_model.sse + second_model.ssr

residualsqr_sum=1-(second_model.sse/second_model.sst)
print(residualsqr_sum)

## [1] 0.4999805

anova(first_model, second_model, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + 
##     Coarse_Aggregate
## Model 2: ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + 
##     s(Superplasticizer) + s(Coarse_Aggregate)
##   Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
## 1   1023.00     158334                              
## 2    996.43     131019 26.574    27315 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# So, now we have additional statistical evidence to suggest that incorporating nonlinear relationships of the covariates improves the model.

visreg(first_model,'cement')

visreg(second_model,'cement')

visreg(first_model,'Blast_Furnace_Slag')

visreg(second_model,'Blast_Furnace_Slag')

visreg(first_model,'Fly_Ash')

visreg(second_model,'Fly_Ash')

visreg(first_model,'Water')

visreg(second_model,'Water')

visreg(first_model,'Superplasticizer')

visreg(second_model,'Superplasticizer')

visreg(first_model,'Coarse_Aggregate')

visreg(second_model,'Coarse_Aggregate')

With CEM graph we can observe that, the confidence interval after applying smoothing function has greater value as compared to the model before smoothing function and after applying the smoothing function, the confidence interval gets better.