Load the mtcars sample dataset from the built-in datasets (data(mtcars)) into R using a dataframe. Perform a basic 80/20 test-train split on the data (you may use caret, the sample method, or manually) and fit a linear model with mpg as the target response, and all other variables as predictors/features (you will need to set up a dummy variable for am). What features are selected as relevant based on resulting t-statistics? What are the associated coefficient values for relevant features? Perform a ridge regression using the glmnet package from CRAN, specifying a vector of 100 values of λ for tuning. Use cross-validation (via cv.glmnet) to determine the minimum value for λ - what do you obtain? (Hint: You can use doMC in order to speed-up your cross-validation by specifying parallel=TRUE in your glmnet calls.). Plot training MSE as a function of λ (you may also use log λ). What is out-of-sample test set performance (using predict), and how do the coefficients differ versus the regular linear model? Has ridge regression performed shrinkage, variable selection, or both?
library(ggplot2) library(lattice) library(caret) library(Matrix) library(foreach) library(glmnet)
library(ROCR) carData = data.frame(mtcars) set.seed(123) mt_cars_trainData = carData[sample(seq_len(nrow(carData)), size = floor(0.8 * nrow(carData))), ] mt_cars_testData = carData[-sample(seq_len(nrow(carData)), size = floor(0.8 * nrow(carData))), ] linearModelTrain = lm(mpg ~ ., mt_cars_trainData) summary(linearModelTrain)
## ## Call: ## lm(formula = mpg ~ ., data = mt_cars_trainData) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.8774 -1.3957 -0.0511 0.8254 4.5637 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 21.48672 21.70872 0.990 0.3391 ## cyl -1.02280 1.35957 -0.752 0.4643 ## disp 0.02220 0.02531 0.877 0.3952 ## hp -0.01921 0.02852 -0.674 0.5114 ## drat 0.22504 1.89085 0.119 0.9070 ## wt -4.60044 2.37145 -1.940 0.0728 . ## qsec 0.83809 0.80049 1.047 0.3129 ## vs 0.76966 2.46407 0.312 0.7594 ## am 2.38345 2.51892 0.946 0.3601 ## gear -0.14113 1.77632 -0.079 0.9378 ## carb 0.29213 1.01352 0.288 0.7774 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.85 on 14 degrees of freedom ## Multiple R-squared: 0.8854, Adjusted R-squared: 0.8035 ## F-statistic: 10.82 on 10 and 14 DF, p-value: 5.62e-05
linearModelTrain$coefficients
## (Intercept) cyl disp hp drat wt ## 21.48672070 -1.02279987 0.02220239 -0.01921390 0.22504432 -4.60044246 ## qsec vs am gear carb ## 0.83808831 0.76966459 2.38344846 -0.14113143 0.29212936
predicted = predict(linearModelTrain, mt_cars_testData, type = "response") actual = mt_cars_testData[, "mpg"] cat("Output Regular linear model = ", mean((predicted - actual)^2))
## Output Regular linear model = 5.598949
lambda_seq = 10^seq(5, -5, by = -.1) xVariable = model.matrix(mpg~. ,mt_cars_trainData)[,-1] yVariable = mt_cars_trainData$mpg # Cross-validation to perform minimum lambda cross_val_fit = cv.glmnet(xVariable, yVariable, alpha = 0, lambda = lambda_seq)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per ## fold
optimal_lambda = cross_val_fit$lambda.min cat("Optimal Lambda Value = ",optimal_lambda)
## Optimal Lambda Value = 3.162278
fit = glmnet(xVariable, yVariable, alpha = 0, lambda = optimal_lambda) plot(cross_val_fit)
coef(fit)
## 11 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) 21.451812119 ## cyl -0.462106925 ## disp -0.005849156 ## hp -0.010238998 ## drat 0.935508239 ## wt -1.299386742 ## qsec 0.221370984 ## vs 1.044852486 ## am 1.638741603 ## gear 0.416922103 ## carb -0.496192538
x_ridge_data = model.matrix(mpg~. ,mt_cars_testData)[,-1] predicted_ridge = predict(fit, s = optimal_lambda, newx = x_ridge_data) actual_data = mt_cars_testData[, "mpg"] cat("Output of Ridge Regression Model = ", mean((predicted_ridge - actual_data)^2))
## Output of Ridge Regression Model = 4.068905
Since we observed that after we perform ridge regression the coefficients have shrunk, So they are now closer to 0, but none of them is exactly 0. MSE reduces from 5.595272 to 4.065019. Since, Ridge regression has performed Shrinkage. The optimum lambda value is 3.162
#PROBLEM 2Load the swiss sample dataset from the built-in datasets (data(swiss)) into R using a dataframe. Perform a basic 80/20 test-train split on the data (you may use caret, the sample method, or manually) and fit a linear model with Fertility as the target response, and all other variables as predictors/features. What features are selected as relevant based on resulting t-statistics? What are the associated coefficient values for relevant features? Perform a lasso regression using the glmnet package from CRAN, specifying a vector of 100 values of λ for tuning. Use cross-validation (via cv.glmnet) to determine the minimum value for λ - what do you obtain? (Hint: You can use doMC in order to speed-up your cross-validation by specifying parallel=TRUE in your glmnet calls.). Plot training MSE as a function of λ (you may also use log λ). What is out-of-sample test set performance (using predict), and how do the coefficients differ versus the regular linear model? Has lasso regression performed shrinkage, variable selection, or both?
library(ggplot2) library(lattice) library(caret) library(Matrix) library(foreach) library(glmnet) swiss_data_frame = data.frame(swiss) set.seed(200) dataPartition = floor(0.8 * nrow(swiss)) swiss_train_data = swiss_data_frame[sample(seq_len(nrow(swiss_data_frame)),size = dataPartition),] swiss_test_data = swiss_data_frame[-sample(seq_len(nrow(swiss_data_frame)),size = dataPartition),] model_linear_model = lm(Fertility ~ ., swiss_train_data) summary(model_linear_model)
## ## Call: ## lm(formula = Fertility ~ ., data = swiss_train_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.3725 -4.8931 0.3094 5.1898 14.3784 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66.56865 11.54371 5.767 2.39e-06 *** ## Agriculture -0.16994 0.08158 -2.083 0.045588 * ## Examination -0.32529 0.28557 -1.139 0.263385 ## Education -0.77292 0.20001 -3.864 0.000532 *** ## Catholic 0.11098 0.04100 2.707 0.010944 * ## Infant.Mortality 1.09735 0.41155 2.666 0.012072 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.42 on 31 degrees of freedom ## Multiple R-squared: 0.7093, Adjusted R-squared: 0.6624 ## F-statistic: 15.13 on 5 and 31 DF, p-value: 1.54e-07
model_linear_model$coefficients
## (Intercept) Agriculture Examination Education ## 66.5686460 -0.1699429 -0.3252892 -0.7729194 ## Catholic Infant.Mortality ## 0.1109790 1.0973513
predicted = predict(model_linear_model, swiss_test_data, type = "response") actual = swiss_test_data[, "Fertility"] cat("Output Regular linear model = ", mean((predicted - actual)^2))
## Output Regular linear model = 68.15539
lambda_seq = 10^seq(5, -5, by = -.1) yVar = swiss_train_data$Fertility xVar = model.matrix(Fertility~. ,swiss_train_data)[,-1] lasso_cv = cv.glmnet(xVar, yVar, alpha = 1, lambda = lambda_seq) opt_lambda_val = lasso_cv$lambda.min cat("Optimal Lambda = ",opt_lambda_val)
## Optimal Lambda = 0.1584893
model_fit = glmnet(xVar, yVar, alpha = 1, lambda = opt_lambda_val) plot(lasso_cv)
coef(model_fit)
## 6 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) 64.8324735 ## Agriculture -0.1401177 ## Examination -0.2970107 ## Education -0.7379700 ## Catholic 0.1036069 ## Infant.Mortality 1.0826605
xVarSec = model.matrix(Fertility~. ,swiss_test_data)[,-1] model_predict = predict(model_fit, s = opt_lambda_val, newx = xVarSec) actual = swiss_test_data[, "Fertility"] cat("Output Lasso Regression = ", mean((model_predict - actual)^2))
## Output Lasso Regression = 69.90852
coef(model_linear_model)
## (Intercept) Agriculture Examination Education ## 66.5686460 -0.1699429 -0.3252892 -0.7729194 ## Catholic Infant.Mortality ## 0.1109790 1.0973513
coef(lasso_cv)
## 6 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) 58.74328473 ## Agriculture . ## Examination -0.17407982 ## Education -0.54142456 ## Catholic 0.06159911 ## Infant.Mortality 0.91413824
## It mostly performed shrinkage # As we Compared to Linear fit Lasso Regularization has shrinked the coefficients and two of them are shrinked to zero.
As we observed that after we perform lasso regression the coefficients have been shrunked. They are now closer to 0, but none of them is exactly 0. MSE increased from 68.15539 to 69.90852. Lasso regression usually performs variable selection but in this case it has performed Shrinkage. Since,Optimum lambda value is 0.1584893.
#PROBLEM 3Load the Concrete Compressive Strength sample dataset from the UCI Machine Learning Repository (Concrete Data.xls) into R using a dataframe (Note: You will need to either use the xlsx or readxl packages to load Excel data, or manually save as CSV). Use the mgcv package to create a generalized additive model (via the gam function) to predict the Concrete Compressive Strength (CCS) as a non-linear function of the input components (C1-C6) - compare the R2 value for a GAM with linear terms as well as smoothed terms (Hint: Use the s() function to apply smoothing using the default bs of tp). Visualize the regression using the visreg package, showing the fit as a function of each predictor with confidence intervals - comment on the quality of the fit at extreme values of the predictors.
library(readxl) library(corrplot)
library(mgcv)
library(visreg) concrete_data = read_excel("/Users/dipenchawla/Desktop/Concrete_Data.xls") summary(concrete_data)
## Cement (component 1)(kg in a m^3 mixture) ## Min. :102.0 ## 1st Qu.:192.4 ## Median :272.9 ## Mean :281.2 ## 3rd Qu.:350.0 ## Max. :540.0 ## Blast Furnace Slag (component 2)(kg in a m^3 mixture) ## Min. : 0.0 ## 1st Qu.: 0.0 ## Median : 22.0 ## Mean : 73.9 ## 3rd Qu.:142.9 ## Max. :359.4 ## Fly Ash (component 3)(kg in a m^3 mixture) ## Min. : 0.00 ## 1st Qu.: 0.00 ## Median : 0.00 ## Mean : 54.19 ## 3rd Qu.:118.27 ## Max. :200.10 ## Water (component 4)(kg in a m^3 mixture) ## Min. :121.8 ## 1st Qu.:164.9 ## Median :185.0 ## Mean :181.6 ## 3rd Qu.:192.0 ## Max. :247.0 ## Superplasticizer (component 5)(kg in a m^3 mixture) ## Min. : 0.000 ## 1st Qu.: 0.000 ## Median : 6.350 ## Mean : 6.203 ## 3rd Qu.:10.160 ## Max. :32.200 ## Coarse Aggregate (component 6)(kg in a m^3 mixture) ## Min. : 801.0 ## 1st Qu.: 932.0 ## Median : 968.0 ## Mean : 972.9 ## 3rd Qu.:1029.4 ## Max. :1145.0 ## Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) ## Min. :594.0 Min. : 1.00 ## 1st Qu.:731.0 1st Qu.: 7.00 ## Median :779.5 Median : 28.00 ## Mean :773.6 Mean : 45.66 ## 3rd Qu.:824.0 3rd Qu.: 56.00 ## Max. :992.6 Max. :365.00 ## Concrete compressive strength(MPa, megapascals) ## Min. : 2.332 ## 1st Qu.:23.707 ## Median :34.443 ## Mean :35.818 ## 3rd Qu.:46.136 ## Max. :82.599
str(concrete_data)
## tibble [1,030 × 9] (S3: tbl_df/tbl/data.frame) ## $ Cement (component 1)(kg in a m^3 mixture) : num [1:1030] 540 540 332 332 199 ... ## $ Blast Furnace Slag (component 2)(kg in a m^3 mixture): num [1:1030] 0 0 142 142 132 ... ## $ Fly Ash (component 3)(kg in a m^3 mixture) : num [1:1030] 0 0 0 0 0 0 0 0 0 0 ... ## $ Water (component 4)(kg in a m^3 mixture) : num [1:1030] 162 162 228 228 192 228 228 228 228 228 ... ## $ Superplasticizer (component 5)(kg in a m^3 mixture) : num [1:1030] 2.5 2.5 0 0 0 0 0 0 0 0 ... ## $ Coarse Aggregate (component 6)(kg in a m^3 mixture) : num [1:1030] 1040 1055 932 932 978 ... ## $ Fine Aggregate (component 7)(kg in a m^3 mixture) : num [1:1030] 676 676 594 594 826 ... ## $ Age (day) : num [1:1030] 28 28 270 365 360 90 365 28 28 28 ... ## $ Concrete compressive strength(MPa, megapascals) : num [1:1030] 80 61.9 40.3 41.1 44.3 ...
# we are taking Columns C1-C6 colnames(concrete_data) = c("cement", "Blast_Furnace_Slag", "Fly_Ash", "Water", "Superplasticizer", "Coarse_Aggregate", "Fine_Aggregate", "Age", "ccs") column_c1_c6 = c("cement", "Blast_Furnace_Slag", "Fly_Ash", "Water", "Superplasticizer", "Coarse_Aggregate", "ccs") concrete_data = concrete_data[column_c1_c6] summary(concrete_data)
## cement Blast_Furnace_Slag Fly_Ash Water ## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8 ## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9 ## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0 ## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6 ## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.27 3rd Qu.:192.0 ## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0 ## Superplasticizer Coarse_Aggregate ccs ## Min. : 0.000 Min. : 801.0 Min. : 2.332 ## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:23.707 ## Median : 6.350 Median : 968.0 Median :34.443 ## Mean : 6.203 Mean : 972.9 Mean :35.818 ## 3rd Qu.:10.160 3rd Qu.:1029.4 3rd Qu.:46.136 ## Max. :32.200 Max. :1145.0 Max. :82.599
corrplot(cor(concrete_data), method = "number")
first_model = gam(ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + Coarse_Aggregate , data=concrete_data) summary(first_model)
## ## Family: gaussian ## Link function: identity ## ## Formula: ## ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + ## Coarse_Aggregate ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.326997 10.510518 0.507 0.612387 ## cement 0.108256 0.005214 20.761 < 2e-16 *** ## Blast_Furnace_Slag 0.079357 0.006193 12.814 < 2e-16 *** ## Fly_Ash 0.055928 0.009287 6.022 2.4e-09 *** ## Water -0.103871 0.027796 -3.737 0.000197 *** ## Superplasticizer 0.356016 0.110251 3.229 0.001281 ** ## Coarse_Aggregate 0.008027 0.006272 1.280 0.200940 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## R-sq.(adj) = 0.445 Deviance explained = 44.9% ## GCV = 155.83 Scale est. = 154.77 n = 1030
second_model = gam(ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + s(Superplasticizer) + s(Coarse_Aggregate) , data=concrete_data) summary(second_model)
## ## Family: gaussian ## Link function: identity ## ## Formula: ## ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + ## s(Superplasticizer) + s(Coarse_Aggregate) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 35.8178 0.3566 100.4 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(cement) 4.464 5.513 69.530 < 2e-16 *** ## s(Blast_Furnace_Slag) 2.088 2.578 48.091 < 2e-16 *** ## s(Fly_Ash) 5.332 6.404 1.784 0.101 ## s(Water) 8.567 8.936 13.504 < 2e-16 *** ## s(Superplasticizer) 7.133 8.143 5.498 1.22e-06 *** ## s(Coarse_Aggregate) 1.000 1.000 0.018 0.892 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.531 Deviance explained = 54.4% ## GCV = 134.84 Scale est. = 130.96 n = 1030
first_model.sse = sum(fitted(first_model)-concrete_data$ccs)^2 first_model.ssr = sum(fitted(first_model) -mean(concrete_data$ccs))^2 first_model.sst = first_model.sse + first_model.ssr residualsqr_main=1-(first_model.sse/first_model.sst) print(residualsqr_main)
## [1] 0.5013695
second_model.sse <- sum(fitted(second_model)-concrete_data$ccs)^2 second_model.ssr <- sum(fitted(second_model) -mean(concrete_data$ccs))^2 second_model.sst = second_model.sse + second_model.ssr residualsqr_sum=1-(second_model.sse/second_model.sst) print(residualsqr_sum)
## [1] 0.4999805
anova(first_model, second_model, test="Chisq")
## Analysis of Deviance Table ## ## Model 1: ccs ~ cement + Blast_Furnace_Slag + Fly_Ash + Water + Superplasticizer + ## Coarse_Aggregate ## Model 2: ccs ~ s(cement) + s(Blast_Furnace_Slag) + s(Fly_Ash) + s(Water) + ## s(Superplasticizer) + s(Coarse_Aggregate) ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 1023.00 158334 ## 2 996.43 131019 26.574 27315 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# So, now we have additional statistical evidence to suggest that incorporating nonlinear relationships of the covariates improves the model. visreg(first_model,'cement')
visreg(second_model,'cement')
visreg(first_model,'Blast_Furnace_Slag')
visreg(second_model,'Blast_Furnace_Slag')
visreg(first_model,'Fly_Ash')
visreg(second_model,'Fly_Ash')
visreg(first_model,'Water')
visreg(second_model,'Water')
visreg(first_model,'Superplasticizer')
visreg(second_model,'Superplasticizer')
visreg(first_model,'Coarse_Aggregate')
visreg(second_model,'Coarse_Aggregate')
With CEM graph we can observe that, the confidence interval after applying smoothing function has greater value as compared to the model before smoothing function and after applying the smoothing function, the confidence interval gets better.