1. Introduction

This project tries to create a prediction model based on the data provided by the World Health Organization (WHO) to evaluate the life expectancy for different countries. The data offers a timeframe from 2000 to 2015.The output algorithms have been used to test if they can maintain their accuracy in predicting the life expectancy for data they haven’t been trained.

2. Loading the data

2.1. load the packages:

The first step is to load all the required packages.

library(tidyverse)
library(readr)
library(dummies)
library(fastDummies)
library(glmnet)
library(leaps)
library(boot)

2.2 load the dataset

The data is read from the csv file and the life_exp dataset is created.

life_exp <- read_csv("Data/Life_Expectancy_Data.csv")

3. Preprocessing the data

The prerequisite for the data analysis is to preprocess the data to make it suitable for analysis.

3.1. Removing the incomplete data

All the rows with missing or null values are removed.

life_exp <- life_exp[complete.cases(life_exp), ]

Below the first 5 rows are shown. The data contains 21 columns and 1636 rows. The columns are:

  1. Country
  2. Year
  3. Status
  4. Life Expectancy
  5. Adult Mortality
  6. Alcohol
  7. percentage expenditure
  8. Hepatitis B
  9. Measles
  10. BMI
  11. under-five deaths
  12. Polio
  13. Total expenditure
  14. Diphtheria
  15. HIV/AIDS
  16. GDP
  17. Population
  18. thinness 1-19 years
  19. thinness 5-9 years
  20. Income composition of resources
  21. Schooling
head(life_exp)
## # A tibble: 6 x 22
##   Country  Year Status `Life expectanc~ `Adult Mortalit~ `infant deaths`
##   <chr>   <dbl> <chr>             <dbl>            <dbl>           <dbl>
## 1 Afghan~  2015 Devel~             65                263              62
## 2 Afghan~  2014 Devel~             59.9              271              64
## 3 Afghan~  2013 Devel~             59.9              268              66
## 4 Afghan~  2012 Devel~             59.5              272              69
## 5 Afghan~  2011 Devel~             59.2              275              71
## 6 Afghan~  2010 Devel~             58.8              279              74
## # ... with 16 more variables: Alcohol <dbl>, `percentage
## #   expenditure` <dbl>, `Hepatitis B` <dbl>, Measles <dbl>, BMI <dbl>,
## #   `under-five deaths` <dbl>, Polio <dbl>, `Total expenditure` <dbl>,
## #   Diphtheria <dbl>, `HIV/AIDS` <dbl>, GDP <dbl>, Population <dbl>,
## #   `thinness 1-19 years` <dbl>, `thinness 5-9 years` <dbl>, `Income
## #   composition of resources` <dbl>, Schooling <dbl>

3.2 Dropping year

The Year column is dropped as it will not be used in the analysis.

life_exp = subset(life_exp, select = -c(Year) )

3.3 Creating dummy variable for status variable

Except for Country name and Status(either developed or developing) columns, all of the data is numeric. The values are either in percentage, years, millions of dollars in the case of Gross Domestic Product (GDP).

For regression models, it will be handy to have the qualitative variables converted to quantitative variables. The Status of the country is changed into numerical with the dummy_cols function from the fastDummies package. This results in a new column with values 0 and 1. The original Status column is then dropped.

life_exp <- fastDummies::dummy_cols(life_exp, select_columns = "Status")
life_exp = subset(life_exp, select = -c(Status) )

3.4 Groupby country

The data is grouped by the country and the mean values for the 2000 - 2015 year period is found.

life_exp <- aggregate(life_exp[,2:21], list(life_exp$Country), mean)
life_exp = subset(life_exp, select = -c(Group.1) )

After the grouping, there are 133 rows.

3.5 Rename column names

It will be handy if the column names in the dataset does not contain spaces and other special characters. So, the spaces and other special characters are dropped from the column names.

names(life_exp)<-str_replace_all(names(life_exp), c(" " = "", "-" = "", "/"=""))

4. Train-validation-test data split.

Before proceeding with the creation of the prediction model, it is important to split the data into separate sets for training, validation and test. Out of the 133 records, 77 is used for training, 28 for validation and the remaining for test.

splits <- c(rep("train", 77), rep("validation", 28), rep("test", 28))
life_exp <- life_exp %>% mutate(splits = sample(splits))

life_exp_train <- life_exp %>% filter(splits == "train")
life_exp_train <- life_exp_train %>% dplyr::select(-splits)
life_exp_validation <- life_exp %>% filter(splits == "validation")
life_exp_validation <- life_exp_validation %>% dplyr::select(-splits)
life_exp_test  <- life_exp %>% filter(splits == "test")
life_exp_test <- life_exp_test %>% dplyr::select(-splits)

5. Important questions

The response variable in the data set is Life expectancy. The predictor variabes are Country Status, Adult Mortality, Alcohol, percentage expenditure, Hepatitis B, Measles, BMI, under-five deaths, Polio, Total expenditure, Diphtheria, HIV/AIDS, GDP, Population, thinness 1-19 years, thinness 5-9 years, Income composition of resources and Schooling. Some important questions that could be addressed are,

  1. Is there a relationship between the predictor varaibles and Life expectancy?
  2. What are the significant factors which define Life expectancy?
  3. Does Alcohol consumption impact Life expectancy?

6. Is there a relationship between the predictor variables and Life expectancy?

A multiple linear regression can be used to answer this question. The null hypothesis is that there is no relationship between the predictors and the response. The alternate hypothesis is that there is a relationship between Life expectancy and at least one of the predictors. This hypothesis test can be performed by computing the F-statistic of the linear regression. When there is no relationship between the response and predictors, the F-statistic should have a value close to 1. On the other hand, if the alternate hypothesis is true, the F-statistic should have a value greater than 1.

The linear regression is fitted with these predictors and the summary is displayed below.

lm_model <- lm(Lifeexpectancy ~ ., data = life_exp_train)
summary(lm_model)
## 
## Call:
## lm(formula = Lifeexpectancy ~ ., data = life_exp_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0016 -1.3553  0.0736  1.6967  4.4597 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.025e+01  3.798e+00  15.864  < 2e-16 ***
## AdultMortality               -4.330e-02  5.749e-03  -7.532 4.13e-10 ***
## infantdeaths                 -9.306e-02  1.110e-01  -0.838  0.40532    
## Alcohol                      -2.596e-01  1.575e-01  -1.648  0.10479    
## percentageexpenditure         9.620e-04  5.740e-04   1.676  0.09924 .  
## HepatitisB                    3.698e-03  3.767e-02   0.098  0.92213    
## Measles                      -1.160e-04  2.019e-04  -0.574  0.56791    
## BMI                           6.029e-02  3.816e-02   1.580  0.11959    
## underfivedeaths               5.481e-02  7.000e-02   0.783  0.43684    
## Polio                        -2.637e-02  4.841e-02  -0.545  0.58815    
## Totalexpenditure             -2.347e-01  2.153e-01  -1.090  0.28017    
## Diphtheria                    3.457e-02  5.562e-02   0.622  0.53672    
## HIVAIDS                      -1.720e-01  7.705e-02  -2.232  0.02957 *  
## GDP                          -7.775e-05  1.184e-04  -0.657  0.51408    
## Population                    5.119e-08  3.650e-08   1.403  0.16615    
## thinness119years              6.149e-02  2.864e-01   0.215  0.83079    
## thinness59years               3.853e-02  2.952e-01   0.131  0.89663    
## Incomecompositionofresources  9.938e+00  4.065e+00   2.445  0.01760 *  
## Schooling                     8.354e-01  2.890e-01   2.890  0.00544 ** 
## Status_Developing            -1.420e+00  1.666e+00  -0.852  0.39775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.582 on 57 degrees of freedom
## Multiple R-squared:  0.9399, Adjusted R-squared:  0.9199 
## F-statistic: 46.95 on 19 and 57 DF,  p-value: < 2.2e-16

The value of F-statistic is significantly greater than 1. This suggests that the null hypothesis can be rejected. So, there is a relationship between atleast one of the predictors and Life expectancy. The number of observations are significantly higher than the number of predcitors. Also, the p-value associated with this F-statistic is very close to 0. So, we can conlcude that atleast one of the predictors and Life expectancy are related. The linear regression plot for the test dataset is displayed in below figure.

tibble(predicted = predict(lm_model,newdata = life_exp_test),
       observed= life_exp_test$Lifeexpectancy) %>% 
  ggplot(aes(x = predicted, y = observed)) +
  geom_point() +
  theme_minimal() +
  geom_abline(slope = 1) +
  labs(title = "Predicted versus observed life expectancy using linear regression",x="Predicted value",y="True value")

7. What are the significant factors which define Life expectancy?

Since the number of predictors is not high, Best subset selection approach can be used to decide the important predictors. The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The summary() command outputs the best set of variables for each model size.

dim(life_exp_train)
## [1] 77 20
regfit.best=regsubsets (Lifeexpectancy ~ ., data=life_exp_train, nvmax=19)
reg.summary =summary (regfit.best)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Lifeexpectancy ~ ., data = life_exp_train, 
##     nvmax = 19)
## 19 Variables  (and intercept)
##                              Forced in Forced out
## AdultMortality                   FALSE      FALSE
## infantdeaths                     FALSE      FALSE
## Alcohol                          FALSE      FALSE
## percentageexpenditure            FALSE      FALSE
## HepatitisB                       FALSE      FALSE
## Measles                          FALSE      FALSE
## BMI                              FALSE      FALSE
## underfivedeaths                  FALSE      FALSE
## Polio                            FALSE      FALSE
## Totalexpenditure                 FALSE      FALSE
## Diphtheria                       FALSE      FALSE
## HIVAIDS                          FALSE      FALSE
## GDP                              FALSE      FALSE
## Population                       FALSE      FALSE
## thinness119years                 FALSE      FALSE
## thinness59years                  FALSE      FALSE
## Incomecompositionofresources     FALSE      FALSE
## Schooling                        FALSE      FALSE
## Status_Developing                FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           AdultMortality infantdeaths Alcohol percentageexpenditure
## 1  ( 1 )  "*"            " "          " "     " "                  
## 2  ( 1 )  "*"            " "          " "     " "                  
## 3  ( 1 )  "*"            " "          " "     " "                  
## 4  ( 1 )  "*"            " "          " "     " "                  
## 5  ( 1 )  "*"            " "          " "     " "                  
## 6  ( 1 )  "*"            " "          " "     "*"                  
## 7  ( 1 )  "*"            " "          "*"     "*"                  
## 8  ( 1 )  "*"            " "          "*"     "*"                  
## 9  ( 1 )  "*"            "*"          "*"     "*"                  
## 10  ( 1 ) "*"            "*"          "*"     "*"                  
## 11  ( 1 ) "*"            "*"          "*"     "*"                  
## 12  ( 1 ) "*"            "*"          "*"     "*"                  
## 13  ( 1 ) "*"            "*"          "*"     "*"                  
## 14  ( 1 ) "*"            "*"          "*"     "*"                  
## 15  ( 1 ) "*"            "*"          "*"     "*"                  
## 16  ( 1 ) "*"            "*"          "*"     "*"                  
## 17  ( 1 ) "*"            "*"          "*"     "*"                  
## 18  ( 1 ) "*"            "*"          "*"     "*"                  
## 19  ( 1 ) "*"            "*"          "*"     "*"                  
##           HepatitisB Measles BMI underfivedeaths Polio Totalexpenditure
## 1  ( 1 )  " "        " "     " " " "             " "   " "             
## 2  ( 1 )  " "        " "     " " " "             " "   " "             
## 3  ( 1 )  " "        " "     " " " "             " "   " "             
## 4  ( 1 )  " "        " "     " " " "             " "   " "             
## 5  ( 1 )  " "        "*"     " " " "             " "   " "             
## 6  ( 1 )  " "        " "     " " "*"             " "   " "             
## 7  ( 1 )  " "        " "     "*" " "             " "   " "             
## 8  ( 1 )  " "        " "     "*" "*"             " "   " "             
## 9  ( 1 )  " "        " "     "*" " "             " "   " "             
## 10  ( 1 ) " "        " "     "*" " "             " "   "*"             
## 11  ( 1 ) "*"        " "     "*" " "             " "   "*"             
## 12  ( 1 ) "*"        " "     "*" " "             " "   "*"             
## 13  ( 1 ) "*"        " "     "*" "*"             " "   "*"             
## 14  ( 1 ) " "        " "     "*" "*"             " "   "*"             
## 15  ( 1 ) " "        " "     "*" "*"             " "   "*"             
## 16  ( 1 ) " "        "*"     "*" "*"             " "   "*"             
## 17  ( 1 ) " "        "*"     "*" "*"             "*"   "*"             
## 18  ( 1 ) " "        "*"     "*" "*"             "*"   "*"             
## 19  ( 1 ) "*"        "*"     "*" "*"             "*"   "*"             
##           Diphtheria HIVAIDS GDP Population thinness119years
## 1  ( 1 )  " "        " "     " " " "        " "             
## 2  ( 1 )  " "        " "     " " " "        " "             
## 3  ( 1 )  " "        " "     " " " "        " "             
## 4  ( 1 )  " "        "*"     " " " "        " "             
## 5  ( 1 )  " "        "*"     " " " "        " "             
## 6  ( 1 )  " "        "*"     " " " "        " "             
## 7  ( 1 )  " "        "*"     " " " "        " "             
## 8  ( 1 )  " "        "*"     " " " "        " "             
## 9  ( 1 )  " "        "*"     " " "*"        " "             
## 10  ( 1 ) " "        "*"     " " "*"        " "             
## 11  ( 1 ) " "        "*"     " " "*"        " "             
## 12  ( 1 ) " "        "*"     " " "*"        "*"             
## 13  ( 1 ) " "        "*"     " " "*"        "*"             
## 14  ( 1 ) " "        "*"     "*" "*"        "*"             
## 15  ( 1 ) "*"        "*"     "*" "*"        "*"             
## 16  ( 1 ) "*"        "*"     "*" "*"        "*"             
## 17  ( 1 ) "*"        "*"     "*" "*"        "*"             
## 18  ( 1 ) "*"        "*"     "*" "*"        "*"             
## 19  ( 1 ) "*"        "*"     "*" "*"        "*"             
##           thinness59years Incomecompositionofresources Schooling
## 1  ( 1 )  " "             " "                          " "      
## 2  ( 1 )  " "             " "                          "*"      
## 3  ( 1 )  " "             "*"                          "*"      
## 4  ( 1 )  " "             "*"                          "*"      
## 5  ( 1 )  " "             "*"                          "*"      
## 6  ( 1 )  " "             "*"                          "*"      
## 7  ( 1 )  " "             "*"                          "*"      
## 8  ( 1 )  " "             "*"                          "*"      
## 9  ( 1 )  " "             "*"                          "*"      
## 10  ( 1 ) " "             "*"                          "*"      
## 11  ( 1 ) " "             "*"                          "*"      
## 12  ( 1 ) " "             "*"                          "*"      
## 13  ( 1 ) " "             "*"                          "*"      
## 14  ( 1 ) " "             "*"                          "*"      
## 15  ( 1 ) " "             "*"                          "*"      
## 16  ( 1 ) " "             "*"                          "*"      
## 17  ( 1 ) " "             "*"                          "*"      
## 18  ( 1 ) "*"             "*"                          "*"      
## 19  ( 1 ) "*"             "*"                          "*"      
##           Status_Developing
## 1  ( 1 )  " "              
## 2  ( 1 )  " "              
## 3  ( 1 )  " "              
## 4  ( 1 )  " "              
## 5  ( 1 )  " "              
## 6  ( 1 )  " "              
## 7  ( 1 )  " "              
## 8  ( 1 )  " "              
## 9  ( 1 )  " "              
## 10  ( 1 ) " "              
## 11  ( 1 ) " "              
## 12  ( 1 ) " "              
## 13  ( 1 ) " "              
## 14  ( 1 ) "*"              
## 15  ( 1 ) "*"              
## 16  ( 1 ) "*"              
## 17  ( 1 ) "*"              
## 18  ( 1 ) "*"              
## 19  ( 1 ) "*"

The above summary of Best subset selection approach provides the best subset for each model size. The best model size can be decided by estimating the test MSE for each of the model size and then selecting the model size with the lowest estimated test MSE. A 10-fold cross validation is performed below to estimate the test MSE of each model size. The model size 2 is found to be one with lowest test MSE.

predict.regsubsets =function (object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
 

k=10
set.seed(1)
folds=sample (1:k,nrow(life_exp_validation),replace=TRUE)
cv.errors =matrix (NA,k,19, dimnames =list(NULL , paste (1:19) ))

for(j in 1:k){
 best.fit=regsubsets(Lifeexpectancy ~ .,data=life_exp_validation[folds!=j,],nvmax=19)
for(i in 1:19){
  pred=predict (best.fit, life_exp_validation[folds==j,] ,id=i)
  cv.errors[j,i]= mean( ( life_exp_validation$Lifeexpectancy[ folds==j]-pred)^2)
 }
}

mean.cv.errors.bestsubset=apply(cv.errors ,2, mean)
mean.cv.errors.bestsubset
##         1         2         3         4         5         6         7 
##  21.06960  30.19506  15.15802  15.84608  19.06236  18.18506  24.63374 
##         8         9        10        11        12        13        14 
## 177.80886 423.23436 447.83239 599.44951 452.43634 387.21358 538.07409 
##        15        16        17        18        19 
## 515.68543 529.76588 522.27302 491.09122 507.90516
min(mean.cv.errors.bestsubset)
## [1] 15.15802

As per the above summary statistics of best subset selection approach, the best 2 variables model consists of AdultMortality and Incomecompositionofresources.

So, it can be concluded that these 2 factors are significant to predict the response(i.e Life expectancy). A linear regression is performed below with these 2 factors and the plot for the test dataset is displayed.

lm_bestfit <- lm(Lifeexpectancy ~ AdultMortality+
                   Incomecompositionofresources, data = life_exp_train)

summary(lm_bestfit)
## 
## Call:
## lm(formula = Lifeexpectancy ~ AdultMortality + Incomecompositionofresources, 
##     data = life_exp_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9391 -1.3799  0.2768  1.4957  9.6980 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  66.608816   1.903651  34.990  < 2e-16 ***
## AdultMortality               -0.058066   0.003719 -15.614  < 2e-16 ***
## Incomecompositionofresources 19.615970   2.341953   8.376 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.857 on 74 degrees of freedom
## Multiple R-squared:  0.9045, Adjusted R-squared:  0.902 
## F-statistic: 350.6 on 2 and 74 DF,  p-value: < 2.2e-16
tibble(predicted = predict(lm_bestfit, newdata = life_exp_test), 
       observed  = life_exp_test$Lifeexpectancy) %>% 
  ggplot(aes(x = predicted, y = observed)) +
  geom_point() +
  theme_minimal() +
  geom_abline(slope = 1) +
  labs(title = "Predicted versus observed life expectancy using Best subset selection")

8. Does Alcohol consumption impact Life expectancy?

The answer to the previous question provides an interesting information. Alcohol consumption is not in the list of 5 factors which decide Life expectancy. This is against the popular notion that alcohol consumption reduces Life expectancy. The null hypothesis is that that there is no relationship between the two. and the alternate hypothesis is alcohol consumption reduces life expectancy.

A simple linear regression with only Alcohol as the predictor is as follows.

lm_alcohol <- lm(Lifeexpectancy ~ Alcohol, data = life_exp_train)
summary(lm_alcohol)
## 
## Call:
## lm(formula = Lifeexpectancy ~ Alcohol, data = life_exp_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.232  -5.365   1.669   6.605  11.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62.943      1.456  43.242  < 2e-16 ***
## Alcohol        1.045      0.285   3.667 0.000455 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.459 on 75 degrees of freedom
## Multiple R-squared:  0.1521, Adjusted R-squared:  0.1408 
## F-statistic: 13.45 on 1 and 75 DF,  p-value: 0.0004554

The F-statistic is higher than 1 and the p-value is almost 0. So, the null hypothesis can be rejected to prove that there is correlation between Alcohol consumption and Life expectancy.

9. Lasso regression

An alternative to best subset selection is Lasso. It does variable selection and regularization to improve prediction accuracy and interpretability of the model. Lasso can be performed with glmnet function.

x_train <- model.matrix(Lifeexpectancy ~ ., data=life_exp_train)
result <- glmnet(x = x_train[, -1],              # X matrix without intercept
                 y=life_exp_train$Lifeexpectancy,# Life expectancy as response
                 family = "gaussian",            # Normally distributed errors
                 alpha  = 1,                     # LASSO penalty
                 lambda = 1)                     # Penalty value
rownames(coef(result))[which(coef(result) != 0)]
## [1] "(Intercept)"                  "AdultMortality"              
## [3] "BMI"                          "HIVAIDS"                     
## [5] "Incomecompositionofresources" "Schooling"

Assuming a value 1 for lambda, results in selecting only 8 predictors as shown in the above output.

A plot of predicted values versus observed values using test dataset for the generated model is shown below.

x_test <- model.matrix(Lifeexpectancy ~ ., data=life_exp_test)[, -1]
y_pred_lasso <- as.numeric(predict(result, newx = x_test))
tibble(Predicted = y_pred_lasso, Observed = life_exp_test$Lifeexpectancy) %>% 
  ggplot(aes(x = Predicted, y = Observed)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  theme_minimal() +
  labs(title = "Predicted versus observed life expectancy using Lasso")

10. Tuning Lambda

The above Lasso regression is done with a Lambda value(tuning parameter) of 1. The lambda parameter changes the strength of the shrinkage in glmnet(). Changing the tuning parameter will change the predictions, and thus the MSE. The best value for lambda can be selected using the cv.glmnet function to determine the lambda value for which the out-of-sample MSE is lowest using k-fold cross-validation.

x_cv <- model.matrix(Lifeexpectancy ~ ., bind_rows(life_exp_train, 
                                            life_exp_validation)[, -21])[, -1]
result_cv <- cv.glmnet(x = x_cv, y = c(life_exp_train$Lifeexpectancy, 
                              life_exp_validation$Lifeexpectancy), nfolds = 15)
best_lambda <- result_cv$lambda.min
best_lambda 
## [1] 0.1690499

11. Lasso regression with cv tuning

A plot of predicted values versus observed values using test dataset with tuned lambda is shown below.

x_test <- model.matrix(Lifeexpectancy ~ ., data=life_exp_test)[, -1]
y_pred_cv_tuning <- as.numeric(predict(result_cv, newx = x_test))
tibble(Predicted = y_pred_cv_tuning, Observed = life_exp_test$Lifeexpectancy) %>% 
  ggplot(aes(x = Predicted, y = Observed)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  theme_minimal() +
  labs(title = "Predicted versus observed life expectancy with tuned lambda")

12. Conclusion

A bar plot comparing the test MSE of the various prediction models described above is as follows. It is interesting to see that the test MSE of Lasso wit cv tuning is the lowest among the 4 prediction models.

# lm
mse <- function(y_true, y_pred) {
  mean((y_true - y_pred)^2)}
y_pred <- predict(lm_model, newdata = life_exp_test)
mse(life_exp_test$Lifeexpectancy, y_pred)
## [1] 18.39668
#Best subset selection
k=10
set.seed(2)
folds=sample (1:k,nrow(life_exp_test),replace=TRUE)
cv.errors =matrix (NA,k,19, dimnames =list(NULL , paste (1:19) ))

for(j in 1:k){
 best.fit=regsubsets(Lifeexpectancy ~ .,data=life_exp_test[folds!=j,],nvmax=19)
for(i in 1:19){
  pred=predict (best.fit, life_exp_test[folds==j,] ,id=i)
  cv.errors[j,i]= mean( ( life_exp_test$Lifeexpectancy[ folds==j]-pred)^2)
 }
}

mean.cv.errors.bestsubset=apply(cv.errors ,2, mean)

min(mean.cv.errors.bestsubset)
## [1] 5.248014
#Lasso
mse(life_exp_test$Lifeexpectancy, y_pred_lasso)
## [1] 6.272871
#Lasso with CV tuning
mse(life_exp_test$Lifeexpectancy, y_pred_cv_tuning)
## [1] 5.01991
# Adding to vector
mses <- c(
  mse(life_exp_test$Lifeexpectancy, y_pred),
  min(mean.cv.errors.bestsubset),
  mse(life_exp_test$Lifeexpectancy, y_pred_lasso),
  mse(life_exp_test$Lifeexpectancy, y_pred_cv_tuning)
)

# Creating a plot
tibble(Method = as_factor(c("lm", "subset", "lasso", "cv_las")), MSE = mses) %>% 
  ggplot(aes(x = Method, y = MSE, fill = Method)) +
  geom_bar(stat = "identity", col = "black") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Comparison of test set MSE for different prediction methods") 

This paper started with an introduction of the dataset,and then explained preprocessing steps and followed it up with a linear regression model. Best subset selection method is applied to filter the significant predictors. Some of the interesting analytical questions are answered with the linear prediction model and the prediction model with the variables selected through the best subset selection approach. Lasso regression model is created with glmnet() and then the best lambda is predicted through cross validation. As expected, the test MSE of the lasso regression model with tuned lambda is lower than the test MSE of the model with a random value for lambda. Another observation is that all the variables selected by the best subset selection approach are also selected by the Lasso model.