Lasso Regression

For this analysis we wish to determine which factors play a role in determining the life expectancy in different countries. For this we have gathered data from different countries, the life expectancy in the country for that year and some public health variables.

Missing Data Imputation

We can impute missing values using the mice package, for this we will use classification and regression trees to impute the data, method = "cart". This function generates multiple imputed datasets which can then be run through the model and the results analysed each time to determine the effect of the imputation. Here we will just generate a single imputed dataset.

# Extract feature variables
feat_vars <- names(life)[5:22]

imputed_values <- mice( data = life[, feat_vars], # Set dataset
                        m = 1, # Set number of multiple imputations
                        maxit = 40, # Set maximum number of iterations
                        method = "cart", # Set method
                        print = FALSE) # Set whether to print output or not

We can then replace the missing values in our dataset with the estimates generated by mice imputation. We can do this using the complete() function:

life[,feat_vars] <- complete(imputed_values, 1) # Extract imputed data

The missing values in the data have been replaced with imputed values.

The Lasso

As an alternative to linear regression we can use the lasso model.

Prior to applying the lasso we want to scale the data which we use to have standard deviation 1 and mean 0. We can do this with the scale() command:

# Select the columns to use for analysis
use_dat <- life[, c(4:22)] 
# Drop missing values
use_dat <- na.omit(use_dat)
# Scale explanatory variables
x_vars <- scale(use_dat[,-1])

We fit this model using the glmnet() command:

#  Fit lasso model
fit_3 <- glmnet(x = x_vars, # Fit explanatory variables
                y = use_dat$Life.expectancy, # Fit response variable
                alpha = 1, # Set alpha as 1 for lasso
                lambda = 0.5) # Set lambda as 0.5

We can then view the calculated coefficients using the coef() command:

coef(fit_3) # Print out lasso coefficients

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                                         s0
## (Intercept)                     69.2249317
## Adult.Mortality                 -2.1635326
## infant.deaths                    .        
## Alcohol                          .        
## percentage.expenditure           .        
## Hepatitis.B                      .        
## Measles                          .        
## BMI                              0.6384696
## under.five.deaths                .        
## Polio                            0.5776042
## Total.expenditure                .        
## Diphtheria                       0.4951371
## HIV.AIDS                        -2.2283728
## GDP                              0.5427246
## Population                       .        
## thinness..1.19.years            -0.2942908
## thinness.5.9.years               .        
## Income.composition.of.resources  1.5192129
## Schooling                        2.7045326

We can see from the print out that the lasso model has calculated the coefficients for each of the noise variables to be zero, indicated by .. Thus these variables have no impact on the model and will not affect the predictions which we generate.