For this analysis we wish to determine which factors play a role in determining the life expectancy in different countries. For this we have gathered data from different countries, the life expectancy in the country for that year and some public health variables.
load("life_expectancy.rda")
We can impute missing values using the mice package, for
this we will use classification and regression trees to impute the data,
method = "cart". This function generates multiple imputed
datasets which can then be run through the model and the results
analysed each time to determine the effect of the imputation. Here we
will just generate a single imputed dataset.
# Extract feature variables
feat_vars <- names(life)[5:22]
imputed_values <- mice( data = life[, feat_vars], # Set dataset
m = 1, # Set number of multiple imputations
maxit = 40, # Set maximum number of iterations
method = "cart", # Set method
print = FALSE) # Set whether to print output or not
We can then replace the missing values in our dataset with the
estimates generated by mice imputation. We can do this using the
complete() function:
life[,feat_vars] <- complete(imputed_values, 1) # Extract imputed data
The missing values in the data have been replaced with imputed values.
As an alternative to linear regression we can use the lasso model.
Prior to applying the lasso we want to scale the data which we use to
have standard deviation 1 and mean 0. We can do this with the
scale() command:
# Select the columns to use for analysis
use_dat <- life[, c(4:22)]
# Drop missing values
use_dat <- na.omit(use_dat)
# Scale explanatory variables
x_vars <- scale(use_dat[,-1])
We fit this model using the glmnet() command:
# Fit lasso model
fit_3 <- glmnet(x = x_vars, # Fit explanatory variables
y = use_dat$Life.expectancy, # Fit response variable
alpha = 1, # Set alpha as 1 for lasso
lambda = 0.5) # Set lambda as 0.5
We can then view the calculated coefficients using the
coef() command:
coef(fit_3) # Print out lasso coefficients
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 69.2249317
## Adult.Mortality -2.1635326
## infant.deaths .
## Alcohol .
## percentage.expenditure .
## Hepatitis.B .
## Measles .
## BMI 0.6384696
## under.five.deaths .
## Polio 0.5776042
## Total.expenditure .
## Diphtheria 0.4951371
## HIV.AIDS -2.2283728
## GDP 0.5427246
## Population .
## thinness..1.19.years -0.2942908
## thinness.5.9.years .
## Income.composition.of.resources 1.5192129
## Schooling 2.7045326
We can see from the print out that the lasso model has calculated the
coefficients for each of the noise variables to be zero, indicated by
.. Thus these variables have no impact on the model and
will not affect the predictions which we generate.