INTRODUCTION

In the traditional statistical sense, logistic regression is a way to estimate the parameters of a logistic model, where logistic models are commonly used to predict a dichotomous outcome by the linear combination of independent variables. Feature selection method is an automatic process that selects a subset of the independent variables(features) from the existing set of features to create a model. The premise of this method is the removal of features that are redundant and irrelevant without losing too much information. In some cases, these methods can provide models that are easier to interpret, can handle model complexity, and are less computationally intensive.

For this presentation, we will be utilizing the “Pima Indians Diabetes” data set to demonstrate the application of feature selection methods for logistic models. Using these methods, our goal is to create a model that can accurately predict our dichotomous outcome(diagnosis of Type 2 diabetes) from a selection of provided health information of patients. In addition, a demonstration of model interpretation for logistic regression will be provided to evaluate the effects of the provided health information of patients to the diagnosis of Type 2 diabetes.

It is important to note that the goal of this presentation is the application of logistic modeling with R. Therefore, many statistical concepts that are used in this presentation will be only be introduced, but links with additional information will be provided throughout the vignette.

I. Data Cleaning

Summary of the diabetes data set:

summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Let’s filter out the missing values that are denoted as 0.

library(dplyr)
diabetes <- filter(diabetes, Glucose > 0, BloodPressure > 0, SkinThickness > 0, Insulin > 0, BMI > 0)
head(diabetes)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           1      89            66            23      94 28.1
## 2           0     137            40            35     168 43.1
## 3           3      78            50            32      88 31.0
## 4           2     197            70            45     543 30.5
## 5           1     189            60            23     846 30.1
## 6           5     166            72            19     175 25.8
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.167  21       0
## 2                    2.288  33       1
## 3                    0.248  26       1
## 4                    0.158  53       1
## 5                    0.398  59       1
## 6                    0.587  51       1

In this dataset, there are eight candidates for our explanatory/predictor variables which include: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, and Age. Our dichotomous response variable is Outcome, which has the values 0 or 1 to denote whether the patient was diagnosed with Type 2 diabetes or not.

II. Logistic Regression

To fit a logistic regression model in R, we can use the glm function and family = binomial. Instead of the least squares method used in linear regression, this function for logistic regression uses maximum likelihood estimation and log likelihood function to estimate the coefficients, and find the best fitting line.

model_test <- glm(Outcome ~ ., data = diabetes)
summary(model_test)
## 
## Call:
## glm(formula = Outcome ~ ., data = diabetes)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.07966  -0.25711  -0.06177   0.25851   1.03750  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -1.103e+00  1.436e-01  -7.681 1.34e-13 ***
## Pregnancies               1.295e-02  8.364e-03   1.549  0.12230    
## Glucose                   6.409e-03  8.159e-04   7.855 4.07e-14 ***
## BloodPressure             5.465e-05  1.730e-03   0.032  0.97482    
## SkinThickness             1.678e-03  2.522e-03   0.665  0.50631    
## Insulin                  -1.233e-04  2.045e-04  -0.603  0.54681    
## BMI                       9.325e-03  3.901e-03   2.391  0.01730 *  
## DiabetesPedigreeFunction  1.572e-01  5.804e-02   2.708  0.00707 ** 
## Age                       5.878e-03  2.787e-03   2.109  0.03559 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1484185)
## 
##     Null deviance: 86.888  on 391  degrees of freedom
## Residual deviance: 56.844  on 383  degrees of freedom
## AIC: 375.52
## 
## Number of Fisher Scoring iterations: 2

Similar to the lm function for linear regression, the summary displays our estimated coefficients and associated p-values.

In logistic regression, our “Y” is represented as log odds, which is equal to the linear combination of independent variables.

\[log(\frac{p}{1-p})=\beta_0+\beta_1X_1+\beta_2X_2...\]

Log odds is denoted on the left, where odds is the probability of success over the probability of failure. Each estimated coefficient is the expected change in log odds of our response variable, while holding our other variables as a constant. Finding the expected change in log odds would not be intuitively understood, so we have to convert the log odds to odds by exponentiating both sides.

First, when we exponentiate the linear combination of variables, it becomes a multiplicative expression. \[\frac{p}{1-p} = e^{\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}}...\] \[\frac{p}{1-p} = e^{\beta_{0}}e^{\beta_{1}X_{1}}e^{\beta_{2}X_{2}}...\] Next, let’s hold every other variable besides X1 as a constant. \[\frac{p}{1-p} = e^{\beta_1X_1}C\] Now when we want to find the expected change in the odds of our response variable for one unit increase(X), we are calculating the odds ratio between the new odds from the old odds. \[\frac{e^{\beta_1(2)}C}{e^{\beta_1(1)}C}=e^{\beta_1(1)}\] This was important to note since we will use this for our model interpretation at the end of this vignette.

III. Correlation Matrix and Variance Inflation Factor

Before we start our feature selection, it is important to measure linear independence, and check for multi-collinearity. It is important to check for multi-collinearity because when our independent variables are highly correlated, the results of our model will very unstable to use for the purpose of prediction and statistical inference. For statistical inference, multi-collinearity leads to unstable coefficient estimates and p-values, making it difficult to properly interpret the model. For prediction, multi-collinearity can also lead to overfitting, where the model will perform inaccurately to out-of-sample data.

library(ggcorrplot)
library(rcompanion)
library(car)
diabetes %>%
cor(use = "all.obs") %>%
ggcorrplot(show.diag = F, type = "lower", lab = TRUE, lab_size=2)

data.frame(vif(model_test))
##                          vif.model_test.
## Pregnancies                     1.900719
## Glucose                         1.670072
## BloodPressure                   1.231815
## SkinThickness                   1.852772
## Insulin                         1.556143
## BMI                             1.979596
## DiabetesPedigreeFunction        1.059315
## Age                             2.129433

The Pearson Correlation Matrix shows us there is correlation among the independent variables which might suggest multi-collinearity. We will check the variance inflation factor to quantify the collinearity among our independent variables. The variation inflation factor scores suggest that the collinearity among our independent variables is relatively low, where a VIF above 5 indicates that the variables are highly correlated.

IV. Model Building with Feature Selection Methods

We will first partition our data into training and testing data sets. This is important because we want to train our model to estimate the parameters and evaluate how well it will perform on new data. To simulate this, we are using 80 percent of the dataset to train because we want to train our model with as much data as possible while leaving a portion of the dataset for testing. We do not want to reuse the same data for both training and testing because we want to know how well our model will categorize new data, or it will simply “memorize” the training data.

set.seed(123) #reproducibility
#80% into train and 20% into test
samp <- sample(1:nrow(diabetes), 313)
train <- diabetes[samp, ]
test <- diabetes[-samp, ]
library(bestglm)
library(caret)
library(InformationValue)
library(pROC)
library(glmnet)

There are many different types of feature selection methods, but we will only be demonstrating the best subset selection and LASSO feature selection methods.

Our first feature selection method we will use is best subset selection. This procedure tests all the linear combinations of features by a specificied criterion: Adjusted R-squared, Akaike information criterion, Bayesian information criterion, and Mallows’s Cp. These criterions for model selection is a way to measure the goodness of fit between a collection of models. The criterion we will use for our best subset selection is the AIC, where one of the benefits of using AIC is that it includes a penalty term that can balance in-sample predictive performance with model complexity. The model with the lowest AIC, indicates the “best” model from our collection of models.

best.logit_AIC <- bestglm(train, IC = "AIC", family = binomial, method = "exhaustive")
## Morgan-Tatar search since family is non-gaussian.
best.logit_AIC$Subsets
##    Intercept Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
## 0       TRUE       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
## 1       TRUE       FALSE    TRUE         FALSE         FALSE   FALSE FALSE
## 2       TRUE       FALSE    TRUE         FALSE         FALSE   FALSE  TRUE
## 3       TRUE       FALSE    TRUE         FALSE         FALSE   FALSE  TRUE
## 4*      TRUE       FALSE    TRUE         FALSE         FALSE   FALSE  TRUE
## 5       TRUE       FALSE    TRUE         FALSE          TRUE   FALSE  TRUE
## 6       TRUE        TRUE    TRUE         FALSE          TRUE   FALSE  TRUE
## 7       TRUE        TRUE    TRUE         FALSE          TRUE    TRUE  TRUE
## 8       TRUE        TRUE    TRUE          TRUE          TRUE    TRUE  TRUE
##    DiabetesPedigreeFunction   Age logLikelihood      AIC
## 0                     FALSE FALSE     -201.0264 402.0528
## 1                     FALSE FALSE     -157.8023 317.6046
## 2                     FALSE FALSE     -152.1463 308.2927
## 3                     FALSE  TRUE     -147.4573 300.9146
## 4*                     TRUE  TRUE     -144.0309 296.0619
## 5                      TRUE  TRUE     -143.7900 297.5800
## 6                      TRUE  TRUE     -143.6222 299.2444
## 7                      TRUE  TRUE     -143.5364 301.0728
## 8                      TRUE  TRUE     -143.4935 302.9870

The model with the lowest(best) AIC score is shown as #4, where it selected the features: Glucose, BMI, DiabetesPedigreeFunction, and Age. We can see that it did not select the features that were correlated with the selected features. In addition, we can observe the penalty term in play as the AIC score increases with more than four features.

Prediction with our best subset model.

To see the performance of our trained model to our testing data, we will use the predict function with type = response, to output probabilities between zero and one. We will then use the confusionMatrix function to create a confusion matrix to display how many of the predictions were properly classified by a specific threshold. The default threshold is 0.5 which makes sense intuitively because it is the middle point of the interval between zero and one. For example, if it predicted a value of 0.9, and the response variable in the testing set was 1, then it was properly classified since it’s greater than 0.5. However, if we predicted values of 0.49 or 0.51, would it make sense to explicitly classify it? This can definitely create space for confusion.

This is where we can utilize the Receiver Operating Characteristics(ROC) curve and the Area Under Curve(AUC) to evaluate the classification performance. The ROC curve has two axes, Sensitivity and 1 - Specificity. Sensitivity is our true positive rate or the proportion of positives that were correctly predicted as positive, and 1 - Specificity is our false positive rate or the proportion of negatives that were wrongly predicted as positive. The benefit of using ROC curve is that is tests the model at all possible thresholds between 0 and 1, and be visualized. The AUC is our measure of performance at all thresholds as it computes the area under the the ROC curve, where a value of one indicates that the predictions were 100% correct, and has the shape of a right triangle that covers all the space under it.

#train with selected model
best_train <- glm(Outcome ~ Glucose + BMI + DiabetesPedigreeFunction + Age, family = binomial, data = train)
#predict probability of response on a normalized scale
best_pred <- predict(best_train, test, type = "response")
#confusion matrix to display our true/false positives and negatives
confusionMatrix(test$Outcome, best_pred)
##    0  1
## 0 52  6
## 1  4 17
#Sensitivity displays our true positive rate
sensitivity(test$Outcome, best_pred)
## [1] 0.7391304
#Specificity displays our true negative rate
specificity(test$Outcome, best_pred)
## [1] 0.9285714
#ROC to measure TPR/FPR at all possible thresholds
ROC_best <- roc(test$Outcome, best_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(ROC_best)
## Area under the curve: 0.8913

The Area under the ROC curve is 89.13% which indicates that our model performed pretty well.

The next feature selection method we will use is LASSO,or least absolute shrinkage and selection operator. Similar to using AIC for best subset selection, this method performs regularization by adding a penalty term lambda, so that the model can generalize the data, instead of over-fitting. This type of regularization can deal with complexity and reduce over-fitting by increasing the value of our penalty term lambda, which shrinks the coefficients, and in some cases, eliminates features by outputting coefficients of zero. To perform the LASSO method for our logistic model, we have to use the glmnet function, alpha = 1, and family = binomial.

#Set our training data in matrix
x_train <- data.matrix(train[, c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                                 "BMI", "DiabetesPedigreeFunction", "Age")])
#Set our testing data in matrix
x_test <- data.matrix(test[, c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                               "BMI", "DiabetesPedigreeFunction", "Age")])

#Fit our lasso model with training data with standardization
lasso_train <- glmnet(x_train, y = as.factor(train$Outcome), alpha = 1, family = 'binomial')
#Display shrinkage of coefficients by lambda value
plot(lasso_train, xvar = 'lambda')

We can see here that our coefficients are shrinking towards zero as our lambda increases in value. We can continue to increase the value of lambda but our coefficients will continue to decrease to zero, so we need to find a way to select the optimal lambda value.

We can achieve this by finding the lambda value that minimizes the mean squared error. The cv.glmnet function uses k-fold cross validation to identify the “best” lambda value. K-fold cross validation is a method that divides our training set into k folds(default is 10), of equal size. It trains the fitted model on k-1 folds and calculates the test MSE on the fold that was left out, repeats this process for each fold, and the average of the k fold test MSE’s is the overall test MSE. In addition, this process is applied for each lambda value to determine the lambda that minimizes the MSE.

#k-fold cross-validation to find the best lambda
cv_lasso <- cv.glmnet(x_train, y = train$Outcome, alpha = 1)
#Best lambda value that minimizes mean squared error
best_lambda <- cv_lasso$lambda.min
#Display mean squared error by lambda value
plot(cv_lasso)

lasso_best <- glmnet(x_train, y = as.factor(train$Outcome), alpha = 1, family = 'binomial', lambda = best_lambda)
#Display shrunk and eliminated coefficients
coef(lasso_best)
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                                    s0
## (Intercept)              -7.749399100
## Pregnancies               0.006800820
## Glucose                   0.031680001
## BloodPressure             .          
## SkinThickness             0.008211263
## Insulin                   .          
## BMI                       0.043670853
## DiabetesPedigreeFunction  0.746422608
## Age                       0.030050701

We can see here that the LASSO method eliminated two features where their coefficients are denoted as a period. In comparison to the coefficient estimates of the best subset selection model, these coefficient estimates are much smaller, and this model has two more features.

Prediction with our LASSO model.

#Prediction probability for each observation with best lambda
lasso_pred <- predict(lasso_best, newx = x_test, type = "response")
#Find optimal probability cutoff score for our binary output
optimal_lasso <- optimalCutoff(test$Outcome, lasso_pred)[1]
#Confusion matrix to show false positives and false negatives
confusionMatrix(test$Outcome, lasso_pred)
##    0  1
## 0 52  6
## 1  4 17
#Sensitivity displays our true positive rate
sensitivity(test$Outcome, lasso_pred)
## [1] 0.7391304
#Specificity displays our true negative rate
specificity(test$Outcome, lasso_pred)
## [1] 0.9285714
#Percentage of total wrong classifications of model
misClassError(test$Outcome, lasso_pred, threshold = optimal_lasso)
## [1] 0.1266
lasso_pred2 <- as.vector(lasso_pred)
#ROC to measure TPR/FPR at all possible thresholds
ROC_lasso <- roc(test$Outcome, lasso_pred2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(ROC_lasso)
## Area under the curve: 0.8843
par(pty = "s")
plot(ROC_lasso, col = "blue", main = "ROC For LASSO(BLUE) vs Best Subset AIC(ORANGE)", legacy.axes = T)
lines(ROC_best, col = "orange")

Our area under the ROC curve is 88.43% which is less than the best subset model. Plotting the two ROC curves, we can see that it agrees with the AUROC scores as the best subset curve is closer to a “right angle” shape. However, the difference between the two models is marginal due to the small size of the data set, number of features, and its non-complexity.

In some cases, using the Area under the ROC Curve can have limitations depending on the context of the objective, where sensitivity will be weighted heavily to reduce the number of false negatives.

A limitation of using LASSO is that it does not provide us unbiased coefficient estimates and p-values to determine statistical significance, so we cannot properly interpret these estimated coefficients. More on that here.

V. Model Interpretation for Logistic Regression

Let’s evaluate the effects of the provided health information of patients to the diagnosis of Type 2 diabetes with our best subset model.

library(oddsratio)
best_model <- glm(Outcome ~ Glucose + BMI + DiabetesPedigreeFunction + Age, family = binomial, data = diabetes)
summary(best_model)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + DiabetesPedigreeFunction + 
##     Age, family = binomial, data = diabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8228  -0.6617  -0.3759   0.6702   2.5881  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -10.092018   1.080251  -9.342  < 2e-16 ***
## Glucose                    0.036189   0.004982   7.264 3.76e-13 ***
## BMI                        0.074449   0.020267   3.673 0.000239 ***
## DiabetesPedigreeFunction   1.087129   0.419408   2.592 0.009541 ** 
## Age                        0.053012   0.013439   3.945 8.00e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 347.23  on 387  degrees of freedom
## AIC: 357.23
## 
## Number of Fisher Scoring iterations: 5

The p-values show that our predictors are all statistically significant.

We can use the or_glm function to find the expected change in the odds of our response variable for every one change of a predictor, while holding our other predictors as a constant.

odds_ratio <- or_glm(data = diabetes, model = best_model,
                     incr = list(Glucose = 1, BMI = 1, DiabetesPedigreeFunction = 1, Age = 1)) #one unit increase
odds_ratio <- odds_ratio[,1:2]
odds_ratio$odds_percent_change <- (odds_ratio$oddsratio - 1) * 100 #percent change
odds_ratio
##                  predictor oddsratio odds_percent_change
## 1                  Glucose     1.037                 3.7
## 2                      BMI     1.077                 7.7
## 3 DiabetesPedigreeFunction     2.966               196.6
## 4                      Age     1.054                 5.4

Interpretation of p-values and coefficients

For every one year increase in Age, we can expect a 5.4% increase in the odds of being diagnosed with diabetes while holding our other predictors as a constant. Age has a statistically significant relationship with the diagnosis of Type 2 diabetes. Naturally, we are more prone to diseases as we get older.

For every one increase in Glucose, we can expect a 3.7% increase in the odds of being diagnosed with diabetes while holding our other predictors as a constant. Glucose has a statistically significant relationship with the diagnosis of Type 2 diabetes. Type 2 diabetes is a disease where your body has difficulty lowering glucose levels.

For every one increase in BMI, we can expect a 7.7% increase in the odds of being diagnosed with diabetes while holding our other predictors as a constant. BMI has a statistically significant relationship with the diagnosis of Type 2 diabetes. Being overweight can lead to resistance to insulin which uses glucose for energy.

DiabetesPedigreeFunction is a “function that scores the likelihood of diabetes based on family history”, that have values between 0 and 2.5, with a median value of 0.3725. This is a scaling issue that can be fixed with standardization but we would face another challenge of having an expected change in odds for every one standard deviation as opposed to a one unit increase.

For every one increase in DiabetesPedigreeFunction, we can expect a 196.6% increase in the odds of being diagnosed with diabetes while holding our other predictors as a constant. DiabetesPedigreeFunction has a statistically significant relationship with the diagnosis of Type 2 diabetes. Genetic factors can affect the development of type 2 diabetes.

This is a great article that explains the importance of interpreting machine learning models.


Links to important packages
bestglm caret pROC glmnet oddsratio car