DATA 622 HW1

Background

The purpose of the assignment was to explore logistic and multinomial logistic regression.

…………………………………………………………………….

1 - Logistic Regression with Binary Outcome

The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.

First, we perform a light exploration of our dataset to familiarize ourselves:

#We're going to deal with the Palmer penguins dataset:
data(penguins)

#Perform 'light EDA' to familiarize ourselves with the dataset:
glimpse(penguins)
summary(penguins)

We then explore how many categories there are in the species column:

#Categories in species column
penguins %>%
  group_by(species) %>%
  count() #there are THREE categories

## # A tibble: 3 x 2
## # Groups:   species [3]
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

From the output above, we see that there are THREE categories in the species column: Adelie, Gentoo, and Chinstrap. Adelie is the most popular species, followed by Gentoo and then Chinstrap.

Being that Adelie was the most popular, I elected to use whether the species was or was not Adelie to be the binary outcome for our logistic regression model.

First, we remove NA values, set our outcome to be binary, and then we apply logistic regression to a full model (all variables):

#Remove NAs
penguins <- penguins %>% na.omit() #this step is essential for consistent dataframe length (later)

#Adelie or NOT Adelie
penguins$adelie <- ifelse(penguins$species == 'Adelie', 1, 0)

#Build binary outcome logistic regression model (with all vars):
model_1 <- glm(adelie ~., family=binomial(link='logit'), data=penguins)

## Warning: glm.fit: algorithm did not converge

summary(model_1)

## 
## Call:
## glm(formula = adelie ~ ., family = binomial(link = "logit"), 
##     data = penguins)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.409e-06  -2.409e-06  -2.409e-06   2.409e-06   2.409e-06  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        2.657e+01  5.242e+07       0        1
## speciesChinstrap  -5.313e+01  1.121e+05       0        1
## speciesGentoo     -5.313e+01  1.900e+05       0        1
## islandDream       -1.090e-07  7.259e+04       0        1
## islandTorgersen   -2.402e-06  7.574e+04       0        1
## bill_length_mm    -2.923e-08  8.936e+03       0        1
## bill_depth_mm      1.535e-07  2.521e+04       0        1
## flipper_length_mm -1.611e-08  4.094e+03       0        1
## body_mass_g        1.164e-10  6.916e+01       0        1
## sexmale            7.256e-08  6.525e+04       0        1
## year              -3.261e-09  2.621e+04       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 1.9319e-09  on 322  degrees of freedom
## AIC: 22
## 
## Number of Fisher Scoring iterations: 25

For continuous variables, the interpretation is as follows:

For every unit increase in bill_length_mm, the log odds of being Adelie decrease by 2.923 * 10^-8.
For every unit increase in bill_depth_mm, the log odds of being Adelie increase by 1.535 * 10^-7.
For every unit increase in flipper_length_mm, the log odds of being Adelie decrease by 1.611 * 10^-8.
For every unit increase in body_mass_g, the log odds of being Adelie increase by 1.164 * 10^-10.
For every unit increase in year, the log odds of being Adelie decrease by -3.261 * 10^-9.

For categorical variables, the interpretation is as follows:

If the species is Chinstrap or Gentoo, the log odds of being Adelie decrease by -5.313 * 10^1.
If the island is Dream, the log odds of being Adelie decrease by -1.090 * 10^-7.
If the sland is Torgersen, the log odds of being Adelie decrease by -2.402 * 10^-6.
If the sex is male, the log odds of being Adelie increase by 7.256 * 10^-8.

In addition to the variable interpretations above, we can pull from the summary statistics. The summary statistics provide us with an AIC value of 22 and p-values of 1.

From these statistics, we determine that there’s likely a better model and thus we’ll optimize based on AIC (by applying the stepAIC function) and p-value (by adding variables back into our model one-by-one) to compare the model difference (if any):

#Optimize model based on AIC value:
model_2 <- stepAIC(model_1)

#Optimize model based on p-value:
model_3 <- glm(adelie ~ bill_length_mm + bill_depth_mm, family=binomial(link='logit'), data=penguins)

The results of our models are provided below:

#AIC optimized output:
summary(model_2)

## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + sex, 
##     family = binomial(link = "logit"), data = penguins)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -5.278e-04  -2.000e-08  -2.000e-08   2.000e-08   5.131e-04  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      1243.45  116369.47   0.011    0.991
## bill_length_mm    -94.98    7016.74  -0.014    0.989
## bill_depth_mm     158.15   11657.45   0.014    0.989
## sexmale           142.52  170018.10   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 5.4973e-07  on 329  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

#p-value optimized output:
summary(model_3)

## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm, family = binomial(link = "logit"), 
##     data = penguins)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.58614  -0.00035   0.00000   0.00234   1.85243  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     24.1314    13.5362   1.783  0.07463 . 
## bill_length_mm  -2.2099     0.6847  -3.228  0.00125 **
## bill_depth_mm    3.9981     1.4838   2.694  0.00705 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.575  on 332  degrees of freedom
## Residual deviance:  18.708  on 330  degrees of freedom
## AIC: 24.708
## 
## Number of Fisher Scoring iterations: 11

Optimizing based on AIC (Akaike Information Criteria) value resulted in a model that:

included variables bill_length_mm, bill_depth_mm, and sexmale,
had an AIC value of 8, and
all p-values approximately 1.

The magnitude of the p-values seemed strange to me and so I optimized once again, this time based on p-value. This resulted in a model that:

included variables bill_length_mm and bill_depth_mm,
had an AIC value of 24.708, and
all p-values well beneath the 0.05 threshold.

Thus, we had conflicting models. On one hand we had a model with a superior AIC score and on the other we had one with superior p-values yet a diminished AIC score. I elected to proceed with the superior AIC score model (model_2) due to the predictive promise that a lower AIC score is indicative of.

Between the two models, the only difference was the inclusion of the sexmale variable. This variable raised the collective p-values of model_2 while simultaneously lowering its AIC score. bill_length_mm and bill_depth_mm appear to provide the greatest indication of what species of penguin we’re dealing with. Alone, each of their p-values were well below the 0.05 threshold (see model_3) but the two variables together may not provide the best fit for the data they were generated from and thus it was necessary to include an additional variable (the sexmale variable) to optimize our model’s ability to fit the data it was derived from.

As a next step, we assess the accuracy of our model …

2 - Model Assessment

For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR.

AUC: Area Under Curve (Accuracy)
Confusion Matrix = True Positive Rate, False Positive Rate, True Negative Rate, False Negative Rate

#Assess probability of being Adelie
summary(model_2$fitted.values)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4384  1.0000  1.0000

#Analyze histogram of our fitted values
hist(model_2$fitted.values,main = " Histogram",xlab = "Probability of Adelie", col = "blue")

Based on the summary statistics, we get an idea of the range of values produced by our model and when we visualize these values, we see that our data has successfully been batched into two camps: 0.0 and 1.0, where we can interpret 1.0 as being of species Adelie and we can interpret 0.0 as NOT being of species Adelie.

The fact that the NOT bar is taller than the Adelie bar is a positive indication, but we’ll dig further into the accuracy of our model via AUC and Confusion Matrix:

#length(penguins$adelie) #0,1 - 344
#length(model_2$fitted.values) #2.2 * 10^-16, 1 - 333

#Add Prediction column to penguins df based on model_2's fitted values:
penguins$Predict <- ifelse(model_2$fitted.values > 0.5,"+ve","-ve")

#Observe Confusion matrix:
mytable <- table(penguins$adelie,penguins$Predict)
rownames(mytable) <- c("Obs -ve","Obs +ve")
colnames(mytable) <- c("Pred -ve","Pred +ve")
mytable

##          
##           Pred -ve Pred +ve
##   Obs -ve      187        0
##   Obs +ve        0      146

#Observe model accuracy:
accuracy <- sum(diag(mytable))/sum(mytable)
accuracy #100%

## [1] 1

Based on the Confusion Matrix, our results are as follows:

True Positive Result (TPR): 146 / 146 = 100%
False Positive Result (FPR): 0 / 146 = 0%
True Negative Result (TNR): 187 / 187 = 100%
False Negative Result (FNR): 0 / 187 = 0%

We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. As you may have expected, the output is 1.0 or 100% accuracy.

*Note: we dropped NA values prior to constructing our model and thus we ended up dropping 11 observations which may have challenged our model and resulted in False Positives or Negatives.

Regardless, based on our Confusion Matrix, the model appears to be strong. We then move on to verify our model’s ROC and AUC:

#Observe ROC and AUC:
roc(adelie~model_2$fitted.values, data = penguins, plot = TRUE, main = "ROC CURVE", col= "blue")

## 
## Call:
## roc.formula(formula = adelie ~ model_2$fitted.values, data = penguins,     plot = TRUE, main = "ROC CURVE", col = "blue")
## 
## Data: model_2$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 1

auc(adelie~model_2$fitted.values, data = penguins)

## Area under the curve: 1

The ROC (Receiver Operating Characteristic) curve utilizes Sensitivity vs Specificity to measure the performance of a model. In our case, the ROC “curve” forms a unit box with a Specificity of 1.0 to a Sensitivity of 1.0.

The AUC (Area Under Curve) measures the fitted values of our model vs. our adelie value. A value of 1.0 in this realm is a perfect score and our model received a 1.0 :)

Based on the AUC, accuracy, and Confusion Matrix, we appear to have a very strong model on our hands. Thus, choosing to optimize our model based on its AIC score appears to have been a good decision in this case.

3 - Multinomial Logistic Regression

Please apply multinomial logistic regression where your outcome variable is ‘species’.
Evaluate the independent variables appropriately to fit your most parsimonious model.
Please be sure to interpret your variables in the model.

I took “most parsimonious model” to mean that we’re looking for the optimal prediction with as few predictor variables as possible. In other words, we want our model to be as accurate as possible with as few independent variables as possible:

As such, I followed a similar approach to #2 where NA values were removed, the model was applied (this time using the multinom() function), and then the model was optimized using the stepAIC() function:

#'Refresh' the Palmer penguins dataset:
data(penguins)

#Remove NAs
penguins <- penguins %>% na.omit()
#is.factor(penguins$species)
#levels(penguins$species)

#Build multinomial logistic regression model (with all vars):
s_model_1 <- multinom(species ~., data=penguins, model = TRUE)

#Optimize model based on AIC value (as per #2):
s_model_2 <- stepAIC(s_model_1)

#Review summary statistics
summary(s_model_1)

## Call:
## multinom(formula = species ~ ., data = penguins, model = TRUE)
## 
## Coefficients:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  -0.2496435   211.62912        97.64471       32.96624     -60.84017
## Gentoo      0.7633414   -68.84639      -140.79025       28.45770     -42.86214
##           flipper_length_mm body_mass_g   sexmale       year
## Chinstrap        0.01515612  -0.0193027 -70.73544 -0.2242371
## Gentoo          -2.79321661   0.1330873 -67.23109 -0.2391631
## 
## Std. Errors:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  0.02157370  0.02157370    6.871032e-16      0.2205894     0.1689163
## Gentoo     0.02114133  0.02114131    5.401394e-53      1.0993488     0.4376254
##           flipper_length_mm body_mass_g    sexmale     year
## Chinstrap          4.937946    25.79546 0.02113835 42.20323
## Gentoo             4.439678   101.47836 0.02114132 42.45178
## 
## Residual Deviance: 7.087421e-06 
## AIC: 36.00001

summary(s_model_2)

## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     sex, data = penguins, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm     sexmale
## Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
## Gentoo       121.0476       14.80645     -44.69711  -0.1301506
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm  sexmale
## Chinstrap    3.985819       9.369182      23.16685 23.33386
## Gentoo       4.071031       9.413971      23.17697 43.25185
## 
## Residual Deviance: 0.9809986 
## AIC: 16.981

When we look at the full model’s summary statistics, we get a model that:

included ALL variables,
had an AIC value of 36, and
a residual deviance of 7.087 * 10^-6

Optimizing based on AIC (Akaike Information Criteria) value resulted in a model that:

included variables bill_length_mm, bill_depth_mm, and sexmale,
had an AIC value of 16.981, and
residual deviance of ~0.981.

If you recall, the variables that remain in our AIC-optimized model are consistent with those from pt 1 of this assignment: bill_length_mm, bill_depth_mm, and sex. It appears that although we varied the levels of our dependent variable, the optimal model was consistent. I don’t know if this assumption can be extended to models of even greater level (ie. 4,5,6 species) but it’s at least thought-provoking.

Being that I provided a rather in-depth variable interpretation for the full model in pt.1 of the assignment, I’ll interpret the variables and the relationship between their coefficient and each level (is this indicative of Adelie, Gentoo or Chinstrap?) here:

Chinstrap = -242.974 + 14.889 * bill_length_mm -21.914 * bill_depth_mm - 36.905 * sexmale

The negative intercept coefficient is indicative that Chinstrap is the least popular species.
The positive bill_length_mm coefficient can be interpreted as bill length being positively correlated with the species being Chinstrap.
The negative bill_depth_mm coefficient can be interpreted as bill depth being negatively correlated with the species being Chinstrap.
The negative sexmale coefficient can be interpreted as penguins of the male sex being negatively correlated with the species being Chinstrap.

Gentoo = 121.047 + 14.806 * bill_length_mm -44.697 * bill_depth_mm - 0.130 * sexmale

The positive intercept coefficient is indicative that Gentoo is one of the most popular species.
The positive bill_length_mm coefficient can be interpreted as bill length being positively correlated with the species being Gentoo.
The highly negative bill_depth_mm coefficient can be interpreted as bill depth being very negatively correlated with the species being Gentoo.
The slightly negative sexmale coefficient can be interpreted as the penguin’s sex be nonindicative of the species being Gentoo.

And while Adelie does not appear in our coefficients or summary statistics, it’s safe to say if a species is not Chinstrap or Gentoo, then it’s Adelie. In other words, the Adelie species make up our 0 (or near 0) values.

4 - Extra Credit

What would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.

For my model in question 3, I would want to evaluate:

typical metrics : the p-value (<= 0.05 threshold preferred), residual error (lower values are better), McFadden’s R-squared value (higher values are better), and AIC values (lower values are better). These are typically generated via our summary statistics and thus provide the best early indication regarding how well our model can represent the data it was generated from. Alternatively, we might use R’s built-in glance() function to check. We also might consider providing confidence intervals for the independent variables as well as our predictions to gain better insight into what level of confidence there might be behind our variable selection and/or model’s predictive capabilities.
Likelihood ratio test : we can compare the likelihood of our data under a full model against that of select variables. Removing independent variables will tend to make the model fit our data less acurrately but it may be necessary to test the statistical significance of a less expansive model.
Classification error rate : we can break our data into a training and testing subset, train our model based on the training set, and then observe the accuracy of our model’s predictions based on the training set and the test set. This can provide an application-base accuracy metric based on our model’s classification capabilities. Being that goodness of fit depends highly on our training-test split, we can test our classification via multiple iterations on different training-test splits to further improve confidence in our classification error rate.

Being that I’m early in this Data Science journey, it’s more-than-likely that I’ve overlooked other useful fit statistics. This list is not comprehensive, rather it’s a “starting line” of sorts. One that will clarify via experimentation, exposure, feedback, time and experience.

References

In solving Homework #1, I referred to the following:

Akanksha Rawat. (2017). Binary Logistic Regression [article]. Retrieved from https://towardsdatascience.com/implementing-binary-logistic-regression-in-r-7d802a9d98fe
Parul Pandey. (2019). Simplifying the ROC and AUC metrics. [article]. Retrieved from https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69#:~:text=AUC%20stands%20for%20Area%20under,of%20one%20model%20to%20another.&text=This%20means%20that%20the%20Red%20curve%20is%20better.
Mohit Sharma. (2018). Multinomial Logistic Regression Using R. [article]. Retrieved from https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/
Alboukadel Kassambar. (2018). Regression Model Accuracy Metrics [article]. Retrieved from http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/
R bloggers. (2015). Evaluating Logistic Regression Models [article]. Retrieved from https://www.r-bloggers.com/2015/08/evaluating-logistic-regression-models/