The purpose of the assignment was to explore logistic and multinomial logistic regression.
…………………………………………………………………….
The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.
First, we perform a light exploration of our dataset to familiarize ourselves:
#We're going to deal with the Palmer penguins dataset:
data(penguins)
#Perform 'light EDA' to familiarize ourselves with the dataset:
glimpse(penguins)
summary(penguins)We then explore how many categories there are in the species column:
#Categories in species column
penguins %>%
group_by(species) %>%
count() #there are THREE categories## # A tibble: 3 x 2
## # Groups: species [3]
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
From the output above, we see that there are THREE categories in the species column: Adelie, Gentoo, and Chinstrap. Adelie is the most popular species, followed by Gentoo and then Chinstrap.
Being that Adelie was the most popular, I elected to use whether the species was or was not Adelie to be the binary outcome for our logistic regression model.
First, we remove NA values, set our outcome to be binary, and then we apply logistic regression to a full model (all variables):
#Remove NAs
penguins <- penguins %>% na.omit() #this step is essential for consistent dataframe length (later)
#Adelie or NOT Adelie
penguins$adelie <- ifelse(penguins$species == 'Adelie', 1, 0)
#Build binary outcome logistic regression model (with all vars):
model_1 <- glm(adelie ~., family=binomial(link='logit'), data=penguins)## Warning: glm.fit: algorithm did not converge
##
## Call:
## glm(formula = adelie ~ ., family = binomial(link = "logit"),
## data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06 2.409e-06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657e+01 5.242e+07 0 1
## speciesChinstrap -5.313e+01 1.121e+05 0 1
## speciesGentoo -5.313e+01 1.900e+05 0 1
## islandDream -1.090e-07 7.259e+04 0 1
## islandTorgersen -2.402e-06 7.574e+04 0 1
## bill_length_mm -2.923e-08 8.936e+03 0 1
## bill_depth_mm 1.535e-07 2.521e+04 0 1
## flipper_length_mm -1.611e-08 4.094e+03 0 1
## body_mass_g 1.164e-10 6.916e+01 0 1
## sexmale 7.256e-08 6.525e+04 0 1
## year -3.261e-09 2.621e+04 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5658e+02 on 332 degrees of freedom
## Residual deviance: 1.9319e-09 on 322 degrees of freedom
## AIC: 22
##
## Number of Fisher Scoring iterations: 25
For continuous variables, the interpretation is as follows:
For categorical variables, the interpretation is as follows:
In addition to the variable interpretations above, we can pull from the summary statistics. The summary statistics provide us with an AIC value of 22 and p-values of 1.
From these statistics, we determine that there’s likely a better model and thus we’ll optimize based on AIC (by applying the stepAIC function) and p-value (by adding variables back into our model one-by-one) to compare the model difference (if any):
#Optimize model based on AIC value:
model_2 <- stepAIC(model_1)
#Optimize model based on p-value:
model_3 <- glm(adelie ~ bill_length_mm + bill_depth_mm, family=binomial(link='logit'), data=penguins)The results of our models are provided below:
##
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + sex,
## family = binomial(link = "logit"), data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.278e-04 -2.000e-08 -2.000e-08 2.000e-08 5.131e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1243.45 116369.47 0.011 0.991
## bill_length_mm -94.98 7016.74 -0.014 0.989
## bill_depth_mm 158.15 11657.45 0.014 0.989
## sexmale 142.52 170018.10 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5658e+02 on 332 degrees of freedom
## Residual deviance: 5.4973e-07 on 329 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
##
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm, family = binomial(link = "logit"),
## data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.58614 -0.00035 0.00000 0.00234 1.85243
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 24.1314 13.5362 1.783 0.07463 .
## bill_length_mm -2.2099 0.6847 -3.228 0.00125 **
## bill_depth_mm 3.9981 1.4838 2.694 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 456.575 on 332 degrees of freedom
## Residual deviance: 18.708 on 330 degrees of freedom
## AIC: 24.708
##
## Number of Fisher Scoring iterations: 11
Optimizing based on AIC (Akaike Information Criteria) value resulted in a model that:
The magnitude of the p-values seemed strange to me and so I optimized once again, this time based on p-value. This resulted in a model that:
Thus, we had conflicting models. On one hand we had a model with a superior AIC score and on the other we had one with superior p-values yet a diminished AIC score. I elected to proceed with the superior AIC score model (model_2) due to the predictive promise that a lower AIC score is indicative of.
Between the two models, the only difference was the inclusion of the sexmale variable. This variable raised the collective p-values of model_2 while simultaneously lowering its AIC score. bill_length_mm and bill_depth_mm appear to provide the greatest indication of what species of penguin we’re dealing with. Alone, each of their p-values were well below the 0.05 threshold (see model_3) but the two variables together may not provide the best fit for the data they were generated from and thus it was necessary to include an additional variable (the sexmale variable) to optimize our model’s ability to fit the data it was derived from.
As a next step, we assess the accuracy of our model …
For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4384 1.0000 1.0000
#Analyze histogram of our fitted values
hist(model_2$fitted.values,main = " Histogram",xlab = "Probability of Adelie", col = "blue")Based on the summary statistics, we get an idea of the range of values produced by our model and when we visualize these values, we see that our data has successfully been batched into two camps: 0.0 and 1.0, where we can interpret 1.0 as being of species Adelie and we can interpret 0.0 as NOT being of species Adelie.
The fact that the NOT bar is taller than the Adelie bar is a positive indication, but we’ll dig further into the accuracy of our model via AUC and Confusion Matrix:
#length(penguins$adelie) #0,1 - 344
#length(model_2$fitted.values) #2.2 * 10^-16, 1 - 333
#Add Prediction column to penguins df based on model_2's fitted values:
penguins$Predict <- ifelse(model_2$fitted.values > 0.5,"+ve","-ve")
#Observe Confusion matrix:
mytable <- table(penguins$adelie,penguins$Predict)
rownames(mytable) <- c("Obs -ve","Obs +ve")
colnames(mytable) <- c("Pred -ve","Pred +ve")
mytable##
## Pred -ve Pred +ve
## Obs -ve 187 0
## Obs +ve 0 146
## [1] 1
Based on the Confusion Matrix, our results are as follows:
We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. As you may have expected, the output is 1.0 or 100% accuracy.
*Note: we dropped NA values prior to constructing our model and thus we ended up dropping 11 observations which may have challenged our model and resulted in False Positives or Negatives.
Regardless, based on our Confusion Matrix, the model appears to be strong. We then move on to verify our model’s ROC and AUC:
#Observe ROC and AUC:
roc(adelie~model_2$fitted.values, data = penguins, plot = TRUE, main = "ROC CURVE", col= "blue")##
## Call:
## roc.formula(formula = adelie ~ model_2$fitted.values, data = penguins, plot = TRUE, main = "ROC CURVE", col = "blue")
##
## Data: model_2$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 1
## Area under the curve: 1
The ROC (Receiver Operating Characteristic) curve utilizes Sensitivity vs Specificity to measure the performance of a model. In our case, the ROC “curve” forms a unit box with a Specificity of 1.0 to a Sensitivity of 1.0.
The AUC (Area Under Curve) measures the fitted values of our model vs. our adelie value. A value of 1.0 in this realm is a perfect score and our model received a 1.0 :)
Based on the AUC, accuracy, and Confusion Matrix, we appear to have a very strong model on our hands. Thus, choosing to optimize our model based on its AIC score appears to have been a good decision in this case.
Please apply multinomial logistic regression where your outcome variable is ‘species’.
Evaluate the independent variables appropriately to fit your most parsimonious model.
Please be sure to interpret your variables in the model.
I took “most parsimonious model” to mean that we’re looking for the optimal prediction with as few predictor variables as possible. In other words, we want our model to be as accurate as possible with as few independent variables as possible:
As such, I followed a similar approach to #2 where NA values were removed, the model was applied (this time using the multinom() function), and then the model was optimized using the stepAIC() function:
#'Refresh' the Palmer penguins dataset:
data(penguins)
#Remove NAs
penguins <- penguins %>% na.omit()
#is.factor(penguins$species)
#levels(penguins$species)
#Build multinomial logistic regression model (with all vars):
s_model_1 <- multinom(species ~., data=penguins, model = TRUE)
#Optimize model based on AIC value (as per #2):
s_model_2 <- stepAIC(s_model_1)## Call:
## multinom(formula = species ~ ., data = penguins, model = TRUE)
##
## Coefficients:
## (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap -0.2496435 211.62912 97.64471 32.96624 -60.84017
## Gentoo 0.7633414 -68.84639 -140.79025 28.45770 -42.86214
## flipper_length_mm body_mass_g sexmale year
## Chinstrap 0.01515612 -0.0193027 -70.73544 -0.2242371
## Gentoo -2.79321661 0.1330873 -67.23109 -0.2391631
##
## Std. Errors:
## (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap 0.02157370 0.02157370 6.871032e-16 0.2205894 0.1689163
## Gentoo 0.02114133 0.02114131 5.401394e-53 1.0993488 0.4376254
## flipper_length_mm body_mass_g sexmale year
## Chinstrap 4.937946 25.79546 0.02113835 42.20323
## Gentoo 4.439678 101.47836 0.02114132 42.45178
##
## Residual Deviance: 7.087421e-06
## AIC: 36.00001
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm +
## sex, data = penguins, model = TRUE)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm sexmale
## Chinstrap -242.9740 14.88938 -21.91411 -36.9051125
## Gentoo 121.0476 14.80645 -44.69711 -0.1301506
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm sexmale
## Chinstrap 3.985819 9.369182 23.16685 23.33386
## Gentoo 4.071031 9.413971 23.17697 43.25185
##
## Residual Deviance: 0.9809986
## AIC: 16.981
When we look at the full model’s summary statistics, we get a model that:
Optimizing based on AIC (Akaike Information Criteria) value resulted in a model that:
If you recall, the variables that remain in our AIC-optimized model are consistent with those from pt 1 of this assignment: bill_length_mm, bill_depth_mm, and sex. It appears that although we varied the levels of our dependent variable, the optimal model was consistent. I don’t know if this assumption can be extended to models of even greater level (ie. 4,5,6 species) but it’s at least thought-provoking.
Being that I provided a rather in-depth variable interpretation for the full model in pt.1 of the assignment, I’ll interpret the variables and the relationship between their coefficient and each level (is this indicative of Adelie, Gentoo or Chinstrap?) here:
Chinstrap = -242.974 + 14.889 * bill_length_mm -21.914 * bill_depth_mm - 36.905 * sexmale
Gentoo = 121.047 + 14.806 * bill_length_mm -44.697 * bill_depth_mm - 0.130 * sexmale
And while Adelie does not appear in our coefficients or summary statistics, it’s safe to say if a species is not Chinstrap or Gentoo, then it’s Adelie. In other words, the Adelie species make up our 0 (or near 0) values.
What would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.
For my model in question 3, I would want to evaluate:
typical metrics : the p-value (<= 0.05 threshold preferred), residual error (lower values are better), McFadden’s R-squared value (higher values are better), and AIC values (lower values are better). These are typically generated via our summary statistics and thus provide the best early indication regarding how well our model can represent the data it was generated from. Alternatively, we might use R’s built-in glance() function to check. We also might consider providing confidence intervals for the independent variables as well as our predictions to gain better insight into what level of confidence there might be behind our variable selection and/or model’s predictive capabilities.
Likelihood ratio test : we can compare the likelihood of our data under a full model against that of select variables. Removing independent variables will tend to make the model fit our data less acurrately but it may be necessary to test the statistical significance of a less expansive model.
Classification error rate : we can break our data into a training and testing subset, train our model based on the training set, and then observe the accuracy of our model’s predictions based on the training set and the test set. This can provide an application-base accuracy metric based on our model’s classification capabilities. Being that goodness of fit depends highly on our training-test split, we can test our classification via multiple iterations on different training-test splits to further improve confidence in our classification error rate.
Being that I’m early in this Data Science journey, it’s more-than-likely that I’ve overlooked other useful fit statistics. This list is not comprehensive, rather it’s a “starting line” of sorts. One that will clarify via experimentation, exposure, feedback, time and experience.
In solving Homework #1, I referred to the following:
Akanksha Rawat. (2017). Binary Logistic Regression [article]. Retrieved from https://towardsdatascience.com/implementing-binary-logistic-regression-in-r-7d802a9d98fe
Parul Pandey. (2019). Simplifying the ROC and AUC metrics. [article]. Retrieved from https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69#:~:text=AUC%20stands%20for%20Area%20under,of%20one%20model%20to%20another.&text=This%20means%20that%20the%20Red%20curve%20is%20better.
Mohit Sharma. (2018). Multinomial Logistic Regression Using R. [article]. Retrieved from https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/
Alboukadel Kassambar. (2018). Regression Model Accuracy Metrics [article]. Retrieved from http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/
R bloggers. (2015). Evaluating Logistic Regression Models [article]. Retrieved from https://www.r-bloggers.com/2015/08/evaluating-logistic-regression-models/