A) The penguin dataset has a ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
First, we’ll save the penguin dataset as a tibble so we can manipulate it easier in the next steps:
Then, as mentioned above, we’ll first have to identify the number of categories that make up the species column. We can do this by running the syntax below:
summary(penguins$species)
## Adelie Chinstrap Gentoo
## 152 68 124
As we can see above, it looks like there are three species present in this dataset. Therefore, in order to create a logistic regression with a binary outcome, we’ll have to determine which two species should be combined. We can also see from this summary that there are a small number of Chinstrap penguins in this dataset. This may be a good initial indication that combining this species with either Adelie or Gentoo will be best. However, before doing so, we can also take a look at the summary of all of the variables in the dataset, as well as some of the interactions between our continuous variables, including bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
It looks like there are 11 individuals that have missing sex information, including two individuals in our dataset that have a large amount of NAs across most of the features. Therefore, I’ll omit those from our analysis:
penguins <- na.omit(penguins)
Next, we can create plots that show interactions between some of our continous variables: Interestingly, when we break out the plots by species, we can see that many of the feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. With the exception of the distribution of
bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.
With this in mind, I think it’ll be best to combine the Chinstrap penguin species with the Adelie species in order for us to create our binary outcome for our logistic regression. Therefore, to do this, we can do the following data manipulation by creating an extra column to bifurcate the species variable:
penguins <- penguins %>%
mutate(species_rollup = ifelse(species == 'Adelie' | species == 'Chinstrap', 'Adelie-Chinstrap', 'Gentoo'))
And now we can see the split between the two new groups of penguins:
##
## Adelie-Chinstrap Gentoo
## 214 119
B) Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Checking kurtosis and transforming variables
Now that the outcome variable of species has been rolled up into two distinct groups of penguins, the next step is to evaluate the independent variables in the dataset accordingly. A good first step is to take a look at the distributions of each feature to get a sense of kurtosis:
From the above distributions, we can see that most of our continuous variables are quite skewed. In order to alleviate this before subjecting to our logistic regression, we can perform some transformations to make more normal distributions.
# transform variables
penguins$bill_length_transformed <- abs(penguins$bill_length_mm - mean(penguins$bill_length_mm))^(1/2)
penguins$flipper_length_transformed <- abs(penguins$flipper_length_mm - mean(penguins$flipper_length_mm))^(1/2)
As we can see above, we were able to perform a few transformations on bill_length_mm and flipper_length_mm to make more normal distributions. However, we will have to see if these transformations are interpretable when we run our logistic regression.
Checking for multicollinearity
Additionally, when evaluating the independent variables, it’s important to take multicollinearity into consideration. Multicollinearity can impact the variances of our parameter estimates, which could lead to incorrect inferences about relationships between our dependent variable (species) and independent variables. Therefore, I ran a correlation plot to check the collinearity between some of our independent variables.
We can see above that flipper_length and body_mass_g have a pretty strong positive correlation with one another. This makes sense given that typically penguins with larger body mass will have larger flippers. Since we are seeing a bimodal distribution for flipper_length which led to a pretty in-depth transformation, we may consider excluding this feature from our logistic regression, given that it also shows such a strong relationship with body_mass_g.
Downsampling Adelie-Chinstrap from dataset
One final step is to take a quick look at the counts for our binary variable species_rollup that we created earlier to help us create our binary logistic regression outcome:
table(penguins$species_rollup)
##
## Adelie-Chinstrap Gentoo
## 214 119
Unfortunately, it looks like there’s a class imbalance here where there are almost double the amount of Adelie-Chinstrap individuals in our dataset than Gentoo individuals. This could impact our logistic regression outcome and eventual model performance. In order to create a more equal subset of individuals between the two classes, we can downsample the Adelie-Chinstrap group to match the number of Gentoo species at 119. In order to do this, we can run a function in the caret package called downSample(), which will randomly select 119 individuals from the Adelie-Chinstrap group to use for our model:
penguins_rollup_factor <- factor(penguins$species_rollup)
downsample_penguins <- downSample(penguins, penguins_rollup_factor, list = FALSE, yname = "Class")
table(downsample_penguins$species_rollup)
##
## Adelie-Chinstrap Gentoo
## 119 119
Now, we have an equal share of 119 individuals in our new downsample_penguins dataset between our two groups. With these adjustments, we are now ready to start running our logistic regression models.
downsample_penguins$penguins_species_fnl <- ifelse(downsample_penguins$species_rollup == 'Gentoo', 0, 1)
Running the logistic regression
First, we’ll need to split our downsample_penguins dataset into a training and testing set (80% training, 20% testing). This was necessary in order to measure our model performance on our holdout test set later on.
set.seed(1234567)
# utilizing one dataset for all four models
penguins_partition <- createDataPartition(downsample_penguins$species_rollup, p=0.8, list=FALSE)
penguins_training <- downsample_penguins[penguins_partition,]
penguins_testing <- downsample_penguins[-penguins_partition,]
Next, after running a few preliminary models that didn’t show large differences between the transformed variables I had created and those from the original dataset, I decided to keep the existing variables in order to ease interpretation of the log odds. Additionally, I decided to not use variables such as bill_depth_mm, island, sex, and year, since they appear to be overfitting the model and creating fitted probabilities that numerically turned out to be 0 or 1. Therefore, for the first model, I added three variables bill_length_mm, flipper_length_mm, and body_mass_g to the logistic regression and used stepwise selection to determine if other variables should be removed:
model1 <- glm(penguins_species_fnl ~ bill_length_mm + flipper_length_mm + body_mass_g , data = penguins_training, family = binomial)
model2 <- model1 %>% stepAIC(trace=FALSE)
summary(model2)
##
## Call:
## glm(formula = penguins_species_fnl ~ bill_length_mm + flipper_length_mm +
## body_mass_g, family = binomial, data = penguins_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22345 -0.00756 -0.00001 0.00331 2.12733
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 154.110258 46.893897 3.286 0.00101 **
## bill_length_mm 0.391771 0.211737 1.850 0.06427 .
## flipper_length_mm -0.751099 0.239089 -3.142 0.00168 **
## body_mass_g -0.003938 0.002432 -1.619 0.10536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.169 on 191 degrees of freedom
## Residual deviance: 17.618 on 188 degrees of freedom
## AIC: 25.618
##
## Number of Fisher Scoring iterations: 10
As we can see, it doesn’t look like StepAIC removed any of our less statistically significant variables, since they didn’t meet the threshold. However, in model 3, I decided to remove bill_length_mm to see if this has an impact on our AIC score.
However, before doing so, I did want to check my VIF scores in order to check multicollinearity.
# print variable inflation factor score
print('VIF scores of predictors')
## [1] "VIF scores of predictors"
VIF(model2)
## bill_length_mm flipper_length_mm body_mass_g
## 2.159405 2.145169 1.101884
Fortunately, we can see that all VIF scores are quite low (below 5), meaning that interactions with one another are quite low.
model3 <- glm(penguins_species_fnl ~ flipper_length_mm + body_mass_g, data = penguins_training, family = binomial)
summary(model3)
##
## Call:
## glm(formula = penguins_species_fnl ~ flipper_length_mm + body_mass_g,
## family = binomial, data = penguins_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67952 -0.02032 -0.00003 0.00835 2.73596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 140.502135 42.189796 3.330 0.000868 ***
## flipper_length_mm -0.622712 0.198249 -3.141 0.001683 **
## body_mass_g -0.002802 0.001918 -1.461 0.143913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.169 on 191 degrees of freedom
## Residual deviance: 23.408 on 189 degrees of freedom
## AIC: 29.408
##
## Number of Fisher Scoring iterations: 10
After running this third model, we can see a small increase in the AIC. I also noticed through this process that flipper_length_mm seemed to be quite predictive of our bifurcated species variable. Therefore, I ran one final model which just exposed flipper_length_mm as the explanatory variable:
model4 <- glm(penguins_species_fnl ~ flipper_length_mm, data = penguins_training, family = binomial)
summary(model4)
##
## Call:
## glm(formula = penguins_species_fnl ~ flipper_length_mm, family = binomial,
## data = penguins_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.07052 -0.04469 -0.00018 0.01180 2.38666
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 141.4275 37.5453 3.767 0.000165 ***
## flipper_length_mm -0.6867 0.1811 -3.793 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.169 on 191 degrees of freedom
## Residual deviance: 26.192 on 190 degrees of freedom
## AIC: 30.192
##
## Number of Fisher Scoring iterations: 9
This model showed a slight increase in AIC from our previous models, and I do worry that if this model, or model #3 were to be used it may not yield as accurate predictions as model #1. We can examine these models in the next part during the evaluation stage. However, I’ll quickly breakdown some of the log odds information and interpret the variables of these logistic regression models below.
C) Provide variable interpretations in your model
Given our initial model turned out to be our best parsimonious model (based on AIC), if we examine the summary output, we can find the following:
Bill Length (mm) – as the bill length of an individual penguin in the dataset increases by one millimeter, then the odds of the individual being of the Adelie or Chinstrap species increases by exp(0.39), or 1.47.
Flipper Length (mm) – as the flipper length of an individual penguin in the dataset increases by one millimeter, then the odds of the individual being of the Adelie or Chinstrap species decreases by exp(-0.75), or 0.47.
Body Mass (g) – as the body mass of an individual penguin in the dataset increases by one gram, then the odds of the individual being of the Adelie or Chinstrap species decreases by exp(-0.003), or 0.10.
A) For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR
Although we won’t evaluate the performance of this model on the training dataset, it is good to check to make sure these functions are working properly before subjecting this to the testing set. Therefore, below we will examine the predictions and confusion matrix for our model of choice (model #1):
predictTrain = predict(model1, type = "response")
penguins_training$target = round(predictTrain, 1)
table(penguins_training$penguins_species_fnl, predictTrain > 0.5)
##
## FALSE TRUE
## 0 95 1
## 1 2 94
We can also check the accuracy of the predictions on the training set:
penguins_training = penguins_training %>%
mutate(accuracy = 1*(round(target, 0) == round(penguins_species_fnl, 0)))
paste0('Accuracy on Training Set: ', sum(penguins_training$accuracy)/nrow(penguins_training))
## [1] "Accuracy on Training Set: 0.979166666666667"
Measuring Accuracy, TPR, FPR, TNR, and FNR on testing set
Now, we can evaluate our chosen model #1 on our holdout testing set. Initially, we’ll set our threshold at 0.5, to determine whether or not our predictions should be classified into the Gentoo category (less than or equal to 0.5) or Adelie-Chinstrap category (greater than 0.5):
predictTest = predict(model1, type = "response", newdata = penguins_testing)
penguins_testing$target = round(predictTest, 1)
confusion_matrix <- table(penguins_testing$penguins_species_fnl, predictTest > 0.5)
confusion_matrix
##
## FALSE TRUE
## 0 23 0
## 1 1 22
Calculating TPR, TNR, FPR, and FNR
# tp = true positive
# tn = true negative
# fp = false positive
# fn = false negative
tp <- confusion_matrix[4]
tn <- confusion_matrix[1]
fp <- confusion_matrix[2]
fn <- confusion_matrix[3]
p <- sum(confusion_matrix[3], confusion_matrix[4])
n <- sum(confusion_matrix[1], confusion_matrix[2])
tpr <- tp / p
tnr <- tn / n
fpr <- fp / n
fnr <- fn / p
title_row <- c('True Positive Rate (TPR)', 'True Negative Rate (TNR)', 'False Positive Rate (FPR)', 'False Negative Rate (FNR)')
values <- c(tpr, tnr, fpr, fnr)
results <- data.frame(title_row, values)
colnames(results) <- c('Measure', 'Value')
results
## Measure Value
## 1 True Positive Rate (TPR) 1.00000000
## 2 True Negative Rate (TNR) 0.95833333
## 3 False Positive Rate (FPR) 0.04166667
## 4 False Negative Rate (FNR) 0.00000000
After calculating TPR, TNR, FPR, and FNR, it’s a good sign to see that both FPR and FNR are quite low. Additionally, the model was very good at classifying penguins in the testing dataset that were of either the Adelie/Chinstrap species (predicted all but one of them correctly), and classified all Gentoo species correctly.
Accuracy
penguins_testing = penguins_testing %>%
mutate(accuracy = 1*(round(target, 0) == round(penguins_species_fnl, 0)))
paste0('Accuracy on Training Set: ', sum(penguins_testing$accuracy)/nrow(penguins_testing))
## [1] "Accuracy on Training Set: 0.978260869565217"
As we can see above, when checking the accuracy of the predictions on our testing set, we can see that model 1 was roughly 98% accurate in predicting whether or not the penguin was classified as either a Gentoo or Adelie/Chinstrap species.
Calculating AUC
We can plot an ROC curve and calculate our AUC by using the pRoc package. Below, we were able to plot our thresholds.
I was having issues with caching when knitting to R, so had to take a picture instead of showing output!
Finally, the AUC of our chosen model #1 is very good, with a value of 0.994, which is very close to 1 and supports our calculation of accuracy that almost all predictions were correct.
A) Please fit a multinomial logistic regression where your outcome variable is ‘species’.
B) Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.
For this last part, we’ll use the initial species variable to create a multinomial logistic regression model. This way, we do not have to bifurcate our species into two categories, but instead use a different type of logistic regression to obtain predictions for all three species. In order to do this, I decided to take our initial training dataset from Part 1, since it is already split accordingly, and use the multinom() function from the broom package to run our first model.
Model 1
In this initial model, I kept all variables in, just to get a baseline AIC to compare future models on the training data.
multinom_model1 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + sex + year + bill_length_transformed + flipper_length_transformed, data = penguins_training_mod)
## # weights: 30 (18 variable)
## initial value 210.933559
## iter 10 value 28.330919
## iter 20 value 2.932103
## iter 30 value 0.107010
## final value 0.000007
## converged
tidy(multinom_model1, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)
## # A tibble: 18 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie (Intercept) -0.328 0.00456 -72.0 0.
## 2 Adelie bill_length_mm -10.7 8.36 -1.28 2.00e- 1
## 3 Adelie bill_depth_mm 57.4 11.5 5.00 5.84e- 7
## 4 Adelie flipper_length_mm -3.28 72.9 -0.0450 9.64e- 1
## 5 Adelie body_mass_g -0.0511 5.04 -0.0101 9.92e- 1
## 6 Adelie sexmale 46.9 0.214 219. 0.
## 7 Adelie year 0.175 7.99 0.0220 9.82e- 1
## 8 Adelie bill_length_transformed -22.0 1.46 -15.1 2.24e- 51
## 9 Adelie flipper_length_transformed 14.8 12.2 1.21 2.28e- 1
## 10 Chinstrap (Intercept) 0.409 0.00456 89.7 0.
## 11 Chinstrap bill_length_mm 17.0 8.36 2.03 4.22e- 2
## 12 Chinstrap bill_depth_mm 46.9 11.5 4.08 4.52e- 5
## 13 Chinstrap flipper_length_mm 0.0369 72.9 0.000506 1.00e+ 0
## 14 Chinstrap body_mass_g -0.173 5.04 -0.0344 9.73e- 1
## 15 Chinstrap sexmale -62.7 0.214 -293. 0.
## 16 Chinstrap year -0.408 7.99 -0.0510 9.59e- 1
## 17 Chinstrap bill_length_transformed -43.9 1.46 -30.1 9.97e-199
## 18 Chinstrap flipper_length_transformed 24.1 12.2 1.97 4.93e- 2
summary(multinom_model1)
## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm +
## flipper_length_mm + body_mass_g + sex + year + bill_length_transformed +
## flipper_length_transformed, data = penguins_training_mod)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie -0.3282549 -10.72648 57.44573 -3.28013884
## Chinstrap 0.4093077 16.98542 46.89662 0.03690183
## body_mass_g sexmale year bill_length_transformed
## Adelie -0.05114594 46.85619 0.1754729 -21.99939
## Chinstrap -0.17332011 -62.66099 -0.4075378 -43.88028
## flipper_length_transformed
## Adelie 14.75825
## Chinstrap 24.05821
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie 0.004561299 8.362393 11.49727 72.94906
## Chinstrap 0.004561299 8.362393 11.49727 72.94906
## body_mass_g sexmale year bill_length_transformed
## Adelie 5.043124 0.2142117 7.988299 1.458978
## Chinstrap 5.043124 0.2142117 7.988299 1.458978
## flipper_length_transformed
## Adelie 12.23654
## Chinstrap 12.23654
##
## Residual Deviance: 1.365053e-05
## AIC: 36.00001
We can see that the AIC value is around 36, and when we look at the p-values and summary statistics, we can immediately see that it’ll be best to remove variables such as year and sex, both of which didn’t perform well in our binomial logistic regression earlier, as well as bill_length_transformed and flipper_length_transformed, both of which have high p-values here as well.
Model 2
When we re-run the model, removing these variables, we can see that there is an improvement in AIC (from 36 to 20). However, we’ll continue to remove a few variables to see if we can improve our model even more.
multinom_model2 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins_training_mod)
## # weights: 18 (10 variable)
## initial value 210.933559
## iter 10 value 10.367979
## iter 20 value 2.784343
## iter 30 value 0.794853
## iter 40 value 0.001436
## final value 0.000061
## converged
tidy(multinom_model2, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)
## # A tibble: 10 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie (Intercept) 6.02 0.126 47.8 0.
## 2 Adelie bill_length_mm -32.7 8.50 -3.85 1.20e- 4
## 3 Adelie bill_depth_mm 48.2 4.89 9.85 6.54e-23
## 4 Adelie flipper_length_mm 3.09 16.1 0.192 8.48e- 1
## 5 Adelie body_mass_g -0.00376 0.828 -0.00455 9.96e- 1
## 6 Chinstrap (Intercept) -5.09 0.944 -5.39 7.09e- 8
## 7 Chinstrap bill_length_mm 22.7 9.06 2.51 1.21e- 2
## 8 Chinstrap bill_depth_mm 1.37 77.3 0.0177 9.86e- 1
## 9 Chinstrap flipper_length_mm -2.36 20.9 -0.113 9.10e- 1
## 10 Chinstrap body_mass_g -0.143 1.07 -0.134 8.94e- 1
summary(multinom_model2)
## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm +
## flipper_length_mm + body_mass_g, data = penguins_training_mod)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie 6.017933 -32.68266 48.164548 3.093349
## Chinstrap -5.086096 22.71875 1.366784 -2.362707
## body_mass_g
## Adelie -0.003764257
## Chinstrap -0.142928375
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie 0.1258945 8.499668 4.887449 16.12609
## Chinstrap 0.9438028 9.057746 77.339941 20.93772
## body_mass_g
## Adelie 0.8280631
## Chinstrap 1.0704038
##
## Residual Deviance: 0.0001212095
## AIC: 20.00012
Model 3
In our third iteration of our model, we decided to remove flipper_length_mm, which seemed to have the highest p-value and was likely exposing our model to multicollinearity issues with body_mass_g, which we could see from our initial variable examination earlier in our model building.
multinom_model3 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm + body_mass_g, data = penguins_training_mod)
## # weights: 15 (8 variable)
## initial value 210.933559
## iter 10 value 7.375240
## iter 20 value 1.881621
## iter 30 value 0.606133
## iter 40 value 0.039272
## iter 50 value 0.033159
## iter 60 value 0.027623
## iter 70 value 0.009051
## iter 80 value 0.008738
## iter 90 value 0.008694
## iter 100 value 0.008516
## final value 0.008516
## stopped after 100 iterations
tidy(multinom_model3, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)
## # A tibble: 8 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie (Intercept) 41.3 0.00000434 9525727. 0.
## 2 Adelie bill_length_mm -48.5 0.000194 -250238. 0.
## 3 Adelie bill_depth_mm 97.4 0.0000798 1220670. 0.
## 4 Adelie body_mass_g 0.0875 0.0175 4.99 6.05e- 7
## 5 Chinstrap (Intercept) -42.5 0.00000313 -13583092. 0.
## 6 Chinstrap bill_length_mm 10.2 0.000184 55533. 0.
## 7 Chinstrap bill_depth_mm 4.55 0.0000647 70309. 0.
## 8 Chinstrap body_mass_g -0.120 0.0173 -6.91 5.01e-12
summary(multinom_model3)
## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm +
## body_mass_g, data = penguins_training_mod)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm body_mass_g
## Adelie 41.30411 -48.54704 97.397123 0.08745134
## Chinstrap -42.51029 10.21919 4.546393 -0.11957026
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm body_mass_g
## Adelie 4.336059e-06 0.0001940033 7.978991e-05 0.01752673
## Chinstrap 3.129647e-06 0.0001840199 6.466277e-05 0.01731568
##
## Residual Deviance: 0.01703178
## AIC: 16.01703
We can see that by removing flipper_length_mm, AIC continues to improve (dropping from 20 to 16). This final model yielded our lowest AIC and contains bill_length_mm bill_depth_mm and body_mass_g as our chosen independent variables for our best parsimonious model.
C) Please be sure to interpret your variables in the model
Given our third multinomial logisitic regression model turned out to be our best parsimonious model (based on AIC), if we examine the summary output, we can find the following:
Bill Length (mm) – A one millimeter increase in bill length is associated with the decrease in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 48.55. Similarly, a one millimeter increase in bill length is associated with the increase in the log odds of being classified as the Chinstrap species vs. Gentoo species by 10.22. Essentially, as bill length increases, it’s less likely to be classified as an Adelie species than Gentoo species. Similarly, as bill length increases, the log odds become more certain to be classified as a Chinstrap species than Gentoo species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).
Bill Depth (mm) – A one millimeter increase in bill depth is associated with the increase in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 97.40. Similarly, a one millimeter increase in bill length is associated with the increase in the log odds of being classified as the Chinstrap species vs. Gentoo species by 4.55. Essentially, as bill depth increases, it’s more likely to be classified as an Adelie species than Gentoo species. Similarly, as bill depth increases, the log odds become more certain to be classified as a Chinstrap species than Gentoo species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).
Body Mass (g) – A one gram increase in body mass is associated with the increase in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 0.09. Similarly, a one gram increase in body mass is associated with the decrease in the log odds of being classified as the Chinstrap species vs. Gentoo species by 0.11. Essentially, as body mass increases, it’s more likely to be classified as an Gentoo species over either Adelie or Chinstrap species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).
Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.
Before discussing fit statistics, we can first create predictions on our holdout test set based on our model #3 in the last section. We can create a dataframe that shows the computed probabilities for an individual being classified into each class.
predictTest_mod = predict(multinom_model3, type = "probs", newdata = penguins_testing)
predictTest_mod = data.frame(predictTest_mod)
Here’s the first five values of the dataframe:
head(predictTest_mod)
## Gentoo Adelie Chinstrap
## 1 9.309919e-118 1.000000e+00 1.027839e-107
## 2 1.581919e-207 1.000000e+00 1.612077e-229
## 3 1.378137e-40 2.968561e-122 1.000000e+00
## 4 2.720822e-46 1.000000e+00 2.773275e-25
## 5 1.242817e-65 1.000000e+00 7.323893e-64
## 6 2.989775e-33 7.430538e-16 1.000000e+00
As we can see above, probabilities falling closer to 1 will be utilized for classification. Therefore, if we take the maximum value of the three columns for each row, we can then estimate a classification for our target set:
target_test <- colnames(predictTest_mod)[apply(predictTest_mod,1,which.max)]
penguins_testing$target = target_test
cmx_test <- table(penguins_testing$target, penguins_testing$target == penguins_testing$species)
cmx_test
##
## FALSE TRUE
## Adelie 0 15
## Chinstrap 1 7
## Gentoo 0 23
We can see from the confusion matrix generated, that the model estimated the correct classification for all individuals in our test set. This indicates that our model is working quite well.
Fit Statistics for Multinomial Logisitic Regression
Some of the fit statistics that you’d want to evaluate for the model produced in part 3 would be based on the predicted results. For instance, now that we have these predictions, we could test the goodness of fit by running a Pearson’s Chi-squared test on these predictions and the categories. This would produce an interpretable \(X^2\) value. It would also be possible to calculate a psuedo R square values for each of our species – we’d interpret it a bit differently than traditional OLS, but would be available for us to make some assumptions about the proportion of variance of species that can be explained by the independent variables.