DATA 622 - Homework #1

Part 1: Logistic Regression with binary outcome

A) The penguin dataset has a ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

First, we’ll save the penguin dataset as a tibble so we can manipulate it easier in the next steps:

Then, as mentioned above, we’ll first have to identify the number of categories that make up the species column. We can do this by running the syntax below:

summary(penguins$species)

##    Adelie Chinstrap    Gentoo 
##       152        68       124

As we can see above, it looks like there are three species present in this dataset. Therefore, in order to create a logistic regression with a binary outcome, we’ll have to determine which two species should be combined. We can also see from this summary that there are a small number of Chinstrap penguins in this dataset. This may be a good initial indication that combining this species with either Adelie or Gentoo will be best. However, before doing so, we can also take a look at the summary of all of the variables in the dataset, as well as some of the interactions between our continuous variables, including bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

It looks like there are 11 individuals that have missing sex information, including two individuals in our dataset that have a large amount of NAs across most of the features. Therefore, I’ll omit those from our analysis:

penguins <- na.omit(penguins)

Next, we can create plots that show interactions between some of our continous variables: Interestingly, when we break out the plots by species, we can see that many of the feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. With the exception of the distribution of bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.

With this in mind, I think it’ll be best to combine the Chinstrap penguin species with the Adelie species in order for us to create our binary outcome for our logistic regression. Therefore, to do this, we can do the following data manipulation by creating an extra column to bifurcate the species variable:

penguins <- penguins %>% 
  mutate(species_rollup = ifelse(species == 'Adelie' | species == 'Chinstrap', 'Adelie-Chinstrap', 'Gentoo'))

And now we can see the split between the two new groups of penguins:

## 
## Adelie-Chinstrap           Gentoo 
##              214              119

B) Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

Checking kurtosis and transforming variables

Now that the outcome variable of species has been rolled up into two distinct groups of penguins, the next step is to evaluate the independent variables in the dataset accordingly. A good first step is to take a look at the distributions of each feature to get a sense of kurtosis:

From the above distributions, we can see that most of our continuous variables are quite skewed. In order to alleviate this before subjecting to our logistic regression, we can perform some transformations to make more normal distributions.

# transform variables
penguins$bill_length_transformed <- abs(penguins$bill_length_mm - mean(penguins$bill_length_mm))^(1/2)
penguins$flipper_length_transformed <- abs(penguins$flipper_length_mm - mean(penguins$flipper_length_mm))^(1/2)

As we can see above, we were able to perform a few transformations on bill_length_mm and flipper_length_mm to make more normal distributions. However, we will have to see if these transformations are interpretable when we run our logistic regression.

Checking for multicollinearity

Additionally, when evaluating the independent variables, it’s important to take multicollinearity into consideration. Multicollinearity can impact the variances of our parameter estimates, which could lead to incorrect inferences about relationships between our dependent variable (species) and independent variables. Therefore, I ran a correlation plot to check the collinearity between some of our independent variables.

We can see above that flipper_length and body_mass_g have a pretty strong positive correlation with one another. This makes sense given that typically penguins with larger body mass will have larger flippers. Since we are seeing a bimodal distribution for flipper_length which led to a pretty in-depth transformation, we may consider excluding this feature from our logistic regression, given that it also shows such a strong relationship with body_mass_g.

Downsampling Adelie-Chinstrap from dataset

One final step is to take a quick look at the counts for our binary variable species_rollup that we created earlier to help us create our binary logistic regression outcome:

table(penguins$species_rollup)

## 
## Adelie-Chinstrap           Gentoo 
##              214              119

Unfortunately, it looks like there’s a class imbalance here where there are almost double the amount of Adelie-Chinstrap individuals in our dataset than Gentoo individuals. This could impact our logistic regression outcome and eventual model performance. In order to create a more equal subset of individuals between the two classes, we can downsample the Adelie-Chinstrap group to match the number of Gentoo species at 119. In order to do this, we can run a function in the caret package called downSample(), which will randomly select 119 individuals from the Adelie-Chinstrap group to use for our model:

penguins_rollup_factor <- factor(penguins$species_rollup)
downsample_penguins <- downSample(penguins, penguins_rollup_factor, list = FALSE, yname = "Class")

table(downsample_penguins$species_rollup)

## 
## Adelie-Chinstrap           Gentoo 
##              119              119

Now, we have an equal share of 119 individuals in our new downsample_penguins dataset between our two groups. With these adjustments, we are now ready to start running our logistic regression models.

downsample_penguins$penguins_species_fnl <- ifelse(downsample_penguins$species_rollup == 'Gentoo', 0, 1)

Running the logistic regression

First, we’ll need to split our downsample_penguins dataset into a training and testing set (80% training, 20% testing). This was necessary in order to measure our model performance on our holdout test set later on.

set.seed(1234567)

# utilizing one dataset for all four models
penguins_partition <- createDataPartition(downsample_penguins$species_rollup, p=0.8, list=FALSE)
penguins_training <- downsample_penguins[penguins_partition,]
penguins_testing <- downsample_penguins[-penguins_partition,]

Next, after running a few preliminary models that didn’t show large differences between the transformed variables I had created and those from the original dataset, I decided to keep the existing variables in order to ease interpretation of the log odds. Additionally, I decided to not use variables such as bill_depth_mm, island, sex, and year, since they appear to be overfitting the model and creating fitted probabilities that numerically turned out to be 0 or 1. Therefore, for the first model, I added three variables bill_length_mm, flipper_length_mm, and body_mass_g to the logistic regression and used stepwise selection to determine if other variables should be removed:

model1 <- glm(penguins_species_fnl ~ bill_length_mm +  flipper_length_mm + body_mass_g , data = penguins_training, family = binomial)

model2 <- model1 %>% stepAIC(trace=FALSE)
summary(model2)

## 
## Call:
## glm(formula = penguins_species_fnl ~ bill_length_mm + flipper_length_mm + 
##     body_mass_g, family = binomial, data = penguins_training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22345  -0.00756  -0.00001   0.00331   2.12733  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       154.110258  46.893897   3.286  0.00101 **
## bill_length_mm      0.391771   0.211737   1.850  0.06427 . 
## flipper_length_mm  -0.751099   0.239089  -3.142  0.00168 **
## body_mass_g        -0.003938   0.002432  -1.619  0.10536   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.169  on 191  degrees of freedom
## Residual deviance:  17.618  on 188  degrees of freedom
## AIC: 25.618
## 
## Number of Fisher Scoring iterations: 10

As we can see, it doesn’t look like StepAIC removed any of our less statistically significant variables, since they didn’t meet the threshold. However, in model 3, I decided to remove bill_length_mm to see if this has an impact on our AIC score.

However, before doing so, I did want to check my VIF scores in order to check multicollinearity.

# print variable inflation factor score
print('VIF scores of predictors')

## [1] "VIF scores of predictors"

VIF(model2)

##    bill_length_mm flipper_length_mm       body_mass_g 
##          2.159405          2.145169          1.101884

Fortunately, we can see that all VIF scores are quite low (below 5), meaning that interactions with one another are quite low.

model3 <- glm(penguins_species_fnl ~ flipper_length_mm + body_mass_g, data = penguins_training, family = binomial)
summary(model3)

## 
## Call:
## glm(formula = penguins_species_fnl ~ flipper_length_mm + body_mass_g, 
##     family = binomial, data = penguins_training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.67952  -0.02032  -0.00003   0.00835   2.73596  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       140.502135  42.189796   3.330 0.000868 ***
## flipper_length_mm  -0.622712   0.198249  -3.141 0.001683 ** 
## body_mass_g        -0.002802   0.001918  -1.461 0.143913    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.169  on 191  degrees of freedom
## Residual deviance:  23.408  on 189  degrees of freedom
## AIC: 29.408
## 
## Number of Fisher Scoring iterations: 10

After running this third model, we can see a small increase in the AIC. I also noticed through this process that flipper_length_mm seemed to be quite predictive of our bifurcated species variable. Therefore, I ran one final model which just exposed flipper_length_mm as the explanatory variable:

model4 <- glm(penguins_species_fnl ~ flipper_length_mm, data = penguins_training, family = binomial)
summary(model4)

## 
## Call:
## glm(formula = penguins_species_fnl ~ flipper_length_mm, family = binomial, 
##     data = penguins_training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.07052  -0.04469  -0.00018   0.01180   2.38666  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       141.4275    37.5453   3.767 0.000165 ***
## flipper_length_mm  -0.6867     0.1811  -3.793 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.169  on 191  degrees of freedom
## Residual deviance:  26.192  on 190  degrees of freedom
## AIC: 30.192
## 
## Number of Fisher Scoring iterations: 9

This model showed a slight increase in AIC from our previous models, and I do worry that if this model, or model #3 were to be used it may not yield as accurate predictions as model #1. We can examine these models in the next part during the evaluation stage. However, I’ll quickly breakdown some of the log odds information and interpret the variables of these logistic regression models below.

C) Provide variable interpretations in your model

Given our initial model turned out to be our best parsimonious model (based on AIC), if we examine the summary output, we can find the following:

Bill Length (mm) – as the bill length of an individual penguin in the dataset increases by one millimeter, then the odds of the individual being of the Adelie or Chinstrap species increases by exp(0.39), or 1.47.
Flipper Length (mm) – as the flipper length of an individual penguin in the dataset increases by one millimeter, then the odds of the individual being of the Adelie or Chinstrap species decreases by exp(-0.75), or 0.47.
Body Mass (g) – as the body mass of an individual penguin in the dataset increases by one gram, then the odds of the individual being of the Adelie or Chinstrap species decreases by exp(-0.003), or 0.10.

Part 2: Logistic Regression evaluation (AUC, TPR, FPR, TNR, FNR and Accuracy)

A) For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR

Although we won’t evaluate the performance of this model on the training dataset, it is good to check to make sure these functions are working properly before subjecting this to the testing set. Therefore, below we will examine the predictions and confusion matrix for our model of choice (model #1):

predictTrain = predict(model1, type = "response")
penguins_training$target = round(predictTrain, 1)
table(penguins_training$penguins_species_fnl, predictTrain > 0.5)

##    
##     FALSE TRUE
##   0    95    1
##   1     2   94

We can also check the accuracy of the predictions on the training set:

penguins_training = penguins_training %>%
  mutate(accuracy = 1*(round(target, 0) == round(penguins_species_fnl, 0)))

paste0('Accuracy on Training Set: ', sum(penguins_training$accuracy)/nrow(penguins_training))

## [1] "Accuracy on Training Set: 0.979166666666667"

Measuring Accuracy, TPR, FPR, TNR, and FNR on testing set

Now, we can evaluate our chosen model #1 on our holdout testing set. Initially, we’ll set our threshold at 0.5, to determine whether or not our predictions should be classified into the Gentoo category (less than or equal to 0.5) or Adelie-Chinstrap category (greater than 0.5):

predictTest = predict(model1, type = "response", newdata = penguins_testing)
penguins_testing$target = round(predictTest, 1)
confusion_matrix <- table(penguins_testing$penguins_species_fnl, predictTest > 0.5)
confusion_matrix

##    
##     FALSE TRUE
##   0    23    0
##   1     1   22

Calculating TPR, TNR, FPR, and FNR

# tp = true positive
# tn = true negative
# fp = false positive
# fn = false negative

tp <- confusion_matrix[4]
tn <- confusion_matrix[1]
fp <- confusion_matrix[2]
fn <- confusion_matrix[3]
p <- sum(confusion_matrix[3], confusion_matrix[4])
n <- sum(confusion_matrix[1], confusion_matrix[2])

tpr <- tp / p
tnr <- tn / n
fpr <- fp / n
fnr <- fn / p

title_row <- c('True Positive Rate (TPR)', 'True Negative Rate (TNR)', 'False Positive Rate (FPR)', 'False Negative Rate (FNR)')
values <- c(tpr, tnr, fpr, fnr)

results <- data.frame(title_row, values)
colnames(results) <- c('Measure', 'Value')

results

##                     Measure      Value
## 1  True Positive Rate (TPR) 1.00000000
## 2  True Negative Rate (TNR) 0.95833333
## 3 False Positive Rate (FPR) 0.04166667
## 4 False Negative Rate (FNR) 0.00000000

After calculating TPR, TNR, FPR, and FNR, it’s a good sign to see that both FPR and FNR are quite low. Additionally, the model was very good at classifying penguins in the testing dataset that were of either the Adelie/Chinstrap species (predicted all but one of them correctly), and classified all Gentoo species correctly.

Accuracy

penguins_testing = penguins_testing %>%
  mutate(accuracy = 1*(round(target, 0) == round(penguins_species_fnl, 0)))

paste0('Accuracy on Training Set: ', sum(penguins_testing$accuracy)/nrow(penguins_testing))

## [1] "Accuracy on Training Set: 0.978260869565217"

As we can see above, when checking the accuracy of the predictions on our testing set, we can see that model 1 was roughly 98% accurate in predicting whether or not the penguin was classified as either a Gentoo or Adelie/Chinstrap species.

Calculating AUC

We can plot an ROC curve and calculate our AUC by using the pRoc package. Below, we were able to plot our thresholds.

I was having issues with caching when knitting to R, so had to take a picture instead of showing output!

Finally, the AUC of our chosen model #1 is very good, with a value of 0.994, which is very close to 1 and supports our calculation of accuracy that almost all predictions were correct.

Part 3: Multinomial Logistic Regression

A) Please fit a multinomial logistic regression where your outcome variable is ‘species’.
B) Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.

For this last part, we’ll use the initial species variable to create a multinomial logistic regression model. This way, we do not have to bifurcate our species into two categories, but instead use a different type of logistic regression to obtain predictions for all three species. In order to do this, I decided to take our initial training dataset from Part 1, since it is already split accordingly, and use the multinom() function from the broom package to run our first model.

Model 1

In this initial model, I kept all variables in, just to get a baseline AIC to compare future models on the training data.

multinom_model1 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + sex + year + bill_length_transformed + flipper_length_transformed, data = penguins_training_mod)

## # weights:  30 (18 variable)
## initial  value 210.933559 
## iter  10 value 28.330919
## iter  20 value 2.932103
## iter  30 value 0.107010
## final  value 0.000007 
## converged

tidy(multinom_model1, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)

## # A tibble: 18 x 6
##    y.level   term                       estimate std.error   statistic   p.value
##    <chr>     <chr>                         <dbl>     <dbl>       <dbl>     <dbl>
##  1 Adelie    (Intercept)                 -0.328    0.00456  -72.0      0.       
##  2 Adelie    bill_length_mm             -10.7      8.36      -1.28     2.00e-  1
##  3 Adelie    bill_depth_mm               57.4     11.5        5.00     5.84e-  7
##  4 Adelie    flipper_length_mm           -3.28    72.9       -0.0450   9.64e-  1
##  5 Adelie    body_mass_g                 -0.0511   5.04      -0.0101   9.92e-  1
##  6 Adelie    sexmale                     46.9      0.214    219.       0.       
##  7 Adelie    year                         0.175    7.99       0.0220   9.82e-  1
##  8 Adelie    bill_length_transformed    -22.0      1.46     -15.1      2.24e- 51
##  9 Adelie    flipper_length_transformed  14.8     12.2        1.21     2.28e-  1
## 10 Chinstrap (Intercept)                  0.409    0.00456   89.7      0.       
## 11 Chinstrap bill_length_mm              17.0      8.36       2.03     4.22e-  2
## 12 Chinstrap bill_depth_mm               46.9     11.5        4.08     4.52e-  5
## 13 Chinstrap flipper_length_mm            0.0369  72.9        0.000506 1.00e+  0
## 14 Chinstrap body_mass_g                 -0.173    5.04      -0.0344   9.73e-  1
## 15 Chinstrap sexmale                    -62.7      0.214   -293.       0.       
## 16 Chinstrap year                        -0.408    7.99      -0.0510   9.59e-  1
## 17 Chinstrap bill_length_transformed    -43.9      1.46     -30.1      9.97e-199
## 18 Chinstrap flipper_length_transformed  24.1     12.2        1.97     4.93e-  2

summary(multinom_model1)

## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g + sex + year + bill_length_transformed + 
##     flipper_length_transformed, data = penguins_training_mod)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie     -0.3282549      -10.72648      57.44573       -3.28013884
## Chinstrap   0.4093077       16.98542      46.89662        0.03690183
##           body_mass_g   sexmale       year bill_length_transformed
## Adelie    -0.05114594  46.85619  0.1754729               -21.99939
## Chinstrap -0.17332011 -62.66099 -0.4075378               -43.88028
##           flipper_length_transformed
## Adelie                      14.75825
## Chinstrap                   24.05821
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie    0.004561299       8.362393      11.49727          72.94906
## Chinstrap 0.004561299       8.362393      11.49727          72.94906
##           body_mass_g   sexmale     year bill_length_transformed
## Adelie       5.043124 0.2142117 7.988299                1.458978
## Chinstrap    5.043124 0.2142117 7.988299                1.458978
##           flipper_length_transformed
## Adelie                      12.23654
## Chinstrap                   12.23654
## 
## Residual Deviance: 1.365053e-05 
## AIC: 36.00001

We can see that the AIC value is around 36, and when we look at the p-values and summary statistics, we can immediately see that it’ll be best to remove variables such as year and sex, both of which didn’t perform well in our binomial logistic regression earlier, as well as bill_length_transformed and flipper_length_transformed, both of which have high p-values here as well.

Model 2

When we re-run the model, removing these variables, we can see that there is an improvement in AIC (from 36 to 20). However, we’ll continue to remove a few variables to see if we can improve our model even more.

multinom_model2 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins_training_mod)

## # weights:  18 (10 variable)
## initial  value 210.933559 
## iter  10 value 10.367979
## iter  20 value 2.784343
## iter  30 value 0.794853
## iter  40 value 0.001436
## final  value 0.000061 
## converged

tidy(multinom_model2, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)

## # A tibble: 10 x 6
##    y.level   term               estimate std.error statistic  p.value
##    <chr>     <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
##  1 Adelie    (Intercept)         6.02        0.126  47.8     0.      
##  2 Adelie    bill_length_mm    -32.7         8.50   -3.85    1.20e- 4
##  3 Adelie    bill_depth_mm      48.2         4.89    9.85    6.54e-23
##  4 Adelie    flipper_length_mm   3.09       16.1     0.192   8.48e- 1
##  5 Adelie    body_mass_g        -0.00376     0.828  -0.00455 9.96e- 1
##  6 Chinstrap (Intercept)        -5.09        0.944  -5.39    7.09e- 8
##  7 Chinstrap bill_length_mm     22.7         9.06    2.51    1.21e- 2
##  8 Chinstrap bill_depth_mm       1.37       77.3     0.0177  9.86e- 1
##  9 Chinstrap flipper_length_mm  -2.36       20.9    -0.113   9.10e- 1
## 10 Chinstrap body_mass_g        -0.143       1.07   -0.134   8.94e- 1

summary(multinom_model2)

## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g, data = penguins_training_mod)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie       6.017933      -32.68266     48.164548          3.093349
## Chinstrap   -5.086096       22.71875      1.366784         -2.362707
##            body_mass_g
## Adelie    -0.003764257
## Chinstrap -0.142928375
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Adelie      0.1258945       8.499668      4.887449          16.12609
## Chinstrap   0.9438028       9.057746     77.339941          20.93772
##           body_mass_g
## Adelie      0.8280631
## Chinstrap   1.0704038
## 
## Residual Deviance: 0.0001212095 
## AIC: 20.00012

Model 3

In our third iteration of our model, we decided to remove flipper_length_mm, which seemed to have the highest p-value and was likely exposing our model to multicollinearity issues with body_mass_g, which we could see from our initial variable examination earlier in our model building.

multinom_model3 <- multinom(outcome_level ~ bill_length_mm + bill_depth_mm  + body_mass_g, data = penguins_training_mod)

## # weights:  15 (8 variable)
## initial  value 210.933559 
## iter  10 value 7.375240
## iter  20 value 1.881621
## iter  30 value 0.606133
## iter  40 value 0.039272
## iter  50 value 0.033159
## iter  60 value 0.027623
## iter  70 value 0.009051
## iter  80 value 0.008738
## iter  90 value 0.008694
## iter 100 value 0.008516
## final  value 0.008516 
## stopped after 100 iterations

tidy(multinom_model3, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE)

## # A tibble: 8 x 6
##   y.level   term           estimate  std.error    statistic  p.value
##   <chr>     <chr>             <dbl>      <dbl>        <dbl>    <dbl>
## 1 Adelie    (Intercept)     41.3    0.00000434   9525727.   0.      
## 2 Adelie    bill_length_mm -48.5    0.000194     -250238.   0.      
## 3 Adelie    bill_depth_mm   97.4    0.0000798    1220670.   0.      
## 4 Adelie    body_mass_g      0.0875 0.0175             4.99 6.05e- 7
## 5 Chinstrap (Intercept)    -42.5    0.00000313 -13583092.   0.      
## 6 Chinstrap bill_length_mm  10.2    0.000184       55533.   0.      
## 7 Chinstrap bill_depth_mm    4.55   0.0000647      70309.   0.      
## 8 Chinstrap body_mass_g     -0.120  0.0173            -6.91 5.01e-12

summary(multinom_model3)

## Call:
## multinom(formula = outcome_level ~ bill_length_mm + bill_depth_mm + 
##     body_mass_g, data = penguins_training_mod)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm body_mass_g
## Adelie       41.30411      -48.54704     97.397123  0.08745134
## Chinstrap   -42.51029       10.21919      4.546393 -0.11957026
## 
## Std. Errors:
##            (Intercept) bill_length_mm bill_depth_mm body_mass_g
## Adelie    4.336059e-06   0.0001940033  7.978991e-05  0.01752673
## Chinstrap 3.129647e-06   0.0001840199  6.466277e-05  0.01731568
## 
## Residual Deviance: 0.01703178 
## AIC: 16.01703

We can see that by removing flipper_length_mm, AIC continues to improve (dropping from 20 to 16). This final model yielded our lowest AIC and contains bill_length_mm bill_depth_mm and body_mass_g as our chosen independent variables for our best parsimonious model.

C) Please be sure to interpret your variables in the model

Given our third multinomial logisitic regression model turned out to be our best parsimonious model (based on AIC), if we examine the summary output, we can find the following:

Bill Length (mm) – A one millimeter increase in bill length is associated with the decrease in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 48.55. Similarly, a one millimeter increase in bill length is associated with the increase in the log odds of being classified as the Chinstrap species vs. Gentoo species by 10.22. Essentially, as bill length increases, it’s less likely to be classified as an Adelie species than Gentoo species. Similarly, as bill length increases, the log odds become more certain to be classified as a Chinstrap species than Gentoo species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).
Bill Depth (mm) – A one millimeter increase in bill depth is associated with the increase in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 97.40. Similarly, a one millimeter increase in bill length is associated with the increase in the log odds of being classified as the Chinstrap species vs. Gentoo species by 4.55. Essentially, as bill depth increases, it’s more likely to be classified as an Adelie species than Gentoo species. Similarly, as bill depth increases, the log odds become more certain to be classified as a Chinstrap species than Gentoo species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).
Body Mass (g) – A one gram increase in body mass is associated with the increase in the log odds of being classified as the Adelie species vs. Gentoo species in the amount of 0.09. Similarly, a one gram increase in body mass is associated with the decrease in the log odds of being classified as the Chinstrap species vs. Gentoo species by 0.11. Essentially, as body mass increases, it’s more likely to be classified as an Gentoo species over either Adelie or Chinstrap species. This can be confirmed by the initial overlapping distributions represented in the data exploration part of the assignment (Part 1).

Part 4: Extra Credit

Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide.

Before discussing fit statistics, we can first create predictions on our holdout test set based on our model #3 in the last section. We can create a dataframe that shows the computed probabilities for an individual being classified into each class.

predictTest_mod = predict(multinom_model3, type = "probs", newdata = penguins_testing)
predictTest_mod = data.frame(predictTest_mod)

Here’s the first five values of the dataframe:

head(predictTest_mod)

##          Gentoo        Adelie     Chinstrap
## 1 9.309919e-118  1.000000e+00 1.027839e-107
## 2 1.581919e-207  1.000000e+00 1.612077e-229
## 3  1.378137e-40 2.968561e-122  1.000000e+00
## 4  2.720822e-46  1.000000e+00  2.773275e-25
## 5  1.242817e-65  1.000000e+00  7.323893e-64
## 6  2.989775e-33  7.430538e-16  1.000000e+00

As we can see above, probabilities falling closer to 1 will be utilized for classification. Therefore, if we take the maximum value of the three columns for each row, we can then estimate a classification for our target set:

target_test <- colnames(predictTest_mod)[apply(predictTest_mod,1,which.max)]
penguins_testing$target = target_test

cmx_test <- table(penguins_testing$target, penguins_testing$target == penguins_testing$species)
cmx_test

##            
##             FALSE TRUE
##   Adelie        0   15
##   Chinstrap     1    7
##   Gentoo        0   23

We can see from the confusion matrix generated, that the model estimated the correct classification for all individuals in our test set. This indicates that our model is working quite well.

Fit Statistics for Multinomial Logisitic Regression

Some of the fit statistics that you’d want to evaluate for the model produced in part 3 would be based on the predicted results. For instance, now that we have these predictions, we could test the goodness of fit by running a Pearson’s Chi-squared test on these predictions and the categories. This would produce an interpretable \(X^2\) value. It would also be possible to calculate a psuedo R square values for each of our species – we’d interpret it a bit differently than traditional OLS, but would be available for us to make some assumptions about the proportion of variance of species that can be explained by the independent variables.

DATA 622 - Homework #1

Zach Alexander

2/12/2021

Part 1: Logistic Regression with binary outcome

Part 2: Logistic Regression evaluation (AUC, TPR, FPR, TNR, FNR and Accuracy)

Part 3: Multinomial Logistic Regression

Part 4: Extra Credit