This assignment analyzes the Palmer Penguins dataset using binary logistic and multinomial regression. The dataset contains the physical measurements and gender of individuals over a 3 year period 2007-2009 from three closely related species: Gentoo, chinstrap and Adelie in the Palmer Archipelago, Antarctica.
In the first section, Problem 1, we construct a binary logistic regression and explore the data. Much of the exploratory analysis applies to multinomial regression model in later sections. We evaluate the appropriateness of the independent variables for inclusion in the selected model. We also interpret the output and the meaning of the coefficients. In Problem 2, we compute model diagnostics like AUC, Accuracy and so far derived from the confusion matrix.
In the third section, Problem 3, we construct a multinomial logistic regression using species as the response variable. We also provide interpretations of the model. In the fourth section, Problem 4, we consider model diagnostics for the multinomial logistic regression. While the process is more challenging than in the binary case, some metrics can shed some light.
Background research was relevant and helpful in variable selection and model building in this assignment. To get a deeper understanding of penguin biology and the associated research study, I explored the Palmer Penguin dataset contained in the R package palmerpenguins, read the 2014 research paper by Gorman, Williams and Fraser and consulted additional background reading on Palmer Station and penguins. This leads to a couple of observations:
The relative proportions of each species in the sample data do not reflect their actual populations either globally or on the islands but the researchers’ preferences. However, we assume the physical measurements of the samples are typical of each species. The data collection was gathered over 3 years (2007-2009) by a researcher (K. Gorman) essentially working alone in the field. Moreover, the birds were evaluating during the start of nesting season as breeding pairs. The authors also state that additional samples were excluded due to research criteria related to their breeding success. Gorman, Williams, Fraser (2014)
Temperature and winter-ice conditions varied significantly in the three years of the study. So we need to explore the year variable even if we expect little or no species prediction. The authors state that climate variation may food supply, habitat availability and affect penguin body mass. 2008 was warmer and resulted in less sea ice in the Palmer archipelago. This suggests time variation in body mass could be present or significant.
The gender of a penguin is difficult to determine and requires genetic testing on blood samples or examining internal organs. One reason is penguins show no external genitalia. Wells (2019) So collecting the sex information is costly whereas the culmen and flipper measurements are by comparison easier and cheaper. We also expect little statistical evidence of sexual dimorphism – i.e. sex should not be a relevant variable in model building.
All three locations are quite close together. They are small rocky uninhabited islands near Palmer Station, the researchers’ home base. We don’t expect location to significantly affect our species prediction model except where it may alter sample frequencies. Moreover, if we wish the model to predict the outcome variable (species) for penguines beyond these 3 islands, the location of each bird should play no role as a model input.
We observe that predicting penguin species from ALL of the independent variables may be an impractical exercise. In practice, getting the species is the least costly information. Visual observation of the head markings with binoculars can allow species identification. Measuring physical attributes – especially the penguin’s gender – can be very time consuming and costly. A more practical model would seek to predict sex from species and other variables to replace costly chromosomal tests.
Initial exploratory data analysis (EDA) of the entire dataset gives several findings:
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
| Name | Piped data |
| Number of rows | 333 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | species, sex |
Variable type: factor
| skim_variable | species | sex | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|---|---|
| island | Adelie | female | 0 | 1 | FALSE | 3 | Dre: 27, Tor: 24, Bis: 22 |
| island | Adelie | male | 0 | 1 | FALSE | 3 | Dre: 28, Tor: 23, Bis: 22 |
| island | Chinstrap | female | 0 | 1 | FALSE | 1 | Dre: 34, Bis: 0, Tor: 0 |
| island | Chinstrap | male | 0 | 1 | FALSE | 1 | Dre: 34, Bis: 0, Tor: 0 |
| island | Gentoo | female | 0 | 1 | FALSE | 1 | Bis: 58, Dre: 0, Tor: 0 |
| island | Gentoo | male | 0 | 1 | FALSE | 1 | Bis: 61, Dre: 0, Tor: 0 |
Variable type: numeric
| skim_variable | species | sex | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | Adelie | female | 0 | 1 | 37.26 | 2.03 | 32.1 | 35.90 | 37.00 | 38.80 | 42.2 | ▁▆▇▅▂ |
| bill_length_mm | Adelie | male | 0 | 1 | 40.39 | 2.28 | 34.6 | 39.00 | 40.60 | 41.50 | 46.0 | ▁▅▇▃▁ |
| bill_length_mm | Chinstrap | female | 0 | 1 | 46.57 | 3.11 | 40.9 | 45.42 | 46.30 | 47.38 | 58.0 | ▂▇▂▁▁ |
| bill_length_mm | Chinstrap | male | 0 | 1 | 51.09 | 1.56 | 48.5 | 50.05 | 50.95 | 51.98 | 55.8 | ▅▇▅▁▁ |
| bill_length_mm | Gentoo | female | 0 | 1 | 45.56 | 2.05 | 40.9 | 43.85 | 45.50 | 46.88 | 50.5 | ▂▃▇▃▁ |
| bill_length_mm | Gentoo | male | 0 | 1 | 49.47 | 2.72 | 44.4 | 48.10 | 49.50 | 50.50 | 59.6 | ▃▇▃▁▁ |
| bill_depth_mm | Adelie | female | 0 | 1 | 17.62 | 0.94 | 15.5 | 17.00 | 17.60 | 18.30 | 20.7 | ▂▇▇▂▁ |
| bill_depth_mm | Adelie | male | 0 | 1 | 19.07 | 1.02 | 17.0 | 18.50 | 18.90 | 19.60 | 21.5 | ▂▇▇▃▂ |
| bill_depth_mm | Chinstrap | female | 0 | 1 | 17.59 | 0.78 | 16.4 | 17.00 | 17.65 | 18.05 | 19.4 | ▇▆▇▃▂ |
| bill_depth_mm | Chinstrap | male | 0 | 1 | 19.25 | 0.76 | 17.5 | 18.80 | 19.30 | 19.80 | 20.8 | ▁▆▅▇▂ |
| bill_depth_mm | Gentoo | female | 0 | 1 | 14.24 | 0.54 | 13.1 | 13.80 | 14.25 | 14.60 | 15.5 | ▂▆▇▅▁ |
| bill_depth_mm | Gentoo | male | 0 | 1 | 15.72 | 0.74 | 14.1 | 15.20 | 15.70 | 16.10 | 17.3 | ▂▆▇▅▂ |
| flipper_length_mm | Adelie | female | 0 | 1 | 187.79 | 5.60 | 172.0 | 185.00 | 188.00 | 191.00 | 202.0 | ▁▂▇▅▁ |
| flipper_length_mm | Adelie | male | 0 | 1 | 192.41 | 6.60 | 178.0 | 189.00 | 193.00 | 197.00 | 210.0 | ▃▅▇▃▁ |
| flipper_length_mm | Chinstrap | female | 0 | 1 | 191.74 | 5.75 | 178.0 | 187.25 | 192.00 | 195.75 | 202.0 | ▂▅▇▇▆ |
| flipper_length_mm | Chinstrap | male | 0 | 1 | 199.91 | 5.98 | 187.0 | 196.00 | 200.50 | 203.00 | 212.0 | ▁▇▅▅▂ |
| flipper_length_mm | Gentoo | female | 0 | 1 | 212.71 | 3.90 | 203.0 | 210.00 | 212.00 | 215.00 | 222.0 | ▁▇▆▅▂ |
| flipper_length_mm | Gentoo | male | 0 | 1 | 221.54 | 5.67 | 208.0 | 218.00 | 221.00 | 225.00 | 231.0 | ▁▆▇▇▆ |
| body_mass_g | Adelie | female | 0 | 1 | 3368.84 | 269.38 | 2850.0 | 3175.00 | 3400.00 | 3550.00 | 3900.0 | ▅▅▇▅▅ |
| body_mass_g | Adelie | male | 0 | 1 | 4043.49 | 346.81 | 3325.0 | 3800.00 | 4000.00 | 4300.00 | 4775.0 | ▃▇▇▇▃ |
| body_mass_g | Chinstrap | female | 0 | 1 | 3527.21 | 285.33 | 2700.0 | 3362.50 | 3550.00 | 3693.75 | 4150.0 | ▁▂▇▇▂ |
| body_mass_g | Chinstrap | male | 0 | 1 | 3938.97 | 362.14 | 3250.0 | 3731.25 | 3950.00 | 4100.00 | 4800.0 | ▃▇▇▂▂ |
| body_mass_g | Gentoo | female | 0 | 1 | 4679.74 | 281.58 | 3950.0 | 4462.50 | 4700.00 | 4875.00 | 5200.0 | ▂▅▇▇▃ |
| body_mass_g | Gentoo | male | 0 | 1 | 5484.84 | 313.16 | 4750.0 | 5300.00 | 5500.00 | 5700.00 | 6300.0 | ▂▆▇▅▂ |
| year | Adelie | female | 0 | 1 | 2008.05 | 0.81 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▇▁▇▁▇ |
| year | Adelie | male | 0 | 1 | 2008.05 | 0.81 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▇▁▇▁▇ |
| year | Chinstrap | female | 0 | 1 | 2007.97 | 0.87 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▇▁▆▁▇ |
| year | Chinstrap | male | 0 | 1 | 2007.97 | 0.87 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▇▁▆▁▇ |
| year | Gentoo | female | 0 | 1 | 2008.07 | 0.79 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▆▁▇▁▇ |
| year | Gentoo | male | 0 | 1 | 2008.07 | 0.79 | 2007.0 | 2007.00 | 2008.00 | 2009.00 | 2009.0 | ▆▁▇▁▇ |
The plot below showed two key points:
Sample gathering was not balanced across islands. Gentoo were exclusively collected from Biscoe island; chinstrap from Dream Island, and Adelie from all three islands.
Sample sizes are small. So we should expect variability in the measurements over time and geography.
Based on the exploratory data analysis below, we will pool all observations by year AND island in model building. The facet plots below show the density of each feature by year bucketed by island and species. We don’t observe significant differences in means or standard deviations in Adelie penguin features across the three islands. We don’t observe significant differences in the distribution for bill depth, bill length, flipper length or mass in Gentoo and Chinstrap penguins over time. For Adelie penguins, we observe a moderate difference in means of Adelie penguins on Torgersen island in bill_depth_mm and body_mass_g and flipper_length_mm on Biscoe Island. It is difficult to argue the difference is significant for such a small sample. Moreover, despite the authors’ concern of time variation in sea-ice conditions, the observation changes in penguin weights went in the opposite direction implied by their habitat preferences. For this reason I choose to exclude year as a feature.
Reproducing the pairwise plot below from the vignette/pca, we assess the six scatter plots of quantitative features.
Of the 6 scatterplots, the most promising pair seems to be: bill_length_mm and bill_depth_mm is distinguishing species. The clusters of observations of the 3 species have almost no overlap. Moreover, when I condition the distributions of bill_depth_mm and bill_length_mm on sex, the clusters are even further separated visually. This can be seen clearly on the bottom right-most scatterplot within each of the 3 ggpair panels.
We choose to define a binary variable based on species called Gentoo which we define as:
\[ Gentoo = \begin{cases} 1 & where & species = Gentoo \\ 0 & where & species = Adelie, Chinstrap \\ \end{cases} \]
We chose this partition of the 3 class data set because Gentoo penguins appear to be neatly separated by the available features and because the Gentoo population imbalance is not excessive. Gentoo penguins comprise 35.7% of the sample as shown below.
We consider using only the most parsimonious model – using bill_length_mm and bill_depth_mm which are the culmen characteristics.
##
## Call:
## glm(formula = Gentoo ~ bill_length_mm + bill_depth_mm, family = binomial,
## data = pc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.83346 -0.01519 -0.00119 0.01196 3.03982
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 48.1175 14.8730 3.235 0.00122 **
## bill_length_mm 0.5561 0.1344 4.138 3.50e-05 ***
## bill_depth_mm -4.4750 1.0500 -4.262 2.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.154 on 332 degrees of freedom
## Residual deviance: 29.822 on 330 degrees of freedom
## AIC: 35.822
##
## Number of Fisher Scoring iterations: 10
The marginal effects allow us to inteprete model coefficients in probability terms. Using the margins library, we can calculate the average marginal effect of the variables to the predicted probability.
An increase in bill length by 1mm increases the average probability of being classified as Gentoo by 0.699%. An increase in bill depth by 1mm decreases the average probability of being classified as Gentoo by 5.6%. Thus, we see the bill depth has larger magnitude effect on the predicted outcome.
Note that coefficient estimates in the logistic regression are interpreted as linear shifts in log odds not in probability. So a 1mm increase in bill length increases the log odds of being classified as Gentoo by 0.56. A 1mm increase in bill depth decreases the log odds of a Gentoo by 4.475. Both coefficient estimates are highly statistically significant.
However, we consider 2 other candidate models to keep this assignment length brief only report selected results.
We next consider using the culmen and gender characteristics in a culmen and gender model.
##
## Call:
## glm(formula = Gentoo ~ bill_length_mm + bill_depth_mm + as.factor(sex),
## family = binomial, data = pc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.99977 -0.00009 0.00000 0.00002 1.73786
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 140.8342 102.2827 1.377 0.169
## bill_length_mm 0.7401 0.5107 1.449 0.147
## bill_depth_mm -11.0369 7.5647 -1.459 0.145
## as.factor(sex)male 15.9818 12.8480 1.244 0.214
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.1538 on 332 degrees of freedom
## Residual deviance: 5.1265 on 329 degrees of freedom
## AIC: 13.127
##
## Number of Fisher Scoring iterations: 13
Lastly, we consider a full model using all 4 physical quantitative and gender features.
##
## Call:
## glm(formula = Gentoo ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
## body_mass_g + I(sex), family = binomial, data = pc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.518e-05 -2.100e-08 -2.100e-08 2.100e-08 3.530e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.776e+02 5.425e+05 0.000 1.000
## bill_length_mm 3.331e-02 3.488e+03 0.000 1.000
## bill_depth_mm -1.034e+01 1.536e+04 -0.001 0.999
## flipper_length_mm 1.272e+00 1.866e+03 0.001 0.999
## body_mass_g 1.931e-02 3.375e+01 0.001 1.000
## I(sex)male -3.729e+00 5.142e+04 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.3415e+02 on 332 degrees of freedom
## Residual deviance: 7.0884e-09 on 327 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
Our chosen binary logistic regression model appears to perform well. All the requested statistics are summarized as follows:
We now show the work that produced the above results. The pROC library, introduced in Data 621, gives us the ROC curve which shows an excellent fit for the culmen model.
While there are many permitted thresholds to evaluate the model, we compare two obvious candidate thresholds.
\[ mod(X) = Y = Pr[Gentoo= True] > 0.5 \] \[ mod(X) = Y = Pr[Gentoo= True] > 0.36 \]
The threshold of 36% is the observed population frequency of Gentoo and is often recommended for modeling purposes over the 50% threshold. The threshold of 50% seems to lead to slightly higher accuracy but given the success of the 36% threshold, we prefer the former model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction F T
## F 210 2
## T 4 117
##
## Accuracy : 0.982
## 95% CI : (0.9612, 0.9934)
## No Information Rate : 0.6426
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9609
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9832
## Specificity : 0.9813
## Pos Pred Value : 0.9669
## Neg Pred Value : 0.9906
## Prevalence : 0.3574
## Detection Rate : 0.3514
## Detection Prevalence : 0.3634
## Balanced Accuracy : 0.9823
##
## 'Positive' Class : T
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000005 0.3573574 0.9999996 1.0000000
##
## FALSE TRUE
## 0 214 0
## 1 1 118
##
## FALSE TRUE
## 0 213 1
## 1 1 118
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3574 1.0000 1.0000
##
## FALSE TRUE
## 0 214 0
## 1 0 119
##
## FALSE TRUE
## 0 214 0
## 1 0 119
The multinomial logistic regression model is relatively straightforward to implement using the nnet library.
## # weights: 12 (6 variable)
## initial value 365.837892
## iter 10 value 35.950274
## iter 20 value 25.109999
## iter 30 value 24.623162
## iter 40 value 23.976224
## iter 50 value 23.895226
## final value 23.895025
## converged
## Call:
## multinom(formula = species2 ~ bill_length_mm + bill_depth_mm,
## data = pc)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm
## Adelie -22.96375 -2.7001766 8.218320
## Chinstrap -48.02067 -0.4846485 4.257897
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm
## Adelie 19.89498 0.7083994 1.795938
## Chinstrap 14.46719 0.1555850 1.000543
##
## Residual Deviance: 47.79005
## AIC: 59.79005
## (Intercept) bill_length_mm bill_depth_mm
## Adelie -1.154249 -3.811658 4.576059
## Chinstrap -3.319281 -3.115008 4.255586
## (Intercept) bill_length_mm bill_depth_mm
## Adelie 0.2483981569 0.0001380376 4.738163e-06
## Chinstrap 0.0009024962 0.0018394000 2.085024e-05
Multinomial coefficients are interpreted as the impact on log odds of a given class relative to the base class. The base class was chosen to be Gentoo penguins.
For the Adelie class, the coefficient -2.70 for bill_length_mm means a 1mm increase in bill length decreases the log odds of Adelie versus Gentoo by 2.70 units of log odds. The coefficient 8.22 for bill_depth_mm means a 1mm increase in bill depth increases the log odds of Adelie versus Gentoo by 8.22.
For the chinstrap class, the coefficient of -0.48 for bill_length_mm means a 1mm increase in bill length decreases the log odds of a chinstrap relative to Gentoo by 0.48 units of log odds. The coefficient of 4.25 for bill_depth_mm means a 1mm increase in bill depth increases the log odds of chinstrap vs. Gentoo by 4.25 units of log odds.
Because the nnet package does not calculate Z or p-values, we follow the UCLA IDRE webpage in estimating these values using the standard errors. Note taht the resulting Z and p values suggest all coefficients are statistically significant at the 1% level.
We consider the use of the multi-class confusion matrix as the most practical evaluation of model performance.
If we have test data, that can be reserved to evaluate model performance out of sample.
In this case, we just calculate the accuracy in the training sample for illustrative purposes.
The predict method applies to the multinomial model to get the predicted class. Next, we calculate the multi-class confusionMatrix in the caret package to tabulate additional statistics.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Gentoo Adelie Chinstrap
## Gentoo 117 0 2
## Adelie 0 143 3
## Chinstrap 4 3 61
##
## Overall Statistics
##
## Accuracy : 0.964
## 95% CI : (0.9379, 0.9812)
## No Information Rate : 0.4384
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9435
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Gentoo Class: Adelie Class: Chinstrap
## Sensitivity 0.9669 0.9795 0.9242
## Specificity 0.9906 0.9840 0.9738
## Pos Pred Value 0.9832 0.9795 0.8971
## Neg Pred Value 0.9813 0.9840 0.9811
## Prevalence 0.3634 0.4384 0.1982
## Detection Rate 0.3514 0.4294 0.1832
## Detection Prevalence 0.3574 0.4384 0.2042
## Balanced Accuracy 0.9788 0.9817 0.9490
We conclude that the above multinomial model performs well. It has 96.4% accuracy. Across each type of penguin, the accuracy, sensitivity and specificity are around 95% on average. The weakest performance is on the chinstrap penguin whoere sensitivity is at 92.4%.
Another statistic of interest is kappa. The value of 94.35% is regarded as very high. This value appears to be identical to the Matthews correlation coefficient (abbreviated MCC) which is regarded as the best single statistic for binary classification assessment by some machine learning theorists. It is also calculated for multiple class classification problems. The value of kappa for the binary and ternary models above match the MCC calculated from the mltools package. MCC contains more information on the calculation method.
We summarize all the R code used in this project in this appendix for ease of reading.
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(GGally)
library(margins)
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(palmerpenguins)
library(skimr)
skim(penguins)
penguins %>% filter(is.na(sex) == FALSE) %>% group_by(species, sex) %>% skim()
pc = penguins %>% filter( is.na(sex) == FALSE)
ggplot(pc) + geom_bar(alpha=0.8, aes(x=year, fill = species)) + facet_grid(vars(island), vars(species))
mu <- pc %>% group_by(species, year, island) %>% summarize(bmg.mean = mean(body_mass_g),
flm.mean = mean(flipper_length_mm),
blm.mean = mean(bill_length_mm),
bdm.mean = mean(bill_depth_mm))
ggplot(pc) + geom_density(aes(x=body_mass_g, fill=as.factor(year)), alpha = 0.4) + facet_grid(vars(island), vars(species)) + geom_vline( data=mu, aes(xintercept=bmg.mean, color=as.factor(year) ), linetype='dashed' )
ggplot(pc) + geom_density(aes(x=bill_length_mm, fill=as.factor(year)), alpha = 0.4) + facet_grid(vars(island), vars(species)) + geom_vline( data=mu, aes(xintercept=blm.mean, color=as.factor(year) ), linetype='dashed' )
ggplot(pc) + geom_density(aes(x=flipper_length_mm, fill=as.factor(year)), alpha = 0.4) + facet_grid(vars(island), vars(species)) + geom_vline( data=mu, aes(xintercept=flm.mean, color=as.factor(year) ), linetype='dashed' )
ggplot(pc) + geom_density(aes(x=bill_depth_mm, fill=as.factor(year)), alpha = 0.4) + facet_grid(vars(island), vars(species)) + geom_vline( data=mu, aes(xintercept=bdm.mean, color=as.factor(year) ), linetype='dashed' )
pc %>%
select(species, body_mass_g, ends_with("_mm")) %>%
GGally::ggpairs(aes(color = species),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm")
) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4"))
pc %>% filter(sex == 'female') %>%
select(species, body_mass_g, ends_with("_mm")) %>%
GGally::ggpairs(aes(color = species),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm")
) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4"))
pc %>% filter(sex == 'male') %>%
select(species, body_mass_g, ends_with("_mm")) %>%
GGally::ggpairs(aes(color = species),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm")
) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4"))
# Population Comparison of the Gentoos vs. Adelie and Chinstrap penguins.
tot = nrow(pc)
pc %>% group_by(species) %>% summarize( n() , pct = round( n()/tot * 100 , 1 ))
pc$Gentoo = ifelse(pc$species == 'Gentoo', 1, 0)
pc %>% group_by(Gentoo) %>% summarize( n() , pct = round( n()/tot * 100 , 1 ))
# Parsimonious model with Culmen characteristics only
culmenmod = glm( Gentoo ~ bill_length_mm + bill_depth_mm, data = pc, family = binomial)
summary(culmenmod)
margins(culmenmod)
cgmod = glm( Gentoo ~ bill_length_mm + bill_depth_mm + as.factor(sex) , data = pc, family=binomial)
summary(cgmod)
margins(cgmod)
fullmod = glm( Gentoo ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + I(sex), data = pc, family = binomial)
summary(fullmod)
margins(fullmod)
library(pROC) # needed for the ROC and AUC statistics
pred_culmenmod = predict(culmenmod, type="response")
par(pty='s') # set the printing aspect ratio to a square
roc_culmenmod = plot.roc(pc$Gentoo, pred_culmenmod, print.auc = TRUE, col='blue', lwd = 4)
pred_cgmod = predict(cgmod, type="response")
pred_fullmod = predict(fullmod, type="response")
library(caret)
predicted_class = factor(ifelse(pred_culmenmod > 0.36 , "T", "F") )
actual_class = factor( ifelse( pc$Gentoo == 1, "T", "F" ))
(c_culmenmod = confusionMatrix(predicted_class, actual_class, positive = c("T") ) )
summary(pred_cgmod)
table(pc$Gentoo, pred_cgmod > 0.5)
table(pc$Gentoo, pred_cgmod > 0.36)
summary(pred_fullmod)
table(pc$Gentoo, pred_fullmod > 0.5)
table(pc$Gentoo, pred_fullmod > 0.36)
par(pty='s')
roc_cgmod = plot.roc(pc$Gentoo, pred_cgmod, print.auc = TRUE, col='blue', lwd = 4)
par(pty='s')
roc_fullmod = plot.roc(pc$Gentoo, pred_fullmod, print.auc = TRUE, col='blue', lwd = 4)
library(foreign)
library(nnet)
pc$species2 = relevel( pc$species, ref = 'Gentoo')
multi_mod1 = multinom( species2 ~ bill_length_mm + bill_depth_mm, data = pc )
summary(multi_mod1)
#-pvalues
(z <- summary(multi_mod1)$coefficients / summary( multi_mod1)$standard.errors )
(p <- (1 - pnorm(abs(z), 0, 1 )) *2 )
# Confusion matrix for the ternary classification model.
pred_multi_mod1 = predict( multi_mod1, newdata=pc, "class")
# The caret library implements the key metrics derived for the 3-class confusion matrix
confusionMatrix(pc$species2, pred_multi_mod1)