Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: <https://allisonhorst.github.io/palmerpenguins/articles/intro.html>
Below describes the penguin dataset where we have 344 observations with 8 features used to describe these observations. Three of these variables are nominal while the other five are numeric.
##Conflicted is used to identify duplicate functions in different packages and explicitely define which should be used
library(conflicted)
pacman::p_load(
# for data importing and pre-processing
dplyr, tidyverse,readxl, palmerpenguins,tidyselect,
# for EDA
skimr, DataExplorer,
# for data modeling and visualization
tidymodels, broom, dotwhisker, vip,
# table processing
kableExtra
)
summary(penguins) %>%kable() %>% kable_styling(
full_width = F, position="center") %>%
scroll_box()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|
| Adelie :152 | Biscoe :168 | Min. :32.10 | Min. :13.10 | Min. :172.0 | Min. :2700 | female:165 | Min. :2007 | |
| Chinstrap: 68 | Dream :124 | 1st Qu.:39.23 | 1st Qu.:15.60 | 1st Qu.:190.0 | 1st Qu.:3550 | male :168 | 1st Qu.:2007 | |
| Gentoo :124 | Torgersen: 52 | Median :44.45 | Median :17.30 | Median :197.0 | Median :4050 | NA’s : 11 | Median :2008 | |
| NA | NA | Mean :43.92 | Mean :17.15 | Mean :200.9 | Mean :4202 | NA | Mean :2008 | |
| NA | NA | 3rd Qu.:48.50 | 3rd Qu.:18.70 | 3rd Qu.:213.0 | 3rd Qu.:4750 | NA | 3rd Qu.:2009 | |
| NA | NA | Max. :59.60 | Max. :21.50 | Max. :231.0 | Max. :6300 | NA | Max. :2009 | |
| NA | NA | NA’s :2 | NA’s :2 | NA’s :2 | NA’s :2 | NA | NA |
There are a few observations that have majority of NA’s. these are removed from our analyses as they do not provide enough information about the observation to make an inference. The observations that were removed are shown below and the only information included for those were the island and year for the penguin species
limitmissing <- .5*ncol(penguins)
retain<-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<limitmissing
Dataset<-penguins[retain,]
penguins[!retain,] %>% kable() %>% kable_styling(
full_width = F, position="center") %>%
scroll_box()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Gentoo | Biscoe | NA | NA | NA | NA | NA | 2009 |
For this assignment, let us use ‘species’ as our outcome or the dependent variable.
The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
The table below provides the class distribution of the penguin species. It shows that we have 152 Adelie penguins, 68 Chinstrap penguins and 124 Gentoo penguins.
table(penguins$species)##
## Adelie Chinstrap Gentoo
## 152 68 124
As shown in the table above, the penguins dataset has approximately 3 classes, Adelie, Chinstrap, and Gentoo. The percentage associated with each class is shown below.
Dataset<-penguins[retain,]
Dataset<-
Dataset %>% droplevels.data.frame() %>%
mutate(species = as.character(species)) %>%
#mutate(species =if_else(species == "Adelie", species, "Not Adelie", missing = NULL)) %>% droplevels.data.frame() %>%
mutate(species= as.factor(species)) %>%
filter(is.na(species)==F)
# count of each dataset
#table(Dataset$species)
# proportion of each datasetprop.table(table(Dataset$species)) %>% as.data.frame() %>%
mutate(Freq= percent(Freq)) %>%
rename( "Class" = Var1, "Proportion" = "Freq")## Class Proportion
## 1 Adelie 44.2%
## 2 Chinstrap 19.9%
## 3 Gentoo 36.0%
We can look at a distribution of our numeric variables below and we can see that each of the features follow close to a gaussian distribution. flipper length follows a bimodal distribution and is likely a distribution of 2 different penguin species.
DataExplorer::plot_histogram(Dataset)Year is inappropriately classified as a numeric feature and so convert it into a factor for our predictions.
Dataset$year <-as.factor(Dataset$year) For our nominal or non-numeric variables, We can see the distribution of each. there are a few species where “sex” is missing and we will use various imputation methods to see if we may appropriately be able to classify the gender.
DataExplorer::plot_bar(Dataset)Below shows the distribution of our Dataset with the three penguin classes included. We can see that there is quite the distinction in classes especially when it comes to Bill length
GGally::ggpairs(Dataset[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()Because we are looking to perform a binary classification problem, we will be aggregating the Chinstrap variable with Gentoo as “non-Adelie” in order to maintain a larger dataset with more information per each class. Understanding the proportion that belongs to each class within the dataset is important to assess whether or not there will be bias toward predicting any particular class. with a 45/55 class distribution, the dataset is sufficiently diverse to make predictions on these classes without having to alter the random sampling methodology. Below provides the exploration of this dataset when treating this as a binary problem with the classes Adelie and non-adelie
unique(penguins$species)## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
Dataset_rev<-
Dataset %>% droplevels.data.frame() %>%
mutate(species = as.character(species)) %>%
mutate(species =if_else(species == "Adelie", species, "Not Adelie", missing = NULL)) %>% droplevels.data.frame() %>%
#filter(species!= "Gentoo") %>%
mutate(species= as.factor(species))
GGally::ggpairs(Dataset_rev[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()For our binomial logistic regression model, we’re using the glm engine from the tidymodels package and using a test/train split of 75%/25%, this sampling is stratified by species to obtain a sample closely representing the division between the two classifications. A 10-fold cross-validation set is pulled in-order avoid over-training by testing the data on 10 different splits of the training data.
When considering all the data, we actually obtain a perfect regression model and can predict the penguin species with 100% accuracy. this is primarily due to the bill_length_mm feature which very distinctly identifies different penguins when regressed against bill bill depth, flipper length, and body mass. Because of this, to make the model actually serve any purpose, bill length was removed from the dataset to purposely decrease the accuracy and attempt to predict with the other not-so-perfect features.
LogR_Engine<-
logistic_reg(mode = "classification") %>%
set_engine(engine = "glm")
set.seed(3)
Datasplit<- initial_split(Dataset_rev, prop = .75, strata = species)
Datatrain<-training(Datasplit)
datacv<- vfold_cv(Datatrain, v = 10, repeats =1, strata = species)
datatest<-testing(Datasplit)
dataval<-validation_split(Datatrain, prop = .75, strata = species)we can see the accuracy based on our cross validation metrics below. This accuracy is based on the average accuracy of 10 resampled test/train splits. The Akaine Information Criteria (AIC) is presented below and and provides an estimate of the prediction error.
Logistic_Recipe<-
recipe(species ~ ., data = Datatrain) %>%
step_rm(bill_length_mm) %>%
step_medianimpute(all_numeric()) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
Penguin_Wflow<-
workflow() %>%
add_model(LogR_Engine) %>%
add_recipe(Logistic_Recipe)
#DataExplorer::plot_correlation(Dataset, type = "all")
#
### tuning the model
##LRreg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 10))
#
#
# fit on the crossvalidation dataset
LR_rs<- Penguin_Wflow %>% fit_resamples(resamples = datacv,
control = control_resamples(save_pred = T))
#collect average ROC from cross validation training set
LR_rs %>% collect_metrics(summarize = T)## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.896 10 0.0237 Preprocessor1_Model1
## 2 roc_auc binary 0.963 10 0.0124 Preprocessor1_Model1
Additionally, we can observe which variables are most significant in our regression as well as the weight that they have on our model. Island, bodymass and sex each have pvalues >0.5 and indicate that these features are not significant within our model.
Finally, Our ROC curve and our confusion matrix is presented below showing how well our model performs. The closer the ROC curve is to the upper right corner of the plot (1.0), the more accurate it is. With bill length included in the model, this would show a perfect curve. The confusion matrics shows that we correctly classified Adelie 30/35 times, and not-adelie 38/45 times. This results in an overall accuracy of 85% on the testing dataset.
logmodel<-Penguin_Wflow %>%
fit(data = Datatrain)
logmodel %>%
pull_workflow_fit() ## parsnip model object
##
## Fit time: 11ms
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) islandDream islandTorgersen bill_depth_mm
## -4.570e+01 3.199e+00 -1.714e+01 -5.099e-01
## flipper_length_mm body_mass_g sexmale year2008
## 2.769e-01 4.923e-06 -8.802e-01 -1.798e+00
## year2009
## -1.497e+00
##
## Degrees of Freedom: 252 Total (i.e. Null); 244 Residual
## (5 observations deleted due to missingness)
## Null Deviance: 346.9
## Residual Deviance: 105.5 AIC: 123.5
tidy(logmodel) %>% kable() %>% kable_styling(
full_width = F, position="center") %>%
scroll_box()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -45.6981890 | 9.6250843 | -4.7478222 | 0.0000021 |
| islandDream | 3.1992578 | 1.0342586 | 3.0932861 | 0.0019795 |
| islandTorgersen | -17.1363499 | 1569.3017858 | -0.0109197 | 0.9912875 |
| bill_depth_mm | -0.5098921 | 0.2457986 | -2.0744305 | 0.0380393 |
| flipper_length_mm | 0.2768751 | 0.0496462 | 5.5769636 | 0.0000000 |
| body_mass_g | 0.0000049 | 0.0007712 | 0.0063836 | 0.9949067 |
| sexmale | -0.8802486 | 0.7912972 | -1.1124121 | 0.2659610 |
| year2008 | -1.7984172 | 0.6780637 | -2.6522834 | 0.0079949 |
| year2009 | -1.4969014 | 0.6576584 | -2.2761078 | 0.0228396 |
penguin_pred_class <-
predict(logmodel, datatest, type = c("class")) %>%
bind_cols(datatest %>% select(species))
penguin_pred_prob <-
predict(logmodel, datatest, type = c("prob")) %>%
bind_cols(datatest %>% select(species))
penguin_pred_prob %>%
roc_curve(truth =species,estimate= .pred_Adelie) %>%
autoplot()penguin_pred_class %>% conf_mat(truth = species, estimate = .pred_class) %>%
autoplot(type = "heatmap")penguin_pred_class %>%
accuracy(truth = .pred_class, estimate=species)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.85
Next, we run a multinomial regression to try and predict based on three distinct outcome classes. These will be the original penguin species identified in the dataset. The same test/train logic is used for this process and we use the “nnet” engine from the tidymodels package. This is representative of a single later neural network. The steps taken to treat the dataset include: median imputations for any missing numeric values, removing variables with near-zero variance that do not provide useful information for model to distinguish between outcomes.
#
#MLogR_Engine<-
#multinom_reg(mode = "classification") %>%
# set_engine(engine = "glmnet")
#
#MLogR_Engine<-
#multinom_reg(mode = "classification") %>%
# set_engine(engine = "keras")
#
MLogR_Engine<-
multinom_reg(mode = "classification") %>%
set_engine(engine = "nnet")
set.seed(3)
Datasplit<- initial_split(Dataset, prop = .75, strata = species)
Datatrain<-training(Datasplit)
datacv<- vfold_cv(Datatrain, v = 10, repeats =1, strata = species)
datatest<-testing(Datasplit)
Logistic_Recipe<-
recipe(species ~ ., data = Datatrain) %>%
#step_rm(bill_length_mm) %>%
step_medianimpute(all_numeric()) %>%
step_modeimpute(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
Penguin_Wflow<-
workflow() %>%
add_model(MLogR_Engine) %>%
add_recipe(Logistic_Recipe)The results of this model is provided below and we obtain an accuracy of 0.984 on the testing set. The confusion matrix is presented below and shows we correctly predicted chinstrap penguins 100% of the time. one case of an Adelie penguin was misclassified as a Chinstrap, and one gase of a Gentou penguin was misclassified as an Adelie.
# fit on the crossvalidation dataset
LR_rs<- Penguin_Wflow %>% fit_resamples(resamples = datacv,
control = control_resamples(save_pred = T))
#collect average ROC from cross validation training set
LR_rs %>% collect_metrics(summarize = T)## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.984 10 0.00635 Preprocessor1_Model1
## 2 roc_auc hand_till 1 10 0 Preprocessor1_Model1
logmodel<-Penguin_Wflow %>%
fit(data = Datatrain)
penguin_pred_class <-
predict(logmodel, datatest, type = c("class")) %>%
bind_cols(datatest %>% select(species))
penguin_pred_prob <-
predict(logmodel, datatest, type = c("prob")) %>%
bind_cols(datatest %>% select(species))
penguin_pred_class %>% conf_mat(truth = species, estimate = .pred_class) %>%
autoplot(type = "heatmap")The accuracy of the training set is shown below and measures 0.976 which shows high predictability and no over/undertraining. The figure provides a visual representation of a single-layer nueral network.
penguin_pred_class %>%
accuracy(truth = .pred_class, estimate=species)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.976
NeuralNetTools::plotnet(logmodel$fit$fit$fit)