Data – 622 Homework # 1

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: <https://allisonhorst.github.io/palmerpenguins/articles/intro.html>

Below describes the penguin dataset where we have 344 observations with 8 features used to describe these observations. Three of these variables are nominal while the other five are numeric.

##Conflicted is used to identify duplicate functions in different packages and explicitely define which should be used
library(conflicted)

pacman::p_load(
  # for data importing and pre-processing
  dplyr, tidyverse,readxl, palmerpenguins,tidyselect,
  # for EDA
  skimr, DataExplorer,
  # for data modeling and visualization
  tidymodels, broom, dotwhisker, vip,
  # table processing
  kableExtra
)

summary(penguins) %>%kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700 female:165 Min. :2007
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 Median :197.0 Median :4050 NA’s : 11 Median :2008
NA NA Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202 NA Mean :2008
NA NA 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750 NA 3rd Qu.:2009
NA NA Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300 NA Max. :2009
NA NA NA’s :2 NA’s :2 NA’s :2 NA’s :2 NA NA

There are a few observations that have majority of NA’s. these are removed from our analyses as they do not provide enough information about the observation to make an inference. The observations that were removed are shown below and the only information included for those were the island and year for the penguin species

limitmissing <- .5*ncol(penguins)

retain<-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<limitmissing

Dataset<-penguins[retain,]

penguins[!retain,] %>% kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen NA NA NA NA NA 2007
Gentoo Biscoe NA NA NA NA NA 2009

For this assignment, let us use ‘species’ as our outcome or the dependent variable.

Logistic Regression with a binary outcome.

The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

The table below provides the class distribution of the penguin species. It shows that we have 152 Adelie penguins, 68 Chinstrap penguins and 124 Gentoo penguins.

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

As shown in the table above, the penguins dataset has approximately 3 classes, Adelie, Chinstrap, and Gentoo. The percentage associated with each class is shown below.

Dataset<-penguins[retain,]

Dataset<-
  Dataset %>% droplevels.data.frame() %>% 
  mutate(species = as.character(species)) %>% 
  #mutate(species =if_else(species == "Adelie", species, "Not Adelie", missing = NULL)) %>% droplevels.data.frame() %>% 
  mutate(species= as.factor(species)) %>% 
  filter(is.na(species)==F) 

# count of each dataset
#table(Dataset$species)
# proportion of each dataset
prop.table(table(Dataset$species)) %>% as.data.frame() %>% 
  mutate(Freq= percent(Freq)) %>% 
  rename( "Class" = Var1, "Proportion" = "Freq")
##       Class Proportion
## 1    Adelie      44.2%
## 2 Chinstrap      19.9%
## 3    Gentoo      36.0%

We can look at a distribution of our numeric variables below and we can see that each of the features follow close to a gaussian distribution. flipper length follows a bimodal distribution and is likely a distribution of 2 different penguin species.

DataExplorer::plot_histogram(Dataset)

Year is inappropriately classified as a numeric feature and so convert it into a factor for our predictions.

Dataset$year <-as.factor(Dataset$year) 

For our nominal or non-numeric variables, We can see the distribution of each. there are a few species where “sex” is missing and we will use various imputation methods to see if we may appropriately be able to classify the gender.

DataExplorer::plot_bar(Dataset)

Below shows the distribution of our Dataset with the three penguin classes included. We can see that there is quite the distinction in classes especially when it comes to Bill length

GGally::ggpairs(Dataset[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()

Because we are looking to perform a binary classification problem, we will be aggregating the Chinstrap variable with Gentoo as “non-Adelie” in order to maintain a larger dataset with more information per each class. Understanding the proportion that belongs to each class within the dataset is important to assess whether or not there will be bias toward predicting any particular class. with a 45/55 class distribution, the dataset is sufficiently diverse to make predictions on these classes without having to alter the random sampling methodology. Below provides the exploration of this dataset when treating this as a binary problem with the classes Adelie and non-adelie

unique(penguins$species)
## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo
Dataset_rev<-
  Dataset %>% droplevels.data.frame() %>% 
  mutate(species = as.character(species)) %>% 
  mutate(species =if_else(species == "Adelie", species, "Not Adelie", missing = NULL)) %>% droplevels.data.frame() %>% 
  #filter(species!= "Gentoo") %>% 
  mutate(species= as.factor(species))

GGally::ggpairs(Dataset_rev[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()

For our binomial logistic regression model, we’re using the glm engine from the tidymodels package and using a test/train split of 75%/25%, this sampling is stratified by species to obtain a sample closely representing the division between the two classifications. A 10-fold cross-validation set is pulled in-order avoid over-training by testing the data on 10 different splits of the training data.

When considering all the data, we actually obtain a perfect regression model and can predict the penguin species with 100% accuracy. this is primarily due to the bill_length_mm feature which very distinctly identifies different penguins when regressed against bill bill depth, flipper length, and body mass. Because of this, to make the model actually serve any purpose, bill length was removed from the dataset to purposely decrease the accuracy and attempt to predict with the other not-so-perfect features.

LogR_Engine<-
logistic_reg(mode = "classification") %>% 
  set_engine(engine = "glm")


set.seed(3)
Datasplit<- initial_split(Dataset_rev, prop = .75, strata = species)
Datatrain<-training(Datasplit)
datacv<- vfold_cv(Datatrain, v = 10, repeats =1, strata = species)
datatest<-testing(Datasplit)
dataval<-validation_split(Datatrain, prop = .75, strata = species)

we can see the accuracy based on our cross validation metrics below. This accuracy is based on the average accuracy of 10 resampled test/train splits. The Akaine Information Criteria (AIC) is presented below and and provides an estimate of the prediction error.

Logistic_Recipe<-
  recipe(species ~ ., data = Datatrain) %>% 
  step_rm(bill_length_mm) %>%  
  step_medianimpute(all_numeric()) %>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors())


Penguin_Wflow<-
  workflow() %>% 
  add_model(LogR_Engine) %>% 
  add_recipe(Logistic_Recipe)

#DataExplorer::plot_correlation(Dataset, type = "all")

#
### tuning the model
##LRreg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 10))

#
#
# fit on the crossvalidation dataset
LR_rs<- Penguin_Wflow %>% fit_resamples(resamples = datacv,
                                        control = control_resamples(save_pred = T))
#collect average ROC from cross validation training set
LR_rs %>% collect_metrics(summarize = T)
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.896    10  0.0237 Preprocessor1_Model1
## 2 roc_auc  binary     0.963    10  0.0124 Preprocessor1_Model1

Additionally, we can observe which variables are most significant in our regression as well as the weight that they have on our model. Island, bodymass and sex each have pvalues >0.5 and indicate that these features are not significant within our model.

Finally, Our ROC curve and our confusion matrix is presented below showing how well our model performs. The closer the ROC curve is to the upper right corner of the plot (1.0), the more accurate it is. With bill length included in the model, this would show a perfect curve. The confusion matrics shows that we correctly classified Adelie 30/35 times, and not-adelie 38/45 times. This results in an overall accuracy of 85% on the testing dataset.

logmodel<-Penguin_Wflow %>% 
  fit(data = Datatrain)

logmodel %>% 
  pull_workflow_fit() 
## parsnip model object
## 
## Fit time:  11ms 
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        islandDream    islandTorgersen      bill_depth_mm  
##        -4.570e+01          3.199e+00         -1.714e+01         -5.099e-01  
## flipper_length_mm        body_mass_g            sexmale           year2008  
##         2.769e-01          4.923e-06         -8.802e-01         -1.798e+00  
##          year2009  
##        -1.497e+00  
## 
## Degrees of Freedom: 252 Total (i.e. Null);  244 Residual
##   (5 observations deleted due to missingness)
## Null Deviance:       346.9 
## Residual Deviance: 105.5     AIC: 123.5
tidy(logmodel) %>% kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()
term estimate std.error statistic p.value
(Intercept) -45.6981890 9.6250843 -4.7478222 0.0000021
islandDream 3.1992578 1.0342586 3.0932861 0.0019795
islandTorgersen -17.1363499 1569.3017858 -0.0109197 0.9912875
bill_depth_mm -0.5098921 0.2457986 -2.0744305 0.0380393
flipper_length_mm 0.2768751 0.0496462 5.5769636 0.0000000
body_mass_g 0.0000049 0.0007712 0.0063836 0.9949067
sexmale -0.8802486 0.7912972 -1.1124121 0.2659610
year2008 -1.7984172 0.6780637 -2.6522834 0.0079949
year2009 -1.4969014 0.6576584 -2.2761078 0.0228396
penguin_pred_class <- 
  predict(logmodel, datatest, type = c("class")) %>% 
  bind_cols(datatest %>% select(species)) 
penguin_pred_prob <- 
  predict(logmodel, datatest, type = c("prob")) %>% 
  bind_cols(datatest %>% select(species)) 


penguin_pred_prob %>% 
  roc_curve(truth =species,estimate= .pred_Adelie) %>% 
  autoplot()

penguin_pred_class %>% conf_mat(truth = species, estimate = .pred_class) %>%  
  autoplot(type = "heatmap")

penguin_pred_class %>% 
  accuracy(truth = .pred_class, estimate=species)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary          0.85

Multinomial Logistic Regression. (40)

Next, we run a multinomial regression to try and predict based on three distinct outcome classes. These will be the original penguin species identified in the dataset. The same test/train logic is used for this process and we use the “nnet” engine from the tidymodels package. This is representative of a single later neural network. The steps taken to treat the dataset include: median imputations for any missing numeric values, removing variables with near-zero variance that do not provide useful information for model to distinguish between outcomes.

#
#MLogR_Engine<-
#multinom_reg(mode = "classification") %>% 
#  set_engine(engine = "glmnet")
#
#MLogR_Engine<-
#multinom_reg(mode = "classification") %>% 
#  set_engine(engine = "keras")
#

MLogR_Engine<-
multinom_reg(mode = "classification") %>% 
  set_engine(engine = "nnet")

set.seed(3)
Datasplit<- initial_split(Dataset, prop = .75, strata = species)
Datatrain<-training(Datasplit)
datacv<- vfold_cv(Datatrain, v = 10, repeats =1, strata = species)
datatest<-testing(Datasplit)


Logistic_Recipe<-
  recipe(species ~ ., data = Datatrain) %>% 
  #step_rm(bill_length_mm) %>%  
  step_medianimpute(all_numeric()) %>% 
  step_modeimpute(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors())


Penguin_Wflow<-
  workflow() %>% 
  add_model(MLogR_Engine) %>% 
  add_recipe(Logistic_Recipe)

The results of this model is provided below and we obtain an accuracy of 0.984 on the testing set. The confusion matrix is presented below and shows we correctly predicted chinstrap penguins 100% of the time. one case of an Adelie penguin was misclassified as a Chinstrap, and one gase of a Gentou penguin was misclassified as an Adelie.

# fit on the crossvalidation dataset
LR_rs<- Penguin_Wflow %>% fit_resamples(resamples = datacv,
                                        control = control_resamples(save_pred = T))
#collect average ROC from cross validation training set
LR_rs %>% collect_metrics(summarize = T)
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.984    10 0.00635 Preprocessor1_Model1
## 2 roc_auc  hand_till  1        10 0       Preprocessor1_Model1
logmodel<-Penguin_Wflow %>% 
  fit(data = Datatrain)

penguin_pred_class <- 
  predict(logmodel, datatest, type = c("class")) %>% 
  bind_cols(datatest %>% select(species)) 
penguin_pred_prob <- 
  predict(logmodel, datatest, type = c("prob")) %>% 
  bind_cols(datatest %>% select(species)) 

penguin_pred_class %>% conf_mat(truth = species, estimate = .pred_class) %>%  
  autoplot(type = "heatmap")

The accuracy of the training set is shown below and measures 0.976 which shows high predictability and no over/undertraining. The figure provides a visual representation of a single-layer nueral network.

penguin_pred_class %>% 
  accuracy(truth = .pred_class, estimate=species)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.976
NeuralNetTools::plotnet(logmodel$fit$fit$fit)