library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 3.6.3
## -- Attaching packages ---------------------------------------------------------- tidymodels 0.1.0 --
## v broom     0.5.6      v rsample   0.0.6 
## v dials     0.0.6      v tune      0.1.0 
## v infer     0.5.1      v workflows 0.1.1 
## v parsnip   0.1.4      v yardstick 0.0.6 
## v recipes   0.1.12
## Warning: package 'broom' was built under R version 3.6.3
## Warning: package 'dials' was built under R version 3.6.3
## Warning: package 'infer' was built under R version 3.6.3
## Warning: package 'parsnip' was built under R version 3.6.3
## Warning: package 'recipes' was built under R version 3.6.3
## Warning: package 'rsample' was built under R version 3.6.3
## Warning: package 'tune' was built under R version 3.6.3
## Warning: package 'workflows' was built under R version 3.6.3
## Warning: package 'yardstick' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x dials::margin()   masks ggplot2::margin()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
library(skimr) #skim
## Warning: package 'skimr' was built under R version 3.6.2
library(vip) #variable importance
## Warning: package 'vip' was built under R version 3.6.3
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 3.6.3
data <- palmerpenguins::penguins_raw

First I loaded the palmer penguins data in. Next we use skim to view the number of rows and columns, data type, percentage of complete rows by column, number of unique values, and distribution of each column.

skim(data)
Data summary
Name data
Number of rows 344
Number of columns 17
_______________________
Column type frequency:
character 9
Date 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
studyName 0 1.00 7 7 0 3 0
Species 0 1.00 33 41 0 3 0
Region 0 1.00 6 6 0 1 0
Island 0 1.00 5 9 0 3 0
Stage 0 1.00 18 18 0 1 0
Individual ID 0 1.00 4 6 0 190 0
Clutch Completion 0 1.00 2 3 0 2 0
Sex 11 0.97 4 6 0 2 0
Comments 290 0.16 18 68 0 10 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date Egg 0 1 2007-11-09 2009-12-01 2008-11-09 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sample Number 0 1.00 63.15 40.43 1.00 29.00 58.00 95.25 152.00 ▇▇▆▅▃
Culmen Length (mm) 2 0.99 43.92 5.46 32.10 39.23 44.45 48.50 59.60 ▃▇▇▆▁
Culmen Depth (mm) 2 0.99 17.15 1.97 13.10 15.60 17.30 18.70 21.50 ▅▅▇▇▂
Flipper Length (mm) 2 0.99 200.92 14.06 172.00 190.00 197.00 213.00 231.00 ▂▇▃▅▂
Body Mass (g) 2 0.99 4201.75 801.95 2700.00 3550.00 4050.00 4750.00 6300.00 ▃▇▆▃▂
Delta 15 N (o/oo) 14 0.96 8.73 0.55 7.63 8.30 8.65 9.17 10.03 ▃▇▆▅▂
Delta 13 C (o/oo) 13 0.96 -25.69 0.79 -27.02 -26.32 -25.83 -25.06 -23.79 ▆▇▅▅▂

Reviewing the data, sample number is an index column and individualID is a unique id which adds no value to our model. Region and Stage have no variance so those will be excluded. Comments are a free text so we exclude that too. Date sample collected and study name are procedural so we remove those as well. The remaining columns will be independant variables used in our model.

Let’s take a look at the distribution of our dependant variable.

data %>% 
  count(Species)

For our binomial prediction model, we’ll group Chinstrap and Gentoo species to make the Adelie and other columns about equal for better prediction. We will call this dataset data1.

names(data) <- str_remove(names(data), " ") #kept getting unexpected symbol error in workflow
data <- mutate_if(data, is.character, as.factor)

data1<-data
data1$Species <- data1$Species %>% str_replace("Gentoo.*|Chinstrap.*", "Other")
data1 <- mutate_if(data1, is.character, as.factor)

set.seed(123)

split <- initial_split(data, strata = Species)
split
## <Training/Validation/Total>
## <258/86/344>
train <- training(split)
test <- testing(split)

split1 <- initial_split(data1, strata = Species)
split1
## <Training/Validation/Total>
## <258/86/344>
train1 <- training(split1)
test1 <- testing(split1)

We’ll do a .75/.25 training/test split.

For our binomial model we’ll do a logistic regression with a glm engine and for the multinomial regression we’ll use a glmnet engine. Both modes will be classification.

For our recipe, we’ll make Species the dependant variable and use the remaining variables as independant. First we remove the columns we mentioned earlier either through the step_rm function or the step_zv (zero variable) function. Next we impute the mission data with K-nearest neighbors. Then we break the factor variables into dummy variables. Lastly, we center and scale our data with the step_normalize function.

model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

model2 <- multinom_reg() %>% 
  set_engine("glmnet") %>% 
  set_mode("classification")

tux_duck <- recipe(Species~., data=train1) %>% 
  step_rm("Comments", "SampleNumber", "IndividualID", "DateEgg", "studyName") %>%  
  step_zv(all_predictors()) %>% 
  step_knnimpute(all_predictors()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric()) 

tux_duck2 <- recipe(Species~., data=train) %>% 
  step_rm("Comments", "SampleNumber", "IndividualID", "DateEgg", "studyName") %>%  
  step_zv(all_predictors()) %>% 
  step_knnimpute(all_predictors()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric()) 

flow <- workflow() %>% 
  add_model(model) %>% 
  add_recipe(tux_duck)

flow2 <- workflow() %>% 
  add_model(model2) %>% 
  add_recipe(tux_duck2) 

penguin_fit <- flow %>% fit(data = train1)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
penguin_fit2 <- flow2 %>% fit(data = train)
penguin_fit
## == Workflow [trained] ==============================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ------------------------------------------------------------------------------------
## 5 Recipe Steps
## 
## * step_rm()
## * step_zv()
## * step_knnimpute()
## * step_dummy()
## * step_normalize()
## 
## -- Model -------------------------------------------------------------------------------------------
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##          (Intercept)   `CulmenLength (mm)`    `CulmenDepth (mm)`  
##               23.933                54.349               -42.485  
## `FlipperLength (mm)`        `BodyMass (g)`    `Delta15 N (o/oo)`  
##                8.658                19.840                -3.672  
##   `Delta13 C (o/oo)`          Island_Dream      Island_Torgersen  
##               24.428                -6.077               -12.573  
## ClutchCompletion_Yes              Sex_MALE  
##               -6.286               -19.595  
## 
## Degrees of Freedom: 257 Total (i.e. Null);  247 Residual
## Null Deviance:       354.2 
## Residual Deviance: 1.777e-08     AIC: 22

The coefficients in our binomial model are listed above. The larger the coefficient, the more impactful the variable to our outcome. A positive coefficient means for each degree increased, the more likely the outcome is Adelie and vis versa. Culmen length and depth are the most impactful.

penguin_fit2
## == Workflow [trained] ==============================================================================
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## -- Preprocessor ------------------------------------------------------------------------------------
## 5 Recipe Steps
## 
## * step_rm()
## * step_zv()
## * step_knnimpute()
## * step_dummy()
## * step_normalize()
## 
## -- Model -------------------------------------------------------------------------------------------
## 
## Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "multinomial") 
## 
##    Df    %Dev  Lambda
## 1   0 0.00000 0.42050
## 2   2 0.08095 0.38310
## 3   3 0.16220 0.34910
## 4   3 0.23750 0.31810
## 5   3 0.30290 0.28980
## 6   3 0.36050 0.26410
## 7   4 0.41700 0.24060
## 8   4 0.47400 0.21920
## 9   5 0.52400 0.19980
## 10  5 0.56900 0.18200
## 11  5 0.60830 0.16580
## 12  6 0.64300 0.15110
## 13  7 0.67430 0.13770
## 14  7 0.70220 0.12550
## 15  7 0.72710 0.11430
## 16  7 0.74940 0.10420
## 17  7 0.76950 0.09490
## 18  7 0.78770 0.08647
## 19  7 0.80410 0.07879
## 20  7 0.81900 0.07179
## 21  7 0.83250 0.06541
## 22  7 0.84540 0.05960
## 23  7 0.85780 0.05431
## 24  7 0.86900 0.04948
## 25  7 0.87930 0.04509
## 26  7 0.88860 0.04108
## 27  7 0.89710 0.03743
## 28  7 0.90490 0.03411
## 29  7 0.91210 0.03108
## 30  7 0.91860 0.02832
## 31  7 0.92460 0.02580
## 32  7 0.93010 0.02351
## 33  8 0.93550 0.02142
## 34  8 0.94050 0.01952
## 35  8 0.94510 0.01778
## 36  8 0.94930 0.01620
## 37  9 0.95310 0.01476
## 38  9 0.95670 0.01345
## 39  9 0.96000 0.01226
## 40  9 0.96300 0.01117
## 41  9 0.96580 0.01018
## 42  9 0.96840 0.00927
## 43  9 0.97070 0.00845
## 44  9 0.97290 0.00770
## 45  9 0.97500 0.00701
## 46  9 0.97680 0.00639
## 
## ...
## and 38 more lines.

The multinomial model lists penalty as lambda and the mixture as alpha.

In our variable importance graph we see the importance matches the coefficient absolute value.

penguin_fit %>% pull_workflow_fit() %>%  vip()

In the multinomial model, our variable importance graph looks more like a principal component analysis. This shows the variables that best help separate the three species outcomes.

penguin_fit2 %>% pull_workflow_fit() %>%  vip()

Now we run our predictions. For both our binomial and multinomial models we have an accuracy of 100% with an AUC of 1 and zero false positive or false negatives. This is very suspicious but I can’t find anything wrong. Looking forward to feedback! The raw probabilities are listed below.

predict(penguin_fit, test1, type="prob") %>% 
  bind_cols(test1) %>% 
  roc_curve(Species,".pred_Adelie Penguin (Pygoscelis adeliae)") %>% 
  autoplot()
## Warning: Novel levels found in column 'Sex': NA. The levels have been
## removed, and values have been coerced to 'NA'.
## Warning: Novel levels found in column 'Comments': NA. The levels have been
## removed, and values have been coerced to 'NA'.

results1 <- predict(penguin_fit, test1, type="prob") %>% 
  bind_cols(test1) %>% 
  select(1, 5)
## Warning: Novel levels found in column 'Sex': NA. The levels have been
## removed, and values have been coerced to 'NA'.
## Warning: Novel levels found in column 'Comments': NA. The levels have been
## removed, and values have been coerced to 'NA'.
names(results1)<-c("Adelie", "Actual")
results1
predict(penguin_fit2, test, type="prob", penalty = 0) %>% 
  bind_cols(test) %>% 
  roc_curve(Species,".pred_Adelie Penguin (Pygoscelis adeliae)":".pred_Gentoo penguin (Pygoscelis papua)") %>% 
  autoplot()
## Warning: Novel levels found in column 'Sex': NA. The levels have been
## removed, and values have been coerced to 'NA'.
## Warning: Novel levels found in column 'Comments': NA. The levels have been
## removed, and values have been coerced to 'NA'.

results <- predict(penguin_fit2, test, type="prob", penalty = 0) %>% 
  bind_cols(test) %>% 
  select(1:3, 6)
## Warning: Novel levels found in column 'Sex': NA. The levels have been
## removed, and values have been coerced to 'NA'.
## Warning: Novel levels found in column 'Comments': NA. The levels have been
## removed, and values have been coerced to 'NA'.
names(results)<-c("Adelie", "Chinstrap", "Gentoo", "Actual")
results