Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework1

Instructions

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

For this assignment, let us use species as our outcome or the dependent variable.

1 Logistic Regression with a binary outcome. (40)

  1. The penguin dataset has species column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

  2. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

  3. Provide variable interpretations in your model.

2 For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)

3 Multinomial Logistic Regression. (40)

  1. Please fit it a multinomial logistic regression where your outcome variable is species.

  2. Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.

  3. Please be sure to interpret your variables in the model.

4 Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)

1. Binary Logistic Regression

EDA

Load Data & Summary

# Load dataframe
df <- penguins %>% 
  as_tibble()

# show summary infio
skim(df)
Data summary
Name df
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Plan of Attack

Our dataset has the following independent features:

  • Geography - island (Biscoe, Dream or TorgerSen)
  • Bill Dimensions - bill_length_mm and bill_depth_mm
  • Flipper Dimensions - flipper_length_mm
  • Overall Weight - body_weight_g
  • Gender - sex
  • Year - year … I assume this is the year the penguin was measured

Our Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.

Recode species to a Binary outcome

Our dataset contains data on 3 different species: Adelie (152 cases), Chinstrap (68 cases) and Gentoo (124 cases). Since we have 3 outcomes, in order to do a binary (2 cases) classifier, we would have to either drop a category altogether or combine two groups into a single outcome. The problem with dropping a category is that we cannot be assured that new cases for prediction would also drop the third group and the presence of a 3rd group in bew data could lead to very poor performance. Since we are being asked to build a binary classifier, I will probably restate the problem as “Build a classifier to identify whether a penguin is Gentoo or not Gentoo. I’ll pick the dominant class as this seems more useful, though if we were doing specific research on a given species, we might choose to use that species as the positive case.

Note - this first problem is contrived and a very poorly designed. It would make far more sense to build a multigroup classifier for species (as in Part 3) - binary logistic regression actually makes little sense in this scenario. If I were writing this problem, I would have asked us to build a logistic binary regression classifier on the sex feature since that only has 2 values. That said, despite being a bad problem, I’ll do as asked.

With this in mind, I’ll create a new column, Gentoo, which is boolean, where 1 means Gentoo and 0 means not Gentoo. I’ll use this new feature as my target and ignore the existing species column when setting up my logistic regression. I actually don’t know the similarities or difference between the species, so through my arbitrary choice to make Gentoo the positive outcome, I may have greatly complicated the model by mixing the other two and overlapping variance within features. My classifier may in fact do a terrible job through this arbitrary choice.

Note that for logistic regression, we want to ensure class balance so we don’t skew results by over training on one class. My arbitrary choice of Gentoo vs non-Gentoo does lead to a class imbalance (124 Gentoo and 220 non-Gentoo). In a more complete analysis, I would address this imbalance (up-sampling, down-sampling, or bootstrapping), but for this problem, I will assume we aren’t introducing too much bias.

# Create new binary column, Adelie or not
df <- df %>%
  mutate(Gentoo = ifelse(species == 'Gentoo', 1, 0)) %>%
  dplyr::select(-species)

Handle NAs

We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.

# show missing data
gg_miss_upset(df)

We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, thge best strategy would be KNN imputing based on other row features; however, while imputing is a go to strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor, but can also work against us. AS such, the decision to impute should be explored in more depth given the specific problem.

# remove rows with NAs
df_na <- na.omit(df)

Dummy Code Categorical

Our island feature is categorical - for machine learning, we really need everything recoded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2.

# dummy encode our categorical `island`
dummies <- dummyVars(Gentoo ~ ., drop2nd=TRUE, data = df_na)
df_imp <- data.frame(predict(dummies, newdata = df_na))

df_dum <- df_imp
df_dum$island.Biscoe <- as.factor(df_dum$island.Biscoe)
df_dum$island.Dream <- as.factor(df_dum$island.Dream)
df_dum$island.Torgersen <- as.factor(df_dum$island.Torgersen)
df_dum$sex.female <- as.factor(df_dum$sex.female)
df_dum$sex.male <- as.factor(df_dum$sex.male)
df_dum$year <- as.factor(as.character(df_dum$year))

skim(df_dum)
Data summary
Name df_dum
Number of rows 333
Number of columns 10
_______________________
Column type frequency:
factor 6
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
island.Biscoe 0 1 FALSE 2 0: 170, 1: 163
island.Dream 0 1 FALSE 2 0: 210, 1: 123
island.Torgersen 0 1 FALSE 2 0: 286, 1: 47
sex.female 0 1 FALSE 2 0: 168, 1: 165
sex.male 0 1 FALSE 2 1: 168, 0: 165
year 0 1 FALSE 3 200: 117, 200: 113, 200: 103

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂
# Note that dummyVars dropped our `Gentoo` column ... I'll have to add that back in a later step

Multicollinearity

Between feature correlations are a huge problem for standard linear and logistic regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a between technique like PCA which reframes the features into a new set with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.

# Calculate and plot the Multicollinearity
correlation = cor(df_imp, use = 'pairwise.complete.obs')

corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))

We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill length, body mass and flipper length are more strongly negatively correlated with being Gentoo - I’m assuming this means Gentoo penguins are larger in stature. We also see that Gentoo penguins are more likely to be found on the Biscoe Island and have a smaller bill depth that other penguins.

Feature-Target Correlation

For linear and logistic regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. More specifically, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.

Now lets explore any correlations between features and with the target, Gentoo.

df_na$Gentoo <- as.factor(df_na$Gentoo)

# Bar plot Gentoo by Island
ggplot(data = df_na) +
  geom_bar(aes(x = factor(island), fill = factor(Gentoo)), position = "fill")

# grouped boxplot
ggplot(df_na, aes(x=island, y=bill_length_mm, fill=Gentoo)) + 
    geom_boxplot()

ggplot(df_na, aes(x=island, y=bill_depth_mm, fill=Gentoo)) + 
    geom_boxplot()

ggplot(df_na, aes(x=island, y=body_mass_g, fill=Gentoo)) + 
    geom_boxplot()

# Show feature correlations/target by decreasing correlation
df_imp$Gentoo <- as.numeric(as.character(df_na$Gentoo))
stack(sort(cor(df_imp[, 11], df_imp[, 1:ncol(df_imp)-1])[,], decreasing=TRUE))
##         values               ind
## 1   0.86685359 flipper_length_mm
## 2   0.82117751       body_mass_g
## 3   0.76154795     island.Biscoe
## 4   0.48825586    bill_length_mm
## 5   0.02313654              year
## 6   0.01208170          sex.male
## 7  -0.01208170        sex.female
## 8  -0.30229607  island.Torgersen
## 9  -0.57070214      island.Dream
## 10 -0.82229303     bill_depth_mm

We see that bill_depth_mm, isalnd.Torgersen, flipper_length_mm, and bill_length_mm have the strongest correlations with Gentoo; however, as we saw, there are also multicollinearity within these features.

Note - I’ll leave all features for now and use stepAIC() in the modeling step below to identify the most important features that improve model performance. This will remove correlated features leaving us with a model containing only those features which offere the most explanitory value in the model.

Modeling

Training-Test Split

We were not given a separate hold out group - so to better measure model performance agains unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performanac (few samples means less information captured by the model), the trade-off is that we can idenitfy under- and over-fitting and better measure model performance.

# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df_imp$Gentoo, p = .8, list = FALSE, times = 1)

pengTrain <- df_imp[ trainIndex,]
pengTest  <- df_imp[-trainIndex,]

Modeling

We train our Logistic regression on the Training Data only. I’ll than do a stepAIC to identify the most important features to include.

# Note, I'm including all features and will use stepAIC in next step to forward/backward do feature selection
pengTrain$Gentoo <- as.factor(pengTrain$Gentoo)
pengTest$Gentoo <- as.factor(pengTest$Gentoo)

model_1 <- train(Gentoo ~ island.Biscoe + island.Dream + island.Torgersen + bill_length_mm + 
                 bill_depth_mm + flipper_length_mm + body_mass_g + year + sex.female + sex.male, 
               data = pengTrain, family = "binomial"
)

Evaluate Model

Evaluating against our Training data:

train_pred <- as.factor(predict(model_1, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$Gentoo)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 173   0
##          1   0  94
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9863, 1)
##     No Information Rate : 0.6479     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6479     
##          Detection Rate : 0.6479     
##    Detection Prevalence : 0.6479     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Evaluating our model with the holdout Test data:

test_pred <- as.factor(predict(model_1, newdata = pengTest))
confusionMatrix(test_pred, pengTest$Gentoo)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 41  0
##          1  0 25
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9456, 1)
##     No Information Rate : 0.6212     
##     P-Value [Acc > NIR] : 2.259e-14  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6212     
##          Detection Rate : 0.6212     
##    Detection Prevalence : 0.6212     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Now, check variable importance:

imp <- as.data.frame(varImp(model_1$finalModel))
imp <- data.frame(overall = imp$Overall, names = rownames(imp))
imp[order(imp$overall,decreasing = T),]
##       overall             names
## 6  33.9291852 flipper_length_mm
## 5  27.5712380     bill_depth_mm
## 7  23.1642856       body_mass_g
## 1  13.0501757     island.Biscoe
## 4  10.9220967    bill_length_mm
## 2   8.4994601      island.Dream
## 3   1.7329501  island.Torgersen
## 9   0.4566677        sex.female
## 10  0.2755593          sex.male
## 8   0.1338426              year

As we see, caret has found that the two features flipper_length_mm, bill_depth_mm, and body_mass_g are the most important features. Let’s try building a model with only these features and see how it performs.

model_2 <- train(Gentoo ~ bill_depth_mm + flipper_length_mm + body_mass_g, 
               data = pengTrain, family = "binomial"
)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
train_pred <- as.factor(predict(model_2, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$Gentoo)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 173   0
##          1   0  94
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9863, 1)
##     No Information Rate : 0.6479     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6479     
##          Detection Rate : 0.6479     
##    Detection Prevalence : 0.6479     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
test_pred <- as.factor(predict(model_2, newdata = pengTest))
confusionMatrix(test_pred, pengTest$Gentoo)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 41  1
##          1  0 24
##                                           
##                Accuracy : 0.9848          
##                  95% CI : (0.9184, 0.9996)
##     No Information Rate : 0.6212          
##     P-Value [Acc > NIR] : 9.315e-13       
##                                           
##                   Kappa : 0.9676          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9762          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6212          
##          Detection Rate : 0.6212          
##    Detection Prevalence : 0.6364          
##       Balanced Accuracy : 0.9800          
##                                           
##        'Positive' Class : 0               
## 

This new model performs almost as well as our original that included all features. With the new model, we had one incorrect prediction in the holdout group. This new model is an improvement and is more parsimonious relative to our first model.

2. Model Performance

See above

3. Multinomial Regression

# --- reload our data - I modified the df dataframe above
df <- penguins %>% 
  as_tibble()

df$species <- as.factor(df$species)

# --- remove rows with NAs
df_na <- na.omit(df)

# --- dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df_na)
df_imp <- data.frame(predict(dummies, newdata = df_na))
df_imp$species <- df_na$species

# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df_imp$species, p = .8, list = FALSE, times = 1)

pengTrain <- df_imp[ trainIndex,]
pengTest  <- df_imp[-trainIndex,]

# --- Build a multinomial model
model_3 <- multinom(species ~ ., data=pengTrain)
## # weights:  36 (22 variable)
## initial  value 294.428093 
## iter  10 value 26.931157
## iter  20 value 1.336507
## iter  30 value 0.007517
## final  value 0.000070 
## converged
summary(model_3)
## Call:
## multinom(formula = species ~ ., data = pengTrain)
## 
## Coefficients:
##           (Intercept) island.Biscoe island.Dream island.Torgersen
## Chinstrap  -0.1135876     -60.23573     64.69650        -4.574355
## Gentoo      0.3619683      12.15698    -21.33891         9.543893
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex.female
## Chinstrap       23.71177     -39.50452        -0.5979877 -0.02990597   14.61851
## Gentoo          12.92696     -22.75387         4.8277078  0.02244845   19.49351
##            sex.male        year
## Chinstrap -14.73210 -0.08115464
## Gentoo    -19.13154 -0.62974780
## 
## Std. Errors:
##           (Intercept) island.Biscoe island.Dream island.Torgersen
## Chinstrap 0.003627723           NaN  0.003627722              NaN
## Gentoo    0.016114795  2.851901e-20  0.016114795     7.382359e-22
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Chinstrap     11.2964304     4.1970020         81.446429    13.44172
## Gentoo         0.2748947     0.1270828          3.759791    16.76411
##             sex.female   sex.male     year
## Chinstrap 4.954216e-01 0.49204677 14.37416
## Gentoo    2.851907e-20 0.01611479 32.54634
## 
## Residual Deviance: 0.0001409674 
## AIC: 36.00014
# --- Evaluate our model
train_pred <- as.factor(predict(model_3, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       117         0      0
##   Chinstrap      0        55      0
##   Gentoo         0         0     96
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9863, 1)
##     No Information Rate : 0.4366     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           1.0000        1.0000
## Specificity                 1.0000           1.0000        1.0000
## Pos Pred Value              1.0000           1.0000        1.0000
## Neg Pred Value              1.0000           1.0000        1.0000
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4366           0.2052        0.3582
## Detection Prevalence        0.4366           0.2052        0.3582
## Balanced Accuracy           1.0000           1.0000        1.0000
test_pred <- as.factor(predict(model_3, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9448, 1)
##     No Information Rate : 0.4462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000              1.0        1.0000
## Specificity                 1.0000              1.0        1.0000
## Pos Pred Value              1.0000              1.0        1.0000
## Neg Pred Value              1.0000              1.0        1.0000
## Prevalence                  0.4462              0.2        0.3538
## Detection Rate              0.4462              0.2        0.3538
## Detection Prevalence        0.4462              0.2        0.3538
## Balanced Accuracy           1.0000              1.0        1.0000
tidy(model_3)
## # A tibble: 22 x 6
##    y.level   term              estimate std.error   statistic     p.value
##    <chr>     <chr>                <dbl>     <dbl>       <dbl>       <dbl>
##  1 Chinstrap (Intercept)        -0.114    0.00363   -31.3       3.31e-215
##  2 Chinstrap island.Biscoe     -60.2    NaN         NaN       NaN        
##  3 Chinstrap island.Dream       64.7      0.00363 17834.        0.       
##  4 Chinstrap island.Torgersen   -4.57   NaN         NaN       NaN        
##  5 Chinstrap bill_length_mm     23.7     11.3         2.10      3.58e-  2
##  6 Chinstrap bill_depth_mm     -39.5      4.20       -9.41      4.84e- 21
##  7 Chinstrap flipper_length_mm  -0.598   81.4        -0.00734   9.94e-  1
##  8 Chinstrap body_mass_g        -0.0299  13.4        -0.00222   9.98e-  1
##  9 Chinstrap sex.female         14.6      0.495      29.5       2.33e-191
## 10 Chinstrap sex.male          -14.7      0.492     -29.9       5.86e-197
## # … with 12 more rows
glance(model_3)
## # A tibble: 1 x 4
##     edf deviance   AIC  nobs
##   <dbl>    <dbl> <dbl> <int>
## 1    18 0.000141  36.0   268

The multinomial regression with no extra work was able to correctly classify all three species in both the training and hold-out test data set. One has to love toy datasets that behave nothing like real world data.

4. Bonus Problem