Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework1
Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
For this assignment, let us use species as our outcome or the dependent variable.
1 Logistic Regression with a binary outcome. (40)
The penguin dataset has species column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.
2 For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)
3 Multinomial Logistic Regression. (40)
Please fit it a multinomial logistic regression where your outcome variable is species.
Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.
Please be sure to interpret your variables in the model.
4 Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)
# Load dataframe
df <- penguins %>%
as_tibble()
# show summary infio
skim(df)| Name | df |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Our dataset has the following independent features:
island (Biscoe, Dream or TorgerSen)bill_length_mm and bill_depth_mmflipper_length_mmbody_weight_gsexyear … I assume this is the year the penguin was measuredOur Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.
species to a Binary outcomeOur dataset contains data on 3 different species: Adelie (152 cases), Chinstrap (68 cases) and Gentoo (124 cases). Since we have 3 outcomes, in order to do a binary (2 cases) classifier, we would have to either drop a category altogether or combine two groups into a single outcome. The problem with dropping a category is that we cannot be assured that new cases for prediction would also drop the third group and the presence of a 3rd group in bew data could lead to very poor performance. Since we are being asked to build a binary classifier, I will probably restate the problem as “Build a classifier to identify whether a penguin is Gentoo or not Gentoo”. I’ll pick the dominant class as this seems more useful, though if we were doing specific research on a given species, we might choose to use that species as the positive case.
Note - this first problem is contrived and a very poorly designed. It would make far more sense to build a multigroup classifier for species (as in Part 3) - binary logistic regression actually makes little sense in this scenario. If I were writing this problem, I would have asked us to build a logistic binary regression classifier on the sex feature since that only has 2 values. That said, despite being a bad problem, I’ll do as asked.
With this in mind, I’ll create a new column, Gentoo, which is boolean, where 1 means Gentoo and 0 means not Gentoo. I’ll use this new feature as my target and ignore the existing species column when setting up my logistic regression. I actually don’t know the similarities or difference between the species, so through my arbitrary choice to make Gentoo the positive outcome, I may have greatly complicated the model by mixing the other two and overlapping variance within features. My classifier may in fact do a terrible job through this arbitrary choice.
Note that for logistic regression, we want to ensure class balance so we don’t skew results by over training on one class. My arbitrary choice of Gentoo vs non-Gentoo does lead to a class imbalance (124 Gentoo and 220 non-Gentoo). In a more complete analysis, I would address this imbalance (up-sampling, down-sampling, or bootstrapping), but for this problem, I will assume we aren’t introducing too much bias.
# Create new binary column, Adelie or not
df <- df %>%
mutate(Gentoo = ifelse(species == 'Gentoo', 1, 0)) %>%
dplyr::select(-species)We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.
# show missing data
gg_miss_upset(df)We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, thge best strategy would be KNN imputing based on other row features; however, while imputing is a go to strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor, but can also work against us. AS such, the decision to impute should be explored in more depth given the specific problem.
# remove rows with NAs
df_na <- na.omit(df)Our island feature is categorical - for machine learning, we really need everything recoded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2.
# dummy encode our categorical `island`
dummies <- dummyVars(Gentoo ~ ., drop2nd=TRUE, data = df_na)
df_imp <- data.frame(predict(dummies, newdata = df_na))
df_dum <- df_imp
df_dum$island.Biscoe <- as.factor(df_dum$island.Biscoe)
df_dum$island.Dream <- as.factor(df_dum$island.Dream)
df_dum$island.Torgersen <- as.factor(df_dum$island.Torgersen)
df_dum$sex.female <- as.factor(df_dum$sex.female)
df_dum$sex.male <- as.factor(df_dum$sex.male)
df_dum$year <- as.factor(as.character(df_dum$year))
skim(df_dum)| Name | df_dum |
| Number of rows | 333 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| island.Biscoe | 0 | 1 | FALSE | 2 | 0: 170, 1: 163 |
| island.Dream | 0 | 1 | FALSE | 2 | 0: 210, 1: 123 |
| island.Torgersen | 0 | 1 | FALSE | 2 | 0: 286, 1: 47 |
| sex.female | 0 | 1 | FALSE | 2 | 0: 168, 1: 165 |
| sex.male | 0 | 1 | FALSE | 2 | 1: 168, 0: 165 |
| year | 0 | 1 | FALSE | 3 | 200: 117, 200: 113, 200: 103 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
# Note that dummyVars dropped our `Gentoo` column ... I'll have to add that back in a later stepBetween feature correlations are a huge problem for standard linear and logistic regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a between technique like PCA which reframes the features into a new set with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.
# Calculate and plot the Multicollinearity
correlation = cor(df_imp, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill length, body mass and flipper length are more strongly negatively correlated with being Gentoo - I’m assuming this means Gentoo penguins are larger in stature. We also see that Gentoo penguins are more likely to be found on the Biscoe Island and have a smaller bill depth that other penguins.
For linear and logistic regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. More specifically, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.
Now lets explore any correlations between features and with the target, Gentoo.
df_na$Gentoo <- as.factor(df_na$Gentoo)
# Bar plot Gentoo by Island
ggplot(data = df_na) +
geom_bar(aes(x = factor(island), fill = factor(Gentoo)), position = "fill")# grouped boxplot
ggplot(df_na, aes(x=island, y=bill_length_mm, fill=Gentoo)) +
geom_boxplot()ggplot(df_na, aes(x=island, y=bill_depth_mm, fill=Gentoo)) +
geom_boxplot()ggplot(df_na, aes(x=island, y=body_mass_g, fill=Gentoo)) +
geom_boxplot()# Show feature correlations/target by decreasing correlation
df_imp$Gentoo <- as.numeric(as.character(df_na$Gentoo))
stack(sort(cor(df_imp[, 11], df_imp[, 1:ncol(df_imp)-1])[,], decreasing=TRUE))## values ind
## 1 0.86685359 flipper_length_mm
## 2 0.82117751 body_mass_g
## 3 0.76154795 island.Biscoe
## 4 0.48825586 bill_length_mm
## 5 0.02313654 year
## 6 0.01208170 sex.male
## 7 -0.01208170 sex.female
## 8 -0.30229607 island.Torgersen
## 9 -0.57070214 island.Dream
## 10 -0.82229303 bill_depth_mm
We see that bill_depth_mm, isalnd.Torgersen, flipper_length_mm, and bill_length_mm have the strongest correlations with Gentoo; however, as we saw, there are also multicollinearity within these features.
Note - I’ll leave all features for now and use stepAIC() in the modeling step below to identify the most important features that improve model performance. This will remove correlated features leaving us with a model containing only those features which offere the most explanitory value in the model.
We were not given a separate hold out group - so to better measure model performance agains unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performanac (few samples means less information captured by the model), the trade-off is that we can idenitfy under- and over-fitting and better measure model performance.
# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df_imp$Gentoo, p = .8, list = FALSE, times = 1)
pengTrain <- df_imp[ trainIndex,]
pengTest <- df_imp[-trainIndex,]We train our Logistic regression on the Training Data only. I’ll than do a stepAIC to identify the most important features to include.
# Note, I'm including all features and will use stepAIC in next step to forward/backward do feature selection
pengTrain$Gentoo <- as.factor(pengTrain$Gentoo)
pengTest$Gentoo <- as.factor(pengTest$Gentoo)
model_1 <- train(Gentoo ~ island.Biscoe + island.Dream + island.Torgersen + bill_length_mm +
bill_depth_mm + flipper_length_mm + body_mass_g + year + sex.female + sex.male,
data = pengTrain, family = "binomial"
)Evaluating against our Training data:
train_pred <- as.factor(predict(model_1, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$Gentoo)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 173 0
## 1 0 94
##
## Accuracy : 1
## 95% CI : (0.9863, 1)
## No Information Rate : 0.6479
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6479
## Detection Rate : 0.6479
## Detection Prevalence : 0.6479
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Evaluating our model with the holdout Test data:
test_pred <- as.factor(predict(model_1, newdata = pengTest))
confusionMatrix(test_pred, pengTest$Gentoo)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 41 0
## 1 0 25
##
## Accuracy : 1
## 95% CI : (0.9456, 1)
## No Information Rate : 0.6212
## P-Value [Acc > NIR] : 2.259e-14
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6212
## Detection Rate : 0.6212
## Detection Prevalence : 0.6212
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Now, check variable importance:
imp <- as.data.frame(varImp(model_1$finalModel))
imp <- data.frame(overall = imp$Overall, names = rownames(imp))
imp[order(imp$overall,decreasing = T),]## overall names
## 6 33.9291852 flipper_length_mm
## 5 27.5712380 bill_depth_mm
## 7 23.1642856 body_mass_g
## 1 13.0501757 island.Biscoe
## 4 10.9220967 bill_length_mm
## 2 8.4994601 island.Dream
## 3 1.7329501 island.Torgersen
## 9 0.4566677 sex.female
## 10 0.2755593 sex.male
## 8 0.1338426 year
As we see, caret has found that the two features flipper_length_mm, bill_depth_mm, and body_mass_g are the most important features. Let’s try building a model with only these features and see how it performs.
model_2 <- train(Gentoo ~ bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain, family = "binomial"
)## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
train_pred <- as.factor(predict(model_2, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$Gentoo)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 173 0
## 1 0 94
##
## Accuracy : 1
## 95% CI : (0.9863, 1)
## No Information Rate : 0.6479
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6479
## Detection Rate : 0.6479
## Detection Prevalence : 0.6479
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
test_pred <- as.factor(predict(model_2, newdata = pengTest))
confusionMatrix(test_pred, pengTest$Gentoo)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 41 1
## 1 0 24
##
## Accuracy : 0.9848
## 95% CI : (0.9184, 0.9996)
## No Information Rate : 0.6212
## P-Value [Acc > NIR] : 9.315e-13
##
## Kappa : 0.9676
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9600
## Pos Pred Value : 0.9762
## Neg Pred Value : 1.0000
## Prevalence : 0.6212
## Detection Rate : 0.6212
## Detection Prevalence : 0.6364
## Balanced Accuracy : 0.9800
##
## 'Positive' Class : 0
##
This new model performs almost as well as our original that included all features. With the new model, we had one incorrect prediction in the holdout group. This new model is an improvement and is more parsimonious relative to our first model.
See above
# --- reload our data - I modified the df dataframe above
df <- penguins %>%
as_tibble()
df$species <- as.factor(df$species)
# --- remove rows with NAs
df_na <- na.omit(df)
# --- dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df_na)
df_imp <- data.frame(predict(dummies, newdata = df_na))
df_imp$species <- df_na$species
# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df_imp$species, p = .8, list = FALSE, times = 1)
pengTrain <- df_imp[ trainIndex,]
pengTest <- df_imp[-trainIndex,]
# --- Build a multinomial model
model_3 <- multinom(species ~ ., data=pengTrain)## # weights: 36 (22 variable)
## initial value 294.428093
## iter 10 value 26.931157
## iter 20 value 1.336507
## iter 30 value 0.007517
## final value 0.000070
## converged
summary(model_3)## Call:
## multinom(formula = species ~ ., data = pengTrain)
##
## Coefficients:
## (Intercept) island.Biscoe island.Dream island.Torgersen
## Chinstrap -0.1135876 -60.23573 64.69650 -4.574355
## Gentoo 0.3619683 12.15698 -21.33891 9.543893
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex.female
## Chinstrap 23.71177 -39.50452 -0.5979877 -0.02990597 14.61851
## Gentoo 12.92696 -22.75387 4.8277078 0.02244845 19.49351
## sex.male year
## Chinstrap -14.73210 -0.08115464
## Gentoo -19.13154 -0.62974780
##
## Std. Errors:
## (Intercept) island.Biscoe island.Dream island.Torgersen
## Chinstrap 0.003627723 NaN 0.003627722 NaN
## Gentoo 0.016114795 2.851901e-20 0.016114795 7.382359e-22
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Chinstrap 11.2964304 4.1970020 81.446429 13.44172
## Gentoo 0.2748947 0.1270828 3.759791 16.76411
## sex.female sex.male year
## Chinstrap 4.954216e-01 0.49204677 14.37416
## Gentoo 2.851907e-20 0.01611479 32.54634
##
## Residual Deviance: 0.0001409674
## AIC: 36.00014
# --- Evaluate our model
train_pred <- as.factor(predict(model_3, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 117 0 0
## Chinstrap 0 55 0
## Gentoo 0 0 96
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9863, 1)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4366 0.2052 0.3582
## Detection Prevalence 0.4366 0.2052 0.3582
## Balanced Accuracy 1.0000 1.0000 1.0000
test_pred <- as.factor(predict(model_3, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9448, 1)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0 1.0000
## Specificity 1.0000 1.0 1.0000
## Pos Pred Value 1.0000 1.0 1.0000
## Neg Pred Value 1.0000 1.0 1.0000
## Prevalence 0.4462 0.2 0.3538
## Detection Rate 0.4462 0.2 0.3538
## Detection Prevalence 0.4462 0.2 0.3538
## Balanced Accuracy 1.0000 1.0 1.0000
tidy(model_3)## # A tibble: 22 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chinstrap (Intercept) -0.114 0.00363 -31.3 3.31e-215
## 2 Chinstrap island.Biscoe -60.2 NaN NaN NaN
## 3 Chinstrap island.Dream 64.7 0.00363 17834. 0.
## 4 Chinstrap island.Torgersen -4.57 NaN NaN NaN
## 5 Chinstrap bill_length_mm 23.7 11.3 2.10 3.58e- 2
## 6 Chinstrap bill_depth_mm -39.5 4.20 -9.41 4.84e- 21
## 7 Chinstrap flipper_length_mm -0.598 81.4 -0.00734 9.94e- 1
## 8 Chinstrap body_mass_g -0.0299 13.4 -0.00222 9.98e- 1
## 9 Chinstrap sex.female 14.6 0.495 29.5 2.33e-191
## 10 Chinstrap sex.male -14.7 0.492 -29.9 5.86e-197
## # … with 12 more rows
glance(model_3)## # A tibble: 1 x 4
## edf deviance AIC nobs
## <dbl> <dbl> <dbl> <int>
## 1 18 0.000141 36.0 268
The multinomial regression with no extra work was able to correctly classify all three species in both the training and hold-out test data set. One has to love toy datasets that behave nothing like real world data.