1. Logistic Regression with a binary outcome.

Data Exploration

The penguins dataset has 344 observations and has one response or dependent variable (species) and 7 independent variables (island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year).

Missing values

We will remove the records which has some of the column values as N/A

#Remove NAs
penguins_data <- penguins_data %>% na.omit() 

Creating Binary variable isAdelie

#Categories in species column
penguins_data %>%
  group_by(species) %>%
  count() #there are THREE categories
## # A tibble: 3 x 2
## # Groups:   species [3]
##   species       n
##   <fct>     <int>
## 1 Adelie      146
## 2 Chinstrap    68
## 3 Gentoo      119

After exploting data we see that there are 3 categories in species column (Adelie,Chinstrap and Gentoo). From data it looks like “Adelie” is most popular category. In order to convert our outcome to binary outcome we can introduce two categories for species column “Adelie” or 1 and “Other” or 0.We will add isAdelie column for that.

# create isAdelie classification response variable
penguins_data <- penguins_data %>%
  mutate(isAdelie = ifelse(species == 'Adelie','Adelie','Other'))

penguins_data$isAdelie <- factor(penguins_data$isAdelie, levels = c("Other", "Adelie"))

Evaluating Independent Variables

Lets look at the correlation between the numeric variables. Looks like variables body_mass_g and flipper_length are positively correlated so we can consider one variable out of them in our model.

penguins_data %>% select_if(is.numeric) %>% cor()
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956  0.58945111
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917 -0.47201566
## flipper_length_mm      0.6530956    -0.5777917         1.0000000  0.87297890
## body_mass_g            0.5894511    -0.4720157         0.8729789  1.00000000
## year                   0.0326569    -0.0481816         0.1510679  0.02186213
##                          year
## bill_length_mm     0.03265690
## bill_depth_mm     -0.04818160
## flipper_length_mm  0.15106792
## body_mass_g        0.02186213
## year               1.00000000

For continuous independent variables, we can get more clarity on the distribution by analyzing it w.r.t. dependent variable.

par(mfrow = c(2,2))

boxplot(bill_length_mm~isAdelie, ylab="bill_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(bill_depth_mm~isAdelie, ylab="bill_depth", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(flipper_length_mm~isAdelie, ylab="flipper_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(body_mass_g~isAdelie, ylab="body_mass", xlab= "isAdelie", col="light blue",data = penguins_data)

Some observations are as follows

For categorical independent variables, we can analyze the frequency of each category w.r.t. the dependent variable

xtabs(~isAdelie + year, data = penguins_data)
##         year
## isAdelie 2007 2008 2009
##   Other    59   63   65
##   Adelie   44   50   52
xtabs(~isAdelie + sex, data = penguins_data)
##         sex
## isAdelie female male
##   Other      92   95
##   Adelie     73   73

Some observations are as follows

We can remove these variables from our dataset

penguins_data <- subset(penguins_data, select = -c(year))
penguins_data <- subset(penguins_data, select = -c(sex))

Train-Test Split Lets split the data. Have 70% data to train the model and keep remaining 30% for testing purpose.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(penguins_data), replace = TRUE, prob = c(0.7, 0.3))
train_data <- penguins_data[which_train, ]
test_data <- penguins_data[!which_train, ]

Binary Logistic Regression

Backwards stepwise regression is performed and the result is a model with the following variables: island, bill_depth_mm, and flipper_length_mm.

logmodel <- glm(isAdelie ~ island + bill_depth_mm + flipper_length_mm, 
                 family = 'binomial', 
                 data = train_data)

summary(logmodel)
## 
## Call:
## glm(formula = isAdelie ~ island + bill_depth_mm + flipper_length_mm, 
##     family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.59262  -0.20631  -0.06577   0.08972   2.63842  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         37.51719    8.79948   4.264 2.01e-05 ***
## islandDream         -3.80276    1.03652  -3.669 0.000244 ***
## islandTorgersen     16.96797 1505.96799   0.011 0.991010    
## bill_depth_mm        0.75693    0.20186   3.750 0.000177 ***
## flipper_length_mm   -0.24812    0.04305  -5.763 8.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 341.80  on 248  degrees of freedom
## Residual deviance: 101.93  on 244  degrees of freedom
## AIC: 111.93
## 
## Number of Fisher Scoring iterations: 18

Coefficient Intrepretation

Larger the coefficient there will be more of an impact on the positive classification or it being Adelie species. So variables having positive coefficients are being more indicative of Adelie species and negative coefficients as being indicative of no Adelie species.

Some interesting observations

All of these coefficients align with our previous exploratory analysis!

2. Metrics

Lets use our model to predict values for the test set and then evaluate the model

log_preds <- predict(logmodel, test_data, type = 'response')
class_prediction <- factor(ifelse(log_preds > 0.50, "Adelie", "Other"), 
                           levels = c("Other", "Adelie"))

log_auc <- auc(response = test_data$isAdelie, predictor = log_preds)
## Setting levels: control = Other, case = Adelie
## Setting direction: controls < cases
log_cm <- confusionMatrix(data = class_prediction, reference =test_data$isAdelie)
log_accuracy <- log_cm$overall['Accuracy']
log_tpr <- log_cm$byClass['Sensitivity']
log_fpr <- 1 - log_tpr
log_tnr <- log_cm$byClass['Specificity']
log_fnr <- 1 - log_tnr
  • AUC - 0.97
  • Accuracy - 0.8977
  • TPR (Sensitivity) - 0.875
  • FPR (1-TPR) - 0.125
  • TNR (Specificity) - 0.937
  • FNR (1-TPR) - 0.063

3. Multinomial Logistic Regression

First, we will define “Adelie” as the reference level (or “baseline species”) for the dataset. This means that our trained model will result in coefficients of the features for the remaining two species in relation to Adelie. We will start with a baseline model that includes all features.

require(nnet)
## Loading required package: nnet
train_data$species <- relevel(train_data$species, ref = "Adelie")
test_data$species <- relevel(test_data$species, ref = "Adelie")

multinom_model <- multinom(species ~ ., data = train_data %>% select(-isAdelie))
## # weights:  24 (14 variable)
## initial  value 273.554460 
## iter  10 value 13.698503
## iter  20 value 2.253936
## iter  30 value 0.060087
## iter  40 value 0.002295
## iter  50 value 0.000494
## iter  60 value 0.000490
## iter  70 value 0.000487
## iter  80 value 0.000255
## iter  90 value 0.000247
## iter 100 value 0.000237
## final  value 0.000237 
## stopped after 100 iterations
summary(multinom_model)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = species ~ ., data = train_data %>% select(-isAdelie))
## 
## Coefficients:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  -136.36389    32.93073       -14.42726       14.13510     -19.90270
## Gentoo      -10.21908   -46.22665       -39.47398        8.73007     -18.41169
##           flipper_length_mm body_mass_g
## Chinstrap        -0.2377843 -0.02749231
## Gentoo           -0.6030912  0.01881209
## 
## Std. Errors:
##           (Intercept)  islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap   0.6957534 1.403937e+00             NaN      101.64960     54.922341
## Gentoo      0.1268320 8.301517e-17    8.486003e-08        8.75875      4.370859
##           flipper_length_mm body_mass_g
## Chinstrap          18.21333   0.5917839
## Gentoo             30.70866   1.3707121
## 
## Residual Deviance: 0.000474253 
## AIC: 28.00047

The low AIC is indicative of a good model fit, so we will keep all of the variables in the model.