1. Logistic Regression with a binary outcome.

1. The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
1. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
1. Provide variable interpretations in your model.

Data Exploration

The penguins dataset has 344 observations and has one response or dependent variable (species) and 7 independent variables (island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year).

Missing values

We will remove the records which has some of the column values as N/A

#Remove NAs
penguins_data <- penguins_data %>% na.omit()

Creating Binary variable isAdelie

#Categories in species column
penguins_data %>%
  group_by(species) %>%
  count() #there are THREE categories

## # A tibble: 3 x 2
## # Groups:   species [3]
##   species       n
##   <fct>     <int>
## 1 Adelie      146
## 2 Chinstrap    68
## 3 Gentoo      119

After exploting data we see that there are 3 categories in species column (Adelie,Chinstrap and Gentoo). From data it looks like “Adelie” is most popular category. In order to convert our outcome to binary outcome we can introduce two categories for species column “Adelie” or 1 and “Other” or 0.We will add isAdelie column for that.

# create isAdelie classification response variable
penguins_data <- penguins_data %>%
  mutate(isAdelie = ifelse(species == 'Adelie','Adelie','Other'))

penguins_data$isAdelie <- factor(penguins_data$isAdelie, levels = c("Other", "Adelie"))

Evaluating Independent Variables

Lets look at the correlation between the numeric variables. Looks like variables body_mass_g and flipper_length are positively correlated so we can consider one variable out of them in our model.

penguins_data %>% select_if(is.numeric) %>% cor()

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956  0.58945111
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917 -0.47201566
## flipper_length_mm      0.6530956    -0.5777917         1.0000000  0.87297890
## body_mass_g            0.5894511    -0.4720157         0.8729789  1.00000000
## year                   0.0326569    -0.0481816         0.1510679  0.02186213
##                          year
## bill_length_mm     0.03265690
## bill_depth_mm     -0.04818160
## flipper_length_mm  0.15106792
## body_mass_g        0.02186213
## year               1.00000000

For continuous independent variables, we can get more clarity on the distribution by analyzing it w.r.t. dependent variable.

par(mfrow = c(2,2))

boxplot(bill_length_mm~isAdelie, ylab="bill_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(bill_depth_mm~isAdelie, ylab="bill_depth", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(flipper_length_mm~isAdelie, ylab="flipper_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(body_mass_g~isAdelie, ylab="body_mass", xlab= "isAdelie", col="light blue",data = penguins_data)

Some observations are as follows

Adelie species are smaller in length than other species
Adelie species has more depth than other species
Adelie species have smaller flipper length and body mass
We can see that flipper length and body_mass variables are correlated

For categorical independent variables, we can analyze the frequency of each category w.r.t. the dependent variable

xtabs(~isAdelie + year, data = penguins_data)

##         year
## isAdelie 2007 2008 2009
##   Other    59   63   65
##   Adelie   44   50   52

xtabs(~isAdelie + sex, data = penguins_data)

##         sex
## isAdelie female male
##   Other      92   95
##   Adelie     73   73

Some observations are as follows

Year variable doesn’t look like much impact in identifying Adelie species
Sex variable doesn’t look like much impact in identifying Adelie species

We can remove these variables from our dataset

penguins_data <- subset(penguins_data, select = -c(year))
penguins_data <- subset(penguins_data, select = -c(sex))

Train-Test Split Lets split the data. Have 70% data to train the model and keep remaining 30% for testing purpose.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(penguins_data), replace = TRUE, prob = c(0.7, 0.3))
train_data <- penguins_data[which_train, ]
test_data <- penguins_data[!which_train, ]

Binary Logistic Regression

Backwards stepwise regression is performed and the result is a model with the following variables: island, bill_depth_mm, and flipper_length_mm.

logmodel <- glm(isAdelie ~ island + bill_depth_mm + flipper_length_mm, 
                 family = 'binomial', 
                 data = train_data)

summary(logmodel)

## 
## Call:
## glm(formula = isAdelie ~ island + bill_depth_mm + flipper_length_mm, 
##     family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.59262  -0.20631  -0.06577   0.08972   2.63842  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         37.51719    8.79948   4.264 2.01e-05 ***
## islandDream         -3.80276    1.03652  -3.669 0.000244 ***
## islandTorgersen     16.96797 1505.96799   0.011 0.991010    
## bill_depth_mm        0.75693    0.20186   3.750 0.000177 ***
## flipper_length_mm   -0.24812    0.04305  -5.763 8.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 341.80  on 248  degrees of freedom
## Residual deviance: 101.93  on 244  degrees of freedom
## AIC: 111.93
## 
## Number of Fisher Scoring iterations: 18

Coefficient Intrepretation

Larger the coefficient there will be more of an impact on the positive classification or it being Adelie species. So variables having positive coefficients are being more indicative of Adelie species and negative coefficients as being indicative of no Adelie species.

Some interesting observations

island: If a penguin lives on Dream island it is less likely to be of the Adelie species. If it lives on Torgersen it is more likely to be of the Adelie species.
bill_depth_mm: A positive value means that the larger the bill depth, the more likely the penguin is of the Adelie species.
flipper_length_mm: A negative value means that the larger the flipper size, the less likely the penguin is of the Adelie species.

All of these coefficients align with our previous exploratory analysis!

2. Metrics

Lets use our model to predict values for the test set and then evaluate the model

log_preds <- predict(logmodel, test_data, type = 'response')
class_prediction <- factor(ifelse(log_preds > 0.50, "Adelie", "Other"), 
                           levels = c("Other", "Adelie"))

log_auc <- auc(response = test_data$isAdelie, predictor = log_preds)

## Setting levels: control = Other, case = Adelie

## Setting direction: controls < cases

log_cm <- confusionMatrix(data = class_prediction, reference =test_data$isAdelie)
log_accuracy <- log_cm$overall['Accuracy']
log_tpr <- log_cm$byClass['Sensitivity']
log_fpr <- 1 - log_tpr
log_tnr <- log_cm$byClass['Specificity']
log_fnr <- 1 - log_tnr

AUC - 0.97
Accuracy - 0.8977
TPR (Sensitivity) - 0.875
FPR (1-TPR) - 0.125
TNR (Specificity) - 0.937
FNR (1-TPR) - 0.063

3. Multinomial Logistic Regression

First, we will define “Adelie” as the reference level (or “baseline species”) for the dataset. This means that our trained model will result in coefficients of the features for the remaining two species in relation to Adelie. We will start with a baseline model that includes all features.

require(nnet)

## Loading required package: nnet

train_data$species <- relevel(train_data$species, ref = "Adelie")
test_data$species <- relevel(test_data$species, ref = "Adelie")

multinom_model <- multinom(species ~ ., data = train_data %>% select(-isAdelie))

## # weights:  24 (14 variable)
## initial  value 273.554460 
## iter  10 value 13.698503
## iter  20 value 2.253936
## iter  30 value 0.060087
## iter  40 value 0.002295
## iter  50 value 0.000494
## iter  60 value 0.000490
## iter  70 value 0.000487
## iter  80 value 0.000255
## iter  90 value 0.000247
## iter 100 value 0.000237
## final  value 0.000237 
## stopped after 100 iterations

summary(multinom_model)

## Warning in sqrt(diag(vc)): NaNs produced

## Call:
## multinom(formula = species ~ ., data = train_data %>% select(-isAdelie))
## 
## Coefficients:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  -136.36389    32.93073       -14.42726       14.13510     -19.90270
## Gentoo      -10.21908   -46.22665       -39.47398        8.73007     -18.41169
##           flipper_length_mm body_mass_g
## Chinstrap        -0.2377843 -0.02749231
## Gentoo           -0.6030912  0.01881209
## 
## Std. Errors:
##           (Intercept)  islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap   0.6957534 1.403937e+00             NaN      101.64960     54.922341
## Gentoo      0.1268320 8.301517e-17    8.486003e-08        8.75875      4.370859
##           flipper_length_mm body_mass_g
## Chinstrap          18.21333   0.5917839
## Gentoo             30.70866   1.3707121
## 
## Residual Deviance: 0.000474253 
## AIC: 28.00047

The low AIC is indicative of a good model fit, so we will keep all of the variables in the model.

Data 622: Homework1

Santosh Manjrekar

2/19/2021

1. Logistic Regression with a binary outcome.

Data Exploration

Missing values

Creating Binary variable isAdelie

Evaluating Independent Variables

Some observations are as follows

Some observations are as follows

Binary Logistic Regression

Coefficient Intrepretation

2. Metrics

3. Multinomial Logistic Regression