Assignment 1 (Logistic Regression)

Subhalaxmi Rout

02/19/2021


Instruction

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

For this assignment, let us use ‘species’ as our outcome or the dependent variable.

\(1.\) Logistic Regression with a binary outcome. (40)

\(a.\) The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

\(b.\) Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

\(c.\) Provide variable interpretations in your model.

\(2.\) For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)

\(3.\) Multinomial Logistic Regression. (40)

\(a.\) Please fit it a multinomial logistic regression where your outcome variable is ‘species’.

\(b.\) Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.

\(c.\) Please be sure to interpret your variables in the model.

\(4.\) Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)

Load libraries

library(palmerpenguins)
library(tibble)
library(DT)
library(dplyr)
library(explore)
library(ggplot2)
library(knitr)
library(caTools)
library(stats)
library(pROC)
library(nnet)

Solution

penguine_data <- glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
DT::datatable(head(penguine_data,10))

The penguins dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. The dataset shows measurements of penguins live on different islands in different years. Below image shows the description of flipper_length, bill_length, and bill_width.

knitr::include_graphics("https://raw.githubusercontent.com/SubhalaxmiRout002/DATA-622/main/Assignment%201/Screen%20Shot%202021-02-19%20at%205.54.09%20AM.png")

# size of data
dim(penguine_data)
## [1] 344   8
# shows Summary and null values 
summary(penguine_data)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Data contains total 19 NAs, most NAs is sex i.e 11 and flipper_length_mm, body_mass_g, bill_depth_mm, and bill_length_mm contain 8 NAs.

EDA

# Count penguins for each species / island
penguins %>% group_by(island, species, .drop = FALSE)
## # A tibble: 344 x 8
## # Groups:   island, species [9]
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…           NA            NA                 NA          NA
##  5 Adelie  Torge…           36.7          19.3              193        3450
##  6 Adelie  Torge…           39.3          20.6              190        3650
##  7 Adelie  Torge…           38.9          17.8              181        3625
##  8 Adelie  Torge…           39.2          19.6              195        4675
##  9 Adelie  Torge…           34.1          18.1              193        3475
## 10 Adelie  Torge…           42            20.2              190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
ggplot(penguins) + aes(x = island, fill = species) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("indianred2","forestgreen","dodgerblue"),
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()

Chinstrap and Gentoo are located on seperate islands.

penguine_data %>% explore_all(target = species)

The density plot shows interesting facts,

  • bill_length plot shows similarity between Chinstrap and Gentoo but separates Adelie. However, we can see bill_length density plot (Chinstrap, Gentoo) of has two peak points, mean data is not normally distributed.
  • bill_depth plot shows similarity between Chinstrap and Adelie but separates Gentoo.
  • Similarly, flipper_length_mm and body_mass_g plots shows similarity between Chinstrap and Adelie but separates Gentoo.

Distribution of bill_length_mm of all penguine species.

hist(penguine_data$bill_length_mm[penguine_data$species=='Gentoo'],col = 'forestgreen'
     , xlab = "bill_length_mm"
     , main = "Gentoos' distribution of bill_length_mm")

hist(penguine_data$bill_length_mm[penguine_data$species=='Chinstrap'],col = 'indianred2'
     , xlab = "bill_length_mm"
     , main = "Chinstraps' Histogram of bill_length_mm")

hist(penguine_data$bill_length_mm[penguine_data$species=='Adelie'],col = 'dodgerblue'
     , xlab = "bill_length_mm"
     , main = "Adelies' Histogram of bill_length_mm")

Hostogram of Adelie seems normally distibuted but other species skewed.

Lets draw a decision tree:

penguine_data %>% explain_tree(target = species)

Above trees shows,

  • if flipper_length_mm < 207 then Gentoo
  • if flipper_length_mm >= 207 then Adile
  • if bill_length_mm < 43 then Chinstrap else Adeli

Now let’s take a closer look at these variables:

# Flipper length vs. bill length
flipper_bill <- ggplot(data = penguine_data,
                         aes(x = flipper_length_mm,
                             y = bill_length_mm)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 3,
             alpha = 0.8) +
  theme_minimal() +
  scale_color_manual(values = c("indianred2","forestgreen","dodgerblue")) +
  labs(title = "Flipper and bill length",
       x = "Flipper length (mm)",
       y = "Bill length (mm)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme(legend.position = c(0.85, 0.15),
        legend.background = element_rect(fill = "white", color = NA),
        plot.title = element_text(hjust = 0.5, size = 15, colour = "black"),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")

flipper_bill

Above plot does not show a clear cluster of species. The density plots shows body_mass and flipper_length_mm of Adelie and Chinstrap different than Gentoo, so have a look on body_mass_g and flipper_length_mm.

ggplot(data = penguine_data, aes(x = body_mass_g, y = flipper_length_mm, col = species)) +
    geom_point() + 
    theme_bw() + 
    labs(title = "Flipper Length and Body Mass of the Palmer Penguins") +
  theme(legend.position = c(0.85, 0.15),
        legend.background = element_rect(fill = "white", color = NA),
        plot.title = element_text(hjust = 0.5, size = 15, colour = "black"),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")

This plot clearly shows a positive relation between flippers_length and body_mass. Moreover, Adelie and Chinstrap cluster together, and Gentoo looks separate from the cluster.

Logistic Regression with a binary outcome

Binary logistic regression requires the dependent variable to be binary.

# remove NAS
penguine_data_2 <- penguine_data[complete.cases(penguine_data), ]

# size of data
dim(penguine_data_2)
## [1] 333   8
#summary(penguine_data_2)

Given, Species as dependent variable, species has 3 categories i.e Adelie,Chinstrap, and Gentoo. Adelie,Chinstrap combines in one cluster due to body_mass and flipper_length and Gentoo in other cluster

# convert to dummy
penguine_data_2$ade_chin <-  ifelse(penguine_data_2$species == 'Adelie' | penguine_data_2$species == 'Chinstrap', 1, 0)


# Drop Specises column from data
penguine_data_2 <- penguine_data_2 %>% select(-c(species) ) 

# View data
DT::datatable(head(penguine_data_2,10))
table(penguine_data_2$ade_chin)
## 
##   0   1 
## 119 214

New species ade_chin(combination of Adelie and Chinstrap) has 214 rows and Guntoo has 119 rows. Data is not balanced. This case I will go for under sampling.

Under Sampling :

Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling.

Take 199 random sample data from ade_chin, to make the balance. So, current data set have 119 rows from each species i.e ade_chin and Gentoo.

Split the data in to 2 sets train and test, assign 80% of data to train and 20% of data to test.

set.seed(1)
smpl1 <- penguine_data_2 %>% filter(ade_chin == 1) %>% sample_n(size = 119)
smpl2 <- penguine_data_2 %>% filter(ade_chin == 0) 
smpl_119 <- rbind(smpl1, smpl2)
# Data split

sample = sample.split(smpl_119$ade_chin, SplitRatio = 0.80)

penguine_train = subset(smpl_119, sample == TRUE)
penguine_test = subset(smpl_119, sample == FALSE)

dim(penguine_train)
## [1] 190   8
dim(penguine_test)
## [1] 48  8

For first binomial model, choose flipper_length_mm , body_mass_g , bill_length_mm , and bill_depth_mm. I do not select other features such as island, year, and sex. Do not find relevant impact on these features for prediction.

Binomial Model 1

bi_model1 <- glm(ade_chin ~  flipper_length_mm + body_mass_g + bill_length_mm + bill_depth_mm
                 , data = penguine_train, family=binomial(link="logit"))
summary(bi_model1)
## 
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + body_mass_g + bill_length_mm + 
##     bill_depth_mm, family = binomial(link = "logit"), data = penguine_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.223e-05  -2.100e-08   0.000e+00   2.100e-08   3.015e-05  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        1.590e+02  8.088e+05   0.000        1
## flipper_length_mm -1.374e+00  2.518e+03  -0.001        1
## body_mass_g       -1.578e-02  4.498e+01   0.000        1
## bill_length_mm     1.389e-01  9.538e+03   0.000        1
## bill_depth_mm      1.140e+01  1.893e+04   0.001        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.6340e+02  on 189  degrees of freedom
## Residual deviance: 3.6989e-09  on 185  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

Iinterpretation

Showing less Null variance : 0.02634, Residual deviance: 0.000036989, AIC: 10.

Equation:

log(p/1-p) = 1.590e+02 - flipper_length_mm * 1.374e+00 - body_mass_g * 1.578e-02 + bill_length_mm * 1.389e-01 + bill_depth_mm * 1.140e+01

Residual deviance is calculated from the model having all the features.On comarison with Linear Regression, think of residual deviance as residual sum of square (RSS) and null deviance as total sum of squares (TSS). The larger the difference between null and residual deviance, better the model.

The model with the lowest AIC will be relatively better. Practically, AIC is always given preference above deviance to evaluate model fit.

Binomial Model 2

Remove bill_length_mm due to not normally distributed.

bi_model2 <- glm(ade_chin ~  flipper_length_mm  + body_mass_g + bill_depth_mm
                 , data = penguine_train, family=binomial(link="logit"))

summary(bi_model2)
## 
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + body_mass_g + bill_depth_mm, 
##     family = binomial(link = "logit"), data = penguine_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.100e-05  -2.110e-08   0.000e+00   2.110e-08   2.678e-05  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        1.688e+02  4.379e+05   0.000    1.000
## flipper_length_mm -1.374e+00  2.537e+03  -0.001    1.000
## body_mass_g       -1.581e-02  4.347e+01   0.000    1.000
## bill_depth_mm      1.119e+01  1.081e+04   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.6340e+02  on 189  degrees of freedom
## Residual deviance: 3.6928e-09  on 186  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

Interpretation

Showing less Null variance : 0.02634, Residual deviance: 0.000036989, AIC: 8.

Equation:

log(p/1-p) = 1.688e+02 - flipper_length_mm * 1.374e+00 - body_mass_g * 1.581e-02 + bill_depth_mm * 1.119e+01

Binomial Model 3

Remove body_mass_g due to positively corre-lated with flipper_length_mm.

bi_model3 <- glm(ade_chin ~  flipper_length_mm  + bill_depth_mm
                 , data = penguine_train, family=binomial(link="logit"))

summary(bi_model3)
## 
## Call:
## glm(formula = ade_chin ~ flipper_length_mm + bill_depth_mm, family = binomial(link = "logit"), 
##     data = penguine_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.615e-05  -2.100e-08   0.000e+00   2.100e-08   4.642e-05  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)          280.510 325904.300   0.001    0.999
## flipper_length_mm     -2.434   1657.410  -0.001    0.999
## bill_depth_mm         13.360  10101.308   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.6340e+02  on 189  degrees of freedom
## Residual deviance: 4.8487e-09  on 187  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Interpretation

Showing less Null variance : 0.02634, Residual deviance: 0.000048487, AIC: 6.

Equation:

log(p/1-p) = 280.510 - flipper_length_mm * 2.434 + bill_depth_mm * 13.360

Out of these 3 models I select model 3 due to low AIC.

Confusion Matrix (CM)

Test the model using penguine_test. confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format.

pred_3 <- predict(bi_model3,penguine_test) 
pred_3 <- if_else(pred_3 > 0.5, 1, 0) 

cm <- table(true = penguine_test$ade_chin, pred_3)

TP <- cm[1]
FP <- cm[2]
FN <- cm[3]
TN <- cm[4]

AUC

The area under the curve (AUC), also referred to as index of accuracy (A), represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).

plot(roc(penguine_test$ade_chin, pred_3, direction="<"),col="blue", lwd=3, main="ROC Curve")

Accuracy

It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

accuracy <- (TP + TN)/(TN + FN + TP + FP)
accuracy
## [1] 1

True Positive Rate (TPR)

It indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known as Sensitivity or Recall.

TPR = (TP)/(TP+FN)
TPR
## [1] 1

False Positive Rate (FPR)

It indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.

FPR = (FP)/(FP+TN)
FPR
## [1] 0

True Negative Rate (TNR)

It indicates how many negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is (TN/TN + FP). It is also known as Specificity.

TNR = (TN)/(FP+TN)
TNR
## [1] 1

False Negative Rate (FNR)

It indicates how many positive values, out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is (FN/FN + TP).

FNR = (FN)/(FN+TP)
FNR
## [1] 0

Other metrics for model 3:

#  Classification Error Rate
CER = (FP+FN)/(TP+TN+FP+FN)
CER
## [1] 0
# Precision
Precision = (TP)/(TP+FP)
Precision
## [1] 1
# F1 score
F1 = 2*(Precision * TPR)/(Precision + TPR)
F1
## [1] 1

Multinomial logistic regression

Multinomial regression is an extension of binomial logistic regression. The algorithm allows us to predict a categorical dependent variable which has more than two levels. Like any other regression model, the multinomial output can be predicted using one or more independent variable. The independent variables can be of a nominal, ordinal or continuous type.

Allocate number to the species:

  • Adelie = 1
  • Chinstrap = 2
  • Gentoo = 3
penguine_data_3 <- penguine_data[complete.cases(penguine_data), ]
#summary(penguine_data_3)
penguine_data_3$species <- case_when(
  penguine_data_3$species == "Adelie" ~ 1,
  penguine_data_3$species == "Chinstrap" ~ 2,
  penguine_data_3$species == "Gentoo" ~ 3
)

table(penguine_data_3$species)
## 
##   1   2   3 
## 146  68 119

Apply dummy_cols() automates the process, and is useful when you have many columns to general dummy variables from or with many categories within the column. This way we can het all numerical data for model. Split the data in to 2 sets i.e train set and test set and the ratio is 70% train and 30% test.

set.seed(22)

ind <- sample(2, nrow(penguine_data_3), replace = TRUE,
              prob = c(0.7, 0.3))
train <- penguine_data_3[ind == 1,]
test <- penguine_data_3[ind == 2,]

dim(train)
## [1] 245   8
dim(test)
## [1] 88  8
train <- fastDummies::dummy_cols(train) %>% select (-c(island, sex, year))
test <- fastDummies::dummy_cols(test) %>% select (-c(island, sex, year))

train$species <- as.factor(train$species)
test$species <- as.factor(test$species)

train$species <- relevel(train$species, ref = "1")
test$species <- relevel(test$species, ref = "1")

Multinomial Model 1

Model with all variables

mul_model1 <- multinom(species ~ ., data = train)
## # weights:  33 (20 variable)
## initial  value 269.160011 
## iter  10 value 6.490882
## iter  20 value 1.000447
## iter  30 value 0.007333
## final  value 0.000004 
## converged
summary(mul_model1)
## Call:
## multinom(formula = species ~ ., data = train)
## 
## Coefficients:
##   (Intercept) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 2   -90.37271      11.193024      -9.19526         -1.300113  0.01892986
## 3   -14.66748       9.790331     -17.77214         -1.672496  0.05237551
##   island_Biscoe island_Dream island_Torgersen sex_female  sex_male
## 2     -52.31516    4.2585232        -42.31608 -18.490427 -71.88229
## 3      16.51737   -0.2050357        -30.97981   8.801406 -23.46888
## 
## Std. Errors:
##   (Intercept) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 2   19.715172       792.7148       5.39504         212.94684   4.5525507
## 3    3.087224       164.5498     127.63921          23.94005   0.9888599
##   island_Biscoe island_Dream island_Torgersen   sex_female   sex_male
## 2  1.855764e-12 1.970569e+01     9.486383e-03 1.628462e+02 143.132554
## 3  3.087221e+00 9.552211e-05     2.621256e-13 7.019081e-12   3.087224
## 
## Residual Deviance: 8.079054e-06 
## AIC: 32.00001

Interpretaion

From multinomial model 1, we can get 2 equations:

Eqn 1:

ln[P(species = 2)/(species = 1)] = -90.37271 + bill_length_mm * 1.193024 - bill_depth_mm * 9.19526 - flipper_length_mm * 1.300113 _ body_mass_g * 0.01892986 - island_Biscoe * 52.31516 + island_Dream * 4.2585232 - island_Torgersen * 42.31608 - sex_female * 18.490427 - sex_male * 71.88229

Eqn 2:

ln[P(species = 3)/(species = 1)] = -14.66748 + bill_length_mm * 9.790331 - bill_depth_mm * 17.77214 - flipper_length_mm * 1.672496 + body_mass_g * 0.05237551 + island_Biscoe * 16.51737 - island_Dream * 0.2050357 - island_Torgersen * 30.97981 + sex_female * 8.801406 - sex_male * 23.46888

To get the statistical significance variable perform 2 tail test. Below calculate p-value, if the valie is > 0.5 then that variable is not statistically significant.

z <- summary(mul_model1)$coefficients / summary(mul_model1)$standard.errors

p <- 1 - pnorm(abs(z), 0, 1) * 2

p
##   (Intercept) bill_length_mm bill_depth_mm flipper_length_mm  body_mass_g
## 2  -0.9999954    -0.01126565    -0.9116920      -0.004871327 -0.003317657
## 3  -0.9999980    -0.04744429    -0.1107374      -0.055696368 -0.042240642
##   island_Biscoe island_Dream island_Torgersen  sex_female   sex_male
## 2    -1.0000000   -0.1710951               -1 -0.09040179 -0.3844786
## 3    -0.9999999   -1.0000000               -1 -1.00000000 -1.0000000

We can see, bill_depth_mm, island_Biscoe, island_Dream, island_Torgersen, sex_female, and sex_male have higher p-value.

Next model we will exclude all these variables.

Model 2 :

Model with out bill_depth_mm, island_Biscoe, island_Dream, island_Torgersen, sex_female, and sex_male

mul_model2 <- multinom(species ~ bill_length_mm + flipper_length_mm + body_mass_g
                       , data = train)
## # weights:  15 (8 variable)
## initial  value 269.160011 
## iter  10 value 58.353361
## iter  20 value 17.264952
## iter  30 value 11.867321
## iter  40 value 6.840817
## iter  50 value 6.077581
## iter  60 value 6.052656
## iter  70 value 5.461953
## iter  80 value 5.375168
## iter  90 value 5.317128
## iter 100 value 5.309806
## final  value 5.309806 
## stopped after 100 iterations
summary(mul_model2)
## Call:
## multinom(formula = species ~ bill_length_mm + flipper_length_mm + 
##     body_mass_g, data = train)
## 
## Coefficients:
##   (Intercept) bill_length_mm flipper_length_mm  body_mass_g
## 2   -174.1776     15.0936970        -1.0568874 -0.075407128
## 3   -203.1864      0.7861321         0.6996343  0.005863934
## 
## Std. Errors:
##    (Intercept) bill_length_mm flipper_length_mm body_mass_g
## 2 0.0008027276     0.01267965        0.08366188 0.004353293
## 3 0.0010528341     0.56610128        0.11811937 0.003882355
## 
## Residual Deviance: 10.61961 
## AIC: 26.61961

Interpretaion

From multinomial mul_model2, we can write the following equations:

ln[P(species = 2)/(species = 1)] = -174.1776 + bill_length_mm * 15.0936970 - flipper_length_mm * 1.0568874 - body_mass_g * 0.075407128

ln[P(species = 3)/(species = 1)] = -203.1864 + bill_length_mm * 0.7861321 + flipper_length_mm * 0.6996343 + body_mass_g * 0.005863934

Out of 2 Multinomial model, mul_model2 has low AIC, I will go with model 2. Apply prediction on both training and test dataset using model 2.

Extra Credit

We will see the confusion matric and caluclate accuracy for both train and test set.

pred_mul <- predict(mul_model2, train)

# confusion matrix
tab <- table(pred_mul, train$species)
tab
##         
## pred_mul   1   2   3
##        1 103   0   1
##        2   0  49   0
##        3   1   0  91
accuracy <- sum(diag(tab))/ sum(tab)
accuracy
## [1] 0.9918367

With training data confusion matrix shows 99.18% accuracy. Lets check the test data.

pred_mul_test <- predict(mul_model2, test)
tab_2 <- table(pred_mul_test, test$species)
tab_2
##              
## pred_mul_test  1  2  3
##             1 39  0  0
##             2  3 19  0
##             3  0  0 27
test_accuracy <-  sum(diag(tab_2))/ sum(tab_2)
test_accuracy
## [1] 0.9659091

Test data also showing 96.59% accuracy.

Understand the performance of the model is important than model making. Lets see how our model is working.

Suppose, we do not apply any ML algo for model, this dataset will show 942.44 % accuracy for Adelie, 20% accuracy for Chinstrap , and 37.55 % accuracy for Gentoo.

n <- table(train$species)
n/sum(n)
## 
##         1         2         3 
## 0.4244898 0.2000000 0.3755102

Lets see our training model performance:

tab/colSums(tab)
##         
## pred_mul           1           2           3
##        1 0.990384615 0.000000000 0.009615385
##        2 0.000000000 1.000000000 0.000000000
##        3 0.010869565 0.000000000 0.989130435

Above matrix shows, species predicted correctly

Adelie = 99% Chinstrap = 100 % Gentoo = 98.9%

Which is quite good. Lets see test model performance.

tab_2/colSums(tab_2)
##              
## pred_mul_test         1         2         3
##             1 0.9285714 0.0000000 0.0000000
##             2 0.1578947 1.0000000 0.0000000
##             3 0.0000000 0.0000000 1.0000000

Above matrix shows, species predicted correctly

Adelie = 92.8% Chinstrap = 100 % Gentoo = 100%

In train set Adelie prediction was 99% but in test set 92.8%. But Gentoo performance improves than train set.

I personally enjoy this assignment, and learn many new things!

There is room for improvement for the model. If I get time I will do in future.