DATA622-Assignment 1

Logistic Regression with Penguin dataset

Don Padmaperuma

Penguin Data Set

This dataset contains size measurements for three differen kind penguin species observed on three islands in the Palmer Archipelago, Antarctica.These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. Penguins dataset contains 8 variables (n = 344 penguins).

Overview

In this assignment I will build a Logistic regression model with binary outcome and a Multinomial logistic regression using palmer dataset and “Species” as the outcome variable. I will be conducting several data preparations to build the models.

Logistic Regression with Binary Outcome

  1. The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
  2. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
  3. Provide variable interpretations in your model.

Data Preparation

# Load data
data("penguins")
# Closer look at the data
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex               <fct> male, female, female, NA, female, male, female, m...
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

As mentioned in part a, we will check what are the categories in the species. From the summary we already know that there are three categories but we will focus only on the species column with the below syntax. This will help us to understand how many in each category.

summary(penguins$species)
##    Adelie Chinstrap    Gentoo 
##       152        68       124

species and its relationship with numerical variables.

penguins %>%
  group_by(species) %>%
  summarize(across(where(is.numeric), mean, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 6
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##   <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
## 1 Adelie              38.8          18.3              190.       3701. 2008.
## 2 Chinstrap           48.8          18.4              196.       3733. 2008.
## 3 Gentoo              47.5          15.0              217.       5076. 2008.
penguins %>%
  count(species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  labs(title = 'Penguins Species & Count')

As you can see from the above visualization and the summaries, We see from the three categories Adelie has the highest count followed by Gentoo and Chinstrap.

Relationship with categorical features

In this section, my purpose is to analyze the relationship between species with other categorical features such as sex and the island.

plot_island<-ggplot(penguins, aes(x = island, y = species, color = species)) +
  geom_jitter(size = 3)
plot_sex<-ggplot(penguins, aes(x = sex, y = species, color = species)) +
  geom_jitter(size = 3)
plot_grid(plot_island, plot_sex, labels = "AUTO")

From the above island graph we can see that the Adelie penguins can be found in all three islands while Chintstrap in only Dream island and Gentoo only in Biscoe island.

Relationship with numerical features

Penguin dataset also has four continuous variables.We can visualize the relationship between those variables from the below plots.

# Scatterplot 1: penguin flipper length versus body mass
plot1<-ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 2) +
  scale_color_manual(values = c("darkorange","darkorchid","cyan"))
# Scatterplot 2: penguin bill length versus bill depth
plot2<-ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 2)  +
  scale_color_manual(values = c("darkorange","darkorchid","cyan"))
plot3<-ggplot(penguins, aes(x = flipper_length_mm,
                     y = body_mass_g)) +
  geom_point(aes(color = sex)) +
  scale_color_manual(values = c("darkorange","cyan4"),
                     na.translate = FALSE) +
  facet_wrap(~species)

plot_grid(plot1, plot2, plot3, labels = "AUTO")

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(penguins, columns = 3:6, title = "Correlogram of Variables", 
        ggplot2::aes(color = island),
        progress = FALSE, 
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) 

With above observations, we need to determine how are we going to proceed with three categories of outcome variable species to build a logistic regression with binary outcome. Being that Adelie is the only species that is in all three island, we can get a binary outcome of species as if whether the species is Adelie or not. To go more further with the data processing, I will remove the missing data and set the outcome to binary with Adelie or NOn Adelie, before we apply to the logistic regression model with all variables. Dealing with missing data is important as it can be problematic in predictive modeling.

# remove missing data
penguins <- na.omit(penguins)
# create two categories of species - Adelie and not_Adelie
penguins$target_adelie = penguins$species
levels(penguins$target_adelie)[levels(penguins$target_adelie) != "Adelie"] <- "NotAdelie"

Model1

library(dplyr)
set.seed(1234567)

# create test and train dataset
penguins_partition <- createDataPartition(penguins$target_adelie, p=0.8, list=FALSE)
train <- penguins[penguins_partition,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test <- penguins[-penguins_partition,]
model1_LR <- glm(target_adelie~.,data = train, family = binomial('logit'))
## Warning: glm.fit: algorithm did not converge
summary(model1_LR)
## 
## Call:
## glm(formula = target_adelie ~ ., family = binomial("logit"), 
##     data = train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.409e-06  -2.409e-06   2.409e-06   2.409e-06   2.409e-06  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -2.657e+01  5.803e+07       0        1
## speciesChinstrap   5.313e+01  1.260e+05       0        1
## speciesGentoo      5.313e+01  2.114e+05       0        1
## islandDream        7.230e-12  8.171e+04       0        1
## islandTorgersen    1.606e-11  8.710e+04       0        1
## bill_length_mm     3.395e-13  9.858e+03       0        1
## bill_depth_mm      6.329e-12  2.812e+04       0        1
## flipper_length_mm -3.869e-12  4.590e+03       0        1
## body_mass_g       -3.576e-14  7.675e+01       0        1
## sexmale            4.818e-11  7.377e+04       0        1
## year               3.839e-11  2.900e+04       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.6605e+02  on 266  degrees of freedom
## Residual deviance: 1.5490e-09  on 256  degrees of freedom
## AIC: 22
## 
## Number of Fisher Scoring iterations: 25

Model2

We can optimize the AIC and derive a better model.

model2_LR <- stepAIC(model1_LR)
## Start:  AIC=22
## target_adelie ~ species + island + bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g + sex + year
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df   Deviance AIC
## - island             2 1.5490e-09  18
## - species            2 2.8409e-08  18
## - bill_length_mm     1 1.5490e-09  20
## - bill_depth_mm      1 1.5490e-09  20
## - flipper_length_mm  1 1.5490e-09  20
## - body_mass_g        1 1.5490e-09  20
## - sex                1 1.5490e-09  20
## - year               1 1.5490e-09  20
## <none>                 1.5490e-09  22
## Warning: glm.fit: algorithm did not converge
## 
## Step:  AIC=18
## target_adelie ~ species + bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex + year
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df   Deviance AIC
## - species            2 1.4122e-07  14
## - bill_length_mm     1 1.5490e-09  16
## - bill_depth_mm      1 1.5490e-09  16
## - flipper_length_mm  1 1.5490e-09  16
## - body_mass_g        1 1.5490e-09  16
## - sex                1 1.5490e-09  16
## - year               1 1.5490e-09  16
## <none>                 1.5490e-09  18
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=14
## target_adelie ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex + year
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                     Df Deviance     AIC
## - year               1    0.000  12.000
## - flipper_length_mm  1    0.000  12.000
## - body_mass_g        1    0.000  12.000
## <none>                    0.000  14.000
## - bill_depth_mm      1    6.423  18.423
## - sex                1    9.173  21.173
## - bill_length_mm     1  183.194 195.194
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=12
## target_adelie ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                     Df Deviance     AIC
## - flipper_length_mm  1    0.000  10.000
## - body_mass_g        1    0.000  10.000
## <none>                    0.000  12.000
## - bill_depth_mm      1    7.221  17.221
## - sex                1    9.175  19.175
## - bill_length_mm     1  190.509 200.509
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=10
## target_adelie ~ bill_length_mm + bill_depth_mm + body_mass_g + 
##     sex
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                  Df Deviance    AIC
## - body_mass_g     1     0.00   8.00
## <none>                  0.00  10.00
## - sex             1     9.20  17.20
## - bill_depth_mm   1    11.50  19.50
## - bill_length_mm  1   239.91 247.91
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=8
## target_adelie ~ bill_length_mm + bill_depth_mm + sex
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                  Df Deviance     AIC
## <none>                 0.000   8.000
## - sex             1   17.389  23.389
## - bill_depth_mm   1   26.376  32.376
## - bill_length_mm  1  267.598 273.598
summary(model2_LR)
## 
## Call:
## glm(formula = target_adelie ~ bill_length_mm + bill_depth_mm + 
##     sex, family = binomial("logit"), data = train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -4.558e-04  -2.000e-08   2.000e-08   2.000e-08   4.648e-04  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -1260.95  126865.82  -0.010    0.992
## bill_length_mm     96.41    7885.01   0.012    0.990
## bill_depth_mm    -160.60   13178.47  -0.012    0.990
## sexmale          -143.54  126874.52  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.6605e+02  on 266  degrees of freedom
## Residual deviance: 4.3094e-07  on 263  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

Model 2 with optimized AIC has a AIC value of 8.

# print variable inflation factor score
print('VIF scores of predictors')
## [1] "VIF scores of predictors"
VIF(model2_LR)
## bill_length_mm  bill_depth_mm            sex 
##      13.090152      13.050076       1.011118

There is a high correlation in bill_length_mm and bill_depth_mm as their values are over 5. And a moderate correlation for variable sex.

model3_LR <- glm(target_adelie ~ bill_length_mm + bill_depth_mm, family=binomial(link='logit'), data=test, maxit = 100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model3_LR)
## 
## Call:
## glm(formula = target_adelie ~ bill_length_mm + bill_depth_mm, 
##     family = binomial(link = "logit"), data = test, maxit = 100)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.251e-05  -2.110e-08   2.110e-08   2.110e-08   1.585e-05  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -188.8  1149506.0   0.000        1
## bill_length_mm      27.1    43437.1   0.001        1
## bill_depth_mm      -54.2    97217.0  -0.001        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9.0523e+01  on 65  degrees of freedom
## Residual deviance: 5.6205e-10  on 63  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 28

When comparing AIC values of all the models, Model 3 has a lower value of AIC and could be a good fit.

Variable Interpretation

Continuous variables

  1. Each one unit change in bill_length_mm will increase the log odds of being an Adelie by 2.377*10^-11
  2. Each one unit change in bill_depth_mm will increase the log odds of being an Adelie by 2.589*10^-11
  3. Each one unit change in flipper_length_mm will decrease the log odds of being an Adelie by 2.821*10^-12
  4. Each one unit change in body_mass_g will increase the log odds of being an Adelie by 1.452*10^-13
  5. Each one unit change in year will increase the log odds of being an Adelie by 6.835*10^-11

Categorical variables

  1. If the species is Chinstrap or Gentoo, the log odds of being an Adelie increased by 5.313*10^01
  2. If the island is Dream, the log odds of being an Adelie increased by 3.192*10^-11
  3. If the island is Torgersen, the log odds of being an Adelie increased by 1.043*10^-11
  4. If the sex is male, the log odds of being an Adelie is decreased by 1.773*10^-10

Model evaluation: AUC, Accuracy, TPR, FPR, TNR, FNR

We will check the predictions and confusion matrix for the Model 1

predictTrain = predict(model1_LR, type = "response")
train$target = round(predictTrain, 1)
mytable <- table(train$target_adelie, predictTrain > 0.5)
mytable
##            
##             FALSE TRUE
##   Adelie      117    0
##   NotAdelie     0  150
# Accuracy
accuracy <- sum(diag(mytable))/sum(mytable)
accuracy
## [1] 1

100 accuracy!!! Based on the Confusion Matrix, our results are as follows:

True Positive Result (TPR): 150 / 150 = 100% False Positive Result (FPR): 0 / 150 = 0% True Negative Result (TNR): 117 / 117 = 100% False Negative Result (FNR): 0 / 117 = 0%

predictTest = predict(model1_LR, type = "response", newdata = test)
test$target = round(predictTest, 1)
confusion_matrix <- table(test$target_adelie, predictTest > 0.5)
confusion_matrix
##            
##             FALSE TRUE
##   Adelie       29    0
##   NotAdelie     0   37

tp = true positive tn = true negative fp = false positive fn = false negative

# TPR, TNR, FPR, and FNR
tp <- confusion_matrix[4]
tn <- confusion_matrix[1]
fp <- confusion_matrix[2]
fn <- confusion_matrix[3]
p <- sum(confusion_matrix[3], confusion_matrix[4])
n <- sum(confusion_matrix[1], confusion_matrix[2])
tpr <- tp / p
tnr <- tn / n
fpr <- fp / n
fnr <- fn / p
title_row <- c('(TPR)', '(TNR)', '(FPR)', '(FNR)')
values <- c(tpr, tnr, fpr, fnr)
results <- data.frame(title_row, values)
colnames(results) <- c('Measure', 'Value')
results
##   Measure Value
## 1   (TPR)     1
## 2   (TNR)     1
## 3   (FPR)     0
## 4   (FNR)     0
# Accuracy
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
accuracy
## [1] 1

95% of accuracy!

Based on accuracy and confusion matrix, model 1 seems a good fit.

Multinomial Logistic Regression

For multinomial logistic regression section I will be using our initial species variable. For that I will be getting data from the original dataset. Also will be using multinom() function in this case.

# newly taking data from the penguin dataset
data(penguins)

# Remove NAs
penguins <- penguins %>% na.omit()
#Build multinomial logistic regression model

model1_MLR <- multinom(species~., data = penguins, model = TRUE)
## # weights:  30 (18 variable)
## initial  value 365.837892 
## iter  10 value 38.413009
## iter  20 value 1.753187
## iter  30 value 0.032963
## final  value 0.000004 
## converged
#Optimized model based on AIC value
model2_MLR <- stepAIC(model1_MLR)
## Start:  AIC=36
## species ~ island + bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex + year
## 
## # weights:  24 (14 variable)
## initial  value 365.837892 
## iter  10 value 39.684733
## iter  20 value 3.493800
## iter  30 value 0.041983
## iter  40 value 0.004550
## iter  50 value 0.002583
## iter  60 value 0.000942
## iter  70 value 0.000750
## iter  80 value 0.000166
## final  value 0.000072 
## converged
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 107.613850
## iter  20 value 70.986260
## iter  30 value 70.288710
## iter  40 value 69.111777
## iter  50 value 68.116406
## final  value 67.940910 
## converged
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 29.234711
## iter  20 value 2.653631
## iter  30 value 1.924314
## iter  40 value 0.144006
## iter  50 value 0.104312
## iter  60 value 0.015151
## iter  70 value 0.013014
## iter  80 value 0.011594
## iter  90 value 0.007703
## iter 100 value 0.005010
## final  value 0.005010 
## stopped after 100 iterations
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 36.868693
## iter  20 value 1.102155
## iter  30 value 0.002861
## final  value 0.000006 
## converged
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 14.861553
## iter  20 value 1.788791
## iter  30 value 0.460648
## iter  40 value 0.043136
## iter  50 value 0.020817
## iter  60 value 0.008312
## iter  70 value 0.006563
## iter  80 value 0.005939
## iter  90 value 0.002767
## iter 100 value 0.002125
## final  value 0.002125 
## stopped after 100 iterations
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 38.605627
## iter  20 value 1.754644
## iter  30 value 0.004685
## iter  40 value 0.000130
## final  value 0.000092 
## converged
## # weights:  27 (16 variable)
## initial  value 365.837892 
## iter  10 value 14.312441
## iter  20 value 2.590536
## iter  30 value 0.589502
## iter  40 value 0.059470
## iter  50 value 0.014357
## iter  60 value 0.005840
## iter  70 value 0.003263
## iter  80 value 0.002841
## iter  90 value 0.002636
## iter 100 value 0.002421
## final  value 0.002421 
## stopped after 100 iterations
##                     Df     AIC
## - island             4  28.000
## - flipper_length_mm  2  32.000
## - sex                2  32.000
## - body_mass_g        2  32.004
## - year               2  32.005
## - bill_depth_mm      2  32.010
## <none>                  36.000
## - bill_length_mm     2 167.882
## # weights:  24 (14 variable)
## initial  value 365.837892 
## iter  10 value 39.684733
## iter  20 value 3.493800
## iter  30 value 0.041983
## iter  40 value 0.004550
## iter  50 value 0.002583
## iter  60 value 0.000942
## iter  70 value 0.000750
## iter  80 value 0.000166
## final  value 0.000072 
## converged
## 
## Step:  AIC=28
## species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex + year
## 
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 146.564554
## iter  20 value 113.998585
## iter  30 value 113.009733
## iter  40 value 112.969642
## iter  50 value 112.476681
## iter  60 value 111.106656
## iter  70 value 109.300764
## final  value 107.543041 
## converged
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 36.356943
## iter  20 value 4.689074
## iter  30 value 3.216662
## iter  40 value 2.025031
## iter  50 value 1.937979
## iter  60 value 1.882920
## iter  70 value 1.636742
## iter  80 value 0.656217
## iter  90 value 0.488711
## iter 100 value 0.409244
## final  value 0.409244 
## stopped after 100 iterations
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 29.688186
## iter  20 value 2.173176
## iter  30 value 0.018585
## iter  40 value 0.002157
## iter  50 value 0.001850
## iter  60 value 0.001671
## iter  70 value 0.000175
## iter  70 value 0.000092
## iter  70 value 0.000090
## final  value 0.000090 
## converged
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 16.984156
## iter  20 value 3.900785
## iter  30 value 2.266374
## iter  40 value 0.537528
## iter  50 value 0.060681
## iter  60 value 0.031129
## iter  70 value 0.009115
## iter  80 value 0.004372
## iter  90 value 0.003783
## iter 100 value 0.003204
## final  value 0.003204 
## stopped after 100 iterations
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 39.951831
## iter  20 value 4.346676
## iter  30 value 1.080490
## iter  40 value 0.395837
## iter  50 value 0.285150
## iter  60 value 0.184950
## iter  70 value 0.135562
## iter  80 value 0.018328
## iter  90 value 0.010902
## iter 100 value 0.009150
## final  value 0.009150 
## stopped after 100 iterations
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 16.719027
## iter  20 value 3.047829
## iter  30 value 0.103484
## iter  40 value 0.063985
## iter  50 value 0.006201
## iter  60 value 0.003924
## iter  70 value 0.001733
## iter  80 value 0.001547
## iter  90 value 0.001513
## iter 100 value 0.001498
## final  value 0.001498 
## stopped after 100 iterations
##                     Df     AIC
## - flipper_length_mm  2  24.000
## - year               2  24.003
## - body_mass_g        2  24.006
## - sex                2  24.018
## - bill_depth_mm      2  24.818
## <none>                  28.000
## - bill_length_mm     2 239.086
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 29.688186
## iter  20 value 2.173176
## iter  30 value 0.018585
## iter  40 value 0.002157
## iter  50 value 0.001850
## iter  60 value 0.001671
## iter  70 value 0.000175
## iter  70 value 0.000092
## iter  70 value 0.000090
## final  value 0.000090 
## converged
## 
## Step:  AIC=24
## species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex + 
##     year
## 
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 145.597733
## iter  20 value 133.588157
## iter  30 value 133.557418
## iter  40 value 133.531903
## final  value 133.358517 
## converged
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 29.102517
## iter  20 value 5.398473
## iter  30 value 4.687579
## iter  40 value 4.582359
## iter  50 value 4.561633
## iter  60 value 4.557843
## iter  70 value 4.557615
## iter  80 value 4.557541
## iter  90 value 4.556795
## iter 100 value 4.556677
## final  value 4.556677 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 9.505164
## iter  20 value 5.215222
## iter  30 value 2.122952
## iter  40 value 0.005871
## iter  50 value 0.000406
## iter  60 value 0.000290
## iter  70 value 0.000289
## iter  80 value 0.000289
## iter  90 value 0.000278
## iter 100 value 0.000275
## final  value 0.000275 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 32.702824
## iter  20 value 4.436647
## iter  30 value 1.747794
## iter  40 value 0.957199
## iter  50 value 0.208096
## iter  60 value 0.174845
## iter  70 value 0.164077
## iter  80 value 0.149767
## iter  90 value 0.142422
## iter 100 value 0.104917
## final  value 0.104917 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
##                  Df     AIC
## - year            2  20.000
## - body_mass_g     2  20.001
## - sex             2  20.210
## <none>               24.000
## - bill_depth_mm   2  29.113
## - bill_length_mm  2 286.717
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
## 
## Step:  AIC=20
## species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex
## 
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 136.772749
## iter  20 value 133.782638
## iter  30 value 133.545483
## final  value 133.538635 
## converged
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 30.179445
## iter  20 value 5.699489
## iter  30 value 5.175455
## iter  40 value 5.153196
## iter  50 value 5.070255
## iter  60 value 4.997789
## iter  70 value 4.893410
## iter  80 value 4.826209
## iter  90 value 4.708542
## iter 100 value 4.695168
## final  value 4.695168 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 20.098944
## iter  20 value 3.621850
## iter  30 value 1.160548
## iter  40 value 1.108738
## iter  50 value 1.063387
## iter  60 value 0.958674
## iter  70 value 0.926099
## iter  80 value 0.837719
## iter  90 value 0.797644
## iter 100 value 0.741759
## final  value 0.741759 
## stopped after 100 iterations
##                  Df     AIC
## - body_mass_g     2  16.981
## - sex             2  17.484
## <none>               20.000
## - bill_depth_mm   2  25.390
## - bill_length_mm  2 283.077
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## 
## Step:  AIC=16.98
## species ~ bill_length_mm + bill_depth_mm + sex
## 
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 150.109767
## iter  20 value 142.419807
## iter  30 value 142.047323
## final  value 142.041293 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 139.953898
## iter  20 value 133.697479
## iter  30 value 133.399000
## iter  40 value 133.326376
## final  value 133.325373 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 26.650783
## iter  20 value 23.943597
## iter  30 value 23.916873
## iter  40 value 23.901339
## iter  50 value 23.895442
## iter  60 value 23.894251
## final  value 23.892065 
## converged
##                  Df     AIC
## <none>               16.981
## - sex             2  59.784
## - bill_depth_mm   2 278.651
## - bill_length_mm  2 296.083
summary(model1_MLR)
## Call:
## multinom(formula = species ~ ., data = penguins, model = TRUE)
## 
## Coefficients:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  -0.2496435   211.62912        97.64471       32.96624     -60.84017
## Gentoo      0.7633414   -68.84639      -140.79025       28.45770     -42.86214
##           flipper_length_mm body_mass_g   sexmale       year
## Chinstrap        0.01515612  -0.0193027 -70.73544 -0.2242371
## Gentoo          -2.79321661   0.1330873 -67.23109 -0.2391631
## 
## Std. Errors:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  0.02157370  0.02157370    6.871032e-16      0.2205894     0.1689163
## Gentoo     0.02114133  0.02114131    5.401394e-53      1.0993488     0.4376254
##           flipper_length_mm body_mass_g    sexmale     year
## Chinstrap          4.937946    25.79546 0.02113835 42.20323
## Gentoo             4.439678   101.47836 0.02114132 42.45178
## 
## Residual Deviance: 7.087421e-06 
## AIC: 36.00001
summary(model2_MLR)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     sex, data = penguins, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm     sexmale
## Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
## Gentoo       121.0476       14.80645     -44.69711  -0.1301506
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm  sexmale
## Chinstrap    3.985819       9.369182      23.16685 23.33386
## Gentoo       4.071031       9.413971      23.17697 43.25185
## 
## Residual Deviance: 0.9809986 
## AIC: 16.981

Summary

When looking at the summary of model1_MLR, we can see that it has a high AIC value compared to the model2_MLR. Optimized AIC model has only three variables bill_length_mm, bill_depth_mm and sex.

Negative Coefficients of bill_depth_mm and sexmale in Chinstrap can be seen as negatively correlated with the species being Chinstraps. Also positive coefficient of bill_length_mm can be interpret as its being positively correlated with the species being Chinstrap. Chinstrap has a negative coefficient intercept which can be interpret as Chinstraps as a unpopular species.

Positive intercept of Gentoo can be represent Gentoo as one of the popular species. As in Chinstrap species, Gentoo also have a positively correlated bill_length_mm that is positively correlated with Gentoos.