GOAL

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable. 1. Logistic Regression with a binary outcome. (40)

  1. The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).

  2. Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.

  3. Provide variable interpretations in your model.

  1. For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)
  2. Multinomial Logistic Regression. (40)
  1. Please fit it a multinomial logistic regression where your outcome variable is ‘species’.

  2. Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.

  3. Please be sure to interpret your variables in the model.

  1. Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)

Library

library(dplyr)  ## must install, besides tidyverse, otherwise some error
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)  ## main library for this task
## Loading required package: lattice
## Loading required package: ggplot2
library(palmerpenguins)
library(tidyr)
library(corrplot)
## corrplot 0.88 loaded
library(tibble)
library(DT)

library(explore)

library(knitr)
library(caTools)
library(stats)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(nnet)  ## multi nomial 

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr   1.4.0     v stringr 1.4.0
## v purrr   0.3.4     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
## 
##     rescale01
library(tidyr)

library(VIM)
## Loading required package: colorspace
## 
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
## 
##     coords
## Loading required package: grid
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
## 
##     transpose
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:explore':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(class)

library(rpart)
library(rpart.plot)

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(naniar)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:naniar':
## 
##     n_complete
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

DATA

penguins <- read.csv ('penguins.csv', fileEncoding = 'UTF-8-BOM', header=TRUE)

EDA EXPLORATORY DATA ANALYSIS

sprintf("The penguins dataset has %d columns and %d rows", ncol(penguins), nrow(penguins))
## [1] "The penguins dataset has 8 columns and 344 rows"
sprintf("The column names are as follows")
## [1] "The column names are as follows"
colnames(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
sprintf("The first 5 rows of the dataset are as follows:")
## [1] "The first 5 rows of the dataset are as follows:"
head(penguins,5)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
sprintf("The summary of the dataset is as follows:")
## [1] "The summary of the dataset is as follows:"
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

MISSING VALUE EXPLORATION

sprintf("The missing data information is as follows:")
## [1] "The missing data information is as follows:"
missing_count <-sapply(penguins, function(x) sum(length(which(is.na(x)))))

data.frame(missing_count)
##                   missing_count
## species                       0
## island                        0
## bill_length_mm                2
## bill_depth_mm                 2
## flipper_length_mm             2
## body_mass_g                   2
## sex                          11
## year                          0
sprintf("Lets remove the missing values rows from the dataset.")
## [1] "Lets remove the missing values rows from the dataset."
penguins_tb <- penguins %>% 
  as_tibble()

# show summary infio
skim(penguins_tb)
Data summary
Name penguins_tb
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 <U+2583><U+2587><U+2587><U+2586><U+2581>
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 <U+2585><U+2585><U+2587><U+2587><U+2582>
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 <U+2582><U+2587><U+2583><U+2585><U+2582>
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 <U+2583><U+2587><U+2586><U+2583><U+2582>
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 <U+2587><U+2581><U+2587><U+2581><U+2587>
# Visualize Missing Values Using VIM Package

library (VIM)
colSums(is.na(penguins))                 # Count missing values by column
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0
colSums(is.na(penguins)) / nrow(penguins)    # Percentage of missing values by column
##           species            island    bill_length_mm     bill_depth_mm 
##       0.000000000       0.000000000       0.005813953       0.005813953 
## flipper_length_mm       body_mass_g               sex              year 
##       0.005813953       0.005813953       0.031976744       0.000000000
aggr(penguins)                           # Create aggregation plot

We can see that most of the missing’s are in the sex column. We will delete these small amount of missing variables in our analysis.

penguins_visl <- penguins %>%
    rename(
         bill_length = bill_length_mm,
         bill_depth = bill_depth_mm,
         flipper_length = flipper_length_mm,
         body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) %>%
  filter(!is.na(bill_depth))

PCA

PRINCIPAL COMPONENT ANALYSIS

penguins_visl.pca <- prcomp (~ bill_length + bill_depth + flipper_length + body_mass,
                    data=penguins_visl,
                    na.action=na.omit,  # not actually necessary: we removed NA
                    scale. = TRUE)

penguins_visl.pca
## Standard deviations (1, .., p=4):
## [1] 1.6594442 0.8789293 0.6043475 0.3293816
## 
## Rotation (n x k) = (4 x 4):
##                       PC1          PC2        PC3        PC4
## bill_length     0.4552503 -0.597031143 -0.6443012  0.1455231
## bill_depth     -0.4003347 -0.797766572  0.4184272 -0.1679860
## flipper_length  0.5760133 -0.002282201  0.2320840 -0.7837987
## body_mass       0.5483502 -0.084362920  0.5966001  0.5798821
screeplot(penguins_visl.pca, type = "line", lwd=3, cex=3, 
        main="Variances of PCA Components")

PCA Results

The PCA analysis is looking at the elbow of the change. Looks like that the variance of 1 variable will explain most of the variances. The drop from one variable to two variable are significant then further, drop from 2 2-3 variables and drop from 3 to fair for variables are less pronounced.

o These two principal components account for 90% of the total variance of these four size variables. o PC1 is largely determined by two variables that are describing the size of penguin ( flipper length and body mass) .
o PC2 is mainly determined by variation in the two beak variables: culmen length and depth. I guess these two variables describing the shape of penguin.

Obviously this high variation is not of desirable to do modeling.

This remains to be seen and the confirmed in the future logistic regression.

DATA VISUALIZATION - OVER ALL VARIABLES

library(car)
library(GGally)

Scatterplot Mtrix

Numerical VR only

scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
                  data=penguins_visl,
                  ellipse=list(levels=0.68),
                  col = scales::hue_pal()(3),
                  legend=list(coords="bottomright"))

Paired Variable Analysis

Extends to discrete VR (stacked bar)

Plus numerical VR (histogram)

library (GGally)
ggpairs(penguins_visl, mapping = aes(color = species), 
        columns = c("bill_length", "bill_depth", 
                    "flipper_length", "body_mass",
                    "island", "sex"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that most of the distributions are approximately unimodal and symmetric within species.

within species, all pairs of variables are positively correlated

library(kableExtra)
kable(head(penguins)) %>%
       kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
  row_spec(0, color = "white", background = "skyblue") %>%
  column_spec(1, bold = T)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
p_1 <- penguins %>%
    filter(!is.na(sex)) %>%  ## only non missing sex, many penguins have missing value in  sex and other (missing are in the same obbs)
  group_by(island) %>%
  summarize(
    mean_wt = round(mean(body_mass_g), 1),
    sd_wt = round(sd(body_mass_g), 1),
    length(body_mass_g)
  )
kable(p_1, col.names = c("Island","Mean Body Mass (g)","SD Mean Body Mass (g)","Sample Size (# of Obs. Penguins)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
  row_spec(0, color = "white", background = "skyblue") %>%
  column_spec(1, bold = T)
Island Mean Body Mass (g) SD Mean Body Mass (g) Sample Size (# of Obs. Penguins)
Biscoe 4719.2 790.9 163
Dream 3718.9 412.9 123
Torgersen 3708.5 451.8 47

DATA VISUALIZATION - SELECTIVE VARIABLES OF INTEREST

p_2 <- penguins %>%
   filter(species == "Adelie") %>%
  group_by(island)

mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
  geom_histogram(aes(fill = island), bins = 8, color = "white") +
  facet_wrap(~ island, scale = "free")

mass_histogram
## Warning: Removed 1 rows containing non-finite values (stat_bin).

penguin_boxplot <- ggplot(p_2, aes(x = island, y = body_mass_g)) +
  geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
  geom_jitter(width = 0.1, alpha = 0.8) +
  labs(x = "Penguin Observations by island ", y = "Body Mass (g)", title = "Adiele Penguin Body Mass Observations ")

penguin_boxplot
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).

p_non_adelie <- penguins %>%
   filter(species == "Chinstrap" | species == 'Gentoo') %>%
  group_by(island)

mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
  geom_histogram(aes(fill = island), bins = 8, color = "white") +
  labs( title = "Adiele Penguin Body Mass Observations ")+
  facet_wrap(~ island, scale = "free")

mass_histogram
## Warning: Removed 1 rows containing non-finite values (stat_bin).

penguin_boxplot <- ggplot(p_non_adelie, aes(x = island, y = body_mass_g)) +
  geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
  geom_jitter(width = 0.1, alpha = 0.8) +
   labs( title = "Non-Adiele Penguin Body Mass Observations ")

penguin_boxplot
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).

penguins %>% 
   filter (!is.na(sex)) %>% 
  ggplot (aes (flipper_length_mm, bill_length_mm, 
               color= sex, size= body_mass_g))+
  geom_point (alpha= 0.7)+
  labs(x="Flipper length (mm)", y="Bill length (mm)")+
  facet_wrap (~species)

DATA PRE-PROCESSING FOR MODELING

PART D STARTS HERE (after data read, library)

# remove NAS
penguins <- penguins[complete.cases(penguins), ] 

# size of data
dim(penguins) ## 333*8 VR
## [1] 333   8

Make Binary Outcome -Adelie or Not

library(dplyr)

# change the sepcies outcome into binary, based on Adelie or not
penguins$adelie <- ifelse(penguins$species == 'Adelie', 1, 0)  ## Adelie is cap in original , thats why the error
head(penguins,1)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
##    sex year adelie
## 1 male 2007      1
dim(penguins) #333*9VR
## [1] 333   9
colnames (penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "adelie"
penguins %>% dplyr:: select (species)  %>% 
  group_by (species) %>% 
  summarize (count=n())
## # A tibble: 3 x 2
##   species   count
##   <fct>     <int>
## 1 Adelie      146
## 2 Chinstrap    68
## 3 Gentoo      119
penguins %>% dplyr:: select (species, adelie)  %>% 
  group_by (species, adelie) %>% 
  summarize (count=n())
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 3 x 3
## # Groups:   species [3]
##   species   adelie count
##   <fct>      <dbl> <int>
## 1 Adelie         1   146
## 2 Chinstrap      0    68
## 3 Gentoo         0   119
df_adelie2<-  dplyr::select(penguins,-'species')

## to drop the original species 
head(df_adelie2,2)
##      island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g    sex
## 1 Torgersen           39.1          18.7               181        3750   male
## 2 Torgersen           39.5          17.4               186        3800 female
##   year adelie
## 1 2007      1
## 2 2007      1
## need to change adelie into factor in the future,
df_adelie2$adelie<- as.factor(df_adelie2$adelie)
head(df_adelie2,3)  ## 333 * 8
##      island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g    sex
## 1 Torgersen           39.1          18.7               181        3750   male
## 2 Torgersen           39.5          17.4               186        3800 female
## 3 Torgersen           40.3          18.0               195        3250 female
##   year adelie
## 1 2007      1
## 2 2007      1
## 3 2007      1
dim(df_adelie2)  ## 333*8
## [1] 333   8
df_adelie2 %>% dplyr:: select (adelie) %>% 
  group_by (adelie) %>% 
summarise(count=n())  ## wrong here, eveyone is adelie, n=333 this is wrong
## # A tibble: 2 x 2
##   adelie count
##   <fct>  <int>
## 1 0        187
## 2 1        146
## now right adelie n = 146

colnames (df_adelie2)
## [1] "island"            "bill_length_mm"    "bill_depth_mm"    
## [4] "flipper_length_mm" "body_mass_g"       "sex"              
## [7] "year"              "adelie"

Dummy Code Categorical Variable

Finally, we’ll want to dummy code our categorical island and sex variables.

library(psych)
# dummy.code: Convert a factor to a matrix of dummy codes

# Converts a factor (categorical) variable to a matrix of dummy codes using a 1 of C-1 binary coding scheme.

# Dummy code the island feature - this creates a new column for each island (0 || 1)

island_dcode <- as.data.frame(dummy.code(df_adelie2$island))
dim(island_dcode)   ###333*3 VR
## [1] 333   3
df_adelie3 <- cbind(df_adelie2, island_dcode)
dim(df_adelie3)  ##333 *11 VR
## [1] 333  11
colnames (df_adelie3)
##  [1] "island"            "bill_length_mm"    "bill_depth_mm"    
##  [4] "flipper_length_mm" "body_mass_g"       "sex"              
##  [7] "year"              "adelie"            "Biscoe"           
## [10] "Dream"             "Torgersen"
# remove the original island factor columns - we will rely on the new dummy columns
df_adelie4 <- df_adelie3 %>% 
  dplyr::select(-island)
dim(df_adelie3)  ##333*11 VR
## [1] 333  11
# dummy code the sex feature - since there are only 2 states, we will only be adding a single column (0 || 1)
df_adelie4$sex <- dummy.code(df_adelie3$sex)

dim(df_adelie4)  ### 333*10
## [1] 333  10
df_adelie5 <- df_adelie4 %>%
  dplyr::select (-year)
dim(df_adelie5)  #333*9
## [1] 333   9
colnames (df_adelie5)
## [1] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
## [4] "body_mass_g"       "sex"               "adelie"           
## [7] "Biscoe"            "Dream"             "Torgersen"
## Code year as factor, instead of dropping it ## not possilbe, bc error
# df_adelie5$year <- as.factor(as.character(df_adelie5$year))
# Error in `$<-.data.frame`(`*tmp*`, year, value = integer(0)) : replacement has 0 rows, data has 333
df_adelie5 %>% dplyr::select (adelie) %>% 
  group_by (adelie) %>% 
  summarise(count=n())
## # A tibble: 2 x 2
##   adelie count
##   <fct>  <int>
## 1 0        187
## 2 1        146

Examine Dataset before modeling

# quick inspect of dataframe
kable(head(df_adelie5)) %>%
  kable_styling(bootstrap_options = "basic")
## Warning in `[<-.data.frame`(`*tmp*`, , isn, value = structure(list(bill_length_mm =
## structure(c("39.1", : provided 9 variables to replace 8 variables
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex adelie Biscoe Dream Torgersen
1 39.1 18.7 181 3750 1 1 0 0 0
2 39.5 17.4 186 3800 0 1 1 0 0
3 40.3 18.0 195 3250 0 1 1 0 0
5 36.7 19.3 193 3450 0 1 1 0 0
6 39.3 20.6 190 3650 1 1 0 0 0
7 38.9 17.8 181 3625 0 1 1 0 0
## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more
library(car)

library(GGally)

MODELING

LOGISTIC REGRESSION

To Predict Adelie Species or Not

Refers to: STHDA-Stepwise Logistic Regression Essential http://www.sthda.com/english/articles/36-classification-methods-essentials/150-stepwise-logistic-regression-essentials-in-r/ Computing stepwise logistique regression

Full logistic regression model
Perform stepwise variable selection
Compare the full and the stepwise models

To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.

Full logistic regression model

model1_fullmodel <- glm (adelie~., family=binomial (link='logit'),data=df_adelie5)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(model1_fullmodel)
##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##      365.84155736      -23.41050269       32.50218136       -0.04903018 
##       body_mass_g           sexmale         sexfemale            Biscoe 
##        0.02729885       10.23181146                NA      -51.89053890 
##             Dream         Torgersen 
##      -44.76170285                NA
summary(model1_fullmodel)
## 
## Call:
## glm(formula = adelie ~ ., family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.239e-04  -2.100e-08  -2.100e-08   2.100e-08   9.978e-05  
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        3.658e+02  3.193e+05   0.001    0.999
## bill_length_mm    -2.341e+01  5.504e+03  -0.004    0.997
## bill_depth_mm      3.250e+01  1.274e+04   0.003    0.998
## flipper_length_mm -4.903e-02  7.429e+02   0.000    1.000
## body_mass_g        2.730e-02  2.428e+01   0.001    0.999
## sexmale            1.023e+01  5.289e+04   0.000    1.000
## sexfemale                 NA         NA      NA       NA
## Biscoe            -5.189e+01  4.391e+04  -0.001    0.999
## Dream             -4.476e+01  6.464e+04  -0.001    0.999
## Torgersen                 NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 4.5520e-08  on 325  degrees of freedom
## AIC: 16
## 
## Number of Fisher Scoring iterations: 25
# ;log: algorithm did not converge

AIC Reduced Model

To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.

library(MASS)
# modelAIC<- stepAIC(model1)
modelAIC<-model1_fullmodel %>% stepAIC( trace=FALSE)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(modelAIC)
##    (Intercept) bill_length_mm  bill_depth_mm          Dream 
##      609.10000      -40.93968       68.41795      -88.10707
summary(modelAIC)
## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + Dream, 
##     family = binomial(link = "logit"), data = df_adelie5)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.026e-04  -2.100e-08  -2.100e-08   2.100e-08   1.878e-04  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)       609.10  168154.13   0.004    0.997
## bill_length_mm    -40.94    5615.85  -0.007    0.994
## bill_depth_mm      68.42    9486.84   0.007    0.994
## Dream             -88.11  137196.84  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 1.2158e-07  on 329  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25
# Error in stepAIC(model1) : number of rows in use has
## still do not converge
summary(modelAIC$fitted.values)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4384  1.0000  1.0000

MODEL EVALUATION

Confusion Matrix

confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

df_adelie5$Predict <- ifelse (modelAIC$fitted.values>.5, 'pos', 'neg')  

CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)

CMtable
##    
##     neg pos
##   0 187   0
##   1   0 146
TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]

True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

on AIC REDUCED MODEL
print (TP)
## [1] 187
print (FP)
## [1] 0
print (FN)
## [1] 0
print (TN)
## [1] 146

ACCURACY

on AIC REDUCED MODEL

Accuracy of our model via Confusion Matrix

accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)
## [1] 1

ROC CURVE

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)
on AIC REDUCED MODEL
#Observe ROC and AUC :
roc(adelie~modelAIC$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.formula(formula = adelie ~ modelAIC$fitted.values, data = df_adelie5,     plot = TRUE, main = "ROC CURVE", col = "blue")
## 
## Data: modelAIC$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 1
# 

OPTED OUT -TRAINING TESTING SPLIT

Training Set / Testing Set Split for Logistic Regression

Solving Overfitting Problem

Using Minimum Amounts of Variables

Prior PCA Analysis teslls us 2 Variables are enough

## three predictors
model4 <- glm(adelie~ bill_length_mm  +bill_depth_mm  + body_mass_g , family=binomial(link='logit'), data=df_adelie5)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model4)
## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + body_mass_g, 
##     family = binomial(link = "logit"), data = df_adelie5)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.308   0.000   0.000   0.000   1.711  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    32.965097  25.641491   1.286   0.1986  
## bill_length_mm -4.903436   2.653993  -1.848   0.0647 .
## bill_depth_mm   8.616112   4.810140   1.791   0.0733 .
## body_mass_g     0.006746   0.003854   1.751   0.0800 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.5751  on 332  degrees of freedom
## Residual deviance:   9.6518  on 329  degrees of freedom
## AIC: 17.652
## 
## Number of Fisher Scoring iterations: 13
# glm.fit: fitted probabilities numerically 0 or 1 occurred
## two predictors
model5 <- glm(adelie ~  flipper_length_mm  + bill_depth_mm
                 , data = df_adelie5, family=binomial(link="logit"))
summary(model5)
## 
## Call:
## glm(formula = adelie ~ flipper_length_mm + bill_depth_mm, family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.58735  -0.19448  -0.07433   0.64787   2.23661  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        26.8488     4.8412   5.546 2.92e-08 ***
## flipper_length_mm  -0.1763     0.0239  -7.376 1.63e-13 ***
## bill_depth_mm       0.4308     0.1230   3.502 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.58  on 332  degrees of freedom
## Residual deviance: 236.11  on 330  degrees of freedom
## AIC: 242.11
## 
## Number of Fisher Scoring iterations: 6
# two predictors, final choice
model_chosen <- glm(adelie~ bill_length_mm   + body_mass_g , family=binomial(link='logit'), data=df_adelie5)

summary(model_chosen)
## 
## Call:
## glm(formula = adelie ~ bill_length_mm + body_mass_g, family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31516  -0.10608  -0.00967   0.09498   2.66818  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    50.0420895  6.8695806   7.285 3.23e-13 ***
## bill_length_mm -1.1304381  0.1668171  -6.777 1.23e-11 ***
## body_mass_g    -0.0003744  0.0005156  -0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.575  on 332  degrees of freedom
## Residual deviance:  89.722  on 330  degrees of freedom
## AIC: 95.722
## 
## Number of Fisher Scoring iterations: 8
## 

MODEL EVALUATION FOR CHOSEN MODEL

Using 2 Predictors

Logistic Classification Metrics
Confusion Matrix (CM)

Test the model using penguine_test. confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format.

summary(model_chosen$fitted.values)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.001081 0.107628 0.438438 0.984288 0.999997

Confusion Matrix

df_adelie5$Predict <- ifelse (model_chosen$fitted.values>.5, 'pos', 'neg')  

CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)

CMtable
##    
##     neg pos
##   0 177  10
##   1   8 138
TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]
print (TP)
## [1] 177
print (FP)
## [1] 8
print (FN)
## [1] 10
print (TN)
## [1] 138

Accuracy of our model via Confusion Matrix

accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)
## [1] 0.9459459

ROC curve

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)

#Observe ROC and AUC :
roc(adelie~model_chosen$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.formula(formula = adelie ~ model_chosen$fitted.values, data = df_adelie5,     plot = TRUE, main = "ROC CURVE", col = "blue")
## 
## Data: model_chosen$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 0.9881
# 

Explation - Logistic Regression

We feel that the logistic regression model, to predict the outcome of Penguin species, in this case, “Adelie” species versus not, that’s not really the best approach. The model has a very significant overfitting problem period even after the AIC reduction, still this problem persist. There is a prominent one to two variables that explains almost 95% of the variants come up leaving the rest of the 10 variables almost powerless in prediction.

This overfitting problem is reflected in the true positive and true negative value, confusion matrix. In The Roc curve, the AUC is close to 100%, for almost all the models.

I tried to overcome the overfitting problems by looking at individual variables 1 by 1, and still find that to not very helpful.

So the logistic regression really have a huge limitation in this model, which lean too the reason that why we need to use and learn other machine learning models, such as KNN, random forest, who is not as restricted by the colinearity in the variables.

Logistic Regression Ends Above

Although training and testing model will work for logistic regression prediction overall, I decide to leave that part out.

Because looking from the overall sample, all of the logistic regression model is highly skewed to overfitting, so this model is really free not fitting for training, therefore the testing model will not be accurate.

Second reason is that the sample size has only three forty observations, which means 80% to 20% split will leave a few dozens of observations in the testing data.

Given that the accuracy of model is already poor to start with, having a even smaller sample size for training and testing purpose really will not work period so I decided to leave that part entirely out from this project.

MultiNomial Regression

Starts Below

DATA PREPROCESSING FOR MultiNomial Model

For the multi nomial analysis, I will use the original outcome as species, which has three categories. Before any analysis, I am doing the usual preprocessing of the data, taking out the missing variables, especially in sex column.

library(nnet)
library(dplyr)
library(MASS)
## Now we refer to the original penguins dataset:
data(penguins)


# #Remove NAs
penguins <- penguins %>% na.omit()
is.factor(penguins$species)
## [1] TRUE
## remove year variable, island varaible
penguins6 <- penguins %>% 
  dplyr::select (-c(year, island))

dim(penguins6)
## [1] 333   6

Examine Dataset before modeling

dim(penguins6)
## [1] 333   6
library(dplyr)
# quick inspect of dataframe
kable(head(penguins6)) %>%
  kable_styling(bootstrap_options = "basic")
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie 39.1 18.7 181 3750 male
Adelie 39.5 17.4 186 3800 female
Adelie 40.3 18.0 195 3250 female
Adelie 36.7 19.3 193 3450 female
Adelie 39.3 20.6 190 3650 male
Adelie 38.9 17.8 181 3625 female
## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more

For the multi nomial analysis, I will use the original outcome as species, which has three categories. Before any analysis, I am doing the usual preprocessing of the data, taking out the missing variables, especially in sex column.

For the 3 species, the Adelie species is the comparison, and the other two species (Chinstrap, Gentoo) are compared aginst Adelie species.

FULL MODEL

MultiNomial

levels(penguins6$species)
## [1] "Adelie"    "Chinstrap" "Gentoo"
#Build multinomial logistic regression model (with all vars):
s_model_1 <- multinom (species ~., data=penguins6, model = TRUE)
## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 16.719027
## iter  20 value 3.047829
## iter  30 value 0.103484
## iter  40 value 0.063985
## iter  50 value 0.006201
## iter  60 value 0.003924
## iter  70 value 0.001733
## iter  80 value 0.001547
## iter  90 value 0.001513
## iter 100 value 0.001498
## final  value 0.001498 
## stopped after 100 iterations
summary(s_model_1)
## Call:
## multinom(formula = species ~ ., data = penguins6, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -103.422229       16.51642     -24.41758       -0.31003929
## Gentoo      -6.756321       12.21924     -30.09757       -0.08758996
##            body_mass_g    sexmale
## Chinstrap -0.030200928 -15.820607
## Gentoo     0.005019684  -8.132101
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   0.5715154      27.749774      27.43640          4.097365
## Gentoo      0.5715734       8.968787      33.25608         11.905692
##           body_mass_g  sexmale
## Chinstrap   0.1350169 3.077521
## Gentoo      0.5848602 3.498288
## 
## Residual Deviance: 0.002996934 
## AIC: 24.003

REDUCED MODEL BASED ON AIC

MultiNomial

#Optimize model based on AIC value (as per #2):
s_model_2 <- stepAIC(s_model_1)
## Start:  AIC=24
## species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex
## 
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 129.185715
## iter  20 value 114.018556
## iter  30 value 113.336460
## iter  40 value 113.216108
## final  value 113.215886 
## converged
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 66.463118
## iter  20 value 12.514139
## iter  30 value 7.743531
## iter  40 value 5.662963
## iter  50 value 4.350499
## iter  60 value 4.145847
## iter  70 value 4.093510
## iter  80 value 3.984724
## iter  90 value 3.903126
## iter 100 value 3.831010
## final  value 3.831010 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 9.579038
## iter  20 value 4.114389
## iter  30 value 2.340467
## iter  40 value 0.614487
## iter  50 value 0.055642
## iter  60 value 0.012303
## iter  70 value 0.007327
## iter  80 value 0.007163
## iter  90 value 0.007015
## iter 100 value 0.006995
## final  value 0.006995 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 16.321214
## iter  20 value 3.754897
## iter  30 value 1.631859
## iter  40 value 0.012427
## iter  50 value 0.001125
## iter  60 value 0.001108
## iter  70 value 0.001006
## iter  80 value 0.000906
## iter  90 value 0.000498
## iter 100 value 0.000498
## final  value 0.000498 
## stopped after 100 iterations
##                     Df     AIC
## - flipper_length_mm  2  20.000
## - sex                2  20.001
## - body_mass_g        2  20.014
## <none>                  24.003
## - bill_depth_mm      2  27.662
## - bill_length_mm     2 246.432
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
## 
## Step:  AIC=20
## species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex
## 
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 136.772749
## iter  20 value 133.782638
## iter  30 value 133.545483
## final  value 133.538635 
## converged
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 30.179445
## iter  20 value 5.699489
## iter  30 value 5.175455
## iter  40 value 5.153196
## iter  50 value 5.070255
## iter  60 value 4.997789
## iter  70 value 4.893410
## iter  80 value 4.826209
## iter  90 value 4.708542
## iter 100 value 4.695168
## final  value 4.695168 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 20.098944
## iter  20 value 3.621850
## iter  30 value 1.160548
## iter  40 value 1.108738
## iter  50 value 1.063387
## iter  60 value 0.958674
## iter  70 value 0.926099
## iter  80 value 0.837719
## iter  90 value 0.797644
## iter 100 value 0.741759
## final  value 0.741759 
## stopped after 100 iterations
##                  Df     AIC
## - body_mass_g     2  16.981
## - sex             2  17.484
## <none>               20.000
## - bill_depth_mm   2  25.390
## - bill_length_mm  2 283.077
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## 
## Step:  AIC=16.98
## species ~ bill_length_mm + bill_depth_mm + sex
## 
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 150.109767
## iter  20 value 142.419807
## iter  30 value 142.047323
## final  value 142.041293 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 139.953898
## iter  20 value 133.697479
## iter  30 value 133.399000
## iter  40 value 133.326376
## final  value 133.325373 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 26.650783
## iter  20 value 23.943597
## iter  30 value 23.916873
## iter  40 value 23.901339
## iter  50 value 23.895442
## iter  60 value 23.894251
## final  value 23.892065 
## converged
##                  Df     AIC
## <none>               16.981
## - sex             2  59.784
## - bill_depth_mm   2 278.651
## - bill_length_mm  2 296.083
#Review summary statistics
summary(s_model_2)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     sex, data = penguins6, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm     sexmale
## Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
## Gentoo       121.0476       14.80645     -44.69711  -0.1301506
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm  sexmale
## Chinstrap    3.985819       9.369182      23.16685 23.33386
## Gentoo       4.071031       9.413971      23.17697 43.25185
## 
## Residual Deviance: 0.9809986 
## AIC: 16.981

Explation - MultiNomial Regression

The multinomial analysis definitely is a better fit than the logistic regression analysis. The full model and the AIC reduced model show significant difference, and the convergence problem do not exist, at least for these two models only. The AIC is 16.9, which is lot more satisfying. Chinstrap and the geeto, footballs these species, the bill lands is the most significant variables in prediction, compared with the Adelie species.

REFERENCE

https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES

mlogit package: Multinomial logit model Description Estimation by maximum likelihood of the multinomial logit model, with alternative-specific and/or individual specific variables.

Exercise 1: Multinomial logit model https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html

https://www.joshuapkeller.com/page/introregression/ Interpretation of Logistic Regression Chapter 18-Logistic Regression

https://www.datacamp.com/community/tutorials/logistic-regression-R

https://statisticsglobe.com/report-missing-values-in-data-frame-in-r