GOAL

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable. 1. Logistic Regression with a binary outcome. (40)

The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.

For your model from #1, please provide: AUC, Accuracy, TPR, FPR, TNR, FNR (20)
Multinomial Logistic Regression. (40)

Please fit it a multinomial logistic regression where your outcome variable is ‘species’.
Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.
Please be sure to interpret your variables in the model.

Extra credit: what would be some of the fit statistics you would want to evaluate for your model in question #3? Feel free to share whatever you can provide. (10)

Library

library(dplyr)  ## must install, besides tidyverse, otherwise some error

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)  ## main library for this task

## Loading required package: lattice

## Loading required package: ggplot2

library(palmerpenguins)
library(tidyr)
library(corrplot)

## corrplot 0.88 loaded

library(tibble)
library(DT)

library(explore)

library(knitr)
library(caTools)
library(stats)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(nnet)  ## multi nomial 

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v readr   1.4.0     v stringr 1.4.0
## v purrr   0.3.4     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:explore':
## 
##     rescale01

library(tidyr)

library(VIM)

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:pROC':
## 
##     coords

## Loading required package: grid

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:purrr':
## 
##     transpose

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:explore':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(class)

library(rpart)
library(rpart.plot)

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(naniar)
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(skimr)

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:naniar':
## 
##     n_complete

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

DATA

penguins <- read.csv ('penguins.csv', fileEncoding = 'UTF-8-BOM', header=TRUE)

EDA EXPLORATORY DATA ANALYSIS

sprintf("The penguins dataset has %d columns and %d rows", ncol(penguins), nrow(penguins))

## [1] "The penguins dataset has 8 columns and 344 rows"

sprintf("The column names are as follows")

## [1] "The column names are as follows"

colnames(penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

sprintf("The first 5 rows of the dataset are as follows:")

## [1] "The first 5 rows of the dataset are as follows:"

head(penguins,5)

##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007

sprintf("The summary of the dataset is as follows:")

## [1] "The summary of the dataset is as follows:"

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

MISSING VALUE EXPLORATION

sprintf("The missing data information is as follows:")

## [1] "The missing data information is as follows:"

missing_count <-sapply(penguins, function(x) sum(length(which(is.na(x)))))

data.frame(missing_count)

##                   missing_count
## species                       0
## island                        0
## bill_length_mm                2
## bill_depth_mm                 2
## flipper_length_mm             2
## body_mass_g                   2
## sex                          11
## year                          0

sprintf("Lets remove the missing values rows from the dataset.")

## [1] "Lets remove the missing values rows from the dataset."

penguins_tb <- penguins %>% 
  as_tibble()

# show summary infio
skim(penguins_tb)

Data summary

Name	penguins_tb
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	<U+2583><U+2587><U+2587><U+2586><U+2581>
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	<U+2585><U+2585><U+2587><U+2587><U+2582>
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	<U+2582><U+2587><U+2583><U+2585><U+2582>
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	<U+2583><U+2587><U+2586><U+2583><U+2582>
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	<U+2587><U+2581><U+2587><U+2581><U+2587>

# Visualize Missing Values Using VIM Package

library (VIM)
colSums(is.na(penguins))                 # Count missing values by column

##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0

colSums(is.na(penguins)) / nrow(penguins)    # Percentage of missing values by column

##           species            island    bill_length_mm     bill_depth_mm 
##       0.000000000       0.000000000       0.005813953       0.005813953 
## flipper_length_mm       body_mass_g               sex              year 
##       0.005813953       0.005813953       0.031976744       0.000000000

aggr(penguins)                           # Create aggregation plot

We can see that most of the missing’s are in the sex column. We will delete these small amount of missing variables in our analysis.

penguins_visl <- penguins %>%
    rename(
         bill_length = bill_length_mm,
         bill_depth = bill_depth_mm,
         flipper_length = flipper_length_mm,
         body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) %>%
  filter(!is.na(bill_depth))

PCA

PRINCIPAL COMPONENT ANALYSIS

penguins_visl.pca <- prcomp (~ bill_length + bill_depth + flipper_length + body_mass,
                    data=penguins_visl,
                    na.action=na.omit,  # not actually necessary: we removed NA
                    scale. = TRUE)

penguins_visl.pca

## Standard deviations (1, .., p=4):
## [1] 1.6594442 0.8789293 0.6043475 0.3293816
## 
## Rotation (n x k) = (4 x 4):
##                       PC1          PC2        PC3        PC4
## bill_length     0.4552503 -0.597031143 -0.6443012  0.1455231
## bill_depth     -0.4003347 -0.797766572  0.4184272 -0.1679860
## flipper_length  0.5760133 -0.002282201  0.2320840 -0.7837987
## body_mass       0.5483502 -0.084362920  0.5966001  0.5798821

screeplot(penguins_visl.pca, type = "line", lwd=3, cex=3, 
        main="Variances of PCA Components")

PCA Results

The PCA analysis is looking at the elbow of the change. Looks like that the variance of 1 variable will explain most of the variances. The drop from one variable to two variable are significant then further, drop from 2 2-3 variables and drop from 3 to fair for variables are less pronounced.

o These two principal components account for 90% of the total variance of these four size variables. o PC1 is largely determined by two variables that are describing the size of penguin ( flipper length and body mass) .
o PC2 is mainly determined by variation in the two beak variables: culmen length and depth. I guess these two variables describing the shape of penguin.

Obviously this high variation is not of desirable to do modeling.

This remains to be seen and the confirmed in the future logistic regression.

DATA VISUALIZATION - OVER ALL VARIABLES

library(car)
library(GGally)

Scatterplot Mtrix

Numerical VR only

scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
                  data=penguins_visl,
                  ellipse=list(levels=0.68),
                  col = scales::hue_pal()(3),
                  legend=list(coords="bottomright"))

Paired Variable Analysis

Extends to discrete VR (stacked bar)

Plus numerical VR (histogram)

library (GGally)
ggpairs(penguins_visl, mapping = aes(color = species), 
        columns = c("bill_length", "bill_depth", 
                    "flipper_length", "body_mass",
                    "island", "sex"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that most of the distributions are approximately unimodal and symmetric within species.

within species, all pairs of variables are positively correlated

library(kableExtra)

kable(head(penguins)) %>%
       kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
  row_spec(0, color = "white", background = "skyblue") %>%
  column_spec(1, bold = T)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

p_1 <- penguins %>%
    filter(!is.na(sex)) %>%  ## only non missing sex, many penguins have missing value in  sex and other (missing are in the same obbs)
  group_by(island) %>%
  summarize(
    mean_wt = round(mean(body_mass_g), 1),
    sd_wt = round(sd(body_mass_g), 1),
    length(body_mass_g)
  )

kable(p_1, col.names = c("Island","Mean Body Mass (g)","SD Mean Body Mass (g)","Sample Size (# of Obs. Penguins)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
  row_spec(0, color = "white", background = "skyblue") %>%
  column_spec(1, bold = T)

Island	Mean Body Mass (g)	SD Mean Body Mass (g)	Sample Size (# of Obs. Penguins)
Biscoe	4719.2	790.9	163
Dream	3718.9	412.9	123
Torgersen	3708.5	451.8	47

DATA VISUALIZATION - SELECTIVE VARIABLES OF INTEREST

p_2 <- penguins %>%
   filter(species == "Adelie") %>%
  group_by(island)

mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
  geom_histogram(aes(fill = island), bins = 8, color = "white") +
  facet_wrap(~ island, scale = "free")

mass_histogram

## Warning: Removed 1 rows containing non-finite values (stat_bin).

penguin_boxplot <- ggplot(p_2, aes(x = island, y = body_mass_g)) +
  geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
  geom_jitter(width = 0.1, alpha = 0.8) +
  labs(x = "Penguin Observations by island ", y = "Body Mass (g)", title = "Adiele Penguin Body Mass Observations ")

penguin_boxplot

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

## Warning: Removed 1 rows containing missing values (geom_point).

p_non_adelie <- penguins %>%
   filter(species == "Chinstrap" | species == 'Gentoo') %>%
  group_by(island)

mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
  geom_histogram(aes(fill = island), bins = 8, color = "white") +
  labs( title = "Adiele Penguin Body Mass Observations ")+
  facet_wrap(~ island, scale = "free")

mass_histogram

## Warning: Removed 1 rows containing non-finite values (stat_bin).

penguin_boxplot <- ggplot(p_non_adelie, aes(x = island, y = body_mass_g)) +
  geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
  geom_jitter(width = 0.1, alpha = 0.8) +
   labs( title = "Non-Adiele Penguin Body Mass Observations ")

penguin_boxplot

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

## Warning: Removed 1 rows containing missing values (geom_point).

penguins %>% 
   filter (!is.na(sex)) %>% 
  ggplot (aes (flipper_length_mm, bill_length_mm, 
               color= sex, size= body_mass_g))+
  geom_point (alpha= 0.7)+
  labs(x="Flipper length (mm)", y="Bill length (mm)")+
  facet_wrap (~species)

DATA PRE-PROCESSING FOR MODELING

PART D STARTS HERE (after data read, library)

# remove NAS
penguins <- penguins[complete.cases(penguins), ] 

# size of data
dim(penguins) ## 333*8 VR

## [1] 333   8

Make Binary Outcome -Adelie or Not

library(dplyr)

# change the sepcies outcome into binary, based on Adelie or not
penguins$adelie <- ifelse(penguins$species == 'Adelie', 1, 0)  ## Adelie is cap in original , thats why the error
head(penguins,1)

##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
##    sex year adelie
## 1 male 2007      1

dim(penguins) #333*9VR

## [1] 333   9

colnames (penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"              "adelie"

penguins %>% dplyr:: select (species)  %>% 
  group_by (species) %>% 
  summarize (count=n())

## # A tibble: 3 x 2
##   species   count
##   <fct>     <int>
## 1 Adelie      146
## 2 Chinstrap    68
## 3 Gentoo      119

penguins %>% dplyr:: select (species, adelie)  %>% 
  group_by (species, adelie) %>% 
  summarize (count=n())

## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.

## # A tibble: 3 x 3
## # Groups:   species [3]
##   species   adelie count
##   <fct>      <dbl> <int>
## 1 Adelie         1   146
## 2 Chinstrap      0    68
## 3 Gentoo         0   119

df_adelie2<-  dplyr::select(penguins,-'species')

## to drop the original species 
head(df_adelie2,2)

##      island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g    sex
## 1 Torgersen           39.1          18.7               181        3750   male
## 2 Torgersen           39.5          17.4               186        3800 female
##   year adelie
## 1 2007      1
## 2 2007      1

## need to change adelie into factor in the future,
df_adelie2$adelie<- as.factor(df_adelie2$adelie)
head(df_adelie2,3)  ## 333 * 8

##      island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g    sex
## 1 Torgersen           39.1          18.7               181        3750   male
## 2 Torgersen           39.5          17.4               186        3800 female
## 3 Torgersen           40.3          18.0               195        3250 female
##   year adelie
## 1 2007      1
## 2 2007      1
## 3 2007      1

dim(df_adelie2)  ## 333*8

## [1] 333   8

df_adelie2 %>% dplyr:: select (adelie) %>% 
  group_by (adelie) %>% 
summarise(count=n())  ## wrong here, eveyone is adelie, n=333 this is wrong

## # A tibble: 2 x 2
##   adelie count
##   <fct>  <int>
## 1 0        187
## 2 1        146

## now right adelie n = 146

colnames (df_adelie2)

## [1] "island"            "bill_length_mm"    "bill_depth_mm"    
## [4] "flipper_length_mm" "body_mass_g"       "sex"              
## [7] "year"              "adelie"

Dummy Code Categorical Variable

Finally, we’ll want to dummy code our categorical island and sex variables.

library(psych)
# dummy.code: Convert a factor to a matrix of dummy codes

# Converts a factor (categorical) variable to a matrix of dummy codes using a 1 of C-1 binary coding scheme.

# Dummy code the island feature - this creates a new column for each island (0 || 1)

island_dcode <- as.data.frame(dummy.code(df_adelie2$island))
dim(island_dcode)   ###333*3 VR

## [1] 333   3

df_adelie3 <- cbind(df_adelie2, island_dcode)
dim(df_adelie3)  ##333 *11 VR

## [1] 333  11

colnames (df_adelie3)

##  [1] "island"            "bill_length_mm"    "bill_depth_mm"    
##  [4] "flipper_length_mm" "body_mass_g"       "sex"              
##  [7] "year"              "adelie"            "Biscoe"           
## [10] "Dream"             "Torgersen"

# remove the original island factor columns - we will rely on the new dummy columns
df_adelie4 <- df_adelie3 %>% 
  dplyr::select(-island)
dim(df_adelie3)  ##333*11 VR

## [1] 333  11

# dummy code the sex feature - since there are only 2 states, we will only be adding a single column (0 || 1)
df_adelie4$sex <- dummy.code(df_adelie3$sex)

dim(df_adelie4)  ### 333*10

## [1] 333  10

df_adelie5 <- df_adelie4 %>%
  dplyr::select (-year)
dim(df_adelie5)  #333*9

## [1] 333   9

colnames (df_adelie5)

## [1] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
## [4] "body_mass_g"       "sex"               "adelie"           
## [7] "Biscoe"            "Dream"             "Torgersen"

## Code year as factor, instead of dropping it ## not possilbe, bc error
# df_adelie5$year <- as.factor(as.character(df_adelie5$year))
# Error in `$<-.data.frame`(`*tmp*`, year, value = integer(0)) : replacement has 0 rows, data has 333

df_adelie5 %>% dplyr::select (adelie) %>% 
  group_by (adelie) %>% 
  summarise(count=n())

## # A tibble: 2 x 2
##   adelie count
##   <fct>  <int>
## 1 0        187
## 2 1        146

Examine Dataset before modeling

# quick inspect of dataframe
kable(head(df_adelie5)) %>%
  kable_styling(bootstrap_options = "basic")

## Warning in `[<-.data.frame`(`*tmp*`, , isn, value = structure(list(bill_length_mm =
## structure(c("39.1", : provided 9 variables to replace 8 variables

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	adelie	Biscoe
1	39.1	18.7	181	3750	1	1	0
2	39.5	17.4	186	3800	0	1	1
3	40.3	18.0	195	3250	0	1	1
5	36.7	19.3	193	3450	0	1	1
6	39.3	20.6	190	3650	1	1	0
7	38.9	17.8	181	3625	0	1	1

## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more

library(car)

library(GGally)

MODELING

LOGISTIC REGRESSION

To Predict Adelie Species or Not

Refers to: STHDA-Stepwise Logistic Regression Essential http://www.sthda.com/english/articles/36-classification-methods-essentials/150-stepwise-logistic-regression-essentials-in-r/ Computing stepwise logistique regression

Full logistic regression model
Perform stepwise variable selection
Compare the full and the stepwise models

To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.

Full logistic regression model

model1_fullmodel <- glm (adelie~., family=binomial (link='logit'),data=df_adelie5)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

coef(model1_fullmodel)

##       (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
##      365.84155736      -23.41050269       32.50218136       -0.04903018 
##       body_mass_g           sexmale         sexfemale            Biscoe 
##        0.02729885       10.23181146                NA      -51.89053890 
##             Dream         Torgersen 
##      -44.76170285                NA

summary(model1_fullmodel)

## 
## Call:
## glm(formula = adelie ~ ., family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.239e-04  -2.100e-08  -2.100e-08   2.100e-08   9.978e-05  
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        3.658e+02  3.193e+05   0.001    0.999
## bill_length_mm    -2.341e+01  5.504e+03  -0.004    0.997
## bill_depth_mm      3.250e+01  1.274e+04   0.003    0.998
## flipper_length_mm -4.903e-02  7.429e+02   0.000    1.000
## body_mass_g        2.730e-02  2.428e+01   0.001    0.999
## sexmale            1.023e+01  5.289e+04   0.000    1.000
## sexfemale                 NA         NA      NA       NA
## Biscoe            -5.189e+01  4.391e+04  -0.001    0.999
## Dream             -4.476e+01  6.464e+04  -0.001    0.999
## Torgersen                 NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 4.5520e-08  on 325  degrees of freedom
## AIC: 16
## 
## Number of Fisher Scoring iterations: 25

# ;log: algorithm did not converge

AIC Reduced Model

To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.

library(MASS)
# modelAIC<- stepAIC(model1)
modelAIC<-model1_fullmodel %>% stepAIC( trace=FALSE)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

coef(modelAIC)

##    (Intercept) bill_length_mm  bill_depth_mm          Dream 
##      609.10000      -40.93968       68.41795      -88.10707

summary(modelAIC)

## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + Dream, 
##     family = binomial(link = "logit"), data = df_adelie5)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.026e-04  -2.100e-08  -2.100e-08   2.100e-08   1.878e-04  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)       609.10  168154.13   0.004    0.997
## bill_length_mm    -40.94    5615.85  -0.007    0.994
## bill_depth_mm      68.42    9486.84   0.007    0.994
## Dream             -88.11  137196.84  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5658e+02  on 332  degrees of freedom
## Residual deviance: 1.2158e-07  on 329  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

# Error in stepAIC(model1) : number of rows in use has
## still do not converge

summary(modelAIC$fitted.values)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4384  1.0000  1.0000

MODEL EVALUATION

Confusion Matrix

confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

df_adelie5$Predict <- ifelse (modelAIC$fitted.values>.5, 'pos', 'neg')  

CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)

CMtable

##    
##     neg pos
##   0 187   0
##   1   0 146

TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]

True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

on AIC REDUCED MODEL

print (TP)

## [1] 187

print (FP)

## [1] 0

print (FN)

## [1] 0

print (TN)

## [1] 146

ACCURACY

on AIC REDUCED MODEL

Accuracy of our model via Confusion Matrix

accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)

## [1] 1

ROC CURVE

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)

on AIC REDUCED MODEL

#Observe ROC and AUC :
roc(adelie~modelAIC$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.formula(formula = adelie ~ modelAIC$fitted.values, data = df_adelie5,     plot = TRUE, main = "ROC CURVE", col = "blue")
## 
## Data: modelAIC$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 1

OPTED OUT -TRAINING TESTING SPLIT

Training Set / Testing Set Split for Logistic Regression

Solving Overfitting Problem

Using Minimum Amounts of Variables

Prior PCA Analysis teslls us 2 Variables are enough

## three predictors
model4 <- glm(adelie~ bill_length_mm  +bill_depth_mm  + body_mass_g , family=binomial(link='logit'), data=df_adelie5)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model4)

## 
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + body_mass_g, 
##     family = binomial(link = "logit"), data = df_adelie5)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.308   0.000   0.000   0.000   1.711  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    32.965097  25.641491   1.286   0.1986  
## bill_length_mm -4.903436   2.653993  -1.848   0.0647 .
## bill_depth_mm   8.616112   4.810140   1.791   0.0733 .
## body_mass_g     0.006746   0.003854   1.751   0.0800 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.5751  on 332  degrees of freedom
## Residual deviance:   9.6518  on 329  degrees of freedom
## AIC: 17.652
## 
## Number of Fisher Scoring iterations: 13

# glm.fit: fitted probabilities numerically 0 or 1 occurred

## two predictors
model5 <- glm(adelie ~  flipper_length_mm  + bill_depth_mm
                 , data = df_adelie5, family=binomial(link="logit"))
summary(model5)

## 
## Call:
## glm(formula = adelie ~ flipper_length_mm + bill_depth_mm, family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.58735  -0.19448  -0.07433   0.64787   2.23661  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        26.8488     4.8412   5.546 2.92e-08 ***
## flipper_length_mm  -0.1763     0.0239  -7.376 1.63e-13 ***
## bill_depth_mm       0.4308     0.1230   3.502 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.58  on 332  degrees of freedom
## Residual deviance: 236.11  on 330  degrees of freedom
## AIC: 242.11
## 
## Number of Fisher Scoring iterations: 6

# two predictors, final choice
model_chosen <- glm(adelie~ bill_length_mm   + body_mass_g , family=binomial(link='logit'), data=df_adelie5)

summary(model_chosen)

## 
## Call:
## glm(formula = adelie ~ bill_length_mm + body_mass_g, family = binomial(link = "logit"), 
##     data = df_adelie5)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31516  -0.10608  -0.00967   0.09498   2.66818  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    50.0420895  6.8695806   7.285 3.23e-13 ***
## bill_length_mm -1.1304381  0.1668171  -6.777 1.23e-11 ***
## body_mass_g    -0.0003744  0.0005156  -0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 456.575  on 332  degrees of freedom
## Residual deviance:  89.722  on 330  degrees of freedom
## AIC: 95.722
## 
## Number of Fisher Scoring iterations: 8

##

MODEL EVALUATION FOR CHOSEN MODEL

Using 2 Predictors

Logistic Classification Metrics

Confusion Matrix (CM)

Test the model using penguine_test. confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format.

summary(model_chosen$fitted.values)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.001081 0.107628 0.438438 0.984288 0.999997

Confusion Matrix

df_adelie5$Predict <- ifelse (model_chosen$fitted.values>.5, 'pos', 'neg')  

CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)

CMtable

##    
##     neg pos
##   0 177  10
##   1   8 138

TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]

print (TP)

## [1] 177

print (FP)

## [1] 8

print (FN)

## [1] 10

print (TN)

## [1] 138

Accuracy of our model via Confusion Matrix

accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)

## [1] 0.9459459

ROC curve

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)

#Observe ROC and AUC :
roc(adelie~model_chosen$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.formula(formula = adelie ~ model_chosen$fitted.values, data = df_adelie5,     plot = TRUE, main = "ROC CURVE", col = "blue")
## 
## Data: model_chosen$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 0.9881

Explation - Logistic Regression

We feel that the logistic regression model, to predict the outcome of Penguin species, in this case, “Adelie” species versus not, that’s not really the best approach. The model has a very significant overfitting problem period even after the AIC reduction, still this problem persist. There is a prominent one to two variables that explains almost 95% of the variants come up leaving the rest of the 10 variables almost powerless in prediction.

This overfitting problem is reflected in the true positive and true negative value, confusion matrix. In The Roc curve, the AUC is close to 100%, for almost all the models.

I tried to overcome the overfitting problems by looking at individual variables 1 by 1, and still find that to not very helpful.

So the logistic regression really have a huge limitation in this model, which lean too the reason that why we need to use and learn other machine learning models, such as KNN, random forest, who is not as restricted by the colinearity in the variables.

Logistic Regression Ends Above

Although training and testing model will work for logistic regression prediction overall, I decide to leave that part out.

Because looking from the overall sample, all of the logistic regression model is highly skewed to overfitting, so this model is really free not fitting for training, therefore the testing model will not be accurate.

Second reason is that the sample size has only three forty observations, which means 80% to 20% split will leave a few dozens of observations in the testing data.

Given that the accuracy of model is already poor to start with, having a even smaller sample size for training and testing purpose really will not work period so I decided to leave that part entirely out from this project.

MultiNomial Regression

Starts Below

DATA PREPROCESSING FOR MultiNomial Model

For the multi nomial analysis, I will use the original outcome as species, which has three categories. Before any analysis, I am doing the usual preprocessing of the data, taking out the missing variables, especially in sex column.

library(nnet)
library(dplyr)
library(MASS)
## Now we refer to the original penguins dataset:
data(penguins)


# #Remove NAs
penguins <- penguins %>% na.omit()
is.factor(penguins$species)

## [1] TRUE

## remove year variable, island varaible
penguins6 <- penguins %>% 
  dplyr::select (-c(year, island))

dim(penguins6)

## [1] 333   6

Examine Dataset before modeling

dim(penguins6)

## [1] 333   6

library(dplyr)
# quick inspect of dataframe
kable(head(penguins6)) %>%
  kable_styling(bootstrap_options = "basic")

species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	39.1	18.7	181	3750	male
Adelie	39.5	17.4	186	3800	female
Adelie	40.3	18.0	195	3250	female
Adelie	36.7	19.3	193	3450	female
Adelie	39.3	20.6	190	3650	male
Adelie	38.9	17.8	181	3625	female

## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more

For the 3 species, the Adelie species is the comparison, and the other two species (Chinstrap, Gentoo) are compared aginst Adelie species.

FULL MODEL

MultiNomial

levels(penguins6$species)

## [1] "Adelie"    "Chinstrap" "Gentoo"

#Build multinomial logistic regression model (with all vars):
s_model_1 <- multinom (species ~., data=penguins6, model = TRUE)

## # weights:  21 (12 variable)
## initial  value 365.837892 
## iter  10 value 16.719027
## iter  20 value 3.047829
## iter  30 value 0.103484
## iter  40 value 0.063985
## iter  50 value 0.006201
## iter  60 value 0.003924
## iter  70 value 0.001733
## iter  80 value 0.001547
## iter  90 value 0.001513
## iter 100 value 0.001498
## final  value 0.001498 
## stopped after 100 iterations

summary(s_model_1)

## Call:
## multinom(formula = species ~ ., data = penguins6, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -103.422229       16.51642     -24.41758       -0.31003929
## Gentoo      -6.756321       12.21924     -30.09757       -0.08758996
##            body_mass_g    sexmale
## Chinstrap -0.030200928 -15.820607
## Gentoo     0.005019684  -8.132101
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   0.5715154      27.749774      27.43640          4.097365
## Gentoo      0.5715734       8.968787      33.25608         11.905692
##           body_mass_g  sexmale
## Chinstrap   0.1350169 3.077521
## Gentoo      0.5848602 3.498288
## 
## Residual Deviance: 0.002996934 
## AIC: 24.003

REDUCED MODEL BASED ON AIC

MultiNomial

#Optimize model based on AIC value (as per #2):
s_model_2 <- stepAIC(s_model_1)

## Start:  AIC=24
## species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g + sex
## 
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 129.185715
## iter  20 value 114.018556
## iter  30 value 113.336460
## iter  40 value 113.216108
## final  value 113.215886 
## converged
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 66.463118
## iter  20 value 12.514139
## iter  30 value 7.743531
## iter  40 value 5.662963
## iter  50 value 4.350499
## iter  60 value 4.145847
## iter  70 value 4.093510
## iter  80 value 3.984724
## iter  90 value 3.903126
## iter 100 value 3.831010
## final  value 3.831010 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 9.579038
## iter  20 value 4.114389
## iter  30 value 2.340467
## iter  40 value 0.614487
## iter  50 value 0.055642
## iter  60 value 0.012303
## iter  70 value 0.007327
## iter  80 value 0.007163
## iter  90 value 0.007015
## iter 100 value 0.006995
## final  value 0.006995 
## stopped after 100 iterations
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 16.321214
## iter  20 value 3.754897
## iter  30 value 1.631859
## iter  40 value 0.012427
## iter  50 value 0.001125
## iter  60 value 0.001108
## iter  70 value 0.001006
## iter  80 value 0.000906
## iter  90 value 0.000498
## iter 100 value 0.000498
## final  value 0.000498 
## stopped after 100 iterations
##                     Df     AIC
## - flipper_length_mm  2  20.000
## - sex                2  20.001
## - body_mass_g        2  20.014
## <none>                  24.003
## - bill_depth_mm      2  27.662
## - bill_length_mm     2 246.432
## # weights:  18 (10 variable)
## initial  value 365.837892 
## iter  10 value 25.561758
## iter  20 value 1.350538
## iter  30 value 0.065803
## iter  40 value 0.020499
## iter  50 value 0.016911
## iter  60 value 0.012712
## iter  70 value 0.005348
## iter  80 value 0.001290
## iter  90 value 0.000225
## iter 100 value 0.000216
## final  value 0.000216 
## stopped after 100 iterations
## 
## Step:  AIC=20
## species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex
## 
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 136.772749
## iter  20 value 133.782638
## iter  30 value 133.545483
## final  value 133.538635 
## converged
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 30.179445
## iter  20 value 5.699489
## iter  30 value 5.175455
## iter  40 value 5.153196
## iter  50 value 5.070255
## iter  60 value 4.997789
## iter  70 value 4.893410
## iter  80 value 4.826209
## iter  90 value 4.708542
## iter 100 value 4.695168
## final  value 4.695168 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 20.098944
## iter  20 value 3.621850
## iter  30 value 1.160548
## iter  40 value 1.108738
## iter  50 value 1.063387
## iter  60 value 0.958674
## iter  70 value 0.926099
## iter  80 value 0.837719
## iter  90 value 0.797644
## iter 100 value 0.741759
## final  value 0.741759 
## stopped after 100 iterations
##                  Df     AIC
## - body_mass_g     2  16.981
## - sex             2  17.484
## <none>               20.000
## - bill_depth_mm   2  25.390
## - bill_length_mm  2 283.077
## # weights:  15 (8 variable)
## initial  value 365.837892 
## iter  10 value 11.953428
## iter  20 value 3.975600
## iter  30 value 2.172365
## iter  40 value 2.111553
## iter  50 value 1.881400
## iter  60 value 1.236767
## iter  70 value 1.133364
## iter  80 value 1.059253
## iter  90 value 0.588576
## iter 100 value 0.490499
## final  value 0.490499 
## stopped after 100 iterations
## 
## Step:  AIC=16.98
## species ~ bill_length_mm + bill_depth_mm + sex
## 
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 150.109767
## iter  20 value 142.419807
## iter  30 value 142.047323
## final  value 142.041293 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 139.953898
## iter  20 value 133.697479
## iter  30 value 133.399000
## iter  40 value 133.326376
## final  value 133.325373 
## converged
## # weights:  12 (6 variable)
## initial  value 365.837892 
## iter  10 value 26.650783
## iter  20 value 23.943597
## iter  30 value 23.916873
## iter  40 value 23.901339
## iter  50 value 23.895442
## iter  60 value 23.894251
## final  value 23.892065 
## converged
##                  Df     AIC
## <none>               16.981
## - sex             2  59.784
## - bill_depth_mm   2 278.651
## - bill_length_mm  2 296.083

#Review summary statistics
summary(s_model_2)

## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm + 
##     sex, data = penguins6, model = TRUE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm     sexmale
## Chinstrap   -242.9740       14.88938     -21.91411 -36.9051125
## Gentoo       121.0476       14.80645     -44.69711  -0.1301506
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm  sexmale
## Chinstrap    3.985819       9.369182      23.16685 23.33386
## Gentoo       4.071031       9.413971      23.17697 43.25185
## 
## Residual Deviance: 0.9809986 
## AIC: 16.981

Explation - MultiNomial Regression

The multinomial analysis definitely is a better fit than the logistic regression analysis. The full model and the AIC reduced model show significant difference, and the convergence problem do not exist, at least for these two models only. The AIC is 16.9, which is lot more satisfying. Chinstrap and the geeto, footballs these species, the bill lands is the most significant variables in prediction, compared with the Adelie species.

REFERENCE

https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES

mlogit package: Multinomial logit model Description Estimation by maximum likelihood of the multinomial logit model, with alternative-specific and/or individual specific variables.

Exercise 1: Multinomial logit model https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html

https://www.joshuapkeller.com/page/introregression/ Interpretation of Logistic Regression Chapter 18-Logistic Regression

https://www.datacamp.com/community/tutorials/logistic-regression-R

https://statisticsglobe.com/report-missing-values-in-data-frame-in-r

Logistic Regression and Multinomial Model on Penguin Data

Hui Gracie Han

GOAL

Library

DATA

EDA EXPLORATORY DATA ANALYSIS

MISSING VALUE EXPLORATION

PCA

PRINCIPAL COMPONENT ANALYSIS

PCA Results

DATA VISUALIZATION - OVER ALL VARIABLES

Scatterplot Mtrix

Numerical VR only

Paired Variable Analysis

Extends to discrete VR (stacked bar)

Plus numerical VR (histogram)

DATA VISUALIZATION - SELECTIVE VARIABLES OF INTEREST

DATA PRE-PROCESSING FOR MODELING

Make Binary Outcome -Adelie or Not

Dummy Code Categorical Variable

MODELING

LOGISTIC REGRESSION

To Predict Adelie Species or Not

Full logistic regression model

AIC Reduced Model

MODEL EVALUATION

Confusion Matrix

True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).

on AIC REDUCED MODEL

ACCURACY

on AIC REDUCED MODEL

ROC CURVE

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)

on AIC REDUCED MODEL

OPTED OUT -TRAINING TESTING SPLIT

Solving Overfitting Problem

Using Minimum Amounts of Variables

Prior PCA Analysis teslls us 2 Variables are enough

MODEL EVALUATION FOR CHOSEN MODEL

Using 2 Predictors

Logistic Classification Metrics

Confusion Matrix (CM)

ROC curve

Sensitivity / specificity is indicated by ROC curve, and AUC (area under curve)

Explation - Logistic Regression

Logistic Regression Ends Above

MultiNomial Regression

Starts Below

DATA PREPROCESSING FOR MultiNomial Model

Examine Dataset before modeling

FULL MODEL

MultiNomial

REDUCED MODEL BASED ON AIC

MultiNomial

Explation - MultiNomial Regression

REFERENCE