Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html For this assignment, let us use ‘species’ as our outcome or the dependent variable. 1. Logistic Regression with a binary outcome. (40)
The penguin dataset has ‘species’ column. Please check how many categories you have in the species column. Conduct whatever data manipulation you need to do to be able to build a logistic regression with binary outcome. Please explain your reasoning behind your decision as you manipulate the outcome/dependent variable (species).
Please make sure you are evaluating the independent variables appropriately in deciding which ones should be in the model.
Provide variable interpretations in your model.
Please fit it a multinomial logistic regression where your outcome variable is ‘species’.
Please be sure to evaluate the independent variables appropriately to fit your best parsimonious model.
Please be sure to interpret your variables in the model.
library(dplyr) ## must install, besides tidyverse, otherwise some error
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret) ## main library for this task
## Loading required package: lattice
## Loading required package: ggplot2
library(palmerpenguins)
library(tidyr)
library(corrplot)
## corrplot 0.88 loaded
library(tibble)
library(DT)
library(explore)
library(knitr)
library(caTools)
library(stats)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(nnet) ## multi nomial
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v readr 1.4.0 v stringr 1.4.0
## v purrr 0.3.4 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
##
## rescale01
library(tidyr)
library(VIM)
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
##
## coords
## Loading required package: grid
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:explore':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(class)
library(rpart)
library(rpart.plot)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(naniar)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:naniar':
##
## n_complete
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
penguins <- read.csv ('penguins.csv', fileEncoding = 'UTF-8-BOM', header=TRUE)
sprintf("The penguins dataset has %d columns and %d rows", ncol(penguins), nrow(penguins))
## [1] "The penguins dataset has 8 columns and 344 rows"
sprintf("The column names are as follows")
## [1] "The column names are as follows"
colnames(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
sprintf("The first 5 rows of the dataset are as follows:")
## [1] "The first 5 rows of the dataset are as follows:"
head(penguins,5)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
sprintf("The summary of the dataset is as follows:")
## [1] "The summary of the dataset is as follows:"
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
sprintf("The missing data information is as follows:")
## [1] "The missing data information is as follows:"
missing_count <-sapply(penguins, function(x) sum(length(which(is.na(x)))))
data.frame(missing_count)
## missing_count
## species 0
## island 0
## bill_length_mm 2
## bill_depth_mm 2
## flipper_length_mm 2
## body_mass_g 2
## sex 11
## year 0
sprintf("Lets remove the missing values rows from the dataset.")
## [1] "Lets remove the missing values rows from the dataset."
penguins_tb <- penguins %>%
as_tibble()
# show summary infio
skim(penguins_tb)
Name | penguins_tb |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | <U+2583><U+2587><U+2587><U+2586><U+2581> |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | <U+2585><U+2585><U+2587><U+2587><U+2582> |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | <U+2582><U+2587><U+2583><U+2585><U+2582> |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | <U+2583><U+2587><U+2586><U+2583><U+2582> |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | <U+2587><U+2581><U+2587><U+2581><U+2587> |
# Visualize Missing Values Using VIM Package
library (VIM)
colSums(is.na(penguins)) # Count missing values by column
## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
colSums(is.na(penguins)) / nrow(penguins) # Percentage of missing values by column
## species island bill_length_mm bill_depth_mm
## 0.000000000 0.000000000 0.005813953 0.005813953
## flipper_length_mm body_mass_g sex year
## 0.005813953 0.005813953 0.031976744 0.000000000
aggr(penguins) # Create aggregation plot
We can see that most of the missing’s are in the sex column. We will delete these small amount of missing variables in our analysis.
penguins_visl <- penguins %>%
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) %>%
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) %>%
filter(!is.na(bill_depth))
penguins_visl.pca <- prcomp (~ bill_length + bill_depth + flipper_length + body_mass,
data=penguins_visl,
na.action=na.omit, # not actually necessary: we removed NA
scale. = TRUE)
penguins_visl.pca
## Standard deviations (1, .., p=4):
## [1] 1.6594442 0.8789293 0.6043475 0.3293816
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## bill_length 0.4552503 -0.597031143 -0.6443012 0.1455231
## bill_depth -0.4003347 -0.797766572 0.4184272 -0.1679860
## flipper_length 0.5760133 -0.002282201 0.2320840 -0.7837987
## body_mass 0.5483502 -0.084362920 0.5966001 0.5798821
screeplot(penguins_visl.pca, type = "line", lwd=3, cex=3,
main="Variances of PCA Components")
The PCA analysis is looking at the elbow of the change. Looks like that the variance of 1 variable will explain most of the variances. The drop from one variable to two variable are significant then further, drop from 2 2-3 variables and drop from 3 to fair for variables are less pronounced.
o These two principal components account for 90% of the total variance of these four size variables. o PC1 is largely determined by two variables that are describing the size of penguin ( flipper length and body mass) .
o PC2 is mainly determined by variation in the two beak variables: culmen length and depth. I guess these two variables describing the shape of penguin.
Obviously this high variation is not of desirable to do modeling.
This remains to be seen and the confirmed in the future logistic regression.
library(car)
library(GGally)
scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
data=penguins_visl,
ellipse=list(levels=0.68),
col = scales::hue_pal()(3),
legend=list(coords="bottomright"))
library (GGally)
ggpairs(penguins_visl, mapping = aes(color = species),
columns = c("bill_length", "bill_depth",
"flipper_length", "body_mass",
"island", "sex"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that most of the distributions are approximately unimodal and symmetric within species.
within species, all pairs of variables are positively correlated
library(kableExtra)
kable(head(penguins)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
row_spec(0, color = "white", background = "skyblue") %>%
column_spec(1, bold = T)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
p_1 <- penguins %>%
filter(!is.na(sex)) %>% ## only non missing sex, many penguins have missing value in sex and other (missing are in the same obbs)
group_by(island) %>%
summarize(
mean_wt = round(mean(body_mass_g), 1),
sd_wt = round(sd(body_mass_g), 1),
length(body_mass_g)
)
kable(p_1, col.names = c("Island","Mean Body Mass (g)","SD Mean Body Mass (g)","Sample Size (# of Obs. Penguins)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered", full_width = F)) %>%
row_spec(0, color = "white", background = "skyblue") %>%
column_spec(1, bold = T)
Island | Mean Body Mass (g) | SD Mean Body Mass (g) | Sample Size (# of Obs. Penguins) |
---|---|---|---|
Biscoe | 4719.2 | 790.9 | 163 |
Dream | 3718.9 | 412.9 | 123 |
Torgersen | 3708.5 | 451.8 | 47 |
p_2 <- penguins %>%
filter(species == "Adelie") %>%
group_by(island)
mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
geom_histogram(aes(fill = island), bins = 8, color = "white") +
facet_wrap(~ island, scale = "free")
mass_histogram
## Warning: Removed 1 rows containing non-finite values (stat_bin).
penguin_boxplot <- ggplot(p_2, aes(x = island, y = body_mass_g)) +
geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
geom_jitter(width = 0.1, alpha = 0.8) +
labs(x = "Penguin Observations by island ", y = "Body Mass (g)", title = "Adiele Penguin Body Mass Observations ")
penguin_boxplot
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).
p_non_adelie <- penguins %>%
filter(species == "Chinstrap" | species == 'Gentoo') %>%
group_by(island)
mass_histogram <- ggplot(p_2, aes(x = body_mass_g)) +
geom_histogram(aes(fill = island), bins = 8, color = "white") +
labs( title = "Adiele Penguin Body Mass Observations ")+
facet_wrap(~ island, scale = "free")
mass_histogram
## Warning: Removed 1 rows containing non-finite values (stat_bin).
penguin_boxplot <- ggplot(p_non_adelie, aes(x = island, y = body_mass_g)) +
geom_boxplot(aes(fill = island), alpha = 0.8, show.legend = F) +
geom_jitter(width = 0.1, alpha = 0.8) +
labs( title = "Non-Adiele Penguin Body Mass Observations ")
penguin_boxplot
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).
penguins %>%
filter (!is.na(sex)) %>%
ggplot (aes (flipper_length_mm, bill_length_mm,
color= sex, size= body_mass_g))+
geom_point (alpha= 0.7)+
labs(x="Flipper length (mm)", y="Bill length (mm)")+
facet_wrap (~species)
PART D STARTS HERE (after data read, library)
# remove NAS
penguins <- penguins[complete.cases(penguins), ]
# size of data
dim(penguins) ## 333*8 VR
## [1] 333 8
library(dplyr)
# change the sepcies outcome into binary, based on Adelie or not
penguins$adelie <- ifelse(penguins$species == 'Adelie', 1, 0) ## Adelie is cap in original , thats why the error
head(penguins,1)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## sex year adelie
## 1 male 2007 1
dim(penguins) #333*9VR
## [1] 333 9
colnames (penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year" "adelie"
penguins %>% dplyr:: select (species) %>%
group_by (species) %>%
summarize (count=n())
## # A tibble: 3 x 2
## species count
## <fct> <int>
## 1 Adelie 146
## 2 Chinstrap 68
## 3 Gentoo 119
penguins %>% dplyr:: select (species, adelie) %>%
group_by (species, adelie) %>%
summarize (count=n())
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 3 x 3
## # Groups: species [3]
## species adelie count
## <fct> <dbl> <int>
## 1 Adelie 1 146
## 2 Chinstrap 0 68
## 3 Gentoo 0 119
df_adelie2<- dplyr::select(penguins,-'species')
## to drop the original species
head(df_adelie2,2)
## island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 Torgersen 39.1 18.7 181 3750 male
## 2 Torgersen 39.5 17.4 186 3800 female
## year adelie
## 1 2007 1
## 2 2007 1
## need to change adelie into factor in the future,
df_adelie2$adelie<- as.factor(df_adelie2$adelie)
head(df_adelie2,3) ## 333 * 8
## island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 Torgersen 39.1 18.7 181 3750 male
## 2 Torgersen 39.5 17.4 186 3800 female
## 3 Torgersen 40.3 18.0 195 3250 female
## year adelie
## 1 2007 1
## 2 2007 1
## 3 2007 1
dim(df_adelie2) ## 333*8
## [1] 333 8
df_adelie2 %>% dplyr:: select (adelie) %>%
group_by (adelie) %>%
summarise(count=n()) ## wrong here, eveyone is adelie, n=333 this is wrong
## # A tibble: 2 x 2
## adelie count
## <fct> <int>
## 1 0 187
## 2 1 146
## now right adelie n = 146
colnames (df_adelie2)
## [1] "island" "bill_length_mm" "bill_depth_mm"
## [4] "flipper_length_mm" "body_mass_g" "sex"
## [7] "year" "adelie"
Finally, we’ll want to dummy code our categorical island
and sex
variables.
library(psych)
# dummy.code: Convert a factor to a matrix of dummy codes
# Converts a factor (categorical) variable to a matrix of dummy codes using a 1 of C-1 binary coding scheme.
# Dummy code the island feature - this creates a new column for each island (0 || 1)
island_dcode <- as.data.frame(dummy.code(df_adelie2$island))
dim(island_dcode) ###333*3 VR
## [1] 333 3
df_adelie3 <- cbind(df_adelie2, island_dcode)
dim(df_adelie3) ##333 *11 VR
## [1] 333 11
colnames (df_adelie3)
## [1] "island" "bill_length_mm" "bill_depth_mm"
## [4] "flipper_length_mm" "body_mass_g" "sex"
## [7] "year" "adelie" "Biscoe"
## [10] "Dream" "Torgersen"
# remove the original island factor columns - we will rely on the new dummy columns
df_adelie4 <- df_adelie3 %>%
dplyr::select(-island)
dim(df_adelie3) ##333*11 VR
## [1] 333 11
# dummy code the sex feature - since there are only 2 states, we will only be adding a single column (0 || 1)
df_adelie4$sex <- dummy.code(df_adelie3$sex)
dim(df_adelie4) ### 333*10
## [1] 333 10
df_adelie5 <- df_adelie4 %>%
dplyr::select (-year)
dim(df_adelie5) #333*9
## [1] 333 9
colnames (df_adelie5)
## [1] "bill_length_mm" "bill_depth_mm" "flipper_length_mm"
## [4] "body_mass_g" "sex" "adelie"
## [7] "Biscoe" "Dream" "Torgersen"
## Code year as factor, instead of dropping it ## not possilbe, bc error
# df_adelie5$year <- as.factor(as.character(df_adelie5$year))
# Error in `$<-.data.frame`(`*tmp*`, year, value = integer(0)) : replacement has 0 rows, data has 333
df_adelie5 %>% dplyr::select (adelie) %>%
group_by (adelie) %>%
summarise(count=n())
## # A tibble: 2 x 2
## adelie count
## <fct> <int>
## 1 0 187
## 2 1 146
Examine Dataset before modeling
# quick inspect of dataframe
kable(head(df_adelie5)) %>%
kable_styling(bootstrap_options = "basic")
## Warning in `[<-.data.frame`(`*tmp*`, , isn, value = structure(list(bill_length_mm =
## structure(c("39.1", : provided 9 variables to replace 8 variables
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | adelie | Biscoe | Dream | Torgersen | |
---|---|---|---|---|---|---|---|---|---|
1 | 39.1 | 18.7 | 181 | 3750 | 1 | 1 | 0 | 0 | 0 |
2 | 39.5 | 17.4 | 186 | 3800 | 0 | 1 | 1 | 0 | 0 |
3 | 40.3 | 18.0 | 195 | 3250 | 0 | 1 | 1 | 0 | 0 |
5 | 36.7 | 19.3 | 193 | 3450 | 0 | 1 | 1 | 0 | 0 |
6 | 39.3 | 20.6 | 190 | 3650 | 1 | 1 | 0 | 0 | 0 |
7 | 38.9 | 17.8 | 181 | 3625 | 0 | 1 | 1 | 0 | 0 |
## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more
library(car)
library(GGally)
Refers to: STHDA-Stepwise Logistic Regression Essential http://www.sthda.com/english/articles/36-classification-methods-essentials/150-stepwise-logistic-regression-essentials-in-r/ Computing stepwise logistique regression
Full logistic regression model
Perform stepwise variable selection
Compare the full and the stepwise models
To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.
model1_fullmodel <- glm (adelie~., family=binomial (link='logit'),data=df_adelie5)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(model1_fullmodel)
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## 365.84155736 -23.41050269 32.50218136 -0.04903018
## body_mass_g sexmale sexfemale Biscoe
## 0.02729885 10.23181146 NA -51.89053890
## Dream Torgersen
## -44.76170285 NA
summary(model1_fullmodel)
##
## Call:
## glm(formula = adelie ~ ., family = binomial(link = "logit"),
## data = df_adelie5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.239e-04 -2.100e-08 -2.100e-08 2.100e-08 9.978e-05
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.658e+02 3.193e+05 0.001 0.999
## bill_length_mm -2.341e+01 5.504e+03 -0.004 0.997
## bill_depth_mm 3.250e+01 1.274e+04 0.003 0.998
## flipper_length_mm -4.903e-02 7.429e+02 0.000 1.000
## body_mass_g 2.730e-02 2.428e+01 0.001 0.999
## sexmale 1.023e+01 5.289e+04 0.000 1.000
## sexfemale NA NA NA NA
## Biscoe -5.189e+01 4.391e+04 -0.001 0.999
## Dream -4.476e+01 6.464e+04 -0.001 0.999
## Torgersen NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5658e+02 on 332 degrees of freedom
## Residual deviance: 4.5520e-08 on 325 degrees of freedom
## AIC: 16
##
## Number of Fisher Scoring iterations: 25
# ;log: algorithm did not converge
To perform stepwise model selection based on Akaike information criterion (AIC) value optimization.
library(MASS)
# modelAIC<- stepAIC(model1)
modelAIC<-model1_fullmodel %>% stepAIC( trace=FALSE)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(modelAIC)
## (Intercept) bill_length_mm bill_depth_mm Dream
## 609.10000 -40.93968 68.41795 -88.10707
summary(modelAIC)
##
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + Dream,
## family = binomial(link = "logit"), data = df_adelie5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.026e-04 -2.100e-08 -2.100e-08 2.100e-08 1.878e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 609.10 168154.13 0.004 0.997
## bill_length_mm -40.94 5615.85 -0.007 0.994
## bill_depth_mm 68.42 9486.84 0.007 0.994
## Dream -88.11 137196.84 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5658e+02 on 332 degrees of freedom
## Residual deviance: 1.2158e-07 on 329 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
# Error in stepAIC(model1) : number of rows in use has
## still do not converge
summary(modelAIC$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4384 1.0000 1.0000
confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).
df_adelie5$Predict <- ifelse (modelAIC$fitted.values>.5, 'pos', 'neg')
CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)
CMtable
##
## neg pos
## 0 187 0
## 1 0 146
TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]
print (TP)
## [1] 187
print (FP)
## [1] 0
print (FN)
## [1] 0
print (TN)
## [1] 146
Accuracy of our model via Confusion Matrix
accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)
## [1] 1
#Observe ROC and AUC :
roc(adelie~modelAIC$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = adelie ~ modelAIC$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col = "blue")
##
## Data: modelAIC$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 1
#
Training Set / Testing Set Split for Logistic Regression
## three predictors
model4 <- glm(adelie~ bill_length_mm +bill_depth_mm + body_mass_g , family=binomial(link='logit'), data=df_adelie5)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model4)
##
## Call:
## glm(formula = adelie ~ bill_length_mm + bill_depth_mm + body_mass_g,
## family = binomial(link = "logit"), data = df_adelie5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.308 0.000 0.000 0.000 1.711
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 32.965097 25.641491 1.286 0.1986
## bill_length_mm -4.903436 2.653993 -1.848 0.0647 .
## bill_depth_mm 8.616112 4.810140 1.791 0.0733 .
## body_mass_g 0.006746 0.003854 1.751 0.0800 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 456.5751 on 332 degrees of freedom
## Residual deviance: 9.6518 on 329 degrees of freedom
## AIC: 17.652
##
## Number of Fisher Scoring iterations: 13
# glm.fit: fitted probabilities numerically 0 or 1 occurred
## two predictors
model5 <- glm(adelie ~ flipper_length_mm + bill_depth_mm
, data = df_adelie5, family=binomial(link="logit"))
summary(model5)
##
## Call:
## glm(formula = adelie ~ flipper_length_mm + bill_depth_mm, family = binomial(link = "logit"),
## data = df_adelie5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.58735 -0.19448 -0.07433 0.64787 2.23661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.8488 4.8412 5.546 2.92e-08 ***
## flipper_length_mm -0.1763 0.0239 -7.376 1.63e-13 ***
## bill_depth_mm 0.4308 0.1230 3.502 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 456.58 on 332 degrees of freedom
## Residual deviance: 236.11 on 330 degrees of freedom
## AIC: 242.11
##
## Number of Fisher Scoring iterations: 6
# two predictors, final choice
model_chosen <- glm(adelie~ bill_length_mm + body_mass_g , family=binomial(link='logit'), data=df_adelie5)
summary(model_chosen)
##
## Call:
## glm(formula = adelie ~ bill_length_mm + body_mass_g, family = binomial(link = "logit"),
## data = df_adelie5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31516 -0.10608 -0.00967 0.09498 2.66818
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 50.0420895 6.8695806 7.285 3.23e-13 ***
## bill_length_mm -1.1304381 0.1668171 -6.777 1.23e-11 ***
## body_mass_g -0.0003744 0.0005156 -0.726 0.468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 456.575 on 332 degrees of freedom
## Residual deviance: 89.722 on 330 degrees of freedom
## AIC: 95.722
##
## Number of Fisher Scoring iterations: 8
##
Test the model using penguine_test. confusion matrix has True Positive (TP), False Positive (FP), False Negative(FN), and True Negative (TN).
The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format.
summary(model_chosen$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.001081 0.107628 0.438438 0.984288 0.999997
Confusion Matrix
df_adelie5$Predict <- ifelse (model_chosen$fitted.values>.5, 'pos', 'neg')
CMtable<- table(df_adelie5$adelie, df_adelie5$Predict)
CMtable
##
## neg pos
## 0 177 10
## 1 8 138
TP <- CMtable[1]
FP <- CMtable[2]
FN <- CMtable[3]
TN <- CMtable[4]
print (TP)
## [1] 177
print (FP)
## [1] 8
print (FN)
## [1] 10
print (TN)
## [1] 138
Accuracy of our model via Confusion Matrix
accuracy <-sum(diag(CMtable))/sum(CMtable)
print(accuracy)
## [1] 0.9459459
#Observe ROC and AUC :
roc(adelie~model_chosen$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col= "blue")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.formula(formula = adelie ~ model_chosen$fitted.values, data = df_adelie5, plot = TRUE, main = "ROC CURVE", col = "blue")
##
## Data: model_chosen$fitted.values in 187 controls (adelie 0) < 146 cases (adelie 1).
## Area under the curve: 0.9881
#
We feel that the logistic regression model, to predict the outcome of Penguin species, in this case, “Adelie” species versus not, that’s not really the best approach. The model has a very significant overfitting problem period even after the AIC reduction, still this problem persist. There is a prominent one to two variables that explains almost 95% of the variants come up leaving the rest of the 10 variables almost powerless in prediction.
This overfitting problem is reflected in the true positive and true negative value, confusion matrix. In The Roc curve, the AUC is close to 100%, for almost all the models.
I tried to overcome the overfitting problems by looking at individual variables 1 by 1, and still find that to not very helpful.
So the logistic regression really have a huge limitation in this model, which lean too the reason that why we need to use and learn other machine learning models, such as KNN, random forest, who is not as restricted by the colinearity in the variables.
Although training and testing model will work for logistic regression prediction overall, I decide to leave that part out.
Because looking from the overall sample, all of the logistic regression model is highly skewed to overfitting, so this model is really free not fitting for training, therefore the testing model will not be accurate.
Second reason is that the sample size has only three forty observations, which means 80% to 20% split will leave a few dozens of observations in the testing data.
Given that the accuracy of model is already poor to start with, having a even smaller sample size for training and testing purpose really will not work period so I decided to leave that part entirely out from this project.
For the multi nomial analysis, I will use the original outcome as species, which has three categories. Before any analysis, I am doing the usual preprocessing of the data, taking out the missing variables, especially in sex column.
library(nnet)
library(dplyr)
library(MASS)
## Now we refer to the original penguins dataset:
data(penguins)
# #Remove NAs
penguins <- penguins %>% na.omit()
is.factor(penguins$species)
## [1] TRUE
## remove year variable, island varaible
penguins6 <- penguins %>%
dplyr::select (-c(year, island))
dim(penguins6)
## [1] 333 6
dim(penguins6)
## [1] 333 6
library(dplyr)
# quick inspect of dataframe
kable(head(penguins6)) %>%
kable_styling(bootstrap_options = "basic")
species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|
Adelie | 39.1 | 18.7 | 181 | 3750 | male |
Adelie | 39.5 | 17.4 | 186 | 3800 | female |
Adelie | 40.3 | 18.0 | 195 | 3250 | female |
Adelie | 36.7 | 19.3 | 193 | 3450 | female |
Adelie | 39.3 | 20.6 | 190 | 3650 | male |
Adelie | 38.9 | 17.8 | 181 | 3625 | female |
## factor: adelie, dream: num, Torgersen: num
## everyhing ele: num , this is waht we want
## year: dropped no more
For the multi nomial analysis, I will use the original outcome as species, which has three categories. Before any analysis, I am doing the usual preprocessing of the data, taking out the missing variables, especially in sex column.
For the 3 species, the Adelie species is the comparison, and the other two species (Chinstrap, Gentoo) are compared aginst Adelie species.
levels(penguins6$species)
## [1] "Adelie" "Chinstrap" "Gentoo"
#Build multinomial logistic regression model (with all vars):
s_model_1 <- multinom (species ~., data=penguins6, model = TRUE)
## # weights: 21 (12 variable)
## initial value 365.837892
## iter 10 value 16.719027
## iter 20 value 3.047829
## iter 30 value 0.103484
## iter 40 value 0.063985
## iter 50 value 0.006201
## iter 60 value 0.003924
## iter 70 value 0.001733
## iter 80 value 0.001547
## iter 90 value 0.001513
## iter 100 value 0.001498
## final value 0.001498
## stopped after 100 iterations
summary(s_model_1)
## Call:
## multinom(formula = species ~ ., data = penguins6, model = TRUE)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -103.422229 16.51642 -24.41758 -0.31003929
## Gentoo -6.756321 12.21924 -30.09757 -0.08758996
## body_mass_g sexmale
## Chinstrap -0.030200928 -15.820607
## Gentoo 0.005019684 -8.132101
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 0.5715154 27.749774 27.43640 4.097365
## Gentoo 0.5715734 8.968787 33.25608 11.905692
## body_mass_g sexmale
## Chinstrap 0.1350169 3.077521
## Gentoo 0.5848602 3.498288
##
## Residual Deviance: 0.002996934
## AIC: 24.003
#Optimize model based on AIC value (as per #2):
s_model_2 <- stepAIC(s_model_1)
## Start: AIC=24
## species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
## body_mass_g + sex
##
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 129.185715
## iter 20 value 114.018556
## iter 30 value 113.336460
## iter 40 value 113.216108
## final value 113.215886
## converged
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 66.463118
## iter 20 value 12.514139
## iter 30 value 7.743531
## iter 40 value 5.662963
## iter 50 value 4.350499
## iter 60 value 4.145847
## iter 70 value 4.093510
## iter 80 value 3.984724
## iter 90 value 3.903126
## iter 100 value 3.831010
## final value 3.831010
## stopped after 100 iterations
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 25.561758
## iter 20 value 1.350538
## iter 30 value 0.065803
## iter 40 value 0.020499
## iter 50 value 0.016911
## iter 60 value 0.012712
## iter 70 value 0.005348
## iter 80 value 0.001290
## iter 90 value 0.000225
## iter 100 value 0.000216
## final value 0.000216
## stopped after 100 iterations
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 9.579038
## iter 20 value 4.114389
## iter 30 value 2.340467
## iter 40 value 0.614487
## iter 50 value 0.055642
## iter 60 value 0.012303
## iter 70 value 0.007327
## iter 80 value 0.007163
## iter 90 value 0.007015
## iter 100 value 0.006995
## final value 0.006995
## stopped after 100 iterations
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 16.321214
## iter 20 value 3.754897
## iter 30 value 1.631859
## iter 40 value 0.012427
## iter 50 value 0.001125
## iter 60 value 0.001108
## iter 70 value 0.001006
## iter 80 value 0.000906
## iter 90 value 0.000498
## iter 100 value 0.000498
## final value 0.000498
## stopped after 100 iterations
## Df AIC
## - flipper_length_mm 2 20.000
## - sex 2 20.001
## - body_mass_g 2 20.014
## <none> 24.003
## - bill_depth_mm 2 27.662
## - bill_length_mm 2 246.432
## # weights: 18 (10 variable)
## initial value 365.837892
## iter 10 value 25.561758
## iter 20 value 1.350538
## iter 30 value 0.065803
## iter 40 value 0.020499
## iter 50 value 0.016911
## iter 60 value 0.012712
## iter 70 value 0.005348
## iter 80 value 0.001290
## iter 90 value 0.000225
## iter 100 value 0.000216
## final value 0.000216
## stopped after 100 iterations
##
## Step: AIC=20
## species ~ bill_length_mm + bill_depth_mm + body_mass_g + sex
##
## # weights: 15 (8 variable)
## initial value 365.837892
## iter 10 value 136.772749
## iter 20 value 133.782638
## iter 30 value 133.545483
## final value 133.538635
## converged
## # weights: 15 (8 variable)
## initial value 365.837892
## iter 10 value 30.179445
## iter 20 value 5.699489
## iter 30 value 5.175455
## iter 40 value 5.153196
## iter 50 value 5.070255
## iter 60 value 4.997789
## iter 70 value 4.893410
## iter 80 value 4.826209
## iter 90 value 4.708542
## iter 100 value 4.695168
## final value 4.695168
## stopped after 100 iterations
## # weights: 15 (8 variable)
## initial value 365.837892
## iter 10 value 11.953428
## iter 20 value 3.975600
## iter 30 value 2.172365
## iter 40 value 2.111553
## iter 50 value 1.881400
## iter 60 value 1.236767
## iter 70 value 1.133364
## iter 80 value 1.059253
## iter 90 value 0.588576
## iter 100 value 0.490499
## final value 0.490499
## stopped after 100 iterations
## # weights: 15 (8 variable)
## initial value 365.837892
## iter 10 value 20.098944
## iter 20 value 3.621850
## iter 30 value 1.160548
## iter 40 value 1.108738
## iter 50 value 1.063387
## iter 60 value 0.958674
## iter 70 value 0.926099
## iter 80 value 0.837719
## iter 90 value 0.797644
## iter 100 value 0.741759
## final value 0.741759
## stopped after 100 iterations
## Df AIC
## - body_mass_g 2 16.981
## - sex 2 17.484
## <none> 20.000
## - bill_depth_mm 2 25.390
## - bill_length_mm 2 283.077
## # weights: 15 (8 variable)
## initial value 365.837892
## iter 10 value 11.953428
## iter 20 value 3.975600
## iter 30 value 2.172365
## iter 40 value 2.111553
## iter 50 value 1.881400
## iter 60 value 1.236767
## iter 70 value 1.133364
## iter 80 value 1.059253
## iter 90 value 0.588576
## iter 100 value 0.490499
## final value 0.490499
## stopped after 100 iterations
##
## Step: AIC=16.98
## species ~ bill_length_mm + bill_depth_mm + sex
##
## # weights: 12 (6 variable)
## initial value 365.837892
## iter 10 value 150.109767
## iter 20 value 142.419807
## iter 30 value 142.047323
## final value 142.041293
## converged
## # weights: 12 (6 variable)
## initial value 365.837892
## iter 10 value 139.953898
## iter 20 value 133.697479
## iter 30 value 133.399000
## iter 40 value 133.326376
## final value 133.325373
## converged
## # weights: 12 (6 variable)
## initial value 365.837892
## iter 10 value 26.650783
## iter 20 value 23.943597
## iter 30 value 23.916873
## iter 40 value 23.901339
## iter 50 value 23.895442
## iter 60 value 23.894251
## final value 23.892065
## converged
## Df AIC
## <none> 16.981
## - sex 2 59.784
## - bill_depth_mm 2 278.651
## - bill_length_mm 2 296.083
#Review summary statistics
summary(s_model_2)
## Call:
## multinom(formula = species ~ bill_length_mm + bill_depth_mm +
## sex, data = penguins6, model = TRUE)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm sexmale
## Chinstrap -242.9740 14.88938 -21.91411 -36.9051125
## Gentoo 121.0476 14.80645 -44.69711 -0.1301506
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm sexmale
## Chinstrap 3.985819 9.369182 23.16685 23.33386
## Gentoo 4.071031 9.413971 23.17697 43.25185
##
## Residual Deviance: 0.9809986
## AIC: 16.981
The multinomial analysis definitely is a better fit than the logistic regression analysis. The full model and the AIC reduced model show significant difference, and the convergence problem do not exist, at least for these two models only. The AIC is 16.9, which is lot more satisfying. Chinstrap and the geeto, footballs these species, the bill lands is the most significant variables in prediction, compared with the Adelie species.
https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES
mlogit package: Multinomial logit model Description Estimation by maximum likelihood of the multinomial logit model, with alternative-specific and/or individual specific variables.
Exercise 1: Multinomial logit model https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html
https://www.joshuapkeller.com/page/introregression/ Interpretation of Logistic Regression Chapter 18-Logistic Regression
https://www.datacamp.com/community/tutorials/logistic-regression-R
https://statisticsglobe.com/report-missing-values-in-data-frame-in-r