The penguins dataset has 344 observations and has one response or dependent variable (species) and 7 independent variables (island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year).
We will remove the records which has some of the column values as N/A
#Remove NAs
penguins_data <- penguins_data %>% na.omit()
#Categories in species column
penguins_data %>%
group_by(species) %>%
count() #there are THREE categories
## # A tibble: 3 x 2
## # Groups: species [3]
## species n
## <fct> <int>
## 1 Adelie 146
## 2 Chinstrap 68
## 3 Gentoo 119
After exploting data we see that there are 3 categories in species column (Adelie,Chinstrap and Gentoo). From data it looks like “Adelie” is most popular category. In order to convert our outcome to binary outcome we can introduce two categories for species column “Adelie” or 1 and “Other” or 0.We will add isAdelie column for that.
# create isAdelie classification response variable
penguins_data <- penguins_data %>%
mutate(isAdelie = ifelse(species == 'Adelie','Adelie','Other'))
penguins_data$isAdelie <- factor(penguins_data$isAdelie, levels = c("Other", "Adelie"))
Lets look at the correlation between the numeric variables. Looks like variables body_mass_g and flipper_length are positively correlated so we can consider one variable out of them in our model.
penguins_data %>% select_if(is.numeric) %>% cor()
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.58945111
## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.47201566
## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.87297890
## body_mass_g 0.5894511 -0.4720157 0.8729789 1.00000000
## year 0.0326569 -0.0481816 0.1510679 0.02186213
## year
## bill_length_mm 0.03265690
## bill_depth_mm -0.04818160
## flipper_length_mm 0.15106792
## body_mass_g 0.02186213
## year 1.00000000
For continuous independent variables, we can get more clarity on the distribution by analyzing it w.r.t. dependent variable.
par(mfrow = c(2,2))
boxplot(bill_length_mm~isAdelie, ylab="bill_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(bill_depth_mm~isAdelie, ylab="bill_depth", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(flipper_length_mm~isAdelie, ylab="flipper_length", xlab= "isAdelie", col="light blue",data = penguins_data)
boxplot(body_mass_g~isAdelie, ylab="body_mass", xlab= "isAdelie", col="light blue",data = penguins_data)
For categorical independent variables, we can analyze the frequency of each category w.r.t. the dependent variable
xtabs(~isAdelie + year, data = penguins_data)
## year
## isAdelie 2007 2008 2009
## Other 59 63 65
## Adelie 44 50 52
xtabs(~isAdelie + sex, data = penguins_data)
## sex
## isAdelie female male
## Other 92 95
## Adelie 73 73
We can remove these variables from our dataset
penguins_data <- subset(penguins_data, select = -c(year))
penguins_data <- subset(penguins_data, select = -c(sex))
Train-Test Split Lets split the data. Have 70% data to train the model and keep remaining 30% for testing purpose.
which_train <- sample(x = c(TRUE, FALSE), size = nrow(penguins_data), replace = TRUE, prob = c(0.7, 0.3))
train_data <- penguins_data[which_train, ]
test_data <- penguins_data[!which_train, ]
Backwards stepwise regression is performed and the result is a model with the following variables: island, bill_depth_mm, and flipper_length_mm.
logmodel <- glm(isAdelie ~ island + bill_depth_mm + flipper_length_mm,
family = 'binomial',
data = train_data)
summary(logmodel)
##
## Call:
## glm(formula = isAdelie ~ island + bill_depth_mm + flipper_length_mm,
## family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.59262 -0.20631 -0.06577 0.08972 2.63842
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 37.51719 8.79948 4.264 2.01e-05 ***
## islandDream -3.80276 1.03652 -3.669 0.000244 ***
## islandTorgersen 16.96797 1505.96799 0.011 0.991010
## bill_depth_mm 0.75693 0.20186 3.750 0.000177 ***
## flipper_length_mm -0.24812 0.04305 -5.763 8.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.80 on 248 degrees of freedom
## Residual deviance: 101.93 on 244 degrees of freedom
## AIC: 111.93
##
## Number of Fisher Scoring iterations: 18
Larger the coefficient there will be more of an impact on the positive classification or it being Adelie species. So variables having positive coefficients are being more indicative of Adelie species and negative coefficients as being indicative of no Adelie species.
Some interesting observations
All of these coefficients align with our previous exploratory analysis!
Lets use our model to predict values for the test set and then evaluate the model
log_preds <- predict(logmodel, test_data, type = 'response')
class_prediction <- factor(ifelse(log_preds > 0.50, "Adelie", "Other"),
levels = c("Other", "Adelie"))
log_auc <- auc(response = test_data$isAdelie, predictor = log_preds)
## Setting levels: control = Other, case = Adelie
## Setting direction: controls < cases
log_cm <- confusionMatrix(data = class_prediction, reference =test_data$isAdelie)
log_accuracy <- log_cm$overall['Accuracy']
log_tpr <- log_cm$byClass['Sensitivity']
log_fpr <- 1 - log_tpr
log_tnr <- log_cm$byClass['Specificity']
log_fnr <- 1 - log_tnr
First, we will define “Adelie” as the reference level (or “baseline species”) for the dataset. This means that our trained model will result in coefficients of the features for the remaining two species in relation to Adelie. We will start with a baseline model that includes all features.
require(nnet)
## Loading required package: nnet
train_data$species <- relevel(train_data$species, ref = "Adelie")
test_data$species <- relevel(test_data$species, ref = "Adelie")
multinom_model <- multinom(species ~ ., data = train_data %>% select(-isAdelie))
## # weights: 24 (14 variable)
## initial value 273.554460
## iter 10 value 13.698503
## iter 20 value 2.253936
## iter 30 value 0.060087
## iter 40 value 0.002295
## iter 50 value 0.000494
## iter 60 value 0.000490
## iter 70 value 0.000487
## iter 80 value 0.000255
## iter 90 value 0.000247
## iter 100 value 0.000237
## final value 0.000237
## stopped after 100 iterations
summary(multinom_model)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = species ~ ., data = train_data %>% select(-isAdelie))
##
## Coefficients:
## (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap -136.36389 32.93073 -14.42726 14.13510 -19.90270
## Gentoo -10.21908 -46.22665 -39.47398 8.73007 -18.41169
## flipper_length_mm body_mass_g
## Chinstrap -0.2377843 -0.02749231
## Gentoo -0.6030912 0.01881209
##
## Std. Errors:
## (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap 0.6957534 1.403937e+00 NaN 101.64960 54.922341
## Gentoo 0.1268320 8.301517e-17 8.486003e-08 8.75875 4.370859
## flipper_length_mm body_mass_g
## Chinstrap 18.21333 0.5917839
## Gentoo 30.70866 1.3707121
##
## Residual Deviance: 0.000474253
## AIC: 28.00047
The low AIC is indicative of a good model fit, so we will keep all of the variables in the model.