library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
msleep <- read.csv("C:/Users/ABHIRAM/Downloads/msleep.csv")
To perform logistic regression using the “msleep” dataset in R, we
first need to identify a binary response variable. We can select the
“conservation” column as the response variable. This column contains
information about the conservation status of each species. Let’s convert
it into a binary variable by considering species with a conservation
status of “lc” (Least Concern) as one class and all other statuses as
the second class. This allows us to model whether a species is “Least
Concern” or not based on the other variables.
Here are the steps to build and interpret a logistic regression
model for this binary response variable using 1-4 explanatory
variables:
1. Data Preparation:
# Loading necessary library
library(dplyr)
# Selecting the relevant columns for analysis
msleep_binary <- msleep %>% select(conservation, sleep_total, sleep_rem, awake, brainwt, bodywt)
# Converting the conservation variable into a binary response variable
msleep_binary <- msleep_binary %>% mutate(is_lc = ifelse(conservation == "lc", 1, 0)) %>%
select(-conservation)
# Removing rows with missing values
msleep_binary <- na.omit(msleep_binary)
2. Building a Logistic Regression Model:
# Model 1: Simple model with one explanatory variable
model1 <- glm(is_lc ~ sleep_total, data = msleep_binary, family = "binomial")
# Model 2: Adding another explanatory variable
model2 <- glm(is_lc ~ sleep_total + brainwt, data = msleep_binary, family = "binomial")
# Model 3: Adding more explanatory variables
model3 <- glm(is_lc ~ sleep_total + brainwt + bodywt, data = msleep_binary, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Model 4: Maximum complexity model with all available explanatory variables
model4 <- glm(is_lc ~ sleep_total + sleep_rem + awake + brainwt + bodywt, data = msleep_binary, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
3. Interpretation of Coeficients:
We can interpret the coefficients by looking at the results of the
logistic regression models. The coefficients represent the log-odds of
being in the “Least Concern” class. A positive coefficient indicates an
increased likelihood of being in the “Least Concern” class, while a
negative coefficient indicates a decreased likelihood. We can use the
summary function to see coefficient estimates, standard errors,
z-values, and p-values for each model.
summary(model1)
##
## Call:
## glm(formula = is_lc ~ sleep_total, family = "binomial", data = msleep_binary)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.71524 0.88074 -0.812 0.417
## sleep_total 0.09215 0.08212 1.122 0.262
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42.684 on 30 degrees of freedom
## Residual deviance: 41.357 on 29 degrees of freedom
## AIC: 45.357
##
## Number of Fisher Scoring iterations: 4
summary(model2)
##
## Call:
## glm(formula = is_lc ~ sleep_total + brainwt, family = "binomial",
## data = msleep_binary)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.07807 1.28915 0.836 0.403
## sleep_total -0.02768 0.10123 -0.273 0.785
## brainwt -7.67105 4.60907 -1.664 0.096 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42.684 on 30 degrees of freedom
## Residual deviance: 37.223 on 28 degrees of freedom
## AIC: 43.223
##
## Number of Fisher Scoring iterations: 5
summary(model3)
##
## Call:
## glm(formula = is_lc ~ sleep_total + brainwt + bodywt, family = "binomial",
## data = msleep_binary)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.593772 1.417061 0.419 0.675
## sleep_total 0.006834 0.114338 0.060 0.952
## brainwt 28.478245 24.534201 1.161 0.246
## bodywt -0.106034 0.083328 -1.272 0.203
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42.684 on 30 degrees of freedom
## Residual deviance: 31.238 on 27 degrees of freedom
## AIC: 39.238
##
## Number of Fisher Scoring iterations: 9
summary(model4)
##
## Call:
## glm(formula = is_lc ~ sleep_total + sleep_rem + awake + brainwt +
## bodywt, family = "binomial", data = msleep_binary)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10299 1.53665 0.718 0.473
## sleep_total -0.15383 0.20886 -0.737 0.461
## sleep_rem 0.62067 0.67939 0.914 0.361
## awake NA NA NA NA
## brainwt 31.96376 23.51186 1.359 0.174
## bodywt -0.12387 0.08092 -1.531 0.126
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42.684 on 30 degrees of freedom
## Residual deviance: 30.339 on 26 degrees of freedom
## AIC: 40.339
##
## Number of Fisher Scoring iterations: 9
Coefficient Intervals:
# Calculating confidence intervals for coefficients in model 1
conf_intervals_model1 <- confint(model1)
## Waiting for profiling to be done...
5. Scatter Plots:
# Scatter plot of sleep_total vs. bodywt
plot(msleep_binary$sleep_total, msleep_binary$bodywt, xlab = "Sleep Total", ylab = "Body Weight")
