library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)

msleep <- read.csv("C:/Users/ABHIRAM/Downloads/msleep.csv")

To perform logistic regression using the “msleep” dataset in R, we first need to identify a binary response variable. We can select the “conservation” column as the response variable. This column contains information about the conservation status of each species. Let’s convert it into a binary variable by considering species with a conservation status of “lc” (Least Concern) as one class and all other statuses as the second class. This allows us to model whether a species is “Least Concern” or not based on the other variables.

Here are the steps to build and interpret a logistic regression model for this binary response variable using 1-4 explanatory variables:

1. Data Preparation:

# Loading necessary library
library(dplyr)

# Selecting the relevant columns for analysis
msleep_binary <- msleep %>% select(conservation, sleep_total, sleep_rem, awake, brainwt, bodywt)

# Converting the conservation variable into a binary response variable
msleep_binary <- msleep_binary %>% mutate(is_lc = ifelse(conservation == "lc", 1, 0)) %>%
  select(-conservation)

# Removing rows with missing values
msleep_binary <- na.omit(msleep_binary)

2. Building a Logistic Regression Model:

# Model 1: Simple model with one explanatory variable
model1 <- glm(is_lc ~ sleep_total, data = msleep_binary, family = "binomial")

# Model 2: Adding another explanatory variable
model2 <- glm(is_lc ~ sleep_total + brainwt, data = msleep_binary, family = "binomial")

# Model 3: Adding more explanatory variables
model3 <- glm(is_lc ~ sleep_total + brainwt + bodywt, data = msleep_binary, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Model 4: Maximum complexity model with all available explanatory variables
model4 <- glm(is_lc ~ sleep_total + sleep_rem + awake + brainwt + bodywt, data = msleep_binary, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

3. Interpretation of Coeficients:

We can interpret the coefficients by looking at the results of the logistic regression models. The coefficients represent the log-odds of being in the “Least Concern” class. A positive coefficient indicates an increased likelihood of being in the “Least Concern” class, while a negative coefficient indicates a decreased likelihood. We can use the summary function to see coefficient estimates, standard errors, z-values, and p-values for each model.

summary(model1)
## 
## Call:
## glm(formula = is_lc ~ sleep_total, family = "binomial", data = msleep_binary)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.71524    0.88074  -0.812    0.417
## sleep_total  0.09215    0.08212   1.122    0.262
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42.684  on 30  degrees of freedom
## Residual deviance: 41.357  on 29  degrees of freedom
## AIC: 45.357
## 
## Number of Fisher Scoring iterations: 4
summary(model2)
## 
## Call:
## glm(formula = is_lc ~ sleep_total + brainwt, family = "binomial", 
##     data = msleep_binary)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.07807    1.28915   0.836    0.403  
## sleep_total -0.02768    0.10123  -0.273    0.785  
## brainwt     -7.67105    4.60907  -1.664    0.096 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42.684  on 30  degrees of freedom
## Residual deviance: 37.223  on 28  degrees of freedom
## AIC: 43.223
## 
## Number of Fisher Scoring iterations: 5
summary(model3)
## 
## Call:
## glm(formula = is_lc ~ sleep_total + brainwt + bodywt, family = "binomial", 
##     data = msleep_binary)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.593772   1.417061   0.419    0.675
## sleep_total  0.006834   0.114338   0.060    0.952
## brainwt     28.478245  24.534201   1.161    0.246
## bodywt      -0.106034   0.083328  -1.272    0.203
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42.684  on 30  degrees of freedom
## Residual deviance: 31.238  on 27  degrees of freedom
## AIC: 39.238
## 
## Number of Fisher Scoring iterations: 9
summary(model4)
## 
## Call:
## glm(formula = is_lc ~ sleep_total + sleep_rem + awake + brainwt + 
##     bodywt, family = "binomial", data = msleep_binary)
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.10299    1.53665   0.718    0.473
## sleep_total -0.15383    0.20886  -0.737    0.461
## sleep_rem    0.62067    0.67939   0.914    0.361
## awake             NA         NA      NA       NA
## brainwt     31.96376   23.51186   1.359    0.174
## bodywt      -0.12387    0.08092  -1.531    0.126
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42.684  on 30  degrees of freedom
## Residual deviance: 30.339  on 26  degrees of freedom
## AIC: 40.339
## 
## Number of Fisher Scoring iterations: 9

Coefficient Intervals:

# Calculating confidence intervals for coefficients in model 1
conf_intervals_model1 <- confint(model1)
## Waiting for profiling to be done...

4. Variable Transformations:

Depending on the model and the relationship between variables, we can consider transformations. For example, we may want to transform variables like “brainwt” or “bodywt” if they have a skewed distribution.

5. Scatter Plots:

# Scatter plot of sleep_total vs. bodywt
plot(msleep_binary$sleep_total, msleep_binary$bodywt, xlab = "Sleep Total", ylab = "Body Weight")