Demo Statistical Analysis with RStudio Logistic Regression

Statistical Analysis with the MTCARS Dataset - Logistic Regression

Background

The mtcars dataset is a list of cars included in the 1974 issue of the Motor Trend US magazine. The list includes information on car design and performance like miles per gallon, weight, horsepower, etc.

It is a foundational dataset used for learning R programming and data analysis skills.

Note: This script assumes that you have worked through Statistical Analysis 1 with MTCARS, which is an introduction to R packages, hypothesis testing, and linear regression models.

Install Relevant Packages

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

Installing package into 'C:/Users/17062/AppData/Local/R/win-library/4.5'
(as 'lib' is unspecified)

package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\17062\AppData\Local\Temp\RtmpYZocpK\downloaded_packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the mtcars dataset

cars <- datasets::mtcars

So in the first demo, we targeted out two continuous-ish variables, miles per gallon and weight to create a linear regression model. That’s because the linear regression model assumes, among other things, that the dependent variable is continuous and follows a normal distribution (bell curve).

Logistic regression is similar but it assumes that are dependent variable can only have two values, or it is a binary/dichotomous variable distribution (no/yes, 0/1, absent/present). So our first step is finding a binary variable from our dataset. Let’s explore.

Exploratory Data Analysis

We’ll check the structure of our dataset to see if we can find any binary variables. We could use View and open up another dataframe, or we can use explore in our script to get used to reading R output.

#explore the mtcars dataset and find a binary variable 

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The structure function gives us information about how our dataframe is organized:

We have a dataframe with 32 rows and 11 columns
Our data is full of numeric variables

Additionally, the vs and am variable seem to only take on two values: 0 and 1. Let’s confirm that with a table.

table(mtcars$am)


 0  1 
19 13

table(mtcars$vs)


 0  1 
18 14

Our table function, also from base R, helps confirm that both am and vs variables only have two values 0 and 1. But, I don’t know a TON about cars so I want to understand what 0/1 means in the context of our variables. This will help us design a hypothesis test that makes sense in the real-world.

Concept Check: What other tools or techniques have you used to understand and explore your data?

#learn more about the mtcars dataset

?mtcars

starting httpd help server ... done

So running the help ( ? ) function on mtcars tells us that that:

vs describes the type of car engine, v-shaped or straight
am describes the transmission, automatic or manual

So we have two binary variables. We can actually run a logistic regression with two binary variables. We can also run a logistic regression with a continuous and binary variable. The bottom line is that our outcome needs to be binary. We’ll run both models as an example.

Logistic Regression with Categorical Predictor

What if I was shopping for a car, and wanted to see if the car engine I like also has the transmission I need. I drive automatic, so I might want to see if we can figure out if the type of car engine can predict the transmission of our variable. Let’s see.

#select vs and transmission variables 

data <- cars %>% select(c(vs, am))

Data Manipulation

Words make more sense to me than numbers, so I am going to recode or transform my dataset by adding a column using mutate and if_else from tidyverse to create a label column that describes my values.

data <- data %>%
  mutate(engine_label = if_else(vs > 0, "Straight engine", "V-shaped Engine")) %>%
  mutate(transmission_label = if_else(am > 0, "Manual", "Automatic"))

Let’s break down what just happened since there is some new code:

I’m assigning my manipulated data to my original dataset
I’m using the pipe ( %>% ) operator to run multiple lines of code at once and in order
mutate is a new function from the dplyr package in tidyverse that creates new columns in a dataset (engine label = )
if_else is also a function from that lets us create new values based on another value

In summary, I told R to create two new columns (engine label and transmission label) that define what the values mean for each column (0 = automatic/v-shaped, 1 = straight/manual) if the value is greater than 0 or not. The piping operator let me do this all at the same time.

Lastly we’ll use tables to double check our work, since our label columns should have the same values as our original columns

table(data$engine_label)


Straight engine V-shaped Engine 
             14              18

table(data$vs)


 0  1 
18 14

table(data$transmission_label)


Automatic    Manual 
       19        13

table(data$am)


 0  1 
19 13

#good to go!

Chi-Square Analysis

Similar to how we needed to use correlation analyses to test the strength and direction of our proposed relationship of continuous variables, we can use a chi-square test to see what we can learn about our proposed relationship between our two binary variables.

First, let’s turn our data into factors. Factors are how R handles categorical data, or data whose numbers are not continuous. Factoring variables helps R understand the data better and opens up functions for us to use.

Since all of our variables are categorical, we’ll turn our all of our variables into factors using mutate again.

data <- data %>%
  mutate(across(c(am, vs, engine_label, transmission_label),
                as.factor))

#double check with str

str(data)

'data.frame':   32 obs. of  4 variables:
 $ vs                : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
 $ am                : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
 $ engine_label      : Factor w/ 2 levels "Straight engine",..: 2 2 1 1 2 1 2 1 1 1 ...
 $ transmission_label: Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...

Then since chi-squares expect a contingency table with the observed values of our distribution, we will use the base stats package in R to run a 2x2 contingency table of am and vs into our chi-square test of independence.

Null Hypothesis: am and vs have no association

Alternative Hypothesis: am and vs are associated

chi1 <- chisq.test(table(data$vs, data$am))
chi1


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(data$vs, data$am)
X-squared = 0.34754, df = 1, p-value = 0.5555

Let’s break down the code and output

A Yates’ continuity correction was run automatically because R’s base chi-square test anticipates small cell counts less than 5 in our table. It’s meant to correct small sample bias.
Data is the table we entered to run the chi-square
X-Squared is a measure of the difference between the values expected by the chi-square model and the values observed in our table. The larger the X-squared the bigger the difference. This is our test-statistic
DF is degrees of freedom - simple explanation here: https://www.reddit.com/r/explainlikeimfive/comments/19d3opt/eli5_degrees_of_freedom/
P-value is the a measure of the likelihood of statistically significant relationship can be detected based on the degrees of freedom and test-statistic. A lower p-value means that we are less likely to observe a statistically significant effect if the Null is true, while higher p-values mean we are more likely to observe a statistically significant effect if the Null is true

Concept Check: Let’s review our tables to see if we need a Yates’ continuity correction in the first place. We’ll review our 2x2 table to see if there are any cells less than 5 in our table

table(data$vs, data$am)

So while close, we theoretically don’t need Yates’ continuity correction. So let’s run our test again and turn of the correction.

chi2 <- chisq.test(table(data$vs, data$am), correct = F)
chi2


    Pearson's Chi-squared test

data:  table(data$vs, data$am)
X-squared = 0.90688, df = 1, p-value = 0.3409

And we see that our X-squares is larger (the difference between values is bigger without the correction) and our p-value is smaller, meaning that we come closer to rejecting the null in this model, but this still would not be considered statistically significant under p-value criteria. That is fine!

We’ll run our logistic regression to see if our model results are consistent.

We can run an logistic regression using the glm function in R’s base state’s function. Glm is very similar to lm for regression, but it assumes that our variables may not follow the normal, continuous distribution, and limits the model fit to assuming that our outcome variable only has two values.

Question: Can engine type be used to predict transmission type?

Null hypothesis: These variables are not associated so no.

Alternative hypothesis, These variables are associated so yes.

References and Dummy - Logistic regression needs a reference value to compare. Usually that is the first value found in the dataset. So here, V-shaped engine is our default reference x value and manual transmission is our default y value. You can adjust references and dummy variables as necessary for your analysis, but we won’t cover it here.

#remember - engine is our x and transmission is y

logmod1 <- glm(am ~ engine_label, data = data, family = binomial(link = "logit"))
summary(logmod1)


Call:
glm(formula = am ~ engine_label, family = binomial(link = "logit"), 
    data = data)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)
(Intercept)                   0.0000     0.5345   0.000    1.000
engine_labelV-shaped Engine  -0.6931     0.7319  -0.947    0.344

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 42.323  on 30  degrees of freedom
AIC: 46.323

Number of Fisher Scoring iterations: 4

Let’s break this down:

We have our model call as usual
We have our model coefficients much like linear regression, but for logistic regression, the coefficient is expressed as an odds ratio
Our p-values are still non-significant, which tracks
Our odds ratio is negative (-0.6931), which we interpret to mean that manual cars are 69% less likely to have a v-shaped engine compared to a straight shaped engine.
Null deviance: How well the intercept is explained by the model, or how good is the model at explaining what happens when no engine type is selected.
Residual deviance: This explains the relationship when we include our variable, or how good the model is explaining what happens when an engine type is selected. They are very similar here, with the model we have actually being less explanatory than our intercept model.
AIC - this is a measure of model complexity to let you know if you have too many variables when comparing model, the deviance and AIC parameters can be used to compare model performance by balancing higher deviance with lower AIC.

Concept Check: What is a real-world explanation for why engine shape is not a statistically significant predictor or transmission type?

Regression with a Continuous Predictor

I don’t know a lot about cars, but I can gather that gross horsepower is related to car engine. I wonder if we can use a logistic regression to predict the likelihood of a car having a v shaped or straight engine based on horsepower.

Exploratory Data Analysis

We already know what our vs engine variable looks like as a table so let’s add our horsepower variable back to our dataset and run a histogram.

data$hp <- cars$hp
hist(data$hp)

We see a right-skew, which means that this isn’t a perfectly normal distribution, but for instructional purposes we’ll work with the assumption that this is normal.

Correlation Analysis

Since we have two different data types: hp (continuous) and vs (binary), we’ll need to think harder about what correlation we can use. For continuous and binary data, we can use the point-biserial function from the ltm package.

install.packages("ltm", repos = "http://cran.us.r-project.org")

Installing package into 'C:/Users/17062/AppData/Local/R/win-library/4.5'
(as 'lib' is unspecified)

package 'ltm' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\17062\AppData\Local\Temp\RtmpYZocpK\downloaded_packages

library(ltm)

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: msm

Loading required package: polycor

biserial.cor <- biserial.cor(data$hp, data$vs)
biserial.cor

[1] 0.7230967

So our point-biserial correlation is 0.74. Please note that this does not produce a hypothesis test like our cor.test function. However, this effect size is positive and relatively strong, which is promising for our alternative hypothesis. Let’s see what the logistic regression says.

Concept Check: Sometimes you can use the Pearson’s correlation to better understand the behavior of the point-biserial by comparing the two. When the correlation values have the number (magnitude) but different direction (positive/negative) it means that we may need to double check the reference values.

#cor.test(data$hp, data$vs)

biserial.cor(data$hp, data$vs, level = 2)

[1] -0.7230967

#reference = 1, straight engine 

biserial.cor(data$hp, data$vs, level = 1)

[1] 0.7230967

#identical to pearsons, level = 0, v-shaped engine, go with that one

Our Pearson’s correlation is negative because 0 is our first value in the data set, which is the reference value for engine (straight-shaped). We used the level function to change the reference value to 1 (v-shaped).

We will use level 2 for point-biserial statistical concepts remind us that the point-biserial and Pearson’s correlation are the same when the same reference is used, even though they are used to analyze different variables. We should default to the output of our point-biserial, which is -0.72 after making our reference level to be straight engine.

logmod2 <- glm(am ~ hp, data = data, family = binomial(link = "logit"))

summary(logmod2)


Call:
glm(formula = am ~ hp, family = binomial(link = "logit"), data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.776614   0.915429   0.848    0.396
hp          -0.008117   0.006074  -1.336    0.181

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 41.228  on 30  degrees of freedom
AIC: 45.228

Number of Fisher Scoring iterations: 4

We have the same output as our original model, but we can see the odds ratio for horsepower based on our reference group for engine, which is a stright shaped engine.

Our interepretation is that as horsepower increases, we are less than 10% less likely to see straight-shaped engine.

Concept Check: Why do you think our models were never significant? Is it because we have way more values compared to others? Are there other values that might explain our model?