Demo 1 Statistical Data Analysis with mtcars

Statistical Data Analysis with the mtcars dataset

Background—

title: “Demo 1 Statistical Data Analysis with mtcars”

execute:

echo: true

format: html

editor: visual

The mtcars dataset is a list of cars included in the 1974 issue of the Motor Trend US magazine. The list includes information on car design and performance like miles per gallon, weight, horsepower, etc.

It is a foundational dataset used for learning R programming and data analysis skills.

R 101: Installing, Loading, and Using Packages

A package is a collection of software tools and functions that R uses to perform data analysis and programming tasks.

Base packages are “built-in” to R. Examples include:

Base
Stats
Datasets

Base packages are convenient because you can use them as soon as you open RStudio. However, most packages require you to install and load them to use their functions. We are going to install a package called explore, which has functions that helps us explore and understand our dataset.

tyf #Installing and Loading a Package install.packages("explore", repos = "http://cran.us.r-project.org") library(explore)}

Concept Check: Did your package install correctly? How do you know?

Start typing explore - do you see a package?
Type explore and then add two colons ( :: ) - what do you see?
The purple p before explore let’s us know that the package is loaded, and the two colons are called the “namespace” that allows you to explore a packages functions by typing its name first.
Note that the repos = argument selects a repository to download the R package from. Usually, you only need to specify the repository when working with a Quarto or Markdown file like this one.

Tidyverse

We will use the explore package later, but for now we will focus on another package called tidyverse. Start by installing and loading the package. It may take a while if this is your first time installing the package.

tyf #install tidyverse install.packages("tidyverse", repos = "http://cran.us.r-project.org") library(tidyverse)}

I already have these packages loaded, so the error message I see is telling me that R decided to update my packages instead of installing them again. What is most interesting is that we are also getting a message that says “Attaching core tidyverse packages.”

This is because Tidyverse is actually a collection of packages that are used for working with data. For example, dplyr helps you select and filter data, while ggplot2 is used for data visualizations.

The last bit of information about tidyverse_conflicts means that we already have packages installed that have the same functions as our tidyverse packages. For example, filter and lag are available in the built-in stats package. Dplyr “masks” stats means that unless we use the namespace operator for stats (e.g., stats::filter), then RStudio will default to using dplyr’s filter function.

Concept Check: Keep an eye on how many packages you have installed. Figure out what tasks you want to complete FIRST, and then install packages that can help you. Some packages do not work well with each other, which can turn into a headache for longer analyses.

Installing the Dataset

We have to install our dataset before we can work with it. Thankfully, our mtcars dataset is built into R through the datasets package, which is a collection of datasets in R.

This opens up the dataset, but we we can’t do anything with the data until we assign it to a data object. We could just write out (datasets::mtcars) everytime we want to use the dataset, but that is time-consuming. Creating data objects saves time, helps create more efficient code, and also helps with file management by making sure you always have an original copy of the data available.

We will assign our dataset to a data object using the assignment operator ( <- )

Data Analysis

State the Hypothesis

I’ve been reading the 1974 issue of Motor Trends magazine and saw that some cars are way heavier than others. I’m also curious about what gas mileage was like for cars back then. My thinking is that the weight (wt) of the car has some relationship on gas mileage (mpg) with the assumption that a heavier car needs to burn MORE gas to move the same distance as a smaller car. So my hypothesis

Null hypothesis: There is no relationship between weight and mph

Alternative hypothesis: There is a relationship between weight and mpg

Alternative hypothesis 2: If there IS a relationship, then I think that cars with more weight would have worse gas mileage than cars with less weight.

This means that we think mpg is dependent on weight. So mpg is our y variable (dependent/target/outcome) and wt is our x variable (independent/feature/predictor).

Exploratory Data Analysis

First we need to explore our data to understand what data types we are working with. We can do that by plotting a chart of the distribution and calculating summary statistics.

Selecting and Piping Variables

The variable selection operator ( $ ) helps you choose variables or columns from your dataset.

The variable selection operator let’s use choose variables by similar to how namespace helps us search packages.

You can use dplyr to select and filter variables based on their names (for columns) and values (for rows)

You can also pipe lines of code together using dplyr’s pipe operator ( % )

tyf #select the car and mpg columns cars <- mtcars %>% select(wt, mpg) #use pipe to select variables without rewriting the dataset #filter to when mpg is greater than 10 mpg.filter.gt10 <- mtcars %>% filter(mpg > 10) #filter to when wt is less than or equal to 4 wt.filter.lte4 <- mtcars %>% filter(wt <= 4)}

Plot the Distribution

Since both our variables are numeric and include decimals, I would think this is a continuous data type or continuous variable. We use histograms to plot the distribution of continuous variables.

Concept Check: Our mpg variable checks out, but out wt variable doesn’t seem to follow the bell curve shape we expect from a normal distribution of continuous data. Let’s hold that thought for now, and see what our summary statistics tell us.

Summary Statistics

We can calculate statistics of central tendency (e.g., means and medians) of our data using the summary function that is built into R for both variables

Distribution

Again, the miles per gallon variable still checks out, but the weight variable has such a narrow range that we may want to consider transforming this into a different data type or variable type due to the distribution. Let’s hold off on that for now. We will keep going so that we have a model to compare.

Key Takeaway: Understanding the distribution of your data types will help you understand if data transformations, manipulations, or changes to your analysis need to be made.

Correlations and Associations

So again, we’ll work with these being continuous variables for now and run correlation analysis to see if there is a relationship, how strong it is, and what direction it is moving (e.g., positive or negative).

We can do that using the cor.test function from the stats package. Since we have two continuous data types (for now) we will use Pearson’s correlation

tyf #run the pearson's correlation to see if there is a relationship cor.test(cars$wt, cars$mpg, method = "pearson", conf.level = 0.95)}

That’s a LOT of output - let’s break that down:

Data = the variables and dataset used to build the model for the correlation analysis
T = critical value for the correlation test statistic indicating there is a chance that our evidence can reject the null hypothesis, can also be used to find the p-value on a traditional critical value table
Df = degrees of freedom how many samples were used to calculate the t-statistic and p-value (simple explanation here: https://www.reddit.com/r/explainlikeimfive/comments/19d3opt/eli5_degrees_of_freedom/)
P-value = the likelihood that we have enough evidence from our current sample to reject the null hypothesis
Alternative hypothesis: Similar to our original hypothesis test, we assume there is no relationship as the comparison
95% confidence interval - We can be confident that for this sample and distribution, our true correlation value would be somewhere between -0.93 and -0.74, even if we repeat the correlation and get different values
Cor: An estimate of the r-value created by the Pearson’s correlation.
1. The correlation is negative, meaning that as weight goes up miles per gallon goes down
2. The correlation is pretty large (0.86) so the relationship between weight and miles per gallon is strong

Putting it all together: Our hypothesis seems to line up so far! Heavier cars seem to reduce gas mileage. We should expect to see this in our linear regression model too.

Concept Check: The correlation you use will change based on your data types and their distribution. Pearson’s may not be appropriate for all data, and you will want to explore Spearman’s, Kendall, and others as appropriate.

Linear Regression

Thankfully, we can use base R to build our linear regression model from the stats package. There are a lot so I want to make sure I’m using the right function so I can namespace stats and search for lm.

I always forget how to build the model formula so I should check the documentation

I recommend using ( ? ) vs ( ?? ) if you already know what specific function you are using. We see that the bare minimum the documentation is asking for is:

A y ~ x formula shape: So our dependent variable goes first followed by a tilde ( ~ ) and then our independent variable last (mpg ~ wt)
The dataset our model variables are coming from

We have all we need for our model so let’s build it, store it in a data object, and inspect it

tyf lm1 <- lm(mpg ~ wt, data = cars) lm1}

So we have the data that was used to build our model and now we have the coefficients tab which tells us that.

When car weight is 0, then miles per gallon is expected to be 37.285

As wt increases by 1, miles per gallon decreases by -5.344 - which lines up with what we’ve seen

But that’s not all the information we expect - so let’s return to our summary function to see how our model actually performed

Now we see our full model output and again we’ll break it down

Call includes the data and variables used to build our model
The residual is the difference between the expected value of mpg in our model, based on the observed value of mpg in our data as it changes over time with our wt variable (can be used as a measure of model error)
The coefficients for the intercept and the variable alongside the standard error, test statistic critical value, and p-value

The next few items make the most sense for linear models with more than one independent variable (multivariate analyses) or when comparing multiple models.

Residual standard error of the model - can be used to determine how well the fitted model we have compares with the actual values in the data. Smaller RSEs mean a more precise model, but it is important to have a model to compare for explanation
Multiple R-Squared - a goodness of fit model that compares divides the variance of the model by the total variance observed in the y variable; can be used to compare multiple models
Adjusted R-Squared - goodness of fit model that lets you know if you’ve included too many variables in your model; can be used to compare multiple models
F-statistic - a measure of how well the dependent variable or y, is explained by the independent variable or x; can be used to compare multiple models

Recap + Next Steps

We’ve used tidyverse for data manipulation
We’ve explored, hypothesized, and modeled our data
We built a linear regression

We covered the following packages and functions

What’s the function?	What’s the package?	What does it do?
install.packages	utils (base)	Installs a package into R so that you can load it R usually remembers what packages you install for a session, but if writing markdown documents like RMarkdown and Quarto, you will need to install the package every time with the repository or CRAN mirror library you want to download it from
library	utils (base)	Loads a package so that you can use its functions without calling the package everytime
mtcars	datasets	Loads the mtcars dataset into our R session
dplyr	tidyverse	Used for manipulating data in R, similar to Power Bi, pandas/polars in Python, and corresponding SQL functions
Piping operator (%>%)	dplyr	Pipe operator that lets use run multiple lines of code in the order it is written, can also be used to select variables without calling the dataset
select	dplyr	Keeps or drops columns with our variables of choice
filter	dplyr	Keeps or drops rows with our values of choice
lm	stats	Builds a basic linear regression model, use summary to see full model output Use the variable selector ( $ ) to select individual model components
cor.test	stats	Calculates the correlation and the hypothesis test on a pair of variables, the default is Pearson for two continuous variables
summary	base	Can be used to describe data objects. Can be used on variables to calculate summary statistics Can be used on models to view the full output
head, tail	base	View the first few rows at the top (head) or bottom (tail) of the dataset
View	base	View the whole dataset as rows and columns (like an Excel table)
Variable selector ( $ )	base	Pulls up a list of components for a data objects. Datasets - pulls variables Models - pulls components
Namespace operator ( :: )	base	The namespace operator lets you confirm a package has been loaded into R and search the functions within that package
Assignment operator ( <- )	base	Assigns code to an object

Next Steps

In the next sessions we’ll build on the model by:

Running the analysis with a transformed weight variable
Running a logistic regression with a binary outcome
Comparing model performance
Automated exploration techniques with the explorer and EDA packages

Helpful Resources

R for Data Science and Introduction to Tidyverse: Working with R from start to finish

Handbook for People Analytics - R Statistical Foundations: Similar to the above but includes more background on statistical foundations

Dplyr Cheat Sheet: Note that this uses R’s new global ( | > ) for piping instead of Tidyverse’s ( %>%) operator.