Basic Linear Regression in R

Michelle Knopp

October 9, 2024

Introduction

Let’s take a look to see how R can help us analyze simple linear regression data.

We are going to use the pre-loaded data from New York’s airquality data set. Specifically we are going to look at the relationship between wind speed (our independent variable) and temperature (our dependent variable). Note wind speed for this data is measured in MPH (miles/hour) and temperature is measured in fahrenheit.

  1. We are going to start by taking a look at a scatterplot.

  2. Next we’ll build a regression formula and make a temperature prediction based on our formula.

  3. Then we use summary information to see if our model is statistically significant.

First we need to load the Tidyverse package, which includes the ggplot package that is necessary for our analysis.

We also want to load our dataset, “airquality”.

## Load tidyverse package

library(tidyverse)
# ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
# ✔ forcats   1.0.0     ✔ readr     2.1.5
# ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
# ✔ lubridate 1.9.3     ✔ tibble    3.2.1
# ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()
# ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Import airquality data

data("airquality")

Now let’s take a look at the airquality data.

head(airquality)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6
## head() shows us the first 6 rows of our data. We are going to look at the wind and temperature rows. This also tells us that "Wind" is how wind speed, and "Temp" is how temperature is labeled within the dataset.

Step 1: Visualize our Data

Remember, we want to see if there is a relationship between wind speed (x) on temperature (y) Let’s build a scatterplot to see this relationship.

## We use ggplot to build our scatterplot.

## ggplot(datasetname, aes(x = wind speed, y = temperature)) + geom_point()

ggplot(airquality, aes(x = Wind, y = Temp)) + geom_point()

## This produces a basic scatterplot of our selected variables. In looking at this scatterplot we can see a "somewhat" linear relationship between wind speed and temperature.
## Now let's add a regression line to our scatterplot. To this this we are going to add the following code to our scatterplot. Note: lm means linear model in R. se = FALSE removes an error ribbon.

  ## + geom_smooth(method = lm, se = FALSE)


ggplot(airquality, aes(x = Wind, y = Temp)) + geom_point() + geom_smooth(method = lm, se = FALSE)
# `geom_smooth()` using formula = 'y ~ x'

## The blue line is our linear regression line.

Step 2: Build a Linear Regression Formula.

The formula for a linear regression model is:

y = c + b(x)

y = predicted value of our dependent variable

c = constant or intercept

b = coefficient or slope of x

x = value of our independent variable.

## Finding this model formula is simple r. We simply use the following code:
  
## lm(dependentvariable ~ independentvariable, data = dataset)

lm(Temp ~ Wind, data = airquality)
# 
# Call:
# lm(formula = Temp ~ Wind, data = airquality)
# 
# Coefficients:
# (Intercept)         Wind  
#       90.13        -1.23
## We can use this output to create our formula.

## Our regression formula will then be:

##  y = 90.13 - 1.23(x)

Now, let’s us our formula to estimate that predicted temperature when wind speed is 11 MPH. All we do is plug in 11 for x. We can do this in r.

x <- 11

90.13 - 1.23*x
# [1] 76.6
## So, this regression formula tells us that we can estimate that the our temperature will be 76.6 degrees F when wind speeds are 11 MPH.

Step 3: Let’s check for statistical significance.

Let’s use a model summary to check for statistical significance of our data.

## Let's name our linear model. We'll use m to keep it simple.

m <- lm(Temp ~ Wind, data = airquality)


## Now let's look at our summary.

summary(m)
# 
# Call:
# lm(formula = Temp ~ Wind, data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -23.291  -5.723   1.709   6.016  19.199 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  90.1349     2.0522  43.921  < 2e-16 ***
# Wind         -1.2305     0.1944  -6.331 2.64e-09 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 8.442 on 151 degrees of freedom
# Multiple R-squared:  0.2098,  Adjusted R-squared:  0.2045 
# F-statistic: 40.08 on 1 and 151 DF,  p-value: 2.642e-09
## We can see by the 3 asterisks (***) after the p values, that our model is statistically significant, and that we can reject the null hypothesis that there is no relationship between between wind speed and temperature.


## Additionally, we can see that this output summary also gives us the constant and slope that we used to find our regression formula. 

Conclusion

Hopefully, now you have gained a better understanding of how to use r and a dataset to analyze for simple linear regression.

References

airquality data: Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.