Let’s take a look to see how R can help us analyze simple linear regression data.
We are going to use the pre-loaded data from New York’s airquality data set. Specifically we are going to look at the relationship between wind speed (our independent variable) and temperature (our dependent variable). Note wind speed for this data is measured in MPH (miles/hour) and temperature is measured in fahrenheit.
We are going to start by taking a look at a scatterplot.
Next we’ll build a regression formula and make a temperature prediction based on our formula.
Then we use summary information to see if our model is statistically significant.
First we need to load the Tidyverse package, which includes the ggplot package that is necessary for our analysis.
We also want to load our dataset, “airquality”.
# ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
# ✔ forcats 1.0.0 ✔ readr 2.1.5
# ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
# ✔ lubridate 1.9.3 ✔ tibble 3.2.1
# ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag() masks stats::lag()
# ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now let’s take a look at the airquality data.
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Remember, we want to see if there is a relationship between wind speed (x) on temperature (y) Let’s build a scatterplot to see this relationship.
## We use ggplot to build our scatterplot.
## ggplot(datasetname, aes(x = wind speed, y = temperature)) + geom_point()
ggplot(airquality, aes(x = Wind, y = Temp)) + geom_point()## This produces a basic scatterplot of our selected variables. In looking at this scatterplot we can see a "somewhat" linear relationship between wind speed and temperature.## Now let's add a regression line to our scatterplot. To this this we are going to add the following code to our scatterplot. Note: lm means linear model in R. se = FALSE removes an error ribbon.
## + geom_smooth(method = lm, se = FALSE)
ggplot(airquality, aes(x = Wind, y = Temp)) + geom_point() + geom_smooth(method = lm, se = FALSE)# `geom_smooth()` using formula = 'y ~ x'
The formula for a linear regression model is:
y = c + b(x)
y = predicted value of our dependent variable
c = constant or intercept
b = coefficient or slope of x
x = value of our independent variable.
## Finding this model formula is simple r. We simply use the following code:
## lm(dependentvariable ~ independentvariable, data = dataset)
lm(Temp ~ Wind, data = airquality)#
# Call:
# lm(formula = Temp ~ Wind, data = airquality)
#
# Coefficients:
# (Intercept) Wind
# 90.13 -1.23
## We can use this output to create our formula.
## Our regression formula will then be:
## y = 90.13 - 1.23(x)Now, let’s us our formula to estimate that predicted temperature when wind speed is 11 MPH. All we do is plug in 11 for x. We can do this in r.
# [1] 76.6
Let’s use a model summary to check for statistical significance of our data.
## Let's name our linear model. We'll use m to keep it simple.
m <- lm(Temp ~ Wind, data = airquality)
## Now let's look at our summary.
summary(m)#
# Call:
# lm(formula = Temp ~ Wind, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -23.291 -5.723 1.709 6.016 19.199
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 90.1349 2.0522 43.921 < 2e-16 ***
# Wind -1.2305 0.1944 -6.331 2.64e-09 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 8.442 on 151 degrees of freedom
# Multiple R-squared: 0.2098, Adjusted R-squared: 0.2045
# F-statistic: 40.08 on 1 and 151 DF, p-value: 2.642e-09
## We can see by the 3 asterisks (***) after the p values, that our model is statistically significant, and that we can reject the null hypothesis that there is no relationship between between wind speed and temperature.
## Additionally, we can see that this output summary also gives us the constant and slope that we used to find our regression formula. Hopefully, now you have gained a better understanding of how to use r and a dataset to analyze for simple linear regression.
airquality data: Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.