Simple linear regression is something used in statistics that models the relationship between two variables by fitting a linear equation to the data. The equation used for linear regression is \(y = B_0 + B_1X\), where \(y\) is the dependent variable, \(B_1\) is the slope, \(x\) is the independent variable, and \(B_0\) is the y-intercept when \(x\) is 0.
When doing simple linear regression in R, we can use the built in function lm(). lm() by default uses the formula \(y\) ~ \(x\), where, once again, \(y\) is the dependent variable and \(x\) is the independent variable. Also, when using lm() a data source needs to be provided. In this powerpoint, we will use lm() in ggplot in geom_smooth() and also when we do a 3D plotly plot of the linear regression of our example problem.
To give examples of simple linear regression, we will be comparing the time in hours given to dough to rise and the height of the dough in inches after rising.
## Hours Height
## 1 0.25 4.25
## 2 0.50 4.50
## 3 0.75 4.75
## 4 1.00 5.25
## 5 1.25 5.80
## 6 1.50 6.30
## 7 1.75 7.10
## 8 2.00 8.00
Here is the ggplot of the data set after applying simple linear regression.
## `geom_smooth()` using formula = 'y ~ x'
Here is the following R code used to create the plot in the previous slide:
ggplot(data, aes(x = Hours, y = Height)) +
geom_line() +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Height of Dough (in inches) vs Hours Risen", x = "Hours", y = "Height")## `geom_smooth()` using formula = 'y ~ x'
To create the graph, you need to include the data source (our data set
was named ‘data’ already), the columns in the data set that the x and y
axises are based off of, geom_line() to connect the points from
geom_point(), geom_smooth to fit the linear regression to the data set,
and finally labs to add the title, x axis label, and y axis label of the
graph.
Here is the graph of the 3D visualization of the hours risen vs height of dough in inches fit with the linear regression utilizing plotly.
Here is the R code used to create this plot:
linearRegression <- lm(Height ~ Hours, data=data)
data$linearRegression <- predict(linearRegression)
plot_ly(data) %>%
add_trace(x = ~Hours, y = ~linearRegression, z = ~Height, type = 'scatter3d', name = 'Hours vs Height',mode = 'lines', line = list(color = 'black')) %>%
add_trace(x = ~Hours, y = ~linearRegression, z = ~linearRegression, type = 'scatter3d', name = 'Linear Regression', mode = 'lines', line = list(color = 'blue')) %>%
layout(title = "Linear Regression of Height of Dough (in inches) vs Hours Risen",
scene = list(xaxis = list(title='Hours'),
yaxis = list(title='Height'),
zaxis = list(title='Height')))As we can see, we used lm() to fit the data (similar to ggplot from earlier) and create a column in the data set called linearRegression to store the values. Next, in the plotly graph we add a trace for Hours vs Height and a trace for the linear regression we calculated earlier. After this, we add layout to give the graph a title and labels for the x axis, y axis, and z axis.
In conclusion, simple linear regression (\(y = B_0 + B_1X\)) is an important part of statistics. The ability to show the relationship between an independent variable and dependent variable (in our case hours risen vs height of dough) is invaluable.
## `geom_smooth()` using formula = 'y ~ x'