Simple linear regression determines if a relationship between two variables exists, and if so, makes estimates about the dependent variable based on values of the independent variable
\(\text{Volume} = \beta_0 + (\beta_1\cdot\text{Girth}) + \varepsilon\)
-
x = independent variable
Finding Values
Calling the Linear Model function in R will generate the \(\beta_0\) and \(\beta_1\) of a data set
lm(Volume ~ Girth, data = trees)
Call:
lm(formula = Volume ~ Girth, data = trees)
Coefficients:
(Intercept) Girth
-36.943 5.066
\(\beta_0 = y \hspace {.08cm} intercept = -36.943 \hspace {.15cm}\)
\(\beta_1 = coefficient = 5.066\)
\(\text{Volume} = \beta_0 + (\beta_1\cdot\text{Girth}) + \varepsilon \hspace {.25cm} = \hspace {.15cm} -36.943 + (5.066\cdot \text{Girth}) + \varepsilon\)
Summary Function
In R, a summary can be run of a Linear Model that will also include the Standard Error of a linear regression model
treeSize <- lm(Volume ~ Girth, data = trees) summary(treeSize)
Call:
lm(formula = Volume ~ Girth, data = trees)
Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Plotting the data with Plotly
R can plot the data to help visualize it
x=trees$Girth; y=trees$Volume
xax <- list(
title = "Girth"
)
yax <- list(
title = "Volume",
range = c(0,80)
)
treePlot <- plot_ly(trees, x=x,y=y, type = "scatter",
mode = "markers", name = "data")%>%
add_lines(x=x, y = fitted(treeSize), name = "fitted")%>%
layout(xaxis = xax, yaxis = yax)
Plotly Code Plotted
Using ggplot
ggplot is another option in R for visualizing data
TTplot <- ggplot(data=trees,aes(x=Girth,y=Volume))+ geom_smooth(method = "lm",se=F) TTplot + geom_point(aes(color=factor(Volume), size=Girth), show.legend = FALSE)
`geom_smooth()` using formula = 'y ~ x'
What is \(\varepsilon\)?
\(\varepsilon\), or the Standard Error, is the average amount of deviation from the equation. As simple linear regression is based on a variety of data points, the standard error accounts for variations from the equation.
The standard deviation is effectively a buffer zone around the linear regression equation. It includes data points near the line, while minimizing points outside of the buffer zone.
ggplot has an option that includes the standard error, which is the shaded area in the graph below
ggplot with a Standard Error buffer
`geom_smooth()` using formula = 'y ~ x'