2024-11-15

What is Simple Linear Regression?

Simple linear regression determines if a relationship between two variables exists, and if so, makes estimates about the dependent variable based on values of the independent variable



            \(\text{Volume} = \beta_0 + (\beta_1\cdot\text{Girth}) + \varepsilon\)

              x = independent variable
              y = dependent variable
              \(\beta_0\) = y-intercept, a constant
              \(\beta_1\) = coefficient/slope of the line
              \(\varepsilon\) = error term

Finding Values

Calling the Linear Model function in R will generate the \(\beta_0\) and \(\beta_1\) of a data set

lm(Volume ~ Girth, data = trees)
Call:
lm(formula = Volume ~ Girth, data = trees)

Coefficients:
(Intercept)        Girth  
    -36.943        5.066  

\(\beta_0 = y \hspace {.08cm} intercept = -36.943 \hspace {.15cm}\)

\(\beta_1 = coefficient = 5.066\)

\(\text{Volume} = \beta_0 + (\beta_1\cdot\text{Girth}) + \varepsilon \hspace {.25cm} = \hspace {.15cm} -36.943 + (5.066\cdot \text{Girth}) + \varepsilon\)

Summary Function

In R, a summary can be run of a Linear Model that will also include the Standard Error of a linear regression model

treeSize <- lm(Volume ~ Girth, data = trees)
summary(treeSize)
Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
   Min     1Q Median     3Q    Max 
-8.065 -3.107  0.152  3.495  9.587 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Plotting the data with Plotly

R can plot the data to help visualize it

x=trees$Girth; y=trees$Volume
xax <- list(
  title = "Girth"
)
yax <- list(
  title = "Volume",
  range = c(0,80)
)
treePlot <- plot_ly(trees, x=x,y=y, type = "scatter", 
          mode = "markers", name = "data")%>%
    add_lines(x=x, y = fitted(treeSize), name = "fitted")%>%
    layout(xaxis = xax, yaxis = yax)

Plotly Code Plotted

Using ggplot

ggplot is another option in R for visualizing data

TTplot <- ggplot(data=trees,aes(x=Girth,y=Volume))+ geom_smooth(method = "lm",se=F)
TTplot + geom_point(aes(color=factor(Volume), size=Girth), show.legend = FALSE)
`geom_smooth()` using formula = 'y ~ x'

What is \(\varepsilon\)?

\(\varepsilon\), or the Standard Error, is the average amount of deviation from the equation. As simple linear regression is based on a variety of data points, the standard error accounts for variations from the equation.


The standard deviation is effectively a buffer zone around the linear regression equation. It includes data points near the line, while minimizing points outside of the buffer zone.

ggplot with a Standard Error buffer

ggplot has an option that includes the standard error, which is the shaded area in the graph below

`geom_smooth()` using formula = 'y ~ x'