Simple Linear Regression

  • Process used to quantify the relationship between a predictor variable: \(\color{darkgreen} x\) and a response variable: \(\color{darkgreen} y\)
  • Utilizes line of best fit or least squares regression line:

\[\color{darkgreen} {\hat{y} = b_0 + b_1*x}\]
where:
\(\color{Darkgreen} {\hat{y}}\)  is the predicted value of response variable
\(\color{Darkgreen} {b_0}\) is the \(\color{darkgreen}y\)-intercept
\(\color{Darkgreen} {b_1}\) is the regression coefficient
\(\color{Darkgreen} {x}\)  is the value of the predictor variable

How do we find the regression line?

-R Markdown utilizes the sum of squares of errors (SSE) to compute the line of best fit: \[\color{Darkgreen} {\Sigma = |errors|^2}\] -An error is the distance from a plot point (actual) to the estimated regression line (predicted)
\[\color{Darkgreen} {\Sigma = (y - \hat{y})^2}\] -The SSE is calculated for all possible regression lines and the line with the “least squares” is our line of best fit


                                        [Image taken from https://serokell.io/blog/regression-analysis-overview]

To illustrate, consider dataset CanPop

Decennial census of population in Canada 1851-2001

This is our CanPop data plotted in ggplot2:

Here is code to plot CanPop and line of best fit in plotly

##simple regression model (list)
lin_Mod = lm(population ~ year, data = CanPop)

##x & y values
x=CanPop$year; y = CanPop$population

##x-axis details
X_ax <- list(title ="Census Year", range = c(1845,2010))

##y-axis details
Y_ax <- list(title= "Population (in millions)")

##data points
fig1 <- plot_ly(x=x, y=y, type="scatter", 
               mode="markers", name="data") %>%

##add line of best fit
add_lines(x=x, y = fitted(lin_Mod), name="fitted") %>%
  
##axis titles & plot margins
layout(xaxis=X_ax, yaxis=Y_ax) %>%
layout(margin = list(l=150,r =50, t=10,b=50))

Population vs Year

Plotted in plotly

(Observe our line of best fit)

CanPop Regression Model

summary(lin_mod) function displays information specific to our linear model

## 
## Call:
## lm(formula = population ~ year, data = CanPop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3660 -2.3010 -0.1938  1.8580  4.2539 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -337.09856   27.71240  -12.16 7.85e-09 ***
## year           0.18134    0.01438   12.61 4.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.652 on 14 degrees of freedom
## Multiple R-squared:  0.919,  Adjusted R-squared:  0.9133 
## F-statistic: 158.9 on 1 and 14 DF,  p-value: 4.955e-09

-Estimate for (Intercept) and year indicates our estimation, fitted, \(y\)-intercept \(\color{Darkgreen}{(\hat{b}_0)}\) and slope \(\color{Darkgreen}{(\hat{b}_1)}\)
-Adjusted R-squared is 0.9133, which means that 91.33% of the variability of population can be explained by year

Confidence band


-The gray shadow on our plot is our confidence band of 95%

-It signifies that within our sample size, 95% of the intervals obtained for the given \(\color{darkgreen}x\) contain the true value of the expected value of \(\color{darkgreen}y\)

END