What is Simple Linear Regression?

In Simple Linear Regression, we study the linear relationship between an explanatory (independent) variable and a response (dependent) variable.

We can look at the strength of this relationship or use the explanatory variable to help us predict values of the response variable.

In this lesson, we will focus on fitting a linear regression line to a set of data.

Simple Linear Regression Model

\[Y=\beta_0 + \beta_1\cdot X + \varepsilon\]

  • Y : Expected value of Y for a given value of X
  • X : Given value of X
  • \(\beta_0\) : Y-intercept
  • \(\beta_1\) : Slope
  • \(\varepsilon\) : Random error term Epsilon

Example : trees dataset (Height vs Volume: Scatter Plot)

Example : trees dataset (Height vs Volume: Fitted Line)

The orange line is the linear regression line fitted to the data.

Estimated Regression Line

\[\hat{Y}=\hat{\beta_0}+\hat{\beta_1}\cdot X\]

  • \(\hat{Y}\) : Predicted value of Y
  • X : Given value of X
  • \(\hat{\beta_0}\) : Estimated value of \(\beta_0\)
  • \(\hat{\beta_1}\) : Estimated value of \(\beta_1\)

Plotting the fitted line using R (Plotly)

We are going to learn how to fit a linear regression line to data using both ggplot2 and plotly.

Plotly:

  • First create variable model, which equals lm(y~x), where y is a vector of our observed y (response) values, and x is a vector of our x (explanatory) values. The function lm() stands for “linear model”.
  • After creating a scatter plot named P, use the pipe operator %>% and the function add_lines() to add a fitted linear regression line. The function add_lines() takes the arguments x (our vector of x values) and y which equals fitted(model), where model is the variable we defined earlier.
  • We will show an example of this using the diamond dataset from the ggplot2 package.

Plotting the fitted line using R (Plotly)

diamond dataset from ggplot2 package: carat vs price

This is what the code should look like:

## Load data and create variables
data(diamond)
x = diamond$carat; y = diamond$price
model = lm(y~x)

## Organizing the plot axes
xax = list(title="Carat", titlefont = list(family="Modern Computer Roman"))
yax = list(title="Price", titlefont = list(family="Modern Computer Roman"))

## Create the scatter plot
P = plot_ly(x=x, y=y, type="scatter", mode="markers") %>%
  layout(xaxis=xax, yaxis=yax)

## Fit line to data and plot
P %>% add_lines(x=x, y=fitted(model))

Plotting the fitted line using R (Plotly)

diamond dataset from ggplot2 package: carat vs price

This is what the plot should look like:

Plotting the fitted line using R (ggplot2)

Now we will learn how fit a linear regression line to data using ggplot2.

ggplot2:

  • First create a scatter plot named P in which we specify the data set used and the explanatory and response variables.
  • After creating the scatter plot, use the addition operator + and the function geom_smooth() to add a fitted linear regression line. The function geom_smooth() takes the argument method (which should equal “lm” for “linear model”) and optional arguments level (specifies the confidence level for the included confidence band) and se (by default equals “TRUE” and specifies that the confidence band be included).
  • For simplicity, we will not include the confidence band, so we will ignore the optional level argument and set the optional se argument to “FALSE”.
  • We will show an example of this using the diamond dataset from the ggplot2 package.

Plotting the fitted line using R (ggplot2)

diamond dataset from ggplot2 package: carat vs price

This is what the code should look like:

## Create the scatter plot
P = ggplot(data=diamond, aes(x=carat, y=price)) + geom_point()

## Fit line to data and plot
P + geom_smooth(method="lm", se=FALSE)

Plotting the fitted line using R (ggplot2)

diamond dataset from ggplot2 package: carat vs price

This is what the plot should look like:

How to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R

There is also a way to find the specific values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R.

Step 1: Define model variable

model = lm(y~x, data=dataset), when:

  • dataset is the name of a data frame or other set of data.
  • y is the name of the column of the data set corresponding to the response variable.
  • x is the name of the column of the data set corresponding to the explanatory variable.

OR

model = lm(y~x), when:

  • y is a vector containing the observed values of the response variable.
  • x is a vector containing the values of the explanatory variable.

How to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R

Step 2: Summarize model variable

summary(model)

  • Returns information about the model, including \(\hat{\beta_0}\) and \(\hat{\beta_1}\), which are listed under the section labeled “Coefficients:”.
  • \(\hat{\beta_0}\) is the value under the “Estimate Std.” column, in the “(Intercept)” row.
  • \(\hat{\beta_1}\) is the value underneath \(\hat{\beta_0}\), in the row labeled by the explanatory variable.

How to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R

This is what the code and output should look like for our carat vs price example:

model = lm(price~carat, data=diamond)
summary(model)
Call:
lm(formula = price ~ carat, data = diamond)

Residuals:
    Min      1Q  Median      3Q     Max 
-85.159 -21.448  -0.869  18.972  79.370 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -259.63      17.32  -14.99   <2e-16 ***
carat        3721.02      81.79   45.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.84 on 46 degrees of freedom
Multiple R-squared:  0.9783,    Adjusted R-squared:  0.9778 
F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

How to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R

This is what the code and output should look like for our carat vs price example:

## OR
x = diamond$carat; y=diamond$price
model = lm(y~x)
summary(model)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-85.159 -21.448  -0.869  18.972  79.370 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -259.63      17.32  -14.99   <2e-16 ***
x            3721.02      81.79   45.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.84 on 46 degrees of freedom
Multiple R-squared:  0.9783,    Adjusted R-squared:  0.9778 
F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

How to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R

Using this method, we find that our estimated y-intercept \(\hat{\beta_0}\) equals approximately -259.63 and our estimated slope \(\hat{\beta_1}\) equals approximately 3,721.02.

So, using R, we find that our estimated regression line is:

\[\hat{Y}=-259.63 + 3721.02\cdot X\]

Estimating the regression line by hand

Method of Least Squares

The best fitted line to the data is the one that minimizes the sum of the squared residuals or \[\sum(Y_i - (\hat{\beta_0} + \hat{\beta_1}\cdot X_i))^2\] Residuals are the distance between the observed Y value associated with an X value and the expected Y value associated with that same X value. You can calculate a residual as follows:

Residual = \(Y_i - \hat{Y_i}\) = \(Y_i - (\hat{\beta_0} + \hat{\beta_1}\cdot X_i)\)

Estimating the regression line by hand

Method of Least Squares

Here are the formulas to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\):

\[\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\cdot \bar{X}\]

\[\hat{\beta_1} = {\sum(X_i - \bar{X})(Y_i - \bar{Y}) \over \sum(X_i - \bar{X})^2 }\] The process of solving for these coefficients by hand will give us the same results as the process of using R.

Conclusion

We’ve looked at three methods of fitting a linear regression line to a data set containing two variables:

  • Plotting the regression line using R with the packages Plotly and ggplot2
  • Solving for the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\) using R
  • Solving for the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\) by hand

All three methods will produce the same results in different ways. While ggplot2 requires less code to plot a regression line, plots in the Plotly package are interactive. Additionally, while both solving for the coefficients using R and by hand will produce the same results, solving by hand is more time consuming and R will offer more information in the model summary than just the coefficients.

References