Statistics: Simple Linear Regression

February 28, 2023

Introduction

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

One variable, denoted \(x\), is regarded as the predictor, explanatory, or independent variable.

The other variable, denoted \(y\), is regarded as the response, outcome, or dependent variable.

Simple linear regression is similar to correlation in that both methods are used to study the relationship between two continuous variables.

The difference is that correlation measures the strength of the linear relationship between two variables, whereas simple linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

Equation models

A simple linear regression model is represented by the equation: \[y = b_0 + b_1x + \epsilon \] where:

\(y\) is the response variable, \(x\) is the predictor variable, \(b_0\) is the intercept, \(b_1\) is the slope of the line, and \(\epsilon\) is the error term.

The error term is the difference between the observed value of the response variable (y) and the fitted value of the response variable (ŷ). The error term is also known as the residual. The error term is assumed to be normally distributed with a mean of zero and a constant variance. The error term is also assumed to be independent of the predictor variable.

Equation models

The goal of simple linear regression is to model the relationship between the predictor variable and the response variable by fitting a straight line to the observed data.

The straight line is described by the equation: \[ y = b_0 + b_1x \] where:

\(y\) is the response variable, \(x\) is the predictor variable, \(b_0\) is the intercept, and \(b_1\) is the slope of the line. The intercept \(b_0\) is the value of \(y\) when \(x = 0\). The slope \(b_1\) is the change in \(y\) associated with a one-unit change in \(x\). The slope is also the correlation between \(x\) and \(y\) multiplied by the ratio of their standard deviations. The slope is also the coefficient of determination (R2) multiplied by the ratio of the standard deviations of \(y\) and \(x\).

Equation models

The coefficient of determination (R2) is the proportion of the variance in the response variable that is explained by the predictor variable. The coefficient of determination is also the square of the correlation between \(x\) and \(y\).

Residuals

The residuals equation is given by: \[ e_i = y_i - \hat{y_i} \] where \(e_i\) is the residual for the \(i^{th}\) observation, \(y_i\) is the observed value of the dependent variable, and \(\hat{y_i}\) is the predicted value of the dependent variable.

Example Data Set: Motor Trend Car Road Tests

library(ggplot2)
data(mtcars)

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm")

The plot shows the relationship between miles per gallon and weight of the car. The trend is that as the weight of the car increases, the miles per gallon decreases. The linear regression line shows that there is a negative correlation between the two variables. The residuals are the difference between the actual value and the predicted value. The residuals are normally distributed around 0, which means that the linear regression model is a good fit for this data.

Example Data Set: Iris

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + geom_smooth(method = "lm")

The plot shows the relationship between sepal length and sepal width for each species of iris. The lines of best fit are linear regressions of the data. The residuals are the distances between the points and the lines of best fit. Also, the residuals are minimized by the lines of best fit. Note that the residuals are not normally distributed, but they are centered around 0.

Example Data Set: Iris

The residuals for setosa are smaller than those for versicolor and virginica. This is because setosa has a more linear relationship between sepal length and sepal width than versicolor and virginica. The linear equations for each line of best fit are: \[ \text{Setosa: } y = 0.835x - 0.726 \] \[ \text{Versicolor: } y = 0.788x - 0.068 \] \[ \text{Virginica: } y = 0.933x + 0.264 \]