Simple Linear Regression

2024-02-04

Introduction to Simple Linear Regression

Simple linear regression is a statistical method that allows us to study the relationship between two continuous variables. It is used to model the linear relationship between the dependent variable and one independent variable. By fitting the best line through the data points, we can make predictions or understand the relationship between the variables.

Purpose: Predicting a response variable from a single predictor variable.
Use Cases: Forecasting sales, evaluating trends, risk assessment, etc.

Theory of Simple Linear Regression

Simple linear regression is based on this simple formula:

\[ y = \beta_0 + \beta_1x + \epsilon \]

Where:

\(y\) is the dependent variable (the variable we are trying to predict).
\(x\) is the independent variable (the predictor variable).
\(\beta_0\) is the intercept of the regression line (the value of \(y\) when \(x\) is 0).
\(\beta_1\) is the slope of the regression line (the change in \(y\) for a one-unit change in \(x\)).
\(\epsilon\) represents the error term (the difference between the observed values and the values predicted by the model).

Assumptions of Simple Linear Regression

Simple linear regression relies on several key assumptions:

Linearity: The relationship between the independent and dependent variable must be linear.
Independence: The residuals (errors) should be independent of each other.
Homoscedasticity: The residuals should have constant variance at every level of the independent variable.
Normality: The residuals should be normally distributed.
No Extreme Outliers: The data should not have influential outliers that could significantly skew the results.

Understanding and checking these assumptions is important for the proper application of simple linear regression and for ensuring valid results.

Example Using mtcars Dataset

The mtcars dataset contains various aspects of automobile design and performance for 32 automobiles. For a simple linear regression example, we will explore the relationship between a car’s weight (wt) and its miles per gallon (mpg), where:

mpg: Miles per gallon (dependent variable)
wt: Weight of the car (1000 lbs) (independent variable)

# Load the mtcars dataset
data(mtcars)
# Display the first six rows focusing on mpg and wt columns
head(mtcars[, c("mpg", "wt")])

                   mpg    wt
Mazda RX4         21.0 2.620
Mazda RX4 Wag     21.0 2.875
Datsun 710        22.8 2.320
Hornet 4 Drive    21.4 3.215
Hornet Sportabout 18.7 3.440
Valiant           18.1 3.460

Interactive Data Visualization with plotly

The following is an interactive scatter plot of the mtcars dataset’s mpg against wt using plotly:

Static Data Visualization with ggplot2

The scatter plot below shows the relationship between a car’s weight and its fuel efficiency.

Scatter plot of MPG vs Car Weight

Regression Analysis with R - Code

To analyze how the weight of a car (wt) influences its fuel efficiency (mpg), we perform a simple linear regression analysis using the lm() function in R. The code below fits a linear model with mpg as the response variable and wt as the explanatory variable.

# Fit the linear model
model <- lm(mpg ~ wt, data=mtcars)

Regression Analysis Model Output

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpreting the Regression Results

After fitting the simple linear regression model, we can interpret the results to understand the relationship between the variables:

The coefficient for wt indicates how much mpg is expected to change for each unit change in weight.
The \(R^2\) value tells us the proportion of the variance in mpg that can be explained by wt.
The p-value tests the hypothesis that the coefficient is significantly different from 0.

From our model, we see that:

The coefficient for wt is negative, indicating that as the weight of the car increases, the fuel efficiency decreases.
A high \(R^2\) value would indicate a strong relationship between wt and mpg.
A small p-value for wt (typically less than 0.05) would suggest that the relationship we observe is statistically significant.

Checking Regression Assumptions

After fitting a linear regression model, it is critical to check that the assumptions underlying the model are satisfied. We can use diagnostic plots to examine these assumptions:

Linearity: The relationship between predictors and the response variable should be linear.
Homoscedasticity: The variance of the errors should be consistent for all values of the predictors.
Normality: The residuals should be approximately normally distributed.
Independence: The residuals should be independent of each other.

par(mfrow=c(2,2), mar=c(2, 2, 2, 2))
plot(model)

Checking Regression Assumptions - Plot

Conclusions from Regression Analysis

The analysis of the mtcars dataset using simple linear regression provided the following insights:

There is a statistically significant negative relationship between the weight of a car (wt) and its fuel efficiency (mpg), with a p-value much less than the standard significance level of 0.05.
The coefficient for weight suggests that for each 1,000 lb increase in weight, the car’s fuel efficiency decreases by an average of approximately 5.34 mpg.
The \(R^2\) value of 0.7528 indicates that approximately 75% of the variability in fuel efficiency can be explained by the car’s weight.

These findings could have important implications for automobile design, where reducing the weight of a vehicle could be a key strategy for improving fuel efficiency.

Further analysis could involve examining the impact of other variables, performing multiple regression, and validating these findings with additional data.