2025-10-24

Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship between two continuous variables:

  • Independent variable (X): The predictor or explanatory variable
  • Dependent variable (Y): The response or outcome variable

The goal is to find the best-fitting straight line through the data points that can be used for prediction and understanding relationships.

The Linear Model

The simple linear regression model is expressed mathematically as:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

where:

  • \(Y_i\) is the observed response for observation \(i\)
  • \(\beta_0\) is the y-intercept (constant term)
  • \(\beta_1\) is the slope (regression coefficient)
  • \(X_i\) is the predictor value for observation \(i\)
  • \(\epsilon_i\) is the random error term, assumed to be \(\epsilon_i \sim N(0, \sigma^2)\)

Estimating the Regression Line

The regression coefficients are estimated using the method of least squares, which minimizes the sum of squared residuals:

\[\text{SSE} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2\]

The estimated regression equation is:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimated coefficients.

Example: Study Time vs Exam Score

Let’s examine the relationship between hours studied and exam scores for a group of students.

## `geom_smooth()` using formula = 'y ~ x'

Fitting the Regression Model

# Fit the linear model
model <- lm(scores ~ hours, data = study_data)
summary(model)
## 
## Call:
## lm(formula = scores ~ hours, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4043 -1.6873 -0.4177  1.1561  5.0443 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.8223     2.2927   22.60 1.55e-08 ***
## hours         3.7541     0.3226   11.64 2.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.93 on 8 degrees of freedom
## Multiple R-squared:  0.9442, Adjusted R-squared:  0.9372 
## F-statistic: 135.4 on 1 and 8 DF,  p-value: 2.71e-06

Interpreting the Results

The fitted regression equation is:

\[\hat{\text{Score}} = 49.76 + 4.04 \times \text{Hours}\]

Interpretation:

  • Intercept (\(\hat{\beta}_0 = 49.76\)): Expected exam score with zero hours of study
  • Slope (\(\hat{\beta}_1 = 4.04\)): For each additional hour studied, exam score increases by approximately 4 points
  • R-squared (0.96): 96% of the variance in exam scores is explained by hours studied

Residual Analysis

Interactive 3D Visualization

Assumptions of Linear Regression

For valid inference, the following assumptions should be met:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent of each other
  3. Homoscedasticity: Constant variance of residuals across all levels of X
  4. Normality: Residuals are approximately normally distributed

These assumptions can be checked using diagnostic plots such as residual plots, Q-Q plots, and scale-location plots.

Conclusion

Simple linear regression is a powerful tool for:

  • Modeling relationships between two variables
  • Making predictions based on predictor values
  • Understanding the strength and direction of associations
  • Testing hypotheses about relationships