2025-04-13

Introduction

Simple Linear Regression is a statistical method that models the relationship between two variables (x and y) by fitting a linear equation to observed data.

The Model

The general form of the regression line is:

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

Where: - \(y\) is the dependent variable - \(x\) is the independent variable - \(\beta_0\) is the intercept - \(\beta_1\) is the slope - \(\varepsilon\) is the error term

Use Case Example

Let’s examine the relationship between study hours and exam scores using simulated data.

##   study_hours exam_scores
## 1    3.588198    59.50752
## 2    8.094746    94.66267
## 3    4.680792    74.17083
## 4    8.947157    89.04510
## 5    9.464206   100.00000
## 6    1.410008    59.18236

GGPlot: Scatterplot with Regression Line

## `geom_smooth()` using formula = 'y ~ x'

There is a positive linear relationship between study hours and exam scores

GGPlot: Residual Plot

Plotly: 3D View

With a 3D plot we can see what the graph would be like with another variable (practice)

Math Slide: Least Squares Estimation

The least squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimize the sum of squared errors:

\[ S(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 \]

R Code Summary

## 
## Call:
## lm(formula = exam_scores ~ study_hours, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4639  -2.3416   0.1516   2.2989  10.9300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.9300     1.4916   33.47   <2e-16 ***
## study_hours   4.9961     0.2384   20.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.42 on 48 degrees of freedom
## Multiple R-squared:  0.9015, Adjusted R-squared:  0.8994 
## F-statistic: 439.2 on 1 and 48 DF,  p-value: < 2.2e-16