What is Simple Linear Regression?

Simple linear regression models the relationship between two numeric variables:

One predictor variable \(x\)
One response variable \(y\)

The idea is to draw a straight line through the data that best explains how \(y\) changes as \(x\) changes.

A common use case: predicting a student’s exam score based on hours studied.

The Model

\[y = \beta_0 + \beta_1 x + \varepsilon\]

Where:

\(y\) is the response variable
\(x\) is the predictor variable
\(\beta_0\) is the intercept (value of \(y\) when \(x = 0\))
\(\beta_1\) is the slope (how much \(y\) changes per one-unit increase in \(x\))
\(\varepsilon\) is the error term, assumed \(\varepsilon \sim N(0, \sigma^2)\)

Estimating the Coefficients

We estimate \(\beta_0\) and \(\beta_1\) using Ordinary Least Squares (OLS). This minimizes the total squared difference between observed and predicted values:

\[\min_{\beta_0, \beta_1} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2\]

The formulas for the estimates are:

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Example Data: Hours Studied vs Exam Score

Fitting the Regression Line

R Code Used to Fit the Model

# Fit a simple linear regression
model <- lm(scores ~ hours, data = df)

# View the results
summary(model)

# Plot with regression line
ggplot(df, aes(x = hours, y = scores)) +
  geom_point(color = "#8C1D40", size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
  labs(title = "Regression Line: Hours vs Score",
       x = "Hours Studied", y = "Exam Score") +
  theme_minimal()

Model Summary

## 
## Call:
## lm(formula = scores ~ hours, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4400 -0.8317 -0.3194  1.0280  3.3825 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.1388     1.4872   31.02 1.27e-09 ***
## hours         6.5859     0.3518   18.72 6.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.943 on 8 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9749 
## F-statistic: 350.5 on 1 and 8 DF,  p-value: 6.84e-08

Checking Assumptions

Good regression requires checking a few things:

Linearity — the relationship between \(x\) and \(y\) should be roughly linear
Normality of residuals — residuals should be approximately normal
Constant variance — spread of residuals should be consistent
Independence — observations should not be related to each other

Summary

Simple linear regression fits a line: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\)
Coefficients are estimated by minimizing squared residuals (OLS)
The \(R^2\) value tells us how much of the variation in \(y\) is explained by \(x\)
Always check the model assumptions before drawing conclusions

In our example, hours studied is a decent predictor of exam score — the more you study, the better you tend to do (no surprise there).

Simple Linear Regression