Understanding the linear relationship between a predictor \(x\) and a response \(y\).
2025-10-18
Understanding the linear relationship between a predictor \(x\) and a response \(y\).
What this presentation covers
The model (mathematical form)
We model the data as \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} N(0,\sigma^2) \]
Key goal: estimate \(\beta_0, \beta_1\) and quantify uncertainty (standard errors, \(t\)-tests, CI).
Scenario: A data analyst wants to predict student exam scores (\(y\)) based on the number of hours spent studying (\(x\)).
Question: Is there a linear relationship between the amount of time a student spends studying and their exam score?
Visualization idea: If we plot exam scores against study hours, we might expect results to roughly cluster along an upward-sloping line. This would indicate that more study time generally leads to higher exam scores.
Goal of regression:
For a simple linear regression model to produce valid estimates, the following assumption must hold:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]
Linearity: The expected value of \(y\) is a linear function of \(x\): \[ E[Y|X=x] = \beta_0 + \beta_1 x \]
Independence: Observations \((x_i, y_i)\) are independent of each other.
Homoscedasticity The variance of the errors is constant: \[ Var(\varepsilon_i) = \sigma^2 \quad \text{for all } i \]
Normality of errors: The errors are normally distributed: \[ \varepsilon_i \sim N(0, \sigma^2) \]
Why it is important:
In simple linear regression, we estimate the parameters \(\beta_0\) and \(\beta_1\) by minimizing the Residual Sum of Squares (RSS):
\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 \]
Closed-form solutions: \[ \hat{\beta}_1 = \frac{\sum){i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]
Interpretation:
Key point:
Least squares provides the line that best fits the data we observed by minimizing the squared vertical distances between the points and the regression line.
We will use a simple linear regression model to predict exam scores based on hours studied.
# Sample dataset hours <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scores <- c(52, 55, 57, 60, 63, 65, 67, 70, 72, 75) df <- data.frame(hours, scores) # Fit linear regression model model <- lm(scores ~ hours, data = df) # View summary of model summary(model)
## ## Call: ## lm(formula = scores ~ hours, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.3818 -0.3227 0.1182 0.1591 0.6545 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 49.80000 0.24371 204.34 3.68e-16 *** ## hours 2.50909 0.03928 63.88 4.01e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3568 on 8 degrees of freedom ## Multiple R-squared: 0.998, Adjusted R-squared: 0.9978 ## F-statistic: 4081 on 1 and 8 DF, p-value: 4.01e-12
We can visualize the relationship between exam scores and hours studied:
We want to test whether hours studied significantly affects exam scores:
\[ H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0 \]
Test statistic:
\[ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \]
Once the model is fitted, we can estimate confidence intervals for the expected exam score at a given number of study hours \(x_0\):
\[ \hat{y}_0 \pm t_{\alpha/2, n-2} \cdot SE(\hat{y}_0) \]
Where the standard error of the predicted mean is:
\[ SE(\hat{y}_0) = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right)} \]
In R:
predict(model, newdata = data.frame(hours = 7), interval = "confidence")
## fit lwr upr ## 1 67.36364 67.07014 67.65713
Big picture
Simple linear regression provides a foundation for understanding predictive relationships - a first step toward more advanced models like multiple regression and logistic regression.