2026-03-01
Simple Linear Regression
What does it do?
- Simple linear regression models the relationship between a dependent variable (Y) and one independent variable (X).
- It helps us to understand trends and make predictions.
- For example, predicting exam scores (Y) based on the amount of hours spent studying (X).
What Does the Model Look Like?
Simple Linear Regression Model
- \(y=b_{0}+b_{1}x+ε\)
- Y: Dependent variable (Exam scores)
- X: Independent variable (Hours studied)
- \(b_{0}\): Intercept
- \(b_{1}\): Slope
- ε: Random error
- \(R^2\): Residuals
- This equation says the independent variable changes linearly with the dependent variable with the addition of some random noise.
Relationship Between the Slope, Intercept, and Random Error
Interpretations
- \(b_{0}\) is the predicted value of Y when X = 0.
- \(b_{1}\) is the expected change in Y for one-unit in X.
- ε exists due to other factors other than X influencing Y, inaccuracies in data collection, and unpredictable human behavior.
- Example: \(b_{0}\) = hours studied (X) and \(b_{1}\) = exam scores (Y).
- If \(b_{1}\) = +15, then for every one-unit increase in hours, the predicted exam increases by 15 and vice-versa.
Example Dataset and Scatterplot
Hours Studied vs Performance Index

Fitting the Model in R
Simple Linear Regression Code
- df = read.csv(“R/Student_Performance.csv”)
- model = lm(Performance.Index ~ Hours.Studied, data = df)
- summary(model)
- ggplot(model, aes(x = Hours.Studied, y = Performance.Index)) + geom_point() + geom_smooth(method=“lm”, se = FALSE) + labs(title = “Hours Studied vs Performance Index”, x = “Hours Studied”, y = “Performance Index” )
Regression Line on the Data
Visualizing the Fitted Line

- The scatterplot shows a positive linear relationship between hours studied and performance index. So as study time increases, performance tends to increase even with variability at each number of study hours.
Checking Model Fit
How Well Does the Model Fit?
- Residuals (\(e_{i}\)) are the differences between the observed values and the predicted values, representing the part of the data the model fails to explain.
- Formula: \(e_{i}\) = Observed y - Predicted y.

- The residuals are centered around 0, which is a good sign. A linear model would be a good choice to utilize in this case.
Interactive Plot
Final Takeaway
- Dependent Variable: Performance Index
- Independent Variable: Hours studied
- Positive relationship shown by regression line
- The model helps to explain and predict outcomes
- Residuals help check whether the fit is reasonable