2025-02-08

Simple Linear Regression - An Introduction

Simple Linear Regression is used to analyze and model relations between dependent and independent variables. In Simple Linear Regression, the following equation is used:

\(Y = \beta_0 + \beta_1 X + \epsilon\)

In this equation, \(Y\) is the dependent variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(X\) is the independent variable, and \(\epsilon\) is the error term.

Where Linear Regression Accurately Models Data

To understand which data sets work best with linear regression, let’s look at one as example.

This scatter plot shows how different study times performed on an exam.

This data set works well to demonstrate linear regression because there’s an obvious relationship between the amount of time spent studying and the exam score achieved by a student.

Application of Simple Linear Regression

Now that we have examined the dataset, let’s apply a simple linear regression model to analyze the relationship between study time and exam score.

Immediately one can tell from the trendline there’s a positive correlation between spending time studying and score on an exam. Trendlines help us predict behavior based on a given data set. For example, if a student studies for \(X = 7\) hours, we can calculate the predicted score as: \[ \hat{Y} = \beta_0 + \beta_1 X = 60.81 + 2.55 \times 7 = 78.66 \] So if a student studies 7 hours, the trendline predicts they achieve a score of 78.66.

Residuals Analysis

To understand whether our linear regression model is appropriate for the data, we use residuals. These are the residuals from the data set of the last slide:

Residuals: Min: -36.673 1Q: -3.572 Median: 2.966 3Q: 8.613 Max: 23.992

Min = -36.673 means the largest over-prediction in exam score by the trend line is 36.673 percent larger than what was actually achieved by a student.

Max = 23.992 means the largest under-prediction in exam score by the trend line is 23.992 percent smaller than what was actually achieved by a student.

1Q = -3.572 means 25% of errors by the trendline are less than or equal to -3.572

3Q = 8.613 means 75% of errors by the trendline are less than or equal to 8.613

Coefficients Analysis

Coefficients in a simple linear regression model represent how the dependent and independent variables relate. These are the coefficients of the aforementioned linear regression model:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.807434 4.6723065 13.014436 0.00e+00
studyhours 2.550183 0.5222307 4.883249 3.82e-05

From the table, \(\beta_0 = 60.807\) means the predicted score for having studied 0 hours is 60.807. \(\beta_1 = 2.550\) means for every hour studied, the trendline predicts a 2.55 percent increase in test score.

Both \(\beta_0\) and \(\beta_1\) are statistically significant, as their \(p\)-values are less than 0.05. This means the intercept and slope meaningfully describe the relationship between study hours and exam scores.

Model Fit and Evaluation

The residual standard error, \(R^2\), and F-statistic let us determine the quality of our linear regression model. The following values are from the example introduced earlier:

Residual Standard Error: 13.24 Multiple R squared: 0.4599 Adjusted R squared: 0.4407 F-statistic: 23.85, p-value: 3.82e-05

Residual standard error = 13.24 means the trendline only deviates about 13.24 percent on average from the real scores achieved by students.

Multiple R squared = 46% means 46% of the variation in exam scores is explained by the amount of time studied, with the remaining 54% being determined by variables outside the linear regression model. Adjusted R squared = 44% represents the same idea but adjusted for using one predictor, study time. The adjusted R square value is more reliable for evaluating how the model works.

F-Statistic = 23.85 and p-value = 3.82e-05 shows the model has a very small p-value, which means study hours are a statistically significant predictors of test performance.

Exploring Regression Model Fit in 3D

We can gain a deeper understanding of how well our simple linear regression model fits the data by examining this 3D interactive scatter plot. This is achieved by visualizing the relationship between:

Study Hours (x-axis): The number of hours studied.

Predicted Scores (y-axis): The exam scores predicted by the regression model.

Residuals (z-axis): The difference between actual scores and predicted scores.

Calculating the Regression Model in R

This slide demonstrates how the simple linear regression model is calculated and how the trendline using the dataset is found using R:

model = lm(examscore ~ studyhours, data = studydata)
intercept = round(coef(model)[1], 2)
slope = round(coef(model)[2], 2)
cat(sprintf("Trendline Equation: Y = %.2f + %.2f * X\n", intercept, slope))
## Trendline Equation: Y = 60.81 + 2.55 * X

This code can be used on any similar data set to produce a linear regression model in R