Simple Linear Regression

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between two variables:

Dependent variable (Y): The outcome we want to predict
Independent variable (X): The predictor variable

The relationship is expressed as:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the error term.

The Regression Equation

The estimated regression line is given by:

\[\hat{Y} = b_0 + b_1 X\]

where:

\(\hat{Y}\) is the predicted value of Y
\(b_0 = \bar{Y} - b_1\bar{X}\) is the estimated intercept
\(b_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\) is the estimated slope.

The slope \(b_1\) represents the average change in Y for a one-unit increase in X.

Key Assumptions

For simple linear regression to be valid, several assumptions must be met:

Linearity: The relationship between X and Y is linear
Independence: Observations are independent of each other
Normality: Residuals are normally distributed

Violations of these assumptions can lead to biased or inefficient estimates.

Example: Study Hours vs Exam Scores

Let’s examine the relationship between study hours and exam scores for a group of students.

## Mean Study Hours: 10

## Mean Exam Score: 81.27

## Correlation: 0.967

Scatter Plot with Regression Line

Fitting the Model in R

model <- lm(Score ~ Hours, data = study_data)

summary(model)

coefficients(model)

new_data <- data.frame(Hours = c(5, 10, 15))
predictions <- predict(model, new_data, interval = "confidence")

Model Results

## 
## Call:
## lm(formula = Score ~ Hours, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4802 -3.2541 -0.2431  2.1036  7.6561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.0946     2.4966   20.07 1.03e-11 ***
## Hours         3.1178     0.2206   14.14 1.11e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.679 on 14 degrees of freedom
## Multiple R-squared:  0.9345, Adjusted R-squared:  0.9298 
## F-statistic: 199.8 on 1 and 14 DF,  p-value: 1.113e-09

Regression Plot

Residual Analysis

Model Evaluation Metrics

The quality of our regression model can be assessed using several metrics:

Coefficient of Determination (\(R^2\)):

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

For our model:

## R-squared: 0.9345

## Adjusted R-squared: 0.9298

## RMSE: 4.377

## 
## Interpretation: 93.5 % of variance in exam scores is explained by study hours

Summary

Key Takeaways:

Simple linear regression models the relationship between two continuous variables
The model provides both prediction capability and insight into variable relationships
Always check assumptions and examine residuals
\(R^2\) indicates how well the model explains variability in the data
From our example: Each additional study hour is associated with approximately 3.12 point increase in exam score