2025-04-13

What is Linear Regression?

Linear regression is a way to look at how two things are related.
Example: How does the amount of hours used studying affect test scores?

When you’re wanting to predict something, you can use linear regression to find a line that best fits the data. This line can help you understand the relationship between two variables.

Why Should We Use Linear Regression?

  • To check if two things are related
  • To make thought out predictions
  • To find and understand patterns in data

Simple Linear Regression: Overview

Linear regression helps us understand relationships between variables:

  1. Dependent variable (y) - the outcome we want to predict
  2. Independent variable (x) - what we use to predict

The Math

Formula:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) = outcome we’re predicting (like test score) - \(x\) = thing we know (like study hours)
- \(\beta_0\) and \(\beta_1\) = numbers R finds for us
- \(\epsilon\) = small differences we can’t explain

Visualization

# Create more realistic data with some random variation
set.seed(123)
study_data <- data.frame(
  hours_studied = seq(1, 10, by = 1)
)
study_data$test_score <- 45 + 5 * study_data$hours_studied + rnorm(10, 0, 3)
study_data$hours_slept <- rnorm(10, mean = 7, sd = 1)

# Create the plot
ggplot(study_data, aes(x = hours_studied, y = test_score)) +
  geom_point(size = 3, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Test Scores vs Hours Studied",
       x = "Hours Studied",
       y = "Test Score") +
  theme_minimal(base_size = 14)

3D Analysis

Residuals Analysis

Results & Significance

The regression shows:

  1. Positive correlation between study hours and test scores
  2. For each additional hour of studying: \[\Delta Score \approx +5\ points\]
summary(lm(test_score ~ hours_studied, data = study_data))
## 
## Call:
## lm(formula = test_score ~ hours_studied, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4043 -1.6873 -0.4177  1.1561  5.0443 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.5764     2.0018   23.27 1.24e-08 ***
## hours_studied   4.7541     0.3226   14.74 4.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.93 on 8 degrees of freedom
## Multiple R-squared:  0.9645, Adjusted R-squared:   0.96 
## F-statistic: 217.1 on 1 and 8 DF,  p-value: 4.423e-07