What is Regression?

Understanding Relationships in Data

  • Regression analysis models the relationship between a dependent variable (what we predict) and independent variables (what we use to predict).
  • Goal: Understand how the dependent variable changes and predict future outcomes.
  • Types:
    • Simple Linear Regression: One dependent, one independent variable.
    • Multiple Linear Regression: One dependent, two or more independent variables.

Simple Linear Regression Model

The Basics

  • Focuses on the relationship between two continuous variables.
  • Assumes a linear relationship.
  • Dependent Variable (Y): Outcome or response.
  • Independent Variable (X): Predictor or explanatory.

The Regression Line

The Mathematical Model

The relationship is represented by a straight line. The population equation is:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

  • \(Y\): Dependent variable.
  • \(X\): Independent variable.
  • \(\beta_0\): Y-intercept (expected Y when X is 0).
  • \(\beta_1\): Slope (expected change in Y for one-unit increase in X).
  • \(\epsilon\): Error term (random variability).

Estimating Coefficients

Finding the Best Fit

We estimate population parameters (\(\beta_0\), \(\beta_1\)) using sample statistics (\(\hat{\beta}_0\), \(\hat{\beta}_1\)). The estimated regression line is:

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

Coefficients are estimated using Ordinary Least Squares (OLS), which minimizes the Sum of Squared Residuals:

\[ \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \]

Example: Study Hours vs. Exam Score

A Practical Scenario

We investigate the linear relationship between hours studied (X) and exam score (Y).

Hypothesis: More study hours lead to higher exam scores.

R Code for Regression

Let’s see it in action!

Here’s how to perform simple linear regression in R.

# Create some sample data
set.seed(123) # for reproducibility
study_data <- data.frame(
  hours_studied = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
  exam_score = c(55, 60, 65, 70, 75, 80, 85, 90, 92, 95, 98) + rnorm(11, 0, 3)
)

# Fit the linear model
model <- lm(exam_score ~ hours_studied, data = study_data)

# Display the estimated coefficients
coef(model)
##   (Intercept) hours_studied 
##     48.440294      4.390491

Visualizing the Relationship (ggplot2)

Scatter Plot with Regression Line

A scatter plot visualizes the relationship. The regression line shows the estimated linear trend.

## `geom_smooth()` using formula = 'y ~ x'

We see a positive linear relationship: more study hours, higher scores.

Checking Assumptions: Residuals (ggplot2)

Are the errors random?

Residuals (\(Y_i - \hat{Y}_i\)) are differences between observed and predicted values. Plotting them against fitted values helps check assumptions (linearity, constant variance). Ideally, residuals scatter randomly around zero.

Random scatter around zero suggests our model assumptions are reasonable.

Interactive Plot (plotly)

Explore the Data!

An interactive plot allows zooming, panning, and hovering for exact data values.

Note: For simple linear regression, a 3D plot isn’t directly applicable as we only have two main variables (X and Y). A 3D plot would typically involve a third independent variable, moving into the realm of multiple regression.

Conclusion

Key Takeaways

  • Simple Linear Regression is a fundamental tool for understanding and modeling linear relationships between two continuous variables.
  • The model is defined by an intercept (\(\beta_0\)) and a slope (\(\beta_1\)), which are estimated from data.
  • Visualizations (like scatter plots and residual plots) are crucial for interpreting the model and checking its assumptions.
  • It provides a way to predict the dependent variable based on the independent variable.

This basic understanding forms the foundation for more advanced statistical modeling techniques!