Introduction to Linear Regression

A Practical Guide to Predictive Modeling

Statistical Methods for Data Analysis

What is Linear Regression Anyway?

Linear regression helps us answer questions like:

  • “If I study one more hour, how much will my grade improve?”
  • “How does temperature affect ice cream sales?”
  • “What’s the relationship between car weight and fuel efficiency?”

Basic idea: Find a straight line that best fits scattered data points

Real-world use: Predicting outcomes based on patterns in data

The Math Behind the Magic

A simple linear model looks like this:

\[Y_i = \alpha + \beta X_i + \varepsilon_i\]

Breaking it down:

  • \(Y_i\) = What we’re trying to predict (dependent variable)
  • \(X_i\) = What we’re using to make predictions (independent variable)
  • \(\alpha\) = Where the line crosses the Y-axis (intercept)
  • \(\beta\) = How steep the line is (slope)
  • \(\varepsilon_i\) = The part we can’t explain (error)

To find the best line, we minimize:

\[\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]

Where \(\hat{Y}_i = \hat{\alpha} + \hat{\beta}X_i\) is our prediction

Let’s Look at Some Real Data

We’ll examine how a car’s horsepower relates to its miles per gallon (MPG).

Understanding Residuals

Diagnostic Checks

Multiple Regression in 3D

R Code Example: Building Your First Model

# Load and explore the data
data(mtcars)
head(mtcars)

# Create a scatter plot
plot(mtcars$hp, mtcars$mpg, 
     main = "Quick Scatter Plot",
     xlab = "Horsepower", 
     ylab = "MPG",
     pch = 19, 
     col = "blue")

# Build the linear model
my_model <- lm(mpg ~ hp, data = mtcars)

# View the results
summary(my_model)

Making Predictions

# Display model equation
cat("Model equation: MPG =", 
    round(coef(my_model)[1], 2), "+",
    round(coef(my_model)[2], 2), "* HP\n")

# Make predictions for new cars
new_cars <- data.frame(hp = c(100, 150, 200))
predictions <- predict(my_model, newdata = new_cars)
print(predictions)

# Get confidence intervals
conf_intervals <- predict(my_model, 
                         newdata = new_cars,
                         interval = "confidence")
print(conf_intervals)

Model Diagnostics

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(my_model)
par(mfrow = c(1, 1))

What to look for:

  • Residuals vs Fitted: Should show random scatter
  • Q-Q Plot: Points should follow the diagonal line
  • Scale-Location: Should show constant variance
  • Residuals vs Leverage: Identifies influential points

Key Takeaways

What we learned:

  1. Linear regression finds the best-fitting line through data points
  2. The model equation is: \(Y = \alpha + \beta X + \varepsilon\)
  3. Residuals help us check if our model fits well
  4. Multiple regression extends to multiple predictors
  5. Always check diagnostic plots before trusting your model

Next steps:

  • Practice with different datasets
  • Learn about multiple regression in depth
  • Explore polynomial and non-linear models
  • Study model validation techniques

Questions?

Thank you for your attention!

Resources: R Documentation, “An Introduction to Statistical Learning”