October 18, 2024

A Comprehensive Overview with R

Content summary

—Introduction to simple linear regression

—Mathematical formulas for regression equations

—Example data set

—Scatterplots and regression lines

—Plotly interactive charts

—Interpretation of regression results

—Residual plots

—Mathematical derivation of regression parameters

—Summary and References

Introduction to Simple Linear Regression

Simple linear regression is a statistical method that allows us to model the relationship between two variables by fitting a linear equation to observed data.

  • Independent variable (x): The variable used to predict another variable (also called the predictor or explanatory variable).
  • Dependent variable (y): The variable being predicted (also called the response variable).

Real-world applications: - Predicting fuel efficiency based on horsepower. - Estimating house prices based on square footage.

The goal is to find a linear equation that best predicts \(y\) based on \(x\).

Mathematical Formula

The simple linear regression equation is:

\[ y = \beta_0 + \beta_1x + \epsilon \]

Where: - \(y\) is the dependent variable (response variable). - \(x\) is the independent variable (predictor variable). - \(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\)). - \(\beta_1\) is the slope of the line (the change in \(y\) for a one-unit change in \(x\)). - \(\epsilon\) is the error term, representing the difference between the observed and predicted values.

Our goal is to estimate \(\beta_0\) and \(\beta_1\) using data.

Example Dataset: mtcars

For this simple linear regression analysis, we use the built-in R dataset mtcars.

  • Dependent variable (y): mpg (Miles per Gallon) – The fuel efficiency of a car.
  • Independent variable (x): hp (Horsepower) – The power output of the car.

This dataset contains observations from 32 car models, with variables such as: - Miles per Gallon (mpg) - Horsepower (hp) - Number of cylinders (cyl) - Weight of the car (wt)

Our goal is to use horsepower to predict miles per gallon using a simple linear regression model.

Scatter Plot with Regression Line

library(ggplot2)

# Plotting a scatterplot and adding a regression line
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "blue") +  
  geom_smooth(method = "lm", col = "red") +  
  labs(title = "Simple Linear Regression: MPG vs Horsepower",
       x = "Horsepower (hp)",
       y = "Miles per Gallon (mpg)")

Plotly Interactive Regression Plot

Regression Results Interpretation

The regression model helps us understand the relationship between horsepower (hp) and miles per gallon (mpg).

Key Insights:

  • Intercept (\(\beta_0\)): The expected value of mpg when horsepower is 0. This is the point where the regression line crosses the y-axis.
  • Slope (\(\beta_1\)): The change in mpg for every 1-unit increase in horsepower. A negative slope indicates that as horsepower increases, mpg decreases.
  • \(R^2\) (Coefficient of Determination):
    • \(R^2\) is a measure of how well the regression line fits the data.
    • A higher \(R^2\) value (closer to 1) indicates that a large proportion of the variance in mpg is explained by horsepower.

From our regression model: - Slope: For every additional unit of horsepower, mpg decreases. - \(R^2\): The value of \(R^2\) shows how well horsepower explains the variance in mpg.

Residual Plot

A residual plot helps evaluate the fit of the regression model by showing the difference between the observed and predicted values.

Mathematical Interpretation

The formulas for calculating the slope (\(\beta_1\)) and intercept (\(\beta_0\)) in simple linear regression are:

\[ \beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Where: - \(x_i\) and \(y_i\) are the individual data points. - \(\bar{x}\) and \(\bar{y}\) are the means of the independent and dependent variables, respectively. - \(\beta_1\) represents the slope of the regression line, indicating the change in \(y\) for a one-unit change in \(x\). - \(\beta_0\) represents the intercept, which is the expected value of \(y\) when \(x = 0\).

These formulas help us understand how the slope and intercept are derived from the data.

Conclusion

In this presentation, we explored the basics of simple linear regression, a powerful tool for understanding relationships between variables.

Key Takeaways:

  • Simple linear regression models the relationship between an independent variable and a dependent variable by fitting a straight line.
  • The slope (\(\beta_1\)) tells us how much the dependent variable changes for a one-unit increase in the independent variable.
  • The intercept (\(\beta_0\)) is the expected value of the dependent variable when the independent variable is 0.
  • The \(R^2\) value shows how well the model fits the data.

Simple linear regression is widely used in various fields, from predicting fuel efficiency to estimating real estate prices. Understanding the basics of this method can help in building more complex models and interpreting real-world data.

References

  • R Documentation: mtcars dataset
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.
  • Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
  • Plotly: Plotly for R
  • R Documentation: lm function