2024-10-20

Introduction to Simple Linear Regression

Linear Regression is used to model the relationship between two variables by fitting a linear equation to observed data.

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where:

  • \(y\): the dependent variable, what we want to predict about the data
  • \(\beta_0\): the intercept, the value of \(y\) when \(x\) is 0
  • \(\beta_1\): the slope of the regression line, the change in \(y\) for a one-unit change in \(x\)
  • \(x\): the independent variable, the predictor or input variable
  • \(\epsilon\): the error term, the difference between the actual and predicted values of \(y\)

Introduction to mtcars Dataset

We will predict mpg (miles per gallon) based on hp (horsepower) from the mtcars dataset. Here is a snippet of the dataset:

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scatter plot with regression line

The Least Squares Method

In linear regression, we estimate the coefficients \(\beta_0\) and \(\beta_1\) using the least squares method, which minimizes the sum of squared errors:

\[ SSE = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

Where:

  • \(y_i\) is the observed value of the dependent variable
  • \(\hat{y}_i\) is the predicted value
  • \(n\) is the number of observations

Fitting the Linear Model

Coefficients Estimate Std. Error t value p value
(Intercept) 30.0989 1.6339 18.4212 0
hp -0.0682 0.0101 -6.7424 0

Where:

  • Estimate: the expected change in mpg for a one-unit change in the predictor variable
  • Std. error: variability of the coefficient estimate, smaller values imply more precision
  • t value: how many standard deviations the estimate is from zero
  • p value: statistical significance of the coefficient

Residuals vs Fitted Plot

Displays the residuals (the differences between observed and predicted values) against the fitted values from the regression model.

Plot of the Regression

The graph displays the relationship between horsepower, miles per gallon, and the fitted values from the linear regression model. The lightblue plane highlights the model’s fit to the data, represented in black dots.

Conclusion

This presentation explored the basics of simple linear regression using the mtcars dataset. The relationship between horsepower and miles per gallon was modeled, and the results were visualized through scatter plots, residuals, and a 3D regression surface. The regression line provided insight into how horsepower impacts fuel efficiency, though it is not a perfect fit. This analysis highlights how regression can help to understand relationships between variables, even when data shows some variability.