2023-03-12

Introduction

  • Simple linear regression is a statistical technique used to model the relationship between a dependent variable and an independent variable.
  • In this example, we will use the mtcars dataset to build a simple linear regression model to predict the miles per gallon (mpg) of a car based on its horsepower.
  • This means that horsepower is is the independent variable X and miles per gallon is the dependent variable Y.

Example Dataset

  • The mtcars data set contains information about different car models, including the number of cylinders, horsepower, miles per gallon, and other variables.
  • The data set has 32 observations and 11 variables, but we will focus on only two variables, hp and mpg.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
  • We can use this piece of code to do this.
library(ggplot2)
ggplot(mtcars, aes(x=hp, y=mpg)) +
  geom_point() +
  xlab("Horsepower") +
  ylab("Miles per Gallon") +
  ggtitle("Scatter Plot of Horsepower and Miles per Gallon")

Scatter Plot results

  • We can see that there is a negative relationship between horsepower and miles per gallon just by looking at the points but we can show this mathematically using a regression line.

Fitting the Regression Line

  • The linear regression model can be expressed as: \[ Y = \beta_0 + \beta_1 X + \epsilon \]
  • where Y is the dependent variable, X is the independent variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the error term.
  • We can use the lm() function in R to fit the regression line.
    x
    (Intercept) 30.0988605
    hp -0.0682283
  • so here we would have \(\beta_0 = 30.0988605\) and \(\beta_1 = -0.0682283\)

Scatter Plot with Regression line

  • The regression line supports our earlier conclusion that there is a negative relation ship between the amount of horsepower a car has and the miles per gallon its engine gets.
## `geom_smooth()` using formula = 'y ~ x'

Model Summary

  • The summary() function provides an overview of the model’s performance
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Residual

  • The residual plot is a useful tool to check the linear regression model.
  • The residuals are the difference between the actual and predicted values of the dependent variable.

Plotly 3D Scatter Plot

  • we can see if there is a further relationship between mpg, hp, and the cubic volume of the engine(disp) by using a 3d scatter plot.

Conclusion

  • Simple linear regression is a useful technique for modeling the relationship between a dependent variable and an independent variable where the simple linear regression creates a model that minimizes the error given by the square of the residuals.
  • The regression line given by \(Y = \beta_0 + \beta_1 X + \epsilon\) can be used to predict the dependent variable based on the independent variable with a given error \(\epsilon\)
  • This allows for the modeling of simple relationships within our data to help us understand what trends we might be able to dive further into with more advanced techniques.