2024-10-17

Welcome to Simple Linear Regression Presentation

In this presentation, we will talk about simple linear regression, from defining it, to visualizing it for sample seed data and for the dataset mtcars.

Introduction

  • Simple linear regression is a linear regression model with that estimates a relationship between one independent and one dependent variable. It is a statistical method to model the relationship between two continuous variables and as I mentioned earlier, it assumes a linear relationship between an independent variable \(x\) and a dependent variable \(y\).

The Linear Regression Model

  • The model can be expressed using the formula:

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

Where:

  • \(y\) = Dependent variable
  • \(x\) = Independent variable
  • \(\beta_0\) = Intercept
  • \(\beta_1\) = Slope
  • \(\varepsilon\) = Error

Estimating Parameters

  • Parameters \(\beta_0\) and \(\beta_1\) are estimated using the formula:

\[ \min_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

Example Dataset

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample data
set.seed(123)
x <- 1:100
y <- 2 + 3 * x + rnorm(100, mean = 0, sd = 30)
data <- data.frame(x, y)

Scatter plot data:

Fitting the Linear Model

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.607 -16.571  -1.039  19.455  62.846 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.90788    5.52862   0.164     0.87    
## x            3.07533    0.09505  32.356   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.44 on 98 degrees of freedom
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9135 
## F-statistic:  1047 on 1 and 98 DF,  p-value: < 2.2e-16

What do the numbers and variables say?

The intercept is the starting point where x=0. The slope shows, how much y changes with a one unit increase in x. The R squared shows how well our line fits to the data.

Using mtcars dataset

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

MPG vs Cars weight

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "blue") +
  labs(
    title = "MPG vs. Car Weight",
    x = "Weight (1000 lbs)",
    y = "Miles Per Gallon (MPG)"
  ) +
  theme_minimal()

## Fitting Linear regression on them

mtcars_model <- lm(mpg ~ wt, data = mtcars)
summary(mtcars_model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Linear regression line for mtcars

## `geom_smooth()` using formula = 'y ~ x'

Actual vs predicted MPG

Plotly Plot 2D

## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...

Plotly Plot 3D

Final plot

library(ggplot2); small_plot <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue", size = 2) + geom_smooth(method = "lm", se = FALSE, color = "darkgreen") + labs(title = "MPG vs. Weight: Summary Plot", x = "Weight (1000 lbs)", y = "Miles Per Gallon (MPG)") + theme_minimal(base_size = 5); small_plot
## `geom_smooth()` using formula = 'y ~ x'

Conclusion

  • Simple Linear Regression allows us to model the relationship between two continuous variables.
  • In the mtcars dataset, weight has a significant negative impact on MPG.
    • Heavier cars tend to have lower fuel efficiency.
  • Regression tools help in making predictions into data patterns.