2023-04-05

SIMPLE

LINEAR

REGRESSION

What is Simple Linear Regression?

  • It is a technique that is used to create a mathematical model that describes the relationship between two continuous variables based on the current available data.

  • This model then can be used to predict the data points that are not present in the data set. Value of one variable can be guessed when the value of the other variable is given.

  • The model that is created through simple linear regression is a straight line in the following form.

\(y = mx + c\)

What is Simple Linear Regression? [ctd…]

  • Following is a sample plot that corresponds to the equation \(y = 2x + 5\)

y = seq(5,25,by=2)
x = seq(0,10,by=1)
df = data.frame(x,y)
g = ggplot(data=df, aes(x=x,y=y)) + geom_line() + geom_point()
g

What is Simple Linear Regression? [ctd…]

  • More specifically, the equation can be written as,

    \(\hat{y} = b_{1}x + b_{0}\)

    where

    \(\hat{y} = \text{predicted value of the second variable}\)

    \(b_{1} = \text{regression coefficient}\)

    \(x = \text{given value of the first variable}\)

    \(b_{0} = \text{y-intercept}\)

Methodology

  • The regression coefficient (\(b_{1}\)) and the y-intercept (\(b_{0}\)) is found in a way such that the sum of square errors between the predicted value and the actual value of the second variable in ech data point in the data set is minimized.

square error between the predicted and actual value of the second variable of the \(i^{th}\) data point :

\((\hat{y}-y_{i})^2\)

sum of the square errors of all \(n\) data points of the data set :

\(\sum_{i=1}^{n}(\hat{y}-y_{i})^2 = \sum_{i=1}^{n}(b_{1}x_{i} + b_{0}-y_{i})^2\)

  • The values of \(b_{1}\) and \(b_{0}\) can be found by differentiating the equation for the sum of square errors with respect to \(b_{1}\) and \(b_{0}\)

Practical Applications

  • Let us look at the following sample data set to understand more about how simple linear regression is used.

  • The following is a data set of details of motor vehicles.

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Practical Applications [ctd…]

  • Let’s take a quick look at how the data points of the two variables weight (\(wt\)) and miles per gallon (\(mpg\)) have distributed.

g <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point()
g

Practical Applications [ctd…]

  • Even though we can clearly see a trend where the mpg decreasing as the weight increases, we cannot accurately give an informed guess of the mpg of a vehicle whose weight is not present in the data set.

  • This is where simple linear regression comes into play.

  • Let us create a simple linear model using “ggplot2” and “plotly” packages of R for the above scenario.

Practical Applications [ctd…]

g <- g + geom_smooth(method="lm", se=FALSE)
ggplotly(g)
`geom_smooth()` using formula = 'y ~ x'

Practical Applications [ctd…]

  • We can further see what the numerical values of the coefficients of the equations of the regression line.

mod <- lm(mpg ~ wt, data=mtcars)
mod$coefficients
(Intercept)          wt 
  37.285126   -5.344472 
  • Therefore, the equation of the simple linear regression line for the above scenario is,

    \(\hat{mpg} = (-5.344472)wt+37.285126\)

THANK

YOU!