2023-04-05
It is a technique that is used to create a mathematical model that describes the relationship between two continuous variables based on the current available data.
This model then can be used to predict the data points that are not present in the data set. Value of one variable can be guessed when the value of the other variable is given.
The model that is created through simple linear regression is a straight line in the following form.
\(y = mx + c\)
Following is a sample plot that corresponds to the equation \(y = 2x + 5\)
y = seq(5,25,by=2) x = seq(0,10,by=1) df = data.frame(x,y) g = ggplot(data=df, aes(x=x,y=y)) + geom_line() + geom_point() g
More specifically, the equation can be written as,
\(\hat{y} = b_{1}x + b_{0}\)
where
\(\hat{y} = \text{predicted value of the second variable}\)
\(b_{1} = \text{regression coefficient}\)
\(x = \text{given value of the first variable}\)
\(b_{0} = \text{y-intercept}\)
The regression coefficient (\(b_{1}\)) and the y-intercept (\(b_{0}\)) is found in a way such that the sum of square errors between the predicted value and the actual value of the second variable in ech data point in the data set is minimized.
square error between the predicted and actual value of the second variable of the \(i^{th}\) data point :
\((\hat{y}-y_{i})^2\)
sum of the square errors of all \(n\) data points of the data set :
\(\sum_{i=1}^{n}(\hat{y}-y_{i})^2 = \sum_{i=1}^{n}(b_{1}x_{i} + b_{0}-y_{i})^2\)
The values of \(b_{1}\) and \(b_{0}\) can be found by differentiating the equation for the sum of square errors with respect to \(b_{1}\) and \(b_{0}\)
Let us look at the following sample data set to understand more about how simple linear regression is used.
The following is a data set of details of motor vehicles.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let’s take a quick look at how the data points of the two variables weight (\(wt\)) and miles per gallon (\(mpg\)) have distributed.
g <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point() g
Even though we can clearly see a trend where the mpg decreasing as the weight increases, we cannot accurately give an informed guess of the mpg of a vehicle whose weight is not present in the data set.
This is where simple linear regression comes into play.
Let us create a simple linear model using “ggplot2” and “plotly” packages of R for the above scenario.
g <- g + geom_smooth(method="lm", se=FALSE) ggplotly(g)
`geom_smooth()` using formula = 'y ~ x'
We can further see what the numerical values of the coefficients of the equations of the regression line.
mod <- lm(mpg ~ wt, data=mtcars) mod$coefficients
(Intercept) wt 37.285126 -5.344472
Therefore, the equation of the simple linear regression line for the above scenario is,
\(\hat{mpg} = (-5.344472)wt+37.285126\)