2023-03-14

What is Simple Linear Regression?

Regression models in general describe relationships between one or more independent variables and a dependent variable by fitting a line, linear or non-linear, to the data.

Simple linear regression is a linear regression model with a only one independent variable and one dependent variable. It finds a linear function that predicts the value of the dependent variable from the value of the independent variable. It can be used to assess how strong the relationship between two variables is or estimate the value of the dependent variable given some value of the independent variable.

How Linear Regression is Performed

A linear regression function is in the form \(y = a + bx\), where y is the dependent variable, x is the independent variable, a is the y-intercept, and b is the slope of the line.

To determine the slope of the regression line, the following formula:
\(b = r{S_y\over S_x}\)
where \(r\) is the Pearson Correlation Coefficient, and \(S_y\) and \(S_x\) are the standard deviations of y and x respectively.

How the Pearson Correlation Coefficient is Calculated

The Pearson Correlation coefficient can be calculated using the following formula:
\(r ={\Sigma_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})} \over \sqrt{\Sigma_{i=1}^n{(x_i - \bar{x})^2}} \sqrt{\Sigma_{i=1}^n{(y_i - \bar{y})^2}} }\)
where \(n\) is the sample size, \(x_i\) and \(y_i\) are sample points indexed with i, and \(\bar{x}\) and \(\bar{y}\) are the sample means for x and y.

As should be fairly obvious, this is tedious by hand. We can use the tools in r to achieve the same thing much more easily. We’ll go over an example using a data set that includes air quality measurements in New York from May to September 1973.

The Data We’ll Be Using

data(airquality)
air <- airquality[!is.na(airquality$Solar.R) & 
                  !is.na(airquality$Temp),c("Solar.R","Temp")]
dim(air)
## [1] 146   2

As we can see, the data set has over 100 observations. While it would be possible to do by hand, it would be very tedious and would be much easier to do using the tools in r.

Scatter Plot of Data

Looking visually, there could be a relationship between the two, but it is not entirely clear. Linear regression could help with this.

Simple Linear Regression in R

mod <- lm(Temp~Solar.R,data=air)
mod
## 
## Call:
## lm(formula = Temp ~ Solar.R, data = air)
## 
## Coefficients:
## (Intercept)      Solar.R  
##    72.86301      0.02825

To do linear regression in r, we only have to use the lm() function using our variables and data set as arguments. The lm() function will give us intercept and slope of the model. The equation we got in our example is essentially:
\(Temperature = 72.863 + 0.028*Solar Radiation\).

Code used for Plot

plot1 <- ggplot(air, aes(x = Solar.R,y = Temp)) + geom_point() + 
geom_line(aes(x = Solar.R,
              y = mod$coefficients[2]*Solar.R + mod$coefficients[1]),
              col = "blue")+
labs(title=
"Temperature vs Solar Radiation In New York 1973 (Linear Regression)",
x ="Solar Radiation (Langleys)", y = "Temperature(Fahrenheit)")

Scatter Plot with Linear Regression

As we can see, there is a definite positive correlation between solar radiation and temperature. It is much more clear now that we can see it in a graph. We can look at another example comparing ozone levels and solar radiation instead.

Linear Regression on New Data

newData <- airquality[!is.na(airquality$Solar.R) &
                      !is.na(airquality$Ozone),c("Solar.R","Ozone")]
mod <- lm(Solar.R~Ozone,data=newData)
mod
## 
## Call:
## lm(formula = Solar.R ~ Ozone, data = newData)
## 
## Coefficients:
## (Intercept)        Ozone  
##    144.6306       0.9542

This gives us the equation: \(Solar Radiation = 144.63 + 0.9542*Ozone\)
which we can plot along with the scatter plot of the data.

Linear Regression Scatter Plot (Ex.2)

Once again we can see a clear relationship between the two.