2024-02-01

What is Linear Regression

Linear regression is a statistical model that creates a line that best fits the given data.

It can be used on data of two or more variables.

The closer all of the data points are to the line the better of a fit the line is.

There are many different uses and applications of linear regression. The ones below are some of the most popular ones:

  • finding the impact one independent variable has on a dependent variable

  • predict values based on input parameters

  • forecast how the data will change in the future

  • helps us understand the relationship of the data

Variables for Linear Regression

\(x_i\) = value of independent variable for observation \(i\)

\(y_i\) = value of dependent variable for observation \(i\)

\(\bar{x}\) = mean value for independent variable

\(\bar{y}\) = mean value for depended variable

\(n\) = total number of observations

Equations for Linear Regression

The following are some to the common equations that are used to calculate the linear regression:

The standard equation to find the line of best fit is: \(\hat{y} = b_0 + b_1 x\)

Least squares criterion: \(\min \sum(y_i - \hat{y}_i)^2\)

Slope for estimated regression line: \(b_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{(x_i - \bar{x})^2}\)

\(Y\)-Intercept for estimated regression equation: \(b_0 = \bar{y} - b_1 \bar{x}\)

Loading in Needed Libraries

In order to create our graphs and use statistical tools, we will need to import the following libraries. We can use ggplot2 to create more complex graphs than what could be made with the original R built-in tools. Plotly is a popular library that is used to create interactive plots.

library(ggplot2) # library which helps in graphing statistical methods
library(plotly) # library that allows use to create interactive graphs
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Loading in our Test Data

To demonstrate how to plot a simple linear regression in R, we are going to use the airquality data frame which is a built-in data frame in R. There are 153 rows with 6 different variables. Ozone, Solar.R, Wind, Temp, Month, and Day are the 6 variables.

head(airquality) # this is shows us the first 6 rows of air
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Plotting Scatter Plot

We will first plot a simple scatter plot with the variables Temp and Ozone. Temp being on the x-axis and Ozone being on the y-axis

linear <-  ggplot(data=airquality,aes(x=Temp,y=Ozone)) +geom_point()

Now we have a simple scatter plot of two variables. As seen in the plot on the next slide there seems to be some correlation between the two variables, yet it isn’t clear so lets add a linear regression line to the graph.

Basic Scatter Plot of Two Variables

plot(linear)

Adding Linear Regression Line

To add a linear regression line to an existing scatter plot we will use the stat_smooth function. Within the stat_smooth function, we will define that we want a linear regression line by setting the method to be lm.

linear2 <- linear + stat_smooth(method=lm, se=F, col='#619CFF')

On the next slide, we can see the scatter plot with a linear regression line. The line has a positive slope meaning that there is a positive correlation within the data. The line is also visually fairly centered within the data plots meaning that the line is a pretty good model for the data.

Scatter Plot with Linear Regression Line

plot(linear2)
## `geom_smooth()` using formula = 'y ~ x'

Adding Shading to show 0.95 Confidence Interval

We can also add gray shading around the linear regression line to show that the true linear model for Temp and Ozone lies within the confidence interval of our modeled linear regression line.

linear3 <- ggplot(data=airquality,aes(x=Temp,y=Ozone))+
  geom_point()+
  stat_smooth(method='lm',col='#619CFF')

In the graph on the next side, it shows us that there are some data points within the shaded region further supporting the idea that the line is a good model for the data.

Linear Regression Line and Confidence Int.

plot(linear3)
## `geom_smooth()` using formula = 'y ~ x'

Making Interactive Graph

With the use of Plotly we can create interactive graphs in R. We can create an interactive version of the graph that has the scatter plot with the linear regression line. The benefits of using plotly to make the graph interactive are that we can show and hide certain graph elements, such as the data points or the graph. The interactive graph also allows us to see the exact value for each data point.

Rline <- lm(Ozone~Temp,data=airquality) 
# The code above is the code to create the linear regression model for Temp and Ozone

# The following code creates the graph
rline <- airquality %>% filter(!is.na(Ozone)) %>% lm(Ozone ~ Temp,.) %>% fitted.values()

linear4 <- airquality %>% filter(!is.na(Ozone)) %>%
  plot_ly(x = ~Temp, y = ~Ozone,mode='markers',name='data') %>% 
  add_markers(y = ~Ozone) %>% 
  add_trace(x = ~Temp, y = rline, mode = "lines",name='line') %>%
  layout(showlegend = T)

Interactive Graph

linear4

Evaluation of the Linear Regression Model

In R, we can use the summary function to create a summary of our linear regression model to see how well it fits the data.

sum <- summary(Rline)   

As we can see from the summary on the next slide, the adjusted R-squared value is 0.48 indicating that 48% of the variance of the Ozone level can be explained by the temperature. This indicates that there is a linear relationship between the variables, Temp and Ozone. The coefficients also tell us what the slope of the line is 2.43 and the offset is -147.

Summary of Linear Regression Model

sum
## 
## Call:
## lm(formula = Ozone ~ Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.729 -17.409  -0.587  11.306 118.271 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -146.9955    18.2872  -8.038 9.37e-13 ***
## Temp           2.4287     0.2331  10.418  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.71 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16