2024-03-23

Introduction

Simple linear regression is a basic statistical method to show the relationship between two variables. It is important in a field like data science, since we can use it for predictive analysis in things such as machine learning. This presentation will explain the importance of simple linear regression and how to use it in the R programming language.

Overview of Simple Linear Regression

According to Newcastle University, Simple Linear Regression “aims to find a linear relationship to describe the correlation between an independent and possibly dependent variable.” The independent variable is always X and dependent is always Y. Once the relationship between X and Y is established, the regression line can then be used for interpolation or the prediction of future or missing values.

Mathematical Formulation

The simple linear regression model is formulated as: \[ Y = \beta_0 + \beta_1X+\epsilon \] In this equation, \(Y\) represents the independent variable, \(\beta_0\) represents the intercept, \(\beta_1\) represents the slope, \(X\) represents the dependent variable, and \(\epsilon\) represents the error component.
The intercept is the value of the dependent variable when the independent variable is zero. The slope represents the change in the dependent variable for a one-unit change in the independent variable. The error term or residual represents the difference between the observed value of the dependent variable and the value predicted by the regression model.

Assumptions of Simple Linear Regression

There are four assumptions in linear regression models:

  • Linearity- A linear relationship exists between the dependent variable, Y, and independent variable X.
  • Homoscedasticity- For all observations, the variance of the regression residuals is the same.
  • Independence- The observation X and Y pairs are independent of one another.
  • Normality- A normal distribution exists among regression residuals.

Scatterplot

The best type of graph to use for linear regression models is a scatter plot since it clearly shows the pattern, strength, and direction of the variables’ relationship.

I’ll be using the “trees” data set for this example, which is a data set that includes the measurement of 31 trees’ diameter, height, and volume. In the next slide I will include the scatter plot of two variables in the data using the following code:

library(ggplot2)
data<-trees
ggplot(data, aes(x= Volume, y= Girth))+ geom_point()+
  labs(x="Independent Variable (Volume)", y = "Dependent Variable(Girth)",
       title="Scatter Plot")

Scatterplot

As we can see, the volume of the tree’s trunk increases steadily with it’s girth.

3D Scatterplot

Here is a 3D scatter plot of all three variables. For simplicity, this model will just focus on volume vs girth.

Model Fitting

We’ll use the lm function to fit the linear regression model. Using the summary function, we can get more information about our model

model <- lm(Volume ~ Girth, data = data)

Model Fitting (cont.)

## 
## Call:
## lm(formula = Volume ~ Girth, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Model Fitting (Cont.)

## `geom_smooth()` using formula = 'y ~ x'

Model Evaluation;

The equation for this linear regression is: \[ Y=5.065856X-36.94346 \]

intercept<-coef(model)[1]
intercept
slope<- coef(model)[2]
slope

meaning that for every 1 unit of volume, there is 5.065856 units of girth in the tree trunk. The -36 intercept doesn’t really represent anything meaningful in this context, it’s simply an estimate of what the girth would be if the volume is 0. More realistically, the girth would be 0, but the -36 is necessary for the model.

Conclusion

This presentation provided a brief overview of simple linear regression and demonstrated its application using the trees data set in R. We took a look at the relationship between volume and girth, where we found the slope to be around 5. Simple linear regression is a very valuable tool for exploring the relationship of variables, as well as predicting future and missing values.