2025-03-15

Introduction to Linear Regression

In the following presentation we will explore simple linear regression

We will be using the built in trees data set. Which contains data about black cherry trees

We will model the relationship between the girth of each tree and its volume

There are assumptions needed to perform a linear regression. There needs to be linear relationship between the two variables and the observations should be independent

Before doing the linear regression we will make a scatter plot to check for a positive association

After creating the model we will check the residual plot to ensure that the assumptions we make were correct

Scatter Plot

This plot shows that there is a linear relationship between girth and volume in trees

This allows us to be able to find the line of best fit

Simple Linear Regression Definition

  • \(\alpha\) is the intercept/constant
  • \(\beta\) is the slope/coefficient
  • \(y\) is dependent variable
  • \(x\) is the independent variable \[ y = \alpha + \beta x \]

Simple linear regression can be used to predict the value of the dependent variable based on the independent variable. The linear model shows the strength of the relationship between the two variables.

Estimating alpha and beta

We cannot know the exact values of alpha and beta so we need to estimate them. We do this by finding the line of best fit by finding the least squares estimates, the values that minimize sum of squared errors(SSE).

\[SSE= \sum_{i=1}^{n} \bigl(y_i - (\alpha + \beta x_i)\bigr)^2\] In this presentation we will use R to create the linear model Using the code:

model = lm(Volume ~ Girth, data = trees)

The summary of the model will be displayed on Slide 5

Summary of the model

\[ y = -36.94 + 5.07x \quad R^2 = .93 \]

## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

We will use plots to help visual the data

Scatter Plot with Line of Best Fit

R Code for ggplot

Here is the code that produced the previous ggplot

p is commented to not print the graph again

p = ggplot(trees, aes(x = Girth, y = Volume)) 
p = p + geom_point(color = "purple", size = 2) 
p = p + geom_smooth(method = "lm", se = TRUE, color = "green") 
p = p + labs(title = "Volume v. Girth Linear Model",
       subtitle = "Scatterplot",
       x = "Girth in Inches",
       y = "Volume in Cubic Feet")
#p

Residual Plot

The residual plot be uniform random around 0

We can see that the assumptions we make earlier are true

Plotly plot

Within linear regression there is no easy example to apply a 3d plot

This plot is created using plotly and is more interactive than the previous ggplots

Conclusion

\[ y = -36.94 + 5.07x \quad R^2 = .93 \]

Since we have a R^2 value greater than .90

We can see that there is a strong linear relationship between the girth and volume of trees