11/12/23

What is Simple Linear Regression?

Simple linear regression is the relationship between two variable where one variable is dependent and the other one is independent.

Some examples of common linear regression include:

  • Number of hours employees spend traing and Performance
  • Height and Weight
  • Volume and Temperature
  • RAM and Cost
  • Pollution and Rising Temperature

Line of Best Fit

The relationship of the two variables is represented by the line of best fit which is represented by the linear equation:

\[ y = \beta_0 + \beta_1x \] y = dependent variable
x = independent variable
beta_0 = y-interception
beta_1 = slope of the equation

Coefficient of Determination(R^2)

The coefficient of determination tell us if the line of best fit is actually an accurate predictor of future data points based in the data. It is represented by a value between 0 and 1. The closer to 1 the R^2 is the more accurate the line of best fit is. R^2 is calcutated using the following formula:

\[ R^2 = \frac{SSR}{SST} \] SSR = Sum of Squared Regression
SST = Sum of Squared Total

Example of Simple Linear Regression

I’m going to be using a data set in R called Oranges which looks at the age and circumferences of the oranges using 5 trees. It’s dimention is 35 by 3.

Data Load

library(ggplot2)
library(plotly)
data(Orange)

Grouping Data Based on Tree

tree1 = Orange[Orange$Tree == 1, ]
tree3 = Orange[Orange$Tree == 3, ]

I’m grouping the 2 different trees data to compared the linear regression between them before creating one with all of the data.

Tree 1

mod1 = lm(age~circumference, data = tree1)
summary(mod1)
Call:
lm(formula = age ~ circumference, data = tree1)

Residuals:
      1       2       3       4       5       6       7 
  25.10   57.36 -108.30 -102.04   65.36  -55.86  118.38 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -264.6734    98.6206  -2.684   0.0436 *  
circumference   11.9192     0.9188  12.973 4.85e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 97.44 on 5 degrees of freedom
Multiple R-squared:  0.9711,    Adjusted R-squared:  0.9654 
F-statistic: 168.3 on 1 and 5 DF,  p-value: 4.852e-05

Tree 1 Continuation

Based on the in formation we know the line of best fit for tree 1 is as follows: \[ y = -264.6734 + 11.9192x\] It also has an R^2 of 0.9654, so we know the line of best fit is really close with the majority of the points which means it is a great predictor for future points.

Tree 1 Scatter Plot

g1 = ggplot(tree1, aes(x = age, y = circumference) ) + geom_point() + 
  geom_smooth(method='lm', se=FALSE, formula = y ~ x) + ggtitle("Tree 1 Scatter Plot")
g1

Tree 3

mod3 = lm(age~circumference, data = tree3)
summary(mod3)
Call:
lm(formula = age ~ circumference, data = tree3)

Residuals:
    15     16     17     18     19     20     21 
-33.65  79.53 -29.40 -86.69  56.04 -91.89 106.07 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -209.5123    85.2683  -2.457   0.0574 .  
circumference   12.0389     0.8353  14.412  2.9e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 87.95 on 5 degrees of freedom
Multiple R-squared:  0.9765,    Adjusted R-squared:  0.9718 
F-statistic: 207.7 on 1 and 5 DF,  p-value: 2.901e-05

Tree 3 Continuation

Based on the in formation we know the line of best fit for tree 3 is as follows: \[ y = -209.5123 + 12.0389x \] It also has an R^2 of 0.9718, so we know the line of best fit is really close with the majority of the points which means it is a great predictor for future points.

Tree 3 Scatter Plot

g3 = ggplot(tree3, aes(x = age, y = circumference) ) + geom_point() + 
  geom_smooth(method='lm', se=FALSE, formula = y ~ x) + ggtitle("Tree 3 Scatter Plot")
g3

Looking the Entire Data

mod = lm(age~circumference, data = Orange)
summary(mod)
Call:
lm(formula = age ~ circumference, data = Orange)

Residuals:
    Min      1Q  Median      3Q     Max 
-317.88 -140.90  -17.20   96.54  471.16 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    16.6036    78.1406   0.212    0.833    
circumference   7.8160     0.6059  12.900 1.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 203.1 on 33 degrees of freedom
Multiple R-squared:  0.8345,    Adjusted R-squared:  0.8295 
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Looking the Entire Data Continuation

Based on the in formation we know the line of best fit for Orange is as follows: \[ y = 16.6036 + 7.8160x \] It also has an R^2 of 0.8295, so we know the line of best fit is close with the majority of the points which means it is a good predictor for future points. It also shows that individual tree data are have higher R^2 because it has less point to fit into the equation.

Orange Scatter Plot

g = plot_ly(Orange, x = ~age, y = ~circumference, type = "scatter") 
g