March 17, 2024

Linear Regression

Linear regression is a statistical model that describes the relationship between two variables. It is used to make predictions about the value of a dependent variable based on given values of the independent variable.

Using linear regression has advantages over other statistical models because it is simple, informative, and applicable to a wide variety of datasets.

Line of Best Fit

The line of best fit describes how a response variable (y) changes as an explanatory variable (x) changes.

For a simple linear regression, the line of best fit has the general equation:

\(y = \beta_1 x + \beta_0 +\epsilon\)


- \(\beta_1\) = slope
- \(\beta_0\) = y-int
- \(\epsilon\) = error

Example: Dataset airquality

d <- airquality
str(d)
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
head(d)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
summary(d)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

How is the amount of ozone in the atmosphere related to temperature?

Adding the Line of Best Fit

attach(airquality)
ggplot(aes(x = Ozone, y = Temp), data = airquality) + geom_point() +
    geom_smooth(method = "lm", se = F) + labs(y = "Temp (F)",
    x = "Ozone (ppb)")

Regression Analysis

  • Model: Temp = 0.2 Ozone + 69.41
    • The intercept, \(\beta_0\) is 69.41
    • The slope, \(\beta_1\) is 0.2
      • For every 1 ppb increase in ozone, the temperature increases by 0.2\(^{\circ}F\)
  • The residual standard error is 6.89
    • This is the difference between predicted and actual temperatures
  • The \(R^{2}\) value is 0.48
    • Approximately 48% of the variance in temperature is explained by ozone
  • p-value is < \(2.2^{-16}\) which indicates statistical significance for this model
summary(lm(Temp ~ Ozone, data = airquality))
## 
## Call:
## lm(formula = Temp ~ Ozone, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.147  -4.858   1.828   4.342  12.328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.41072    1.02971   67.41   <2e-16 ***
## Ozone        0.20081    0.01928   10.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.819 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

Another Example: How are Wind and Ozone related?

Comparison (Temp or Wind vs Ozone)

  • Temperature and ozone are positively correlated: as amount of ozone increases, the temperature also increases
  • Wind and ozone are negatively correlated: as amount of ozone increases, the wind decreases
  • The \(R^{2}\) value for wind is around 0.36
    • This indicates that the linear regression model for temperature vs ozone does a better job predicting than it does for wind vs ozone
model <- lm(airquality$Solar.R ~ airquality$Ozone)
model
## 
## Call:
## lm(formula = airquality$Solar.R ~ airquality$Ozone)
## 
## Coefficients:
##      (Intercept)  airquality$Ozone  
##         144.6306            0.9542
cor.test(airquality$Ozone, airquality$Wind)
## 
##  Pearson's product-moment correlation
## 
## data:  airquality$Ozone and airquality$Wind
## t = -8.0401, df = 114, p-value = 9.272e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7063918 -0.4708713
## sample estimates:
##        cor 
## -0.6015465
summary(model)
## 
## Call:
## lm(formula = airquality$Solar.R ~ airquality$Ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -153.577  -65.349   -0.555   68.110  176.011 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      144.6306    13.1749   10.98  < 2e-16 ***
## airquality$Ozone   0.9542     0.2459    3.88 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.83 on 109 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.1213, Adjusted R-squared:  0.1133 
## F-statistic: 15.05 on 1 and 109 DF,  p-value: 0.0001793

Conclusion

Simple Linear Regression is a helpful statistical method used to show the relationship between a response variable and an explanatory variable

The line of best fit can visualize the nature of the relationship (positive or negative) based on the sign of its slope and can be used for prediction.

While linear regression can be applied to many different data sets, it isn’t always an informative model if there isn’t much of a linear relationship to begin with, and it will be a better prediction model for some sets than others.