What is Simple Linear Regression?

  • Simple linear regression is a model used to measure the relationship between two quantitative variables (x and y).
  • x = the predictor and independent variable
  • y = the response and dependent variable
  • With these variables, simple linear regression predicts the outcome, y, based on x
  • It is titled “simple” because it only involves one independent variable

Simple Linear Regression Formula

\(y = \beta_0 + \beta_1\cdot x + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(\mu=0; \,\,\sigma^2)\)

\(\beta_0\) = Constant/y-intercept

\(\beta_1\) = the slope (depicts the relationship between x and y)

\(\varepsilon\) = the error (represents how the observed outcome differs from the actual outcome of the model)

Orange Dataset

Within the following slide, simple linear regression will be showcased with the usage of the built-in “Orange” data set. This data set provides the variables age and circumference of Orange trees over the span of 35 observations.

head(Orange):

##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

ggplot Example #1

Within this graph, there is a positive linear relationship. Generally, as the age of the orange tree increases so does the circumference.

Linear Model

## 
## Call:
## lm(formula = circumference ~ age, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

\(R^2\) Interpretation

  • \(R^2\) = the coefficient of determination that tests if the linear model fits the data well.
  • Values range from 0 to 1
  • The closer \(R^2\) is to 1, the more linear the model
  • \(R^2\) = 0.8345 so this model depicts a high level of correlation

Line of Best Fit Equation

\(y = \beta_0 + \beta_1\cdot x + \varepsilon\)

circumference = circumference intercept + (slope * age)

\(\beta_0 = 17.40\) –> circumference intercept

\(\beta_1 = .107\) –> slope

\(y = .107x + 17.40\)

Cars Dataset

The next utilized data set will be “cars.” This data frame provides 50 observations depicting the relationship between speed (mph) and stopping distance (ft).

head(cars):

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

ggplot Example #2

As the speed increases, the stopping distance also tends to increase. Therefore, these two variables have a positive linear relationship.

Code for Previous Slide

ggplot(data = cars, aes(x = speed, y = dist)) + 
  geom_point(col='deeppink3') + 
  geom_smooth(formula = y ~ x, 
              method = "lm", 
              se = FALSE, 
              col='royalblue') +
  ggtitle("Stopping Distance (ft) vs Speed (mph) of Cars")  + 
  xlab("Speed (mph)") + 
  ylab("Stopping Distance(ft)")

Trees Dataset

The following trees data set contains 31 observations with the diameter, height, and volume for black cherry trees.

head(trees):

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Plotly Example

This graph presents that as the diameter and height increase, the volume also increases.