Simple Linear Regression

Linear regression is a powerful statistical method that allows us to model the relationship between data points by fitting a straight line.

With this line, we can predict a y value (response variable) based on a given x value (predictor variable).

To introduce simple linear regression, we will use the built-in R dataset “faithful”

Faithful Dataset

We will be using the “faithful” dataset, in R, to illustrate what can be done with a simple linear regression model. First, lets take a look at our data:

## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

Faithful Dataset

As we saw on the previous slide, our dataset has 272 observations, and 2 variables.

eruptions - the duration of an eruption in minutes

waiting - the amount of time, in minutes, until the next eruption

This gives us a good amount of data to look at, and we know that our eruptions variable will be our predictor, while our waiting varible will be our response variable.

Faithful Dataset

Let’s take a look at the data before modeling: This data appears to be linear, and will work nicely for our model.

The Model

We will move forward, assuming this linear relationship between eruptions and wait time. The simple linear regression model is defined by:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

\[\begin{aligned} & \text{Where:} \\ & - y_i \text{ is the response variable (Waiting Time)} \\ & - x_i \text{ is the predictor variable (Eruption Duration)} \\ & - \beta_0 \text{ is the y-intercept} \\ & - \beta_1 \text{ is the slope coefficient} \\ & - \epsilon_i \text{ is the random error term, where } \epsilon_i\sim N(0, \sigma^2) \\ \end{aligned} \]

Residual Sum of Squares

The Residual Sum of Squares will tell us how well the regression model fits our dataset, defined by

\[RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

\[\begin{aligned} & \text{Where:} \\ & - y_i \text{ is actual value of the } i^{th} \text{observation} \\ & - \hat{y}_i \text{ is the predicted value for that observation} \\ \end{aligned} \]

RSS = 0: The regression line is a perfect fit, all points are on the line.
Low RSS: The model is a good fit.
High RSS: The model is not a good fit.

The Plot

fit <- lm(waiting ~ eruptions, data = faithful)
ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = "darkgrey") +
  geom_smooth(method = "lm", color = "#8C1D40", se = FALSE) +
  labs(x = "Eruption Duration (min)",
       y = "Waiting Time (min)")+
  theme_minimal()