What Will You See in this Presentation?

During this presentation, we will look at linear regression and analyze the correlation between Old Faithful’s eruption duration and waiting times. We are using the built in dataset that contains 272 observations of Old Faithful eruptions.

What is the linear regression model?

We can define the linear regression model as:

\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]

Where:

  • \(\beta_0\) is the y-intercept
  • \(\beta_1\) is the slope
  • \(X_i\) is an independent variable
  • \(\epsilon_i\) is the error term

Scatter Plot Comparison

Code for Scatter Plot

Here’s the code for the previous Scatter plot

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = 'green') +
  theme_minimal() + geom_smooth(method = "lm", 
                      se = TRUE, color = "red")  +
  labs(
    title = "Eruption Duration vs. Waiting Time",
    x = "Eruption Duration (min)",
    y = "Waiting Time (min)"
  )

Interactive Density Plot

Here is the code for the Density Plot

p <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(alpha = 0.5) +
  geom_density_2d(color = "red") +
  theme_minimal() +
  labs(
    title = "Interactive Density Contours of Eruption Patterns",
    x = "Eruption Duration (minutes)",
    y = "Waiting Time (minutes)"
  )

ggplotly(p, tooltip = "text")

Frequency Density Graph of Old Faithful

R-squared in Our Model

The \(R^2\) measures how well eruption duration predicts waiting time:

\[R^2 = 1 - \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2}\] For Old Faithful:

  • \(y_i\) = actual waiting times (this is between 43 and 96 minutes)
  • \(\hat{y_i}\) = predicted waiting times from the model
  • \(\bar{y}\) = mean waiting time (70.9 minutes)

Our model’s \(R^2 = 0.811\), so roughly 81.1% of waiting time.