2025-11-09

What is Simple Linear Regression?

It is a method that uses a straight line relationship between an independent variable (usually x) and a dependent variable (usually y).

Real life uses: - Helps predict cost of things such as travel, forecasting buisness sales, analyzing profit/loss, etc.

How do you write it in mathematical terms?

There is a very simple equation for writing simple linear regression in mathematical terms.
The equation is:

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

Here, y is the dependent variable, x is the independent variable,
\(\beta_0\) is the y-intercept, \(\beta_1\) is the slope,
and \(\varepsilon\) is the margin of error (the difference between our predicted value and actual value).

Importance and how to calculate it

Helps represent unknown variability which might occur due to multiple reasons. It also helps define noise and helps to better the models fit by quantifying it. We calculate it using:

\[ \varepsilon = y (actual) - y (predicted) \]

Details about the dataset we are about to use

We are going to use the faithful dataset as it is pretty big and I have been meaning to experiment with it recently.

Here is a small breakdown of the column names and the details included in this dataset.
Let’s check if it’s installed first (it should be as it is an inbuilt dataset in R).

We will check for it by using these commands:

data(faithful)
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
names(faithful)
## [1] "eruptions" "waiting"

This should print out the first few rows. For information about the column names, we can run the names(faithful) command.

Fitting the model

model <- lm(waiting ~ eruptions, data = faithful)
summary(model)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Plotly Plot

pred <- predict(model, newdata = faithful)
plot_ly(faithful, x = ~eruptions, y = ~waiting, type = "scatter", mode = "markers",
        name = "Data Points") %>%
  add_lines(x = ~eruptions, y = ~pred, name = "Fitted Line") %>%
  layout(title = "Waiting Time vs Eruption Duration",
         xaxis = list(title = "Eruption Duration"),
         yaxis = list(title = "Waiting Time"))

GGPLOT 1

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Waiting Time on Eruption Duration",
       x = "Eruption Duration",
       y = "Waiting Time")

GGPLOT 2

faithful$residuals <- resid(model)
faithful$fitted <- fitted(model)

ggplot(faithful, aes(x = fitted, y = residuals)) +
  geom_point(color = "orange") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Waiting Time",
       y = "Residuals")