2025-10-20

What is Simple Linear Regression?

  • Simple linear regression is a statistical method that uses a linear equation to predict the value of one dependent variable based on another independent variable.
  • It helps us understand how one variable (predictor) affects another (outcome)
  • Let’s explore this with an example.
  • Example: How does House area (sq ft) influence house price ($k)?
  • We will:
    • Simulate a small housing data
    • Fit a linear model
    • Visualize results with ggplot2 and Plotly

Creating the Data

set.seed(42)
n <- 120
area <- round(rnorm(n, mean = 1800, sd = 450))
area[area < 600] <- 600
price <- 100 + 0.15 * area + rnorm(n, 0, 30)

housing <- data.frame(area, price)
head(housing)
  area    price
1 2417 417.7412
2 1546 287.7869
3 1963 398.1911
4 2085 382.8508
5 1982 397.2453
6 1752 349.9522

The Model

Let’s fit a simple linear regression:

\[ \text{Price}_i = \beta_0 + \beta_1 \,\text{Area}_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0,\sigma^2) \]

  • \(\beta_0\): intercept (price when area = 0)
  • \(\beta_1\): slope (change in price per 1 sq ft)

What is a Scatter Plot and how to code it?

A scatter plot is a graph that displays the relationship between two quantitative variables using points. One variable is on the x-axis and the other on the y-axis.

m <- lm(price ~ area, data = housing)

ggplot(housing, aes(x = area, y = price)) +
geom_point(color = "#e5a0c6") + 
geom_smooth(method = "lm", se = TRUE, color = "#7f3667") +
labs(title = "Housing Prices vs Area",
x = "Area (sq ft)", y = "Price ($K)") +
theme_classic()

Scatter Plot Output

What is a Residuals Plot and how to code it?

A residual plot shows the difference between the observed and predicted values (residuals) on the y-axis and the fitted values on the x-axis, helping us check how well our regression model fits the data.

housing$fitted <- fitted(m)
housing$resid  <- resid(m)

ggplot(housing, aes(x = fitted, y = resid)) +
geom_point(color = "#7f3667") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Price ($K)", y = "Residuals") +
theme_minimal()

Residuals Plot Output

What is an Inetractive Plot and how to code it?

An interactive plot lets users explore data by hovering, zooming, or rotating elements, making it easier to visualize and understand relationships dynamically.

plot_ly(housing, x = ~area, y = ~price,
type = "scatter", mode = "markers",
marker = list(color = '#e5a0c6', size = 8)) %>%
add_lines(x = housing$area,
y = fitted(m),
line = list(color = '#7f3667')) %>%
layout(title = "Interactive Price vs Area",
xaxis = list(title = "Area (sq ft)"),
yaxis = list(title = "Price ($K)"))

Interactive Plot

Inference

The slope’s \(100(1-\alpha)\%\) confidence interval:

\[ \hat\beta_1 \pm t_{n-2,\,1-\alpha/2}\,\mathrm{SE}(\hat\beta_1) \]

Test \(H_0:\ \beta_1=0\) with the \(t\)-statistic: \[ t = \frac{\hat\beta_1}{\mathrm{SE}(\hat\beta_1)} \]