Muhammad Arif
trees dataset (base R)We model a numerical response \(Y\) with a single predictor \(X\): \[ Y = \beta_0 + \beta_1 X + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2). \]
Ordinary Least Squares (OLS) chooses \(\hat{\beta}_0,\hat{\beta}_1\) to minimize the sum of squared errors: \[ \text{SSE} = \sum_{i=1}^n \bigl(y_i - \hat{y}_i\bigr)^2 = \sum_{i=1}^n \bigl(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\bigr)^2. \]
Model fit quality is summarized by: \[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{MSE}=\frac{\text{SSE}}{n-2}. \]
We will check residual plots to support assumptions.
Use built-in trees data to predict
Volume from Girth.
Girth (inches), Volume (cubic feet)Volume ~ Girth# Only base R + ggplot2 + plotly as covered in notes
library(ggplot2)
library(plotly)
data(trees)
str(trees)## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
g <- ggplot(trees, aes(x = Girth, y = Volume)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "firebrick") +
labs(title = "Tree Volume vs Girth",
x = "Girth (inches)",
y = "Volume (cubic ft)") +
theme_bw()
g## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
## (Intercept) Girth
## -36.943459 5.065856
## 1 2 3 4 5 6
## 5.103149 6.622906 7.636077 16.248033 17.261205 17.767790
res_df <- data.frame(
Girth = trees$Girth,
Residuals = resid(mod)
)
ggplot(res_df, aes(x = Girth, y = Residuals)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = 0, linetype = 2) +
labs(title = "Residuals vs Girth", x = "Girth", y = "Residuals") +
theme_minimal()We include a simple 3D plot to satisfy the requirement.
# 3D scatter of mtcars: mpg vs. wt & hp (illustrative)
xax <- list(title = "wt")
yax <- list(title = "hp")
zax <- list(title = "mpg")
plot_ly(mtcars, x = ~wt, y = ~hp, z = ~mpg,
type = "scatter3d", mode = "markers",
color = ~as.factor(cyl)) %>%
layout(title = "mpg vs (wt, hp) — plotly 3D",
scene = list(xaxis = xax, yaxis = yax, zaxis = zax))Confidence interval for slope \(\beta_1\): \[ \hat{\beta}_1 \pm t_{\alpha/2,\,n-2}\cdot \text{SE}(\hat{\beta}_1). \]
For a given \(x_0\), prediction for \(Y_0\): \[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0. \]
# Data
data(trees)
# Model
mod <- lm(Volume ~ Girth, data = trees)
# Fitted values and residuals
head(fitted(mod)); head(resid(mod))## 1 2 3 4 5 6
## 5.103149 6.622906 7.636077 16.248033 17.261205 17.767790
## 1 2 3 4 5 6
## 5.1968508 3.6770939 2.5639226 0.1519667 1.5387954 1.9322098
# Two ggplots
library(ggplot2)
ggplot(trees, aes(Girth, Volume)) + geom_point() + geom_smooth(method="lm", se=FALSE)## `geom_smooth()` using formula = 'y ~ x'
res_df <- data.frame(Girth = trees$Girth, Residuals = resid(mod))
ggplot(res_df, aes(Girth, Residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = 2)