# I am assessing the accuracy of coefficient estimates in simple linear regression.
# My focus is on understanding how well the intercept (β0) and slope (β1) approximate
# the true relationship between X (predictor) and Y (response) in the presence of random error (ϵ).
# I assume that the true relationship is Y = β0 + β1X + ϵ, where ϵ is independent of X
# and has a mean of zero.
# This plot compares the true population regression line (red, dashed) with the least squares
# regression line (blue, solid) based on the observed data. The true population line represents
# the actual relationship, Y = 2 + 3X + ϵ, where ϵ is random error with a mean of zero. The least
# squares line is estimated from the observed data using the coefficients derived from the
# least squares method.
# The orange points represent the observed data points, which scatter around the true population
# line due to the influence of random error (ϵ). These deviations highlight how real-world data
# rarely aligns perfectly with theoretical models. The least squares line attempts to minimize
# these deviations by fitting the data as closely as possible.
# The two lines, while not identical, are very close to each other, demonstrating that the least
# squares method effectively estimates the underlying relationship between X and Y. The alignment
# shows that the least squares method provides unbiased estimates of the true coefficients (β0 = 2,
# β1 = 3) in this simulated dataset.
# Observing this plot, I conclude that the least squares line closely approximates the true population
# line for this particular dataset. This confirms the reliability of the least squares method in
# estimating the coefficients when the assumptions of linear regression hold. However, variability
# in real-world data due to measurement error or other factors could cause greater discrepancies in
# practice.
# I simulate a data set to explore these concepts:
set.seed(42) # I set a seed for reproducibility
n <- 100 # I decide to generate 100 observations
x <- rnorm(n, mean = 0, sd = 1) # I generate random X values from a normal distribution
epsilon <- rnorm(n, mean = 0, sd = 1) # I generate random error terms (ϵ)
# I define the true population regression line: Y = 2 + 3X + ϵ
y <- 2 + 3 * x + epsilon # Y values based on the true relationship
# I fit a linear model to estimate the coefficients (β0 and β1) using least squares
lm_fit <- lm(y ~ x)
# I display the summary of my linear model to see the estimated coefficients
summary(lm_fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88842 -0.50664 0.01225 0.54106 2.86240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.91163 0.09088 21.04 <2e-16 ***
## x 3.02716 0.08767 34.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9083 on 98 degrees of freedom
## Multiple R-squared: 0.9241, Adjusted R-squared: 0.9233
## F-statistic: 1192 on 1 and 98 DF, p-value: < 2.2e-16
# I extract the estimated coefficients
beta_0_hat <- coef(lm_fit)[1] # Intercept (β̂0)
beta_1_hat <- coef(lm_fit)[2] # Slope (β̂1)
# I visualize the data and regression lines
plot(x, y,
main = "Population vs. Least Squares Regression Lines",
xlab = "X (Predictor)", ylab = "Y (Response)",
col = "darkorange", pch = 16, cex = 1.2, # Points in orange
bg = "lightgray", xlim = c(-3, 3), ylim = c(-10, 15))
abline(2, 3, col = "red", lwd = 3, lty = 2) # True population regression line in bold red
abline(lm_fit, col = "blue", lwd = 3) # Least squares regression line in bold blue
legend("topleft", legend = c("True Population Line", "Least Squares Line"),
col = c("red", "blue"), lty = c(2, 1), lwd = 3, bg = "white")

# I interpret the plot:
# The red dashed line represents the true population regression line (Y = 2 + 3X).
# The blue line is my least squares regression line based on the observed data.
# The two lines are close, showing that the least squares method provides a good approximation.
# I assess the variability in the least squares estimates:
# I generate multiple datasets to observe variability in the regression lines.
n_simulations <- 10 # Number of datasets I simulate
plot(x, y,
main = "Variability in Least Squares Lines",
xlab = "X (Predictor)", ylab = "Y (Response)",
col = "gray", pch = 16, cex = 1.2, bg = "lightgray", xlim = c(-3, 3), ylim = c(-10, 15))
abline(2, 3, col = "red", lwd = 3, lty = 2) # True population regression line
for (i in 1:n_simulations) {
y_sim <- 2 + 3 * x + rnorm(n, mean = 0, sd = 1) # New dataset
lm_sim <- lm(y_sim ~ x) # Fit least squares line
abline(lm_sim, col = adjustcolor("lightblue", alpha.f = 0.5), lwd = 2) # Plot least squares line
}
abline(lm_fit, col = "blue", lwd = 3) # Add the first least squares line for comparison
legend("topleft", legend = c("True Population Line", "Least Squares Lines"),
col = c("red", "lightblue"), lty = c(2, 1), lwd = 3, bg = "white")

# I notice variability in the light blue least squares lines due to random error.
# However, on average, the least squares lines are close to the true population line,
# which confirms the unbiasedness of the least squares estimates.
# I calculate confidence intervals for the coefficients:
confint(lm_fit, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 1.731289 2.091977
## x 2.853191 3.201128
# I interpret the confidence intervals:
# For β0 (intercept), the 95% confidence interval might contain values like [1.9, 2.1],
# which means I am 95% confident that the true intercept is within this range.
# For β1 (slope), the interval might look like [2.9, 3.1], indicating that the true slope
# is likely to be within this range.
# I test the hypothesis H0: β1 = 0 (no relationship between X and Y)
# The summary of the model provides a t-statistic and p-value for β1.
# If the p-value is small (e.g., < 0.05), I reject the null hypothesis,
# concluding that there is evidence of a relationship between X and Y.
# I calculate the residual standard error (RSE) to assess the model's fit:
rse <- summary(lm_fit)$sigma
cat("Residual Standard Error (RSE):", rse, "\n")
## Residual Standard Error (RSE): 0.9083303