# I am assessing the accuracy of coefficient estimates in simple linear regression.
# My focus is on understanding how well the intercept (β0) and slope (β1) approximate
# the true relationship between X (predictor) and Y (response) in the presence of random error (ϵ).
# I assume that the true relationship is Y = β0 + β1X + ϵ, where ϵ is independent of X
# and has a mean of zero.

# This plot compares the true population regression line (red, dashed) with the least squares
# regression line (blue, solid) based on the observed data. The true population line represents
# the actual relationship, Y = 2 + 3X + ϵ, where ϵ is random error with a mean of zero. The least 
# squares line is estimated from the observed data using the coefficients derived from the 
# least squares method.

# The orange points represent the observed data points, which scatter around the true population
# line due to the influence of random error (ϵ). These deviations highlight how real-world data
# rarely aligns perfectly with theoretical models. The least squares line attempts to minimize 
# these deviations by fitting the data as closely as possible.

# The two lines, while not identical, are very close to each other, demonstrating that the least 
# squares method effectively estimates the underlying relationship between X and Y. The alignment 
# shows that the least squares method provides unbiased estimates of the true coefficients (β0 = 2, 
# β1 = 3) in this simulated dataset.

# Observing this plot, I conclude that the least squares line closely approximates the true population
# line for this particular dataset. This confirms the reliability of the least squares method in 
# estimating the coefficients when the assumptions of linear regression hold. However, variability 
# in real-world data due to measurement error or other factors could cause greater discrepancies in 
# practice.


# I simulate a data set to explore these concepts:
set.seed(42)  # I set a seed for reproducibility
n <- 100      # I decide to generate 100 observations
x <- rnorm(n, mean = 0, sd = 1)  # I generate random X values from a normal distribution
epsilon <- rnorm(n, mean = 0, sd = 1)  # I generate random error terms (ϵ)

# I define the true population regression line: Y = 2 + 3X + ϵ
y <- 2 + 3 * x + epsilon  # Y values based on the true relationship

# I fit a linear model to estimate the coefficients (β0 and β1) using least squares
lm_fit <- lm(y ~ x)

# I display the summary of my linear model to see the estimated coefficients
summary(lm_fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88842 -0.50664  0.01225  0.54106  2.86240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.91163    0.09088   21.04   <2e-16 ***
## x            3.02716    0.08767   34.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9083 on 98 degrees of freedom
## Multiple R-squared:  0.9241, Adjusted R-squared:  0.9233 
## F-statistic:  1192 on 1 and 98 DF,  p-value: < 2.2e-16
# I extract the estimated coefficients
beta_0_hat <- coef(lm_fit)[1]  # Intercept (β̂0)
beta_1_hat <- coef(lm_fit)[2]  # Slope (β̂1)

# I visualize the data and regression lines
plot(x, y, 
     main = "Population vs. Least Squares Regression Lines", 
     xlab = "X (Predictor)", ylab = "Y (Response)",
     col = "darkorange", pch = 16, cex = 1.2,  # Points in orange
     bg = "lightgray", xlim = c(-3, 3), ylim = c(-10, 15))
abline(2, 3, col = "red", lwd = 3, lty = 2)  # True population regression line in bold red
abline(lm_fit, col = "blue", lwd = 3)  # Least squares regression line in bold blue
legend("topleft", legend = c("True Population Line", "Least Squares Line"),
       col = c("red", "blue"), lty = c(2, 1), lwd = 3, bg = "white")

# I interpret the plot:
# The red dashed line represents the true population regression line (Y = 2 + 3X).
# The blue line is my least squares regression line based on the observed data.
# The two lines are close, showing that the least squares method provides a good approximation.

# I assess the variability in the least squares estimates:
# I generate multiple datasets to observe variability in the regression lines.
n_simulations <- 10  # Number of datasets I simulate
plot(x, y, 
     main = "Variability in Least Squares Lines", 
     xlab = "X (Predictor)", ylab = "Y (Response)",
     col = "gray", pch = 16, cex = 1.2, bg = "lightgray", xlim = c(-3, 3), ylim = c(-10, 15))
abline(2, 3, col = "red", lwd = 3, lty = 2)  # True population regression line

for (i in 1:n_simulations) {
  y_sim <- 2 + 3 * x + rnorm(n, mean = 0, sd = 1)  # New dataset
  lm_sim <- lm(y_sim ~ x)  # Fit least squares line
  abline(lm_sim, col = adjustcolor("lightblue", alpha.f = 0.5), lwd = 2)  # Plot least squares line
}
abline(lm_fit, col = "blue", lwd = 3)  # Add the first least squares line for comparison
legend("topleft", legend = c("True Population Line", "Least Squares Lines"),
       col = c("red", "lightblue"), lty = c(2, 1), lwd = 3, bg = "white")

# I notice variability in the light blue least squares lines due to random error.
# However, on average, the least squares lines are close to the true population line,
# which confirms the unbiasedness of the least squares estimates.

# I calculate confidence intervals for the coefficients:
confint(lm_fit, level = 0.95)
##                2.5 %   97.5 %
## (Intercept) 1.731289 2.091977
## x           2.853191 3.201128
# I interpret the confidence intervals:
# For β0 (intercept), the 95% confidence interval might contain values like [1.9, 2.1],
# which means I am 95% confident that the true intercept is within this range.
# For β1 (slope), the interval might look like [2.9, 3.1], indicating that the true slope
# is likely to be within this range.

# I test the hypothesis H0: β1 = 0 (no relationship between X and Y)
# The summary of the model provides a t-statistic and p-value for β1.
# If the p-value is small (e.g., < 0.05), I reject the null hypothesis,
# concluding that there is evidence of a relationship between X and Y.

# I calculate the residual standard error (RSE) to assess the model's fit:
rse <- summary(lm_fit)$sigma
cat("Residual Standard Error (RSE):", rse, "\n")
## Residual Standard Error (RSE): 0.9083303