Assessing the Accuracy of the Model

# In this analysis, I am interpreting the results of the Linear Regression Fit and Residuals plot.
# I created this plot to visualize the relationship between the predictor (X) and the response (Y)
# and to assess the accuracy of my linear regression model.

# As I examine the plot, I see the blue regression line running through the data points, which shows
# the linear relationship estimated by my model. This regression line is my model's best attempt
# to capture the true relationship between X and Y by minimizing the residuals, or errors.

# The orange points represent the observed data, which scatter around the regression line.
# These points reflect the real-world deviations caused by the random error (ϵ) inherent in the data.
# For each observed point, I drew a red line connecting it to the predicted value on the regression line.
# These red lines highlight the residuals—the difference between the observed value and the predicted value.
# Observing these residuals helps me evaluate how well the model fits the data.

# I notice that the residuals are relatively small and evenly distributed around the regression line.
# This indicates that my model performs well in predicting the response variable, with no systematic pattern
# of under- or over-prediction. The spread of the residuals also appears to remain consistent across
# different values of X, suggesting that the assumption of constant variance (homoscedasticity) holds true.

# Next, I calculated two key metrics to quantify the accuracy of my model: the Residual Standard Error (RSE)
# and \( R^2 \). The RSE provides me with an absolute measure of how much the observed values deviate from
# the regression line on average. For example, if the RSE is 3.26, it means my predictions are typically
# off by about 3.26 units. I use this value to assess whether my model's error is acceptable given the
# scale of my response variable.

# On the other hand, \( R^2 \) tells me how much of the variability in Y is explained by X.
# In this case, if my \( R^2 \) is 0.61, I know that 61% of the variability in Y is captured by my model.
# This gives me confidence in the model's ability to explain a significant portion of the relationship
# between the predictor and response.

# In conclusion, I see that my linear regression model fits the data well. The residuals are small
# and evenly distributed, and the metrics support my visual observations. By combining visual analysis
# with numerical measures like RSE and \( R^2 \), I am confident in the accuracy of my model
# and its ability to describe the relationship between X and Y. This analysis strengthens my understanding
# of how well linear regression can approximate real-world relationships and helps me evaluate
# the model's usefulness for prediction.


# Simulate a dataset to explore these metrics
set.seed(42)  # I set a seed for reproducibility
n <- 100  # Number of observations
x <- rnorm(n, mean = 0, sd = 1)  # Predictor variable X
epsilon <- rnorm(n, mean = 0, sd = 1)  # Random error term ϵ

# Define the true relationship: Y = 2 + 3X + ϵ
y <- 2 + 3 * x + epsilon  # Response variable Y

# Fit a linear regression model
lm_fit <- lm(y ~ x)

# Calculate Residual Standard Error (RSE)
# RSE estimates the average deviation of observed values from the true regression line.
rss <- sum(residuals(lm_fit)^2)  # Residual Sum of Squares (RSS)
rse <- sqrt(rss / (n - 2))  # RSE formula
cat("Residual Standard Error (RSE):", rse, "\n")

## Residual Standard Error (RSE): 0.9083303

# Calculate \( R^2 \)
# \( R^2 \) is the proportion of variance in the response explained by the predictor.
tss <- sum((y - mean(y))^2)  # Total Sum of Squares (TSS)
r_squared <- 1 - (rss / tss)  # \( R^2 \) formula
cat("R-squared (R^2):", r_squared, "\n")

## R-squared (R^2): 0.9240538

# Visualize the regression model and residuals
plot(x, y, 
     main = "Linear Regression Fit and Residuals",
     xlab = "Predictor (X)", ylab = "Response (Y)",
     col = "darkorange", pch = 16)
abline(lm_fit, col = "blue", lwd = 2)  # Add the regression line
legend("topleft", legend = c("Regression Line"), col = c("blue"), lwd = 2, bg = "white")

# Add residuals to the plot for visualization
for (i in 1:n) {
  lines(c(x[i], x[i]), c(y[i], predict(lm_fit)[i]), col = "red")
}

# Interpret the metrics
# RSE (Residual Standard Error):
# This value provides the average deviation of the response variable from the true regression line.
# For example, an RSE of 3.26 indicates that predictions deviate by 3.26 units on average.
# Whether this is acceptable depends on the problem context and the scale of the response variable.

# R^2 (Proportion of Variance Explained):
# The \( R^2 \) value represents the fraction of the variance in the response variable
# that is explained by the predictor variable. An \( R^2 \) close to 1 indicates a strong
# relationship, while a value near 0 indicates a weak relationship. For instance, \( R^2 = 0.61 \)
# means that 61% of the variability in the response is explained by the predictor.

Assessing the Accuracy of the Model

Avery Holloman

2024-11-08