First Let’s prove some key properties\ We know \[ \sum_{i=1}^n(y_i - \hat y_i) = 0 \] \[\implies \sum_{i=1}^n y_i = \sum_{i=1}^n \hat y_i\]
(As sum of residuals is 0) \[ \implies \bar y = \bar{ \hat y} \] and, \[ \implies \sum_{i=1}^n(y_i) - n.\bar y = \sum_{i=1}^n \widehat{y} - n.\bar y \] \[ \implies \sum_{i=1}^n(y_i - \bar y) = \sum_{i=1}^n(\hat y_i - \bar y) \] We will use all these properties for our proof. Let, \[ \rho(y, \hat y) \] be the coefficient of correlation between \(y , \hat y\) \[ \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{\hat y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar {\hat y})^2}} \] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{ y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{ y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y)^2 =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2 . \sum_{i=1}^n(\hat y_i - \bar{ y})^2}{{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y)^2 =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\]
Also, \[ \mathbf{R}^2 = 1 - \frac{SSE}{{SS}_y}\] \[= \frac{ {SS}_y - SSE}{{SS}_y}\] From defition, \[= \frac{ {SS}_{reg}}{{SS}_y}\] \[=\frac{\sum_{i=1}^n{(\bar{y}} - \hat{y_i})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\] \[=\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\] \[\mathbf{R}^2 = \rho(y, \hat y)^2\] i.e. \(R^2\) is equal to the empirical squared correlation between \(y\) and \(\hat y\)for OLS in multiple regression
# Function to simulate and fit
simulate_and_fit <- function(n, N) {
intercepts <- slopes <- numeric(N)
for (i in 1:N) {
# Simulate data
x <- runif(n, -1, 1)
y <- rnorm(n, 3 + 0.5 * x, 0.5) # Modified: Generate y according to N(3 + 0.5x, 0.5^2)
# Fit least squares model
model <- lm(y ~ x)
# Record coefficients
intercepts[i] <- coef(model)[1]
slopes[i] <- coef(model)[2]
}
return(data.frame(Intercept = intercepts, Slope = slopes))
}
# Simulation and fitting for n = 50, 100, 200
n_values <- c(50, 100, 200)
N <- 1000
# Results storage
results <- list()
# Perform simulation for each n
for (n in n_values) {
results[[as.character(n)]] <- simulate_and_fit(n, N)
}
We are knitting as a HTML and then converting it to.a PDF as mfrow is creating rendering issues and is misplacing plots.
# Visualize Marginal Normality
par(mfrow = c(3,2))
for (n in n_values) {
data_i <- results[[as.character(n)]]$Intercept
hist(data_i, main = paste("Intercept, n =", n), col = "lightblue",freq = FALSE, xlim = c(min(data_i), max(data_i)))
# Compute mean and standard deviation of the data
mean_data_i <- mean(data_i)
sd_data_i <- sd(data_i)
# Generate x values for the normal curve
x_values <- seq(mean_data_i - 3 * sd_data_i, mean_data_i + 3 * sd_data_i, length.out = n)
# Plot the normal curve
# Compute corresponding y values using the normal distribution function
y_values <- dnorm(x_values, mean_data_i, sd_data_i)
lines(x_values, y_values, col = "darkred", lwd = 2)
data_s <- results[[as.character(n)]]$Slope
# Compute mean and standard deviation of the data
mean_data_s <- mean(data_s)
sd_data_s <- sd(data_s)
hist(data_s, main = paste("Slope, n =", n), col = "lightgreen", freq = FALSE, xlim = c(min(data_s), max(data_s)))
# Generate x values for the normal curve
x_values <- seq(mean_data_s - 3 * sd_data_s, mean_data_s + 3 * sd_data_s, length.out = n)
y_values <- dnorm(x_values, mean_data_s, sd_data_s)
lines(x_values, y_values, col = "darkred", lwd = 2)
}
We can clearly see that the histograms of intercept and slope coefficients for each sample size exhibit bell-shaped distributions resembling the normal distribution.
As the sample size increases, the histograms are expected to become smoother and more symmetric around the mean. This trend occurs because the Central Limit Theorem ensures that the distribution of the least squares estimates approaches normality as the sample size grows large, regardless of the distribution of the predictor variable.
par(mfrow = c(1, 1))
# Visualize Joint Normality
plot(results$'50', col = 'lightblue', pch = 16, xlab = 'Intercept', ylab = 'Slope', main = 'Joint Distribution, n = 50')
points(results$'100', col = 'lightgreen', pch = 16)
points(results$'200', col = 'lightcoral', pch = 16)
legend("topright", legend = c("n = 50", "n = 100", "n = 200"), col = c('lightblue', 'lightgreen', 'lightcoral'), pch = 16)
library(MASS)
# Function to compute joint density
compute_joint_density <- function(data_i, data_s) {
# Estimate density using kde2d
density <- kde2d(data_i, data_s)
return(density)
}
# Visualize Marginal Normality
par(mfrow = c(1, 3))
for (n in n_values) {
data_i <- results[[as.character(n)]]$Intercept
data_s <- results[[as.character(n)]]$Slope
# Create scatter plot
plot(data_i, data_s, col = 'lightgreen', pch = 16, xlab = 'Intercept', ylab = 'Slope', main = paste("Joint Distribution, n =", n))
# Compute joint density
density <- compute_joint_density(data_i, data_s)
# Add contour lines
contour(density, add = TRUE, col = 'darkred', lwd = 2)
}
We can clearly see that the scatter plots of intercept and slope coefficients display elliptical contours, indicating joint normality. Elliptical contours suggest that the joint distribution of intercept and slope coefficients is bivariate normal. The shape of the contours may become more symmetric and circular as the sample size increases, reflecting improved joint normality.
Generate Data and Run Model
# Function to simulate and fit with t-distribution errors
simulate_and_fit_t <- function(n, N, df) {
intercepts <- slopes <- numeric(N)
for (i in 1:N) {
# Simulate data with t-distribution errors
x <- runif(n, -1, 1)
epsilon <- rt(n, df)
y <- rnorm(n, 3 + 0.5 * x, 0.5) + epsilon
# Fit least squares model
model <- lm(y ~ x)
# Record coefficients
intercepts[i] <- coef(model)[1]
slopes[i] <- coef(model)[2]
}
return(data.frame(Intercept = intercepts, Slope = slopes))
}
# Simulation and fitting for t-distribution errors
n_values <- c(50, 100,200)
k_values <- c(2, 5, 10, 20)
N <- 1000
# Results storage for t-distribution errors
results_t <- list()
# Perform simulation for each combination of n and k
for (n in n_values) {
for (k in k_values) {
key <- paste("n=", n, "_k=", k, sep="")
results_t[[key]] <- simulate_and_fit_t(n, N, k)
}
}
# Visualize Marginal Normality with t-distribution errors
par(mfrow = c(3, 4), mar = c(4, 4, 2, 1))
for (n in n_values) {
for (k in k_values) {
key <- paste("n=", n, "_k=", k, sep="")
data_i <- results_t[[key]]$Intercept
hist(data_i, main = paste("t(",k,"): Intercept, n=",n, sep = ""), include.lowest = TRUE, col = "lightblue", xlab = "Intercept Values", xlim = c(2, 4), freq = FALSE)
# Compute mean and standard deviation of the data
mean_data_i <- mean(data_i)
sd_data_i <- sd(data_i)
# Generate x values for the normal curve
x_values <- seq(mean_data_i - 3 * sd_data_i, mean_data_i + 3 * sd_data_i, length.out = 100)
# Plot the normal curve
# Compute corresponding y values using the normal distribution function
y_values <- dnorm(x_values, mean_data_i, sd_data_i)
lines(x_values, y_values, col = "darkred", lwd = 2)
data_s <- results_t[[key]]$Slope
# Compute mean and standard deviation of the data
mean_data_s <- mean(data_s)
sd_data_s <- sd(data_s)
hist(data_s, main = paste("t(",k,"): Slope, n=",n,sep=""), include.lowest = TRUE, col = "lightgreen", xlab = "Intecept Values", xlim = c(0, 1), freq = FALSE)
# Generate x values for the normal curve
x_values <- seq(mean_data_s - 3 * sd_data_s, mean_data_s + 3 * sd_data_s, length.out = 100)
y_values <- dnorm(x_values, mean_data_s, sd_data_s)
lines(x_values, y_values, col = "darkred", lwd = 2)
}
}
Similar to the previous scenario, the histograms of intercept and slope coefficients to exhibit bell-shaped distributions. However, the shape of the histograms varies depending on the degrees of freedom k of the t-distribution. For lower degrees of freedom (e.g., k=2), the histograms may display heavier tails and potential skewness compared to the normal distribution.
As the degrees of freedom increase (e.g., k=20), the histograms approach the shape of a normal distribution.
# Visualize Joint Normality with t-distribution errors
par(mfrow = c(2, 2))
for (n in n_values) {
for (k in k_values) {
key <- paste("n=", n, "_k=", k, sep="")
plot(results_t[[key]]$Intercept, results_t[[key]]$Slope, col = 'lightblue', pch = 16,
xlab = 'Intercept', ylab = 'Slope', main = paste("t(", k, "): Joint Distribution, n =", n))
# Add legend
legend("topright", legend = paste("n =", n, ", k =", k), col = 'lightblue', pch = 16, title = "Settings", cex = 0.5)
}
}
#par(mfrow = c(1, 1))
The shape and orientation of the contours varies across different degrees of freedom but is still elliptical.
For lower degrees of freedom, the contours are distorted, reflecting deviations from perfect bivariate normality. As the degrees of freedom increase, the contours are more symmetric and uniform, leading to better adherence to joint normality assumptions and the pattern is similar as the number of data points increase.
# Load the Boston dataset
library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(car)
library(MASS)
library(ggplot2)
library(ellipse)
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:car':
##
## ellipse
## The following object is masked from 'package:graphics':
##
## pairs
data(Boston)
# Check the structure of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Identify unique values and data types of each variable
sapply(Boston, function(x) { c(length(unique(x)), class(x)) })
## crim zn indus chas nox rm age
## [1,] "504" "26" "76" "2" "81" "446" "356"
## [2,] "numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric"
## dis rad tax ptratio black lstat medv
## [1,] "412" "9" "66" "46" "357" "455" "229"
## [2,] "numeric" "integer" "numeric" "numeric" "numeric" "numeric" "numeric"
We can see that chas and rad are discrete random variables and will have to remove them from our model.
# Fit a Linear Model
data(Boston)
lm_model <- lm(medv ~ . - chas - rad, data = Boston)
summary(lm_model)
##
## Call:
## lm(formula = medv ~ . - chas - rad, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3315 -2.8771 -0.6792 1.6858 27.4744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.970e+01 5.051e+00 5.879 7.59e-09 ***
## crim -7.010e-02 3.269e-02 -2.144 0.032482 *
## zn 3.989e-02 1.409e-02 2.831 0.004835 **
## indus -4.198e-02 6.080e-02 -0.691 0.490195
## nox -1.458e+01 3.899e+00 -3.740 0.000206 ***
## rm 4.188e+00 4.255e-01 9.843 < 2e-16 ***
## age -1.868e-03 1.359e-02 -0.137 0.890696
## dis -1.503e+00 2.059e-01 -7.301 1.15e-12 ***
## tax 8.334e-04 2.386e-03 0.349 0.727038
## ptratio -8.738e-01 1.323e-01 -6.607 1.02e-10 ***
## black 8.843e-03 2.763e-03 3.200 0.001461 **
## lstat -5.267e-01 5.224e-02 -10.083 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.899 on 494 degrees of freedom
## Multiple R-squared: 0.7225, Adjusted R-squared: 0.7163
## F-statistic: 116.9 on 11 and 494 DF, p-value: < 2.2e-16
# Checking for Mean Zero assumption
plot(lm_model, which = 1)
cat("Mean of the residuals is: ", mean(lm_model$residuals))
## Mean of the residuals is: 2.106352e-16
As most of the points are randomly scattered around res. = 0 without any clear pattern,.his suggests that the assumption that the relationship is linear is reasonable. Also, the value of the mean of residuals is tending to 0 which satisfies mean 0 assumption.
# Checking for the Homoscedasticity assumption
plot(lm_model, which = 3)
Homoscedasticity assumes that the variances of residuals are finite and similar or equal thereby, making the spread of residuals relatively constant throughout the plot. This plot is a scale-location plot of the squared root of residuals against their fitted values.
# Checking for Normality assumption using the Q-Q plot
plot(lm_model, which = 2)
In the normal Q-Q plot, we can clearly observe that there are a few outliers. Also, the plot is almost fitting on the straight line which enables us to conclude that the data approximately follows the normal distribution.
plot(lm_model, which = 2)
# Identify the most significant outlier
max_residual <- max(abs(lm_model$residuals))
max_residual_index <- which.max(abs(lm_model$residuals))
max_residual_observation <- row.names(Boston)[max_residual_index]
print(paste("Most significant outlier:", max_residual_observation))
## [1] "Most significant outlier: 369"
print(paste("Absolute value of residual:", max_residual))
## [1] "Absolute value of residual: 27.4744025269803"
# Using Leverage-Residuals^2 plot
plot(hatvalues(lm_model), rstudent(lm_model)^2, pch = 20,
main = "Outliers in Predictor: Leverage vs. Squared Residuals")
cat("Index of Outlier with the highest leverage: ", which.max(rstudent(lm_model)^2))
## Index of Outlier with the highest leverage: 369
The most significant outlier of all is the one at index = 369. This outlier has the highest influence on the regression coefficients. Having an extreme value compared to others, it prominently impacts the results of the regression model. This value is unusual and is shown as an outlier in almost all the previous plots.
When there exists outliers in predictor: check with hat values
plot(hatvalues(lm_model), type = "h")
p = 2
n= 506
abline(h = 2 * (p + 1) / n, lty = 2,col = 'darkred')
The hat-value summarizes the potential contribution of observation i to the fitted values collectively, and is a suitable measure of the leverage of this observation because the ith row of H represents the weight associated with the ith observed response in determining each of the fitted values.
In this case visualising would be hard as we have over 11 predctor variables and we’ll need a higher dimensional space to be visually able to see the points.
# Look for extreme values in medv
boxplot(Boston$medv, main = "Boxplot of medv")
# Calculate median and interquartile range
median_medv <- median(Boston$medv)
iqr_medv <- IQR(Boston$medv)
# Define outlier cutoff values
lower_cutoff <- median_medv - 1.5 * iqr_medv
upper_cutoff <- median_medv + 1.5 * iqr_medv
# Identify outliers
outliers <- Boston$medv[Boston$medv < lower_cutoff | Boston$medv > upper_cutoff]
# Print outliers
print(outliers)
## [1] 34.7 33.4 36.2 34.9 35.4 38.7 43.8 33.2 41.3 50.0 50.0 50.0 50.0 37.2 39.8
## [16] 36.2 37.9 50.0 34.9 37.0 36.4 50.0 33.3 34.6 34.9 42.3 48.5 50.0 44.8 50.0
## [31] 37.6 46.7 41.7 48.3 42.8 44.0 50.0 36.0 33.8 43.1 48.8 36.5 50.0 43.5 35.2
## [46] 33.2 35.1 45.4 35.4 46.0 50.0 37.3 36.1 33.4 50.0 50.0 50.0 50.0 50.0 8.8
## [61] 7.2 7.4 8.5 5.0 6.3 5.6 7.2 8.3 8.5 5.0 7.0 7.2 7.5 8.8 8.4
## [76] 8.3 8.7 8.4 7.0 8.1
This code calculates the median and interquartile range (IQR) of the
medv variable. Outliers are then
identified based on their deviation from the median by more than 1.5
times the IQR. This is a commonly used statistical approach. Any data
points that fall significantly outside the “whiskers” of the boxplot may
be considered outliers. These outliers could potentially have a
significant impact on the analysis.
Studentized Residuals Concept
The basic idea is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations. Then, we compare the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals. Standardizing the deleted residuals produces studentized residuals.
# Calculating the studentized residuals
studentized_resid <- rstudent(lm_model)
# Identifying outliers based on a threshold (e.g., 2 standard deviations)
outlier_threshold <- 2
outliers <- which(abs(studentized_resid) > outlier_threshold)
# Plotting the calculated studentized residuals
plot(studentized_resid, pch = 20, main = "Outliers in Response")
# Adding a horizontal line at the threshold
abline(h = outlier_threshold, col = "red", lty = 2)
Here, we are considering the threshold for outliers to be 2. The scatter plot gives the outliers in response and the farthest outlier has an index of 369. (aligns with our predictor outlier)
To check for influential observations, we are using Cook’s distance and leverage. Cook’s distance measures the influence of each observation on the model’s coefficients, while leverage measures how much an observation’s predictor values differ from the mean predictor values.
# Calculate Cook's distance and leverage
cook_dist <- cooks.distance(lm_model)
leverage <- hatvalues(lm_model)
# Plot Cook's distance
plot(cook_dist, pch = 20, main = "Cook's Distance", xlab = "Observation", ylab = "Cook's Distance")
abline(h = 4 / nrow(Boston), col = "red")
# Plot leverage
plot(leverage, pch = 20, main = "Leverage", xlab = "Observation", ylab = "Leverage")
abline(h = 0.05, col = "red")
# Identify influential observations based on Cook's distance and leverage
influential_obs <- which(cook_dist > 4 / nrow(Boston) & leverage > 0.05)
# Print influential observations
print(Boston[influential_obs, ])
## crim zn indus chas nox rm age dis rad tax ptratio black
## 215 0.28955 0 10.59 0 0.489 5.412 9.8 3.5875 4 277 18.6 348.93
## 254 0.36894 22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 396.90
## 354 0.01709 90 2.02 0 0.410 6.728 36.1 12.1265 5 187 17.0 384.46
## 365 3.47428 0 18.10 1 0.718 8.780 82.9 1.9047 24 666 20.2 354.55
## 366 4.55587 0 18.10 0 0.718 3.561 87.9 1.6132 24 666 20.2 354.70
## 368 13.52220 0 18.10 0 0.631 3.863 100.0 1.5106 24 666 20.2 131.42
## 369 4.89822 0 18.10 0 0.631 4.970 100.0 1.3325 24 666 20.2 375.52
## 381 88.97620 0 18.10 0 0.671 6.968 91.9 1.4165 24 666 20.2 396.90
## 406 67.92080 0 18.10 0 0.693 5.683 100.0 1.4254 24 666 20.2 384.97
## 413 18.81100 0 18.10 0 0.597 4.628 100.0 1.5539 24 666 20.2 28.79
## 415 45.74610 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 88.27
## lstat medv
## 215 29.55 23.7
## 254 3.54 42.8
## 354 4.50 30.1
## 365 5.29 21.9
## 366 7.12 27.5
## 368 13.33 23.1
## 369 3.26 50.0
## 381 17.21 10.4
## 406 22.98 5.0
## 413 34.37 17.9
## 415 36.98 7.0
There are nearly 11 influential values and we can conclude that these have affect on the model significantly more than the other points. (Note: Index 369 is present here)
VIF stands for Variance inflation factor. The VIF values measure the degree of multicollinearity between predictor variables. A VIF greater than 5 or 10 indicates potential multicollinearity issues, where predictor variables are highly correlated with each other. High VIF values can inflate the standard errors of regression coefficients
# Calculate VIFs
vif_values <- car::vif(lm_model)
# Plot VIF values as a histogram
barplot(vif_values, main = "VIF Values", xlab = "Predictor Variables", ylab = "VIF",
ylim = c(0, max(vif_values) + 1), col = ifelse(vif_values >= sort(vif_values, decreasing = TRUE)[5], "red", "blue"),
names.arg = names(vif_values))
# Add VIF values over the bars
text(x = 1:length(vif_values), y = vif_values, labels = round(vif_values, 2), pos = 3, col = "black", cex = 0.8)
# Add legend
legend("topright", legend = c("Top 5 Highest VIFs"), fill = c("red"), border = "transparent")