Question 1

To Show \(R^2\) is equal to the emmperical squared correlation between \(y\) and \(\hat y\)for OLS in multiple regression

First Let’s prove some key properties\ We know \[ \sum_{i=1}^n(y_i - \hat y_i) = 0 \] \[\implies \sum_{i=1}^n y_i = \sum_{i=1}^n \hat y_i\]

(As sum of residuals is 0) \[ \implies \bar y = \bar{ \hat y} \] and, \[ \implies \sum_{i=1}^n(y_i) - n.\bar y = \sum_{i=1}^n \widehat{y} - n.\bar y \] \[ \implies \sum_{i=1}^n(y_i - \bar y) = \sum_{i=1}^n(\hat y_i - \bar y) \] We will use all these properties for our proof. Let, \[ \rho(y, \hat y) \] be the coefficient of correlation between \(y , \hat y\) \[ \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{\hat y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar {\hat y})^2}} \] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{ y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(y_i - \bar y)(\hat y_i - \bar{ y})}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y) =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sqrt{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y)^2 =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2 . \sum_{i=1}^n(\hat y_i - \bar{ y})^2}{{\sum_{i=1}^n(y_i - \bar y)^2 \sum_{i=1}^n(\hat y_i - \bar { y})^2}}\] \[ \implies \rho(y, \hat y)^2 =\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\]

Also, \[ \mathbf{R}^2 = 1 - \frac{SSE}{{SS}_y}\] \[= \frac{ {SS}_y - SSE}{{SS}_y}\] From defition, \[= \frac{ {SS}_{reg}}{{SS}_y}\] \[=\frac{\sum_{i=1}^n{(\bar{y}} - \hat{y_i})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\] \[=\frac{\sum_{i=1}^n(\hat y_i - \bar{ y})^2}{\sum_{i=1}^n(y_i - \bar y)^2}\] \[\mathbf{R}^2 = \rho(y, \hat y)^2\] i.e. \(R^2\) is equal to the empirical squared correlation between \(y\) and \(\hat y\)for OLS in multiple regression

Question 2

Perform simulations to check that the least squares coefficients are indeed, or at least approximately, normally distributed, under appropriate conditions.

A. Situation with only one predictor

# Function to simulate and fit
simulate_and_fit <- function(n, N) {
  intercepts <- slopes <- numeric(N)
  for (i in 1:N) {
    # Simulate data
    x <- runif(n, -1, 1)
    y <- rnorm(n, 3 + 0.5 * x, 0.5)  # Modified: Generate y according to N(3 + 0.5x, 0.5^2)
    
    # Fit least squares model
    model <- lm(y ~ x)
    
    # Record coefficients
    intercepts[i] <- coef(model)[1]
    slopes[i] <- coef(model)[2]
  }
  return(data.frame(Intercept = intercepts, Slope = slopes))
}

# Simulation and fitting for n = 50, 100, 200
n_values <- c(50, 100, 200)
N <- 1000

# Results storage
results <- list()

# Perform simulation for each n
for (n in n_values) {
  results[[as.character(n)]] <- simulate_and_fit(n, N)
}

Now we have to check for marginal normality

We are knitting as a HTML and then converting it to.a PDF as mfrow is creating rendering issues and is misplacing plots.

# Visualize Marginal Normality
par(mfrow = c(3,2))
for (n in n_values) {
  
  data_i <- results[[as.character(n)]]$Intercept
  hist(data_i, main = paste("Intercept, n =", n), col = "lightblue",freq = FALSE, xlim = c(min(data_i), max(data_i)))
  # Compute mean and standard deviation of the data
  mean_data_i <- mean(data_i)
  sd_data_i <- sd(data_i)
  
  # Generate x values for the normal curve
  x_values <- seq(mean_data_i - 3 * sd_data_i, mean_data_i + 3 * sd_data_i, length.out = n)
  
  # Plot the normal curve
  # Compute corresponding y values using the normal distribution function
  y_values <- dnorm(x_values, mean_data_i, sd_data_i)
  lines(x_values, y_values, col = "darkred", lwd = 2)
  
  data_s <- results[[as.character(n)]]$Slope
  # Compute mean and standard deviation of the data
  mean_data_s <- mean(data_s)
  sd_data_s <- sd(data_s)
  
  hist(data_s, main = paste("Slope, n =", n), col = "lightgreen", freq = FALSE, xlim = c(min(data_s), max(data_s)))
  # Generate x values for the normal curve
  x_values <- seq(mean_data_s - 3 * sd_data_s, mean_data_s + 3 * sd_data_s, length.out = n)
  
  y_values <- dnorm(x_values, mean_data_s, sd_data_s)
  lines(x_values, y_values, col = "darkred", lwd = 2)
}

We can clearly see that the histograms of intercept and slope coefficients for each sample size exhibit bell-shaped distributions resembling the normal distribution.

As the sample size increases, the histograms are expected to become smoother and more symmetric around the mean. This trend occurs because the Central Limit Theorem ensures that the distribution of the least squares estimates approaches normality as the sample size grows large, regardless of the distribution of the predictor variable.

Joint Normality

par(mfrow = c(1, 1))

# Visualize Joint Normality
plot(results$'50', col = 'lightblue', pch = 16, xlab = 'Intercept', ylab = 'Slope', main = 'Joint Distribution, n = 50')
points(results$'100', col = 'lightgreen', pch = 16)
points(results$'200', col = 'lightcoral', pch = 16)
legend("topright", legend = c("n = 50", "n = 100", "n = 200"), col = c('lightblue', 'lightgreen', 'lightcoral'), pch = 16)

library(MASS)
# Function to compute joint density
compute_joint_density <- function(data_i, data_s) {
  # Estimate density using kde2d
  density <- kde2d(data_i, data_s)
  return(density)
}

# Visualize Marginal Normality
par(mfrow = c(1, 3))
for (n in n_values) {
  data_i <- results[[as.character(n)]]$Intercept
  data_s <- results[[as.character(n)]]$Slope
  
  # Create scatter plot
  plot(data_i, data_s, col = 'lightgreen', pch = 16, xlab = 'Intercept', ylab = 'Slope', main = paste("Joint Distribution, n =", n))
  
  # Compute joint density
  density <- compute_joint_density(data_i, data_s)
  
  # Add contour lines
  contour(density, add = TRUE, col = 'darkred', lwd = 2)
}

We can clearly see that the scatter plots of intercept and slope coefficients display elliptical contours, indicating joint normality. Elliptical contours suggest that the joint distribution of intercept and slope coefficients is bivariate normal. The shape of the contours may become more symmetric and circular as the sample size increases, reflecting improved joint normality.

B. Data with errors that have the t-distribution with \(k \in\) {2, 5, 10, 20} degrees of freedom

Generate Data and Run Model

# Function to simulate and fit with t-distribution errors
simulate_and_fit_t <- function(n, N, df) {
  intercepts <- slopes <- numeric(N)
  for (i in 1:N) {
    # Simulate data with t-distribution errors
    x <- runif(n, -1, 1)
    epsilon <- rt(n, df)
    y <- rnorm(n, 3 + 0.5 * x, 0.5) + epsilon
    
    # Fit least squares model
    model <- lm(y ~ x)
    
    # Record coefficients
    intercepts[i] <- coef(model)[1]
    slopes[i] <- coef(model)[2]
  }
  return(data.frame(Intercept = intercepts, Slope = slopes))
}

# Simulation and fitting for t-distribution errors
n_values <- c(50, 100,200)
k_values <- c(2, 5, 10, 20)
N <- 1000

# Results storage for t-distribution errors
results_t <- list()

# Perform simulation for each combination of n and k
for (n in n_values) {
  for (k in k_values) {
    key <- paste("n=", n, "_k=", k, sep="")
    results_t[[key]] <- simulate_and_fit_t(n, N, k)
  }
}

Marginal Normality

# Visualize Marginal Normality with t-distribution errors
par(mfrow = c(3, 4), mar = c(4, 4, 2, 1))

for (n in n_values) {
  for (k in k_values) {
    key <- paste("n=", n, "_k=", k, sep="")
    
    data_i <- results_t[[key]]$Intercept
    hist(data_i, main = paste("t(",k,"): Intercept, n=",n, sep = ""), include.lowest = TRUE, col = "lightblue", xlab = "Intercept Values", xlim = c(2, 4), freq = FALSE)
    # Compute mean and standard deviation of the data
    mean_data_i <- mean(data_i)
    sd_data_i <- sd(data_i)
    
    # Generate x values for the normal curve
    x_values <- seq(mean_data_i - 3 * sd_data_i, mean_data_i + 3 * sd_data_i, length.out = 100)
    
    # Plot the normal curve
    # Compute corresponding y values using the normal distribution function
    y_values <- dnorm(x_values, mean_data_i, sd_data_i)
    lines(x_values, y_values, col = "darkred", lwd = 2)
    
    data_s <- results_t[[key]]$Slope
    # Compute mean and standard deviation of the data
    mean_data_s <- mean(data_s)
    sd_data_s <- sd(data_s)
    
    hist(data_s, main = paste("t(",k,"): Slope, n=",n,sep=""), include.lowest = TRUE, col = "lightgreen", xlab = "Intecept Values", xlim = c(0, 1), freq = FALSE)
    # Generate x values for the normal curve
    x_values <- seq(mean_data_s - 3 * sd_data_s, mean_data_s + 3 * sd_data_s, length.out = 100)
    
    y_values <- dnorm(x_values, mean_data_s, sd_data_s)
    lines(x_values, y_values, col = "darkred", lwd = 2)  
    }
}

Similar to the previous scenario, the histograms of intercept and slope coefficients to exhibit bell-shaped distributions. However, the shape of the histograms varies depending on the degrees of freedom k of the t-distribution. For lower degrees of freedom (e.g., k=2), the histograms may display heavier tails and potential skewness compared to the normal distribution.

As the degrees of freedom increase (e.g., k=20), the histograms approach the shape of a normal distribution.

Joint Normality

# Visualize Joint Normality with t-distribution errors
par(mfrow = c(2, 2))

for (n in n_values) {
  for (k in k_values) {
    key <- paste("n=", n, "_k=", k, sep="")
    plot(results_t[[key]]$Intercept, results_t[[key]]$Slope, col = 'lightblue', pch = 16, 
         xlab = 'Intercept', ylab = 'Slope', main = paste("t(", k, "): Joint Distribution, n =", n))
    
    # Add legend
    legend("topright", legend = paste("n =", n, ", k =", k), col = 'lightblue', pch = 16, title = "Settings", cex = 0.5)
  }
}

#par(mfrow = c(1, 1))

The shape and orientation of the contours varies across different degrees of freedom but is still elliptical.

For lower degrees of freedom, the contours are distorted, reflecting deviations from perfect bivariate normality. As the degrees of freedom increase, the contours are more symmetric and uniform, leading to better adherence to joint normality assumptions and the pattern is similar as the number of data points increase.

Question 3

Practice for performing diagnostics

Load the Datatset

# Load the Boston dataset
library(alr4)

## Loading required package: car

## Loading required package: carData

## Loading required package: effects

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

library(car)
library(MASS)
library(ggplot2)
library(ellipse)

## 
## Attaching package: 'ellipse'

## The following object is masked from 'package:car':
## 
##     ellipse

## The following object is masked from 'package:graphics':
## 
##     pairs

data(Boston)

# Check the structure of the dataset
str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

# Identify unique values and data types of each variable
sapply(Boston, function(x) { c(length(unique(x)), class(x)) })

##      crim      zn        indus     chas      nox       rm        age      
## [1,] "504"     "26"      "76"      "2"       "81"      "446"     "356"    
## [2,] "numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric"
##      dis       rad       tax       ptratio   black     lstat     medv     
## [1,] "412"     "9"       "66"      "46"      "357"     "455"     "229"    
## [2,] "numeric" "integer" "numeric" "numeric" "numeric" "numeric" "numeric"

We can see that chas and rad are discrete random variables and will have to remove them from our model.

Train Model

# Fit a Linear Model
data(Boston)
lm_model <- lm(medv ~ . - chas - rad, data = Boston)

summary(lm_model)

## 
## Call:
## lm(formula = medv ~ . - chas - rad, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3315  -2.8771  -0.6792   1.6858  27.4744 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.970e+01  5.051e+00   5.879 7.59e-09 ***
## crim        -7.010e-02  3.269e-02  -2.144 0.032482 *  
## zn           3.989e-02  1.409e-02   2.831 0.004835 ** 
## indus       -4.198e-02  6.080e-02  -0.691 0.490195    
## nox         -1.458e+01  3.899e+00  -3.740 0.000206 ***
## rm           4.188e+00  4.255e-01   9.843  < 2e-16 ***
## age         -1.868e-03  1.359e-02  -0.137 0.890696    
## dis         -1.503e+00  2.059e-01  -7.301 1.15e-12 ***
## tax          8.334e-04  2.386e-03   0.349 0.727038    
## ptratio     -8.738e-01  1.323e-01  -6.607 1.02e-10 ***
## black        8.843e-03  2.763e-03   3.200 0.001461 ** 
## lstat       -5.267e-01  5.224e-02 -10.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.899 on 494 degrees of freedom
## Multiple R-squared:  0.7225, Adjusted R-squared:  0.7163 
## F-statistic: 116.9 on 11 and 494 DF,  p-value: < 2.2e-16

1. Standard Assumptions

A. Mean Zero

# Checking for Mean Zero assumption
plot(lm_model, which = 1)

cat("Mean of the residuals is: ", mean(lm_model$residuals))

## Mean of the residuals is:  2.106352e-16

As most of the points are randomly scattered around res. = 0 without any clear pattern,.his suggests that the assumption that the relationship is linear is reasonable. Also, the value of the mean of residuals is tending to 0 which satisfies mean 0 assumption.

B. Homoscedasticity

# Checking for the Homoscedasticity assumption
plot(lm_model, which = 3)

Homoscedasticity assumes that the variances of residuals are finite and similar or equal thereby, making the spread of residuals relatively constant throughout the plot. This plot is a scale-location plot of the squared root of residuals against their fitted values.

C. Normality

# Checking for Normality assumption using the Q-Q plot
plot(lm_model, which = 2)

In the normal Q-Q plot, we can clearly observe that there are a few outliers. Also, the plot is almost fitting on the straight line which enables us to conclude that the data approximately follows the normal distribution.

2. Outliers in the Predictor

plot(lm_model, which = 2)

# Identify the most significant outlier
max_residual <- max(abs(lm_model$residuals))
max_residual_index <- which.max(abs(lm_model$residuals))
max_residual_observation <- row.names(Boston)[max_residual_index]
print(paste("Most significant outlier:", max_residual_observation))

## [1] "Most significant outlier: 369"

print(paste("Absolute value of residual:", max_residual))

## [1] "Absolute value of residual: 27.4744025269803"

# Using Leverage-Residuals^2 plot
plot(hatvalues(lm_model), rstudent(lm_model)^2, pch = 20, 
     main = "Outliers in Predictor: Leverage vs. Squared Residuals")

cat("Index of Outlier with the highest leverage: ", which.max(rstudent(lm_model)^2))

## Index of Outlier with the highest leverage:  369

The most significant outlier of all is the one at index = 369. This outlier has the highest influence on the regression coefficients. Having an extreme value compared to others, it prominently impacts the results of the regression model. This value is unusual and is shown as an outlier in almost all the previous plots.

When there exists outliers in predictor: check with hat values

plot(hatvalues(lm_model), type = "h")
p = 2
n= 506
abline(h = 2 * (p + 1) / n, lty = 2,col = 'darkred')

The hat-value summarizes the potential contribution of observation i to the fitted values collectively, and is a suitable measure of the leverage of this observation because the ith row of H represents the weight associated with the ith observed response in determining each of the fitted values.

In this case visualising would be hard as we have over 11 predctor variables and we’ll need a higher dimensional space to be visually able to see the points.

3. Outliers in the Response

# Look for extreme values in medv
boxplot(Boston$medv, main = "Boxplot of medv")

# Calculate median and interquartile range
median_medv <- median(Boston$medv)
iqr_medv <- IQR(Boston$medv)

# Define outlier cutoff values
lower_cutoff <- median_medv - 1.5 * iqr_medv
upper_cutoff <- median_medv + 1.5 * iqr_medv

# Identify outliers
outliers <- Boston$medv[Boston$medv < lower_cutoff | Boston$medv > upper_cutoff]

# Print outliers
print(outliers)

##  [1] 34.7 33.4 36.2 34.9 35.4 38.7 43.8 33.2 41.3 50.0 50.0 50.0 50.0 37.2 39.8
## [16] 36.2 37.9 50.0 34.9 37.0 36.4 50.0 33.3 34.6 34.9 42.3 48.5 50.0 44.8 50.0
## [31] 37.6 46.7 41.7 48.3 42.8 44.0 50.0 36.0 33.8 43.1 48.8 36.5 50.0 43.5 35.2
## [46] 33.2 35.1 45.4 35.4 46.0 50.0 37.3 36.1 33.4 50.0 50.0 50.0 50.0 50.0  8.8
## [61]  7.2  7.4  8.5  5.0  6.3  5.6  7.2  8.3  8.5  5.0  7.0  7.2  7.5  8.8  8.4
## [76]  8.3  8.7  8.4  7.0  8.1

This code calculates the median and interquartile range (IQR) of the medv variable. Outliers are then identified based on their deviation from the median by more than 1.5 times the IQR. This is a commonly used statistical approach. Any data points that fall significantly outside the “whiskers” of the boxplot may be considered outliers. These outliers could potentially have a significant impact on the analysis.

Studentized Residuals Concept

The basic idea is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations. Then, we compare the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals. Standardizing the deleted residuals produces studentized residuals.

# Calculating the studentized residuals
studentized_resid <- rstudent(lm_model)

# Identifying outliers based on a threshold (e.g., 2 standard deviations)
outlier_threshold <- 2
outliers <- which(abs(studentized_resid) > outlier_threshold)

# Plotting the calculated studentized residuals
plot(studentized_resid, pch = 20, main = "Outliers in Response")

# Adding a horizontal line at the threshold
abline(h = outlier_threshold, col = "red", lty = 2)

Here, we are considering the threshold for outliers to be 2. The scatter plot gives the outliers in response and the farthest outlier has an index of 369. (aligns with our predictor outlier)

4. Influential Observations

To check for influential observations, we are using Cook’s distance and leverage. Cook’s distance measures the influence of each observation on the model’s coefficients, while leverage measures how much an observation’s predictor values differ from the mean predictor values.

# Calculate Cook's distance and leverage
cook_dist <- cooks.distance(lm_model)
leverage <- hatvalues(lm_model)

# Plot Cook's distance
plot(cook_dist, pch = 20, main = "Cook's Distance", xlab = "Observation", ylab = "Cook's Distance")
abline(h = 4 / nrow(Boston), col = "red")

# Plot leverage
plot(leverage, pch = 20, main = "Leverage", xlab = "Observation", ylab = "Leverage")
abline(h = 0.05, col = "red")

# Identify influential observations based on Cook's distance and leverage
influential_obs <- which(cook_dist > 4 / nrow(Boston) & leverage > 0.05)

# Print influential observations
print(Boston[influential_obs, ])

##         crim zn indus chas   nox    rm   age     dis rad tax ptratio  black
## 215  0.28955  0 10.59    0 0.489 5.412   9.8  3.5875   4 277    18.6 348.93
## 254  0.36894 22  5.86    0 0.431 8.259   8.4  8.9067   7 330    19.1 396.90
## 354  0.01709 90  2.02    0 0.410 6.728  36.1 12.1265   5 187    17.0 384.46
## 365  3.47428  0 18.10    1 0.718 8.780  82.9  1.9047  24 666    20.2 354.55
## 366  4.55587  0 18.10    0 0.718 3.561  87.9  1.6132  24 666    20.2 354.70
## 368 13.52220  0 18.10    0 0.631 3.863 100.0  1.5106  24 666    20.2 131.42
## 369  4.89822  0 18.10    0 0.631 4.970 100.0  1.3325  24 666    20.2 375.52
## 381 88.97620  0 18.10    0 0.671 6.968  91.9  1.4165  24 666    20.2 396.90
## 406 67.92080  0 18.10    0 0.693 5.683 100.0  1.4254  24 666    20.2 384.97
## 413 18.81100  0 18.10    0 0.597 4.628 100.0  1.5539  24 666    20.2  28.79
## 415 45.74610  0 18.10    0 0.693 4.519 100.0  1.6582  24 666    20.2  88.27
##     lstat medv
## 215 29.55 23.7
## 254  3.54 42.8
## 354  4.50 30.1
## 365  5.29 21.9
## 366  7.12 27.5
## 368 13.33 23.1
## 369  3.26 50.0
## 381 17.21 10.4
## 406 22.98  5.0
## 413 34.37 17.9
## 415 36.98  7.0

There are nearly 11 influential values and we can conclude that these have affect on the model significantly more than the other points. (Note: Index 369 is present here)

5. Multicollinearity

VIF stands for Variance inflation factor. The VIF values measure the degree of multicollinearity between predictor variables. A VIF greater than 5 or 10 indicates potential multicollinearity issues, where predictor variables are highly correlated with each other. High VIF values can inflate the standard errors of regression coefficients

# Calculate VIFs
vif_values <- car::vif(lm_model)

# Plot VIF values as a histogram
barplot(vif_values, main = "VIF Values", xlab = "Predictor Variables", ylab = "VIF",
        ylim = c(0, max(vif_values) + 1), col = ifelse(vif_values >= sort(vif_values, decreasing = TRUE)[5], "red", "blue"),
        names.arg = names(vif_values))

# Add VIF values over the bars
text(x = 1:length(vif_values), y = vif_values, labels = round(vif_values, 2), pos = 3, col = "black", cex = 0.8)

# Add legend
legend("topright", legend = c("Top 5 Highest VIFs"), fill = c("red"), border = "transparent")

DSC 241 - Statistical Models: Homework 1

Sreenivas Dheeraj Marri & Kaveya Sivaprakasam

Question 1

To Show \(R^2\) is equal to the emmperical squared correlation between \(y\) and \(\hat y\)for OLS in multiple regression

Question 2

Perform simulations to check that the least squares coefficients are indeed, or at least approximately, normally distributed, under appropriate conditions.

A. Situation with only one predictor

Now we have to check for marginal normality

Joint Normality

B. Data with errors that have the t-distribution with \(k \in\) {2, 5, 10, 20} degrees of freedom

Marginal Normality

Joint Normality

Question 3

Practice for performing diagnostics

Load the Datatset

Train Model

1. Standard Assumptions

A. Mean Zero

B. Homoscedasticity

C. Normality

2. Outliers in the Predictor

3. Outliers in the Response

4. Influential Observations

5. Multicollinearity

Contributions

Sreenivas Dheeraj Marri

Kaveya Sivaprakasam