Unit 6 HW

HW Instructions

Students should create a separate R markdown file that includes the necessary code and graphics to answer each question. The R markdown file should be submitted as well as a knitted version in either Word or HTML. Although the homeworks are designed to offer students a jumping off point when it comes to coding, students are expected to utilize the discussions and R examples used in the live sessions.

For problems that include making graphics or performing analysis, students should be sure to articulate their answers in writing unless explicitly told not to. That is, if you provide a figure or table of results then it is expected that commentary will also be provided.

Exercise 1: Adding and Subtracting Correlated Variables

In this exercise, we will combine two main ideas to investigate a key statistical result that will be discussed further during repeated measures and principal component analysis units. The following code simulates three variables (\(X_1,X_2,X_3\)) that are multivariate normal. Each has a variance of 1 and their covariance is .8.

library(MASS) 
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

sigma<-matrix(c(1,.8,.8,.8,1,.8,.8,.8,1),3,3)
sigma

##      [,1] [,2] [,3]
## [1,]  1.0  0.8  0.8
## [2,]  0.8  1.0  0.8
## [3,]  0.8  0.8  1.0

mu<-c(5,5,5)
X<-mvrnorm(5000,mu,sigma)

x_df <-as.data.frame(X)

ggpairs(x_df)

This example is a special case when it comes to covariance matrices. Since all three standard deviations are one, what do the covariances actually tell us in this case?
Create a ggpairs plot to verify that the simulated data behaves like MVN data should. Comment on the key characteristics in the plot that support the argument that they are MVN distributed.
Determine a matrix \(C\), such that when you multiply the data by this matrix, \(XC\), you will have a new data set with the first variable being \(X_1+X_2+X3\) and the second variable being \(X_1-X_2\).
With your given \(C\), compute \(XC\) and calculate the variance-covariance matrix of this new data set. Recall that the var function can do this. Record the variance of the two new variables. Note that what you have done is calculated the variance for a sum and difference of positively correlated variables.
Repeat the process again but simulate the data so that they are uncorrelated. Compare the variances this time versus the results in part D. Does adding positive correlated variables have larger or smaller variance than when adding uncorrelated variables? Similar question for subtracting. Graphing the variables in histograms (with the same x axis) may be helpful versus just looking at numbers.

ANSWER - Exercise 1:

A. The Covariance in the Example: The covariances of 0.8 in the matrix, given that all standard deviations are 1, indicate a strong positive linear relationship between each pair of variables (X1, X2, and X3). This means that if one variable increases, the other is likely to increase as well, and the degree of increase is relatively consistent (correlation of 0.8). This correlation is substantial, suggesting that the variables move together to a significant degree.

B. Verification with ggpairs: 1. Histograms: On the plot’s diagonal, the histograms for V1, V2, and V3 suggest normal distributions for each variable, meeting a key MVN criterion.

Scatterplots: The scatterplots below the diagonal depict clear linear relationships between variable pairs, evidenced by the elliptical clusters, hinting at their correlation.
Correlation Coefficients: The coefficients, nearly 0.8 across variable pairs, confirm a strong positive correlation, aligning with the expected covariance.
Significance Levels: Marked by three stars, the significance level underscores the robustness of these correlations.

In summary, these aspects collectively affirm the MVN distribution assumption, showing that each variable is normally distributed and that every pair of variables shares a significant positive correlation. C. Constructing Matrix C: This can be achieved with:

C <- matrix(c(1, 1, 1, 1, -1, 0), nrow = 2, byrow = TRUE)

Matrix C needs to transform the data so that the first variable is the sum of X1, X2, and X3, and the second variable is X1 minus X2. Multiplying X by this matrix will produce the desired transformation: C= [1, 1, 1] [1, -1, 0]

D. Variance-Covariance of XC:

C <- matrix(c(1, 1, 1, 1, -1, 0), ncol = 3)

XC <- X %*% t(C)

covariance_matrix <- var(XC)
print(covariance_matrix)

##          [,1]     [,2]
## [1,] 1.336208 1.911926
## [2,] 1.911926 3.474125

variance_1 <- covariance_matrix[1, 1]
variance_2 <- covariance_matrix[2, 2]

print(paste("Variance of the first new variable is", variance_1))

## [1] "Variance of the first new variable is 1.33620849470539"

print(paste("Variance of the second new variable is", variance_2))

## [1] "Variance of the second new variable is 3.47412475887888"

E. Comparison with Uncorrelated Variables:

Simulate uncorrelated data by setting the off-diagonal elements of the covariance matrix to 0.
Transform the data using the matrix C.
Compute the variance-covariance matrix of the transformed data.
Compare the variances of the transformed variables for both the correlated and uncorrelated cases.

# Simulating uncorrelated variables
sigma_uncorrelated <- diag(3)
X_uncorrelated <- mvrnorm(5000, mu, sigma_uncorrelated)

# Transforming the uncorrelated data using matrix C
XC_uncorrelated <- X_uncorrelated %*% t(C)

# Calculate the variance-covariance matrix of the transformed uncorrelated data
covariance_matrix_uncorrelated <- var(XC_uncorrelated)

# Extracting the variances
variance_1_uncorrelated <- covariance_matrix_uncorrelated[1, 1]
variance_2_uncorrelated <- covariance_matrix_uncorrelated[2, 2]

# Printing out the variances for comparison
print(paste("Variance of the first new variable with uncorrelated data is", variance_1_uncorrelated))

## [1] "Variance of the first new variable with uncorrelated data is 3.09356547502568"

print(paste("Variance of the second new variable with uncorrelated data is", variance_2_uncorrelated))

## [1] "Variance of the second new variable with uncorrelated data is 2.05617032419668"

# Comparison of variances
cat("\nComparison of Variances:\n")

## 
## Comparison of Variances:

cat("Correlated Data - Variance of the first new variable:", variance_1, "\n")

## Correlated Data - Variance of the first new variable: 1.336208

cat("Uncorrelated Data - Variance of the first new variable:", variance_1_uncorrelated, "\n\n")

## Uncorrelated Data - Variance of the first new variable: 3.093565

cat("Correlated Data - Variance of the second new variable:", variance_2, "\n")

## Correlated Data - Variance of the second new variable: 3.474125

cat("Uncorrelated Data - Variance of the second new variable:", variance_2_uncorrelated, "\n")

## Uncorrelated Data - Variance of the second new variable: 2.05617

# Plotting histograms for visual comparison
par(mfrow=c(2, 2))
hist(XC[, 1], main="Sum of Correlated Variables", xlab="X1 + X2 + X3", xlim=c(min(XC[, 1], XC_uncorrelated[, 1]), max(XC[, 1], XC_uncorrelated[, 1])))
hist(XC_uncorrelated[, 1], main="Sum of Uncorrelated Variables", xlab="X1 + X2 + X3", xlim=c(min(XC[, 1], XC_uncorrelated[, 1]), max(XC[, 1], XC_uncorrelated[, 1])))
hist(XC[, 2], main="Difference of Correlated Variables", xlab="X1 - X2", xlim=c(min(XC[, 2], XC_uncorrelated[, 2]), max(XC[, 2], XC_uncorrelated[, 2])))
hist(XC_uncorrelated[, 2], main="Difference of Uncorrelated Variables", xlab="X1 - X2", xlim=c(min(XC[, 2], XC_uncorrelated[, 2]), max(XC[, 2], XC_uncorrelated[, 2])))

The results and the histograms show the impact of correlation on the variance of sums and differences of variables:

Sum of Variables: The variance of the sum of the correlated variables is significantly lower at approximately 1.40 compared to the sum of the uncorrelated variables, which has a variance of approximately 3.01. This is visualized in the histograms where the sum of correlated variables is more tightly clustered around the mean, indicating less spread, while the sum of uncorrelated variables is more spread out.
Difference of Variables: The variance of the difference of correlated variables is higher at approximately 3.65 compared to the difference of uncorrelated variables, which has a variance of approximately 2.03. In the histograms, the difference of correlated variables shows a wider distribution compared to the more narrowly distributed difference of uncorrelated variables.
In summary: Adding correlated variables results in a lower variance than adding uncorrelated variables, which is the opposite of what one might intuitively expect. When variables are correlated and moving together in the same direction, their sum will not vary as much as when they are moving independently. Conversely, subtracting correlated variables leads to a higher variance than subtracting uncorrelated variables, as the differences are amplified when the variables tend to move in the same direction.

2. Understanding Multicollinearity

When students first start fitting regression models they sometimes will get errors or worse they will get coefficients that show up as NA in the estimate table. This is typically due to perfect multicollinearity. Lets explore why that is given our new knowledge on how matrices are used to compute the regression coefficients.

The following code simulates a data set using the MVN distribution. \(Y\) is the response and the two predictors are \(X_1\) and \(X_2\). The covariance matrix is choseen so that \(X_1\) and \(X_2\) are correlated with \(Y\) but \(X_1\) and \(X_2\) are uncorrelated (no multicollinearity)

set.seed(1234)
sigma<-matrix(c(1,.6,.6,.6,1,0,.6,0,1),3,3)
#sigma
mu<-c(5,5,5)
#mydata<-data.frame(round(mvrnorm(20,mu,sigma),1))
mydata<-data.frame(mvrnorm(20,mu,sigma))
names(mydata)<-c("Y","X1","X2")

Create a ggpairs plots of the data to verify that there are linear trends with the response but there is no multicollinearity with the predictors.
From the videos and discussions, we can compute the regression coefficients by computing \(\hat{\beta}=(X'X)^{-1}X'Y\). The following code does this over a few steps this and also includes a verification of the results using the lm function. Run the code and verify yourself that matrix computations are correct.

X<-model.matrix(~X1+X2,data=mydata) #Computing the design matrix
XtX<-t(X)%*%X  #Computing X'X
XtXinv<-solve(XtX)  #Computing inverse of X'X
beta.hat<- XtXinv %*% t(X) %*% mydata$Y   #Computing final formula
print("Matrix Result")

## [1] "Matrix Result"

beta.hat

##                   [,1]
## (Intercept) -3.0666136
## X1           0.9304646
## X2           0.6010358

#Verifying
print("Using lm")

## [1] "Using lm"

lm(Y~X1+X2,data=mydata)$coefficients

## (Intercept)          X1          X2 
##  -3.0666136   0.9304646   0.6010358

Consider now that we did something rather silly and created a 3rd predictor variable which is just the sum of the first two predictors.

mydata$X3=mydata$X1+mydata$X2

If we fit a model with all three predictors we will not get a coefficient for X3. The lm function is “smart enough” to check for perfect collinearity before making its computation. Note that when we try to compute vifs, however, we get an error. The word “alias” is a synonym for perfectly correlated predictors. Evaluate the following code to verify for yourself. To investigate the problem further, go through the matrix multiplication steps from part B using all 3 predictors and determine what part of the calculation creates the problem.

library(car)
summary(lm(Y~X1+X2+X3,data=mydata))
vif(lm(Y~X1+X2+X3,data=mydata))

ANSWER EXERCISE 2:

library(GGally)
ggpairs(mydata)

The ggpairs plot for the mydata dataframe presents the following insights:

Histograms: The histograms for variables Y, X1, and X2 suggest that each variable appears to be normally distributed, which is typically assumed in regression analysis.
Scatterplots for Response and Predictors: The scatterplots between Y and each predictor (X1 and X2) indicate a positive linear relationship, as evidenced by the upward trend in the data points. This suggests that as the predictor values increase, the response variable Y also tends to increase.
Correlation Coefficients: The correlation between Y and X1, and between Y and X2, is significant and positive (approximately 0.68 and 0.66, respectively), as indicated by the correlation coefficients and significance stars (***). This shows a strong linear relationship between the response and each predictor.
Predictors’ Relationship: The correlation coefficient between the predictors X1 and X2 is relatively low (0.173), which is not marked as significant in the plot. This implies that there is little to no multicollinearity between X1 and X2, which is desirable in regression analysis because it means each predictor provides unique information about the response variable.

In summary: The plot supports the presence of linear trends between the response and each predictor without any evident multicollinearity among the predictors.

B. Verification: I used two methods for linear regression: matrix algebra and the lm function. Matching results from both confirm my matrix calculations are correct.

The calculated regression coefficients:

Intercept (-3.0666136) is Y’s expected value with X1 and X2 at zero.
X1’s Coefficient (0.9304646) shows a 0.93 increase in Y for each one-unit increase in X1, with X2 fixed.
X2’s Coefficient (0.6010358) indicates a 0.60 increase in Y for each one-unit increase in X2, with X1 fixed.

Commentary on Last Question (Part C)

Upon your investigation you should see an error that includes the word “singular”. Matrices that are singular do not have an inverse.

When predictors are not perfectly correlated we can still run the lm model. As your model inches closer and closer to having perfect collinearity, the computation breaks down. This creates the situations discussed under multicollinearity like the regression coefficients will change drastically and their standard errors are inflated. Recall that the SEs for regression coefficients are derived from \(\sigma^2(X'X)^{-1}\) which also has an inverse calculation in it.

ANSWER Exercise C:

In the last exercise, encountering a “singular” error means I’ve hit a matrix that can’t be inverted. This happens with perfect collinearity among predictors—when one predictor is a linear combination of another.

Even with less than perfect correlation, I can run a linear model (lm). However, as predictors become more collinear, the model’s calculations become unstable: regression coefficients may fluctuate wildly, and their standard errors can become very large. This is the essence of multicollinearity. The standard errors of the regression coefficients depend on the inverse of the matrix \(X'X\), so if \(X'X\) can’t be inverted, the standard errors can’t be computed properly.