Students should create a separate R markdown file that includes the necessary code and graphics to answer each question. The R markdown file should be submitted as well as a knitted version in either Word or HTML. Although the homeworks are designed to offer students a jumping off point when it comes to coding, students are expected to utilize the discussions and R examples used in the live sessions.
For problems that include making graphics or performing analysis, students should be sure to articulate their answers in writing unless explicitly told not to. That is, if you provide a figure or table of results then it is expected that commentary will also be provided.
A. The Covariance in the Example: The covariances of 0.8 in the matrix, given that all standard deviations are 1, indicate a strong positive linear relationship between each pair of variables (X1, X2, and X3). This means that if one variable increases, the other is likely to increase as well, and the degree of increase is relatively consistent (correlation of 0.8). This correlation is substantial, suggesting that the variables move together to a significant degree.
B. Verification with ggpairs: 1. Histograms: On the plot’s diagonal, the histograms for V1, V2, and V3 suggest normal distributions for each variable, meeting a key MVN criterion.
Scatterplots: The scatterplots below the diagonal depict clear linear relationships between variable pairs, evidenced by the elliptical clusters, hinting at their correlation.
Correlation Coefficients: The coefficients, nearly 0.8 across variable pairs, confirm a strong positive correlation, aligning with the expected covariance.
Significance Levels: Marked by three stars, the significance level underscores the robustness of these correlations.
In summary, these aspects collectively affirm the MVN distribution assumption, showing that each variable is normally distributed and that every pair of variables shares a significant positive correlation. C. Constructing Matrix C: This can be achieved with:
C <- matrix(c(1, 1, 1, 1, -1, 0), nrow = 2, byrow = TRUE)
Matrix C needs to transform the data so that the first variable is the sum of X1, X2, and X3, and the second variable is X1 minus X2. Multiplying X by this matrix will produce the desired transformation: C= [1, 1, 1] [1, -1, 0]
D. Variance-Covariance of XC:
C <- matrix(c(1, 1, 1, 1, -1, 0), ncol = 3)
XC <- X %*% t(C)
covariance_matrix <- var(XC)
print(covariance_matrix)
## [,1] [,2]
## [1,] 1.336208 1.911926
## [2,] 1.911926 3.474125
variance_1 <- covariance_matrix[1, 1]
variance_2 <- covariance_matrix[2, 2]
print(paste("Variance of the first new variable is", variance_1))
## [1] "Variance of the first new variable is 1.33620849470539"
print(paste("Variance of the second new variable is", variance_2))
## [1] "Variance of the second new variable is 3.47412475887888"
E. Comparison with Uncorrelated Variables:
# Simulating uncorrelated variables
sigma_uncorrelated <- diag(3)
X_uncorrelated <- mvrnorm(5000, mu, sigma_uncorrelated)
# Transforming the uncorrelated data using matrix C
XC_uncorrelated <- X_uncorrelated %*% t(C)
# Calculate the variance-covariance matrix of the transformed uncorrelated data
covariance_matrix_uncorrelated <- var(XC_uncorrelated)
# Extracting the variances
variance_1_uncorrelated <- covariance_matrix_uncorrelated[1, 1]
variance_2_uncorrelated <- covariance_matrix_uncorrelated[2, 2]
# Printing out the variances for comparison
print(paste("Variance of the first new variable with uncorrelated data is", variance_1_uncorrelated))
## [1] "Variance of the first new variable with uncorrelated data is 3.09356547502568"
print(paste("Variance of the second new variable with uncorrelated data is", variance_2_uncorrelated))
## [1] "Variance of the second new variable with uncorrelated data is 2.05617032419668"
# Comparison of variances
cat("\nComparison of Variances:\n")
##
## Comparison of Variances:
cat("Correlated Data - Variance of the first new variable:", variance_1, "\n")
## Correlated Data - Variance of the first new variable: 1.336208
cat("Uncorrelated Data - Variance of the first new variable:", variance_1_uncorrelated, "\n\n")
## Uncorrelated Data - Variance of the first new variable: 3.093565
cat("Correlated Data - Variance of the second new variable:", variance_2, "\n")
## Correlated Data - Variance of the second new variable: 3.474125
cat("Uncorrelated Data - Variance of the second new variable:", variance_2_uncorrelated, "\n")
## Uncorrelated Data - Variance of the second new variable: 2.05617
# Plotting histograms for visual comparison
par(mfrow=c(2, 2))
hist(XC[, 1], main="Sum of Correlated Variables", xlab="X1 + X2 + X3", xlim=c(min(XC[, 1], XC_uncorrelated[, 1]), max(XC[, 1], XC_uncorrelated[, 1])))
hist(XC_uncorrelated[, 1], main="Sum of Uncorrelated Variables", xlab="X1 + X2 + X3", xlim=c(min(XC[, 1], XC_uncorrelated[, 1]), max(XC[, 1], XC_uncorrelated[, 1])))
hist(XC[, 2], main="Difference of Correlated Variables", xlab="X1 - X2", xlim=c(min(XC[, 2], XC_uncorrelated[, 2]), max(XC[, 2], XC_uncorrelated[, 2])))
hist(XC_uncorrelated[, 2], main="Difference of Uncorrelated Variables", xlab="X1 - X2", xlim=c(min(XC[, 2], XC_uncorrelated[, 2]), max(XC[, 2], XC_uncorrelated[, 2])))
The results and the histograms show the impact of correlation on the variance of sums and differences of variables:
Sum of Variables: The variance of the sum of the correlated variables is significantly lower at approximately 1.40 compared to the sum of the uncorrelated variables, which has a variance of approximately 3.01. This is visualized in the histograms where the sum of correlated variables is more tightly clustered around the mean, indicating less spread, while the sum of uncorrelated variables is more spread out.
Difference of Variables: The variance of the difference of correlated variables is higher at approximately 3.65 compared to the difference of uncorrelated variables, which has a variance of approximately 2.03. In the histograms, the difference of correlated variables shows a wider distribution compared to the more narrowly distributed difference of uncorrelated variables.
In summary: Adding correlated variables results in a lower variance than adding uncorrelated variables, which is the opposite of what one might intuitively expect. When variables are correlated and moving together in the same direction, their sum will not vary as much as when they are moving independently. Conversely, subtracting correlated variables leads to a higher variance than subtracting uncorrelated variables, as the differences are amplified when the variables tend to move in the same direction.
When students first start fitting regression models they sometimes will get errors or worse they will get coefficients that show up as NA in the estimate table. This is typically due to perfect multicollinearity. Lets explore why that is given our new knowledge on how matrices are used to compute the regression coefficients.
The following code simulates a data set using the MVN distribution. \(Y\) is the response and the two predictors are \(X_1\) and \(X_2\). The covariance matrix is choseen so that \(X_1\) and \(X_2\) are correlated with \(Y\) but \(X_1\) and \(X_2\) are uncorrelated (no multicollinearity)
set.seed(1234)
sigma<-matrix(c(1,.6,.6,.6,1,0,.6,0,1),3,3)
#sigma
mu<-c(5,5,5)
#mydata<-data.frame(round(mvrnorm(20,mu,sigma),1))
mydata<-data.frame(mvrnorm(20,mu,sigma))
names(mydata)<-c("Y","X1","X2")
Create a ggpairs
plots of the data to verify that
there are linear trends with the response but there is no
multicollinearity with the predictors.
From the videos and discussions, we can compute the regression
coefficients by computing \(\hat{\beta}=(X'X)^{-1}X'Y\). The
following code does this over a few steps this and also includes a
verification of the results using the lm
function. Run the
code and verify yourself that matrix computations are correct.
X<-model.matrix(~X1+X2,data=mydata) #Computing the design matrix
XtX<-t(X)%*%X #Computing X'X
XtXinv<-solve(XtX) #Computing inverse of X'X
beta.hat<- XtXinv %*% t(X) %*% mydata$Y #Computing final formula
print("Matrix Result")
## [1] "Matrix Result"
beta.hat
## [,1]
## (Intercept) -3.0666136
## X1 0.9304646
## X2 0.6010358
#Verifying
print("Using lm")
## [1] "Using lm"
lm(Y~X1+X2,data=mydata)$coefficients
## (Intercept) X1 X2
## -3.0666136 0.9304646 0.6010358
mydata$X3=mydata$X1+mydata$X2
If we fit a model with all three predictors we will not get a coefficient for X3. The lm function is “smart enough” to check for perfect collinearity before making its computation. Note that when we try to compute vifs, however, we get an error. The word “alias” is a synonym for perfectly correlated predictors. Evaluate the following code to verify for yourself. To investigate the problem further, go through the matrix multiplication steps from part B using all 3 predictors and determine what part of the calculation creates the problem.
library(car)
summary(lm(Y~X1+X2+X3,data=mydata))
vif(lm(Y~X1+X2+X3,data=mydata))
A:
library(GGally)
ggpairs(mydata)
The ggpairs
plot for the mydata
dataframe
presents the following insights:
Histograms: The histograms for variables Y, X1, and X2 suggest that each variable appears to be normally distributed, which is typically assumed in regression analysis.
Scatterplots for Response and Predictors: The scatterplots between Y and each predictor (X1 and X2) indicate a positive linear relationship, as evidenced by the upward trend in the data points. This suggests that as the predictor values increase, the response variable Y also tends to increase.
Correlation Coefficients: The correlation between Y and X1, and between Y and X2, is significant and positive (approximately 0.68 and 0.66, respectively), as indicated by the correlation coefficients and significance stars (***). This shows a strong linear relationship between the response and each predictor.
Predictors’ Relationship: The correlation coefficient between the predictors X1 and X2 is relatively low (0.173), which is not marked as significant in the plot. This implies that there is little to no multicollinearity between X1 and X2, which is desirable in regression analysis because it means each predictor provides unique information about the response variable.
B. Verification: I used two methods for linear
regression: matrix algebra and the lm
function. Matching
results from both confirm my matrix calculations are correct.
The calculated regression coefficients:
Upon your investigation you should see an error that includes the word “singular”. Matrices that are singular do not have an inverse.
When predictors are not perfectly correlated we can still run the lm model. As your model inches closer and closer to having perfect collinearity, the computation breaks down. This creates the situations discussed under multicollinearity like the regression coefficients will change drastically and their standard errors are inflated. Recall that the SEs for regression coefficients are derived from \(\sigma^2(X'X)^{-1}\) which also has an inverse calculation in it.
In the last exercise, encountering a “singular” error means I’ve hit a matrix that can’t be inverted. This happens with perfect collinearity among predictors—when one predictor is a linear combination of another.
Even with less than perfect correlation, I can run a linear model
(lm
). However, as predictors become more collinear, the
model’s calculations become unstable: regression coefficients may
fluctuate wildly, and their standard errors can become very large. This
is the essence of multicollinearity. The standard errors of the
regression coefficients depend on the inverse of the matrix \(X'X\), so if \(X'X\) can’t be inverted, the standard
errors can’t be computed properly.