EPsy8266 Assignment 1

Introduction and Overview

Author

SEMGopher (Jichuan Wu & Mingyang Pang)

Published

February 7, 2025

Question 1

Describe how to convert a covariance matrix to a correlation matrix in a way that a person not familiar with statistics can understand. Be sure to explain all the statistical concepts that are necessary to understand in order to convert a covariance matrix to a correlation matrix. (1 point)

Key Concepts:

Covariance measures how two variables change together. For example, if we want to examine if the “time spent on homework” and the “academic performance” for all studnets in EPSY8266 vary simultaneously, we can reference the covariance value of the two varibales. The positive value indicates that they tend to increase at the same time while the negative one shows that one increase while the other decreases. If their changes are unrelated, the covariance is close to zero. Covariance depends on the units of measurement, and the convaraince value can range from negative to positive without fixed bounds. Covariance matrix shows the covariance between multiple pairs of variables. For example, if we’d like to look at the covariance of each pair of “time spent on homework”“academic performance”“Intellegence Score” and “parent education level”, we can adpot covaraice matrix to demonstrate the values clearly. It depicts interrelationships between different variables that change relative to one another.

Correlation is a standardized measure that shows how strong and whether the relationship between two variables is positive or negative. Unlike covariance, which can range from negative to positive infinity, correlation values are always between -1 and 1. A correlation of 1 means a perfect positive relationship, -1 means a perfect negative relationship, and 0 means no linear relationship. Correlation matrix has a similar function with Covariance matrix, but it shows correlation values.

Value standardization is the process of transforming data into a common scale, removing the influence of different units and magnitudes so that the values can be more easily compared. It ensures that the data is on the same scale, which is particularly useful when comparing variables that have different units or ranges.

Covariance-correlation Conversion:

For different variables, they may have different varaince and units, which make the covarainces hard to compare. In order to solve the problem, we need to remove the unites and utify the variance value. The process is often called “standardize”. To standardize the values:

  • Determine the individual standard deviations which explain natural variability patterns of these variables.
  • Divide each covariance value by the standard deviation of the two variables

Formula: \[ \text{Correlation} = \frac{\text{Covariance}}{\text{Standard deviation of first variable} \times \text{Standard deviation of second variable}} \]

Question 2

Convert the following covariance matrix to a correlation matrix. (2 points)

Using \(r=\frac{\sigma_{XY}}{s_X s_Y}\), we converted the covariance matrix into the correlation matrix as follows:

Question 3

What do you notice about the correlation matrix that was not apparent in the covariance matrix? What does this lead you to conclude with regards to utilizing and interpreting the covariance matrix? Why do you see this difference? (1 point)

In the correlation matrix, we see that all the correlations are the same (0.5). however, In the covariance matrix, the covariance values between different variables are different from each other.

This observation leads us to conclude that while the covariance matrix can show differing covariance values between variables, this is not always helpful for comparing relationships.

The difference arises from the following reason: Covariance considers the scales or units of the variables. In other words, the covariance values depend on the units of measurement, so large values might simply reflect the size of the units rather than the actual strength of the relationship between variables. However, the correlation matrix standardizes these values, removing the impact of differing units and scales. As a result, correlation value is between -1 and 1, which makes it easier to compare the strength of relationships across variables, regardless of their original units.

Question 4

For Question 4, part – a & b, you will need to utilize the data below. 17 schools were analyzed and the following scores were found:

X= Percentage of African American students in the cohort

Y= Percentage of white students who said that they “knew well” 2 or more African American students

a. Enter the data into SPSS as a data file (or a different statistics program of your choosing). Then analyze them, first with all 17 data points. Next, reanalyze with only the first 16 data points to determine the line of best fit. Fit the regression line to the data for both analyses separately.

17 data points analysis

# Load libraries
library(ggplot2)
library(broom)


# Creat dataframes 
school_17 <- data.frame(school = c(1:17), X = c(2.6, 2.7, 3.5, 3.8, 4.3, 4.6, 5.3, 6.4, 6.9, 7.1, 7.3, 7.4, 7.6, 8.4, 8.4, 8.6, 12.5), Y = c(c(46, 58, 50.5, 52, 44, 49, 58.5, 62.5, 57, 67, 54, 62, 66.5, 63, 69, 65, 59)))

# Creat a scatterplot and fit a line
ggplot (data = school_17, aes(x = X, y = Y)) +
  geom_point(size = 4, shape = 21, fill = "skyblue", color = "black")+
  geom_smooth(method = "lm", formula = y ~ x, color = "red", se = FALSE) +
  theme_light() +
  xlab ("Percentage of African American Students") +
  ylab ("Percentage of White Students who “knew well” African American Students")+
  theme(
    text = element_text(size = 10)
  )

# Fit the model
lm_17 = lm(Y~1 + X, data = school_17)

# Intercept and slope
tidy(lm_17)
# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)    46.1      3.98      11.6  0.00000000719
2 X               1.86     0.586      3.18 0.00625      
# R^2
glance(lm_17)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.402         0.362  6.04      10.1 0.00625     1  -53.6  113.  116.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

\[ Y_i = 46.1 + 1.86(X_i) \] \[ R^2 = 0.402 \]

16 data points analysis

# Load libraries
library(ggplot2)
library(broom)

# Creat dataframes
school_16 <- data.frame(school = c(1:16), X = c(2.6, 2.7, 3.5, 3.8, 4.3, 4.6, 5.3, 6.4, 6.9, 7.1, 7.3, 7.4, 7.6, 8.4, 8.4, 8.6), Y = c(c(46, 58, 50.5, 52, 44, 49, 58.5, 62.5, 57, 67, 54, 62, 66.5, 63, 69, 65))) #first 16 data points

# Creat a scatterplot and fit a line
ggplot (data = school_16, aes(x = X, y = Y)) +
  geom_point(size = 4, shape = 21, fill = "skyblue", color = "black")+
  geom_smooth(method = "lm", formula = y ~ x, color = "red", se = FALSE) +
  theme_light() +
  xlab ("Percentage of African American Students") +
  ylab ("Percentage of White Students who “knew well” African American Students")+
  theme(
    text = element_text(size = 10)
  )

# Fit the model
lm_16 = lm(Y~1 + X, data = school_16)

# Intercept and slope
tidy(lm_16)
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    40.5      3.94      10.3  0.0000000663
2 X               2.90     0.629      4.61 0.000405    
# R^2
glance(lm_16)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.603         0.574  5.09      21.2 0.000405     1  -47.7  101.  104.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

\[ Y_i = 40.5 + 2.90(X_i)\\ R^2 = 0.603 \]

b. Do you think these regression two lines are different? If so, why do you think so? Explain. (2 points)

The two regression lines show distinct differences becasue of the varying slopes and intercepts. Specifically, the 16-school model demonstrates a stronger explanatory power for the dependent variable (Y) than the 17-school model, with the R² value increasing from 0.402 to 0.603. This suggests that the introduction of the 17th school causes the difference between the two models, reducing the overall fit.

When the 17th school is removed from the analysis, both the intercept and slope of the regression line show significant changes, indicating that this data point has a strong influence on the model. The 16-school model provides a better fit becasue there is a big improvement in the R² value from 0.402 to 0.603. The 17th school shows different tendencies from its peer institutions, which accounts for the better fitting results when we only include 16 schools in the analysis.

The slope of the regression line for the 16-school model readuce greatly from 1.86 to 2.9 when the 17th school is excluded. This suggests that, without the 17th school, there is a stonger relationship between the proportion of black students and the rate at which white students become familiar with African American students.

Question 5

Give an example of each of the following types of matrices (2 points)

a. Symmetric

Symmetric matrix is a square matrix that is equal to its transpose. Example: \[ \begin{bmatrix} 1 & 7 & 3 \\ 7 & 4 & 5 \\ 3 & 5 & 2 \end{bmatrix} \]

b. Identity

Identity matrix is a square matrix in which all the elements of principal diagonals are one, and all other elements are zeros. Example: \[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

c. Diagonal

Diagonal matrix is a square matrix which all the elements except of principal diagonals are zero. The elements of principal diagonals can take any value. Example: \[ \begin{bmatrix} 5 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 7 \end{bmatrix} \]

d. Rectangular

The matrix is rectangular matrix which has unequal numbers of row and column.Example: \[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \]

Question 6

Let A, B, C, and D be the following matrices:

a. Identify whether the following are conformable or not and explain why they are conformable or not. (2 points)

A * B: They are comfortable because the number of columns of matrix A is equal to the number of rows of matrix B.

A * C: They are not comfortable because the number of columns of matrix A is not equal to the number of rows of matrix C.

B * C: They are comfortable because the number of columns of matrix B is equal to the number of rows of matrix C.

D * B: They are not comfortable because the number of columns of matrix D is not equal to the number of rows of matrix B.

b. For the matrices which are conformable in part a, complete the matrix multiplication. (2points) \[ *A * B = \begin{bmatrix} 1 & 4 \\ 6 & 3 \\ 5 & 6 \end{bmatrix} \times \begin{bmatrix} 2 & 1 & 3 \\ 8 & 3 & 2 \\ \end{bmatrix} =\begin{bmatrix} (1 \times 2 + 4 \times 8 ) & (1 \times 1 + 4 \times 3 ) & (1 \times 3 + 4 \times 2 ) \\ (6 \times 2 + 3 \times 8 ) & (6 \times 1 + 3 \times 3) & (6 \times 3 + 3 \times 2) \\ (5 \times 2 + 6 \times 8 ) & (5 \times 1 + 6 \times 3) & (5 \times 3 + 6 \times 2) \end{bmatrix} = \begin{bmatrix} 34 & 13 & 11 \\ 36 & 15 & 24 \\ 58 & 23 & 27 \end{bmatrix} \]

\[ B * C = \begin{bmatrix} 2 & 1 & 3 \\ 8 & 3 & 2 \end{bmatrix} \times \begin{bmatrix} 4 & 3 & 7 \\ 3 & 3 & 0 \\ 7 & 0 & 2 \end{bmatrix} = \begin{bmatrix} (2 \times 4 + 1 \times 3 + 3 \times 7) & (2 \times 3 + 1 \times 3 + 3 \times 0) & (2 \times 7 + 1 \times 0 + 3 \times 2) \\ (8 \times 4 + 3 \times 3 + 2 \times 7) & (8 \times 3 + 3 \times 3 + 2 \times 0) & (8 \times 7 + 3 \times 0 + 2 \times 2) \end{bmatrix} = \begin{bmatrix} 32 & 9 & 20 \\ 55 & 33 & 60 \end{bmatrix} \]

Question 7

Given the following model:

Write the (3) regression equations for the model above in the following matrix form:

\[ \textit{Dependent} = \textit{Path Coefficients} \times \textit{Independent Variables} + \\ \textit{Path Coefficients} \times \textit{Dependent Variables} + \textit{Residuals} \] Note that all path coefficients should be written in standardized notation. There are 3 dependent variables and 2 independent variables in the model above. For your path coefficients matrices, place the correct path variables which need to be estimated. Think of each column to represent where the path is coming from and each row representing where the path is leading to. If there is no path in that direction, place a 0 in the matrix. Be sure that the matrices you write are conformable. (2 points)

The structural equations can be written as:

\(X3 = \beta_{31} X1 + \beta_{32} X2 + e_3\)

\(X4 = \beta_{41} X1 + \beta_{42} X2 + \beta_{43} X3 + e_4\)

\(X5 = \beta_{51} X1 + \beta_{52} X2 + \beta_{53} X3 + \beta_{54} X4 +e_5\)

Matrix Definitions: \[ Dependent =\begin{bmatrix} X3 \\ X4 \\ X5 \end{bmatrix} \] \[ Independent =\begin{bmatrix} X1 \\ X2 \end{bmatrix} \] Dependent-to-Dependent Path Matrix: \[ \begin{bmatrix} 0&0&0\\ \beta_{43}&0&0\\ \beta_{53}&\beta_{54}&0\\ \end{bmatrix} \] Independent-to-Dependent Path Matrix: \[ \begin{bmatrix} \beta_{31}&\beta_{32}\\ \beta_{41}&\beta_{42}\\ \beta_{51}&\beta_{52}\\ \end{bmatrix} \] Residuals: \[ \begin{bmatrix} e_3\\ e_4\\ e_5\\ \end{bmatrix} \] Final Matrix Equation: \[ \begin{bmatrix} X3\\ X4\\ X5 \end{bmatrix}= \begin{bmatrix} \beta_{31}&\beta_{32}\\ \beta_{41}&\beta_{42}\\ \beta_{51}&\beta_{52}\\ \end{bmatrix} \begin{bmatrix} X1 \\ X2 \end{bmatrix}+ \begin{bmatrix} 0&0&0\\ \beta_{43}&0&0\\ \beta_{53}&\beta_{54}&0\\ \end{bmatrix} \begin{bmatrix} X3 \\ X4 \\ X5 \end{bmatrix} + \begin{bmatrix} e_3\\ e_4\\ e_5\\ \end{bmatrix} \] Note: \(\beta_{ij}\) represents the standardized path coefficient from variable \(j\) to variable \(i\). \(e_3\), \(e_4\), and \(e_5\) are the residuals for \(X3\), \(X4\), and \(X5\), respectively.