Assignment1 - STA 8301

1) Suppose our multivariate data have sample covariance matrix S

# Load necessary libraries
library(MASS)   # For the ginv function
library(car)    # For the mahalanobis function

## Loading required package: carData

# Define the covariance matrix
my.S <- matrix(c(2, -3, 2, -3, 6, 4, 2, 4, 3), byrow = TRUE, nrow = 3, ncol = 3)

# Display the matrix
my.S

##      [,1] [,2] [,3]
## [1,]    2   -3    2
## [2,]   -3    6    4
## [3,]    2    4    3

1.a) Based on this covariance matrix S, how many columns (variables) does the original data matrix have?

1.b)Can you tell how many rows the original data matrix has?

1.c) Find and write (or print) the inverse of S.

#S_inverse <- solve(my.S)
S_inverse<-ginv(my.S)
S_inverse

##             [,1]        [,2]        [,3]
## [1,] -0.02105263 -0.17894737  0.25263158
## [2,] -0.17894737 -0.02105263  0.14736842
## [3,]  0.25263158  0.14736842 -0.03157895

1.d) Find and write (or print) the correlation matrix for this data set.

varcov<-var(my.S)
corr<-cov2cor(varcov)
corr

##            [,1]       [,2]       [,3]
## [1,]  1.0000000 -0.6719319 -0.8660254
## [2,] -0.6719319  1.0000000  0.9522166
## [3,] -0.8660254  0.9522166  1.0000000

2) Suppose a multivariate data set has sample covariance matrix

2.a) Determine the matrix D^{-1/2}, where D^{-1/2} is defined in the Lecture 1 notes.

S<-matrix(c(16,-2,4,-2,9,-1,4,-1,25),nrow=3,ncol=3, byrow=T)

diag_ele<-1/sqrt(diag(S))

D<-matrix(diag_ele,nrow=1,ncol=3)
D

##      [,1]      [,2] [,3]
## [1,] 0.25 0.3333333  0.2

2.b) Calculate and print the sample correlation matrix R for this data set.

correlation_matrix<-cor(S)
correlation_matrix

##             [,1]       [,2]        [,3]
## [1,]  1.00000000 -0.8071830 -0.00790866
## [2,] -0.80718300  1.0000000 -0.58389908
## [3,] -0.00790866 -0.5838991  1.00000000

3) The air pollution data set is provided (airpoll.txt). See also Table 2.1, Everitt (2005).

For this problem, we will focus only on the first 16 observations (cities).

You can read the data into R (as a data frame) with the code:

airpol.full <- read.table("airpoll.txt", header=T)
city.names <- as.character(airpol.full[1:16,1])
airpol.data.sub <- airpol.full[,2:8]

Perform your analysis on the ‘airpol.data.sub’ subset.

3.a) Use R to calculate the sample covariance matrix and the sample correlation matrix for this data subset.

Identify which pairs of variables seem to be strongly associated. Write a paragraph describing the nature (strength and direction) of the relationship between these variable pairs.

library(DT)
#Calculate the sample covariance matrix
covariance_matrix <- cov(airpol.full[, 2:ncol(airpol.full)])
datatable(covariance_matrix)

# Calculate the sample correlation matrix
correlation_matrix <- cor(airpol.full[, 2:ncol(airpol.full)])
datatable(correlation_matrix)

Dataset Overview:

The correlation matrix provided represents the relationships between various variables in a dataset called airpol.

The variables included in the analysis are Rainfall, Education, Popden (Population Density), Nonwhite, NOX (Nitrogen Oxides), SO2 (Sulfur Dioxide), and Mortality.

Key Findings:

Rainfall and Mortality: There is a moderately strong positive correlation between Rainfall and Mortality (correlation coefficient = 0.509). This suggests that regions experiencing higher rainfall tend to have higher mortality rates. Possible explanations could include the influence of weather on health conditions and infrastructure vulnerabilities during heavy rainfall.
Education and Mortality: There is a moderately strong negative correlation between Education and Mortality (correlation coefficient = -0.510). This implies that areas with higher levels of education attainment tend to have lower mortality rates. This finding is consistent with existing research indicating the impact of education on health behaviors, access to healthcare, and socio-economic status.
Nonwhite and Mortality: There is a strong positive correlation between the proportion of Nonwhite population and Mortality (correlation coefficient = 0.644). This suggests that areas with a higher percentage of Nonwhite residents tend to have higher mortality rates. This correlation underscores the existence of health disparities related to race or ethnicity, reflecting systemic inequalities in access to healthcare and socio-economic factors.

Other Relationships:

Several other notable correlations include a moderate negative correlation between Rainfall and Education (-0.490), and a moderate positive correlation between NOX and SO2 (0.409). These relationships may warrant further investigation to understand their underlying causes and implications.

Conclusion:

The correlation matrix analysis provides valuable insights into the interrelationships between various factors and mortality rates within the dataset. The findings highlight the importance of considering socio-economic, demographic, and environmental factors in understanding health outcomes.

3.b)Use R to calculate the distance matrix for these observations (after scaling the variables by dividing each variable by its standard deviation).

Write a paragraph describing some of the most similar pairs of cities and some of the most different pairs of cities, giving evidence from the distance matrix.

# Scale the pollution metrics (excluding the city names)
pollution_data <- scale(airpol.full[, -1], center = FALSE)
# Calculate the euclidean distance matrix
dis <- dist(pollution_data)

# Extract city names
city_names <- airpol.full[, 1]
dis_matrix <- as.matrix(dis)
rownames(dis_matrix) <- city_names
colnames(dis_matrix) <- city_names
datatable(dis_matrix)

# Finding the most similar and most different pairs of cities
similar_pairs <- which(dis_matrix == min(dis_matrix[dis_matrix > 0]), arr.ind = TRUE)
different_pairs <- which(dis_matrix == max(dis_matrix), arr.ind = TRUE)
# Display similar and different cities
similar_cities <- city_names[similar_pairs]
different_cities <- city_names[different_pairs]

#Find the indices of the most similar and most different pairs of cities
similar_pairs_indices <- which(dis_matrix == min(dis_matrix[dis_matrix > 0]), arr.ind = TRUE)
different_pairs_indices <- which(dis_matrix == max(dis_matrix), arr.ind = TRUE)
# Extract city names for the most similar and most different pairs
most_similar_cities <- city_names[similar_pairs_indices]
most_different_cities <- city_names[different_pairs_indices]

# Extract distances for the most similar and most different pairs
most_similar_distance <- dis_matrix[similar_pairs_indices][1]
most_different_distance <- dis_matrix[different_pairs_indices][1]
most_similar_cities <- city_names[similar_pairs_indices]
most_different_cities <- city_names[different_pairs_indices]

# Extract distances for the most similar and most different pairs
most_similar_distance <- dis_matrix[similar_pairs_indices][1]
most_different_distance <- dis_matrix[different_pairs_indices][1]
#Print most similar citieis with corresponding distance

most_different_distance

## [1] 6.48197

most_different_cities

## [1] "miamiFL"  "losangCA" "losangCA" "miamiFL"

Based on the distance matrix derived from the scaled air pollution metrics, we can observe significant similarities and differences between certain pairs of cities. The most similar pair of cities is Dayton, Ohio (daytonOH) and Columbus, Ohio (colombOH), with a remarkably small distance of 0.1342168. This minimal distance indicates that the air pollution profiles of Dayton and Columbus are nearly identical, suggesting that these cities experience similar levels of pollutants and likely share comparable environmental conditions or regulatory practices.

In contrast, the most different pair of cities is Miami, Florida (miamiFL) and Los Angeles, California (losangCA), with a considerable distance separating them. This substantial distance reflects a significant disparity in their air quality metrics, which could be due to various factors such as differences in industrial activities, traffic patterns may lead to different pollution sources and dispersal patterns compared to the sprawling urban environment and higher traffic density of Los Angeles.

c) Give a plot that will help assess whether this data set comes from a multivariate normal distribution. What is your conclusion, based on the plot?

# Ensure column 1 contains the city names
city_names <- airpol.full[, 1]

# Scale the pollution metrics (excluding the city names)
pollution_data <- scale(airpol.full[, -1], center = FALSE)

# Calculate the Mahalanobis distances
center <- colMeans(pollution_data)
cov_matrix <- cov(pollution_data)
#get inverse of covariance matrix
inv_cov_matrix <- ginv(cov_matrix)
mahalanobis_distances <- mahalanobis(pollution_data, center, inv_cov_matrix)

# Calculate the theoretical quantiles of the chi-squared distribution
df <- ncol(pollution_data)
theoretical_quantiles <- qchisq((1:nrow(pollution_data) - 0.5) / nrow(pollution_data), df)

# Sort the Mahalanobis distances
sorted_mahalanobis_distances <- sort(mahalanobis_distances)

# Create the Q-Q plot
qqplot(theoretical_quantiles, sorted_mahalanobis_distances,
       main = "Chi-Squared Q-Q Plot for Multivariate Normality",
       xlab = "Theoretical Quantiles of Chi-Squared Distribution",
       ylab = "Sorted Mahalanobis Distances")
abline(0, 1, col = "red")

Discussion

The points deviate significantly from the line, especially in a systematic manner specifically in a curving away, this suggests the dataset may not follow a multivariate normal distribution.