We will calculate eigenvalues, eigenvectors, the covariance matrix, and the correlation matrix for the Multivariate Analysis course assessment. Datasets was given from Credit Card Approval process.
credit_card_approval_dataset <- read.csv("credit_card_approval_dataset.csv")
credit_card_approval_dataset
## Gender Age Debt Married BankCustomer Industry YearsEmployed
## 1 1 30.83 0.000 1 1 Industrials 1.25
## 2 0 58.67 4.460 1 1 Materials 3.04
## 3 0 24.50 0.500 0 1 Materials 1.50
## 4 1 27.83 1.540 1 1 Industrials 3.75
## 5 1 20.17 5.625 1 1 Industrials 1.71
## PriorDefault Employed CreditScore DriversLicense Citizen ZipCode Income
## 1 1 1 1 0 ByBirth 202 0
## 2 1 1 6 0 ByBirth 43 560
## 3 1 0 0 0 ByBirth 280 824
## 4 1 1 5 1 ByBirth 100 3
## 5 1 0 0 0 ByOtherMeans 120 0
## Approved
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
selected_columns <- credit_card_approval_dataset[, c("Age", "Debt", "YearsEmployed", "Income")]
print(selected_columns)
## Age Debt YearsEmployed Income
## 1 30.83 0.000 1.25 0
## 2 58.67 4.460 3.04 560
## 3 24.50 0.500 1.50 824
## 4 27.83 1.540 3.75 3
## 5 20.17 5.625 1.71 0
eigen_values_vectors <- eigen(cor(selected_columns))
print(eigen_values_vectors$values)
## [1] 1.6257301 1.2116374 0.7801983 0.3824342
eigen_values <- (eigen_values_vectors$values)
Every number express Principal Componen (PC). Percentage (%) shows bellow
total_variance <- sum(eigen_values)
explained_variance <- (eigen_values / total_variance) * 100
print(explained_variance)
## [1] 40.643254 30.290936 19.504956 9.560854
Percentage value show every data contribution.
print(eigen_values_vectors$vectors)
## [,1] [,2] [,3] [,4]
## [1,] 0.6669885 -0.2767578 -0.04870403 0.6900431
## [2,] 0.4325557 0.4214898 0.77292363 -0.1945018
## [3,] 0.5743585 0.2815327 -0.59684810 -0.4843800
## [4,] 0.1952799 -0.8163889 0.20973632 -0.5013838
Eigenvectors show the contribution of each variable to the PC. For example, in row 1, the ‘Age’ variable has a significant effect on PC1 and PC4 but less effect on PC2 and PC3.
cov_matrix <- cov(selected_columns)
print(cov_matrix)
## Age Debt YearsEmployed Income
## Age 231.361400 9.345663 6.999375 2046.9725
## Debt 9.345663 6.187675 0.605225 -112.3137
## YearsEmployed 6.999375 0.605225 1.182050 -42.7750
## Income 2046.972500 -112.313750 -42.775000 151957.8000
The results show the data distribution for Age, Debt, YearsEmployed, and Income. The ‘YearsEmployed’ variable has the smallest spread, indicating that many employees have a similar number of working years. ‘Age’ shows a relatively wide spread, reflecting more variation in the age of individuals. ‘Debt’ demonstrates low variance, meaning most individuals have similar debt values. ‘Income’ has the largest spread, suggesting a wide range of income levels and the potential presence of outliers in the data.
cor_matrix <- cor(selected_columns)
print(cor_matrix)
## Age Debt YearsEmployed Income
## Age 1.0000000 0.2470022 0.4232489 0.3452272
## Debt 0.2470022 1.0000000 0.2237872 -0.1158264
## YearsEmployed 0.4232489 0.2237872 1.0000000 -0.1009278
## Income 0.3452272 -0.1158264 -0.1009278 1.0000000
Based on the correlation scale ranging from -1 to 1, each value represents the strength of the relationship between variables. Correlation value below 0.2 indicates a very weak relationship, below 0.4 is weak, below 0.6 is moderate, below 0.8 is strong, and below 0.99 is very strong. Value of 1 represents a perfect positive correlation. This scale also applies in reverse for negative correlations, where -1 indicates a perfect negative correlation and values closer to 0 show weaker relationships.