The Applicabilities of Principle Component Analysis as Tested through Visualizing and Interpreting Data

Introduction

We tested PCA or Principle Component Analysis, a machine learning technique on the COVID-19 data of many nations to learn about dimension reduction and how it could be of use in displaying and interpreting data. In its simplest form, PCA is a technique that is used to summarize, visualize, and classify data with fewer variables. It can be used to observe trends, outliers, and clusters. PCA is also able to be used as a preliminary step in adopting data for further machine learning. PCA is able to show the variables with the most variance, in its simplest form using a PC1 and PC2.

Environment Setup

A variety of packages were used to ensure a smooth analysis. The factoextra package contained easy-to-use methods to extract and visualize the results of the PCA. Along with this, the softImpute package was used to input missing values into a matrix (a crucial component of the PCA) through a process called nuclear-norm regularization. Next, the plotly package was used to do standard graphing and visualization. Finally, the dplyr contained many useful data manipulation methods and operations that were crucial throughout the test.

# install.packages("factoextra")
library(factoextra)
# install.packages("softImpute")
library(softImpute)
# install.packages("plotly")
library(plotly)
# install.packages("dplyr")
library(dplyr)

We then loaded in the data. The data frame we used was in the document country_daily_data.RDS. This data was prepared by combining daily COVID Data and relevant country level socioeconomic and political factors.

For the simplicity of our analysis, we focused on the data of only Feb 20, 2022.

Feb20 <- covid %>%
  filter(date == "2022-02-20") %>%
  filter(!is.na(fully_vaccinated_pct))

Following filtering, we selected parameters that both grabbed our attention and seemed to hold importance. PCA is only able to use numerical variables for calculations, however a few categorical variables had been included for grouping the points in visualization and/or filling missing values.

pca_df <- Feb20 %>%
  select(iso3c,             # categorical
         country,           # categorical
         subregion,         # categorical
         wb_income_group,   # categorical
         hospital_beds_per_thousand,
         population,
         population_density,
         median_age,
         aged_65_older,
         aged_70_older,
         life_expectancy,
         vdem_freedom_of_expression_score,
         vdem_liberal_democracy_score,
         boix_democracy_yes_no,
         boix_democracy_duration_years,
         freedom_house_civil_liberties,
         freedom_house_political_rights,
         freedom_house_freedom_score,
         polity_democracy_score,
         wdi_prop_less_2_usd_day,
         wdi_gdppc_nominal,
         wdi_gdppc_ppp,
         wdi_urban_population_pct,
         wdi_urban_pop_1m_cities_pct,
         wdi_gini_index,
         wdi_pop_under_15)

To ensure an accurate result in our trial, any countries missing 10 or more of the total variables shared among the nations were excluded. Along with this, any variables missing 10 or more values were excluded. This is was done as the PCA would’ve failed as a result of the presence of missing values

clean_df <- subset(pca_df[rowSums(is.na(pca_df)) < 9,], 
                   select = - c(wdi_urban_pop_1m_cities_pct, wdi_prop_less_2_usd_day))

As we still had missing values, any missing points were calculated as the median of their particular column.

clean_df2 <- clean_df %>%
  group_by(subregion, wb_income_group) %>%
  mutate_if(is.numeric,
            function(x) ifelse(is.na(x), 
                               median(x, na.rm = TRUE), 
                               x))

Even after this process, there were still missing values. Because of this, we had to fill in those values with the global median of the category. At this point, we also set each column name according to its country in preparation for visualizing the data to ensure readability. Finally, we removed any categorical variables at this stage.

clean_df3 <- clean_df2 %>%
  group_by(wb_income_group) %>%
  mutate_if(is.numeric,
            function(x) ifelse(is.na(x), 
                               median(x, na.rm = TRUE), 
                               x))

# Excluding categorical variables.
num_df <- subset(clean_df3, select = - c(iso3c, country, subregion, wb_income_group))
as.data.frame(num_df)
row.names(num_df) <- clean_df3$country

Conduction and Analysis of the PCA

With our environment set to go, we were now able to conduct our PCA. We were using the prcomp method to conduct the PCA. The scaling was required to be turned on by setting scale. = TRUE such that the contributions of the variables were not affected by their scale, but only the importance.

Feb20_pca <- prcomp(num_df, center = TRUE, scale. = TRUE)

Feb20_pca$x was the computed coordinates of the countries in the resulting PCs. It should have had the same dimension as the input data frame.

We could now plot the countries in the PC coordinate. We only used the first 2 PC’s as this is usually enough to explain most of the variable and is enough for interpreting data.

fviz_eig(Feb20_pca)

fviz_pca_ind(Feb20_pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

By using PCA, we were able to know how the original variables contributed to the PC’s. We were also able to visualize how they had contributed to the PC’s and print their quantitative contributions.

fviz_pca_var(Feb20_pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

res.var <- get_pca_var(Feb20_pca)
data.frame(variable_importance_PC1 = sort(res.var$contrib[,1], decreasing = TRUE))

Inputing Missing Values

By using the very algorithm of the PCA, we filled in missing values. Such method of filling missing values is called “imputing”. Here we started from clean_df (after getting rid of poor quality records but before replacing any NA with median).

# Xna is the matrix with NA.
Xna <- data.matrix(scale(subset(clean_df, 
                                select = - c(iso3c, country, subregion, wb_income_group))))

# Xhat is the matrix with NAs filled by grouped medians. 
Xhat <- data.matrix(scale(subset(clean_df3, 
                                 select = - c(iso3c, country, subregion, wb_income_group))))

# Scale 
Xnas=biScale(Xna)

fit <- softImpute(Xnas, type = "svd")
Ximp <- complete(Xna, fit)

Ximp[is.na(Xna)]  # Imputed missing values

Xhat[is.na(Xna)]  # Missing values filled with grouped medians

plot(Ximp[is.na(Xna)], Xhat[is.na(Xna)])

cor(Ximp[is.na(Xna)], Xhat[is.na(Xna)])

With our missing data points filled in with a much more accurate way of doing so, we could now see how our PCA differed when compared to the previous PCA executed with our past values.

names(Ximp) <- names(num_df)
row.names(Ximp) <- row.names(num_df)

Ximp_pca <- prcomp(Ximp, center = TRUE, scale. = TRUE)

fviz_eig(Ximp_pca)

fviz_pca_ind(Ximp_pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

fviz_pca_var(Ximp_pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

res.var <- get_pca_var(Ximp_pca)
data.frame(variable_importance_PC1 = sort(res.var$contrib[,1], decreasing = TRUE))

Conclusion

Using PCA, we were able to visualize and interpret data by displaying only the variables with the most variance, or in other words, the variables that were able to capture the most information. By calculating the PCA, we were able to get coordinates that could then be placed on a PC coordinate plane and then be visualized, communicating to us patterns that otherwise wouldn’t have been clear to us by utilizing other visualization techniques. If used correctly, Principle Component Analysis could be of great aid in gaining insight on invaluable correlations, applicable in fields such as the Healthcare and Financial industry. In the future we would like to continue to explore Principle Component Analysis, specifically in the context of how effective its dimension reduction capabilities could be in preparing data for more advanced analyses.