University of Warsaw - Faculty of Economic Sciences

Libraries

library('plotly')
library(knitr)
library(readxl)
library(clusterSim)
library(corrplot)
library(GGally)
library(tidyverse)
library(gridExtra)
library(grid)
library(lattice)
library(psych)
library(factoextra)
library(kableExtra)

1 Introduction and data description

1.1 Motivation and dataset

Medical things are a really complex issue so the data is. We can distinguish many variables which can determine the state of the health and by this, we will have a really huge dataset. This is really helpful in medical analysis because the trust level of our outcome should be really high to omit as many mistakes as we can

In this analysis, I will try to reduce dimension by conducting PCA on the dataset. The dataset is publicly available at this link

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SOQM4D

The author of the dataset combined many different statistics from various sources including meteorological stations, public surveys, and medical reports to create the best possible mix of variables that can influence gastric cancer. The data is completed for Iran city Mashhad which is a really important pilgrimage site for Muslims. It’s also one of the biggest cities in Iran. Thus, this dataset is composed of 165 different variables representative of 165 different regions of Mashhad

1.2 Why PCA?

PCA is a very strong tool for exploratory. We are able to reduce numerous datasets to only a couple of variances that will describe the highest percentage of variance. It’s useful also because we can further use PCA components in different econometrics models or predictive models. It’s possible also to check which variable contributes the most to a certain dimension. Thus, PCA is an easy and accessible method to deal with multidimensional datasets and to get easy to further use datasets. The interpretation of PCA components is not that easy but this can help useful tools described as PCA rotation methods

1.3 Variables

Before we go any further it’s worth describing what kind of data we are using. Because different backgrounds of our lives can lead to gastric cancer the author of the dataset used 18 variables. Among them, we can find social, environmental, and behavioral variables which should at least according to the theory affect the level of gastric cancer.


NUMBER VARIABLE VARIABLE_BACKGROUND UNIT_OF_MEASURE
1 Stomach cancer incidence per 10,000 people dependent variable Number of patients
2 Average Age Social conditions Years
3 Average BMI Social conditions BMI score
4 Population Density Social conditions number of people per 1km^2
5 Average fruit consumption Behavioral (food) gram per day
6 Average vegetables consumption Behavioral (food) gram per day
7 Average meat consumption Behavioral (food) gram per week
8 Average consumption of processed meat Behavioral (food) gram per week
9 Average smoked food consumption Behavioral (food) gram per month
10 Average salt consumption Behavioral (food) gram per month
11 Percentage of smoking people Behavioral (risk factor) %
12 Percentage of drinking (alcohol) people Behavioral (risk factor) %
13 O3 contamination level Environmental µg/m3
14 NO2 contamination level Environmental µg/m3
15 SO2 contamination level Environmental µg/m3
16 CO contamination level Environmental µg/m3
17 PM2_5 contamination level Environmental µg/m3
18 PM10 contamination level Environmental µg/m3

1.4 Data prepraration

Data preparation in the case of PCA should contain at least 2 steps

1) Numerical continuous variables:

To make sure that all of the data is numerical continuous. PCA is a method designed for only continuous data and may be influenced by for example binary data (0/1) etc. That was actually the first step I took even before presenting the variables because the original data contained also binary and descriptive values. Thus this step may be considered done

2) Data normalisation:

As it was mentioned PCA maximizes the variance of components. For that reason, we should scale our variables because of their different nature, background, and units of measures of them. For example, we can’t compare Age with CO contamination levels for the simple reason they are totally different in everything. Thus data normalization will be a necessary step in successfully conducting PCA. I will perform normalization using data.Normalization from ClusterSIM with the type n1 - standardization ((x-mean)/sd)

Further the data presented below will be used.

Avg_AGE Avg_BMI Density Fruits Avg_Vegetable Avg_RedMeat Avg_ProcessedMeat Total_SmokedFood Total_SaltUse
Min. :-3.14886 Min. :-3.66541 Min. :-1.72863 Min. :-3.08053 Min. :-2.96070 Min. :-3.1804 Min. :-0.8505 Min. :-0.5639 Min. :-0.7836
1st Qu.:-0.68543 1st Qu.:-0.37905 1st Qu.:-0.69630 1st Qu.:-0.50949 1st Qu.:-0.42939 1st Qu.:-0.4600 1st Qu.:-0.4609 1st Qu.:-0.5074 1st Qu.:-0.6327
Median : 0.07304 Median : 0.08715 Median :-0.01839 Median :-0.05413 Median : 0.05777 Median : 0.0204 Median :-0.1316 Median :-0.3944 Median :-0.3741
Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.54231 3rd Qu.: 0.41827 3rd Qu.: 0.45297 3rd Qu.: 0.39218 3rd Qu.: 0.43226 3rd Qu.: 0.4682 3rd Qu.: 0.1678 3rd Qu.: 0.1140 3rd Qu.: 0.2725
Max. : 4.11825 Max. : 4.46552 Max. : 3.57691 Max. : 5.43700 Max. : 5.43761 Max. : 3.8616 Max. :10.2372 Max. : 5.8199 Max. : 4.4966
Total_SaltUse Smokers AlcoholUsers O3 NO2 SO2 CO PM2_5 PM10
Min. :-0.7836 Min. :-1.41775 Min. :-0.78681 Min. :-2.2465 Min. :-1.8039 Min. :-2.5528 Min. :-3.18225 Min. :-1.695921 Min. :-1.95762
1st Qu.:-0.6327 1st Qu.:-0.64715 1st Qu.:-0.78681 1st Qu.:-0.6853 1st Qu.:-0.6523 1st Qu.:-0.4836 1st Qu.:-0.64456 1st Qu.:-0.828227 1st Qu.:-0.48994
Median :-0.3741 Median : 0.02728 Median :-0.04193 Median : 0.1288 Median :-0.1510 Median :-0.2235 Median :-0.03721 Median :-0.008023 Median : 0.01614
Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
3rd Qu.: 0.2725 3rd Qu.: 0.53314 3rd Qu.: 0.20983 3rd Qu.: 0.4028 3rd Qu.: 0.3214 3rd Qu.: 0.3217 3rd Qu.: 0.60672 3rd Qu.: 0.784941 3rd Qu.: 0.18858
Max. : 4.4966 Max. : 3.83636 Max. : 6.13230 Max. : 2.8482 Max. : 2.8379 Max. : 2.5409 Max. : 2.72867 Max. : 2.321502 Max. : 4.70469

2 PCA

2.1 Correlation Matrix

Correlation Matrix in the case of PCA may be very useful to get a deeper understanding of the data we cope with. It will initially point, to where the correlation occurs. So let’s perform a correlation matrix

mydata.cor = cor(data)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = mydata.cor, col = palette, symm = TRUE)

2.2 Principal Component Analysis - Summary and Variable Loadings

pca<-prcomp(data, center=FALSE, scale.=FALSE)
summary(pca)
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.788 1.3151 1.25031 1.23613 1.12380 1.04618 1.00649
## Proportion of Variance 0.188 0.1017 0.09196 0.08988 0.07429 0.06438 0.05959
## Cumulative Proportion  0.188 0.2898 0.38172 0.47160 0.54589 0.61027 0.66986
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.95914 0.89153 0.84181 0.80981 0.79929 0.73118 0.68278
## Proportion of Variance 0.05411 0.04675 0.04169 0.03858 0.03758 0.03145 0.02742
## Cumulative Proportion  0.72398 0.77073 0.81242 0.85099 0.88857 0.92002 0.94744
##                           PC15    PC16    PC17
## Standard deviation     0.61918 0.50814 0.50189
## Proportion of Variance 0.02255 0.01519 0.01482
## Cumulative Proportion  0.96999 0.98518 1.00000

Analysis summary of PCA and Eigenvalues we can simply say that to explain 70% of variation we need to take as much as 8 principal components. That’s not good news as we will reduce the number of variables only to 8 thus, to only 44% of starting a number of variables. The cost may be significant in the case of medical data because we will receive in the back only 70% of explained variation.

There is also a nice way to show on a plot how many percentages one component explain and its cumulative value

It’s possible to apply also an approach based on eigenvalues. In this approach, we choose only components with Eigenvalues greater or even than 1

So the plot of Eigenvalues suggests that we should take only 7 principal components because the 8th is slightly below the value of 1. For sure we should not take 9th because the distance from number 1 is fairly too huge. Besides that, I will stay will 8 principal components to achieve the 70% of explained variation

2.3 Individual results - Variable Loadings and contribution to

Firstly let’s analyze variable loadings. Here rules are simple, the higher the better. Of course, we need to put on it an interpretation. Variable loadings may be interpreted as participation in a certain component. The higher value of variable loading the higher participation in a certain component and the interpretation of this variable should be definitely included. On the other hand, variable loadings with eigenvalues are creatin coordinates of the variables.

Variable loadings for the first 8 principal components based on previous analysis
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Avg_AGE -0.2796461 -0.1984843 0.0253811 -0.0749842 0.0555563 -0.1031867 0.2215427 -0.1452071
Avg_BMI -0.0705409 0.1081842 0.2597773 0.1164589 0.3658137 0.3915323 0.1319622 -0.6377863
Density 0.1115762 0.0591775 0.3656833 -0.3903685 -0.2458868 0.0273745 0.3173151 0.2221843
Fruits -0.1763033 -0.5203652 -0.0651447 -0.1319989 0.0078307 0.1742188 -0.0651132 -0.2044404
Avg_Vegetable -0.2110913 -0.4781619 0.1104302 -0.2197124 -0.0799904 -0.0915293 0.0253990 0.1321966
Avg_RedMeat -0.1793520 -0.3867984 -0.0173552 -0.1010523 0.1586233 0.3050748 0.2775429 0.1741535
Avg_ProcessedMeat 0.0004931 0.1558846 -0.0554942 0.2573249 0.2058881 0.5626912 0.2907927 0.4892113
Total_SmokedFood -0.3861181 0.1761112 0.2163192 0.1084444 -0.2534631 -0.0472967 0.2120895 0.0036892
Total_SaltUse -0.3633487 0.1567816 0.2488753 0.2026163 -0.3910196 -0.0394616 0.1581481 -0.0642671
Smokers -0.0995252 0.0501363 -0.3461041 -0.1087179 -0.4366890 0.3426938 -0.2498862 -0.2000702
AlcoholUsers 0.0441641 -0.0382806 -0.4813558 0.0676116 -0.3886511 0.2564562 0.1564594 -0.0097166
O3 -0.3743289 0.2893929 -0.0896498 -0.1814557 -0.0189554 -0.0084843 0.1101649 -0.0889192
NO2 0.3299184 0.0138434 0.2671411 -0.2747684 -0.1319723 0.2283809 -0.0204853 -0.1317156
SO2 0.3882276 -0.1062309 -0.0066057 0.0502294 -0.1630516 -0.1313334 0.4113718 -0.0651393
CO -0.1133730 -0.0107916 0.3910967 -0.0204941 -0.0945676 0.2883007 -0.5682499 0.2768281
PM2_5 0.0528429 0.2851432 -0.0807033 -0.6578471 0.0693154 0.1397915 0.0641057 -0.0687079
PM10 -0.2970910 0.1893244 -0.2814050 -0.2701261 0.3341051 -0.1665634 -0.0249998 0.1967967

We can also check the contribution of a certain variable to a certain principal component. Because the first components are the most important ones I will analyze only 4 first principal components

So as we can see first principal component is primarily composed of SO2, smoked food, O3, Total Salt Used, NO2, PM10, and Average Age. On the other hand in the second principal components, we can distinguish mainly Fruits, Vegetables, Red Meat, O3, and PM2_5. So the first principal component is more about environmental pollution and the second about diet.

2.4 Visualisation of PCA

As machine learning methods are complicated it’s very worth to visualise them. There is a way to visualise also PCA method which I will perfom below.

database <- read_excel("database.xlsx")
data <- scale(database[,2:18])
a <- scale(database[,1])
prin_comp <- prcomp(data, center=FALSE, scale.=FALSE)

components <- prin_comp[["x"]]
components <- data.frame(components)
components$PC2 <- -components$PC2
components$PC3 <- -components$PC3
components=components[,1:3]
components['cancer']=a[,1]
components['level']<-ifelse(components[,4]>0.5, "high level of cancer", ifelse(components[,4]>0, 
                                                                  "mediuum level of cancer", ifelse(components[,4]<=0, 
                                                                                                  "low level of cancer", NA)))

tot_explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio)

tit = 'Total Explained Variance = 38,17%'

fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3,color = components[,5] , colors = c('#EF553B','#00CC96','#FFFF00') ) %>%
  add_markers(size = 12)


fig <- fig %>%
  layout(
    title = tit,
    scene = list(bgcolor = "#e5ecf6")
  )

fig

For me, it’s hard to judge based on this plot which principal component may be more related to the high level of cancer

3 Analysis and Rotation of PCA

As it was visible interpreting PCA may be very hard, especially if we don’t have a particular level of knowledge on a certain topic. Even with even a medical degree, we can still not be able to use PCA as a powerful tool for such things. There is a need to use PCA components in different kinds of econometrical/statistical methods for example linear regression. But it’s worth knowing what a certain component means. Here useful may be a method called Varimax Rotation. This method is maximizing the sum of squares of the most important loadings, thus we can put an interpretation to certain principal components. The fewer variables, the easier interpretation.

3.1 Rotation of PCA

pca.varimax<-principal(data, nfactors=8, rotate="varimax")
print(loadings(pca.varimax), digits=3, cutoff=0.35, sort=TRUE)
## 
## Loadings:
##                   RC1    RC5    RC2    RC4    RC3    RC7    RC6    RC8   
## Total_SmokedFood   0.840                                                 
## Total_SaltUse      0.906                                                 
## O3                 0.491  0.631                                          
## SO2                      -0.550                      -0.517              
## PM10                      0.863                                          
## Fruits                           0.758                                   
## Avg_Vegetable                    0.739                                   
## Avg_RedMeat                      0.704                                   
## Density                                 0.790                            
## NO2                      -0.362         0.615                            
## PM2_5                     0.532         0.678                            
## Smokers                                        0.814                     
## AlcoholUsers                                   0.708                     
## CO                                                    0.867              
## Avg_ProcessedMeat                                            0.914       
## Avg_BMI                                                             0.927
## Avg_AGE                          0.479                                   
## 
##                  RC1   RC5   RC2   RC4   RC3   RC7   RC6   RC8
## SS loadings    2.132 1.977 1.924 1.588 1.317 1.208 1.084 1.077
## Proportion Var 0.125 0.116 0.113 0.093 0.077 0.071 0.064 0.063
## Cumulative Var 0.125 0.242 0.355 0.448 0.526 0.597 0.661 0.724

Let’s put an interpretation on received Rotated Components (RC). The first one seems like the one describing the taste preferences of the diet. The second seems to be related to the air pollution level. The third is mainly about a diet (balance of meat, vegetables, and fruits) but we can see also age, so I can assume that after a certain level of Age the diet becomes a really important factor in case of gastric cancer which is pretty logical. RC3 is also nice because it tells about risk factors for smoking and drinking. RC8 is only about BMI. I have to say that some of them have a really clear interpretation which is very nice

But to be sure we can also visualise it and see the connections

fa.diagram(loadings(pca.varimax))

4 Conclusion

Performing PCA on our data was not really as efficient as it should be because we had to take 8 principal components to achieve the level of 70% explained variation. Besides that, the intepretation of the principal component was not that easy. The method called Varimax Rotation helped thus, we could more eaisly put an intepretation on all of those Rc which were really logical. For sure it’s necessary to perform more econometrical methods on received PCAs or RCs to get deeper knowledge about relationships with the cancer level. But now it will take only 8 variables not as in the beginning 18. So let’s call it a success