University of Warsaw - Faculty of Economic Sciences

Libraries

library('plotly')
library(knitr)
library(readxl)
library(clusterSim)
library(corrplot)
library(GGally)
library(tidyverse)
library(gridExtra)
library(grid)
library(lattice)
library(psych)
library(factoextra)
library(kableExtra)

1 Introduction and data description

1.1 Motivation and dataset

Medical things are a really complex issue so the data is. We can distinguish many variables which can determine the state of the health and by this, we will have a really huge dataset. This is really helpful in medical analysis because the trust level of our outcome should be really high to omit as many mistakes as we can

In this analysis, I will try to reduce dimension by conducting PCA on the dataset. The dataset is publicly available at this link

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SOQM4D

The author of the dataset combined many different statistics from various sources including meteorological stations, public surveys, and medical reports to create the best possible mix of variables that can influence gastric cancer. The data is completed for Iran city Mashhad which is a really important pilgrimage site for Muslims. It’s also one of the biggest cities in Iran. Thus, this dataset is composed of 165 different variables representative of 165 different regions of Mashhad

1.2 Why PCA?

PCA is a very strong tool for exploratory. We are able to reduce numerous datasets to only a couple of variances that will describe the highest percentage of variance. It’s useful also because we can further use PCA components in different econometrics models or predictive models. It’s possible also to check which variable contributes the most to a certain dimension. Thus, PCA is an easy and accessible method to deal with multidimensional datasets and to get easy to further use datasets. The interpretation of PCA components is not that easy but this can help useful tools described as PCA rotation methods

1.3 Variables

Before we go any further it’s worth describing what kind of data we are using. Because different backgrounds of our lives can lead to gastric cancer the author of the dataset used 18 variables. Among them, we can find social, environmental, and behavioral variables which should at least according to the theory affect the level of gastric cancer.

NUMBER	VARIABLE	VARIABLE_BACKGROUND	UNIT_OF_MEASURE
1	Stomach cancer incidence per 10,000 people	dependent variable	Number of patients
2	Average Age	Social conditions	Years
3	Average BMI	Social conditions	BMI score
4	Population Density	Social conditions	number of people per 1km^2
5	Average fruit consumption	Behavioral (food)	gram per day
6	Average vegetables consumption	Behavioral (food)	gram per day
7	Average meat consumption	Behavioral (food)	gram per week
8	Average consumption of processed meat	Behavioral (food)	gram per week
9	Average smoked food consumption	Behavioral (food)	gram per month
10	Average salt consumption	Behavioral (food)	gram per month
11	Percentage of smoking people	Behavioral (risk factor)	%
12	Percentage of drinking (alcohol) people	Behavioral (risk factor)	%
13	O3 contamination level	Environmental	µg/m3
14	NO2 contamination level	Environmental	µg/m3
15	SO2 contamination level	Environmental	µg/m3
16	CO contamination level	Environmental	µg/m3
17	PM2_5 contamination level	Environmental	µg/m3
18	PM10 contamination level	Environmental	µg/m3

1.4 Data prepraration

Data preparation in the case of PCA should contain at least 2 steps

1) Numerical continuous variables:

To make sure that all of the data is numerical continuous. PCA is a method designed for only continuous data and may be influenced by for example binary data (0/1) etc. That was actually the first step I took even before presenting the variables because the original data contained also binary and descriptive values. Thus this step may be considered done

2) Data normalisation:

As it was mentioned PCA maximizes the variance of components. For that reason, we should scale our variables because of their different nature, background, and units of measures of them. For example, we can’t compare Age with CO contamination levels for the simple reason they are totally different in everything. Thus data normalization will be a necessary step in successfully conducting PCA. I will perform normalization using data.Normalization from ClusterSIM with the type n1 - standardization ((x-mean)/sd)

Further the data presented below will be used.

Avg_AGE	Avg_BMI	Density	Fruits	Avg_Vegetable	Avg_RedMeat	Avg_ProcessedMeat	Total_SmokedFood	Total_SaltUse
Min. :-3.14886	Min. :-3.66541	Min. :-1.72863	Min. :-3.08053	Min. :-2.96070	Min. :-3.1804	Min. :-0.8505	Min. :-0.5639	Min. :-0.7836
1st Qu.:-0.68543	1st Qu.:-0.37905	1st Qu.:-0.69630	1st Qu.:-0.50949	1st Qu.:-0.42939	1st Qu.:-0.4600	1st Qu.:-0.4609	1st Qu.:-0.5074	1st Qu.:-0.6327
Median : 0.07304	Median : 0.08715	Median :-0.01839	Median :-0.05413	Median : 0.05777	Median : 0.0204	Median :-0.1316	Median :-0.3944	Median :-0.3741
Mean : 0.00000	Mean : 0.00000	Mean : 0.00000	Mean : 0.00000	Mean : 0.00000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000
3rd Qu.: 0.54231	3rd Qu.: 0.41827	3rd Qu.: 0.45297	3rd Qu.: 0.39218	3rd Qu.: 0.43226	3rd Qu.: 0.4682	3rd Qu.: 0.1678	3rd Qu.: 0.1140	3rd Qu.: 0.2725
Max. : 4.11825	Max. : 4.46552	Max. : 3.57691	Max. : 5.43700	Max. : 5.43761	Max. : 3.8616	Max. :10.2372	Max. : 5.8199	Max. : 4.4966

Total_SaltUse	Smokers	AlcoholUsers	O3	NO2	SO2	CO	PM2_5	PM10
Min. :-0.7836	Min. :-1.41775	Min. :-0.78681	Min. :-2.2465	Min. :-1.8039	Min. :-2.5528	Min. :-3.18225	Min. :-1.695921	Min. :-1.95762
1st Qu.:-0.6327	1st Qu.:-0.64715	1st Qu.:-0.78681	1st Qu.:-0.6853	1st Qu.:-0.6523	1st Qu.:-0.4836	1st Qu.:-0.64456	1st Qu.:-0.828227	1st Qu.:-0.48994
Median :-0.3741	Median : 0.02728	Median :-0.04193	Median : 0.1288	Median :-0.1510	Median :-0.2235	Median :-0.03721	Median :-0.008023	Median : 0.01614
Mean : 0.0000	Mean : 0.00000	Mean : 0.00000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.00000	Mean : 0.000000	Mean : 0.00000
3rd Qu.: 0.2725	3rd Qu.: 0.53314	3rd Qu.: 0.20983	3rd Qu.: 0.4028	3rd Qu.: 0.3214	3rd Qu.: 0.3217	3rd Qu.: 0.60672	3rd Qu.: 0.784941	3rd Qu.: 0.18858
Max. : 4.4966	Max. : 3.83636	Max. : 6.13230	Max. : 2.8482	Max. : 2.8379	Max. : 2.5409	Max. : 2.72867	Max. : 2.321502	Max. : 4.70469

2 PCA

2.1 Correlation Matrix

Correlation Matrix in the case of PCA may be very useful to get a deeper understanding of the data we cope with. It will initially point, to where the correlation occurs. So let’s perform a correlation matrix

mydata.cor = cor(data)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = mydata.cor, col = palette, symm = TRUE)

2.2 Principal Component Analysis - Summary and Variable Loadings

pca<-prcomp(data, center=FALSE, scale.=FALSE)
summary(pca)

## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.788 1.3151 1.25031 1.23613 1.12380 1.04618 1.00649
## Proportion of Variance 0.188 0.1017 0.09196 0.08988 0.07429 0.06438 0.05959
## Cumulative Proportion  0.188 0.2898 0.38172 0.47160 0.54589 0.61027 0.66986
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.95914 0.89153 0.84181 0.80981 0.79929 0.73118 0.68278
## Proportion of Variance 0.05411 0.04675 0.04169 0.03858 0.03758 0.03145 0.02742
## Cumulative Proportion  0.72398 0.77073 0.81242 0.85099 0.88857 0.92002 0.94744
##                           PC15    PC16    PC17
## Standard deviation     0.61918 0.50814 0.50189
## Proportion of Variance 0.02255 0.01519 0.01482
## Cumulative Proportion  0.96999 0.98518 1.00000

Analysis summary of PCA and Eigenvalues we can simply say that to explain 70% of variation we need to take as much as 8 principal components. That’s not good news as we will reduce the number of variables only to 8 thus, to only 44% of starting a number of variables. The cost may be significant in the case of medical data because we will receive in the back only 70% of explained variation.

There is also a nice way to show on a plot how many percentages one component explain and its cumulative value

It’s possible to apply also an approach based on eigenvalues. In this approach, we choose only components with Eigenvalues greater or even than 1

So the plot of Eigenvalues suggests that we should take only 7 principal components because the 8th is slightly below the value of 1. For sure we should not take 9th because the distance from number 1 is fairly too huge. Besides that, I will stay will 8 principal components to achieve the 70% of explained variation

2.3 Individual results - Variable Loadings and contribution to

Firstly let’s analyze variable loadings. Here rules are simple, the higher the better. Of course, we need to put on it an interpretation. Variable loadings may be interpreted as participation in a certain component. The higher value of variable loading the higher participation in a certain component and the interpretation of this variable should be definitely included. On the other hand, variable loadings with eigenvalues are creatin coordinates of the variables.

Variable loadings for the first 8 principal components based on previous analysis

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8

Avg_AGE -0.2796461 -0.1984843 0.0253811 -0.0749842 0.0555563 -0.1031867 0.2215427 -0.1452071

Avg_BMI -0.0705409 0.1081842 0.2597773 0.1164589 0.3658137 0.3915323 0.1319622 -0.6377863

Density 0.1115762 0.0591775 0.3656833 -0.3903685 -0.2458868 0.0273745 0.3173151 0.2221843

Fruits -0.1763033 -0.5203652 -0.0651447 -0.1319989 0.0078307 0.1742188 -0.0651132 -0.2044404

Avg_Vegetable -0.2110913 -0.4781619 0.1104302 -0.2197124 -0.0799904 -0.0915293 0.0253990 0.1321966

Avg_RedMeat -0.1793520 -0.3867984 -0.0173552 -0.1010523 0.1586233 0.3050748 0.2775429 0.1741535

Avg_ProcessedMeat 0.0004931 0.1558846 -0.0554942 0.2573249 0.2058881 0.5626912 0.2907927 0.4892113

Total_SmokedFood -0.3861181 0.1761112 0.2163192 0.1084444 -0.2534631 -0.0472967 0.2120895 0.0036892

Total_SaltUse -0.3633487 0.1567816 0.2488753 0.2026163 -0.3910196 -0.0394616 0.1581481 -0.0642671

Smokers -0.0995252 0.0501363 -0.3461041 -0.1087179 -0.4366890 0.3426938 -0.2498862 -0.2000702

AlcoholUsers 0.0441641 -0.0382806 -0.4813558 0.0676116 -0.3886511 0.2564562 0.1564594 -0.0097166

O3 -0.3743289 0.2893929 -0.0896498 -0.1814557 -0.0189554 -0.0084843 0.1101649 -0.0889192

NO2 0.3299184 0.0138434 0.2671411 -0.2747684 -0.1319723 0.2283809 -0.0204853 -0.1317156

SO2 0.3882276 -0.1062309 -0.0066057 0.0502294 -0.1630516 -0.1313334 0.4113718 -0.0651393

CO -0.1133730 -0.0107916 0.3910967 -0.0204941 -0.0945676 0.2883007 -0.5682499 0.2768281

PM2_5 0.0528429 0.2851432 -0.0807033 -0.6578471 0.0693154 0.1397915 0.0641057 -0.0687079

PM10 -0.2970910 0.1893244 -0.2814050 -0.2701261 0.3341051 -0.1665634 -0.0249998 0.1967967

We can also check the contribution of a certain variable to a certain principal component. Because the first components are the most important ones I will analyze only 4 first principal components

So as we can see first principal component is primarily composed of SO2, smoked food, O3, Total Salt Used, NO2, PM10, and Average Age. On the other hand in the second principal components, we can distinguish mainly Fruits, Vegetables, Red Meat, O3, and PM2_5. So the first principal component is more about environmental pollution and the second about diet.

2.4 Visualisation of PCA

As machine learning methods are complicated it’s very worth to visualise them. There is a way to visualise also PCA method which I will perfom below.

database <- read_excel("database.xlsx") data <- scale(database[,2:18]) a <- scale(database[,1]) prin_comp <- prcomp(data, center=FALSE, scale.=FALSE) components <- prin_comp[["x"]] components <- data.frame(components) components$PC2 <- -components$PC2 components$PC3 <- -components$PC3 components=components[,1:3] components['cancer']=a[,1] components['level']<-ifelse(components[,4]>0.5, "high level of cancer", ifelse(components[,4]>0, "mediuum level of cancer", ifelse(components[,4]<=0, "low level of cancer", NA))) tot_explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',] tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio) tit = 'Total Explained Variance = 38,17%' fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3,color = components[,5] , colors = c('#EF553B','#00CC96','#FFFF00') ) %>% add_markers(size = 12) fig <- fig %>% layout( title = tit, scene = list(bgcolor = "#e5ecf6") ) fig

For me, it’s hard to judge based on this plot which principal component may be more related to the high level of cancer

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Avg_AGE	-0.2796461	-0.1984843	0.0253811	-0.0749842	0.0555563	-0.1031867	0.2215427	-0.1452071
Avg_BMI	-0.0705409	0.1081842	0.2597773	0.1164589	0.3658137	0.3915323	0.1319622	-0.6377863
Density	0.1115762	0.0591775	0.3656833	-0.3903685	-0.2458868	0.0273745	0.3173151	0.2221843
Fruits	-0.1763033	-0.5203652	-0.0651447	-0.1319989	0.0078307	0.1742188	-0.0651132	-0.2044404
Avg_Vegetable	-0.2110913	-0.4781619	0.1104302	-0.2197124	-0.0799904	-0.0915293	0.0253990	0.1321966
Avg_RedMeat	-0.1793520	-0.3867984	-0.0173552	-0.1010523	0.1586233	0.3050748	0.2775429	0.1741535
Avg_ProcessedMeat	0.0004931	0.1558846	-0.0554942	0.2573249	0.2058881	0.5626912	0.2907927	0.4892113
Total_SmokedFood	-0.3861181	0.1761112	0.2163192	0.1084444	-0.2534631	-0.0472967	0.2120895	0.0036892
Total_SaltUse	-0.3633487	0.1567816	0.2488753	0.2026163	-0.3910196	-0.0394616	0.1581481	-0.0642671
Smokers	-0.0995252	0.0501363	-0.3461041	-0.1087179	-0.4366890	0.3426938	-0.2498862	-0.2000702
AlcoholUsers	0.0441641	-0.0382806	-0.4813558	0.0676116	-0.3886511	0.2564562	0.1564594	-0.0097166
O3	-0.3743289	0.2893929	-0.0896498	-0.1814557	-0.0189554	-0.0084843	0.1101649	-0.0889192
NO2	0.3299184	0.0138434	0.2671411	-0.2747684	-0.1319723	0.2283809	-0.0204853	-0.1317156
SO2	0.3882276	-0.1062309	-0.0066057	0.0502294	-0.1630516	-0.1313334	0.4113718	-0.0651393
CO	-0.1133730	-0.0107916	0.3910967	-0.0204941	-0.0945676	0.2883007	-0.5682499	0.2768281
PM2_5	0.0528429	0.2851432	-0.0807033	-0.6578471	0.0693154	0.1397915	0.0641057	-0.0687079
PM10	-0.2970910	0.1893244	-0.2814050	-0.2701261	0.3341051	-0.1665634	-0.0249998	0.1967967

3 Analysis and Rotation of PCA

As it was visible interpreting PCA may be very hard, especially if we don’t have a particular level of knowledge on a certain topic. Even with even a medical degree, we can still not be able to use PCA as a powerful tool for such things. There is a need to use PCA components in different kinds of econometrical/statistical methods for example linear regression. But it’s worth knowing what a certain component means. Here useful may be a method called Varimax Rotation. This method is maximizing the sum of squares of the most important loadings, thus we can put an interpretation to certain principal components. The fewer variables, the easier interpretation.

3.1 Rotation of PCA

pca.varimax<-principal(data, nfactors=8, rotate="varimax")
print(loadings(pca.varimax), digits=3, cutoff=0.35, sort=TRUE)

## 
## Loadings:
##                   RC1    RC5    RC2    RC4    RC3    RC7    RC6    RC8   
## Total_SmokedFood   0.840                                                 
## Total_SaltUse      0.906                                                 
## O3                 0.491  0.631                                          
## SO2                      -0.550                      -0.517              
## PM10                      0.863                                          
## Fruits                           0.758                                   
## Avg_Vegetable                    0.739                                   
## Avg_RedMeat                      0.704                                   
## Density                                 0.790                            
## NO2                      -0.362         0.615                            
## PM2_5                     0.532         0.678                            
## Smokers                                        0.814                     
## AlcoholUsers                                   0.708                     
## CO                                                    0.867              
## Avg_ProcessedMeat                                            0.914       
## Avg_BMI                                                             0.927
## Avg_AGE                          0.479                                   
## 
##                  RC1   RC5   RC2   RC4   RC3   RC7   RC6   RC8
## SS loadings    2.132 1.977 1.924 1.588 1.317 1.208 1.084 1.077
## Proportion Var 0.125 0.116 0.113 0.093 0.077 0.071 0.064 0.063
## Cumulative Var 0.125 0.242 0.355 0.448 0.526 0.597 0.661 0.724

Let’s put an interpretation on received Rotated Components (RC). The first one seems like the one describing the taste preferences of the diet. The second seems to be related to the air pollution level. The third is mainly about a diet (balance of meat, vegetables, and fruits) but we can see also age, so I can assume that after a certain level of Age the diet becomes a really important factor in case of gastric cancer which is pretty logical. RC3 is also nice because it tells about risk factors for smoking and drinking. RC8 is only about BMI. I have to say that some of them have a really clear interpretation which is very nice

But to be sure we can also visualise it and see the connections

fa.diagram(loadings(pca.varimax))

4 Conclusion

Performing PCA on our data was not really as efficient as it should be because we had to take 8 principal components to achieve the level of 70% explained variation. Besides that, the intepretation of the principal component was not that easy. The method called Varimax Rotation helped thus, we could more eaisly put an intepretation on all of those Rc which were really logical. For sure it’s necessary to perform more econometrical methods on received PCAs or RCs to get deeper knowledge about relationships with the cancer level. But now it will take only 8 variables not as in the beginning 18. So let’s call it a success

Dimenson reduction on variables determining gastric cancer - PCA approach

- Mateusz Domaradzki