University of Warsaw - Faculty of Economic Sciences
Libraries
library('plotly')
library(knitr)
library(readxl)
library(clusterSim)
library(corrplot)
library(GGally)
library(tidyverse)
library(gridExtra)
library(grid)
library(lattice)
library(psych)
library(factoextra)
library(kableExtra)Medical things are a really complex issue so the data is. We can distinguish many variables which can determine the state of the health and by this, we will have a really huge dataset. This is really helpful in medical analysis because the trust level of our outcome should be really high to omit as many mistakes as we can
In this analysis, I will try to reduce dimension by conducting PCA on the dataset. The dataset is publicly available at this link
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SOQM4D
The author of the dataset combined many different statistics from various sources including meteorological stations, public surveys, and medical reports to create the best possible mix of variables that can influence gastric cancer. The data is completed for Iran city Mashhad which is a really important pilgrimage site for Muslims. It’s also one of the biggest cities in Iran. Thus, this dataset is composed of 165 different variables representative of 165 different regions of Mashhad
PCA is a very strong tool for exploratory. We are able to reduce numerous datasets to only a couple of variances that will describe the highest percentage of variance. It’s useful also because we can further use PCA components in different econometrics models or predictive models. It’s possible also to check which variable contributes the most to a certain dimension. Thus, PCA is an easy and accessible method to deal with multidimensional datasets and to get easy to further use datasets. The interpretation of PCA components is not that easy but this can help useful tools described as PCA rotation methods
Before we go any further it’s worth describing what kind of data we are using. Because different backgrounds of our lives can lead to gastric cancer the author of the dataset used 18 variables. Among them, we can find social, environmental, and behavioral variables which should at least according to the theory affect the level of gastric cancer.
| NUMBER | VARIABLE | VARIABLE_BACKGROUND | UNIT_OF_MEASURE |
|---|---|---|---|
| 1 | Stomach cancer incidence per 10,000 people | dependent variable | Number of patients |
| 2 | Average Age | Social conditions | Years |
| 3 | Average BMI | Social conditions | BMI score |
| 4 | Population Density | Social conditions | number of people per 1km^2 |
| 5 | Average fruit consumption | Behavioral (food) | gram per day |
| 6 | Average vegetables consumption | Behavioral (food) | gram per day |
| 7 | Average meat consumption | Behavioral (food) | gram per week |
| 8 | Average consumption of processed meat | Behavioral (food) | gram per week |
| 9 | Average smoked food consumption | Behavioral (food) | gram per month |
| 10 | Average salt consumption | Behavioral (food) | gram per month |
| 11 | Percentage of smoking people | Behavioral (risk factor) | % |
| 12 | Percentage of drinking (alcohol) people | Behavioral (risk factor) | % |
| 13 | O3 contamination level | Environmental | µg/m3 |
| 14 | NO2 contamination level | Environmental | µg/m3 |
| 15 | SO2 contamination level | Environmental | µg/m3 |
| 16 | CO contamination level | Environmental | µg/m3 |
| 17 | PM2_5 contamination level | Environmental | µg/m3 |
| 18 | PM10 contamination level | Environmental | µg/m3 |
Data preparation in the case of PCA should contain at least 2 steps
1) Numerical continuous variables:
To make sure that all of the data is numerical continuous. PCA is a method designed for only continuous data and may be influenced by for example binary data (0/1) etc. That was actually the first step I took even before presenting the variables because the original data contained also binary and descriptive values. Thus this step may be considered done
2) Data normalisation:
As it was mentioned PCA maximizes the variance of components. For that reason, we should scale our variables because of their different nature, background, and units of measures of them. For example, we can’t compare Age with CO contamination levels for the simple reason they are totally different in everything. Thus data normalization will be a necessary step in successfully conducting PCA. I will perform normalization using data.Normalization from ClusterSIM with the type n1 - standardization ((x-mean)/sd)
Further the data presented below will be used.
| Avg_AGE | Avg_BMI | Density | Fruits | Avg_Vegetable | Avg_RedMeat | Avg_ProcessedMeat | Total_SmokedFood | Total_SaltUse | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :-3.14886 | Min. :-3.66541 | Min. :-1.72863 | Min. :-3.08053 | Min. :-2.96070 | Min. :-3.1804 | Min. :-0.8505 | Min. :-0.5639 | Min. :-0.7836 | |
| 1st Qu.:-0.68543 | 1st Qu.:-0.37905 | 1st Qu.:-0.69630 | 1st Qu.:-0.50949 | 1st Qu.:-0.42939 | 1st Qu.:-0.4600 | 1st Qu.:-0.4609 | 1st Qu.:-0.5074 | 1st Qu.:-0.6327 | |
| Median : 0.07304 | Median : 0.08715 | Median :-0.01839 | Median :-0.05413 | Median : 0.05777 | Median : 0.0204 | Median :-0.1316 | Median :-0.3944 | Median :-0.3741 | |
| Mean : 0.00000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | |
| 3rd Qu.: 0.54231 | 3rd Qu.: 0.41827 | 3rd Qu.: 0.45297 | 3rd Qu.: 0.39218 | 3rd Qu.: 0.43226 | 3rd Qu.: 0.4682 | 3rd Qu.: 0.1678 | 3rd Qu.: 0.1140 | 3rd Qu.: 0.2725 | |
| Max. : 4.11825 | Max. : 4.46552 | Max. : 3.57691 | Max. : 5.43700 | Max. : 5.43761 | Max. : 3.8616 | Max. :10.2372 | Max. : 5.8199 | Max. : 4.4966 |
| Total_SaltUse | Smokers | AlcoholUsers | O3 | NO2 | SO2 | CO | PM2_5 | PM10 | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :-0.7836 | Min. :-1.41775 | Min. :-0.78681 | Min. :-2.2465 | Min. :-1.8039 | Min. :-2.5528 | Min. :-3.18225 | Min. :-1.695921 | Min. :-1.95762 | |
| 1st Qu.:-0.6327 | 1st Qu.:-0.64715 | 1st Qu.:-0.78681 | 1st Qu.:-0.6853 | 1st Qu.:-0.6523 | 1st Qu.:-0.4836 | 1st Qu.:-0.64456 | 1st Qu.:-0.828227 | 1st Qu.:-0.48994 | |
| Median :-0.3741 | Median : 0.02728 | Median :-0.04193 | Median : 0.1288 | Median :-0.1510 | Median :-0.2235 | Median :-0.03721 | Median :-0.008023 | Median : 0.01614 | |
| Mean : 0.0000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.000000 | Mean : 0.00000 | |
| 3rd Qu.: 0.2725 | 3rd Qu.: 0.53314 | 3rd Qu.: 0.20983 | 3rd Qu.: 0.4028 | 3rd Qu.: 0.3214 | 3rd Qu.: 0.3217 | 3rd Qu.: 0.60672 | 3rd Qu.: 0.784941 | 3rd Qu.: 0.18858 | |
| Max. : 4.4966 | Max. : 3.83636 | Max. : 6.13230 | Max. : 2.8482 | Max. : 2.8379 | Max. : 2.5409 | Max. : 2.72867 | Max. : 2.321502 | Max. : 4.70469 |
Correlation Matrix in the case of PCA may be very useful to get a deeper understanding of the data we cope with. It will initially point, to where the correlation occurs. So let’s perform a correlation matrix
mydata.cor = cor(data)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = mydata.cor, col = palette, symm = TRUE)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.788 1.3151 1.25031 1.23613 1.12380 1.04618 1.00649
## Proportion of Variance 0.188 0.1017 0.09196 0.08988 0.07429 0.06438 0.05959
## Cumulative Proportion 0.188 0.2898 0.38172 0.47160 0.54589 0.61027 0.66986
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.95914 0.89153 0.84181 0.80981 0.79929 0.73118 0.68278
## Proportion of Variance 0.05411 0.04675 0.04169 0.03858 0.03758 0.03145 0.02742
## Cumulative Proportion 0.72398 0.77073 0.81242 0.85099 0.88857 0.92002 0.94744
## PC15 PC16 PC17
## Standard deviation 0.61918 0.50814 0.50189
## Proportion of Variance 0.02255 0.01519 0.01482
## Cumulative Proportion 0.96999 0.98518 1.00000
Analysis summary of PCA and Eigenvalues we can simply say that to explain 70% of variation we need to take as much as 8 principal components. That’s not good news as we will reduce the number of variables only to 8 thus, to only 44% of starting a number of variables. The cost may be significant in the case of medical data because we will receive in the back only 70% of explained variation.
There is also a nice way to show on a plot how many percentages one component explain and its cumulative value
It’s possible to apply also an approach based on eigenvalues. In this approach, we choose only components with Eigenvalues greater or even than 1
So the plot of Eigenvalues suggests that we should take only 7 principal components because the 8th is slightly below the value of 1. For sure we should not take 9th because the distance from number 1 is fairly too huge. Besides that, I will stay will 8 principal components to achieve the 70% of explained variation
Firstly let’s analyze variable loadings. Here rules are simple, the higher the better. Of course, we need to put on it an interpretation. Variable loadings may be interpreted as participation in a certain component. The higher value of variable loading the higher participation in a certain component and the interpretation of this variable should be definitely included. On the other hand, variable loadings with eigenvalues are creatin coordinates of the variables.
Variable loadings for the first 8 principal components based on previous analysis| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
|---|---|---|---|---|---|---|---|---|
| Avg_AGE | -0.2796461 | -0.1984843 | 0.0253811 | -0.0749842 | 0.0555563 | -0.1031867 | 0.2215427 | -0.1452071 |
| Avg_BMI | -0.0705409 | 0.1081842 | 0.2597773 | 0.1164589 | 0.3658137 | 0.3915323 | 0.1319622 | -0.6377863 |
| Density | 0.1115762 | 0.0591775 | 0.3656833 | -0.3903685 | -0.2458868 | 0.0273745 | 0.3173151 | 0.2221843 |
| Fruits | -0.1763033 | -0.5203652 | -0.0651447 | -0.1319989 | 0.0078307 | 0.1742188 | -0.0651132 | -0.2044404 |
| Avg_Vegetable | -0.2110913 | -0.4781619 | 0.1104302 | -0.2197124 | -0.0799904 | -0.0915293 | 0.0253990 | 0.1321966 |
| Avg_RedMeat | -0.1793520 | -0.3867984 | -0.0173552 | -0.1010523 | 0.1586233 | 0.3050748 | 0.2775429 | 0.1741535 |
| Avg_ProcessedMeat | 0.0004931 | 0.1558846 | -0.0554942 | 0.2573249 | 0.2058881 | 0.5626912 | 0.2907927 | 0.4892113 |
| Total_SmokedFood | -0.3861181 | 0.1761112 | 0.2163192 | 0.1084444 | -0.2534631 | -0.0472967 | 0.2120895 | 0.0036892 |
| Total_SaltUse | -0.3633487 | 0.1567816 | 0.2488753 | 0.2026163 | -0.3910196 | -0.0394616 | 0.1581481 | -0.0642671 |
| Smokers | -0.0995252 | 0.0501363 | -0.3461041 | -0.1087179 | -0.4366890 | 0.3426938 | -0.2498862 | -0.2000702 |
| AlcoholUsers | 0.0441641 | -0.0382806 | -0.4813558 | 0.0676116 | -0.3886511 | 0.2564562 | 0.1564594 | -0.0097166 |
| O3 | -0.3743289 | 0.2893929 | -0.0896498 | -0.1814557 | -0.0189554 | -0.0084843 | 0.1101649 | -0.0889192 |
| NO2 | 0.3299184 | 0.0138434 | 0.2671411 | -0.2747684 | -0.1319723 | 0.2283809 | -0.0204853 | -0.1317156 |
| SO2 | 0.3882276 | -0.1062309 | -0.0066057 | 0.0502294 | -0.1630516 | -0.1313334 | 0.4113718 | -0.0651393 |
| CO | -0.1133730 | -0.0107916 | 0.3910967 | -0.0204941 | -0.0945676 | 0.2883007 | -0.5682499 | 0.2768281 |
| PM2_5 | 0.0528429 | 0.2851432 | -0.0807033 | -0.6578471 | 0.0693154 | 0.1397915 | 0.0641057 | -0.0687079 |
| PM10 | -0.2970910 | 0.1893244 | -0.2814050 | -0.2701261 | 0.3341051 | -0.1665634 | -0.0249998 | 0.1967967 |
We can also check the contribution of a certain variable to a certain principal component. Because the first components are the most important ones I will analyze only 4 first principal components
So as we can see first principal component is primarily composed of SO2, smoked food, O3, Total Salt Used, NO2, PM10, and Average Age. On the other hand in the second principal components, we can distinguish mainly Fruits, Vegetables, Red Meat, O3, and PM2_5. So the first principal component is more about environmental pollution and the second about diet.
As machine learning methods are complicated it’s very worth to visualise them. There is a way to visualise also PCA method which I will perfom below.
database <- read_excel("database.xlsx")
data <- scale(database[,2:18])
a <- scale(database[,1])
prin_comp <- prcomp(data, center=FALSE, scale.=FALSE)
components <- prin_comp[["x"]]
components <- data.frame(components)
components$PC2 <- -components$PC2
components$PC3 <- -components$PC3
components=components[,1:3]
components['cancer']=a[,1]
components['level']<-ifelse(components[,4]>0.5, "high level of cancer", ifelse(components[,4]>0,
"mediuum level of cancer", ifelse(components[,4]<=0,
"low level of cancer", NA)))
tot_explained_variance_ratio <- summary(prin_comp)[["importance"]]['Proportion of Variance',]
tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio)
tit = 'Total Explained Variance = 38,17%'
fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3,color = components[,5] , colors = c('#EF553B','#00CC96','#FFFF00') ) %>%
add_markers(size = 12)
fig <- fig %>%
layout(
title = tit,
scene = list(bgcolor = "#e5ecf6")
)
figFor me, it’s hard to judge based on this plot which principal component may be more related to the high level of cancer
As it was visible interpreting PCA may be very hard, especially if we don’t have a particular level of knowledge on a certain topic. Even with even a medical degree, we can still not be able to use PCA as a powerful tool for such things. There is a need to use PCA components in different kinds of econometrical/statistical methods for example linear regression. But it’s worth knowing what a certain component means. Here useful may be a method called Varimax Rotation. This method is maximizing the sum of squares of the most important loadings, thus we can put an interpretation to certain principal components. The fewer variables, the easier interpretation.
pca.varimax<-principal(data, nfactors=8, rotate="varimax")
print(loadings(pca.varimax), digits=3, cutoff=0.35, sort=TRUE)##
## Loadings:
## RC1 RC5 RC2 RC4 RC3 RC7 RC6 RC8
## Total_SmokedFood 0.840
## Total_SaltUse 0.906
## O3 0.491 0.631
## SO2 -0.550 -0.517
## PM10 0.863
## Fruits 0.758
## Avg_Vegetable 0.739
## Avg_RedMeat 0.704
## Density 0.790
## NO2 -0.362 0.615
## PM2_5 0.532 0.678
## Smokers 0.814
## AlcoholUsers 0.708
## CO 0.867
## Avg_ProcessedMeat 0.914
## Avg_BMI 0.927
## Avg_AGE 0.479
##
## RC1 RC5 RC2 RC4 RC3 RC7 RC6 RC8
## SS loadings 2.132 1.977 1.924 1.588 1.317 1.208 1.084 1.077
## Proportion Var 0.125 0.116 0.113 0.093 0.077 0.071 0.064 0.063
## Cumulative Var 0.125 0.242 0.355 0.448 0.526 0.597 0.661 0.724
Let’s put an interpretation on received Rotated Components (RC). The first one seems like the one describing the taste preferences of the diet. The second seems to be related to the air pollution level. The third is mainly about a diet (balance of meat, vegetables, and fruits) but we can see also age, so I can assume that after a certain level of Age the diet becomes a really important factor in case of gastric cancer which is pretty logical. RC3 is also nice because it tells about risk factors for smoking and drinking. RC8 is only about BMI. I have to say that some of them have a really clear interpretation which is very nice
But to be sure we can also visualise it and see the connections
Performing PCA on our data was not really as efficient as it should be because we had to take 8 principal components to achieve the level of 70% explained variation. Besides that, the intepretation of the principal component was not that easy. The method called Varimax Rotation helped thus, we could more eaisly put an intepretation on all of those Rc which were really logical. For sure it’s necessary to perform more econometrical methods on received PCAs or RCs to get deeper knowledge about relationships with the cancer level. But now it will take only 8 variables not as in the beginning 18. So let’s call it a success