The aim of the paper is to find the possibilities in dimension reduction of Covid-19 illness symptoms dataset and group the coexisting symptoms within the clusters. The dimension reduction procedure was done with the use of Principal Component Analysis (PCA) and further with the Multiple Correspondence Analysis (MCA). Clustering procedure was done according to the hierarchical clustering method.
The data was taken from the https://www.kaggle.com/imdevskp/corona-virus-report website and contains the list of officially recorded Covid-19 illnesses with the coexisting symptoms. The data was collected from the beginning of the pandemic till the March 2020 and contains the 316 800 observations. The data is divided on 27 medical categories. The dataset contains binary information.
## [1] "C:/Users/Mateusz/Documents/Studia/Master/UW/DS WNE/Unsupervised Learning/projekt UL"
## [1] "C:/Users/Mateusz/Documents/Studia/Master/UW/DS WNE/Unsupervised Learning/projekt UL"
df <- read.csv("cleaned_data_ul_covid19.csv", sep=",", dec=".", header=T)
summary(df)
## Fever Tiredness Dry.Cough Difficulty.in.Breathing
## Min. :0.0000 Min. :0.0 Min. :0.0000 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0
## Median :0.0000 Median :0.5 Median :1.0000 Median :0.5
## Mean :0.3125 Mean :0.5 Mean :0.5625 Mean :0.5
## 3rd Qu.:1.0000 3rd Qu.:1.0 3rd Qu.:1.0000 3rd Qu.:1.0
## Max. :1.0000 Max. :1.0 Max. :1.0000 Max. :1.0
## Sore.Throat None_Sympton Pains Nasal.Congestion
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.3125 Mean :0.0625 Mean :0.3636 Mean :0.5455
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Runny.Nose Diarrhea None_Experiencing Age_0.9
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0
## Median :1.0000 Median :0.0000 Median :0.00000 Median :0.0
## Mean :0.5455 Mean :0.3636 Mean :0.09091 Mean :0.2
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0
## Age_10.19 Age_20.24 Age_25.59 Age_60. Gender_Female
## Min. :0.0 Min. :0.0 Min. :0.0 Min. :0.0 Min. :0.0000
## 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0000
## Median :0.0 Median :0.0 Median :0.0 Median :0.0 Median :0.0000
## Mean :0.2 Mean :0.2 Mean :0.2 Mean :0.2 Mean :0.3333
## 3rd Qu.:0.0 3rd Qu.:0.0 3rd Qu.:0.0 3rd Qu.:0.0 3rd Qu.:1.0000
## Max. :1.0 Max. :1.0 Max. :1.0 Max. :1.0 Max. :1.0000
## Gender_Male Gender_Transgender Severity_Mild Severity_Moderate
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00
## Median :0.0000 Median :0.0000 Median :0.00 Median :0.00
## Mean :0.3333 Mean :0.3333 Mean :0.25 Mean :0.25
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.25 3rd Qu.:0.25
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.00
## Severity_None Severity_Severe Contact_Dont.Know Contact_No
## Min. :0.00 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00 Median :0.00 Median :0.0000 Median :0.0000
## Mean :0.25 Mean :0.25 Mean :0.3333 Mean :0.3333
## 3rd Qu.:0.25 3rd Qu.:0.25 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00 Max. :1.00 Max. :1.0000 Max. :1.0000
## Contact_Yes Country
## Min. :0.0000 Length:316800
## 1st Qu.:0.0000 Class :character
## Median :0.0000 Mode :character
## Mean :0.3333
## 3rd Qu.:1.0000
## Max. :1.0000
dim(df)
## [1] 316800 27
There are 316 800 observations and 27 binary variables (in the next step last variable ‘country’ will be dropped).
covid <- df[,1:26]
covid.cor <- cor(covid, method="pearson")
print(covid.cor, digits=2)
## Fever Tiredness Dry.Cough Difficulty.in.Breathing
## Fever 1.0e+00 0.40 5.1e-02 -0.13
## Tiredness 4.0e-01 1.00 3.8e-01 0.00
## Dry.Cough 5.1e-02 0.38 1.0e+00 0.38
## Difficulty.in.Breathing -1.3e-01 0.00 3.8e-01 1.00
## Sore.Throat -1.6e-01 -0.13 5.1e-02 0.40
## None_Sympton -1.7e-01 -0.26 -2.9e-01 -0.26
## Pains 0.0e+00 0.00 0.0e+00 0.00
## Nasal.Congestion 0.0e+00 0.00 0.0e+00 0.00
## Runny.Nose 0.0e+00 0.00 0.0e+00 0.00
## Diarrhea 0.0e+00 0.00 0.0e+00 0.00
## None_Experiencing 0.0e+00 0.00 0.0e+00 0.00
## Age_0.9 -1.7e-18 0.00 -4.1e-18 0.00
## Age_10.19 -7.5e-18 0.00 -2.6e-18 0.00
## Age_20.24 -7.5e-18 0.00 -2.6e-18 0.00
## Age_25.59 -7.5e-18 0.00 -2.6e-18 0.00
## Age_60. -1.7e-18 0.00 -4.1e-18 0.00
## Gender_Female -1.0e-17 0.00 -5.4e-18 0.00
## Gender_Male -1.0e-17 0.00 -5.4e-18 0.00
## Gender_Transgender -1.0e-17 0.00 -5.4e-18 0.00
## Severity_Mild 0.0e+00 0.00 0.0e+00 0.00
## Severity_Moderate 0.0e+00 0.00 0.0e+00 0.00
## Severity_None 0.0e+00 0.00 0.0e+00 0.00
## Severity_Severe 0.0e+00 0.00 0.0e+00 0.00
## Contact_Dont.Know 0.0e+00 0.00 0.0e+00 0.00
## Contact_No 0.0e+00 0.00 0.0e+00 0.00
## Contact_Yes 0.0e+00 0.00 0.0e+00 0.00
## Sore.Throat None_Sympton Pains Nasal.Congestion
## Fever -1.6e-01 -1.7e-01 0.0e+00 0.0e+00
## Tiredness -1.3e-01 -2.6e-01 0.0e+00 0.0e+00
## Dry.Cough 5.1e-02 -2.9e-01 0.0e+00 0.0e+00
## Difficulty.in.Breathing 4.0e-01 -2.6e-01 0.0e+00 0.0e+00
## Sore.Throat 1.0e+00 -1.7e-01 0.0e+00 0.0e+00
## None_Sympton -1.7e-01 1.0e+00 0.0e+00 0.0e+00
## Pains 0.0e+00 0.0e+00 1.0e+00 3.1e-01
## Nasal.Congestion 0.0e+00 0.0e+00 3.1e-01 1.0e+00
## Runny.Nose 0.0e+00 0.0e+00 -6.9e-02 2.7e-01
## Diarrhea 0.0e+00 0.0e+00 -1.8e-01 -6.9e-02
## None_Experiencing 0.0e+00 0.0e+00 -2.4e-01 -3.5e-01
## Age_0.9 -9.1e-19 -2.4e-18 2.5e-19 -1.6e-19
## Age_10.19 -4.6e-19 -2.5e-18 1.1e-19 2.1e-19
## Age_20.24 -4.6e-19 -2.5e-18 1.1e-19 2.1e-19
## Age_25.59 -4.6e-19 -2.5e-18 1.1e-19 2.1e-19
## Age_60. -9.1e-19 -2.4e-18 2.5e-19 -1.6e-19
## Gender_Female -1.6e-19 -2.3e-18 -6.0e-19 -9.3e-21
## Gender_Male -1.6e-19 -2.3e-18 -6.0e-19 1.1e-24
## Gender_Transgender -1.6e-19 -2.3e-18 -6.0e-19 -1.9e-20
## Severity_Mild 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Moderate 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_None 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Severe 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Contact_Dont.Know 0.0e+00 0.0e+00 1.6e-20 3.5e-21
## Contact_No 0.0e+00 0.0e+00 1.6e-20 3.5e-21
## Contact_Yes 0.0e+00 0.0e+00 1.6e-20 3.5e-21
## Runny.Nose Diarrhea None_Experiencing Age_0.9
## Fever 0.0e+00 0.0e+00 0.0e+00 -1.7e-18
## Tiredness 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Dry.Cough 0.0e+00 0.0e+00 0.0e+00 -4.1e-18
## Difficulty.in.Breathing 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Sore.Throat 0.0e+00 0.0e+00 0.0e+00 -9.1e-19
## None_Sympton 0.0e+00 0.0e+00 0.0e+00 -2.4e-18
## Pains -6.9e-02 -1.8e-01 -2.4e-01 2.5e-19
## Nasal.Congestion 2.7e-01 -6.9e-02 -3.5e-01 -1.6e-19
## Runny.Nose 1.0e+00 3.1e-01 -3.5e-01 3.9e-20
## Diarrhea 3.1e-01 1.0e+00 -2.4e-01 5.1e-20
## None_Experiencing -3.5e-01 -2.4e-01 1.0e+00 1.2e-19
## Age_0.9 3.9e-20 5.1e-20 1.2e-19 1.0e+00
## Age_10.19 1.3e-20 -4.4e-20 1.1e-19 -2.5e-01
## Age_20.24 1.3e-20 -4.4e-20 1.1e-19 -2.5e-01
## Age_25.59 1.3e-20 -4.4e-20 1.1e-19 -2.5e-01
## Age_60. 4.0e-20 5.1e-20 1.1e-19 -2.5e-01
## Gender_Female -8.4e-20 9.7e-21 -2.9e-19 -1.8e-18
## Gender_Male -8.4e-20 9.7e-21 -2.9e-19 -8.9e-18
## Gender_Transgender -8.4e-20 9.7e-21 -2.9e-19 -8.9e-18
## Severity_Mild 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Moderate 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_None 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Severe 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Contact_Dont.Know 3.5e-21 1.6e-20 6.1e-21 0.0e+00
## Contact_No 3.5e-21 1.6e-20 4.5e-21 0.0e+00
## Contact_Yes 3.5e-21 1.6e-20 6.1e-21 0.0e+00
## Age_10.19 Age_20.24 Age_25.59 Age_60. Gender_Female
## Fever -7.5e-18 -7.5e-18 -7.5e-18 -1.7e-18 -1.0e-17
## Tiredness 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Dry.Cough -2.6e-18 -2.6e-18 -2.6e-18 -4.1e-18 -5.4e-18
## Difficulty.in.Breathing 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Sore.Throat -4.6e-19 -4.6e-19 -4.6e-19 -9.1e-19 -1.6e-19
## None_Sympton -2.5e-18 -2.5e-18 -2.5e-18 -2.4e-18 -2.3e-18
## Pains 1.1e-19 1.1e-19 1.1e-19 2.5e-19 -6.0e-19
## Nasal.Congestion 2.1e-19 2.1e-19 2.1e-19 -1.6e-19 -9.3e-21
## Runny.Nose 1.3e-20 1.3e-20 1.3e-20 4.0e-20 -8.4e-20
## Diarrhea -4.4e-20 -4.4e-20 -4.4e-20 5.1e-20 9.7e-21
## None_Experiencing 1.1e-19 1.1e-19 1.1e-19 1.1e-19 -2.9e-19
## Age_0.9 -2.5e-01 -2.5e-01 -2.5e-01 -2.5e-01 -1.8e-18
## Age_10.19 1.0e+00 -2.5e-01 -2.5e-01 -2.5e-01 1.8e-18
## Age_20.24 -2.5e-01 1.0e+00 -2.5e-01 -2.5e-01 1.8e-18
## Age_25.59 -2.5e-01 -2.5e-01 1.0e+00 -2.5e-01 1.8e-18
## Age_60. -2.5e-01 -2.5e-01 -2.5e-01 1.0e+00 -1.8e-18
## Gender_Female 1.8e-18 1.8e-18 1.8e-18 -1.8e-18 1.0e+00
## Gender_Male 9.6e-22 2.7e-23 9.6e-22 -8.9e-18 -5.0e-01
## Gender_Transgender 5.9e-22 -3.4e-22 5.9e-22 -8.9e-18 -5.0e-01
## Severity_Mild 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Moderate 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_None 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Severity_Severe 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Contact_Dont.Know 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Contact_No 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Contact_Yes 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
## Gender_Male Gender_Transgender Severity_Mild
## Fever -1.0e-17 -1.0e-17 0.00
## Tiredness 0.0e+00 0.0e+00 0.00
## Dry.Cough -5.4e-18 -5.4e-18 0.00
## Difficulty.in.Breathing 0.0e+00 0.0e+00 0.00
## Sore.Throat -1.6e-19 -1.6e-19 0.00
## None_Sympton -2.3e-18 -2.3e-18 0.00
## Pains -6.0e-19 -6.0e-19 0.00
## Nasal.Congestion 1.1e-24 -1.9e-20 0.00
## Runny.Nose -8.4e-20 -8.4e-20 0.00
## Diarrhea 9.7e-21 9.7e-21 0.00
## None_Experiencing -2.9e-19 -2.9e-19 0.00
## Age_0.9 -8.9e-18 -8.9e-18 0.00
## Age_10.19 9.6e-22 5.9e-22 0.00
## Age_20.24 2.7e-23 -3.4e-22 0.00
## Age_25.59 9.6e-22 5.9e-22 0.00
## Age_60. -8.9e-18 -8.9e-18 0.00
## Gender_Female -5.0e-01 -5.0e-01 0.00
## Gender_Male 1.0e+00 -5.0e-01 0.00
## Gender_Transgender -5.0e-01 1.0e+00 0.00
## Severity_Mild 0.0e+00 0.0e+00 1.00
## Severity_Moderate 0.0e+00 0.0e+00 -0.33
## Severity_None 0.0e+00 0.0e+00 -0.33
## Severity_Severe 0.0e+00 0.0e+00 -0.33
## Contact_Dont.Know 0.0e+00 0.0e+00 0.00
## Contact_No 0.0e+00 0.0e+00 0.00
## Contact_Yes 0.0e+00 0.0e+00 0.00
## Severity_Moderate Severity_None Severity_Severe
## Fever 0.00 0.00 0.00
## Tiredness 0.00 0.00 0.00
## Dry.Cough 0.00 0.00 0.00
## Difficulty.in.Breathing 0.00 0.00 0.00
## Sore.Throat 0.00 0.00 0.00
## None_Sympton 0.00 0.00 0.00
## Pains 0.00 0.00 0.00
## Nasal.Congestion 0.00 0.00 0.00
## Runny.Nose 0.00 0.00 0.00
## Diarrhea 0.00 0.00 0.00
## None_Experiencing 0.00 0.00 0.00
## Age_0.9 0.00 0.00 0.00
## Age_10.19 0.00 0.00 0.00
## Age_20.24 0.00 0.00 0.00
## Age_25.59 0.00 0.00 0.00
## Age_60. 0.00 0.00 0.00
## Gender_Female 0.00 0.00 0.00
## Gender_Male 0.00 0.00 0.00
## Gender_Transgender 0.00 0.00 0.00
## Severity_Mild -0.33 -0.33 -0.33
## Severity_Moderate 1.00 -0.33 -0.33
## Severity_None -0.33 1.00 -0.33
## Severity_Severe -0.33 -0.33 1.00
## Contact_Dont.Know 0.00 0.00 0.00
## Contact_No 0.00 0.00 0.00
## Contact_Yes 0.00 0.00 0.00
## Contact_Dont.Know Contact_No Contact_Yes
## Fever 0.0e+00 0.0e+00 0.0e+00
## Tiredness 0.0e+00 0.0e+00 0.0e+00
## Dry.Cough 0.0e+00 0.0e+00 0.0e+00
## Difficulty.in.Breathing 0.0e+00 0.0e+00 0.0e+00
## Sore.Throat 0.0e+00 0.0e+00 0.0e+00
## None_Sympton 0.0e+00 0.0e+00 0.0e+00
## Pains 1.6e-20 1.6e-20 1.6e-20
## Nasal.Congestion 3.5e-21 3.5e-21 3.5e-21
## Runny.Nose 3.5e-21 3.5e-21 3.5e-21
## Diarrhea 1.6e-20 1.6e-20 1.6e-20
## None_Experiencing 6.1e-21 4.5e-21 6.1e-21
## Age_0.9 0.0e+00 0.0e+00 0.0e+00
## Age_10.19 0.0e+00 0.0e+00 0.0e+00
## Age_20.24 0.0e+00 0.0e+00 0.0e+00
## Age_25.59 0.0e+00 0.0e+00 0.0e+00
## Age_60. 0.0e+00 0.0e+00 0.0e+00
## Gender_Female 0.0e+00 0.0e+00 0.0e+00
## Gender_Male 0.0e+00 0.0e+00 0.0e+00
## Gender_Transgender 0.0e+00 0.0e+00 0.0e+00
## Severity_Mild 0.0e+00 0.0e+00 0.0e+00
## Severity_Moderate 0.0e+00 0.0e+00 0.0e+00
## Severity_None 0.0e+00 0.0e+00 0.0e+00
## Severity_Severe 0.0e+00 0.0e+00 0.0e+00
## Contact_Dont.Know 1.0e+00 -5.0e-01 -5.0e-01
## Contact_No -5.0e-01 1.0e+00 -5.0e-01
## Contact_Yes -5.0e-01 -5.0e-01 1.0e+00
library(corrplot)
## corrplot 0.92 loaded
corrplot(covid.cor, order = "alphabet", tl.cex=0.6)
In the correlation plot above we can see that ‘Difficulty in breathing’ is positively correlated with ‘Dry cough’. ‘Sore throat’ is positively correlated to ‘Difficulty in breathing’ and ‘Tiredness’ is positively correlated to both ‘Dry cough’ and ‘Fever’. Apart from that we can see negative correlation between the variables which indicates the age interval of the patient and the severity extent of the disease progression.
Previous modelling analysis showed me that the MDS technique is not a sufficient way to deal with the binary data, therefore, the PCA and further the MCA dimension reduction technique was used.
covid.pca1<-prcomp(covid, center=TRUE, scale.=TRUE) # stats::
options("scipen"=100, "digits"=4)
covid.pca1$rotation
## PC1 PC2
## Fever -0.21405978143411083514 0.0000000000000605386
## Tiredness -0.42889968191431682154 0.0000000000000044484
## Dry.Cough -0.54019740294622264898 -0.0000000000000881574
## Difficulty.in.Breathing -0.42889968191858962587 -0.0000000000001456485
## Sore.Throat -0.21405978143676643310 -0.0000000000001196803
## None_Sympton 0.49863183963492629935 0.0000000000000754247
## Pains -0.00000000000004121897 0.2678090525075103590
## Nasal.Congestion -0.00000000000007986030 0.5026760082128256535
## Runny.Nose -0.00000000000008798873 0.5026760082126404683
## Diarrhea -0.00000000000004543182 0.2678090525074020567
## None_Experiencing 0.00000000000008586092 -0.5926130983405116792
## Age_0.9 -0.00000000000000553552 0.0000000000000105575
## Age_10.19 0.00000000000000009317 -0.0000000000000013600
## Age_20.24 0.00000000000000009596 -0.0000000000000014433
## Age_25.59 0.00000000000000014216 -0.0000000000000008604
## Age_60. 0.00000000000000010325 -0.0000000000000012768
## Gender_Female -0.00000000000000566779 -0.0000000000000038858
## Gender_Male 0.00000000000001584326 0.0000000000000122541
## Gender_Transgender -0.00000000000000566748 -0.0000000000000033584
## Severity_Mild -0.00000000000003774828 0.0000000000000004441
## Severity_Moderate 0.00000000000001068735 -0.0000000000000032474
## Severity_None 0.00000000000001385733 0.0000000000000024841
## Severity_Severe 0.00000000000001385733 0.0000000000000023176
## Contact_Dont.Know -0.00000000000003590453 0.0000000000000024980
## Contact_No 0.00000000000000678649 -0.0000000000000020262
## Contact_Yes 0.00000000000003617388 0.0000000000000005551
## PC3 PC4
## Fever -0.536185408933701257 0.0000000000000083382
## Tiredness -0.460982870884000140 0.0000000000000235166
## Dry.Cough -0.000000000001752917 0.0000000000000360036
## Difficulty.in.Breathing 0.460982870882539419 0.0000000000000166206
## Sore.Throat 0.536185408931496799 0.0000000000000057350
## None_Sympton 0.000000000002703865 -0.0000000000000116614
## Pains 0.000000000000040755 -0.0000000000000291364
## Nasal.Congestion 0.000000000000077265 -0.0000000000000125559
## Runny.Nose 0.000000000000079155 0.0000000000000178902
## Diarrhea 0.000000000000042363 0.0000000000000240814
## None_Experiencing -0.000000000000109562 -0.0000000000000005516
## Age_0.9 0.000000000000011814 0.0000000000000020648
## Age_10.19 -0.000000000000004038 -0.0000000000000211723
## Age_20.24 -0.000000000000004038 0.0000000000000050619
## Age_25.59 -0.000000000000004069 0.0000000000000155813
## Age_60. -0.000000000000004377 0.0000000000000080075
## Gender_Female 0.000000000000021496 -0.3064165338893526935
## Gender_Male -0.000000000000033662 0.0184581787200654088
## Gender_Transgender 0.000000000000021494 0.2879583551692940779
## Severity_Mild 0.000000000000096209 0.0000000000000159421
## Severity_Moderate -0.000000000000006684 -0.0000000000000391336
## Severity_None -0.000000000000030819 0.0000000000000297245
## Severity_Severe -0.000000000000030819 -0.0000000000000061843
## Contact_Dont.Know 0.000000000000005103 -0.7299113064428662812
## Contact_No -0.000000000000039856 0.2561153661579838037
## Contact_Yes 0.000000000000041218 0.4737959402848411217
## PC5 PC6
## Fever 0.0000000000000000000 0.0000000000000328129
## Tiredness -0.0000000000000055352 0.0000000000000417560
## Dry.Cough -0.0000000000000170692 0.0000000000000038874
## Difficulty.in.Breathing -0.0000000000000149465 -0.0000000000000280029
## Sore.Throat -0.0000000000000100344 -0.0000000000000380875
## None_Sympton 0.0000000000000074794 -0.0000000000000011114
## Pains 0.0000000000000164775 -0.0000000000000147937
## Nasal.Congestion 0.0000000000000082825 -0.0000000000000019498
## Runny.Nose -0.0000000000000079014 0.0000000000000095618
## Diarrhea -0.0000000000000127642 0.0000000000000107692
## None_Experiencing -0.0000000000000010292 -0.0000000000000006314
## Age_0.9 -0.0000000000000010714 -0.0000000000000040367
## Age_10.19 0.0000000000000017115 -0.0000000000000157652
## Age_20.24 -0.0000000000000010924 0.0000000000000073067
## Age_25.59 0.0000000000000013746 0.0000000000000133990
## Age_60. -0.0000000000000079681 0.0000000000000032752
## Gender_Female -0.5276948015111558865 0.2374185122007645421
## Gender_Male -0.1938153091677393136 -0.3237629746918117557
## Gender_Transgender 0.7215101106789235663 0.0863444624910671282
## Severity_Mild 0.0000000000000031955 -0.0000000000000084238
## Severity_Moderate -0.0000000000000042374 0.0000000000000084516
## Severity_None 0.0000000000000019242 -0.0000000000000060334
## Severity_Severe -0.0000000000000004056 0.0000000000000060160
## Contact_Dont.Know 0.3282400019238625455 0.0782797113003221579
## Contact_No -0.1342474861369017392 -0.6802962524358183716
## Contact_Yes -0.1939925157869437922 0.6020165411354851948
## PC7 PC8
## Fever 0.0000000000000000000 -0.000000000000000002665
## Tiredness 0.0000000000000030003 0.000000000000000804932
## Dry.Cough -0.0000000000000020311 0.000000000000002523496
## Difficulty.in.Breathing -0.0000000000000181243 0.000000000000001745773
## Sore.Throat -0.0000000000000133986 0.000000000000002266489
## None_Sympton 0.0000000000000102454 -0.000000000000005013272
## Pains 0.0000000000000126881 0.613584431441651201311
## Nasal.Congestion 0.0000000000000093593 0.351445793106101023184
## Runny.Nose 0.0000000000000031778 -0.351445793106191450850
## Diarrhea -0.0000000000000040275 -0.613584431442331323936
## None_Experiencing -0.0000000000000069148 -0.000000000000162814207
## Age_0.9 0.0000000000000064518 -0.000000000000000648787
## Age_10.19 -0.0000000000000011954 0.000000000000001415534
## Age_20.24 -0.0000000000000104853 0.000000000000001207368
## Age_25.59 0.0000000000000007026 0.000000000000000388578
## Age_60. -0.0000000000000092528 0.000000000000000846545
## Gender_Female 0.4877973154378615894 -0.000000000000003719247
## Gender_Male -0.7238363934314839820 0.000000000000006571133
## Gender_Transgender 0.2360390779936666628 -0.000000000000007341350
## Severity_Mild -0.0000000000000009466 0.000000000000008243406
## Severity_Moderate -0.0000000000000014141 0.000000000000028310687
## Severity_None 0.0000000000000002835 -0.000000000000006120104
## Severity_Severe 0.0000000000000025449 -0.000000000000031037672
## Contact_Dont.Know -0.1415165690981745050 -0.000000000000037136960
## Contact_No 0.3467653490072623579 -0.000000000000004510281
## Contact_Yes -0.2052487799090883247 0.000000000000038496983
## PC9 PC10
## Fever 0.000000000000011048 0.0000000000000250244
## Tiredness 0.000000000000007409 0.0000000000000202558
## Dry.Cough -0.000000000000004346 0.0000000000000066174
## Difficulty.in.Breathing -0.000000000000006223 -0.0000000000000656811
## Sore.Throat 0.000000000000006113 -0.0000000000000846728
## None_Sympton 0.000000000000001521 0.0000000000000271320
## Pains -0.000000000000009034 -0.0000000000000005829
## Nasal.Congestion -0.000000000000008382 -0.0000000000000018596
## Runny.Nose 0.000000000000009194 0.0000000000000009021
## Diarrhea 0.000000000000016823 0.0000000000000021753
## None_Experiencing -0.000000000000002134 0.0000000000000008743
## Age_0.9 -0.000000000000006366 -0.0000000000000647113
## Age_10.19 0.000000000000006762 -0.0000000000000334247
## Age_20.24 -0.000000000000001599 0.0000000000000785344
## Age_25.59 0.000000000000001183 -0.0000000000000103251
## Age_60. -0.000000000000001079 0.0000000000000448461
## Gender_Female -0.000000000000019772 0.0000000000000082434
## Gender_Male 0.000000000000002999 -0.0000000000000025639
## Gender_Transgender 0.000000000000016886 -0.0000000000000056899
## Severity_Mild -0.064342457660595959 0.8385063838452186946
## Severity_Moderate 0.708725691715895278 -0.3346287630444449435
## Severity_None -0.700313995285159807 -0.4222096341228042116
## Severity_Severe 0.055930761229906575 -0.0816679866780549713
## Contact_Dont.Know -0.000000000000035503 0.0000000000000098740
## Contact_No 0.000000000000020451 -0.0000000000000082469
## Contact_Yes 0.000000000000016084 -0.0000000000000018457
## PC11 PC12
## Fever 0.00000000000001473314 -0.0000000000000157343
## Tiredness 0.00000000000001091327 0.0000000000000023893
## Dry.Cough -0.00000000000000090200 0.0000000000000089824
## Difficulty.in.Breathing -0.00000000000002363464 -0.0000000000000014412
## Sore.Throat -0.00000000000002113663 0.0000000000000047328
## None_Sympton 0.00000000000000907006 -0.0000000000000007347
## Pains -0.00000000000002491063 -0.0000000000000003886
## Nasal.Congestion -0.00000000000000850708 0.0000000000000049440
## Runny.Nose 0.00000000000001418657 0.0000000000000025882
## Diarrhea 0.00000000000002247681 0.0000000000000019238
## None_Experiencing 0.00000000000000003816 -0.0000000000000043368
## Age_0.9 -0.00000000000002264204 -0.5761004231848011470
## Age_10.19 -0.00000000000000953404 -0.4784341110756626381
## Age_20.24 0.00000000000002283243 0.3310718718876559663
## Age_25.59 -0.00000000000000075981 0.5459884985979381877
## Age_60. 0.00000000000001405473 0.1774741637791431570
## Gender_Female 0.00000000000000351802 0.0000000000000054436
## Gender_Male -0.00000000000000366200 0.0000000000000009350
## Gender_Transgender 0.00000000000000036776 -0.0000000000000041425
## Severity_Mild 0.20680206090118474771 -0.0000000000000739547
## Severity_Moderate 0.36841754144612937072 0.0000000000000170627
## Severity_None 0.28513037870706497978 0.0000000000000267685
## Severity_Severe -0.86034998105432525239 0.0000000000000298372
## Contact_Dont.Know 0.00000000000000202963 0.0000000000000131180
## Contact_No -0.00000000000000323006 0.0000000000000114023
## Contact_Yes 0.00000000000000151962 -0.0000000000000232661
## PC13 PC14
## Fever -0.0000000000000050658 -0.0000000000000178271
## Tiredness 0.0000000000000006295 0.0000000000000027374
## Dry.Cough 0.0000000000000028276 0.0000000000000102126
## Difficulty.in.Breathing -0.0000000000000003565 -0.0000000000000016880
## Sore.Throat 0.0000000000000016750 0.0000000000000052746
## None_Sympton -0.0000000000000001954 -0.0000000000000007889
## Pains -0.0000000000000004233 -0.0000000000000010408
## Nasal.Congestion 0.0000000000000015899 0.0000000000000057905
## Runny.Nose 0.0000000000000011033 0.0000000000000033411
## Diarrhea 0.0000000000000011948 0.0000000000000027790
## None_Experiencing -0.0000000000000016046 -0.0000000000000049856
## Age_0.9 -0.1845713023717037438 -0.6526481777607980872
## Age_10.19 -0.0977535710873589925 0.7433487541608907456
## Age_20.24 0.1492120919401024037 -0.1390165838393611075
## Age_25.59 -0.6136104707521466972 0.0464083619546589443
## Age_60. 0.7467232522713272980 0.0019076454808065191
## Gender_Female 0.0000000000000009480 0.0000000000000014745
## Gender_Male -0.0000000000000099486 -0.0000000000000036152
## Gender_Transgender 0.0000000000000072277 0.0000000000000065538
## Severity_Mild -0.0000000000000586944 -0.0000000000000057038
## Severity_Moderate 0.0000000000000172883 -0.0000000000000065711
## Severity_None 0.0000000000000229894 0.0000000000000077091
## Severity_Severe 0.0000000000000235840 0.0000000000000035041
## Contact_Dont.Know 0.0000000000000004337 -0.0000000000000122784
## Contact_No 0.0000000000000012473 0.0000000000000001006
## Contact_Yes -0.0000000000000012646 0.0000000000000143219
## PC15 PC16
## Fever -0.00000000000000244892 -0.488781361520113422
## Tiredness 0.00000000000000045823 0.111383616625499254
## Dry.Cough 0.00000000000000146689 0.580361264593369341
## Difficulty.in.Breathing -0.00000000000000032739 0.111383616624441614
## Sore.Throat 0.00000000000000057455 -0.488781361521033519
## None_Sympton -0.00000000000000007854 0.400691581786861373
## Pains -0.00000000000000056205 -0.000000000000010540
## Nasal.Congestion 0.00000000000000026194 0.000000000000005588
## Runny.Nose 0.00000000000000150531 0.000000000000008024
## Diarrhea 0.00000000000000024568 -0.000000000000014163
## None_Experiencing -0.00000000000000094456 -0.000000000000012753
## Age_0.9 -0.08995606045526935435 0.000000000000016707
## Age_10.19 0.09475056930898999308 -0.000000000000001483
## Age_20.24 0.80548218892901113364 -0.000000000000001482
## Age_25.59 -0.35103420563199405624 -0.000000000000001413
## Age_60. -0.45924249215140583447 -0.000000000000001285
## Gender_Female 0.00000000000000064272 -0.000000000000001338
## Gender_Male -0.00000000000000486828 0.000000000000008359
## Gender_Transgender 0.00000000000000009801 -0.000000000000001339
## Severity_Mild -0.00000000000004526848 -0.000000000000035937
## Severity_Moderate 0.00000000000001030209 0.000000000000016663
## Severity_None 0.00000000000001546940 0.000000000000006128
## Severity_Severe 0.00000000000001392831 0.000000000000006128
## Contact_Dont.Know -0.00000000000000776245 0.000000000000012295
## Contact_No 0.00000000000000202225 -0.000000000000002629
## Contact_Yes 0.00000000000000283714 -0.000000000000009163
## PC17 PC18
## Fever -0.00000000000001744464 0.3903340329286217325
## Tiredness -0.00000000000000595821 0.2269288384815393855
## Dry.Cough -0.00000000000002545808 0.0397382035935744116
## Difficulty.in.Breathing 0.00000000000000062004 0.2269288384846792350
## Sore.Throat -0.00000000000001278646 0.3903340329243994988
## None_Sympton -0.00000000000003389717 0.7685741117050595150
## Pains -0.49166360000560599408 0.0000000000000009229
## Nasal.Congestion 0.45336347670304943946 0.0000000000000117792
## Runny.Nose 0.45336347670359111728 0.0000000000000361888
## Diarrhea -0.49166360000552394860 -0.0000000000000167910
## None_Experiencing 0.32474131988721954833 0.0000000000000324174
## Age_0.9 0.00000000000000194983 0.0000000000000006170
## Age_10.19 -0.00000000000000059674 0.0000000000000080583
## Age_20.24 -0.00000000000000056899 0.0000000000000080598
## Age_25.59 -0.00000000000000009714 0.0000000000000080772
## Age_60. 0.00000000000000002776 0.0000000000000079726
## Gender_Female 0.00000000000000204003 -0.0000000000000008905
## Gender_Male 0.00000000000000312250 -0.0000000000000028569
## Gender_Transgender 0.00000000000000224820 -0.0000000000000008910
## Severity_Mild 0.00000000000000133227 0.0000000000000110916
## Severity_Moderate 0.00000000000000263678 -0.0000000000000099553
## Severity_None -0.00000000000000290046 0.0000000000000002970
## Severity_Severe 0.00000000000000491274 0.0000000000000002970
## Contact_Dont.Know 0.00000000000000367761 -0.0000000000000023336
## Contact_No 0.00000000000000238698 -0.0000000000000102085
## Contact_Yes -0.00000000000000582867 -0.0000000000000129348
## PC19 PC20
## Fever -0.4609828708800784991 0.00000000000018514121
## Tiredness 0.5361854089309092686 -0.00000000000015723211
## Dry.Cough 0.0000000000042169115 -0.00000000000003973745
## Difficulty.in.Breathing -0.5361854089342895646 0.00000000000023115821
## Sore.Throat 0.4609828708864614488 -0.00000000000017281484
## None_Sympton 0.0000000000022613855 0.00000000000000889718
## Pains -0.0000000000001756315 -0.43190880498485145766
## Nasal.Congestion -0.0000000000000499215 -0.20439762415453949229
## Runny.Nose -0.0000000000000987951 -0.20439762414852172268
## Diarrhea -0.0000000000001467525 -0.43190880498759387507
## None_Experiencing -0.0000000000002762506 -0.73712467794340263971
## Age_0.9 0.0000000000000153425 0.00000000000000064185
## Age_10.19 -0.0000000000000047256 -0.00000000000000123512
## Age_20.24 -0.0000000000000047234 -0.00000000000000141553
## Age_25.59 -0.0000000000000047344 -0.00000000000000111022
## Age_60. -0.0000000000000046711 -0.00000000000000099920
## Gender_Female -0.0000000000000044903 -0.00000000000000002776
## Gender_Male 0.0000000000000033204 -0.00000000000000136696
## Gender_Transgender -0.0000000000000044904 0.00000000000000001388
## Severity_Mild 0.0000000000000049358 0.00000000000000358047
## Severity_Moderate -0.0000000000000046872 0.00000000000000378864
## Severity_None 0.0000000000000020329 0.00000000000000078410
## Severity_Severe 0.0000000000000020329 0.00000000000000256739
## Contact_Dont.Know -0.0000000000000005298 0.00000000000000144329
## Contact_No -0.0000000000000018665 0.00000000000000063838
## Contact_Yes -0.0000000000000052411 -0.00000000000000111022
## PC21 PC22
## Fever -0.00000000000002865428 -0.250819722793808153
## Tiredness 0.00000000000002077141 0.502137486220437301
## Dry.Cough -0.00000000000001112701 -0.608102329863072444
## Difficulty.in.Breathing -0.00000000000002281610 0.502137486213335205
## Sore.Throat 0.00000000000000771039 -0.250819722789301425
## None_Sympton -0.00000000000001577137 -0.010314048875738085
## Pains 0.35144579310832746444 0.000000000000032715
## Nasal.Congestion -0.61358443144113872236 0.000000000000011233
## Runny.Nose 0.61358443144284258164 -0.000000000000002998
## Diarrhea -0.35144579310396539817 0.000000000000039611
## None_Experiencing 0.00000000000328935490 0.000000000000034161
## Age_0.9 -0.00000000000000006418 -0.000000000000002836
## Age_10.19 0.00000000000000349720 0.000000000000001036
## Age_20.24 0.00000000000000380251 0.000000000000001036
## Age_25.59 0.00000000000000409395 0.000000000000001014
## Age_60. 0.00000000000000421885 0.000000000000001038
## Gender_Female -0.00000000000000107553 -0.000000000000001914
## Gender_Male 0.00000000000000027756 -0.000000000000001518
## Gender_Transgender -0.00000000000000131145 -0.000000000000001914
## Severity_Mild 0.00000000000000099920 0.000000000000009229
## Severity_Moderate 0.00000000000000108247 -0.000000000000003488
## Severity_None 0.00000000000000359435 -0.000000000000006431
## Severity_Severe -0.00000000000000092287 -0.000000000000006431
## Contact_Dont.Know 0.00000000000000403844 -0.000000000000006467
## Contact_No 0.00000000000000180411 0.000000000000005790
## Contact_Yes 0.00000000000000532907 -0.000000000000002278
## PC23 PC24
## Fever 0.000000000000005495 0.00000000000000604140
## Tiredness 0.000000000000005284 0.00000000000000819646
## Dry.Cough 0.000000000000004330 -0.00000000000000005413
## Difficulty.in.Breathing 0.000000000000005007 -0.00000000000000315429
## Sore.Throat 0.000000000000001815 -0.00000000000001099775
## None_Sympton 0.000000000000012104 0.00000000000000087149
## Pains 0.000000000000004274 0.00000000000000283107
## Nasal.Congestion -0.000000000000001651 0.00000000000000066613
## Runny.Nose 0.000000000000005468 -0.00000000000000255351
## Diarrhea -0.000000000000001846 0.00000000000000408701
## None_Experiencing -0.000000000000000111 0.00000000000000352496
## Age_0.9 -0.447213508214742483 -0.00021356787907845197
## Age_10.19 -0.447213508216977140 -0.00021356787907871044
## Age_20.24 -0.447213508214116817 -0.00021356787907625407
## Age_25.59 -0.447213508213751387 -0.00021356787907753083
## Age_60. -0.447213508214213573 -0.00021356787908130559
## Gender_Female 0.000211405624445354 0.03332673349230962784
## Gender_Male 0.000211405624451870 0.03332673349231155685
## Gender_Transgender 0.000211405624445341 0.03332673349230839271
## Severity_Mild -0.000251187242563372 0.49877764549716285813
## Severity_Moderate -0.000251187242571380 0.49877764549709063813
## Severity_None -0.000251187242571359 0.49877764549712144682
## Severity_Severe -0.000251187242571484 0.49877764549714942444
## Contact_Dont.Know 0.000036028205324973 0.02273909890735868744
## Contact_No 0.000036028205326000 0.02273909890735935357
## Contact_Yes 0.000036028205332855 0.02273909890736006134
## PC25 PC26
## Fever 0.0000000000000043200 0.0000000000000054800
## Tiredness 0.0000000000000080508 0.0000000000000086093
## Dry.Cough -0.0000000000000020127 0.0000000000000018950
## Difficulty.in.Breathing -0.0000000000000008553 0.0000000000000020854
## Sore.Throat 0.0000000000000026839 0.0000000000000064849
## None_Sympton 0.0000000000000007144 0.0000000000000091043
## Pains 0.0000000000000023315 -0.0000000000000018874
## Nasal.Congestion -0.0000000000000027339 0.0000000000000047046
## Runny.Nose -0.0000000000000041356 -0.0000000000000040246
## Diarrhea -0.0000000000000011935 0.0000000000000012837
## None_Experiencing -0.0000000000000003469 0.0000000000000006523
## Age_0.9 0.0001798737341731204 0.0000102199423333248
## Age_10.19 0.0001798737341707352 0.0000102199423324262
## Age_20.24 0.0001798737341728585 0.0000102199423332450
## Age_25.59 0.0001798737341681955 0.0000102199423319266
## Age_60. 0.0001798737341699996 0.0000102199423316213
## Gender_Female 0.5700301824464704126 -0.0853710054747235797
## Gender_Male 0.5700301824465009437 -0.0853710054747276320
## Gender_Transgender 0.5700301824464559797 -0.0853710054747208041
## Severity_Mild -0.0314427698329692307 -0.0152364524262884882
## Severity_Moderate -0.0314427698329646926 -0.0152364524262871004
## Severity_None -0.0314427698329665176 -0.0152364524262873433
## Severity_Severe -0.0314427698329686131 -0.0152364524262883771
## Contact_Dont.Know 0.0841467377625422508 0.5707325047165981990
## Contact_No 0.0841467377625457480 0.5707325047166211807
## Contact_Yes 0.0841467377625474966 0.5707325047166350585
summary(covid.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.3823 1.3430 1.2757 1.2247 1.2247 1.2247 1.2247 1.1815
## Proportion of Variance 0.0735 0.0694 0.0626 0.0577 0.0577 0.0577 0.0577 0.0537
## Cumulative Proportion 0.0735 0.1429 0.2055 0.2631 0.3208 0.3785 0.4362 0.4899
## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
## Standard deviation 1.1547 1.1547 1.1547 1.1180 1.1180 1.1180 1.1180 0.926
## Proportion of Variance 0.0513 0.0513 0.0513 0.0481 0.0481 0.0481 0.0481 0.033
## Cumulative Proportion 0.5412 0.5925 0.6438 0.6918 0.7399 0.7880 0.8361 0.869
## PC17 PC18 PC19 PC20 PC21 PC22
## Standard deviation 0.8698 0.8097 0.7323 0.7265 0.7183 0.6426
## Proportion of Variance 0.0291 0.0252 0.0206 0.0203 0.0198 0.0159
## Cumulative Proportion 0.8981 0.9234 0.9440 0.9643 0.9841 1.0000
## PC23 PC24 PC25
## Standard deviation 0.00000000000288 0.0000000000000671 0.000000000000042
## Proportion of Variance 0.00000000000000 0.0000000000000000 0.000000000000000
## Cumulative Proportion 1.00000000000000 1.0000000000000000 1.000000000000000
## PC26
## Standard deviation 0.0000000000000356
## Proportion of Variance 0.0000000000000000
## Cumulative Proportion 1.0000000000000000
The PCA is also not the best way to reduce the dimensions in a give dataset, because first two principal components explain only 14% of the variance. The 95% explanation level is reached with the 20th component, which is rather a poor outcome.
library(ggplot2)
library(factoextra)
fviz_eig(covid.pca1, choice='eigenvalue') # eigenvalues on y-axis
fviz_eig(covid.pca1)
The scree plot present the eigenvalue and the percentage of the explained variance.
eig.val<-get_eigenvalue(covid.pca1)
eig.val
## eigenvalue variance.percent
## Dim.1 1.910817285867398629406466170622 7.349297253337395474659388128202
## Dim.2 1.803730751615455263348053449590 6.937425967752927746801105968188
## Dim.3 1.627348479587384932898430633941 6.259032613798695798834614834050
## Dim.4 1.500000000000014432899320127035 5.769230769231803002128344814992
## Dim.5 1.500000000000012212453270876722 5.769230769231794120344147813739
## Dim.6 1.499999999999937827510620991234 5.769230769231508126893004373414
## Dim.7 1.499999999999932276395497865451 5.769230769231487698789351270534
## Dim.8 1.395960175667761760109897295479 5.369077598723070998687489918666
## Dim.9 1.333333333333392101138770158286 5.128205128206223761822002416011
## Dim.10 1.333333333333301951029170595575 5.128205128205876484059899667045
## Dim.11 1.333333333333276193854999291943 5.128205128205778784433732653270
## Dim.12 1.250000000002040811963865962753 4.807692307700972556006036029430
## Dim.13 1.250000000000484279283341493283 4.807692307694985345278837485239
## Dim.14 1.249999999998756772257024749706 4.807692307688341770699480548501
## Dim.15 1.249999999998430144643180028652 4.807692307687084998235604871297
## Dim.16 0.857100054855390425245786900632 3.296538672521291513106689308188
## Dim.17 0.756608521367482578234842094389 2.910032774490811213752294861479
## Dim.18 0.655574438451520746795608829416 2.521440147890892102822135711904
## Dim.19 0.536287884047667695597283454845 2.062645707875994549596043725614
## Dim.20 0.527755965111434854897254354000 2.029830635044324260718440200435
## Dim.21 0.515944586236422697567149953102 1.984402254755808447939102734381
## Dim.22 0.412871857188091051504841288988 1.587968681492927247589364014857
## Dim.23 0.000000000000000000000008304858 0.000000000000000000000031941762
## Dim.24 0.000000000000000000000000004500 0.000000000000000000000000017308
## Dim.25 0.000000000000000000000000001764 0.000000000000000000000000006786
## Dim.26 0.000000000000000000000000001269 0.000000000000000000000000004882
## cumulative.variance.percent
## Dim.1 7.349
## Dim.2 14.287
## Dim.3 20.546
## Dim.4 26.315
## Dim.5 32.084
## Dim.6 37.853
## Dim.7 43.623
## Dim.8 48.992
## Dim.9 54.120
## Dim.10 59.248
## Dim.11 64.376
## Dim.12 69.184
## Dim.13 73.992
## Dim.14 78.799
## Dim.15 83.607
## Dim.16 86.904
## Dim.17 89.814
## Dim.18 92.335
## Dim.19 94.398
## Dim.20 96.428
## Dim.21 98.412
## Dim.22 100.000
## Dim.23 100.000
## Dim.24 100.000
## Dim.25 100.000
## Dim.26 100.000
library(gridExtra)
var<-get_pca_var(covid.pca1)
a<-fviz_contrib(covid.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(covid.pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
In the first plot “Contribution of variables to Dim-1” it can be seen that the most important variables are ‘Dry cough’, ‘None symptom’, ‘Difficulty in breathing’, and ‘Tiredness’.
From the second plot “Contribution of variables to Dim-2” it can be seen that the most important variables are ‘None Experiencing’, ‘Nasal Congestion’, and ‘Runny nose’ symptoms.
The Multiple Correspondence Analysis (MCA) procedure was done because it is the proper way to reduce the dimensions in with the categorical data.
According to Wikipedia: ” (…), multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.”
df <- read.csv("cleaned_data_ul_covid19.csv", sep=",", dec=".", header=T)
covid <- df[,1:26]
covid$Fever <- ifelse(covid$Fever==1, "Fever", "No fever")
covid$Tiredness <- ifelse(covid$Tiredness==1, "Tiredness", "No tiredness")
covid$Dry.Cough <- ifelse(covid$Dry.Cough==1, "Dry cough", "No dry cough")
covid$Difficulty.in.Breathing <- ifelse(covid$Difficulty.in.Breathing==1, "Difficulties in breathing", "No difficulties in breathing")
covid$Sore.Throat <- ifelse(covid$Sore.Throat==1, "Sore throat", "No sore throat")
covid$None_Sympton <- ifelse(covid$None_Sympton==1, "None symptom", "Symptom")
covid$Pains <- ifelse(covid$Pains==1, "Pains", "No pains")
covid$Nasal.Congestion <- ifelse(covid$Nasal.Congestion==1, "Nasal congestion", "No nasal congestion")
covid$Runny.Nose <- ifelse(covid$Runny.Nose==1, "Runny nose", "No runny nose")
covid$Diarrhea <- ifelse(covid$Diarrhea==1, "Diarrhea", "No diarrhea")
covid$None_Experiencing <- ifelse(covid$None_Experiencing==1, "None experiencing", "Experiencing")
covid$Age_0.9 <- ifelse(covid$Age_0.9==1, "In age between 0 and 9 y.o.", "Not in age between 0 and 9 y.o.")
covid$Age_10.19 <- ifelse(covid$Age_10.19==1, "In age between 10 and 19 y.o.", "Not in age between 10 and 19 y.o.")
covid$Age_20.24 <- ifelse(covid$Age_20.24==1, "In age between 20 and 24 y.o.", "Not in age between 20 and 24 y.o.")
covid$Age_25.59 <- ifelse(covid$Age_25.59==1, "In age between 25 and 59 y.o.", "Not in age between 25 and 59 y.o.")
covid$Age_60. <- ifelse(covid$Age_60.==1, "60 years old and older", "Not 60 years old or older")
covid$Gender_Female <- ifelse(covid$Gender_Female==1, "Female", "Not female")
covid$Gender_Male <- ifelse(covid$Gender_Male==1, "Male", "Not male")
covid$Gender_Transgender <- ifelse(covid$Gender_Transgender==1, "Transgender", "Not transgender")
covid$Severity_Mild <- ifelse(covid$Severity_Mild==1, "Severity mild", "Not severity mild")
covid$Severity_Moderate <- ifelse(covid$Severity_Moderate==1, "Severity moderate", "Not severity moderate")
covid$Severity_Severe <- ifelse(covid$Severity_Severe==1, "Severity severe", "Not severity severe")
covid$Severity_None <- ifelse(covid$Severity_None==1, "Severity none", "Not severity none")
covid$Contact_Dont.Know <- ifelse(covid$Contact_Dont.Know==1, "Does't know about the contact", "Knows about the contact")
covid$Contact_No <- ifelse(covid$Contact_No==1, "No contact", "Contact")
covid$Contact_Yes <- ifelse(covid$Contact_Yes==1, "Having contact with ill person", "Not having contact with ill person")
library(FactoMineR)
mca <- MCA(covid, graph=F)
library(factoextra)
fviz_screeplot(mca, addlabels = T)
fviz_contrib(mca, choice = "var", axes =1:2)
The explanation of the variance is the same as it was in the PCA, however, the most important variables turned out to be ‘None experiencing’, ‘None symptom’, and ‘No dry cough’ which is very interesting outcome.
fviz_mca_var(mca, col.var = "contrib",
gradient.cols = c("darkgreen", "yellow", "darkred"),
repel = TRUE,
ggtheme = theme_minimal()
)
The above chart presents the MCA results. Variables which have the key contribution in the indication of the Covid-19 symptoms are presented in reddish colour, and the greener the colour, the less significant the variable is.
The hierarchical clustering method was implemented on the dataset. However, the computational limits of my computer did not allow me to cluster the whole 316 800 observations, therefore I have run the the function which randomly chooses the 10 000 observations from the whole dataset. All further operation are done on the sample of the initial dataset.
covid_clust <- df[,1:26]
sampl <- covid_clust[sample(nrow(df), 10000), ]
dm<-dist(t(sampl))
hc<-hclust(dm, method="complete")
plot(hc)
plot(density(dm))
The hierarchical tree seems to be not that clear. The density plot indicates that the density reaches the highest value for the distance around 60.
# cutting by distance between units
clust<-cutree(hc, h=60)
summary(clust)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 8.50 8.31 11.75 18.00
clust
## Fever Tiredness Dry.Cough
## 1 1 2
## Difficulty.in.Breathing Sore.Throat None_Sympton
## 2 3 4
## Pains Nasal.Congestion Runny.Nose
## 5 6 7
## Diarrhea None_Experiencing Age_0.9
## 8 4 9
## Age_10.19 Age_20.24 Age_25.59
## 10 11 12
## Age_60. Gender_Female Gender_Male
## 4 13 14
## Gender_Transgender Severity_Mild Severity_Moderate
## 15 10 4
## Severity_None Severity_Severe Contact_Dont.Know
## 11 9 16
## Contact_No Contact_Yes
## 17 18
Computation clustered the data into 18 clusters, which is rather quite a large number of clusters. Therefore, I have decided that it has to be checked which is the best number fore the clustering which somehow optimizes the number of the clusters (not to be as high as 18, but also not to small) and the inertion of the ‘Q’ value, which should be high. I was choosing between 3, 4, 5, and 6 clusters. The results are presented below.
library(ClustGeo)
dm_1<-dist(t(sampl)) # distances between observations
hc_1<-hclust(dm_1, method="complete") # simple dendrogram
# cutting by number of clusters
clust.vec.3<-cutree(hc_1, k=3)# division into 2 clusters
clust.vec.4<-cutree(hc_1, k=4) # division into 4 clusters
clust.vec.5<-cutree(hc_1, k=5) # division into 5 clusters
clust.vec.6<-cutree(hc_1, k=6) # division into 6 clusters
diss.mat<-dm_1
inertion<-matrix(0, nrow=4, ncol=4)
colnames(inertion)<-c("division to 3 clust.", "division to 4 clust.", "division to 5 clust.", "division to 6 clust.")
rownames(inertion)<-c("intra-clust", "total", "percentage", "Q")
inertion[1,1]<-withindiss(diss.mat, part=clust.vec.3)# intra-cluster
inertion[2,1]<-inertdiss(diss.mat) # overall
inertion[3,1]<-inertion[1,1]/ inertion[2,1] # ratio
inertion[4,1]<-1-inertion[3,1] # Q, inter-cluster
inertion[1,2]<-withindiss(diss.mat, part=clust.vec.4) # intra-cluster
inertion[2,2]<-inertdiss(diss.mat) # overall
inertion[3,2]<-inertion[1,2]/ inertion[2,2] # ratio
inertion[4,2]<-1-inertion[3,2] # Q, inter-cluster
inertion[1,3]<-withindiss(diss.mat, part=clust.vec.5) # intra-cluster
inertion[2,3]<-inertdiss(diss.mat) # overall
inertion[3,3]<-inertion[1,3]/ inertion[2,3] # ratio
inertion[4,3]<-1-inertion[3,3] # Q, inter-cluster
inertion[1,4]<-withindiss(diss.mat, part=clust.vec.6) # intra-cluster
inertion[2,4]<-inertdiss(diss.mat) # overall
inertion[3,4]<-inertion[1,4]/ inertion[2,4] # ratio
inertion[4,4]<-1-inertion[3,4] # Q, inter-cluster
options("scipen"=100, "digits"=4)
inertion
## division to 3 clust. division to 4 clust. division to 5 clust.
## intra-clust 1871.7256 1733.7154 1607.0449
## total 2116.2544 2116.2544 2116.2544
## percentage 0.8845 0.8192 0.7594
## Q 0.1155 0.1808 0.2406
## division to 6 clust.
## intra-clust 1477.160
## total 2116.254
## percentage 0.698
## Q 0.302
Basing on the results, I have decided that 4 clusters will provide the optimal balance between the number of clusters and the Q value, because the Q value for 3, 4, 5, and 6 clusters is 11.8%, 18.3%, 24.3%, 30.4% respectively. The highest interval gap is between the 3 and 4 clusters (6.5%). Therefore, ongoing analysis will be conducted for the 4 clusters grouping technique.
clust_4<-cutree(hc, k=4) # division into 4 clusters
summary(clust_4) # -> 4 clusters
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 2.31 3.00 4.00
clust_4
## Fever Tiredness Dry.Cough
## 1 1 2
## Difficulty.in.Breathing Sore.Throat None_Sympton
## 2 3 4
## Pains Nasal.Congestion Runny.Nose
## 1 1 2
## Diarrhea None_Experiencing Age_0.9
## 2 4 3
## Age_10.19 Age_20.24 Age_25.59
## 1 1 3
## Age_60. Gender_Female Gender_Male
## 4 3 1
## Gender_Transgender Severity_Mild Severity_Moderate
## 4 1 4
## Severity_None Severity_Severe Contact_Dont.Know
## 1 3 3
## Contact_No Contact_Yes
## 1 4
plot(hc, hang=-1) # lower label management, looks better
plot(hc)
rect.hclust(hc, k=4, border=5:8)
library(factoextra)
fviz_cluster(list(data=t(sampl), cluster=clust.vec.4))
library(cluster)
plot(silhouette(clust.vec.4,dm_1))
Plots above visualize the clustering of the data within 4 clusters. On the one hand, dimension reduction and the clustering eases the understanding the messed data. However, on the other hand, in my analysis some variables have the negative value of the silhouette, which means that perhaps the clustering could be done better.
Nevertheless, clustering for 18 groups (what was aforementioned to be the most appropriate number of clusters for given dataset), turned out to be extremely messy overall. Clustering the data for that many groups, frankly speaking, has little sense for the further analysis purposes.
Presented paper undertaken the problem of handling with big data (316 800 observations) analysis with the use of unsupervised learning methods on the medical dataset concerned the Covid-19 illnesses and coexisting symptoms. The dimension reduction via the multivariate correspondence analysis technique and hierarchical clustering method was implemented in the analysis. Computed outcome partially helped with the understand of the dataset, however, personally I thought that the outcome will be more transparent. It is not that clear to understand the outcome of MCA (probably because of the characteristics of the data). The outcome of the hierarchical clustering is however clearer.
Both the dimension reduction of the data and the clustering are crucial in the world of data science. Those techniques provide not only the simplicity of the data understanding and its visualization, but also are very powerful tools to handle the big data as was presented in my paper. Nevertheless, my paper also shows that sometimes even the specialized methods (like MDS or PCA) are not universal for some types of the gathered data.