Covid-19 dimension reduction and clustering on symptoms

The aim of the paper is to find the possibilities in dimension reduction of Covid-19 illness symptoms dataset and group the coexisting symptoms within the clusters. The dimension reduction procedure was done with the use of Principal Component Analysis (PCA) and further with the Multiple Correspondence Analysis (MCA). Clustering procedure was done according to the hierarchical clustering method.

Data

The data was taken from the https://www.kaggle.com/imdevskp/corona-virus-report website and contains the list of officially recorded Covid-19 illnesses with the coexisting symptoms. The data was collected from the beginning of the pandemic till the March 2020 and contains the 316 800 observations. The data is divided on 27 medical categories. The dataset contains binary information.

## [1] "C:/Users/Mateusz/Documents/Studia/Master/UW/DS WNE/Unsupervised Learning/projekt UL"
## [1] "C:/Users/Mateusz/Documents/Studia/Master/UW/DS WNE/Unsupervised Learning/projekt UL"
df <- read.csv("cleaned_data_ul_covid19.csv", sep=",", dec=".", header=T)
summary(df)
##      Fever          Tiredness     Dry.Cough      Difficulty.in.Breathing
##  Min.   :0.0000   Min.   :0.0   Min.   :0.0000   Min.   :0.0            
##  1st Qu.:0.0000   1st Qu.:0.0   1st Qu.:0.0000   1st Qu.:0.0            
##  Median :0.0000   Median :0.5   Median :1.0000   Median :0.5            
##  Mean   :0.3125   Mean   :0.5   Mean   :0.5625   Mean   :0.5            
##  3rd Qu.:1.0000   3rd Qu.:1.0   3rd Qu.:1.0000   3rd Qu.:1.0            
##  Max.   :1.0000   Max.   :1.0   Max.   :1.0000   Max.   :1.0            
##   Sore.Throat      None_Sympton        Pains        Nasal.Congestion
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.3125   Mean   :0.0625   Mean   :0.3636   Mean   :0.5455  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    Runny.Nose        Diarrhea      None_Experiencing    Age_0.9   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0  
##  Median :1.0000   Median :0.0000   Median :0.00000   Median :0.0  
##  Mean   :0.5455   Mean   :0.3636   Mean   :0.09091   Mean   :0.2  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0  
##    Age_10.19     Age_20.24     Age_25.59      Age_60.    Gender_Female   
##  Min.   :0.0   Min.   :0.0   Min.   :0.0   Min.   :0.0   Min.   :0.0000  
##  1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0000  
##  Median :0.0   Median :0.0   Median :0.0   Median :0.0   Median :0.0000  
##  Mean   :0.2   Mean   :0.2   Mean   :0.2   Mean   :0.2   Mean   :0.3333  
##  3rd Qu.:0.0   3rd Qu.:0.0   3rd Qu.:0.0   3rd Qu.:0.0   3rd Qu.:1.0000  
##  Max.   :1.0   Max.   :1.0   Max.   :1.0   Max.   :1.0   Max.   :1.0000  
##   Gender_Male     Gender_Transgender Severity_Mild  Severity_Moderate
##  Min.   :0.0000   Min.   :0.0000     Min.   :0.00   Min.   :0.00     
##  1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.00   1st Qu.:0.00     
##  Median :0.0000   Median :0.0000     Median :0.00   Median :0.00     
##  Mean   :0.3333   Mean   :0.3333     Mean   :0.25   Mean   :0.25     
##  3rd Qu.:1.0000   3rd Qu.:1.0000     3rd Qu.:0.25   3rd Qu.:0.25     
##  Max.   :1.0000   Max.   :1.0000     Max.   :1.00   Max.   :1.00     
##  Severity_None  Severity_Severe Contact_Dont.Know   Contact_No    
##  Min.   :0.00   Min.   :0.00    Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.00   1st Qu.:0.00    1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :0.00   Median :0.00    Median :0.0000    Median :0.0000  
##  Mean   :0.25   Mean   :0.25    Mean   :0.3333    Mean   :0.3333  
##  3rd Qu.:0.25   3rd Qu.:0.25    3rd Qu.:1.0000    3rd Qu.:1.0000  
##  Max.   :1.00   Max.   :1.00    Max.   :1.0000    Max.   :1.0000  
##   Contact_Yes       Country         
##  Min.   :0.0000   Length:316800     
##  1st Qu.:0.0000   Class :character  
##  Median :0.0000   Mode  :character  
##  Mean   :0.3333                     
##  3rd Qu.:1.0000                     
##  Max.   :1.0000
dim(df)
## [1] 316800     27

There are 316 800 observations and 27 binary variables (in the next step last variable ‘country’ will be dropped).

covid <- df[,1:26]
covid.cor <- cor(covid, method="pearson")
print(covid.cor, digits=2)
##                            Fever Tiredness Dry.Cough Difficulty.in.Breathing
## Fever                    1.0e+00      0.40   5.1e-02                   -0.13
## Tiredness                4.0e-01      1.00   3.8e-01                    0.00
## Dry.Cough                5.1e-02      0.38   1.0e+00                    0.38
## Difficulty.in.Breathing -1.3e-01      0.00   3.8e-01                    1.00
## Sore.Throat             -1.6e-01     -0.13   5.1e-02                    0.40
## None_Sympton            -1.7e-01     -0.26  -2.9e-01                   -0.26
## Pains                    0.0e+00      0.00   0.0e+00                    0.00
## Nasal.Congestion         0.0e+00      0.00   0.0e+00                    0.00
## Runny.Nose               0.0e+00      0.00   0.0e+00                    0.00
## Diarrhea                 0.0e+00      0.00   0.0e+00                    0.00
## None_Experiencing        0.0e+00      0.00   0.0e+00                    0.00
## Age_0.9                 -1.7e-18      0.00  -4.1e-18                    0.00
## Age_10.19               -7.5e-18      0.00  -2.6e-18                    0.00
## Age_20.24               -7.5e-18      0.00  -2.6e-18                    0.00
## Age_25.59               -7.5e-18      0.00  -2.6e-18                    0.00
## Age_60.                 -1.7e-18      0.00  -4.1e-18                    0.00
## Gender_Female           -1.0e-17      0.00  -5.4e-18                    0.00
## Gender_Male             -1.0e-17      0.00  -5.4e-18                    0.00
## Gender_Transgender      -1.0e-17      0.00  -5.4e-18                    0.00
## Severity_Mild            0.0e+00      0.00   0.0e+00                    0.00
## Severity_Moderate        0.0e+00      0.00   0.0e+00                    0.00
## Severity_None            0.0e+00      0.00   0.0e+00                    0.00
## Severity_Severe          0.0e+00      0.00   0.0e+00                    0.00
## Contact_Dont.Know        0.0e+00      0.00   0.0e+00                    0.00
## Contact_No               0.0e+00      0.00   0.0e+00                    0.00
## Contact_Yes              0.0e+00      0.00   0.0e+00                    0.00
##                         Sore.Throat None_Sympton    Pains Nasal.Congestion
## Fever                      -1.6e-01     -1.7e-01  0.0e+00          0.0e+00
## Tiredness                  -1.3e-01     -2.6e-01  0.0e+00          0.0e+00
## Dry.Cough                   5.1e-02     -2.9e-01  0.0e+00          0.0e+00
## Difficulty.in.Breathing     4.0e-01     -2.6e-01  0.0e+00          0.0e+00
## Sore.Throat                 1.0e+00     -1.7e-01  0.0e+00          0.0e+00
## None_Sympton               -1.7e-01      1.0e+00  0.0e+00          0.0e+00
## Pains                       0.0e+00      0.0e+00  1.0e+00          3.1e-01
## Nasal.Congestion            0.0e+00      0.0e+00  3.1e-01          1.0e+00
## Runny.Nose                  0.0e+00      0.0e+00 -6.9e-02          2.7e-01
## Diarrhea                    0.0e+00      0.0e+00 -1.8e-01         -6.9e-02
## None_Experiencing           0.0e+00      0.0e+00 -2.4e-01         -3.5e-01
## Age_0.9                    -9.1e-19     -2.4e-18  2.5e-19         -1.6e-19
## Age_10.19                  -4.6e-19     -2.5e-18  1.1e-19          2.1e-19
## Age_20.24                  -4.6e-19     -2.5e-18  1.1e-19          2.1e-19
## Age_25.59                  -4.6e-19     -2.5e-18  1.1e-19          2.1e-19
## Age_60.                    -9.1e-19     -2.4e-18  2.5e-19         -1.6e-19
## Gender_Female              -1.6e-19     -2.3e-18 -6.0e-19         -9.3e-21
## Gender_Male                -1.6e-19     -2.3e-18 -6.0e-19          1.1e-24
## Gender_Transgender         -1.6e-19     -2.3e-18 -6.0e-19         -1.9e-20
## Severity_Mild               0.0e+00      0.0e+00  0.0e+00          0.0e+00
## Severity_Moderate           0.0e+00      0.0e+00  0.0e+00          0.0e+00
## Severity_None               0.0e+00      0.0e+00  0.0e+00          0.0e+00
## Severity_Severe             0.0e+00      0.0e+00  0.0e+00          0.0e+00
## Contact_Dont.Know           0.0e+00      0.0e+00  1.6e-20          3.5e-21
## Contact_No                  0.0e+00      0.0e+00  1.6e-20          3.5e-21
## Contact_Yes                 0.0e+00      0.0e+00  1.6e-20          3.5e-21
##                         Runny.Nose Diarrhea None_Experiencing  Age_0.9
## Fever                      0.0e+00  0.0e+00           0.0e+00 -1.7e-18
## Tiredness                  0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Dry.Cough                  0.0e+00  0.0e+00           0.0e+00 -4.1e-18
## Difficulty.in.Breathing    0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Sore.Throat                0.0e+00  0.0e+00           0.0e+00 -9.1e-19
## None_Sympton               0.0e+00  0.0e+00           0.0e+00 -2.4e-18
## Pains                     -6.9e-02 -1.8e-01          -2.4e-01  2.5e-19
## Nasal.Congestion           2.7e-01 -6.9e-02          -3.5e-01 -1.6e-19
## Runny.Nose                 1.0e+00  3.1e-01          -3.5e-01  3.9e-20
## Diarrhea                   3.1e-01  1.0e+00          -2.4e-01  5.1e-20
## None_Experiencing         -3.5e-01 -2.4e-01           1.0e+00  1.2e-19
## Age_0.9                    3.9e-20  5.1e-20           1.2e-19  1.0e+00
## Age_10.19                  1.3e-20 -4.4e-20           1.1e-19 -2.5e-01
## Age_20.24                  1.3e-20 -4.4e-20           1.1e-19 -2.5e-01
## Age_25.59                  1.3e-20 -4.4e-20           1.1e-19 -2.5e-01
## Age_60.                    4.0e-20  5.1e-20           1.1e-19 -2.5e-01
## Gender_Female             -8.4e-20  9.7e-21          -2.9e-19 -1.8e-18
## Gender_Male               -8.4e-20  9.7e-21          -2.9e-19 -8.9e-18
## Gender_Transgender        -8.4e-20  9.7e-21          -2.9e-19 -8.9e-18
## Severity_Mild              0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Severity_Moderate          0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Severity_None              0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Severity_Severe            0.0e+00  0.0e+00           0.0e+00  0.0e+00
## Contact_Dont.Know          3.5e-21  1.6e-20           6.1e-21  0.0e+00
## Contact_No                 3.5e-21  1.6e-20           4.5e-21  0.0e+00
## Contact_Yes                3.5e-21  1.6e-20           6.1e-21  0.0e+00
##                         Age_10.19 Age_20.24 Age_25.59  Age_60. Gender_Female
## Fever                    -7.5e-18  -7.5e-18  -7.5e-18 -1.7e-18      -1.0e-17
## Tiredness                 0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Dry.Cough                -2.6e-18  -2.6e-18  -2.6e-18 -4.1e-18      -5.4e-18
## Difficulty.in.Breathing   0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Sore.Throat              -4.6e-19  -4.6e-19  -4.6e-19 -9.1e-19      -1.6e-19
## None_Sympton             -2.5e-18  -2.5e-18  -2.5e-18 -2.4e-18      -2.3e-18
## Pains                     1.1e-19   1.1e-19   1.1e-19  2.5e-19      -6.0e-19
## Nasal.Congestion          2.1e-19   2.1e-19   2.1e-19 -1.6e-19      -9.3e-21
## Runny.Nose                1.3e-20   1.3e-20   1.3e-20  4.0e-20      -8.4e-20
## Diarrhea                 -4.4e-20  -4.4e-20  -4.4e-20  5.1e-20       9.7e-21
## None_Experiencing         1.1e-19   1.1e-19   1.1e-19  1.1e-19      -2.9e-19
## Age_0.9                  -2.5e-01  -2.5e-01  -2.5e-01 -2.5e-01      -1.8e-18
## Age_10.19                 1.0e+00  -2.5e-01  -2.5e-01 -2.5e-01       1.8e-18
## Age_20.24                -2.5e-01   1.0e+00  -2.5e-01 -2.5e-01       1.8e-18
## Age_25.59                -2.5e-01  -2.5e-01   1.0e+00 -2.5e-01       1.8e-18
## Age_60.                  -2.5e-01  -2.5e-01  -2.5e-01  1.0e+00      -1.8e-18
## Gender_Female             1.8e-18   1.8e-18   1.8e-18 -1.8e-18       1.0e+00
## Gender_Male               9.6e-22   2.7e-23   9.6e-22 -8.9e-18      -5.0e-01
## Gender_Transgender        5.9e-22  -3.4e-22   5.9e-22 -8.9e-18      -5.0e-01
## Severity_Mild             0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Severity_Moderate         0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Severity_None             0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Severity_Severe           0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Contact_Dont.Know         0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Contact_No                0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
## Contact_Yes               0.0e+00   0.0e+00   0.0e+00  0.0e+00       0.0e+00
##                         Gender_Male Gender_Transgender Severity_Mild
## Fever                      -1.0e-17           -1.0e-17          0.00
## Tiredness                   0.0e+00            0.0e+00          0.00
## Dry.Cough                  -5.4e-18           -5.4e-18          0.00
## Difficulty.in.Breathing     0.0e+00            0.0e+00          0.00
## Sore.Throat                -1.6e-19           -1.6e-19          0.00
## None_Sympton               -2.3e-18           -2.3e-18          0.00
## Pains                      -6.0e-19           -6.0e-19          0.00
## Nasal.Congestion            1.1e-24           -1.9e-20          0.00
## Runny.Nose                 -8.4e-20           -8.4e-20          0.00
## Diarrhea                    9.7e-21            9.7e-21          0.00
## None_Experiencing          -2.9e-19           -2.9e-19          0.00
## Age_0.9                    -8.9e-18           -8.9e-18          0.00
## Age_10.19                   9.6e-22            5.9e-22          0.00
## Age_20.24                   2.7e-23           -3.4e-22          0.00
## Age_25.59                   9.6e-22            5.9e-22          0.00
## Age_60.                    -8.9e-18           -8.9e-18          0.00
## Gender_Female              -5.0e-01           -5.0e-01          0.00
## Gender_Male                 1.0e+00           -5.0e-01          0.00
## Gender_Transgender         -5.0e-01            1.0e+00          0.00
## Severity_Mild               0.0e+00            0.0e+00          1.00
## Severity_Moderate           0.0e+00            0.0e+00         -0.33
## Severity_None               0.0e+00            0.0e+00         -0.33
## Severity_Severe             0.0e+00            0.0e+00         -0.33
## Contact_Dont.Know           0.0e+00            0.0e+00          0.00
## Contact_No                  0.0e+00            0.0e+00          0.00
## Contact_Yes                 0.0e+00            0.0e+00          0.00
##                         Severity_Moderate Severity_None Severity_Severe
## Fever                                0.00          0.00            0.00
## Tiredness                            0.00          0.00            0.00
## Dry.Cough                            0.00          0.00            0.00
## Difficulty.in.Breathing              0.00          0.00            0.00
## Sore.Throat                          0.00          0.00            0.00
## None_Sympton                         0.00          0.00            0.00
## Pains                                0.00          0.00            0.00
## Nasal.Congestion                     0.00          0.00            0.00
## Runny.Nose                           0.00          0.00            0.00
## Diarrhea                             0.00          0.00            0.00
## None_Experiencing                    0.00          0.00            0.00
## Age_0.9                              0.00          0.00            0.00
## Age_10.19                            0.00          0.00            0.00
## Age_20.24                            0.00          0.00            0.00
## Age_25.59                            0.00          0.00            0.00
## Age_60.                              0.00          0.00            0.00
## Gender_Female                        0.00          0.00            0.00
## Gender_Male                          0.00          0.00            0.00
## Gender_Transgender                   0.00          0.00            0.00
## Severity_Mild                       -0.33         -0.33           -0.33
## Severity_Moderate                    1.00         -0.33           -0.33
## Severity_None                       -0.33          1.00           -0.33
## Severity_Severe                     -0.33         -0.33            1.00
## Contact_Dont.Know                    0.00          0.00            0.00
## Contact_No                           0.00          0.00            0.00
## Contact_Yes                          0.00          0.00            0.00
##                         Contact_Dont.Know Contact_No Contact_Yes
## Fever                             0.0e+00    0.0e+00     0.0e+00
## Tiredness                         0.0e+00    0.0e+00     0.0e+00
## Dry.Cough                         0.0e+00    0.0e+00     0.0e+00
## Difficulty.in.Breathing           0.0e+00    0.0e+00     0.0e+00
## Sore.Throat                       0.0e+00    0.0e+00     0.0e+00
## None_Sympton                      0.0e+00    0.0e+00     0.0e+00
## Pains                             1.6e-20    1.6e-20     1.6e-20
## Nasal.Congestion                  3.5e-21    3.5e-21     3.5e-21
## Runny.Nose                        3.5e-21    3.5e-21     3.5e-21
## Diarrhea                          1.6e-20    1.6e-20     1.6e-20
## None_Experiencing                 6.1e-21    4.5e-21     6.1e-21
## Age_0.9                           0.0e+00    0.0e+00     0.0e+00
## Age_10.19                         0.0e+00    0.0e+00     0.0e+00
## Age_20.24                         0.0e+00    0.0e+00     0.0e+00
## Age_25.59                         0.0e+00    0.0e+00     0.0e+00
## Age_60.                           0.0e+00    0.0e+00     0.0e+00
## Gender_Female                     0.0e+00    0.0e+00     0.0e+00
## Gender_Male                       0.0e+00    0.0e+00     0.0e+00
## Gender_Transgender                0.0e+00    0.0e+00     0.0e+00
## Severity_Mild                     0.0e+00    0.0e+00     0.0e+00
## Severity_Moderate                 0.0e+00    0.0e+00     0.0e+00
## Severity_None                     0.0e+00    0.0e+00     0.0e+00
## Severity_Severe                   0.0e+00    0.0e+00     0.0e+00
## Contact_Dont.Know                 1.0e+00   -5.0e-01    -5.0e-01
## Contact_No                       -5.0e-01    1.0e+00    -5.0e-01
## Contact_Yes                      -5.0e-01   -5.0e-01     1.0e+00
library(corrplot)
## corrplot 0.92 loaded
corrplot(covid.cor, order = "alphabet", tl.cex=0.6)

In the correlation plot above we can see that ‘Difficulty in breathing’ is positively correlated with ‘Dry cough’. ‘Sore throat’ is positively correlated to ‘Difficulty in breathing’ and ‘Tiredness’ is positively correlated to both ‘Dry cough’ and ‘Fever’. Apart from that we can see negative correlation between the variables which indicates the age interval of the patient and the severity extent of the disease progression.

PCA

Previous modelling analysis showed me that the MDS technique is not a sufficient way to deal with the binary data, therefore, the PCA and further the MCA dimension reduction technique was used.

covid.pca1<-prcomp(covid, center=TRUE, scale.=TRUE) # stats::
options("scipen"=100, "digits"=4)
covid.pca1$rotation
##                                             PC1                    PC2
## Fever                   -0.21405978143411083514  0.0000000000000605386
## Tiredness               -0.42889968191431682154  0.0000000000000044484
## Dry.Cough               -0.54019740294622264898 -0.0000000000000881574
## Difficulty.in.Breathing -0.42889968191858962587 -0.0000000000001456485
## Sore.Throat             -0.21405978143676643310 -0.0000000000001196803
## None_Sympton             0.49863183963492629935  0.0000000000000754247
## Pains                   -0.00000000000004121897  0.2678090525075103590
## Nasal.Congestion        -0.00000000000007986030  0.5026760082128256535
## Runny.Nose              -0.00000000000008798873  0.5026760082126404683
## Diarrhea                -0.00000000000004543182  0.2678090525074020567
## None_Experiencing        0.00000000000008586092 -0.5926130983405116792
## Age_0.9                 -0.00000000000000553552  0.0000000000000105575
## Age_10.19                0.00000000000000009317 -0.0000000000000013600
## Age_20.24                0.00000000000000009596 -0.0000000000000014433
## Age_25.59                0.00000000000000014216 -0.0000000000000008604
## Age_60.                  0.00000000000000010325 -0.0000000000000012768
## Gender_Female           -0.00000000000000566779 -0.0000000000000038858
## Gender_Male              0.00000000000001584326  0.0000000000000122541
## Gender_Transgender      -0.00000000000000566748 -0.0000000000000033584
## Severity_Mild           -0.00000000000003774828  0.0000000000000004441
## Severity_Moderate        0.00000000000001068735 -0.0000000000000032474
## Severity_None            0.00000000000001385733  0.0000000000000024841
## Severity_Severe          0.00000000000001385733  0.0000000000000023176
## Contact_Dont.Know       -0.00000000000003590453  0.0000000000000024980
## Contact_No               0.00000000000000678649 -0.0000000000000020262
## Contact_Yes              0.00000000000003617388  0.0000000000000005551
##                                           PC3                    PC4
## Fever                   -0.536185408933701257  0.0000000000000083382
## Tiredness               -0.460982870884000140  0.0000000000000235166
## Dry.Cough               -0.000000000001752917  0.0000000000000360036
## Difficulty.in.Breathing  0.460982870882539419  0.0000000000000166206
## Sore.Throat              0.536185408931496799  0.0000000000000057350
## None_Sympton             0.000000000002703865 -0.0000000000000116614
## Pains                    0.000000000000040755 -0.0000000000000291364
## Nasal.Congestion         0.000000000000077265 -0.0000000000000125559
## Runny.Nose               0.000000000000079155  0.0000000000000178902
## Diarrhea                 0.000000000000042363  0.0000000000000240814
## None_Experiencing       -0.000000000000109562 -0.0000000000000005516
## Age_0.9                  0.000000000000011814  0.0000000000000020648
## Age_10.19               -0.000000000000004038 -0.0000000000000211723
## Age_20.24               -0.000000000000004038  0.0000000000000050619
## Age_25.59               -0.000000000000004069  0.0000000000000155813
## Age_60.                 -0.000000000000004377  0.0000000000000080075
## Gender_Female            0.000000000000021496 -0.3064165338893526935
## Gender_Male             -0.000000000000033662  0.0184581787200654088
## Gender_Transgender       0.000000000000021494  0.2879583551692940779
## Severity_Mild            0.000000000000096209  0.0000000000000159421
## Severity_Moderate       -0.000000000000006684 -0.0000000000000391336
## Severity_None           -0.000000000000030819  0.0000000000000297245
## Severity_Severe         -0.000000000000030819 -0.0000000000000061843
## Contact_Dont.Know        0.000000000000005103 -0.7299113064428662812
## Contact_No              -0.000000000000039856  0.2561153661579838037
## Contact_Yes              0.000000000000041218  0.4737959402848411217
##                                            PC5                    PC6
## Fever                    0.0000000000000000000  0.0000000000000328129
## Tiredness               -0.0000000000000055352  0.0000000000000417560
## Dry.Cough               -0.0000000000000170692  0.0000000000000038874
## Difficulty.in.Breathing -0.0000000000000149465 -0.0000000000000280029
## Sore.Throat             -0.0000000000000100344 -0.0000000000000380875
## None_Sympton             0.0000000000000074794 -0.0000000000000011114
## Pains                    0.0000000000000164775 -0.0000000000000147937
## Nasal.Congestion         0.0000000000000082825 -0.0000000000000019498
## Runny.Nose              -0.0000000000000079014  0.0000000000000095618
## Diarrhea                -0.0000000000000127642  0.0000000000000107692
## None_Experiencing       -0.0000000000000010292 -0.0000000000000006314
## Age_0.9                 -0.0000000000000010714 -0.0000000000000040367
## Age_10.19                0.0000000000000017115 -0.0000000000000157652
## Age_20.24               -0.0000000000000010924  0.0000000000000073067
## Age_25.59                0.0000000000000013746  0.0000000000000133990
## Age_60.                 -0.0000000000000079681  0.0000000000000032752
## Gender_Female           -0.5276948015111558865  0.2374185122007645421
## Gender_Male             -0.1938153091677393136 -0.3237629746918117557
## Gender_Transgender       0.7215101106789235663  0.0863444624910671282
## Severity_Mild            0.0000000000000031955 -0.0000000000000084238
## Severity_Moderate       -0.0000000000000042374  0.0000000000000084516
## Severity_None            0.0000000000000019242 -0.0000000000000060334
## Severity_Severe         -0.0000000000000004056  0.0000000000000060160
## Contact_Dont.Know        0.3282400019238625455  0.0782797113003221579
## Contact_No              -0.1342474861369017392 -0.6802962524358183716
## Contact_Yes             -0.1939925157869437922  0.6020165411354851948
##                                            PC7                      PC8
## Fever                    0.0000000000000000000 -0.000000000000000002665
## Tiredness                0.0000000000000030003  0.000000000000000804932
## Dry.Cough               -0.0000000000000020311  0.000000000000002523496
## Difficulty.in.Breathing -0.0000000000000181243  0.000000000000001745773
## Sore.Throat             -0.0000000000000133986  0.000000000000002266489
## None_Sympton             0.0000000000000102454 -0.000000000000005013272
## Pains                    0.0000000000000126881  0.613584431441651201311
## Nasal.Congestion         0.0000000000000093593  0.351445793106101023184
## Runny.Nose               0.0000000000000031778 -0.351445793106191450850
## Diarrhea                -0.0000000000000040275 -0.613584431442331323936
## None_Experiencing       -0.0000000000000069148 -0.000000000000162814207
## Age_0.9                  0.0000000000000064518 -0.000000000000000648787
## Age_10.19               -0.0000000000000011954  0.000000000000001415534
## Age_20.24               -0.0000000000000104853  0.000000000000001207368
## Age_25.59                0.0000000000000007026  0.000000000000000388578
## Age_60.                 -0.0000000000000092528  0.000000000000000846545
## Gender_Female            0.4877973154378615894 -0.000000000000003719247
## Gender_Male             -0.7238363934314839820  0.000000000000006571133
## Gender_Transgender       0.2360390779936666628 -0.000000000000007341350
## Severity_Mild           -0.0000000000000009466  0.000000000000008243406
## Severity_Moderate       -0.0000000000000014141  0.000000000000028310687
## Severity_None            0.0000000000000002835 -0.000000000000006120104
## Severity_Severe          0.0000000000000025449 -0.000000000000031037672
## Contact_Dont.Know       -0.1415165690981745050 -0.000000000000037136960
## Contact_No               0.3467653490072623579 -0.000000000000004510281
## Contact_Yes             -0.2052487799090883247  0.000000000000038496983
##                                           PC9                   PC10
## Fever                    0.000000000000011048  0.0000000000000250244
## Tiredness                0.000000000000007409  0.0000000000000202558
## Dry.Cough               -0.000000000000004346  0.0000000000000066174
## Difficulty.in.Breathing -0.000000000000006223 -0.0000000000000656811
## Sore.Throat              0.000000000000006113 -0.0000000000000846728
## None_Sympton             0.000000000000001521  0.0000000000000271320
## Pains                   -0.000000000000009034 -0.0000000000000005829
## Nasal.Congestion        -0.000000000000008382 -0.0000000000000018596
## Runny.Nose               0.000000000000009194  0.0000000000000009021
## Diarrhea                 0.000000000000016823  0.0000000000000021753
## None_Experiencing       -0.000000000000002134  0.0000000000000008743
## Age_0.9                 -0.000000000000006366 -0.0000000000000647113
## Age_10.19                0.000000000000006762 -0.0000000000000334247
## Age_20.24               -0.000000000000001599  0.0000000000000785344
## Age_25.59                0.000000000000001183 -0.0000000000000103251
## Age_60.                 -0.000000000000001079  0.0000000000000448461
## Gender_Female           -0.000000000000019772  0.0000000000000082434
## Gender_Male              0.000000000000002999 -0.0000000000000025639
## Gender_Transgender       0.000000000000016886 -0.0000000000000056899
## Severity_Mild           -0.064342457660595959  0.8385063838452186946
## Severity_Moderate        0.708725691715895278 -0.3346287630444449435
## Severity_None           -0.700313995285159807 -0.4222096341228042116
## Severity_Severe          0.055930761229906575 -0.0816679866780549713
## Contact_Dont.Know       -0.000000000000035503  0.0000000000000098740
## Contact_No               0.000000000000020451 -0.0000000000000082469
## Contact_Yes              0.000000000000016084 -0.0000000000000018457
##                                            PC11                   PC12
## Fever                    0.00000000000001473314 -0.0000000000000157343
## Tiredness                0.00000000000001091327  0.0000000000000023893
## Dry.Cough               -0.00000000000000090200  0.0000000000000089824
## Difficulty.in.Breathing -0.00000000000002363464 -0.0000000000000014412
## Sore.Throat             -0.00000000000002113663  0.0000000000000047328
## None_Sympton             0.00000000000000907006 -0.0000000000000007347
## Pains                   -0.00000000000002491063 -0.0000000000000003886
## Nasal.Congestion        -0.00000000000000850708  0.0000000000000049440
## Runny.Nose               0.00000000000001418657  0.0000000000000025882
## Diarrhea                 0.00000000000002247681  0.0000000000000019238
## None_Experiencing        0.00000000000000003816 -0.0000000000000043368
## Age_0.9                 -0.00000000000002264204 -0.5761004231848011470
## Age_10.19               -0.00000000000000953404 -0.4784341110756626381
## Age_20.24                0.00000000000002283243  0.3310718718876559663
## Age_25.59               -0.00000000000000075981  0.5459884985979381877
## Age_60.                  0.00000000000001405473  0.1774741637791431570
## Gender_Female            0.00000000000000351802  0.0000000000000054436
## Gender_Male             -0.00000000000000366200  0.0000000000000009350
## Gender_Transgender       0.00000000000000036776 -0.0000000000000041425
## Severity_Mild            0.20680206090118474771 -0.0000000000000739547
## Severity_Moderate        0.36841754144612937072  0.0000000000000170627
## Severity_None            0.28513037870706497978  0.0000000000000267685
## Severity_Severe         -0.86034998105432525239  0.0000000000000298372
## Contact_Dont.Know        0.00000000000000202963  0.0000000000000131180
## Contact_No              -0.00000000000000323006  0.0000000000000114023
## Contact_Yes              0.00000000000000151962 -0.0000000000000232661
##                                           PC13                   PC14
## Fever                   -0.0000000000000050658 -0.0000000000000178271
## Tiredness                0.0000000000000006295  0.0000000000000027374
## Dry.Cough                0.0000000000000028276  0.0000000000000102126
## Difficulty.in.Breathing -0.0000000000000003565 -0.0000000000000016880
## Sore.Throat              0.0000000000000016750  0.0000000000000052746
## None_Sympton            -0.0000000000000001954 -0.0000000000000007889
## Pains                   -0.0000000000000004233 -0.0000000000000010408
## Nasal.Congestion         0.0000000000000015899  0.0000000000000057905
## Runny.Nose               0.0000000000000011033  0.0000000000000033411
## Diarrhea                 0.0000000000000011948  0.0000000000000027790
## None_Experiencing       -0.0000000000000016046 -0.0000000000000049856
## Age_0.9                 -0.1845713023717037438 -0.6526481777607980872
## Age_10.19               -0.0977535710873589925  0.7433487541608907456
## Age_20.24                0.1492120919401024037 -0.1390165838393611075
## Age_25.59               -0.6136104707521466972  0.0464083619546589443
## Age_60.                  0.7467232522713272980  0.0019076454808065191
## Gender_Female            0.0000000000000009480  0.0000000000000014745
## Gender_Male             -0.0000000000000099486 -0.0000000000000036152
## Gender_Transgender       0.0000000000000072277  0.0000000000000065538
## Severity_Mild           -0.0000000000000586944 -0.0000000000000057038
## Severity_Moderate        0.0000000000000172883 -0.0000000000000065711
## Severity_None            0.0000000000000229894  0.0000000000000077091
## Severity_Severe          0.0000000000000235840  0.0000000000000035041
## Contact_Dont.Know        0.0000000000000004337 -0.0000000000000122784
## Contact_No               0.0000000000000012473  0.0000000000000001006
## Contact_Yes             -0.0000000000000012646  0.0000000000000143219
##                                            PC15                  PC16
## Fever                   -0.00000000000000244892 -0.488781361520113422
## Tiredness                0.00000000000000045823  0.111383616625499254
## Dry.Cough                0.00000000000000146689  0.580361264593369341
## Difficulty.in.Breathing -0.00000000000000032739  0.111383616624441614
## Sore.Throat              0.00000000000000057455 -0.488781361521033519
## None_Sympton            -0.00000000000000007854  0.400691581786861373
## Pains                   -0.00000000000000056205 -0.000000000000010540
## Nasal.Congestion         0.00000000000000026194  0.000000000000005588
## Runny.Nose               0.00000000000000150531  0.000000000000008024
## Diarrhea                 0.00000000000000024568 -0.000000000000014163
## None_Experiencing       -0.00000000000000094456 -0.000000000000012753
## Age_0.9                 -0.08995606045526935435  0.000000000000016707
## Age_10.19                0.09475056930898999308 -0.000000000000001483
## Age_20.24                0.80548218892901113364 -0.000000000000001482
## Age_25.59               -0.35103420563199405624 -0.000000000000001413
## Age_60.                 -0.45924249215140583447 -0.000000000000001285
## Gender_Female            0.00000000000000064272 -0.000000000000001338
## Gender_Male             -0.00000000000000486828  0.000000000000008359
## Gender_Transgender       0.00000000000000009801 -0.000000000000001339
## Severity_Mild           -0.00000000000004526848 -0.000000000000035937
## Severity_Moderate        0.00000000000001030209  0.000000000000016663
## Severity_None            0.00000000000001546940  0.000000000000006128
## Severity_Severe          0.00000000000001392831  0.000000000000006128
## Contact_Dont.Know       -0.00000000000000776245  0.000000000000012295
## Contact_No               0.00000000000000202225 -0.000000000000002629
## Contact_Yes              0.00000000000000283714 -0.000000000000009163
##                                            PC17                   PC18
## Fever                   -0.00000000000001744464  0.3903340329286217325
## Tiredness               -0.00000000000000595821  0.2269288384815393855
## Dry.Cough               -0.00000000000002545808  0.0397382035935744116
## Difficulty.in.Breathing  0.00000000000000062004  0.2269288384846792350
## Sore.Throat             -0.00000000000001278646  0.3903340329243994988
## None_Sympton            -0.00000000000003389717  0.7685741117050595150
## Pains                   -0.49166360000560599408  0.0000000000000009229
## Nasal.Congestion         0.45336347670304943946  0.0000000000000117792
## Runny.Nose               0.45336347670359111728  0.0000000000000361888
## Diarrhea                -0.49166360000552394860 -0.0000000000000167910
## None_Experiencing        0.32474131988721954833  0.0000000000000324174
## Age_0.9                  0.00000000000000194983  0.0000000000000006170
## Age_10.19               -0.00000000000000059674  0.0000000000000080583
## Age_20.24               -0.00000000000000056899  0.0000000000000080598
## Age_25.59               -0.00000000000000009714  0.0000000000000080772
## Age_60.                  0.00000000000000002776  0.0000000000000079726
## Gender_Female            0.00000000000000204003 -0.0000000000000008905
## Gender_Male              0.00000000000000312250 -0.0000000000000028569
## Gender_Transgender       0.00000000000000224820 -0.0000000000000008910
## Severity_Mild            0.00000000000000133227  0.0000000000000110916
## Severity_Moderate        0.00000000000000263678 -0.0000000000000099553
## Severity_None           -0.00000000000000290046  0.0000000000000002970
## Severity_Severe          0.00000000000000491274  0.0000000000000002970
## Contact_Dont.Know        0.00000000000000367761 -0.0000000000000023336
## Contact_No               0.00000000000000238698 -0.0000000000000102085
## Contact_Yes             -0.00000000000000582867 -0.0000000000000129348
##                                           PC19                    PC20
## Fever                   -0.4609828708800784991  0.00000000000018514121
## Tiredness                0.5361854089309092686 -0.00000000000015723211
## Dry.Cough                0.0000000000042169115 -0.00000000000003973745
## Difficulty.in.Breathing -0.5361854089342895646  0.00000000000023115821
## Sore.Throat              0.4609828708864614488 -0.00000000000017281484
## None_Sympton             0.0000000000022613855  0.00000000000000889718
## Pains                   -0.0000000000001756315 -0.43190880498485145766
## Nasal.Congestion        -0.0000000000000499215 -0.20439762415453949229
## Runny.Nose              -0.0000000000000987951 -0.20439762414852172268
## Diarrhea                -0.0000000000001467525 -0.43190880498759387507
## None_Experiencing       -0.0000000000002762506 -0.73712467794340263971
## Age_0.9                  0.0000000000000153425  0.00000000000000064185
## Age_10.19               -0.0000000000000047256 -0.00000000000000123512
## Age_20.24               -0.0000000000000047234 -0.00000000000000141553
## Age_25.59               -0.0000000000000047344 -0.00000000000000111022
## Age_60.                 -0.0000000000000046711 -0.00000000000000099920
## Gender_Female           -0.0000000000000044903 -0.00000000000000002776
## Gender_Male              0.0000000000000033204 -0.00000000000000136696
## Gender_Transgender      -0.0000000000000044904  0.00000000000000001388
## Severity_Mild            0.0000000000000049358  0.00000000000000358047
## Severity_Moderate       -0.0000000000000046872  0.00000000000000378864
## Severity_None            0.0000000000000020329  0.00000000000000078410
## Severity_Severe          0.0000000000000020329  0.00000000000000256739
## Contact_Dont.Know       -0.0000000000000005298  0.00000000000000144329
## Contact_No              -0.0000000000000018665  0.00000000000000063838
## Contact_Yes             -0.0000000000000052411 -0.00000000000000111022
##                                            PC21                  PC22
## Fever                   -0.00000000000002865428 -0.250819722793808153
## Tiredness                0.00000000000002077141  0.502137486220437301
## Dry.Cough               -0.00000000000001112701 -0.608102329863072444
## Difficulty.in.Breathing -0.00000000000002281610  0.502137486213335205
## Sore.Throat              0.00000000000000771039 -0.250819722789301425
## None_Sympton            -0.00000000000001577137 -0.010314048875738085
## Pains                    0.35144579310832746444  0.000000000000032715
## Nasal.Congestion        -0.61358443144113872236  0.000000000000011233
## Runny.Nose               0.61358443144284258164 -0.000000000000002998
## Diarrhea                -0.35144579310396539817  0.000000000000039611
## None_Experiencing        0.00000000000328935490  0.000000000000034161
## Age_0.9                 -0.00000000000000006418 -0.000000000000002836
## Age_10.19                0.00000000000000349720  0.000000000000001036
## Age_20.24                0.00000000000000380251  0.000000000000001036
## Age_25.59                0.00000000000000409395  0.000000000000001014
## Age_60.                  0.00000000000000421885  0.000000000000001038
## Gender_Female           -0.00000000000000107553 -0.000000000000001914
## Gender_Male              0.00000000000000027756 -0.000000000000001518
## Gender_Transgender      -0.00000000000000131145 -0.000000000000001914
## Severity_Mild            0.00000000000000099920  0.000000000000009229
## Severity_Moderate        0.00000000000000108247 -0.000000000000003488
## Severity_None            0.00000000000000359435 -0.000000000000006431
## Severity_Severe         -0.00000000000000092287 -0.000000000000006431
## Contact_Dont.Know        0.00000000000000403844 -0.000000000000006467
## Contact_No               0.00000000000000180411  0.000000000000005790
## Contact_Yes              0.00000000000000532907 -0.000000000000002278
##                                          PC23                    PC24
## Fever                    0.000000000000005495  0.00000000000000604140
## Tiredness                0.000000000000005284  0.00000000000000819646
## Dry.Cough                0.000000000000004330 -0.00000000000000005413
## Difficulty.in.Breathing  0.000000000000005007 -0.00000000000000315429
## Sore.Throat              0.000000000000001815 -0.00000000000001099775
## None_Sympton             0.000000000000012104  0.00000000000000087149
## Pains                    0.000000000000004274  0.00000000000000283107
## Nasal.Congestion        -0.000000000000001651  0.00000000000000066613
## Runny.Nose               0.000000000000005468 -0.00000000000000255351
## Diarrhea                -0.000000000000001846  0.00000000000000408701
## None_Experiencing       -0.000000000000000111  0.00000000000000352496
## Age_0.9                 -0.447213508214742483 -0.00021356787907845197
## Age_10.19               -0.447213508216977140 -0.00021356787907871044
## Age_20.24               -0.447213508214116817 -0.00021356787907625407
## Age_25.59               -0.447213508213751387 -0.00021356787907753083
## Age_60.                 -0.447213508214213573 -0.00021356787908130559
## Gender_Female            0.000211405624445354  0.03332673349230962784
## Gender_Male              0.000211405624451870  0.03332673349231155685
## Gender_Transgender       0.000211405624445341  0.03332673349230839271
## Severity_Mild           -0.000251187242563372  0.49877764549716285813
## Severity_Moderate       -0.000251187242571380  0.49877764549709063813
## Severity_None           -0.000251187242571359  0.49877764549712144682
## Severity_Severe         -0.000251187242571484  0.49877764549714942444
## Contact_Dont.Know        0.000036028205324973  0.02273909890735868744
## Contact_No               0.000036028205326000  0.02273909890735935357
## Contact_Yes              0.000036028205332855  0.02273909890736006134
##                                           PC25                   PC26
## Fever                    0.0000000000000043200  0.0000000000000054800
## Tiredness                0.0000000000000080508  0.0000000000000086093
## Dry.Cough               -0.0000000000000020127  0.0000000000000018950
## Difficulty.in.Breathing -0.0000000000000008553  0.0000000000000020854
## Sore.Throat              0.0000000000000026839  0.0000000000000064849
## None_Sympton             0.0000000000000007144  0.0000000000000091043
## Pains                    0.0000000000000023315 -0.0000000000000018874
## Nasal.Congestion        -0.0000000000000027339  0.0000000000000047046
## Runny.Nose              -0.0000000000000041356 -0.0000000000000040246
## Diarrhea                -0.0000000000000011935  0.0000000000000012837
## None_Experiencing       -0.0000000000000003469  0.0000000000000006523
## Age_0.9                  0.0001798737341731204  0.0000102199423333248
## Age_10.19                0.0001798737341707352  0.0000102199423324262
## Age_20.24                0.0001798737341728585  0.0000102199423332450
## Age_25.59                0.0001798737341681955  0.0000102199423319266
## Age_60.                  0.0001798737341699996  0.0000102199423316213
## Gender_Female            0.5700301824464704126 -0.0853710054747235797
## Gender_Male              0.5700301824465009437 -0.0853710054747276320
## Gender_Transgender       0.5700301824464559797 -0.0853710054747208041
## Severity_Mild           -0.0314427698329692307 -0.0152364524262884882
## Severity_Moderate       -0.0314427698329646926 -0.0152364524262871004
## Severity_None           -0.0314427698329665176 -0.0152364524262873433
## Severity_Severe         -0.0314427698329686131 -0.0152364524262883771
## Contact_Dont.Know        0.0841467377625422508  0.5707325047165981990
## Contact_No               0.0841467377625457480  0.5707325047166211807
## Contact_Yes              0.0841467377625474966  0.5707325047166350585
summary(covid.pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     1.3823 1.3430 1.2757 1.2247 1.2247 1.2247 1.2247 1.1815
## Proportion of Variance 0.0735 0.0694 0.0626 0.0577 0.0577 0.0577 0.0577 0.0537
## Cumulative Proportion  0.0735 0.1429 0.2055 0.2631 0.3208 0.3785 0.4362 0.4899
##                           PC9   PC10   PC11   PC12   PC13   PC14   PC15  PC16
## Standard deviation     1.1547 1.1547 1.1547 1.1180 1.1180 1.1180 1.1180 0.926
## Proportion of Variance 0.0513 0.0513 0.0513 0.0481 0.0481 0.0481 0.0481 0.033
## Cumulative Proportion  0.5412 0.5925 0.6438 0.6918 0.7399 0.7880 0.8361 0.869
##                          PC17   PC18   PC19   PC20   PC21   PC22
## Standard deviation     0.8698 0.8097 0.7323 0.7265 0.7183 0.6426
## Proportion of Variance 0.0291 0.0252 0.0206 0.0203 0.0198 0.0159
## Cumulative Proportion  0.8981 0.9234 0.9440 0.9643 0.9841 1.0000
##                                    PC23               PC24              PC25
## Standard deviation     0.00000000000288 0.0000000000000671 0.000000000000042
## Proportion of Variance 0.00000000000000 0.0000000000000000 0.000000000000000
## Cumulative Proportion  1.00000000000000 1.0000000000000000 1.000000000000000
##                                      PC26
## Standard deviation     0.0000000000000356
## Proportion of Variance 0.0000000000000000
## Cumulative Proportion  1.0000000000000000

The PCA is also not the best way to reduce the dimensions in a give dataset, because first two principal components explain only 14% of the variance. The 95% explanation level is reached with the 20th component, which is rather a poor outcome.

library(ggplot2)
library(factoextra)
fviz_eig(covid.pca1, choice='eigenvalue')   # eigenvalues on y-axis

fviz_eig(covid.pca1)

The scree plot present the eigenvalue and the percentage of the explained variance.

eig.val<-get_eigenvalue(covid.pca1)
eig.val
##                              eigenvalue                 variance.percent
## Dim.1  1.910817285867398629406466170622 7.349297253337395474659388128202
## Dim.2  1.803730751615455263348053449590 6.937425967752927746801105968188
## Dim.3  1.627348479587384932898430633941 6.259032613798695798834614834050
## Dim.4  1.500000000000014432899320127035 5.769230769231803002128344814992
## Dim.5  1.500000000000012212453270876722 5.769230769231794120344147813739
## Dim.6  1.499999999999937827510620991234 5.769230769231508126893004373414
## Dim.7  1.499999999999932276395497865451 5.769230769231487698789351270534
## Dim.8  1.395960175667761760109897295479 5.369077598723070998687489918666
## Dim.9  1.333333333333392101138770158286 5.128205128206223761822002416011
## Dim.10 1.333333333333301951029170595575 5.128205128205876484059899667045
## Dim.11 1.333333333333276193854999291943 5.128205128205778784433732653270
## Dim.12 1.250000000002040811963865962753 4.807692307700972556006036029430
## Dim.13 1.250000000000484279283341493283 4.807692307694985345278837485239
## Dim.14 1.249999999998756772257024749706 4.807692307688341770699480548501
## Dim.15 1.249999999998430144643180028652 4.807692307687084998235604871297
## Dim.16 0.857100054855390425245786900632 3.296538672521291513106689308188
## Dim.17 0.756608521367482578234842094389 2.910032774490811213752294861479
## Dim.18 0.655574438451520746795608829416 2.521440147890892102822135711904
## Dim.19 0.536287884047667695597283454845 2.062645707875994549596043725614
## Dim.20 0.527755965111434854897254354000 2.029830635044324260718440200435
## Dim.21 0.515944586236422697567149953102 1.984402254755808447939102734381
## Dim.22 0.412871857188091051504841288988 1.587968681492927247589364014857
## Dim.23 0.000000000000000000000008304858 0.000000000000000000000031941762
## Dim.24 0.000000000000000000000000004500 0.000000000000000000000000017308
## Dim.25 0.000000000000000000000000001764 0.000000000000000000000000006786
## Dim.26 0.000000000000000000000000001269 0.000000000000000000000000004882
##        cumulative.variance.percent
## Dim.1                        7.349
## Dim.2                       14.287
## Dim.3                       20.546
## Dim.4                       26.315
## Dim.5                       32.084
## Dim.6                       37.853
## Dim.7                       43.623
## Dim.8                       48.992
## Dim.9                       54.120
## Dim.10                      59.248
## Dim.11                      64.376
## Dim.12                      69.184
## Dim.13                      73.992
## Dim.14                      78.799
## Dim.15                      83.607
## Dim.16                      86.904
## Dim.17                      89.814
## Dim.18                      92.335
## Dim.19                      94.398
## Dim.20                      96.428
## Dim.21                      98.412
## Dim.22                     100.000
## Dim.23                     100.000
## Dim.24                     100.000
## Dim.25                     100.000
## Dim.26                     100.000
library(gridExtra)
var<-get_pca_var(covid.pca1)
a<-fviz_contrib(covid.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(covid.pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

In the first plot “Contribution of variables to Dim-1” it can be seen that the most important variables are ‘Dry cough’, ‘None symptom’, ‘Difficulty in breathing’, and ‘Tiredness’.

From the second plot “Contribution of variables to Dim-2” it can be seen that the most important variables are ‘None Experiencing’, ‘Nasal Congestion’, and ‘Runny nose’ symptoms.

MCA

The Multiple Correspondence Analysis (MCA) procedure was done because it is the proper way to reduce the dimensions in with the categorical data.

According to Wikipedia: ” (…), multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.”

df <- read.csv("cleaned_data_ul_covid19.csv", sep=",", dec=".", header=T)
covid <- df[,1:26]
covid$Fever <- ifelse(covid$Fever==1, "Fever", "No fever")
covid$Tiredness <- ifelse(covid$Tiredness==1, "Tiredness", "No tiredness")
covid$Dry.Cough <- ifelse(covid$Dry.Cough==1, "Dry cough", "No dry cough")
covid$Difficulty.in.Breathing <- ifelse(covid$Difficulty.in.Breathing==1, "Difficulties in breathing", "No difficulties in breathing")
covid$Sore.Throat <- ifelse(covid$Sore.Throat==1, "Sore throat", "No sore throat")
covid$None_Sympton <- ifelse(covid$None_Sympton==1, "None symptom", "Symptom")
covid$Pains <- ifelse(covid$Pains==1, "Pains", "No pains")
covid$Nasal.Congestion <- ifelse(covid$Nasal.Congestion==1, "Nasal congestion", "No nasal congestion")
covid$Runny.Nose <- ifelse(covid$Runny.Nose==1, "Runny nose", "No runny nose")
covid$Diarrhea <- ifelse(covid$Diarrhea==1, "Diarrhea", "No diarrhea")
covid$None_Experiencing <- ifelse(covid$None_Experiencing==1, "None experiencing", "Experiencing")
covid$Age_0.9 <- ifelse(covid$Age_0.9==1, "In age between 0 and 9 y.o.", "Not in age between 0 and 9 y.o.")
covid$Age_10.19 <- ifelse(covid$Age_10.19==1, "In age between 10 and 19 y.o.", "Not in age between 10 and 19 y.o.")
covid$Age_20.24 <- ifelse(covid$Age_20.24==1, "In age between 20 and 24 y.o.", "Not in age between 20 and 24 y.o.")
covid$Age_25.59 <- ifelse(covid$Age_25.59==1, "In age between 25 and 59 y.o.", "Not in age between 25 and 59 y.o.")
covid$Age_60. <- ifelse(covid$Age_60.==1, "60 years old and older", "Not 60 years old or older")
covid$Gender_Female <- ifelse(covid$Gender_Female==1, "Female", "Not female")
covid$Gender_Male <- ifelse(covid$Gender_Male==1, "Male", "Not male")
covid$Gender_Transgender <- ifelse(covid$Gender_Transgender==1, "Transgender", "Not transgender")
covid$Severity_Mild <- ifelse(covid$Severity_Mild==1, "Severity mild", "Not severity mild")
covid$Severity_Moderate <- ifelse(covid$Severity_Moderate==1, "Severity moderate", "Not severity moderate")
covid$Severity_Severe <- ifelse(covid$Severity_Severe==1, "Severity severe", "Not severity severe")
covid$Severity_None <- ifelse(covid$Severity_None==1, "Severity none", "Not severity none")
covid$Contact_Dont.Know <- ifelse(covid$Contact_Dont.Know==1, "Does't know about the contact", "Knows about the contact")
covid$Contact_No <- ifelse(covid$Contact_No==1, "No contact", "Contact")
covid$Contact_Yes <- ifelse(covid$Contact_Yes==1, "Having contact with ill person", "Not having contact with ill person")
library(FactoMineR)
mca <- MCA(covid, graph=F)
library(factoextra)
fviz_screeplot(mca, addlabels = T)

fviz_contrib(mca, choice = "var", axes =1:2)

The explanation of the variance is the same as it was in the PCA, however, the most important variables turned out to be ‘None experiencing’, ‘None symptom’, and ‘No dry cough’ which is very interesting outcome.

fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("darkgreen", "yellow", "darkred"), 
             repel = TRUE, 
             ggtheme = theme_minimal()
)

The above chart presents the MCA results. Variables which have the key contribution in the indication of the Covid-19 symptoms are presented in reddish colour, and the greener the colour, the less significant the variable is.

Clustering

The hierarchical clustering method was implemented on the dataset. However, the computational limits of my computer did not allow me to cluster the whole 316 800 observations, therefore I have run the the function which randomly chooses the 10 000 observations from the whole dataset. All further operation are done on the sample of the initial dataset.

covid_clust <- df[,1:26]
sampl <- covid_clust[sample(nrow(df), 10000), ]
dm<-dist(t(sampl))
hc<-hclust(dm, method="complete") 
plot(hc)

plot(density(dm))

The hierarchical tree seems to be not that clear. The density plot indicates that the density reaches the highest value for the distance around 60.

# cutting by distance between units
clust<-cutree(hc, h=60)
summary(clust)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    8.50    8.31   11.75   18.00
clust
##                   Fever               Tiredness               Dry.Cough 
##                       1                       1                       2 
## Difficulty.in.Breathing             Sore.Throat            None_Sympton 
##                       2                       3                       4 
##                   Pains        Nasal.Congestion              Runny.Nose 
##                       5                       6                       7 
##                Diarrhea       None_Experiencing                 Age_0.9 
##                       8                       4                       9 
##               Age_10.19               Age_20.24               Age_25.59 
##                      10                      11                      12 
##                 Age_60.           Gender_Female             Gender_Male 
##                       4                      13                      14 
##      Gender_Transgender           Severity_Mild       Severity_Moderate 
##                      15                      10                       4 
##           Severity_None         Severity_Severe       Contact_Dont.Know 
##                      11                       9                      16 
##              Contact_No             Contact_Yes 
##                      17                      18

Computation clustered the data into 18 clusters, which is rather quite a large number of clusters. Therefore, I have decided that it has to be checked which is the best number fore the clustering which somehow optimizes the number of the clusters (not to be as high as 18, but also not to small) and the inertion of the ‘Q’ value, which should be high. I was choosing between 3, 4, 5, and 6 clusters. The results are presented below.

library(ClustGeo)
dm_1<-dist(t(sampl)) # distances between observations
hc_1<-hclust(dm_1, method="complete") # simple dendrogram
    
# cutting by number of clusters
clust.vec.3<-cutree(hc_1, k=3)# division into 2 clusters
clust.vec.4<-cutree(hc_1, k=4) # division into 4 clusters
clust.vec.5<-cutree(hc_1, k=5) # division into 5 clusters
clust.vec.6<-cutree(hc_1, k=6) # division into 6 clusters

diss.mat<-dm_1  

inertion<-matrix(0, nrow=4, ncol=4)
colnames(inertion)<-c("division to 3 clust.", "division to 4 clust.", "division to 5 clust.", "division to 6 clust.")
rownames(inertion)<-c("intra-clust", "total", "percentage", "Q")

inertion[1,1]<-withindiss(diss.mat, part=clust.vec.3)# intra-cluster
inertion[2,1]<-inertdiss(diss.mat)              # overall
inertion[3,1]<-inertion[1,1]/ inertion[2,1]     # ratio
inertion[4,1]<-1-inertion[3,1]              # Q, inter-cluster

inertion[1,2]<-withindiss(diss.mat, part=clust.vec.4)   # intra-cluster
inertion[2,2]<-inertdiss(diss.mat)              # overall
inertion[3,2]<-inertion[1,2]/ inertion[2,2]     # ratio
inertion[4,2]<-1-inertion[3,2]              # Q, inter-cluster

inertion[1,3]<-withindiss(diss.mat, part=clust.vec.5)   # intra-cluster
inertion[2,3]<-inertdiss(diss.mat)              # overall
inertion[3,3]<-inertion[1,3]/ inertion[2,3]     # ratio
inertion[4,3]<-1-inertion[3,3]              # Q, inter-cluster

inertion[1,4]<-withindiss(diss.mat, part=clust.vec.6)   # intra-cluster
inertion[2,4]<-inertdiss(diss.mat)              # overall
inertion[3,4]<-inertion[1,4]/ inertion[2,4]     # ratio
inertion[4,4]<-1-inertion[3,4]              # Q, inter-cluster
options("scipen"=100, "digits"=4)
inertion
##             division to 3 clust. division to 4 clust. division to 5 clust.
## intra-clust            1871.7256            1733.7154            1607.0449
## total                  2116.2544            2116.2544            2116.2544
## percentage                0.8845               0.8192               0.7594
## Q                         0.1155               0.1808               0.2406
##             division to 6 clust.
## intra-clust             1477.160
## total                   2116.254
## percentage                 0.698
## Q                          0.302

Basing on the results, I have decided that 4 clusters will provide the optimal balance between the number of clusters and the Q value, because the Q value for 3, 4, 5, and 6 clusters is 11.8%, 18.3%, 24.3%, 30.4% respectively. The highest interval gap is between the 3 and 4 clusters (6.5%). Therefore, ongoing analysis will be conducted for the 4 clusters grouping technique.

clust_4<-cutree(hc, k=4) # division into 4 clusters
summary(clust_4) # -> 4 clusters
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    2.31    3.00    4.00
clust_4
##                   Fever               Tiredness               Dry.Cough 
##                       1                       1                       2 
## Difficulty.in.Breathing             Sore.Throat            None_Sympton 
##                       2                       3                       4 
##                   Pains        Nasal.Congestion              Runny.Nose 
##                       1                       1                       2 
##                Diarrhea       None_Experiencing                 Age_0.9 
##                       2                       4                       3 
##               Age_10.19               Age_20.24               Age_25.59 
##                       1                       1                       3 
##                 Age_60.           Gender_Female             Gender_Male 
##                       4                       3                       1 
##      Gender_Transgender           Severity_Mild       Severity_Moderate 
##                       4                       1                       4 
##           Severity_None         Severity_Severe       Contact_Dont.Know 
##                       1                       3                       3 
##              Contact_No             Contact_Yes 
##                       1                       4
plot(hc, hang=-1) # lower label management, looks better

plot(hc)
rect.hclust(hc, k=4, border=5:8)

library(factoextra)
fviz_cluster(list(data=t(sampl), cluster=clust.vec.4))

library(cluster)
plot(silhouette(clust.vec.4,dm_1))

Plots above visualize the clustering of the data within 4 clusters. On the one hand, dimension reduction and the clustering eases the understanding the messed data. However, on the other hand, in my analysis some variables have the negative value of the silhouette, which means that perhaps the clustering could be done better.

Nevertheless, clustering for 18 groups (what was aforementioned to be the most appropriate number of clusters for given dataset), turned out to be extremely messy overall. Clustering the data for that many groups, frankly speaking, has little sense for the further analysis purposes.

Conclusion

Presented paper undertaken the problem of handling with big data (316 800 observations) analysis with the use of unsupervised learning methods on the medical dataset concerned the Covid-19 illnesses and coexisting symptoms. The dimension reduction via the multivariate correspondence analysis technique and hierarchical clustering method was implemented in the analysis. Computed outcome partially helped with the understand of the dataset, however, personally I thought that the outcome will be more transparent. It is not that clear to understand the outcome of MCA (probably because of the characteristics of the data). The outcome of the hierarchical clustering is however clearer.

Both the dimension reduction of the data and the clustering are crucial in the world of data science. Those techniques provide not only the simplicity of the data understanding and its visualization, but also are very powerful tools to handle the big data as was presented in my paper. Nevertheless, my paper also shows that sometimes even the specialized methods (like MDS or PCA) are not universal for some types of the gathered data.