Introduction

Column

Abstract

Addressing sample biases in genome-wide association study for SARS-CoV-2

Monitoring adaptive changes of SARS-CoV-2 is of critical importance to mitigate its transmission. There are many researches showing the viral genomic mutations play a key role in propagation of SARS-COV-2. Genome-wide association study (GWAS) is a typical method to study the association between genotype and phenotypic measures. However, GWAS typically requires random sampling. Currently, isolates of SARS-CoV-2 tend to be sequenced in regions with advanced research capacity, and most sequences were reported during March-April 2020. These, and other sampling biases pose challenges to GWAS of SARS-CoV-2. In this work, several statistical and computational methods to mitigate the sampling biases of SARS-COV-2 are proposed.

[1] 1400711      31

Summary of the data

# A tibble: 6 x 31
     X1 `Virus name`    Type   `Accession ID` `Collection dat~ `Additional loca~
  <dbl> <chr>           <chr>  <chr>          <date>           <chr>            
1     1 hCoV-19/Czech ~ betac~ EPI_ISL_426889 2020-04-02       <NA>             
2     2 hCoV-19/USA/NY~ betac~ EPI_ISL_426617 2020-04-01       <NA>             
3     3 hCoV-19/USA/NY~ betac~ EPI_ISL_426618 2020-04-01       <NA>             
4     4 hCoV-19/USA/NY~ betac~ EPI_ISL_426619 2020-04-01       <NA>             
5     5 hCoV-19/USA/NY~ betac~ EPI_ISL_426620 2020-04-01       <NA>             
6     6 hCoV-19/USA/NY~ betac~ EPI_ISL_426621 2020-04-01       <NA>             
# ... with 25 more variables: Sequence length <dbl>, Host <chr>,
#   Patient age <chr>, Gender <chr>, Clade <chr>, Pango lineage <chr>,
#   Pangolin version <date>, Variant <lgl>, AA Substitutions <chr>,
#   Submission date <date>, Is reference? <lgl>, Is complete? <lgl>,
#   Is high coverage? <lgl>, Is low coverage? <lgl>, N-Content <dbl>,
#   GC-Content <dbl>, date <date>, GISAID_clade <chr>, region <chr>,
#   country <chr>, date2 <date>, monthDate <date>, weekDate <date>, week <dbl>,
#   month <dbl>

Column

Original Data Set

Undersample of 20

Undersample of 40

Oversample of 200

Oversample of 500

Date Frequency Plots

Column

Original Data

Undersample of 20

Undersample of 40

Oversample of 200

Oversample of 500

Column

Original Data

Undersample of 20

Undersample of 40

Oversample of 200

Oversample 500

Frequency Plots by Location

Column

Original Data

Undersample of 20

Undersample of 40

Oversample 200

Oversample 500

Column

Original Data

Undersample of 20

Undersample 40

Oversample of 200

Oversample of 500

Chi Squared Tests of Undersamples

Column

Month

   Month      p-Value DF
1      1 1.654640e-14  6
2      2 6.195658e-06  6
3      3 3.615681e-03  4
4      4 9.242295e-04  5
5      5 8.581578e-03  4
6      6 7.523235e-03  4
7      7 2.805183e-03  4
8      8 3.130698e-12  4
9      9 3.780490e-98  6
10    10 1.267388e-42  6
11    11 2.551189e-61  6
12    12 1.956698e-14  6
13    13 5.003433e-07  4

Week

   Week      p-Value DF
1     1 2.368208e-01  6
2     2 2.224464e-07  6
3     3 3.799037e-01  6
4     4 3.297999e-05  6
5     5 1.449028e-03  4
6     6 5.623520e-02  3
7     7 4.017306e-04  4
8     8 2.119456e-01  4
9     9 5.524509e-04  4
10   10 3.001509e-01  3
11   14 1.063090e-01  3
12   20 4.253254e-02  3
13   21 8.280664e-01  3
14   22 4.587896e-01  3
15   23 3.557941e-01  3
16   24 4.113412e-02  3
17   25 7.778238e-01  3
18   26 4.752911e-01  3
19   27 5.824367e-03  3
20   28 2.482789e-01  3
21   29 1.615003e-03  3
22   30 2.474218e-01  3
23   31 8.528808e-01  3
24   32 1.675442e-03  3
25   33 4.964944e-03  4
26   34 6.918604e-05  4
27   35 1.925310e-04  4
28   36 5.671213e-02  4
29   37 1.105719e-06  4
30   38 1.340265e-08  4
31   39 4.930772e-02  4
32   40 6.122400e-06  4
33   41 2.318575e-09  4
34   42 1.512491e-06  4
35   43 5.830431e-09  5
36   44 1.620449e-09  4
37   45 1.585289e-07  5
38   46 2.089959e-11  5
39   47 7.323962e-58  5
40   48 9.358187e-07  5
41   49 8.841524e-04  5
42   50 9.093478e-06  4
43   51 3.190217e-02  4
44   52 4.909901e-03  4
45   53 1.966564e-03  3
46   54 6.631603e-04  3
47   55 5.190509e-02  3
48   56 7.212334e-01  3
49   57 2.992473e-09  3

Continent

     Continent      p-Value DF
1       Africa 1.346341e-08  7
2         Asia 3.220917e-29  8
3       Europe 4.850537e-19  8
4 NorthAmerica 3.639654e-88  5
5      Oceania 2.760799e-39  5
6 SouthAmerica 2.169172e-14  5

Country

                        Country    p-Value DF
1                        Angola 0.45878964  3
2                     Argentina 0.73342246  3
3                         Aruba 0.66950875  3
4                       Austria 0.63125809  4
5                       Bahrain 0.84358306  5
6                    Bangladesh 1.00000000  3
7                       Belgium 0.71676002  4
8          BosniaandHerzegovina 0.45042658  4
9                   BurkinaFaso 0.49740448  3
10                     Cambodia 0.83592179  4
11                     Cameroon 0.75581481  4
12                       Canada 0.09933304  3
13                        Chile 0.42349992  3
14                        China 0.50366827  4
15                     Colombia 0.35360179  3
16                    CostaRica 0.53417724  4
17                 Coted'Ivoire 0.64762268  4
18                      Croatia 0.92456082  4
19                      Curacao 0.17390128  3
20                       Cyprus 0.53681431  4
21                CzechRepublic 0.56076560  4
22 DemocraticRepublicoftheCongo 0.19104576  3
23                      Denmark 0.27529913  4
24                      Ecuador 0.28003264  3
25                        Egypt 0.47057266  5
26                      Estonia 0.93677844  3
27                      Finland 0.82083597  5
28                       France 0.71966918  4
29                       Gambia 0.61598170  3
30                      Germany 0.45299261  4
31                        Ghana 0.27301011  4
32                    Gibraltar 0.78109893  3
33                       Greece 0.16063801  4
34                   Guadeloupe 0.92830304  3
35                      Hungary 0.23107824  4
36                      Iceland 0.10595978  3
37                        India 0.04625015  4
38                    Indonesia 0.71278093  4
39                         Iran 0.66262727  4
40                         Iraq 0.30572941  4
41                      Ireland 0.21749440  3
42                       Israel 0.26752595  3
43                        Italy 0.90979599  4
44                   Kazakhstan 0.02905894  4
45                        Kenya 0.03506043  4
46                    Lithuania 0.19272480  3
47                   Luxembourg 0.06652308  4
48                     Malaysia 0.15515427  3
49                    Mauritius 0.26146413  3
50                      Morocco 0.43374900  4
51                  Netherlands 0.07913008  4
52                   NewZealand 0.70523331  4
53                      Nigeria 0.08484479  4
54               NorthMacedonia 0.56716513  4
55                       Norway 0.80633919  4
56                         Oman 0.16727738  3
57                     Pakistan 0.39426448  4
58                     Portugal 0.61018711  4
59                        Qatar 0.83642845  5
60           RepublicoftheCongo 0.75484983  3
61                      Romania 0.14325829  4
62                       Rwanda 0.46252052  3
63                  SaudiArabia 0.34434784  4
64                      Senegal 0.40379846  5
65                       Serbia 0.54329125  3
66                    Singapore 0.01241423  4
67                     Slovakia 0.62744808  3
68                     Slovenia 0.10377716  3
69                   SouthKorea 0.44592170  3
70                        Spain 0.26327399  4
71                     SriLanka 0.84914535  4
72                     Suriname 0.73575888  4
73                       Sweden 0.75873590  4
74                  Switzerland 0.15458730  4
75                       Taiwan 0.64776109  7
76                     Thailand 0.50217069  4
77                         Togo 0.02800535  4
78                      Tunisia 0.76128480  4
79                       Turkey 0.13197750  4
80                       Uganda 0.72004823  3
81           UnitedArabEmirates 0.60974498  4
82                UnitedKingdom 0.64687550  4
83                          USA 0.01242136  3

Column

Month

   Month       p-Value DF
1      1  2.315282e-32  6
2      2  2.560532e-13  6
3      3  1.392890e-24  6
4      4  7.337918e-49  5
5      5  6.933985e-15  5
6      6  2.317907e-09  4
7      7  4.193184e-09  4
8      8  2.163757e-46  5
9      9 6.064160e-158  6
10    10 3.674524e-125  6
11    11 4.026039e-113  6
12    12  1.135786e-59  6
13    13  2.184995e-24  5

Week

   Week      p-Value DF
1     1 3.148633e-04  6
2     2 1.443161e-08  6
3     3 7.977575e-11  6
4     4 3.121344e-06  6
5     5 1.894588e-05  6
6     6 1.444988e-04  5
7     7 2.853322e-04  4
8     8 6.546771e-03  4
9     9 6.257266e-09  4
10   10 7.743987e-02  4
11   11 2.052983e-06  4
12   12 3.793150e-15  4
13   13 1.608997e-06  4
14   14 3.442165e-02  3
15   15 2.179858e-04  3
16   18 7.940317e-03  3
17   19 5.449143e-01  3
18   20 1.197949e-01  3
19   21 3.624162e-01  4
20   22 3.749875e-01  3
21   23 3.761749e-01  3
22   24 3.555137e-02  3
23   25 5.158688e-02  3
24   26 1.422558e-02  3
25   27 6.969311e-05  3
26   28 2.781476e-01  3
27   29 1.353555e-07  4
28   30 5.021056e-05  3
29   31 1.722020e-02  3
30   32 7.581415e-08  4
31   33 5.068218e-07  4
32   34 2.482507e-08  4
33   35 4.483360e-04  4
34   36 3.319017e-10  4
35   37 8.109786e-09  4
36   38 4.836214e-18  4
37   39 4.511623e-10  5
38   40 1.663799e-21  5
39   41 3.304054e-12  5
40   42 2.335019e-15  6
41   43 1.595251e-33  6
42   44 9.828228e-20  5
43   45 8.335500e-51  6
44   46 2.883698e-14  6
45   47 1.772866e-97  6
46   48 8.876446e-05  6
47   49 9.060032e-09  5
48   50 8.793899e-15  5
49   51 8.602499e-13  5
50   52 1.507619e-10  4
51   53 1.328952e-06  4
52   54 2.632379e-06  4
53   55 5.472436e-04  3
54   56 1.432595e-06  3
55   57 6.360602e-06  3

Continent

     Continent       p-Value DF
1       Africa  1.753342e-31  7
2         Asia  1.503543e-56  8
3       Europe  7.443628e-45  8
4 NorthAmerica 6.244032e-162  5
5      Oceania  2.269244e-53  6
6 SouthAmerica  1.368290e-31  6

Country

                         Country     p-Value DF
1                         Angola 0.412403258  5
2                      Argentina 0.738474603  3
3                          Aruba 0.821544305  4
4                      Australia 0.529694509  4
5                        Austria 0.002064464  5
6                        Bahrain 0.872706357  5
7                     Bangladesh 0.154841583  4
8                        Belgium 0.285203142  5
9           BosniaandHerzegovina 0.714560106  5
10                      Bulgaria 0.684184844  4
11                   BurkinaFaso 0.802372190  3
12                      Cambodia 0.613952403  5
13                      Cameroon 0.944877365  5
14                        Canada 0.883151853  3
15                         Chile 0.358814024  3
16                         China 0.769445621  5
17                      Colombia 0.780119905  3
18                     CostaRica 0.795916015  4
19                  Coted'Ivoire 0.520328569  4
20                       Croatia 0.535765547  4
21                       Curacao 0.507424711  4
22                        Cyprus 0.855695198  4
23                 CzechRepublic 0.592521414  4
24  DemocraticRepublicoftheCongo 0.741038889  3
25                       Denmark 0.861587577  4
26                       Ecuador 0.559247323  5
27                         Egypt 0.433137968  5
28                       Estonia 0.079079614  4
29                       Finland 0.765701451  5
30                        France 0.735636631  6
31                  FrenchGuiana 0.789731668  3
32                        Gambia 0.145923694  4
33                       Germany 0.898179130  5
34                         Ghana 0.136788957  5
35                     Gibraltar 0.607757176  4
36                        Greece 0.569184171  4
37                    Guadeloupe 0.560364164  4
38                      HongKong 0.038574072  3
39                       Hungary 0.498439266  4
40                       Iceland 0.693540815  4
41                         India 0.733569137  4
42                     Indonesia 0.429703019  4
43                          Iran 0.255877620  4
44                          Iraq 0.666489701  4
45                       Ireland 0.393149218  4
46                        Israel 0.578582677  4
47                         Italy 0.767505195  5
48                        Jordan 0.628962571  4
49                    Kazakhstan 0.947193324  4
50                         Kenya 0.071726945  5
51                        Latvia 0.641195568  4
52                     Lithuania 0.125200668  3
53                    Luxembourg 0.738981716  4
54                      Malaysia 0.656730911  5
55                     Mauritius 0.751370710  5
56                       Mayotte 0.499869772  3
57                        Mexico 0.133145889  4
58                       Morocco 0.854553325  4
59                    Mozambique 0.521627741  3
60                   Netherlands 0.528561321  4
61                    NewZealand 0.385701146  6
62                       Nigeria 0.997567264  4
63        NorthernMarianaIslands 0.889859868  4
64                NorthMacedonia 0.475348671  4
65                        Norway 0.642466625  4
66                          Oman 0.947457534  3
67                      Pakistan 0.510154372  5
68                      Paraguay 0.793447116  3
69                   Philippines 0.769412756  4
70                        Poland 0.048849194  3
71                      Portugal 0.418879184  4
72                         Qatar 0.738598080  5
73            RepublicoftheCongo 0.956945924  3
74                       Reunion 0.872083333  3
75                       Romania 0.702950812  5
76                        Rwanda 0.612864644  3
77                   SaudiArabia 0.163470592  4
78                       Senegal 0.990820576  6
79                        Serbia 0.623452656  4
80                     Singapore 0.649301094  5
81                   SintMaarten 0.787022372  3
82                      Slovakia 0.949395645  3
83                      Slovenia 0.502415904  3
84                    SouthKorea 0.909646571  3
85                         Spain 0.160484970  6
86                      SriLanka 0.277757310  4
87                      Suriname 0.593413253  4
88                        Sweden 0.299479049  4
89                   Switzerland 0.519657018  4
90                        Taiwan 0.903276297  7
91                      Thailand 0.568599998  5
92                          Togo 0.711941825  4
93             TrinidadandTobago 0.700334086  3
94                       Tunisia 0.824178410  4
95                        Turkey 0.135010085  4
96                        Uganda 0.160345077  4
97                       Ukraine 0.981225804  3
98            UnitedArabEmirates 0.631874261  5
99                 UnitedKingdom 0.451041490  4
100                          USA 0.728756190  3
101                       Zambia 1.000000000  3

Chi Squared Tests of Oversamples

Column

Month

   Month       p-Value DF
1      1 1.183292e-224  6
2      2  2.961689e-71  6
3      3 6.268041e-124  7
4      4 1.872053e-146  7
5      5 4.272289e-249  6
6      6  0.000000e+00  5
7      7 1.254006e-256  6
8      8  0.000000e+00  6
9      9  0.000000e+00  6
10    10  0.000000e+00  7
11    11  0.000000e+00  6
12    12 7.886256e-140  7
13    13 1.254577e-113  6
14    14  7.237216e-12  3

Week

   Week       p-Value DF
1     1  1.741785e-29  6
2     2  4.169705e-63  6
3     3  2.926284e-38  6
4     4  1.314584e-80  6
5     5  3.733575e-21  6
6     6  1.513699e-25  6
7     7  1.629846e-15  6
8     8  3.231018e-15  6
9     9  1.666734e-20  6
10   10  1.009987e-10  6
11   11  6.466170e-29  6
12   12  3.646701e-78  5
13   13  1.312207e-52  6
14   14 6.508301e-131  4
15   15  2.127646e-14  4
16   16  6.257602e-09  5
17   17  1.048050e-69  5
18   18  2.227441e-26  5
19   19  1.254421e-06  4
20   20 1.369052e-280  5
21   21  1.387302e-02  4
22   22  3.115603e-27  5
23   23  1.216882e-72  5
24   24  4.637465e-09  4
25   25  1.467601e-09  4
26   26  2.167824e-07  4
27   27  5.646454e-17  4
28   28  1.496538e-18  4
29   29  1.603255e-16  4
30   30  9.755995e-11  4
31   31  2.312134e-06  5
32   32  1.273501e-16  4
33   33  3.250696e-37  5
34   34  6.389566e-23  5
35   35 9.953590e-155  5
36   36 6.525695e-118  5
37   37 5.931354e-129  6
38   38 2.049733e-197  6
39   39  0.000000e+00  6
40   40 1.780454e-168  6
41   41 1.308159e-109  6
42   42 1.170157e-220  6
43   43 4.377060e-131  6
44   44 3.854736e-149  6
45   45 5.734014e-160  6
46   46 2.578503e-228  6
47   47  0.000000e+00  6
48   48  2.646403e-67  6
49   49  3.129490e-39  6
50   50  4.605544e-60  6
51   51  9.483241e-57  6
52   52  9.561212e-13  6
53   53  2.464960e-28  6
54   54  2.250829e-49  6
55   55  2.531055e-14  5
56   56  4.575516e-25  5
57   57  2.503096e-23  3

Continent

     Continent       p-Value DF
1       Africa 6.650859e-156  8
2         Asia  0.000000e+00  8
3       Europe 2.191332e-231  8
4 NorthAmerica  0.000000e+00  8
5      Oceania  0.000000e+00  7
6 SouthAmerica 5.335575e-193  8

Country

                         Country      p-Value DF
1                         Angola 0.6515359976  6
2                      Argentina 0.4709293725  3
3                        Armenia 0.9695707706  4
4                          Aruba 0.2443370966  4
5                      Australia 0.4663553533  6
6                        Austria 0.4998086298  5
7                        Bahrain 0.9455674179  7
8                     Bangladesh 0.3633510042  5
9                        Belgium 0.1181876382  6
10          BosniaandHerzegovina 0.5232293464  5
11                      Botswana 0.3349677938  3
12                        Brazil 0.6309425876  4
13                      Bulgaria 0.9094949820  4
14                   BurkinaFaso 0.9165174649  3
15                      Cambodia 0.9185984322  5
16                      Cameroon 0.2276476216  5
17                        Canada 0.7618381671  5
18                         Chile 0.9075064225  3
19                         China 0.5715929940  5
20                      Colombia 0.8443503245  4
21                     CostaRica 0.0323832030  4
22                  Coted'Ivoire 0.6503016037  5
23                       Croatia 0.9391483327  4
24                       Curacao 0.6748335982  4
25                        Cyprus 0.5333418807  4
26                 CzechRepublic 0.5170781727  5
27  DemocraticRepublicoftheCongo 0.4859151767  6
28                       Denmark 0.8036216038  4
29                       Ecuador 0.7012853551  5
30                         Egypt 0.8725233264  5
31              EquatorialGuinea 0.8979161202  4
32                       Estonia 0.6421887198  4
33                       Finland 0.2258860459  5
34                        France 0.1724944703  6
35                  FrenchGuiana 0.5608053332  5
36                         Gabon 0.5285727305  4
37                        Gambia 0.2723578866  4
38                       Germany 0.9192934735  5
39                         Ghana 0.0078597525  6
40                     Gibraltar 0.2530656247  4
41                        Greece 0.7237299844  5
42                    Guadeloupe 0.8191882378  4
43                      HongKong 0.0837983901  5
44                       Hungary 0.2092241513  4
45                       Iceland 0.7919381898  7
46                         India 0.0638228240  6
47                     Indonesia 0.5107744600  6
48                          Iran 0.0004395766  5
49                          Iraq 0.0020308386  4
50                       Ireland 0.8285258552  5
51                        Israel 0.3551945409  4
52                         Italy 0.0571963239  6
53                         Japan 0.3785975940  3
54                        Jordan 0.8682199786  5
55                    Kazakhstan 0.5188388984  5
56                         Kenya 0.5598834278  5
57                        Latvia 0.9780187218  5
58                     Lithuania 0.7924127824  5
59                    Luxembourg 0.4624086501  4
60                        Malawi 0.5397582095  3
61                      Malaysia 0.9097120446  5
62                     Mauritius 0.9531200562  5
63                       Mayotte 0.8708710247  4
64                        Mexico 0.4653209588  4
65                       Morocco 0.8219462953  5
66                    Mozambique 0.0187155567  3
67                   Netherlands 0.6760835057  4
68                    NewZealand 0.0442937843  8
69                       Nigeria 0.6954171588  6
70        NorthernMarianaIslands 0.5045974485  4
71                NorthMacedonia 0.0854486252  5
72                        Norway 0.5893702339  4
73                          Oman 0.5399916029  6
74                      Pakistan 0.9174697489  7
75                        Panama 0.2338557450  4
76                      Paraguay 0.3377765823  3
77                          Peru 0.0053265799  4
78                   Philippines 0.8277767556  4
79                        Poland 0.6971863002  4
80                      Portugal 0.5717702566  5
81                         Qatar 0.0331219657  5
82            RepublicoftheCongo 0.3763359201  5
83                       Reunion 0.7729258540  4
84                       Romania 0.9854825339  5
85                        Russia 0.9143103372  4
86                        Rwanda 0.8455949627  6
87                   SaudiArabia 0.7084726210  4
88                       Senegal 0.2061907250  7
89                        Serbia 0.6458574560  4
90                     Singapore 0.3207904259  6
91                   SintMaarten 0.0144812761  4
92                      Slovakia 0.1688736325  4
93                      Slovenia 0.3174657457  5
94                   SouthAfrica 0.9860278217  4
95                    SouthKorea 0.7216906609  6
96                         Spain 0.1215741626  6
97                      SriLanka 0.5333081871  5
98                      Suriname 0.8570321342  5
99                        Sweden 0.8730205463  4
100                  Switzerland 0.0821408528  5
101                       Taiwan 0.5005912983  7
102                     Thailand 0.1027226911  6
103                         Togo 0.8467004756  4
104            TrinidadandTobago 0.4665183446  3
105                      Tunisia 0.9432945044  5
106                       Turkey 0.3198341434  6
107                       Uganda 0.5111953911  5
108                      Ukraine 0.0600657116  5
109           UnitedArabEmirates 0.9349687846  7
110                UnitedKingdom 0.7115848299  5
111                      Uruguay 0.1424328895  3
112                          USA 0.3097782128  5
113                       Zambia 0.4357704330  3
114                     Zimbabwe 0.5210815886  4

Column

Month

   Month       p-Value DF
1      1  0.000000e+00  6
2      2 6.171736e-187  7
3      3 1.810998e-287  7
4      4 4.720677e-276  7
5      5  0.000000e+00  7
6      6  0.000000e+00  5
7      7  0.000000e+00  6
8      8  0.000000e+00  6
9      9  0.000000e+00  6
10    10  0.000000e+00  7
11    11  0.000000e+00  7
12    12  0.000000e+00  7
13    13 4.811437e-263  6
14    14  3.416967e-26  3

Week

   Week       p-Value DF
1     1  2.786640e-64  6
2     2 2.441660e-202  6
3     3  3.146641e-81  6
4     4 4.902682e-147  6
5     5  9.543251e-57  6
6     6  1.563405e-69  6
7     7  1.728113e-34  6
8     8  3.430324e-37  7
9     9  3.547748e-74  6
10   10  4.532506e-22  7
11   11  8.120484e-54  6
12   12 1.277179e-139  6
13   13 3.061146e-106  6
14   14 1.063466e-120  7
15   15  1.981572e-54  6
16   16  7.957999e-19  6
17   17  3.755110e-88  5
18   18 2.286600e-132  5
19   19  2.832595e-41  5
20   20  0.000000e+00  5
21   21 8.055207e-103  5
22   22 4.162990e-178  5
23   23 6.366328e-232  5
24   24  0.000000e+00  5
25   25  1.293235e-22  4
26   26  8.764413e-16  4
27   27  1.652961e-47  4
28   28 3.944140e-117  4
29   29  2.443328e-55  4
30   30  8.124950e-22  5
31   31 2.153270e-203  6
32   32  4.501931e-54  5
33   33 1.324636e-196  6
34   34  1.045446e-70  5
35   35  0.000000e+00  6
36   36  0.000000e+00  6
37   37  0.000000e+00  6
38   38  0.000000e+00  6
39   39  0.000000e+00  6
40   40  0.000000e+00  6
41   41 6.096042e-253  6
42   42  0.000000e+00  6
43   43  0.000000e+00  7
44   44  0.000000e+00  7
45   45  0.000000e+00  6
46   46  0.000000e+00  6
47   47  0.000000e+00  7
48   48 5.537390e-161  6
49   49 1.417578e-100  7
50   50 2.284104e-212  6
51   51 1.165777e-165  6
52   52  1.621691e-70  6
53   53  2.321790e-56  6
54   54  7.611955e-90  6
55   55  4.548222e-61  6
56   56  2.855869e-81  6
57   57 7.204460e-130  5

Continent

     Continent p-Value DF
1       Africa       0  8
2         Asia       0  8
3       Europe       0  8
4 NorthAmerica       0  8
5      Oceania       0  8
6 SouthAmerica       0  8

Country

                         Country     p-Value DF
1                         Angola 0.594795623  6
2                      Argentina 0.010495577  4
3                        Armenia 0.694201799  4
4                          Aruba 0.635130965  5
5                      Australia 0.836591512  7
6                        Austria 0.355419547  6
7                        Bahrain 0.785129415  7
8                     Bangladesh 0.231462230  6
9                        Belgium 0.244447986  7
10          BosniaandHerzegovina 0.986432694  5
11                      Botswana 0.375415974  3
12                        Brazil 0.753199760  4
13                      Bulgaria 0.262446204  5
14                   BurkinaFaso 0.093257121  3
15                      Cambodia 0.321339802  5
16                      Cameroon 0.035997682  5
17                        Canada 0.635171618  5
18                         Chile 0.717708104  5
19                         China 0.106332435  5
20                      Colombia 0.004513380  8
21                     CostaRica 0.983640980  4
22                  Coted'Ivoire 0.450753467  5
23                       Croatia 0.031941182  4
24                       Curacao 0.397503348  4
25                        Cyprus 0.768321572  4
26                 CzechRepublic 0.224972259  5
27  DemocraticRepublicoftheCongo 0.809700374  6
28                       Denmark 0.332331885  5
29                       Ecuador 0.294707867  5
30                         Egypt 0.077160181  5
31              EquatorialGuinea 0.757858665  4
32                       Estonia 0.779790842  4
33                       Finland 0.080879888  5
34                        France 0.675545836  6
35                  FrenchGuiana 0.742715229  5
36                         Gabon 0.212445349  4
37                        Gambia 0.939151901  5
38                       Germany 0.110950752  6
39                         Ghana 0.360899925  6
40                     Gibraltar 0.355541871  4
41                        Greece 0.481174900  6
42                    Guadeloupe 0.077706108  4
43                      HongKong 0.809100522  6
44                       Hungary 0.757821349  4
45                       Iceland 0.666960250  7
46                         India 0.379573609  7
47                     Indonesia 0.623342518  6
48                          Iran 0.165828494  5
49                          Iraq 0.880576621  4
50                       Ireland 0.025456029  6
51                        Israel 0.633307812  5
52                         Italy 0.044558542  6
53                         Japan 0.524990488  3
54                        Jordan 0.789061977  6
55                    Kazakhstan 0.642360921  5
56                         Kenya 0.748827177  6
57                        Latvia 0.221239941  5
58                     Lithuania 0.228656361  5
59                    Luxembourg 0.421781294  5
60                        Malawi 0.979093888  3
61                      Malaysia 0.926186929  8
62                     Mauritius 0.880321742  5
63                       Mayotte 0.384108554  5
64                        Mexico 0.688997006  5
65                       Morocco 0.848492390  5
66                    Mozambique 0.332423972  3
67                   Netherlands 0.954842186  6
68                    NewZealand 0.128838714  8
69                       Nigeria 0.484197599  7
70        NorthernMarianaIslands 0.264531103  4
71                NorthMacedonia 0.356310749  5
72                        Norway 0.667349054  5
73                          Oman 0.966613008  6
74                      Pakistan 0.835673155  7
75                        Panama 0.672052774  5
76                      Paraguay 0.717206695  3
77                          Peru 0.001008969  5
78                   Philippines 0.535583339  4
79                        Poland 0.744951208  5
80                      Portugal 0.730978045  7
81                         Qatar 0.157973991  7
82            RepublicoftheCongo 0.674215085  5
83                       Reunion 0.343431313  4
84                       Romania 0.993120672  5
85                        Russia 0.221017827  5
86                        Rwanda 0.174758671  6
87                   SaudiArabia 0.660730948  4
88                       Senegal 0.591412820  7
89                        Serbia 0.174787973  4
90                     Singapore 0.046677409  8
91                   SintMaarten 0.247445086  4
92                      Slovakia 0.515574270  6
93                      Slovenia 0.853984441  5
94                   SouthAfrica 0.634627571  4
95                    SouthKorea 0.631480839  7
96                         Spain 0.742953301  7
97                      SriLanka 0.535609743  5
98                      Suriname 0.130620836  5
99                        Sweden 0.712196383  4
100                  Switzerland 0.024884613  5
101                       Taiwan 0.045372977  7
102                     Thailand 0.630499094  7
103                         Togo 0.130059651  4
104            TrinidadandTobago 0.778130330  3
105                      Tunisia 0.034225547  5
106                       Turkey 0.301188326  7
107                       Uganda 0.403145929  5
108                      Ukraine 0.465363576  5
109           UnitedArabEmirates 0.228151650  7
110                UnitedKingdom 0.526859530  8
111                      Uruguay 0.474548289  3
112                          USA 0.373792870  5
113                       Zambia 0.877798904  3
114                     Zimbabwe 0.283869873  4
---
title: "COVID-19 Data Resampling"
author: "Landon Kehr"
output: 
  flexdashboard::flex_dashboard:
    theme: cosmo
    orientation: columns
    social: ["facebook", "twitter", "linkedin"]
    source_code: embed
    vertical_layout: scroll
---

```{r setup, include=FALSE}
#2.1
#rm(list=ls())
library(tidyverse)
library(dplyr)
library(ROSE)
library(RColorBrewer)
library(Rfast)
library(ggplot2)
library(lubridate)
library(flexdashboard)
#list.files("msa_1007")
#setwd(getwd()) #set working directory to source directory
```

# Introduction

## Column {.tabset data-width=600}

### Abstract

**Addressing sample biases in genome-wide association study for SARS-CoV-2**

Monitoring adaptive changes of SARS-CoV-2 is of critical importance to mitigate its transmission. There are many researches showing the viral genomic mutations play a key role in propagation of SARS-COV-2. Genome-wide association study (GWAS) is a typical method to study the association between genotype and phenotypic measures. However, GWAS typically requires random sampling. Currently, isolates of SARS-CoV-2 tend to be sequenced in regions with advanced research capacity, and most sequences were reported during March-April 2020. These, and other sampling biases pose challenges to GWAS of SARS-CoV-2.
 In this work, several statistical and computational methods to mitigate the sampling biases of SARS-COV-2 are proposed.

```{r warning=FALSE}
tb <- read_csv("tb.csv") 

```


```{r}
dim(tb)
```



```{r}
#Filtering out countries that have fewer than 50 samples

group <- tb %>% group_by(country) %>% summarize(N=n()) %>% filter(N>=50)
#sum(group$N)
#group
cleaned <- tb %>% filter(country %in% group$country)
```

### Summary of the data

```{r}
head(cleaned)
```

## Column {.tabset data-width=600}

### Original Data Set
```{r}
#Bar graph of the final cleaned data set grouped by strain clade determined by GISAID

originalBar <- ggplot(tb,aes(x=GISAID_clade)) + 
  geom_bar()
originalBar
```

### Undersample of 20
```{r}
#Create an undersample of 20 data points from each country

set.seed(2)
undersample20 <- cleaned %>% group_by(country) %>% sample_n(20)
under20Bar <- ggplot(undersample20, aes(x=GISAID_clade)) + 
  geom_bar()
under20Bar
```

### Undersample of 40
```{r}
#Create an undersample of 40 data points from each country

set.seed(1)
undersample40 <- cleaned %>% group_by(country) %>% sample_n(40)
under40Bar <- ggplot(undersample40, aes(x=GISAID_clade)) + 
  geom_bar()
under40Bar
```

### Oversample of 200
```{r}
#Create an oversample of 200 data points from each country

set.seed(3)
oversample200 <- cleaned %>% group_by(country) %>% slice_sample(n=200,replace=TRUE)
over200Bar <- ggplot(oversample200, aes(x=GISAID_clade)) + 
  geom_bar()
over200Bar
```

### Oversample of 500
```{r}
#Create an oversample of 500 data points from each country

set.seed(4)
oversample500 <- cleaned %>% group_by(country) %>% slice_sample(n=500,replace=TRUE)
over500Bar <- ggplot(oversample500, aes(x=GISAID_clade)) + 
  geom_bar()
over500Bar
```

# Something {.hidden}

## Column {.tabset}

#By Month
##Prop Table
```{r}
#Original Data Set Proportion Table by Month

mainCladeTableMonth <- table(tb$GISAID_clade, tb$month)
#mainCladeTableMonth
mainPropTableMonth <- as.data.frame(prop.table(mainCladeTableMonth,2)) %>% 
  mutate("Clade"=Var1,
        "Month"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Month,Frequency)
#mainPropTableMonth
```
##Frequency Plot
```{r}
#Original Data Set Frequency Percentage Plot Month

originalDateMonth <- ggplot(mainPropTableMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=mainPropTableMonth$Month) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#originalDateMonth
```

#By Week
##Prop Table
```{r}
#Original Data Set Proportion Table by Week

mainCladeTableWeek <- table(tb$GISAID_clade, tb$week)
mainPropTableWeek <- as.data.frame(prop.table(mainCladeTableWeek,2)) %>% 
  mutate("Clade"=Var1,
         "Week"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Week,Frequency)
#mainPropTableWeek
```
##Frequency Plot
```{r}
#Original Data Set Frequency Percentage Plot Week

originalDateWeek <- ggplot(mainPropTableWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=mainPropTableWeek$Week) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#originalDateWeek
```

#By Continent
##Prop Table
```{r}
#Original Data Set Proportion Table by Continent

mainCladeTableLocCont <- table(tb$GISAID_clade, tb$region)
mainPropTableLocCont <- as.data.frame(prop.table(mainCladeTableLocCont,2)) %>% 
  mutate("Clade"=Var1,
         "Continent"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Continent,Frequency)
#mainPropTableLocCont
```

##Frequency Plot
```{r}
#Original Data Set Percentage by Continent

originalCont <- ggplot(mainPropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(mainPropTableLocCont$Continent) 
#originalCont
```
#By Country
##Prop Table
```{r}
#Original Data Set Proportion Table by Country

mainCladeTableLoc<- table(tb$GISAID_clade, tb$country)
mainPropTableLoc <- as.data.frame(prop.table(mainCladeTableLoc,2)) %>% 
  mutate("Clade"=Var1,
       "Country"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Country,Frequency)
#mainPropTableLoc
```

##Frequency Plot
```{r}
#Original Data Set Percentage by Country

originalLoc <- ggplot(mainPropTableLoc, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(mainPropTableLoc$Country)
#originalLoc
```
#Undersample 40
#By Month
##Prop Table
```{r}
#Undersample 40 Proportion Table by Month

under40DateTableMonth <- table(undersample40$GISAID_clade, undersample40$month)
under40PropTableDateMonth <- as.data.frame(prop.table(under40DateTableMonth,2)) %>% 
  mutate("Clade"=Var1,
         "Month"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Month,Frequency)
#under40PropTableDateMonth
```

##Frequency Plot
```{r}
#Undersample 40 Frequency Percentage Plot Month

under40DateMonth <- ggplot(under40PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=under40PropTableDateMonth$Month) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#under40DateMonth
```

#By Week
##Prop Table
```{r}
#Undersample 40 Proportion Table by Week

under40DateTableWeek <- table(undersample40$GISAID_clade, undersample40$week)
under40PropTableDateWeek <- as.data.frame(prop.table(under40DateTableWeek,2)) %>% 
  mutate("Clade"=Var1,
         "Week"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Week,Frequency)
#under40PropTableDateWeek
```

##Frequency Plot
```{r}
#Undersample 40 Frequency Percentage Plot
under40DateWeek <- ggplot(under40PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=under40PropTableDateWeek$Week) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#under40DateWeek
```
#By Continent
##Prop Table
```{r}
#Undersample 40 Proportion Table by Continent
under40CladeTableLocCont <- table(undersample40$GISAID_clade, undersample40$region)
under40PropTableLocCont <- as.data.frame(prop.table(under40CladeTableLocCont,2)) %>% 
  mutate("Clade"=Var1,
       "Continent"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Continent,Frequency)
#under40PropTableLocCont
```

##Frequency Plot
```{r}
#Undersample 40 Percentage by Continent

under40Cont <- ggplot(under40PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(under40PropTableLocCont$Continent) 
#under40Cont
```

#By Country
##Prop Table
```{r}
#Undersample 40 Proportion Table by Country

under40CladeTableLoc <- table(undersample40$GISAID_clade, undersample40$country)
under40PropTableLoc <- as.data.frame(prop.table(under40CladeTableLoc,2)) %>% 
  mutate("Clade"=Var1,
       "Country"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Country,Frequency)
#under40PropTableLoc
```

##Frequency Plot
```{r}
#Undersample 40 Percentage by Country

under40Loc <- ggplot(under40PropTableLoc, aes(x="",y=Frequency,fill=Clade)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() +
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(under40PropTableLoc$Country)
#under40Loc
```

#Undersample of 20
#By Month
##Prop Table
```{r}
#Undersample 20 Proportion Table by Month

under20DateTableMonth <- table(undersample20$GISAID_clade, undersample20$month)
under20PropTableDateMonth <- as.data.frame(prop.table(under20DateTableMonth,2)) %>% 
  mutate("Clade"=Var1,
         "Month"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Month,Frequency)
#under20PropTableDateMonth
```

##Frequency Plot
```{r}
#Undersample 20 Frequency Percentage Plot Month

under20DateMonth <- ggplot(under20PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=under20PropTableDateMonth$Month) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#under20DateMonth
```

#By Week
##Prop Table
```{r}
#Undersample 20 Proportion Table by Week

under20DateTableWeek <- table(undersample20$GISAID_clade, undersample20$week)
under20PropTableDateWeek <- as.data.frame(prop.table(under20DateTableWeek,2)) %>% 
  mutate("Clade"=Var1,
         "Week"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Week,Frequency)
#under20PropTableDateWeek
```

##Frequency Plot
```{r}
#Undersample 20 Frequency Percentage Plot

under20DateWeek <- ggplot(under20PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=under20PropTableDateWeek$Week) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#under20DateWeek
```
#By Continent
##Prop Table
```{r}
#Undersample 20 Proportion Table by Continent

under20CladeTableLocCont <- table(undersample20$GISAID_clade, undersample20$region)
under20PropTableLocCont <- as.data.frame(prop.table(under20CladeTableLocCont,2)) %>% 
  mutate("Clade"=Var1,
       "Continent"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Continent,Frequency)
#under20PropTableLocCont
```
##Frequency Plot
```{r}
#Undersample 20 Percentage by Continent

under20Cont <- ggplot(under20PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(under20PropTableLocCont$Continent) 
#under20Cont
```

#By Country
##Prop Table
```{r}
#Undersample 20 Proportion Table by Country

under20CladeTableLoc <- table(undersample20$GISAID_clade, undersample20$country)
under20PropTableLoc <- as.data.frame(prop.table(under20CladeTableLoc,2)) %>% 
  mutate("Clade"=Var1,
       "Country"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Country,Frequency)
#under20PropTableLoc
```
##Frequency Plot
```{r}
#Undersample 20 Percentage by Country

under20Loc <- ggplot(under20PropTableLoc, aes(x="",y=Frequency,fill=Clade)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() +
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(under20PropTableLoc$Country)
#under20Loc
```

#Oversample of 200
#By Month
##Prop Table
```{r}
#Oversample 200 Proportion Table by Month

over200DateTableMonth <- table(oversample200$GISAID_clade, oversample200$month)
over200PropTableDateMonth <- as.data.frame(prop.table(over200DateTableMonth,2)) %>% 
  mutate("Clade"=Var1,
         "Month"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Month,Frequency)
#over200PropTableDateMonth
```

##Frequency Plot
```{r}
#Oversample 200 Frequency Percentage Plot Month

over200DateMonth <- ggplot(over200PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=over200PropTableDateMonth$Month) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#over200DateMonth
```
#By Week
##Prop Table
```{r}
#Oversample 200 Proportion Table by Week

over200DateTableWeek <- table(oversample200$GISAID_clade, oversample200$week)
over200PropTableDateWeek <- as.data.frame(prop.table(over200DateTableWeek,2)) %>% 
  mutate("Clade"=Var1,
         "Week"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Week,Frequency)
#over200PropTableDateWeek
```

##Frequency Plot
```{r}
#Oversample 200 Frequency Percentage Plot Week

over200DateWeek <- ggplot(over200PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=over200PropTableDateWeek$Week) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#over200DateWeek
```
#By Continent
##Prop Table
```{r}
#Oversample 200 Proportion Table by Continent

over200CladeTableLocCont <- table(oversample200$GISAID_clade, oversample200$region)
over200PropTableLocCont <- as.data.frame(prop.table(over200CladeTableLocCont,2)) %>% 
  mutate("Clade"=Var1,
       "Continent"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Continent,Frequency)
#over200PropTableLocCont
```
##Frequency Plot
```{r}
#Oversample 200 Percentage by Continent

over200Cont <- ggplot(over200PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(over200PropTableLocCont$Continent) 
#over200Cont
```

#By Country
##Prop Table
```{r}
#Oversample 200 Proportion Table by Country

over200CladeTableLoc <- table(oversample200$GISAID_clade, oversample200$country)
over200PropTableLoc <- as.data.frame(prop.table(over200CladeTableLoc,2)) %>% 
  mutate("Clade"=Var1,
       "Country"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Country,Frequency)
#over200PropTableLoc
```
##Frequency Plot
```{r}
#Oversample 200 Percentage by Country

over200Loc <- ggplot(over200PropTableLoc, aes(x="",y=Frequency,fill=Clade)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(over200PropTableLoc$Country)
#over200Loc
```

#Oversample of 500
#By Month
##Prop Table
```{r}
#Oversample 500 Proportion Table by Month

over500DateTableMonth <- table(oversample500$GISAID_clade, oversample500$month)
over500PropTableDateMonth <- as.data.frame(prop.table(over500DateTableMonth,2)) %>% 
  mutate("Clade"=Var1,
       "Month"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Month,Frequency)
#over500PropTableDateMonth
```

##Frequency Plot
```{r}
#Oversample 500 Frequency Percentage Plot Month

over500DateMonth <- ggplot(over500PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=over500PropTableDateMonth$Month) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#over500DateMonth
```

#By Week
##Prop Table
```{r}
#Oversample 500 Proportion Table by Week

over500DateTableWeek <- table(oversample500$GISAID_clade, oversample500$week)
over500PropTableDateWeek <- as.data.frame(prop.table(over500DateTableWeek,2)) %>% 
  mutate("Clade"=Var1,
         "Week"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Week,Frequency)
#over500PropTableDateWeek
```

##Frequency Plot
```{r}
#Oversample 500 Frequency Percentage Plot Week

over500DateWeek <- ggplot(over500PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) + 
  geom_col() + 
  scale_x_discrete(limits=over500PropTableDateWeek$Week) + 
  scale_fill_brewer(palette="Paired") + 
  theme_bw()
#over500DateWeek
```

#By Continent
##Prop Table
```{r}
#Oversample 500 Proportion Table by Country

over500CladeTableLocCont <- table(oversample500$GISAID_clade, oversample500$region)
over500PropTableLocCont <- as.data.frame(prop.table(over500CladeTableLocCont,2)) %>% 
  mutate("Clade"=Var1,
       "Continent"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Continent,Frequency)
#over500PropTableLocCont
```

##Frequency Plot
```{r}
#Oversample 500 Percentage by Month

over500Cont <- ggplot(over500PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(over500PropTableLocCont$Continent) 
#over500Cont
```

#By Country
##Prop Table
```{r}
#Oversample 500 Proportion Table by Country

over500CladeTableLoc <- table(oversample500$GISAID_clade, oversample500$country)
over500PropTableLoc <- as.data.frame(prop.table(over500CladeTableLoc,2)) %>% 
  mutate("Clade"=Var1,
       "Country"=Var2,
         "Frequency"=Freq*100) %>%
  select(Clade,Country,Frequency)
#over500PropTableLoc
```

##Frequency Plot
```{r}
#Oversample 500 Percentage by Month

over500Loc <- ggplot(over500PropTableLoc, aes(x="",y=Frequency,fill=Clade)) + 
  geom_bar(stat="identity",width=1) + 
  coord_polar("y", start=0) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  facet_wrap(over500PropTableLoc$Country)
#over500Loc
```

# Date Frequency Plots

## Column {.tabset}

### Original Data
```{r}
#Print all Month Frequency Plots

originalDateMonth
```

### Undersample of 20
```{r}
under20DateMonth
```

### Undersample of 40
```{r}
under40DateMonth
```

### Oversample of 200
```{r}
over200DateMonth
```

### Oversample of 500
```{r}
over500DateMonth
```

## Column {.tabset}

### Original Data
```{r}
#Print all Week Frequency Plots

originalDateWeek
```

### Undersample of 20
```{r}
under20DateWeek
```

### Undersample of 40
```{r}
under40DateWeek 
```

### Oversample of 200
```{r}
over200DateWeek
```

### Oversample 500
```{r}
over500DateWeek
```

# Frequency Plots by Location

## Column {.tabset}

### Original Data
```{r}
#Print all Continent Frequency Plots

originalCont
```

### Undersample of 20
```{r}
under20Cont
```

### Undersample of 40
```{r}
under40Cont
```

### Oversample 200
```{r}
over200Cont
```

### Oversample 500
```{r}
over500Cont
```

## Column {.tabset}

### Original Data
```{r}
#Print all Country Frequency Plots

originalLoc
```

### Undersample of 20
```{r}
under20Loc
```

### Undersample 40
```{r}
under40Loc
```

### Oversample of 200
```{r}
over200Loc
```

### Oversample of 500
```{r}
over500Loc
```

# Chi Squared Tests of Undersamples

## Column {.tabset}

### Month
```{r warning=FALSE}
#Creating sub tables undersample 20 Month

originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>% 
  select(Clade,Month,Freq)
under20CladeTableMonth <- as.data.frame(table(undersample20$GISAID_clade,undersample20$month))
compareMonths <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>% 
  filter(Month %in% under20CladeTableMonth$Var2)

#Combining tables and renaming variables

compareMonths <- cbind2(compareMonths,under20CladeTableMonth$Freq) %>% group_by(Month) %>% 
  mutate("N_under20"=sum(y), 
         "Observed_under20"=y,
         "Expected_under20"=round(N_under20*Prop)) %>% 
  select(-y)
#compareMonths

#Chi-squared tests for undersample 20 Months

list20_1 <- split(compareMonths,compareMonths$Month, drop=TRUE)
month20pVals <- as.data.frame(matrix(nrow=length(list20_1),ncol=3))
colnames(month20pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list20_1)){
  temp <- as.data.frame(list20_1[i])
  colnames(temp) <- c("Clade","Month","Original","Prop","N_under20","Observed_under20","Expected_under20")
  temp <- temp %>% filter(Expected_under20!=0)
  if (nrow(temp)>=4){
    originalChiMonth20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiMonth20)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    month20pVals[i,1] <- temp[1,2]
    month20pVals[i,2] <- originalChiMonth20$p.value
    month20pVals[i,3] <- originalChiMonth20$parameter
  }
}
month20pVals <- month20pVals %>% drop_na
month20pVals
```

### Week
```{r warning=FALSE}
#Creating sub tables undersample 20 Week

originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>% 
  select(Clade,Week,Freq)
under20CladeTableWeek <- as.data.frame(table(undersample20$GISAID_clade,undersample20$week))
compareWeeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>% 
  filter(Week %in% under20CladeTableWeek$Var2)

#Combining tables and renaming variables

compareWeeks <- cbind2(compareWeeks,under20CladeTableWeek$Freq) %>% group_by(Week) %>% 
  mutate("N_under20"=sum(y), 
         "Observed_under20"=y,
         "Expected_under20"=round(N_under20*Prop)) %>% 
  select(-y)
#compareWeeks

#Chi-squared tests for undersample 20 Week

list20_2 <- split(compareWeeks,compareWeeks$Week, drop=TRUE)
week20pVals <- as.data.frame(matrix(nrow=length(list20_2),ncol=3))
colnames(week20pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list20_2)){
  temp <- as.data.frame(list20_2[i])
  colnames(temp) <- c("Clade","Week","Original","Prop","N_under20","Observed_under20","Expected_under20")
  temp <- temp %>% filter(Expected_under20!=0)
  if (nrow(temp)>=4){
    originalChiWeek20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiWeek20)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    week20pVals[i,1] <- temp[1,2]
    week20pVals[i,2] <- originalChiWeek20$p.value
    week20pVals[i,3] <- originalChiWeek20$parameter
  }
}
week20pVals <- week20pVals %>% drop_na
week20pVals
```

### Continent
```{r warning=FALSE}
#Creating sub tables undersample 20 Continent

originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>% 
  select(Clade,Continent,Freq)
under20CladeTableCont <- as.data.frame(table(undersample20$GISAID_clade,undersample20$region))
compareCont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>% 
  filter(Continent %in% under20CladeTableCont$Var2)

#Combining tables and renaming variables

compareCont <- cbind2(compareCont,under20CladeTableCont$Freq) %>% group_by(Continent) %>% 
  mutate("N_under20"=sum(y), 
         "Observed_under20"=y,
         "Expected_under20"=round(N_under20*Prop)) %>% 
  select(-y)
#compareCont

#Chi-squared tests for undersample 20 Continents

list20_3 <- split(compareCont,compareCont$Continent, drop=TRUE)
cont20pVals <- as.data.frame(matrix(nrow=length(list20_3),ncol=3))
colnames(cont20pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list20_3)){
  temp <- as.data.frame(list20_3[i])
  colnames(temp) <- c("Clade","Continent","Original","Prop","N_under20","Observed_under20","Expected_under20")
  temp <- temp %>% filter(Expected_under20!=0)
  if (nrow(temp)>=4){
    originalChiCont20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiCont20)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    cont20pVals[i,1] <- temp[1,2]
    cont20pVals[i,2] <- originalChiCont20$p.value
    cont20pVals[i,3] <- originalChiCont20$parameter
  }
}
cont20pVals <- cont20pVals %>% drop_na
cont20pVals
```

### Country
```{r warning=FALSE}
#Creating sub tables undersample 20 Country

originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>% 
  select(Clade,Country,Freq)
under20CladeTableLoc <- as.data.frame(table(undersample20$GISAID_clade,undersample20$country))
compareLoc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>%      
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>% 
  filter(Country %in% under20CladeTableLoc$Var2)

#Combining tables and renaming variables

compareLoc <- cbind2(compareLoc,under20CladeTableLoc$Freq) %>% group_by(Country) %>% 
  mutate("N_under20"=sum(y), 
         "Observed_under20"=y,
         "Expected_under20"=round(N_under20*Prop)) %>% 
  select(-y)
#compareLoc

#Chi-squared tests for undersample 20 Countries

list20_4 <- split(compareLoc,compareLoc$Country, drop=TRUE)
country20pVals <- as.data.frame(matrix(nrow=length(list20_4),ncol=3))
colnames(country20pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list20_4)){
  temp <- as.data.frame(list20_4[i])
  colnames(temp) <- c("Clade","Country","Original","Prop","N_under20","Observed_under20","Expected_under20")
  temp <- temp %>% filter(Expected_under20!=0)
  if (nrow(temp)>=4){
    originalChiLoc20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiLoc20)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    country20pVals[i,1] <- temp[1,2]
    country20pVals[i,2] <- originalChiLoc20$p.value
    country20pVals[i,3] <- originalChiLoc20$parameter
  }
}
country20pVals <- country20pVals %>% drop_na
country20pVals
```

## Column {.tabset}

### Month
```{r warning=FALSE}
#Creating sub tables undersample 40 Month

originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>% 
  select(Clade,Month,Freq)
under40CladeTableMonth <- as.data.frame(table(undersample40$GISAID_clade,undersample40$month))
#under40CladeTableMonth
compare40Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>% 
  filter(Month %in% under40CladeTableMonth$Var2)

#Combining tables and renaming variables

compare40Months <- cbind2(compare40Months,under40CladeTableMonth$Freq) %>% group_by(Month) %>% 
  mutate("N_under40"=sum(y), 
         "Observed_under40"=y,
         "Expected_under40"=round(N_under40*Prop)) %>% 
  select(-y)
#compare40Months

#Chi-squared tests for undersample 40 Months

list40_1 <- split(compare40Months,compare40Months$Month, drop=TRUE)
month40pVals <- as.data.frame(matrix(nrow=length(list40_1),ncol=3))
colnames(month40pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list40_1)){
  temp <- as.data.frame(list40_1[i])
  colnames(temp) <- c("Clade","Month","Original","Prop","N_under40","Observed_under40","Expected_under40")
  temp <- temp %>% filter(Expected_under40!=0)
  if (nrow(temp)>=4){
    originalChiMonth40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiMonth40)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    month40pVals[i,1] <- temp[1,2]
    month40pVals[i,2] <- originalChiMonth40$p.value
    month40pVals[i,3] <- originalChiMonth40$parameter
  }
}
month40pVals <- month40pVals %>% drop_na
month40pVals
```

### Week
```{r warning=FALSE}
#Creating sub tables undersample 40 Week

originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>% 
  select(Clade,Week,Freq)
under40CladeTableWeek <- as.data.frame(table(undersample40$GISAID_clade,undersample40$week))
compare40Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>% 
  filter(Week %in% under40CladeTableWeek$Var2)

#Combining tables and renaming variables

compare40Weeks <- cbind2(compare40Weeks,under40CladeTableWeek$Freq) %>% group_by(Week) %>% 
  mutate("N_under40"=sum(y), 
         "Observed_under40"=y,
         "Expected_under40"=round(N_under40*Prop)) %>% 
  select(-y)
#compare40Weeks

#Chi-squared tests for undersample 40 Weeks

list40_2 <- split(compare40Weeks,compare40Weeks$Week, drop=TRUE)
week40pVals <- as.data.frame(matrix(nrow=length(list40_2),ncol=3))
colnames(week40pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list40_2)){
  temp <- as.data.frame(list40_2[i])
  colnames(temp) <- c("Clade","Week","Original","Prop","N_under40","Observed_under40","Expected_under40")
  temp <- temp %>% filter(Expected_under40!=0)
  if (nrow(temp)>=4){
    originalChiWeek40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiWeek40)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    week40pVals[i,1] <- temp[1,2]
    week40pVals[i,2] <- originalChiWeek40$p.value
    week40pVals[i,3] <- originalChiWeek40$parameter
  }
}
week40pVals <- week40pVals %>% drop_na
week40pVals
```

### Continent
```{r warning=FALSE}
#Creating sub tables undersample 40 Continent

originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>% 
  select(Clade,Continent,Freq)
under40CladeTableCont <- as.data.frame(table(undersample40$GISAID_clade,undersample40$region))
compare40Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>% 
  filter(Continent %in% under40CladeTableCont$Var2)

#Combining tables and renaming variables

compare40Cont <- cbind2(compare40Cont,under40CladeTableCont$Freq) %>% group_by(Continent) %>% 
  mutate("N_under40"=sum(y), 
         "Observed_under40"=y,
         "Expected_under40"=round(N_under40*Prop)) %>% 
  select(-y)
#compare40Cont

#Chi-squared tests for undersample 40 Continents

list40_3 <- split(compare40Cont,compare40Cont$Continent, drop=TRUE)
cont40pVals <- as.data.frame(matrix(nrow=length(list40_3),ncol=3))
colnames(cont40pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list40_3)){
  temp <- as.data.frame(list40_3[i])
  colnames(temp) <- c("Clade","Continent","Original","Prop","N_under40","Observed_under40","Expected_under40")
  temp <- temp %>% filter(Expected_under40!=0)
  if (nrow(temp)>=4){
    originalChiCont40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiCont40)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    cont40pVals[i,1] <- temp[1,2]
    cont40pVals[i,2] <- originalChiCont40$p.value
    cont40pVals[i,3] <- originalChiCont40$parameter
  }
}
cont40pVals <- cont40pVals %>% drop_na
cont40pVals
```

### Country
```{r warning=FALSE}
#Creating sub tables undersample 40 Country

originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>% 
  select(Clade,Country,Freq)
under40CladeTableLoc <- as.data.frame(table(undersample40$GISAID_clade,undersample40$country))
compare40Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>% 
  filter(Country %in% under40CladeTableLoc$Var2)

#Combining tables and renaming variables

compare40Loc <- cbind2(compare40Loc,under40CladeTableLoc$Freq) %>% group_by(Country) %>% 
  mutate("N_under40"=sum(y), 
         "Observed_under40"=y,
         "Expected_under40"=round(N_under40*Prop)) %>% 
  select(-y)
#compare40Loc

#Chi-squared tests for undersample 40 Countries

list40_4 <- split(compare40Loc,compare40Loc$Country, drop=TRUE)
country40pVals <- as.data.frame(matrix(nrow=length(list40_4),ncol=3))
colnames(country40pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list40_4)){
  temp <- as.data.frame(list40_4[i])
  colnames(temp) <- c("Clade","Country","Original","Prop","N_under40","Observed_under40","Expected_under40")
  temp <- temp %>% filter(Expected_under40!=0)
  if (nrow(temp)>=4){
    originalChiLoc40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiLoc40)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    country40pVals[i,1] <- temp[1,2]
    country40pVals[i,2] <- originalChiLoc40$p.value
    country40pVals[i,3] <- originalChiLoc40$parameter
  }
}
country40pVals <- country40pVals %>% drop_na
country40pVals
```
# Chi Squared Tests of Oversamples

## Column {.tabset}

### Month
```{r warning=FALSE}
#Creating sub tables oversample 200 Month

originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>% 
  select(Clade,Month,Freq)
over200CladeTableMonth <- as.data.frame(table(oversample200$GISAID_clade,oversample200$month))
compare200Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>% 
  filter(Month %in% over200CladeTableMonth$Var2)

#Combining tables and renaming variables

compare200Months <- cbind2(compare200Months,over200CladeTableMonth$Freq) %>% group_by(Month) %>% 
  mutate("N_over200"=sum(y), 
         "Observed_over200"=y,
         "Expected_over200"=round(N_over200*Prop)) %>% 
  select(-y)
#compare200Months

#Chi-squared tests for oversample 200 Months

list200_1 <- split(compare200Months,compare200Months$Month, drop=TRUE)
month200pVals <- as.data.frame(matrix(nrow=length(list200_1),ncol=3))
colnames(month200pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list200_1)){
  temp <- as.data.frame(list200_1[i])
  colnames(temp) <- c("Clade","Month","Original","Prop","N_over200","Observed_over200","Expected_over200")
  temp <- temp %>% filter(Expected_over200!=0)
  if (nrow(temp)>=4){
    originalChiMonth200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiMonth200)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    month200pVals[i,1] <- temp[1,2]
    month200pVals[i,2] <- originalChiMonth200$p.value
    month200pVals[i,3] <- originalChiMonth200$parameter
  }
}
month200pVals <- month200pVals %>% drop_na
month200pVals
```

### Week
```{r warning=FALSE}
#Creating sub tables oversample 200 Week

originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>% 
  select(Clade,Week,Freq)
over200CladeTableWeek <- as.data.frame(table(oversample200$GISAID_clade,oversample200$week))
compare200Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>% 
  filter(Week %in% over200CladeTableWeek$Var2)

#Combining tables and renaming variables

compare200Weeks <- cbind2(compare200Weeks,over200CladeTableWeek$Freq) %>% group_by(Week) %>% 
  mutate("N_over200"=sum(y), 
         "Observed_over200"=y,
         "Expected_over200"=round(N_over200*Prop)) %>% 
  select(-y)
#compare200Weeks

#Chi-squared tests for oversample 200 Weeks

list200_2 <- split(compare200Weeks,compare200Weeks$Week, drop=TRUE)
week200pVals <- as.data.frame(matrix(nrow=length(list200_2),ncol=3))
colnames(week200pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list200_2)){
  temp <- as.data.frame(list200_2[i])
  colnames(temp) <- c("Clade","Week","Original","Prop","N_over200","Observed_over200","Expected_over200")
  temp <- temp %>% filter(Expected_over200!=0)
  if (nrow(temp)>=4){
    originalChiWeek200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiWeek200)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    week200pVals[i,1] <- temp[1,2]
    week200pVals[i,2] <- originalChiWeek200$p.value
    week200pVals[i,3] <- originalChiWeek200$parameter
  }
}
week200pVals <- week200pVals %>% drop_na
week200pVals
```

### Continent
```{r warning=FALSE}
#Creating sub tables oversample 200 Continent

originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>% 
  select(Clade,Continent,Freq)
over200CladeTableCont <- as.data.frame(table(oversample200$GISAID_clade,oversample200$region))
compare200Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>% 
  filter(Continent %in% over200CladeTableCont$Var2)

#Combining tables and renaming variables

compare200Cont <- cbind2(compare200Cont,over200CladeTableCont$Freq) %>% group_by(Continent) %>% 
  mutate("N_over200"=sum(y), 
         "Observed_over200"=y,
         "Expected_over200"=round(N_over200*Prop)) %>% 
  select(-y)
#compare200Cont

#Chi-squared tests for oversample 200 Continents

list200_3 <- split(compare200Cont,compare200Cont$Continent, drop=TRUE)
cont200pVals <- as.data.frame(matrix(nrow=length(list200_3),ncol=3))
colnames(cont200pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list200_3)){
  temp <- as.data.frame(list200_3[i])
  colnames(temp) <- c("Clade","Continent","Original","Prop","N_over200","Observed_over200","Expected_over200")
  temp <- temp %>% filter(Expected_over200!=0)
  if (nrow(temp)>=4){
    originalChiCont200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiCont200)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    cont200pVals[i,1] <- temp[1,2]
    cont200pVals[i,2] <- originalChiCont200$p.value
    cont200pVals[i,3] <- originalChiCont200$parameter
  }
}
cont200pVals <- cont200pVals %>% drop_na
cont200pVals
```

### Country
```{r warning=FALSE}
#Creating sub tables oversample 200 Country

originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>% 
  select(Clade,Country,Freq)
over200CladeTableLoc <- as.data.frame(table(oversample200$GISAID_clade,oversample200$country))
compare200Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>% 
  filter(Country %in% over200CladeTableLoc$Var2)

#Combining tables and renaming variables

compare200Loc <- cbind2(compare200Loc,over200CladeTableLoc$Freq) %>% group_by(Country) %>% 
  mutate("N_over200"=sum(y), 
         "Observed_over200"=y,
         "Expected_over200"=round(N_over200*Prop)) %>% 
  select(-y)
#compare200Loc

#Chi-squared tests for oversample 200 Countries

list200_4 <- split(compare200Loc,compare200Loc$Country, drop=TRUE)
country200pVals <- as.data.frame(matrix(nrow=length(list200_4),ncol=3))
colnames(country200pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list200_4)){
  temp <- as.data.frame(list200_4[i])
  colnames(temp) <- c("Clade","Country","Original","Prop","N_over200","Observed_over200","Expected_over200")
  temp <- temp %>% filter(Expected_over200!=0)
  if (nrow(temp)>=4){
    originalChiLoc200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiLoc200)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    country200pVals[i,1] <- temp[1,2]
    country200pVals[i,2] <- originalChiLoc200$p.value
    country200pVals[i,3] <- originalChiLoc200$parameter
  }
}
country200pVals <- country200pVals %>% drop_na
country200pVals
```

## Column {.tabset}

### Month
```{r warning=FALSE}
#Creating sub tables oversample 500 Month

originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>% 
  select(Clade,Month,Freq)
over500CladeTableMonth <- as.data.frame(table(oversample500$GISAID_clade,oversample500$month))
compare500Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>% 
  filter(Month %in% over500CladeTableMonth$Var2)

#Combining tables and renaming variables

compare500Months <- cbind2(compare500Months,over500CladeTableMonth$Freq) %>% group_by(Month) %>% 
  mutate("N_over500"=sum(y), 
         "Observed_over500"=y,
         "Expected_over500"=round(N_over500*Prop)) %>% 
  select(-y)
#compare500Months

#Chi-squared tests for oversample 500 Months

list500_1 <- split(compare500Months,compare500Months$Month, drop=TRUE)
month500pVals <- as.data.frame(matrix(nrow=length(list500_1),ncol=3))
colnames(month500pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list500_1)){
  temp <- as.data.frame(list500_1[i])
  colnames(temp) <- c("Clade","Month","Original","Prop","N_over500","Observed_over500","Expected_over500")
  temp <- temp %>% filter(Expected_over500!=0)
  if (nrow(temp)>=4){
    originalChiMonth500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiMonth500)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    month500pVals[i,1] <- temp[1,2]
    month500pVals[i,2] <- originalChiMonth500$p.value
    month500pVals[i,3] <- originalChiMonth500$parameter
  }
}
month500pVals <- month500pVals %>% drop_na
month500pVals
```

### Week
```{r warning=FALSE}
#Creating sub tables oversample 500 Week

originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>% 
  select(Clade,Week,Freq)
over500CladeTableWeek <- as.data.frame(table(oversample500$GISAID_clade,oversample500$week))
compare500Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>% 
  filter(Week %in% over500CladeTableWeek$Var2)

#Combining tables and renaming variables

compare500Weeks <- cbind2(compare500Weeks,over500CladeTableWeek$Freq) %>% group_by(Week) %>% 
  mutate("N_over500"=sum(y), 
         "Observed_over500"=y,
         "Expected_over500"=round(N_over500*Prop)) %>% 
  select(-y)
#compare500Weeks

#Chi-squared tests for oversample 500 Weeks

list500_2 <- split(compare500Weeks,compare500Weeks$Week, drop=TRUE)
week500pVals <- as.data.frame(matrix(nrow=length(list500_2),ncol=3))
colnames(week500pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list500_2)){
  temp <- as.data.frame(list500_2[i])
  colnames(temp) <- c("Clade","Week","Original","Prop","N_over500","Observed_over500","Expected_over500")
  temp <- temp %>% filter(Expected_over500!=0)
  if (nrow(temp)>=4){
    originalChiWeek500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiWeek500)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    week500pVals[i,1] <- temp[1,2]
    week500pVals[i,2] <- originalChiWeek500$p.value
    week500pVals[i,3] <- originalChiWeek500$parameter
  }
}
week500pVals <- week500pVals %>% drop_na
week500pVals
```

### Continent
```{r warning=FALSE}
#Creating sub tables oversample 500 Continent

originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>% 
  select(Clade,Continent,Freq)
over500CladeTableCont <- as.data.frame(table(oversample500$GISAID_clade,oversample500$region))
compare500Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>%
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>% 
  filter(Continent %in% over500CladeTableCont$Var2)

#Combining tables and renaming variables

compare500Cont <- cbind2(compare500Cont,over500CladeTableCont$Freq) %>% group_by(Continent) %>% 
  mutate("N_over500"=sum(y), 
         "Observed_over500"=y,
         "Expected_over500"=round(N_over500*Prop)) %>% 
  select(-y)
#compare500Cont

#Chi-squared tests for oversample 500 Continents

list500_3 <- split(compare500Cont,compare500Cont$Continent, drop=TRUE)
cont500pVals <- as.data.frame(matrix(nrow=length(list500_3),ncol=3))
colnames(cont500pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list500_3)){
  temp <- as.data.frame(list500_3[i])
  colnames(temp) <- c("Clade","Continent","Original","Prop","N_over500","Observed_over500","Expected_over500")
  temp <- temp %>% filter(Expected_over500!=0)
  if (nrow(temp)>=4){
    originalChiCont500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiCont500)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    cont500pVals[i,1] <- temp[1,2]
    cont500pVals[i,2] <- originalChiCont500$p.value
    cont500pVals[i,3] <- originalChiCont500$parameter
  }
}
cont500pVals <- cont500pVals %>% drop_na
cont500pVals
```

### Country
```{r warning=FALSE}
#Creating sub tables oversample 500 Country

originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>% 
  select(Clade,Country,Freq)
over500CladeTableLoc <- as.data.frame(table(oversample500$GISAID_clade,oversample500$country))
compare500Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>% 
  mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>% 
  filter(Country %in% over500CladeTableLoc$Var2)

#Combining tables and renaming variables

compare500Loc <- cbind2(compare500Loc,over500CladeTableLoc$Freq) %>% group_by(Country) %>% 
  mutate("N_over500"=sum(y), 
         "Observed_over500"=y,
         "Expected_over500"=round(N_over500*Prop)) %>% 
  select(-y)
#compare500Loc

#Chi-squared tests for oversample 500 Countries
list500_4 <- split(compare500Loc,compare500Loc$Country, drop=TRUE)
country500pVals <- as.data.frame(matrix(nrow=length(list500_4),ncol=3))
colnames(country500pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list500_4)){
  temp <- as.data.frame(list500_4[i])
  colnames(temp) <- c("Clade","Country","Original","Prop","N_over500","Observed_over500","Expected_over500")
  temp <- temp %>% filter(Expected_over500!=0)
  if (nrow(temp)>=4){
    originalChiLoc500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
    #Store chi squared tests in data frame
    #print(originalChiLoc500)
    temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
    country500pVals[i,1] <- temp[1,2]
    country500pVals[i,2] <- originalChiLoc500$p.value
    country500pVals[i,3] <- originalChiLoc500$parameter
  }
}
country500pVals <- country500pVals %>% drop_na
country500pVals
```