Addressing sample biases in genome-wide association study for SARS-CoV-2
Monitoring adaptive changes of SARS-CoV-2 is of critical importance to mitigate its transmission. There are many researches showing the viral genomic mutations play a key role in propagation of SARS-COV-2. Genome-wide association study (GWAS) is a typical method to study the association between genotype and phenotypic measures. However, GWAS typically requires random sampling. Currently, isolates of SARS-CoV-2 tend to be sequenced in regions with advanced research capacity, and most sequences were reported during March-April 2020. These, and other sampling biases pose challenges to GWAS of SARS-CoV-2. In this work, several statistical and computational methods to mitigate the sampling biases of SARS-COV-2 are proposed.
[1] 1400711 31
# A tibble: 6 x 31
X1 `Virus name` Type `Accession ID` `Collection dat~ `Additional loca~
<dbl> <chr> <chr> <chr> <date> <chr>
1 1 hCoV-19/Czech ~ betac~ EPI_ISL_426889 2020-04-02 <NA>
2 2 hCoV-19/USA/NY~ betac~ EPI_ISL_426617 2020-04-01 <NA>
3 3 hCoV-19/USA/NY~ betac~ EPI_ISL_426618 2020-04-01 <NA>
4 4 hCoV-19/USA/NY~ betac~ EPI_ISL_426619 2020-04-01 <NA>
5 5 hCoV-19/USA/NY~ betac~ EPI_ISL_426620 2020-04-01 <NA>
6 6 hCoV-19/USA/NY~ betac~ EPI_ISL_426621 2020-04-01 <NA>
# ... with 25 more variables: Sequence length <dbl>, Host <chr>,
# Patient age <chr>, Gender <chr>, Clade <chr>, Pango lineage <chr>,
# Pangolin version <date>, Variant <lgl>, AA Substitutions <chr>,
# Submission date <date>, Is reference? <lgl>, Is complete? <lgl>,
# Is high coverage? <lgl>, Is low coverage? <lgl>, N-Content <dbl>,
# GC-Content <dbl>, date <date>, GISAID_clade <chr>, region <chr>,
# country <chr>, date2 <date>, monthDate <date>, weekDate <date>, week <dbl>,
# month <dbl>
Month p-Value DF
1 1 1.654640e-14 6
2 2 6.195658e-06 6
3 3 3.615681e-03 4
4 4 9.242295e-04 5
5 5 8.581578e-03 4
6 6 7.523235e-03 4
7 7 2.805183e-03 4
8 8 3.130698e-12 4
9 9 3.780490e-98 6
10 10 1.267388e-42 6
11 11 2.551189e-61 6
12 12 1.956698e-14 6
13 13 5.003433e-07 4
Week p-Value DF
1 1 2.368208e-01 6
2 2 2.224464e-07 6
3 3 3.799037e-01 6
4 4 3.297999e-05 6
5 5 1.449028e-03 4
6 6 5.623520e-02 3
7 7 4.017306e-04 4
8 8 2.119456e-01 4
9 9 5.524509e-04 4
10 10 3.001509e-01 3
11 14 1.063090e-01 3
12 20 4.253254e-02 3
13 21 8.280664e-01 3
14 22 4.587896e-01 3
15 23 3.557941e-01 3
16 24 4.113412e-02 3
17 25 7.778238e-01 3
18 26 4.752911e-01 3
19 27 5.824367e-03 3
20 28 2.482789e-01 3
21 29 1.615003e-03 3
22 30 2.474218e-01 3
23 31 8.528808e-01 3
24 32 1.675442e-03 3
25 33 4.964944e-03 4
26 34 6.918604e-05 4
27 35 1.925310e-04 4
28 36 5.671213e-02 4
29 37 1.105719e-06 4
30 38 1.340265e-08 4
31 39 4.930772e-02 4
32 40 6.122400e-06 4
33 41 2.318575e-09 4
34 42 1.512491e-06 4
35 43 5.830431e-09 5
36 44 1.620449e-09 4
37 45 1.585289e-07 5
38 46 2.089959e-11 5
39 47 7.323962e-58 5
40 48 9.358187e-07 5
41 49 8.841524e-04 5
42 50 9.093478e-06 4
43 51 3.190217e-02 4
44 52 4.909901e-03 4
45 53 1.966564e-03 3
46 54 6.631603e-04 3
47 55 5.190509e-02 3
48 56 7.212334e-01 3
49 57 2.992473e-09 3
Continent p-Value DF
1 Africa 1.346341e-08 7
2 Asia 3.220917e-29 8
3 Europe 4.850537e-19 8
4 NorthAmerica 3.639654e-88 5
5 Oceania 2.760799e-39 5
6 SouthAmerica 2.169172e-14 5
Country p-Value DF
1 Angola 0.45878964 3
2 Argentina 0.73342246 3
3 Aruba 0.66950875 3
4 Austria 0.63125809 4
5 Bahrain 0.84358306 5
6 Bangladesh 1.00000000 3
7 Belgium 0.71676002 4
8 BosniaandHerzegovina 0.45042658 4
9 BurkinaFaso 0.49740448 3
10 Cambodia 0.83592179 4
11 Cameroon 0.75581481 4
12 Canada 0.09933304 3
13 Chile 0.42349992 3
14 China 0.50366827 4
15 Colombia 0.35360179 3
16 CostaRica 0.53417724 4
17 Coted'Ivoire 0.64762268 4
18 Croatia 0.92456082 4
19 Curacao 0.17390128 3
20 Cyprus 0.53681431 4
21 CzechRepublic 0.56076560 4
22 DemocraticRepublicoftheCongo 0.19104576 3
23 Denmark 0.27529913 4
24 Ecuador 0.28003264 3
25 Egypt 0.47057266 5
26 Estonia 0.93677844 3
27 Finland 0.82083597 5
28 France 0.71966918 4
29 Gambia 0.61598170 3
30 Germany 0.45299261 4
31 Ghana 0.27301011 4
32 Gibraltar 0.78109893 3
33 Greece 0.16063801 4
34 Guadeloupe 0.92830304 3
35 Hungary 0.23107824 4
36 Iceland 0.10595978 3
37 India 0.04625015 4
38 Indonesia 0.71278093 4
39 Iran 0.66262727 4
40 Iraq 0.30572941 4
41 Ireland 0.21749440 3
42 Israel 0.26752595 3
43 Italy 0.90979599 4
44 Kazakhstan 0.02905894 4
45 Kenya 0.03506043 4
46 Lithuania 0.19272480 3
47 Luxembourg 0.06652308 4
48 Malaysia 0.15515427 3
49 Mauritius 0.26146413 3
50 Morocco 0.43374900 4
51 Netherlands 0.07913008 4
52 NewZealand 0.70523331 4
53 Nigeria 0.08484479 4
54 NorthMacedonia 0.56716513 4
55 Norway 0.80633919 4
56 Oman 0.16727738 3
57 Pakistan 0.39426448 4
58 Portugal 0.61018711 4
59 Qatar 0.83642845 5
60 RepublicoftheCongo 0.75484983 3
61 Romania 0.14325829 4
62 Rwanda 0.46252052 3
63 SaudiArabia 0.34434784 4
64 Senegal 0.40379846 5
65 Serbia 0.54329125 3
66 Singapore 0.01241423 4
67 Slovakia 0.62744808 3
68 Slovenia 0.10377716 3
69 SouthKorea 0.44592170 3
70 Spain 0.26327399 4
71 SriLanka 0.84914535 4
72 Suriname 0.73575888 4
73 Sweden 0.75873590 4
74 Switzerland 0.15458730 4
75 Taiwan 0.64776109 7
76 Thailand 0.50217069 4
77 Togo 0.02800535 4
78 Tunisia 0.76128480 4
79 Turkey 0.13197750 4
80 Uganda 0.72004823 3
81 UnitedArabEmirates 0.60974498 4
82 UnitedKingdom 0.64687550 4
83 USA 0.01242136 3
Month p-Value DF
1 1 2.315282e-32 6
2 2 2.560532e-13 6
3 3 1.392890e-24 6
4 4 7.337918e-49 5
5 5 6.933985e-15 5
6 6 2.317907e-09 4
7 7 4.193184e-09 4
8 8 2.163757e-46 5
9 9 6.064160e-158 6
10 10 3.674524e-125 6
11 11 4.026039e-113 6
12 12 1.135786e-59 6
13 13 2.184995e-24 5
Week p-Value DF
1 1 3.148633e-04 6
2 2 1.443161e-08 6
3 3 7.977575e-11 6
4 4 3.121344e-06 6
5 5 1.894588e-05 6
6 6 1.444988e-04 5
7 7 2.853322e-04 4
8 8 6.546771e-03 4
9 9 6.257266e-09 4
10 10 7.743987e-02 4
11 11 2.052983e-06 4
12 12 3.793150e-15 4
13 13 1.608997e-06 4
14 14 3.442165e-02 3
15 15 2.179858e-04 3
16 18 7.940317e-03 3
17 19 5.449143e-01 3
18 20 1.197949e-01 3
19 21 3.624162e-01 4
20 22 3.749875e-01 3
21 23 3.761749e-01 3
22 24 3.555137e-02 3
23 25 5.158688e-02 3
24 26 1.422558e-02 3
25 27 6.969311e-05 3
26 28 2.781476e-01 3
27 29 1.353555e-07 4
28 30 5.021056e-05 3
29 31 1.722020e-02 3
30 32 7.581415e-08 4
31 33 5.068218e-07 4
32 34 2.482507e-08 4
33 35 4.483360e-04 4
34 36 3.319017e-10 4
35 37 8.109786e-09 4
36 38 4.836214e-18 4
37 39 4.511623e-10 5
38 40 1.663799e-21 5
39 41 3.304054e-12 5
40 42 2.335019e-15 6
41 43 1.595251e-33 6
42 44 9.828228e-20 5
43 45 8.335500e-51 6
44 46 2.883698e-14 6
45 47 1.772866e-97 6
46 48 8.876446e-05 6
47 49 9.060032e-09 5
48 50 8.793899e-15 5
49 51 8.602499e-13 5
50 52 1.507619e-10 4
51 53 1.328952e-06 4
52 54 2.632379e-06 4
53 55 5.472436e-04 3
54 56 1.432595e-06 3
55 57 6.360602e-06 3
Continent p-Value DF
1 Africa 1.753342e-31 7
2 Asia 1.503543e-56 8
3 Europe 7.443628e-45 8
4 NorthAmerica 6.244032e-162 5
5 Oceania 2.269244e-53 6
6 SouthAmerica 1.368290e-31 6
Country p-Value DF
1 Angola 0.412403258 5
2 Argentina 0.738474603 3
3 Aruba 0.821544305 4
4 Australia 0.529694509 4
5 Austria 0.002064464 5
6 Bahrain 0.872706357 5
7 Bangladesh 0.154841583 4
8 Belgium 0.285203142 5
9 BosniaandHerzegovina 0.714560106 5
10 Bulgaria 0.684184844 4
11 BurkinaFaso 0.802372190 3
12 Cambodia 0.613952403 5
13 Cameroon 0.944877365 5
14 Canada 0.883151853 3
15 Chile 0.358814024 3
16 China 0.769445621 5
17 Colombia 0.780119905 3
18 CostaRica 0.795916015 4
19 Coted'Ivoire 0.520328569 4
20 Croatia 0.535765547 4
21 Curacao 0.507424711 4
22 Cyprus 0.855695198 4
23 CzechRepublic 0.592521414 4
24 DemocraticRepublicoftheCongo 0.741038889 3
25 Denmark 0.861587577 4
26 Ecuador 0.559247323 5
27 Egypt 0.433137968 5
28 Estonia 0.079079614 4
29 Finland 0.765701451 5
30 France 0.735636631 6
31 FrenchGuiana 0.789731668 3
32 Gambia 0.145923694 4
33 Germany 0.898179130 5
34 Ghana 0.136788957 5
35 Gibraltar 0.607757176 4
36 Greece 0.569184171 4
37 Guadeloupe 0.560364164 4
38 HongKong 0.038574072 3
39 Hungary 0.498439266 4
40 Iceland 0.693540815 4
41 India 0.733569137 4
42 Indonesia 0.429703019 4
43 Iran 0.255877620 4
44 Iraq 0.666489701 4
45 Ireland 0.393149218 4
46 Israel 0.578582677 4
47 Italy 0.767505195 5
48 Jordan 0.628962571 4
49 Kazakhstan 0.947193324 4
50 Kenya 0.071726945 5
51 Latvia 0.641195568 4
52 Lithuania 0.125200668 3
53 Luxembourg 0.738981716 4
54 Malaysia 0.656730911 5
55 Mauritius 0.751370710 5
56 Mayotte 0.499869772 3
57 Mexico 0.133145889 4
58 Morocco 0.854553325 4
59 Mozambique 0.521627741 3
60 Netherlands 0.528561321 4
61 NewZealand 0.385701146 6
62 Nigeria 0.997567264 4
63 NorthernMarianaIslands 0.889859868 4
64 NorthMacedonia 0.475348671 4
65 Norway 0.642466625 4
66 Oman 0.947457534 3
67 Pakistan 0.510154372 5
68 Paraguay 0.793447116 3
69 Philippines 0.769412756 4
70 Poland 0.048849194 3
71 Portugal 0.418879184 4
72 Qatar 0.738598080 5
73 RepublicoftheCongo 0.956945924 3
74 Reunion 0.872083333 3
75 Romania 0.702950812 5
76 Rwanda 0.612864644 3
77 SaudiArabia 0.163470592 4
78 Senegal 0.990820576 6
79 Serbia 0.623452656 4
80 Singapore 0.649301094 5
81 SintMaarten 0.787022372 3
82 Slovakia 0.949395645 3
83 Slovenia 0.502415904 3
84 SouthKorea 0.909646571 3
85 Spain 0.160484970 6
86 SriLanka 0.277757310 4
87 Suriname 0.593413253 4
88 Sweden 0.299479049 4
89 Switzerland 0.519657018 4
90 Taiwan 0.903276297 7
91 Thailand 0.568599998 5
92 Togo 0.711941825 4
93 TrinidadandTobago 0.700334086 3
94 Tunisia 0.824178410 4
95 Turkey 0.135010085 4
96 Uganda 0.160345077 4
97 Ukraine 0.981225804 3
98 UnitedArabEmirates 0.631874261 5
99 UnitedKingdom 0.451041490 4
100 USA 0.728756190 3
101 Zambia 1.000000000 3
Month p-Value DF
1 1 1.183292e-224 6
2 2 2.961689e-71 6
3 3 6.268041e-124 7
4 4 1.872053e-146 7
5 5 4.272289e-249 6
6 6 0.000000e+00 5
7 7 1.254006e-256 6
8 8 0.000000e+00 6
9 9 0.000000e+00 6
10 10 0.000000e+00 7
11 11 0.000000e+00 6
12 12 7.886256e-140 7
13 13 1.254577e-113 6
14 14 7.237216e-12 3
Week p-Value DF
1 1 1.741785e-29 6
2 2 4.169705e-63 6
3 3 2.926284e-38 6
4 4 1.314584e-80 6
5 5 3.733575e-21 6
6 6 1.513699e-25 6
7 7 1.629846e-15 6
8 8 3.231018e-15 6
9 9 1.666734e-20 6
10 10 1.009987e-10 6
11 11 6.466170e-29 6
12 12 3.646701e-78 5
13 13 1.312207e-52 6
14 14 6.508301e-131 4
15 15 2.127646e-14 4
16 16 6.257602e-09 5
17 17 1.048050e-69 5
18 18 2.227441e-26 5
19 19 1.254421e-06 4
20 20 1.369052e-280 5
21 21 1.387302e-02 4
22 22 3.115603e-27 5
23 23 1.216882e-72 5
24 24 4.637465e-09 4
25 25 1.467601e-09 4
26 26 2.167824e-07 4
27 27 5.646454e-17 4
28 28 1.496538e-18 4
29 29 1.603255e-16 4
30 30 9.755995e-11 4
31 31 2.312134e-06 5
32 32 1.273501e-16 4
33 33 3.250696e-37 5
34 34 6.389566e-23 5
35 35 9.953590e-155 5
36 36 6.525695e-118 5
37 37 5.931354e-129 6
38 38 2.049733e-197 6
39 39 0.000000e+00 6
40 40 1.780454e-168 6
41 41 1.308159e-109 6
42 42 1.170157e-220 6
43 43 4.377060e-131 6
44 44 3.854736e-149 6
45 45 5.734014e-160 6
46 46 2.578503e-228 6
47 47 0.000000e+00 6
48 48 2.646403e-67 6
49 49 3.129490e-39 6
50 50 4.605544e-60 6
51 51 9.483241e-57 6
52 52 9.561212e-13 6
53 53 2.464960e-28 6
54 54 2.250829e-49 6
55 55 2.531055e-14 5
56 56 4.575516e-25 5
57 57 2.503096e-23 3
Continent p-Value DF
1 Africa 6.650859e-156 8
2 Asia 0.000000e+00 8
3 Europe 2.191332e-231 8
4 NorthAmerica 0.000000e+00 8
5 Oceania 0.000000e+00 7
6 SouthAmerica 5.335575e-193 8
Country p-Value DF
1 Angola 0.6515359976 6
2 Argentina 0.4709293725 3
3 Armenia 0.9695707706 4
4 Aruba 0.2443370966 4
5 Australia 0.4663553533 6
6 Austria 0.4998086298 5
7 Bahrain 0.9455674179 7
8 Bangladesh 0.3633510042 5
9 Belgium 0.1181876382 6
10 BosniaandHerzegovina 0.5232293464 5
11 Botswana 0.3349677938 3
12 Brazil 0.6309425876 4
13 Bulgaria 0.9094949820 4
14 BurkinaFaso 0.9165174649 3
15 Cambodia 0.9185984322 5
16 Cameroon 0.2276476216 5
17 Canada 0.7618381671 5
18 Chile 0.9075064225 3
19 China 0.5715929940 5
20 Colombia 0.8443503245 4
21 CostaRica 0.0323832030 4
22 Coted'Ivoire 0.6503016037 5
23 Croatia 0.9391483327 4
24 Curacao 0.6748335982 4
25 Cyprus 0.5333418807 4
26 CzechRepublic 0.5170781727 5
27 DemocraticRepublicoftheCongo 0.4859151767 6
28 Denmark 0.8036216038 4
29 Ecuador 0.7012853551 5
30 Egypt 0.8725233264 5
31 EquatorialGuinea 0.8979161202 4
32 Estonia 0.6421887198 4
33 Finland 0.2258860459 5
34 France 0.1724944703 6
35 FrenchGuiana 0.5608053332 5
36 Gabon 0.5285727305 4
37 Gambia 0.2723578866 4
38 Germany 0.9192934735 5
39 Ghana 0.0078597525 6
40 Gibraltar 0.2530656247 4
41 Greece 0.7237299844 5
42 Guadeloupe 0.8191882378 4
43 HongKong 0.0837983901 5
44 Hungary 0.2092241513 4
45 Iceland 0.7919381898 7
46 India 0.0638228240 6
47 Indonesia 0.5107744600 6
48 Iran 0.0004395766 5
49 Iraq 0.0020308386 4
50 Ireland 0.8285258552 5
51 Israel 0.3551945409 4
52 Italy 0.0571963239 6
53 Japan 0.3785975940 3
54 Jordan 0.8682199786 5
55 Kazakhstan 0.5188388984 5
56 Kenya 0.5598834278 5
57 Latvia 0.9780187218 5
58 Lithuania 0.7924127824 5
59 Luxembourg 0.4624086501 4
60 Malawi 0.5397582095 3
61 Malaysia 0.9097120446 5
62 Mauritius 0.9531200562 5
63 Mayotte 0.8708710247 4
64 Mexico 0.4653209588 4
65 Morocco 0.8219462953 5
66 Mozambique 0.0187155567 3
67 Netherlands 0.6760835057 4
68 NewZealand 0.0442937843 8
69 Nigeria 0.6954171588 6
70 NorthernMarianaIslands 0.5045974485 4
71 NorthMacedonia 0.0854486252 5
72 Norway 0.5893702339 4
73 Oman 0.5399916029 6
74 Pakistan 0.9174697489 7
75 Panama 0.2338557450 4
76 Paraguay 0.3377765823 3
77 Peru 0.0053265799 4
78 Philippines 0.8277767556 4
79 Poland 0.6971863002 4
80 Portugal 0.5717702566 5
81 Qatar 0.0331219657 5
82 RepublicoftheCongo 0.3763359201 5
83 Reunion 0.7729258540 4
84 Romania 0.9854825339 5
85 Russia 0.9143103372 4
86 Rwanda 0.8455949627 6
87 SaudiArabia 0.7084726210 4
88 Senegal 0.2061907250 7
89 Serbia 0.6458574560 4
90 Singapore 0.3207904259 6
91 SintMaarten 0.0144812761 4
92 Slovakia 0.1688736325 4
93 Slovenia 0.3174657457 5
94 SouthAfrica 0.9860278217 4
95 SouthKorea 0.7216906609 6
96 Spain 0.1215741626 6
97 SriLanka 0.5333081871 5
98 Suriname 0.8570321342 5
99 Sweden 0.8730205463 4
100 Switzerland 0.0821408528 5
101 Taiwan 0.5005912983 7
102 Thailand 0.1027226911 6
103 Togo 0.8467004756 4
104 TrinidadandTobago 0.4665183446 3
105 Tunisia 0.9432945044 5
106 Turkey 0.3198341434 6
107 Uganda 0.5111953911 5
108 Ukraine 0.0600657116 5
109 UnitedArabEmirates 0.9349687846 7
110 UnitedKingdom 0.7115848299 5
111 Uruguay 0.1424328895 3
112 USA 0.3097782128 5
113 Zambia 0.4357704330 3
114 Zimbabwe 0.5210815886 4
Month p-Value DF
1 1 0.000000e+00 6
2 2 6.171736e-187 7
3 3 1.810998e-287 7
4 4 4.720677e-276 7
5 5 0.000000e+00 7
6 6 0.000000e+00 5
7 7 0.000000e+00 6
8 8 0.000000e+00 6
9 9 0.000000e+00 6
10 10 0.000000e+00 7
11 11 0.000000e+00 7
12 12 0.000000e+00 7
13 13 4.811437e-263 6
14 14 3.416967e-26 3
Week p-Value DF
1 1 2.786640e-64 6
2 2 2.441660e-202 6
3 3 3.146641e-81 6
4 4 4.902682e-147 6
5 5 9.543251e-57 6
6 6 1.563405e-69 6
7 7 1.728113e-34 6
8 8 3.430324e-37 7
9 9 3.547748e-74 6
10 10 4.532506e-22 7
11 11 8.120484e-54 6
12 12 1.277179e-139 6
13 13 3.061146e-106 6
14 14 1.063466e-120 7
15 15 1.981572e-54 6
16 16 7.957999e-19 6
17 17 3.755110e-88 5
18 18 2.286600e-132 5
19 19 2.832595e-41 5
20 20 0.000000e+00 5
21 21 8.055207e-103 5
22 22 4.162990e-178 5
23 23 6.366328e-232 5
24 24 0.000000e+00 5
25 25 1.293235e-22 4
26 26 8.764413e-16 4
27 27 1.652961e-47 4
28 28 3.944140e-117 4
29 29 2.443328e-55 4
30 30 8.124950e-22 5
31 31 2.153270e-203 6
32 32 4.501931e-54 5
33 33 1.324636e-196 6
34 34 1.045446e-70 5
35 35 0.000000e+00 6
36 36 0.000000e+00 6
37 37 0.000000e+00 6
38 38 0.000000e+00 6
39 39 0.000000e+00 6
40 40 0.000000e+00 6
41 41 6.096042e-253 6
42 42 0.000000e+00 6
43 43 0.000000e+00 7
44 44 0.000000e+00 7
45 45 0.000000e+00 6
46 46 0.000000e+00 6
47 47 0.000000e+00 7
48 48 5.537390e-161 6
49 49 1.417578e-100 7
50 50 2.284104e-212 6
51 51 1.165777e-165 6
52 52 1.621691e-70 6
53 53 2.321790e-56 6
54 54 7.611955e-90 6
55 55 4.548222e-61 6
56 56 2.855869e-81 6
57 57 7.204460e-130 5
Continent p-Value DF
1 Africa 0 8
2 Asia 0 8
3 Europe 0 8
4 NorthAmerica 0 8
5 Oceania 0 8
6 SouthAmerica 0 8
Country p-Value DF
1 Angola 0.594795623 6
2 Argentina 0.010495577 4
3 Armenia 0.694201799 4
4 Aruba 0.635130965 5
5 Australia 0.836591512 7
6 Austria 0.355419547 6
7 Bahrain 0.785129415 7
8 Bangladesh 0.231462230 6
9 Belgium 0.244447986 7
10 BosniaandHerzegovina 0.986432694 5
11 Botswana 0.375415974 3
12 Brazil 0.753199760 4
13 Bulgaria 0.262446204 5
14 BurkinaFaso 0.093257121 3
15 Cambodia 0.321339802 5
16 Cameroon 0.035997682 5
17 Canada 0.635171618 5
18 Chile 0.717708104 5
19 China 0.106332435 5
20 Colombia 0.004513380 8
21 CostaRica 0.983640980 4
22 Coted'Ivoire 0.450753467 5
23 Croatia 0.031941182 4
24 Curacao 0.397503348 4
25 Cyprus 0.768321572 4
26 CzechRepublic 0.224972259 5
27 DemocraticRepublicoftheCongo 0.809700374 6
28 Denmark 0.332331885 5
29 Ecuador 0.294707867 5
30 Egypt 0.077160181 5
31 EquatorialGuinea 0.757858665 4
32 Estonia 0.779790842 4
33 Finland 0.080879888 5
34 France 0.675545836 6
35 FrenchGuiana 0.742715229 5
36 Gabon 0.212445349 4
37 Gambia 0.939151901 5
38 Germany 0.110950752 6
39 Ghana 0.360899925 6
40 Gibraltar 0.355541871 4
41 Greece 0.481174900 6
42 Guadeloupe 0.077706108 4
43 HongKong 0.809100522 6
44 Hungary 0.757821349 4
45 Iceland 0.666960250 7
46 India 0.379573609 7
47 Indonesia 0.623342518 6
48 Iran 0.165828494 5
49 Iraq 0.880576621 4
50 Ireland 0.025456029 6
51 Israel 0.633307812 5
52 Italy 0.044558542 6
53 Japan 0.524990488 3
54 Jordan 0.789061977 6
55 Kazakhstan 0.642360921 5
56 Kenya 0.748827177 6
57 Latvia 0.221239941 5
58 Lithuania 0.228656361 5
59 Luxembourg 0.421781294 5
60 Malawi 0.979093888 3
61 Malaysia 0.926186929 8
62 Mauritius 0.880321742 5
63 Mayotte 0.384108554 5
64 Mexico 0.688997006 5
65 Morocco 0.848492390 5
66 Mozambique 0.332423972 3
67 Netherlands 0.954842186 6
68 NewZealand 0.128838714 8
69 Nigeria 0.484197599 7
70 NorthernMarianaIslands 0.264531103 4
71 NorthMacedonia 0.356310749 5
72 Norway 0.667349054 5
73 Oman 0.966613008 6
74 Pakistan 0.835673155 7
75 Panama 0.672052774 5
76 Paraguay 0.717206695 3
77 Peru 0.001008969 5
78 Philippines 0.535583339 4
79 Poland 0.744951208 5
80 Portugal 0.730978045 7
81 Qatar 0.157973991 7
82 RepublicoftheCongo 0.674215085 5
83 Reunion 0.343431313 4
84 Romania 0.993120672 5
85 Russia 0.221017827 5
86 Rwanda 0.174758671 6
87 SaudiArabia 0.660730948 4
88 Senegal 0.591412820 7
89 Serbia 0.174787973 4
90 Singapore 0.046677409 8
91 SintMaarten 0.247445086 4
92 Slovakia 0.515574270 6
93 Slovenia 0.853984441 5
94 SouthAfrica 0.634627571 4
95 SouthKorea 0.631480839 7
96 Spain 0.742953301 7
97 SriLanka 0.535609743 5
98 Suriname 0.130620836 5
99 Sweden 0.712196383 4
100 Switzerland 0.024884613 5
101 Taiwan 0.045372977 7
102 Thailand 0.630499094 7
103 Togo 0.130059651 4
104 TrinidadandTobago 0.778130330 3
105 Tunisia 0.034225547 5
106 Turkey 0.301188326 7
107 Uganda 0.403145929 5
108 Ukraine 0.465363576 5
109 UnitedArabEmirates 0.228151650 7
110 UnitedKingdom 0.526859530 8
111 Uruguay 0.474548289 3
112 USA 0.373792870 5
113 Zambia 0.877798904 3
114 Zimbabwe 0.283869873 4
---
title: "COVID-19 Data Resampling"
author: "Landon Kehr"
output:
flexdashboard::flex_dashboard:
theme: cosmo
orientation: columns
social: ["facebook", "twitter", "linkedin"]
source_code: embed
vertical_layout: scroll
---
```{r setup, include=FALSE}
#2.1
#rm(list=ls())
library(tidyverse)
library(dplyr)
library(ROSE)
library(RColorBrewer)
library(Rfast)
library(ggplot2)
library(lubridate)
library(flexdashboard)
#list.files("msa_1007")
#setwd(getwd()) #set working directory to source directory
```
# Introduction
## Column {.tabset data-width=600}
### Abstract
**Addressing sample biases in genome-wide association study for SARS-CoV-2**
Monitoring adaptive changes of SARS-CoV-2 is of critical importance to mitigate its transmission. There are many researches showing the viral genomic mutations play a key role in propagation of SARS-COV-2. Genome-wide association study (GWAS) is a typical method to study the association between genotype and phenotypic measures. However, GWAS typically requires random sampling. Currently, isolates of SARS-CoV-2 tend to be sequenced in regions with advanced research capacity, and most sequences were reported during March-April 2020. These, and other sampling biases pose challenges to GWAS of SARS-CoV-2.
In this work, several statistical and computational methods to mitigate the sampling biases of SARS-COV-2 are proposed.
```{r warning=FALSE}
tb <- read_csv("tb.csv")
```
```{r}
dim(tb)
```
```{r}
#Filtering out countries that have fewer than 50 samples
group <- tb %>% group_by(country) %>% summarize(N=n()) %>% filter(N>=50)
#sum(group$N)
#group
cleaned <- tb %>% filter(country %in% group$country)
```
### Summary of the data
```{r}
head(cleaned)
```
## Column {.tabset data-width=600}
### Original Data Set
```{r}
#Bar graph of the final cleaned data set grouped by strain clade determined by GISAID
originalBar <- ggplot(tb,aes(x=GISAID_clade)) +
geom_bar()
originalBar
```
### Undersample of 20
```{r}
#Create an undersample of 20 data points from each country
set.seed(2)
undersample20 <- cleaned %>% group_by(country) %>% sample_n(20)
under20Bar <- ggplot(undersample20, aes(x=GISAID_clade)) +
geom_bar()
under20Bar
```
### Undersample of 40
```{r}
#Create an undersample of 40 data points from each country
set.seed(1)
undersample40 <- cleaned %>% group_by(country) %>% sample_n(40)
under40Bar <- ggplot(undersample40, aes(x=GISAID_clade)) +
geom_bar()
under40Bar
```
### Oversample of 200
```{r}
#Create an oversample of 200 data points from each country
set.seed(3)
oversample200 <- cleaned %>% group_by(country) %>% slice_sample(n=200,replace=TRUE)
over200Bar <- ggplot(oversample200, aes(x=GISAID_clade)) +
geom_bar()
over200Bar
```
### Oversample of 500
```{r}
#Create an oversample of 500 data points from each country
set.seed(4)
oversample500 <- cleaned %>% group_by(country) %>% slice_sample(n=500,replace=TRUE)
over500Bar <- ggplot(oversample500, aes(x=GISAID_clade)) +
geom_bar()
over500Bar
```
# Something {.hidden}
## Column {.tabset}
#By Month
##Prop Table
```{r}
#Original Data Set Proportion Table by Month
mainCladeTableMonth <- table(tb$GISAID_clade, tb$month)
#mainCladeTableMonth
mainPropTableMonth <- as.data.frame(prop.table(mainCladeTableMonth,2)) %>%
mutate("Clade"=Var1,
"Month"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Month,Frequency)
#mainPropTableMonth
```
##Frequency Plot
```{r}
#Original Data Set Frequency Percentage Plot Month
originalDateMonth <- ggplot(mainPropTableMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) +
geom_col() +
scale_x_discrete(limits=mainPropTableMonth$Month) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#originalDateMonth
```
#By Week
##Prop Table
```{r}
#Original Data Set Proportion Table by Week
mainCladeTableWeek <- table(tb$GISAID_clade, tb$week)
mainPropTableWeek <- as.data.frame(prop.table(mainCladeTableWeek,2)) %>%
mutate("Clade"=Var1,
"Week"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Week,Frequency)
#mainPropTableWeek
```
##Frequency Plot
```{r}
#Original Data Set Frequency Percentage Plot Week
originalDateWeek <- ggplot(mainPropTableWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) +
geom_col() +
scale_x_discrete(limits=mainPropTableWeek$Week) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#originalDateWeek
```
#By Continent
##Prop Table
```{r}
#Original Data Set Proportion Table by Continent
mainCladeTableLocCont <- table(tb$GISAID_clade, tb$region)
mainPropTableLocCont <- as.data.frame(prop.table(mainCladeTableLocCont,2)) %>%
mutate("Clade"=Var1,
"Continent"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Continent,Frequency)
#mainPropTableLocCont
```
##Frequency Plot
```{r}
#Original Data Set Percentage by Continent
originalCont <- ggplot(mainPropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(mainPropTableLocCont$Continent)
#originalCont
```
#By Country
##Prop Table
```{r}
#Original Data Set Proportion Table by Country
mainCladeTableLoc<- table(tb$GISAID_clade, tb$country)
mainPropTableLoc <- as.data.frame(prop.table(mainCladeTableLoc,2)) %>%
mutate("Clade"=Var1,
"Country"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Country,Frequency)
#mainPropTableLoc
```
##Frequency Plot
```{r}
#Original Data Set Percentage by Country
originalLoc <- ggplot(mainPropTableLoc, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(mainPropTableLoc$Country)
#originalLoc
```
#Undersample 40
#By Month
##Prop Table
```{r}
#Undersample 40 Proportion Table by Month
under40DateTableMonth <- table(undersample40$GISAID_clade, undersample40$month)
under40PropTableDateMonth <- as.data.frame(prop.table(under40DateTableMonth,2)) %>%
mutate("Clade"=Var1,
"Month"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Month,Frequency)
#under40PropTableDateMonth
```
##Frequency Plot
```{r}
#Undersample 40 Frequency Percentage Plot Month
under40DateMonth <- ggplot(under40PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) +
geom_col() +
scale_x_discrete(limits=under40PropTableDateMonth$Month) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#under40DateMonth
```
#By Week
##Prop Table
```{r}
#Undersample 40 Proportion Table by Week
under40DateTableWeek <- table(undersample40$GISAID_clade, undersample40$week)
under40PropTableDateWeek <- as.data.frame(prop.table(under40DateTableWeek,2)) %>%
mutate("Clade"=Var1,
"Week"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Week,Frequency)
#under40PropTableDateWeek
```
##Frequency Plot
```{r}
#Undersample 40 Frequency Percentage Plot
under40DateWeek <- ggplot(under40PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) +
geom_col() +
scale_x_discrete(limits=under40PropTableDateWeek$Week) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#under40DateWeek
```
#By Continent
##Prop Table
```{r}
#Undersample 40 Proportion Table by Continent
under40CladeTableLocCont <- table(undersample40$GISAID_clade, undersample40$region)
under40PropTableLocCont <- as.data.frame(prop.table(under40CladeTableLocCont,2)) %>%
mutate("Clade"=Var1,
"Continent"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Continent,Frequency)
#under40PropTableLocCont
```
##Frequency Plot
```{r}
#Undersample 40 Percentage by Continent
under40Cont <- ggplot(under40PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(under40PropTableLocCont$Continent)
#under40Cont
```
#By Country
##Prop Table
```{r}
#Undersample 40 Proportion Table by Country
under40CladeTableLoc <- table(undersample40$GISAID_clade, undersample40$country)
under40PropTableLoc <- as.data.frame(prop.table(under40CladeTableLoc,2)) %>%
mutate("Clade"=Var1,
"Country"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Country,Frequency)
#under40PropTableLoc
```
##Frequency Plot
```{r}
#Undersample 40 Percentage by Country
under40Loc <- ggplot(under40PropTableLoc, aes(x="",y=Frequency,fill=Clade)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(under40PropTableLoc$Country)
#under40Loc
```
#Undersample of 20
#By Month
##Prop Table
```{r}
#Undersample 20 Proportion Table by Month
under20DateTableMonth <- table(undersample20$GISAID_clade, undersample20$month)
under20PropTableDateMonth <- as.data.frame(prop.table(under20DateTableMonth,2)) %>%
mutate("Clade"=Var1,
"Month"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Month,Frequency)
#under20PropTableDateMonth
```
##Frequency Plot
```{r}
#Undersample 20 Frequency Percentage Plot Month
under20DateMonth <- ggplot(under20PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) +
geom_col() +
scale_x_discrete(limits=under20PropTableDateMonth$Month) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#under20DateMonth
```
#By Week
##Prop Table
```{r}
#Undersample 20 Proportion Table by Week
under20DateTableWeek <- table(undersample20$GISAID_clade, undersample20$week)
under20PropTableDateWeek <- as.data.frame(prop.table(under20DateTableWeek,2)) %>%
mutate("Clade"=Var1,
"Week"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Week,Frequency)
#under20PropTableDateWeek
```
##Frequency Plot
```{r}
#Undersample 20 Frequency Percentage Plot
under20DateWeek <- ggplot(under20PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) +
geom_col() +
scale_x_discrete(limits=under20PropTableDateWeek$Week) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#under20DateWeek
```
#By Continent
##Prop Table
```{r}
#Undersample 20 Proportion Table by Continent
under20CladeTableLocCont <- table(undersample20$GISAID_clade, undersample20$region)
under20PropTableLocCont <- as.data.frame(prop.table(under20CladeTableLocCont,2)) %>%
mutate("Clade"=Var1,
"Continent"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Continent,Frequency)
#under20PropTableLocCont
```
##Frequency Plot
```{r}
#Undersample 20 Percentage by Continent
under20Cont <- ggplot(under20PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(under20PropTableLocCont$Continent)
#under20Cont
```
#By Country
##Prop Table
```{r}
#Undersample 20 Proportion Table by Country
under20CladeTableLoc <- table(undersample20$GISAID_clade, undersample20$country)
under20PropTableLoc <- as.data.frame(prop.table(under20CladeTableLoc,2)) %>%
mutate("Clade"=Var1,
"Country"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Country,Frequency)
#under20PropTableLoc
```
##Frequency Plot
```{r}
#Undersample 20 Percentage by Country
under20Loc <- ggplot(under20PropTableLoc, aes(x="",y=Frequency,fill=Clade)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(under20PropTableLoc$Country)
#under20Loc
```
#Oversample of 200
#By Month
##Prop Table
```{r}
#Oversample 200 Proportion Table by Month
over200DateTableMonth <- table(oversample200$GISAID_clade, oversample200$month)
over200PropTableDateMonth <- as.data.frame(prop.table(over200DateTableMonth,2)) %>%
mutate("Clade"=Var1,
"Month"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Month,Frequency)
#over200PropTableDateMonth
```
##Frequency Plot
```{r}
#Oversample 200 Frequency Percentage Plot Month
over200DateMonth <- ggplot(over200PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) +
geom_col() +
scale_x_discrete(limits=over200PropTableDateMonth$Month) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#over200DateMonth
```
#By Week
##Prop Table
```{r}
#Oversample 200 Proportion Table by Week
over200DateTableWeek <- table(oversample200$GISAID_clade, oversample200$week)
over200PropTableDateWeek <- as.data.frame(prop.table(over200DateTableWeek,2)) %>%
mutate("Clade"=Var1,
"Week"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Week,Frequency)
#over200PropTableDateWeek
```
##Frequency Plot
```{r}
#Oversample 200 Frequency Percentage Plot Week
over200DateWeek <- ggplot(over200PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) +
geom_col() +
scale_x_discrete(limits=over200PropTableDateWeek$Week) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#over200DateWeek
```
#By Continent
##Prop Table
```{r}
#Oversample 200 Proportion Table by Continent
over200CladeTableLocCont <- table(oversample200$GISAID_clade, oversample200$region)
over200PropTableLocCont <- as.data.frame(prop.table(over200CladeTableLocCont,2)) %>%
mutate("Clade"=Var1,
"Continent"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Continent,Frequency)
#over200PropTableLocCont
```
##Frequency Plot
```{r}
#Oversample 200 Percentage by Continent
over200Cont <- ggplot(over200PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(over200PropTableLocCont$Continent)
#over200Cont
```
#By Country
##Prop Table
```{r}
#Oversample 200 Proportion Table by Country
over200CladeTableLoc <- table(oversample200$GISAID_clade, oversample200$country)
over200PropTableLoc <- as.data.frame(prop.table(over200CladeTableLoc,2)) %>%
mutate("Clade"=Var1,
"Country"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Country,Frequency)
#over200PropTableLoc
```
##Frequency Plot
```{r}
#Oversample 200 Percentage by Country
over200Loc <- ggplot(over200PropTableLoc, aes(x="",y=Frequency,fill=Clade)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(over200PropTableLoc$Country)
#over200Loc
```
#Oversample of 500
#By Month
##Prop Table
```{r}
#Oversample 500 Proportion Table by Month
over500DateTableMonth <- table(oversample500$GISAID_clade, oversample500$month)
over500PropTableDateMonth <- as.data.frame(prop.table(over500DateTableMonth,2)) %>%
mutate("Clade"=Var1,
"Month"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Month,Frequency)
#over500PropTableDateMonth
```
##Frequency Plot
```{r}
#Oversample 500 Frequency Percentage Plot Month
over500DateMonth <- ggplot(over500PropTableDateMonth,aes(x=Month,y=Frequency,fill=Clade, width=6)) +
geom_col() +
scale_x_discrete(limits=over500PropTableDateMonth$Month) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#over500DateMonth
```
#By Week
##Prop Table
```{r}
#Oversample 500 Proportion Table by Week
over500DateTableWeek <- table(oversample500$GISAID_clade, oversample500$week)
over500PropTableDateWeek <- as.data.frame(prop.table(over500DateTableWeek,2)) %>%
mutate("Clade"=Var1,
"Week"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Week,Frequency)
#over500PropTableDateWeek
```
##Frequency Plot
```{r}
#Oversample 500 Frequency Percentage Plot Week
over500DateWeek <- ggplot(over500PropTableDateWeek,aes(x=Week,y=Frequency,fill=Clade,width=6)) +
geom_col() +
scale_x_discrete(limits=over500PropTableDateWeek$Week) +
scale_fill_brewer(palette="Paired") +
theme_bw()
#over500DateWeek
```
#By Continent
##Prop Table
```{r}
#Oversample 500 Proportion Table by Country
over500CladeTableLocCont <- table(oversample500$GISAID_clade, oversample500$region)
over500PropTableLocCont <- as.data.frame(prop.table(over500CladeTableLocCont,2)) %>%
mutate("Clade"=Var1,
"Continent"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Continent,Frequency)
#over500PropTableLocCont
```
##Frequency Plot
```{r}
#Oversample 500 Percentage by Month
over500Cont <- ggplot(over500PropTableLocCont, aes(x="",y=Frequency,fill=Clade,)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(over500PropTableLocCont$Continent)
#over500Cont
```
#By Country
##Prop Table
```{r}
#Oversample 500 Proportion Table by Country
over500CladeTableLoc <- table(oversample500$GISAID_clade, oversample500$country)
over500PropTableLoc <- as.data.frame(prop.table(over500CladeTableLoc,2)) %>%
mutate("Clade"=Var1,
"Country"=Var2,
"Frequency"=Freq*100) %>%
select(Clade,Country,Frequency)
#over500PropTableLoc
```
##Frequency Plot
```{r}
#Oversample 500 Percentage by Month
over500Loc <- ggplot(over500PropTableLoc, aes(x="",y=Frequency,fill=Clade)) +
geom_bar(stat="identity",width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Paired") +
facet_wrap(over500PropTableLoc$Country)
#over500Loc
```
# Date Frequency Plots
## Column {.tabset}
### Original Data
```{r}
#Print all Month Frequency Plots
originalDateMonth
```
### Undersample of 20
```{r}
under20DateMonth
```
### Undersample of 40
```{r}
under40DateMonth
```
### Oversample of 200
```{r}
over200DateMonth
```
### Oversample of 500
```{r}
over500DateMonth
```
## Column {.tabset}
### Original Data
```{r}
#Print all Week Frequency Plots
originalDateWeek
```
### Undersample of 20
```{r}
under20DateWeek
```
### Undersample of 40
```{r}
under40DateWeek
```
### Oversample of 200
```{r}
over200DateWeek
```
### Oversample 500
```{r}
over500DateWeek
```
# Frequency Plots by Location
## Column {.tabset}
### Original Data
```{r}
#Print all Continent Frequency Plots
originalCont
```
### Undersample of 20
```{r}
under20Cont
```
### Undersample of 40
```{r}
under40Cont
```
### Oversample 200
```{r}
over200Cont
```
### Oversample 500
```{r}
over500Cont
```
## Column {.tabset}
### Original Data
```{r}
#Print all Country Frequency Plots
originalLoc
```
### Undersample of 20
```{r}
under20Loc
```
### Undersample 40
```{r}
under40Loc
```
### Oversample of 200
```{r}
over200Loc
```
### Oversample of 500
```{r}
over500Loc
```
# Chi Squared Tests of Undersamples
## Column {.tabset}
### Month
```{r warning=FALSE}
#Creating sub tables undersample 20 Month
originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>%
select(Clade,Month,Freq)
under20CladeTableMonth <- as.data.frame(table(undersample20$GISAID_clade,undersample20$month))
compareMonths <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>%
filter(Month %in% under20CladeTableMonth$Var2)
#Combining tables and renaming variables
compareMonths <- cbind2(compareMonths,under20CladeTableMonth$Freq) %>% group_by(Month) %>%
mutate("N_under20"=sum(y),
"Observed_under20"=y,
"Expected_under20"=round(N_under20*Prop)) %>%
select(-y)
#compareMonths
#Chi-squared tests for undersample 20 Months
list20_1 <- split(compareMonths,compareMonths$Month, drop=TRUE)
month20pVals <- as.data.frame(matrix(nrow=length(list20_1),ncol=3))
colnames(month20pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list20_1)){
temp <- as.data.frame(list20_1[i])
colnames(temp) <- c("Clade","Month","Original","Prop","N_under20","Observed_under20","Expected_under20")
temp <- temp %>% filter(Expected_under20!=0)
if (nrow(temp)>=4){
originalChiMonth20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiMonth20)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
month20pVals[i,1] <- temp[1,2]
month20pVals[i,2] <- originalChiMonth20$p.value
month20pVals[i,3] <- originalChiMonth20$parameter
}
}
month20pVals <- month20pVals %>% drop_na
month20pVals
```
### Week
```{r warning=FALSE}
#Creating sub tables undersample 20 Week
originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>%
select(Clade,Week,Freq)
under20CladeTableWeek <- as.data.frame(table(undersample20$GISAID_clade,undersample20$week))
compareWeeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>%
filter(Week %in% under20CladeTableWeek$Var2)
#Combining tables and renaming variables
compareWeeks <- cbind2(compareWeeks,under20CladeTableWeek$Freq) %>% group_by(Week) %>%
mutate("N_under20"=sum(y),
"Observed_under20"=y,
"Expected_under20"=round(N_under20*Prop)) %>%
select(-y)
#compareWeeks
#Chi-squared tests for undersample 20 Week
list20_2 <- split(compareWeeks,compareWeeks$Week, drop=TRUE)
week20pVals <- as.data.frame(matrix(nrow=length(list20_2),ncol=3))
colnames(week20pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list20_2)){
temp <- as.data.frame(list20_2[i])
colnames(temp) <- c("Clade","Week","Original","Prop","N_under20","Observed_under20","Expected_under20")
temp <- temp %>% filter(Expected_under20!=0)
if (nrow(temp)>=4){
originalChiWeek20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiWeek20)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
week20pVals[i,1] <- temp[1,2]
week20pVals[i,2] <- originalChiWeek20$p.value
week20pVals[i,3] <- originalChiWeek20$parameter
}
}
week20pVals <- week20pVals %>% drop_na
week20pVals
```
### Continent
```{r warning=FALSE}
#Creating sub tables undersample 20 Continent
originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>%
select(Clade,Continent,Freq)
under20CladeTableCont <- as.data.frame(table(undersample20$GISAID_clade,undersample20$region))
compareCont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>%
filter(Continent %in% under20CladeTableCont$Var2)
#Combining tables and renaming variables
compareCont <- cbind2(compareCont,under20CladeTableCont$Freq) %>% group_by(Continent) %>%
mutate("N_under20"=sum(y),
"Observed_under20"=y,
"Expected_under20"=round(N_under20*Prop)) %>%
select(-y)
#compareCont
#Chi-squared tests for undersample 20 Continents
list20_3 <- split(compareCont,compareCont$Continent, drop=TRUE)
cont20pVals <- as.data.frame(matrix(nrow=length(list20_3),ncol=3))
colnames(cont20pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list20_3)){
temp <- as.data.frame(list20_3[i])
colnames(temp) <- c("Clade","Continent","Original","Prop","N_under20","Observed_under20","Expected_under20")
temp <- temp %>% filter(Expected_under20!=0)
if (nrow(temp)>=4){
originalChiCont20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiCont20)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
cont20pVals[i,1] <- temp[1,2]
cont20pVals[i,2] <- originalChiCont20$p.value
cont20pVals[i,3] <- originalChiCont20$parameter
}
}
cont20pVals <- cont20pVals %>% drop_na
cont20pVals
```
### Country
```{r warning=FALSE}
#Creating sub tables undersample 20 Country
originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>%
select(Clade,Country,Freq)
under20CladeTableLoc <- as.data.frame(table(undersample20$GISAID_clade,undersample20$country))
compareLoc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>%
filter(Country %in% under20CladeTableLoc$Var2)
#Combining tables and renaming variables
compareLoc <- cbind2(compareLoc,under20CladeTableLoc$Freq) %>% group_by(Country) %>%
mutate("N_under20"=sum(y),
"Observed_under20"=y,
"Expected_under20"=round(N_under20*Prop)) %>%
select(-y)
#compareLoc
#Chi-squared tests for undersample 20 Countries
list20_4 <- split(compareLoc,compareLoc$Country, drop=TRUE)
country20pVals <- as.data.frame(matrix(nrow=length(list20_4),ncol=3))
colnames(country20pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list20_4)){
temp <- as.data.frame(list20_4[i])
colnames(temp) <- c("Clade","Country","Original","Prop","N_under20","Observed_under20","Expected_under20")
temp <- temp %>% filter(Expected_under20!=0)
if (nrow(temp)>=4){
originalChiLoc20 <- chisq.test(temp$Observed_under20,correct=TRUE,p=temp$Expected_under20,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiLoc20)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
country20pVals[i,1] <- temp[1,2]
country20pVals[i,2] <- originalChiLoc20$p.value
country20pVals[i,3] <- originalChiLoc20$parameter
}
}
country20pVals <- country20pVals %>% drop_na
country20pVals
```
## Column {.tabset}
### Month
```{r warning=FALSE}
#Creating sub tables undersample 40 Month
originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>%
select(Clade,Month,Freq)
under40CladeTableMonth <- as.data.frame(table(undersample40$GISAID_clade,undersample40$month))
#under40CladeTableMonth
compare40Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>%
filter(Month %in% under40CladeTableMonth$Var2)
#Combining tables and renaming variables
compare40Months <- cbind2(compare40Months,under40CladeTableMonth$Freq) %>% group_by(Month) %>%
mutate("N_under40"=sum(y),
"Observed_under40"=y,
"Expected_under40"=round(N_under40*Prop)) %>%
select(-y)
#compare40Months
#Chi-squared tests for undersample 40 Months
list40_1 <- split(compare40Months,compare40Months$Month, drop=TRUE)
month40pVals <- as.data.frame(matrix(nrow=length(list40_1),ncol=3))
colnames(month40pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list40_1)){
temp <- as.data.frame(list40_1[i])
colnames(temp) <- c("Clade","Month","Original","Prop","N_under40","Observed_under40","Expected_under40")
temp <- temp %>% filter(Expected_under40!=0)
if (nrow(temp)>=4){
originalChiMonth40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiMonth40)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
month40pVals[i,1] <- temp[1,2]
month40pVals[i,2] <- originalChiMonth40$p.value
month40pVals[i,3] <- originalChiMonth40$parameter
}
}
month40pVals <- month40pVals %>% drop_na
month40pVals
```
### Week
```{r warning=FALSE}
#Creating sub tables undersample 40 Week
originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>%
select(Clade,Week,Freq)
under40CladeTableWeek <- as.data.frame(table(undersample40$GISAID_clade,undersample40$week))
compare40Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>%
filter(Week %in% under40CladeTableWeek$Var2)
#Combining tables and renaming variables
compare40Weeks <- cbind2(compare40Weeks,under40CladeTableWeek$Freq) %>% group_by(Week) %>%
mutate("N_under40"=sum(y),
"Observed_under40"=y,
"Expected_under40"=round(N_under40*Prop)) %>%
select(-y)
#compare40Weeks
#Chi-squared tests for undersample 40 Weeks
list40_2 <- split(compare40Weeks,compare40Weeks$Week, drop=TRUE)
week40pVals <- as.data.frame(matrix(nrow=length(list40_2),ncol=3))
colnames(week40pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list40_2)){
temp <- as.data.frame(list40_2[i])
colnames(temp) <- c("Clade","Week","Original","Prop","N_under40","Observed_under40","Expected_under40")
temp <- temp %>% filter(Expected_under40!=0)
if (nrow(temp)>=4){
originalChiWeek40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiWeek40)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
week40pVals[i,1] <- temp[1,2]
week40pVals[i,2] <- originalChiWeek40$p.value
week40pVals[i,3] <- originalChiWeek40$parameter
}
}
week40pVals <- week40pVals %>% drop_na
week40pVals
```
### Continent
```{r warning=FALSE}
#Creating sub tables undersample 40 Continent
originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>%
select(Clade,Continent,Freq)
under40CladeTableCont <- as.data.frame(table(undersample40$GISAID_clade,undersample40$region))
compare40Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>%
filter(Continent %in% under40CladeTableCont$Var2)
#Combining tables and renaming variables
compare40Cont <- cbind2(compare40Cont,under40CladeTableCont$Freq) %>% group_by(Continent) %>%
mutate("N_under40"=sum(y),
"Observed_under40"=y,
"Expected_under40"=round(N_under40*Prop)) %>%
select(-y)
#compare40Cont
#Chi-squared tests for undersample 40 Continents
list40_3 <- split(compare40Cont,compare40Cont$Continent, drop=TRUE)
cont40pVals <- as.data.frame(matrix(nrow=length(list40_3),ncol=3))
colnames(cont40pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list40_3)){
temp <- as.data.frame(list40_3[i])
colnames(temp) <- c("Clade","Continent","Original","Prop","N_under40","Observed_under40","Expected_under40")
temp <- temp %>% filter(Expected_under40!=0)
if (nrow(temp)>=4){
originalChiCont40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiCont40)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
cont40pVals[i,1] <- temp[1,2]
cont40pVals[i,2] <- originalChiCont40$p.value
cont40pVals[i,3] <- originalChiCont40$parameter
}
}
cont40pVals <- cont40pVals %>% drop_na
cont40pVals
```
### Country
```{r warning=FALSE}
#Creating sub tables undersample 40 Country
originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>%
select(Clade,Country,Freq)
under40CladeTableLoc <- as.data.frame(table(undersample40$GISAID_clade,undersample40$country))
compare40Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>%
filter(Country %in% under40CladeTableLoc$Var2)
#Combining tables and renaming variables
compare40Loc <- cbind2(compare40Loc,under40CladeTableLoc$Freq) %>% group_by(Country) %>%
mutate("N_under40"=sum(y),
"Observed_under40"=y,
"Expected_under40"=round(N_under40*Prop)) %>%
select(-y)
#compare40Loc
#Chi-squared tests for undersample 40 Countries
list40_4 <- split(compare40Loc,compare40Loc$Country, drop=TRUE)
country40pVals <- as.data.frame(matrix(nrow=length(list40_4),ncol=3))
colnames(country40pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list40_4)){
temp <- as.data.frame(list40_4[i])
colnames(temp) <- c("Clade","Country","Original","Prop","N_under40","Observed_under40","Expected_under40")
temp <- temp %>% filter(Expected_under40!=0)
if (nrow(temp)>=4){
originalChiLoc40 <- chisq.test(temp$Observed_under40,correct=TRUE,p=temp$Expected_under40,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiLoc40)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
country40pVals[i,1] <- temp[1,2]
country40pVals[i,2] <- originalChiLoc40$p.value
country40pVals[i,3] <- originalChiLoc40$parameter
}
}
country40pVals <- country40pVals %>% drop_na
country40pVals
```
# Chi Squared Tests of Oversamples
## Column {.tabset}
### Month
```{r warning=FALSE}
#Creating sub tables oversample 200 Month
originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>%
select(Clade,Month,Freq)
over200CladeTableMonth <- as.data.frame(table(oversample200$GISAID_clade,oversample200$month))
compare200Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>%
filter(Month %in% over200CladeTableMonth$Var2)
#Combining tables and renaming variables
compare200Months <- cbind2(compare200Months,over200CladeTableMonth$Freq) %>% group_by(Month) %>%
mutate("N_over200"=sum(y),
"Observed_over200"=y,
"Expected_over200"=round(N_over200*Prop)) %>%
select(-y)
#compare200Months
#Chi-squared tests for oversample 200 Months
list200_1 <- split(compare200Months,compare200Months$Month, drop=TRUE)
month200pVals <- as.data.frame(matrix(nrow=length(list200_1),ncol=3))
colnames(month200pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list200_1)){
temp <- as.data.frame(list200_1[i])
colnames(temp) <- c("Clade","Month","Original","Prop","N_over200","Observed_over200","Expected_over200")
temp <- temp %>% filter(Expected_over200!=0)
if (nrow(temp)>=4){
originalChiMonth200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiMonth200)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
month200pVals[i,1] <- temp[1,2]
month200pVals[i,2] <- originalChiMonth200$p.value
month200pVals[i,3] <- originalChiMonth200$parameter
}
}
month200pVals <- month200pVals %>% drop_na
month200pVals
```
### Week
```{r warning=FALSE}
#Creating sub tables oversample 200 Week
originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>%
select(Clade,Week,Freq)
over200CladeTableWeek <- as.data.frame(table(oversample200$GISAID_clade,oversample200$week))
compare200Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>%
filter(Week %in% over200CladeTableWeek$Var2)
#Combining tables and renaming variables
compare200Weeks <- cbind2(compare200Weeks,over200CladeTableWeek$Freq) %>% group_by(Week) %>%
mutate("N_over200"=sum(y),
"Observed_over200"=y,
"Expected_over200"=round(N_over200*Prop)) %>%
select(-y)
#compare200Weeks
#Chi-squared tests for oversample 200 Weeks
list200_2 <- split(compare200Weeks,compare200Weeks$Week, drop=TRUE)
week200pVals <- as.data.frame(matrix(nrow=length(list200_2),ncol=3))
colnames(week200pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list200_2)){
temp <- as.data.frame(list200_2[i])
colnames(temp) <- c("Clade","Week","Original","Prop","N_over200","Observed_over200","Expected_over200")
temp <- temp %>% filter(Expected_over200!=0)
if (nrow(temp)>=4){
originalChiWeek200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiWeek200)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
week200pVals[i,1] <- temp[1,2]
week200pVals[i,2] <- originalChiWeek200$p.value
week200pVals[i,3] <- originalChiWeek200$parameter
}
}
week200pVals <- week200pVals %>% drop_na
week200pVals
```
### Continent
```{r warning=FALSE}
#Creating sub tables oversample 200 Continent
originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>%
select(Clade,Continent,Freq)
over200CladeTableCont <- as.data.frame(table(oversample200$GISAID_clade,oversample200$region))
compare200Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>%
filter(Continent %in% over200CladeTableCont$Var2)
#Combining tables and renaming variables
compare200Cont <- cbind2(compare200Cont,over200CladeTableCont$Freq) %>% group_by(Continent) %>%
mutate("N_over200"=sum(y),
"Observed_over200"=y,
"Expected_over200"=round(N_over200*Prop)) %>%
select(-y)
#compare200Cont
#Chi-squared tests for oversample 200 Continents
list200_3 <- split(compare200Cont,compare200Cont$Continent, drop=TRUE)
cont200pVals <- as.data.frame(matrix(nrow=length(list200_3),ncol=3))
colnames(cont200pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list200_3)){
temp <- as.data.frame(list200_3[i])
colnames(temp) <- c("Clade","Continent","Original","Prop","N_over200","Observed_over200","Expected_over200")
temp <- temp %>% filter(Expected_over200!=0)
if (nrow(temp)>=4){
originalChiCont200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiCont200)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
cont200pVals[i,1] <- temp[1,2]
cont200pVals[i,2] <- originalChiCont200$p.value
cont200pVals[i,3] <- originalChiCont200$parameter
}
}
cont200pVals <- cont200pVals %>% drop_na
cont200pVals
```
### Country
```{r warning=FALSE}
#Creating sub tables oversample 200 Country
originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>%
select(Clade,Country,Freq)
over200CladeTableLoc <- as.data.frame(table(oversample200$GISAID_clade,oversample200$country))
compare200Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>%
filter(Country %in% over200CladeTableLoc$Var2)
#Combining tables and renaming variables
compare200Loc <- cbind2(compare200Loc,over200CladeTableLoc$Freq) %>% group_by(Country) %>%
mutate("N_over200"=sum(y),
"Observed_over200"=y,
"Expected_over200"=round(N_over200*Prop)) %>%
select(-y)
#compare200Loc
#Chi-squared tests for oversample 200 Countries
list200_4 <- split(compare200Loc,compare200Loc$Country, drop=TRUE)
country200pVals <- as.data.frame(matrix(nrow=length(list200_4),ncol=3))
colnames(country200pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list200_4)){
temp <- as.data.frame(list200_4[i])
colnames(temp) <- c("Clade","Country","Original","Prop","N_over200","Observed_over200","Expected_over200")
temp <- temp %>% filter(Expected_over200!=0)
if (nrow(temp)>=4){
originalChiLoc200 <- chisq.test(temp$Observed_over200,correct=TRUE,p=temp$Expected_over200,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiLoc200)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
country200pVals[i,1] <- temp[1,2]
country200pVals[i,2] <- originalChiLoc200$p.value
country200pVals[i,3] <- originalChiLoc200$parameter
}
}
country200pVals <- country200pVals %>% drop_na
country200pVals
```
## Column {.tabset}
### Month
```{r warning=FALSE}
#Creating sub tables oversample 500 Month
originalCladeTableMonth <- as.data.frame(table(tb$GISAID_clade,tb$month)) %>% mutate("Clade"=Var1,"Month"=Var2) %>%
select(Clade,Month,Freq)
over500CladeTableMonth <- as.data.frame(table(oversample500$GISAID_clade,oversample500$month))
compare500Months <- cbind2(originalCladeTableMonth,mainPropTableMonth$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Month,Original,Prop) %>%
filter(Month %in% over500CladeTableMonth$Var2)
#Combining tables and renaming variables
compare500Months <- cbind2(compare500Months,over500CladeTableMonth$Freq) %>% group_by(Month) %>%
mutate("N_over500"=sum(y),
"Observed_over500"=y,
"Expected_over500"=round(N_over500*Prop)) %>%
select(-y)
#compare500Months
#Chi-squared tests for oversample 500 Months
list500_1 <- split(compare500Months,compare500Months$Month, drop=TRUE)
month500pVals <- as.data.frame(matrix(nrow=length(list500_1),ncol=3))
colnames(month500pVals) <- c("Month","p-Value","DF")
for (i in 1:length(list500_1)){
temp <- as.data.frame(list500_1[i])
colnames(temp) <- c("Clade","Month","Original","Prop","N_over500","Observed_over500","Expected_over500")
temp <- temp %>% filter(Expected_over500!=0)
if (nrow(temp)>=4){
originalChiMonth500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiMonth500)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
month500pVals[i,1] <- temp[1,2]
month500pVals[i,2] <- originalChiMonth500$p.value
month500pVals[i,3] <- originalChiMonth500$parameter
}
}
month500pVals <- month500pVals %>% drop_na
month500pVals
```
### Week
```{r warning=FALSE}
#Creating sub tables oversample 500 Week
originalCladeTableWeek <- as.data.frame(table(tb$GISAID_clade,tb$week)) %>% mutate("Clade"=Var1,"Week"=Var2) %>%
select(Clade,Week,Freq)
over500CladeTableWeek <- as.data.frame(table(oversample500$GISAID_clade,oversample500$week))
compare500Weeks <- cbind2(originalCladeTableWeek,mainPropTableWeek$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Week,Original,Prop) %>%
filter(Week %in% over500CladeTableWeek$Var2)
#Combining tables and renaming variables
compare500Weeks <- cbind2(compare500Weeks,over500CladeTableWeek$Freq) %>% group_by(Week) %>%
mutate("N_over500"=sum(y),
"Observed_over500"=y,
"Expected_over500"=round(N_over500*Prop)) %>%
select(-y)
#compare500Weeks
#Chi-squared tests for oversample 500 Weeks
list500_2 <- split(compare500Weeks,compare500Weeks$Week, drop=TRUE)
week500pVals <- as.data.frame(matrix(nrow=length(list500_2),ncol=3))
colnames(week500pVals) <- c("Week","p-Value","DF")
for (i in 1:length(list500_2)){
temp <- as.data.frame(list500_2[i])
colnames(temp) <- c("Clade","Week","Original","Prop","N_over500","Observed_over500","Expected_over500")
temp <- temp %>% filter(Expected_over500!=0)
if (nrow(temp)>=4){
originalChiWeek500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiWeek500)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
week500pVals[i,1] <- temp[1,2]
week500pVals[i,2] <- originalChiWeek500$p.value
week500pVals[i,3] <- originalChiWeek500$parameter
}
}
week500pVals <- week500pVals %>% drop_na
week500pVals
```
### Continent
```{r warning=FALSE}
#Creating sub tables oversample 500 Continent
originalCladeTableCont <- as.data.frame(table(tb$GISAID_clade,tb$region)) %>% mutate("Clade"=Var1,"Continent"=Var2) %>%
select(Clade,Continent,Freq)
over500CladeTableCont <- as.data.frame(table(oversample500$GISAID_clade,oversample500$region))
compare500Cont <- cbind2(originalCladeTableCont,mainPropTableLocCont$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Continent,Original,Prop) %>%
filter(Continent %in% over500CladeTableCont$Var2)
#Combining tables and renaming variables
compare500Cont <- cbind2(compare500Cont,over500CladeTableCont$Freq) %>% group_by(Continent) %>%
mutate("N_over500"=sum(y),
"Observed_over500"=y,
"Expected_over500"=round(N_over500*Prop)) %>%
select(-y)
#compare500Cont
#Chi-squared tests for oversample 500 Continents
list500_3 <- split(compare500Cont,compare500Cont$Continent, drop=TRUE)
cont500pVals <- as.data.frame(matrix(nrow=length(list500_3),ncol=3))
colnames(cont500pVals) <- c("Continent","p-Value","DF")
for (i in 1:length(list500_3)){
temp <- as.data.frame(list500_3[i])
colnames(temp) <- c("Clade","Continent","Original","Prop","N_over500","Observed_over500","Expected_over500")
temp <- temp %>% filter(Expected_over500!=0)
if (nrow(temp)>=4){
originalChiCont500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiCont500)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
cont500pVals[i,1] <- temp[1,2]
cont500pVals[i,2] <- originalChiCont500$p.value
cont500pVals[i,3] <- originalChiCont500$parameter
}
}
cont500pVals <- cont500pVals %>% drop_na
cont500pVals
```
### Country
```{r warning=FALSE}
#Creating sub tables oversample 500 Country
originalCladeTableLoc <- as.data.frame(table(tb$GISAID_clade,tb$country)) %>% mutate("Clade"=Var1,"Country"=Var2) %>%
select(Clade,Country,Freq)
over500CladeTableLoc <- as.data.frame(table(oversample500$GISAID_clade,oversample500$country))
compare500Loc <- cbind2(originalCladeTableLoc,mainPropTableLoc$Frequency) %>%
mutate("Prop"=y/100,"Original"=Freq) %>% select(Clade,Country,Original,Prop) %>%
filter(Country %in% over500CladeTableLoc$Var2)
#Combining tables and renaming variables
compare500Loc <- cbind2(compare500Loc,over500CladeTableLoc$Freq) %>% group_by(Country) %>%
mutate("N_over500"=sum(y),
"Observed_over500"=y,
"Expected_over500"=round(N_over500*Prop)) %>%
select(-y)
#compare500Loc
#Chi-squared tests for oversample 500 Countries
list500_4 <- split(compare500Loc,compare500Loc$Country, drop=TRUE)
country500pVals <- as.data.frame(matrix(nrow=length(list500_4),ncol=3))
colnames(country500pVals) <- c("Country","p-Value","DF")
for (i in 1:length(list500_4)){
temp <- as.data.frame(list500_4[i])
colnames(temp) <- c("Clade","Country","Original","Prop","N_over500","Observed_over500","Expected_over500")
temp <- temp %>% filter(Expected_over500!=0)
if (nrow(temp)>=4){
originalChiLoc500 <- chisq.test(temp$Observed_over500,correct=TRUE,p=temp$Expected_over500,rescale.p=TRUE)
#Store chi squared tests in data frame
#print(originalChiLoc500)
temp <- data.frame(lapply(temp, as.character), stringsAsFactors=FALSE)
country500pVals[i,1] <- temp[1,2]
country500pVals[i,2] <- originalChiLoc500$p.value
country500pVals[i,3] <- originalChiLoc500$parameter
}
}
country500pVals <- country500pVals %>% drop_na
country500pVals
```