Step 1: Data Selection

Summary of Dataset

The data used in this analysis comes from the Kaggle dataset “Coronavirus Dataset” compiled by Kim Jihoo, based on official reports from the Korea Disease Control and Prevention Agency (KDCA) and local governments in South Korea.

The analysis primarily uses three datasets: (1) Time-series data containing daily confirmed, released, and deceased counts (2) Patient demographic data including age and gender (3) Policy data documenting major public health interventions.

Step 2: Exploratory Visual Analysis

Phase 1: Overview of the Structure of the Dataset

Let’s begin by looking at each dataset to get a feel of what they look like.

head(case_data)
##   case_id province         city group              infection_case confirmed
## 1 1000001    Seoul   Yongsan-gu  TRUE               Itaewon Clubs       139
## 2 1000002    Seoul    Gwanak-gu  TRUE                     Richway       119
## 3 1000003    Seoul      Guro-gu  TRUE         Guro-gu Call Center        95
## 4 1000004    Seoul Yangcheon-gu  TRUE Yangcheon Table Tennis Club        43
## 5 1000005    Seoul    Dobong-gu  TRUE             Day Care Center        43
## 6 1000006    Seoul      Guro-gu  TRUE       Manmin Central Church        41
##    latitude  longitude
## 1 37.538621 126.992652
## 2  37.48208 126.901384
## 3 37.508163 126.884387
## 4 37.546061 126.874209
## 5 37.679422 127.044374
## 6 37.481059 126.894343
head(patient_data)
##   patient_id    sex age country province        city       infection_case
## 1      1e+09   male 50s   Korea    Seoul  Gangseo-gu      overseas inflow
## 2      1e+09   male 30s   Korea    Seoul Jungnang-gu      overseas inflow
## 3      1e+09   male 50s   Korea    Seoul   Jongno-gu contact with patient
## 4      1e+09   male 20s   Korea    Seoul     Mapo-gu      overseas inflow
## 5      1e+09 female 20s   Korea    Seoul Seongbuk-gu contact with patient
## 6      1e+09 female 50s   Korea    Seoul   Jongno-gu contact with patient
##   infected_by contact_number symptom_onset_date confirmed_date released_date
## 1                         75         2020-01-22     2020-01-23    2020-02-05
## 2                         31                        2020-01-30    2020-03-02
## 3  2002000001             17                        2020-01-30    2020-02-19
## 4                          9         2020-01-26     2020-01-30    2020-02-15
## 5  1000000002              2                        2020-01-31    2020-02-24
## 6  1000000003             43                        2020-01-31    2020-02-19
##   deceased_date    state
## 1               released
## 2               released
## 3               released
## 4               released
## 5               released
## 6               released
head(policy_data)
##   policy_id country        type                     gov_policy           detail
## 1         1   Korea       Alert Infectious Disease Alert Level   Level 1 (Blue)
## 2         2   Korea       Alert Infectious Disease Alert Level Level 2 (Yellow)
## 3         3   Korea       Alert Infectious Disease Alert Level Level 3 (Orange)
## 4         4   Korea       Alert Infectious Disease Alert Level    Level 4 (Red)
## 5         5   Korea Immigration  Special Immigration Procedure       from China
## 6         6   Korea Immigration  Special Immigration Procedure   from Hong Kong
##   start_date   end_date
## 1 2020-01-03 2020-01-19
## 2 2020-01-20 2020-01-27
## 3 2020-01-28 2020-02-22
## 4 2020-02-23           
## 5 2020-02-04           
## 6 2020-02-12
head(region_data)
##    code province        city latitude longitude elementary_school_count
## 1 10000    Seoul       Seoul 37.56695  126.9780                     607
## 2 10010    Seoul  Gangnam-gu 37.51842  127.0472                      33
## 3 10020    Seoul Gangdong-gu 37.53049  127.1238                      27
## 4 10030    Seoul  Gangbuk-gu 37.63994  127.0255                      14
## 5 10040    Seoul  Gangseo-gu 37.55117  126.8495                      36
## 6 10050    Seoul   Gwanak-gu 37.47829  126.9515                      22
##   kindergarten_count university_count academy_ratio elderly_population_ratio
## 1                830               48          1.44                    15.38
## 2                 38                0          4.18                    13.17
## 3                 32                0          1.54                    14.55
## 4                 21                0          0.67                    19.49
## 5                 56                1          1.17                    14.39
## 6                 33                1          0.89                    15.12
##   elderly_alone_ratio nursing_home_count
## 1                 5.8              22739
## 2                 4.3               3088
## 3                 5.4               1023
## 4                 8.5                628
## 5                 5.7               1080
## 6                 4.9                909
head(search_trend_data)
##         date    cold     flu pneumonia coronavirus
## 1 2016-01-01 0.11663 0.05590   0.15726     0.00736
## 2 2016-01-02 0.13372 0.17135   0.20826     0.00890
## 3 2016-01-03 0.14917 0.22317   0.19326     0.00845
## 4 2016-01-04 0.17463 0.18626   0.29008     0.01145
## 5 2016-01-05 0.17226 0.15072   0.24562     0.01381
## 6 2016-01-06 0.17272 0.14399   0.25081     0.01381
head(seoul_floating_data)
##         date hour birth_year    sex province          city fp_num
## 1 2020-01-01    0         20 female    Seoul     Dobong-gu  19140
## 2 2020-01-01    0         20   male    Seoul     Dobong-gu  19950
## 3 2020-01-01    0         20 female    Seoul Dongdaemun-gu  25450
## 4 2020-01-01    0         20   male    Seoul Dongdaemun-gu  27050
## 5 2020-01-01    0         20 female    Seoul    Dongjag-gu  28880
## 6 2020-01-01    0         20   male    Seoul    Dongjag-gu  30350
head(time_data)
##         date time test negative confirmed released deceased
## 1 2020-01-20   16    1        0         1        0        0
## 2 2020-01-21   16    1        0         1        0        0
## 3 2020-01-22   16    4        3         1        0        0
## 4 2020-01-23   16   22       21         1        0        0
## 5 2020-01-24   16   27       25         2        0        0
## 6 2020-01-25   16   27       25         2        0        0
head(time_gender_data)
##         date time    sex confirmed deceased
## 1 2020-03-02    0   male      1591       13
## 2 2020-03-02    0 female      2621        9
## 3 2020-03-03    0   male      1810       16
## 4 2020-03-03    0 female      3002       12
## 5 2020-03-04    0   male      1996       20
## 6 2020-03-04    0 female      3332       12
head(time_province_data)
##         date time province confirmed released deceased
## 1 2020-01-20   16    Seoul         0        0        0
## 2 2020-01-20   16    Busan         0        0        0
## 3 2020-01-20   16    Daegu         0        0        0
## 4 2020-01-20   16  Incheon         1        0        0
## 5 2020-01-20   16  Gwangju         0        0        0
## 6 2020-01-20   16  Daejeon         0        0        0
head(weather_data)
##    code province       date avg_temp min_temp max_temp precipitation
## 1 10000    Seoul 2016-01-01      1.2     -3.3      4.0             0
## 2 11000    Busan 2016-01-01      5.3      1.1     10.9             0
## 3 12000    Daegu 2016-01-01      1.7     -4.0      8.0             0
## 4 13000  Gwangju 2016-01-01      3.2     -1.5      8.1             0
## 5 14000  Incheon 2016-01-01      3.1     -0.4      5.7             0
## 6 15000  Daejeon 2016-01-01      1.6     -4.2      7.7             0
##   max_wind_speed most_wind_direction avg_relative_humidity
## 1            3.5                  90                  73.0
## 2            7.4                 340                  52.1
## 3            3.7                 270                  70.5
## 4            2.7                 230                  73.1
## 5            5.3                 180                  83.9
## 6            4.4                 320                  77.4

Let’s look at the structure of all data frames too.

str(case_data)
## 'data.frame':    174 obs. of  8 variables:
##  $ case_id       : int  1000001 1000002 1000003 1000004 1000005 1000006 1000007 1000008 1000009 1000010 ...
##  $ province      : chr  "Seoul" "Seoul" "Seoul" "Seoul" ...
##  $ city          : chr  "Yongsan-gu" "Gwanak-gu" "Guro-gu" "Yangcheon-gu" ...
##  $ group         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ infection_case: chr  "Itaewon Clubs" "Richway" "Guro-gu Call Center" "Yangcheon Table Tennis Club" ...
##  $ confirmed     : int  139 119 95 43 43 41 36 17 25 30 ...
##  $ latitude      : chr  "37.538621" "37.48208" "37.508163" "37.546061" ...
##  $ longitude     : chr  "126.992652" "126.901384" "126.884387" "126.874209" ...
str(patient_data)
## 'data.frame':    5165 obs. of  14 variables:
##  $ patient_id        : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
##  $ sex               : chr  "male" "male" "male" "male" ...
##  $ age               : chr  "50s" "30s" "50s" "20s" ...
##  $ country           : chr  "Korea" "Korea" "Korea" "Korea" ...
##  $ province          : chr  "Seoul" "Seoul" "Seoul" "Seoul" ...
##  $ city              : chr  "Gangseo-gu" "Jungnang-gu" "Jongno-gu" "Mapo-gu" ...
##  $ infection_case    : chr  "overseas inflow" "overseas inflow" "contact with patient" "overseas inflow" ...
##  $ infected_by       : chr  "" "" "2002000001" "" ...
##  $ contact_number    : chr  "75" "31" "17" "9" ...
##  $ symptom_onset_date: chr  "2020-01-22" "" "" "2020-01-26" ...
##  $ confirmed_date    : chr  "2020-01-23" "2020-01-30" "2020-01-30" "2020-01-30" ...
##  $ released_date     : chr  "2020-02-05" "2020-03-02" "2020-02-19" "2020-02-15" ...
##  $ deceased_date     : chr  "" "" "" "" ...
##  $ state             : chr  "released" "released" "released" "released" ...
str(policy_data)
## 'data.frame':    61 obs. of  7 variables:
##  $ policy_id : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country   : chr  "Korea" "Korea" "Korea" "Korea" ...
##  $ type      : chr  "Alert" "Alert" "Alert" "Alert" ...
##  $ gov_policy: chr  "Infectious Disease Alert Level" "Infectious Disease Alert Level" "Infectious Disease Alert Level" "Infectious Disease Alert Level" ...
##  $ detail    : chr  "Level 1 (Blue)" "Level 2 (Yellow)" "Level 3 (Orange)" "Level 4 (Red)" ...
##  $ start_date: chr  "2020-01-03" "2020-01-20" "2020-01-28" "2020-02-23" ...
##  $ end_date  : chr  "2020-01-19" "2020-01-27" "2020-02-22" "" ...
str(region_data)
## 'data.frame':    244 obs. of  12 variables:
##  $ code                    : int  10000 10010 10020 10030 10040 10050 10060 10070 10080 10090 ...
##  $ province                : chr  "Seoul" "Seoul" "Seoul" "Seoul" ...
##  $ city                    : chr  "Seoul" "Gangnam-gu" "Gangdong-gu" "Gangbuk-gu" ...
##  $ latitude                : num  37.6 37.5 37.5 37.6 37.6 ...
##  $ longitude               : num  127 127 127 127 127 ...
##  $ elementary_school_count : int  607 33 27 14 36 22 22 26 18 42 ...
##  $ kindergarten_count      : int  830 38 32 21 56 33 33 34 19 66 ...
##  $ university_count        : int  48 0 0 0 1 1 3 3 0 6 ...
##  $ academy_ratio           : num  1.44 4.18 1.54 0.67 1.17 0.89 1.16 1 0.96 1.39 ...
##  $ elderly_population_ratio: num  15.4 13.2 14.6 19.5 14.4 ...
##  $ elderly_alone_ratio     : num  5.8 4.3 5.4 8.5 5.7 4.9 4.8 5.7 6.7 7.4 ...
##  $ nursing_home_count      : int  22739 3088 1023 628 1080 909 723 741 475 952 ...
str(search_trend_data)
## 'data.frame':    1642 obs. of  5 variables:
##  $ date       : chr  "2016-01-01" "2016-01-02" "2016-01-03" "2016-01-04" ...
##  $ cold       : num  0.117 0.134 0.149 0.175 0.172 ...
##  $ flu        : num  0.0559 0.1714 0.2232 0.1863 0.1507 ...
##  $ pneumonia  : num  0.157 0.208 0.193 0.29 0.246 ...
##  $ coronavirus: num  0.00736 0.0089 0.00845 0.01145 0.01381 ...
str(seoul_floating_data)
## 'data.frame':    1084800 obs. of  7 variables:
##  $ date      : chr  "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-01" ...
##  $ hour      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ birth_year: int  20 20 20 20 20 20 20 20 20 20 ...
##  $ sex       : chr  "female" "male" "female" "male" ...
##  $ province  : chr  "Seoul" "Seoul" "Seoul" "Seoul" ...
##  $ city      : chr  "Dobong-gu" "Dobong-gu" "Dongdaemun-gu" "Dongdaemun-gu" ...
##  $ fp_num    : int  19140 19950 25450 27050 28880 30350 27750 27910 19490 21940 ...
str(time_data)
## 'data.frame':    163 obs. of  7 variables:
##  $ date     : chr  "2020-01-20" "2020-01-21" "2020-01-22" "2020-01-23" ...
##  $ time     : int  16 16 16 16 16 16 16 16 16 16 ...
##  $ test     : int  1 1 4 22 27 27 51 61 116 187 ...
##  $ negative : int  0 0 3 21 25 25 47 56 97 155 ...
##  $ confirmed: int  1 1 1 1 2 2 3 4 4 4 ...
##  $ released : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deceased : int  0 0 0 0 0 0 0 0 0 0 ...
str(time_gender_data)
## 'data.frame':    242 obs. of  5 variables:
##  $ date     : chr  "2020-03-02" "2020-03-02" "2020-03-03" "2020-03-03" ...
##  $ time     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sex      : chr  "male" "female" "male" "female" ...
##  $ confirmed: int  1591 2621 1810 3002 1996 3332 2149 3617 2345 3939 ...
##  $ deceased : int  13 9 16 12 20 12 21 14 25 17 ...
str(time_province_data)
## 'data.frame':    2771 obs. of  6 variables:
##  $ date     : chr  "2020-01-20" "2020-01-20" "2020-01-20" "2020-01-20" ...
##  $ time     : int  16 16 16 16 16 16 16 16 16 16 ...
##  $ province : chr  "Seoul" "Busan" "Daegu" "Incheon" ...
##  $ confirmed: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ released : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deceased : int  0 0 0 0 0 0 0 0 0 0 ...
str(weather_data)
## 'data.frame':    26271 obs. of  10 variables:
##  $ code                 : int  10000 11000 12000 13000 14000 15000 16000 20000 30000 40000 ...
##  $ province             : chr  "Seoul" "Busan" "Daegu" "Gwangju" ...
##  $ date                 : chr  "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
##  $ avg_temp             : num  1.2 5.3 1.7 3.2 3.1 1.6 4 1.6 5.1 -1 ...
##  $ min_temp             : num  -3.3 1.1 -4 -1.5 -0.4 -4.2 -1.6 -4.2 2.1 -5.9 ...
##  $ max_temp             : num  4 10.9 8 8.1 5.7 7.7 12 5.7 8.9 4.1 ...
##  $ precipitation        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ max_wind_speed       : num  3.5 7.4 3.7 2.7 5.3 4.4 2.7 2.1 9.6 1.6 ...
##  $ most_wind_direction  : int  90 340 270 230 180 320 320 180 290 110 ...
##  $ avg_relative_humidity: num  73 52.1 70.5 73.1 83.9 77.4 53.3 80.1 33 79.4 ...

For case data, latitude and longitude values are chr, which should be converted into num.

case_data <- case_data %>% mutate(latitude = na_if(latitude, "-"), longitude = na_if(longitude, "-"), latitude = as.numeric(latitude), longitude = as.numeric(longitude))

As an observation, some rows have missing latitude, longitude, or city values missing.

For patient_data,

  1. sex column could be converted into a factor value for the sake of preparing the data for analysis

  2. contact number represents the count of contacts; thus should be converted into int.

  3. symptom_onset_date, released_date, deceased_date: should be converted into date value.

  4. age: should be integer value

patient_data <- patient_data %>% mutate(sex = na_if(sex, ""), sex = as.factor(sex), contact_number = as.integer(contact_number), symptom_onset_date = as.Date(symptom_onset_date), released_date = as.Date(released_date), deceased_date = as.Date(deceased_date), confirmed_date = as.Date(confirmed_date)) 
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `contact_number = as.integer(contact_number)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

For policy data, some of the policies do not have an end date, yet we plan to look at the effects of a policy beginning the start date, thus at the moment, there’s nothing to fill in or need to fill in the end date for every policy.

policy_data <- policy_data %>% mutate(start_date = as.Date(start_date), end_date = as.Date(end_date))

For search_trend_data, the date values should be converted to date data type.

search_trend_data <- search_trend_data %>% mutate(date = as.Date(date))

For seoul_floating_date,

  1. date: converted to date data type

  2. sex: converted into a factor

seoul_floating_data <- seoul_floating_data %>% mutate(date = as.Date(date), sex = as.factor(sex)) 

For time_data, time_gender, time_province_data, and weather_data, all date values should be converted to date data type.

time_data <- time_data %>% mutate(date = as.Date(date))
time_age_data <- time_age_data %>% mutate(date = as.Date(date))
time_gender_data <- time_gender_data %>%  mutate(date = as.Date(date))
time_province_data <- time_province_data %>%  mutate(date = as.Date(date))
time_gender_data <- time_gender_data %>%  mutate(date = as.Date(date), sex = na_if(sex, ""), sex = as.factor(sex))
weather_data <- weather_data %>% mutate(date = as.Date(date))

Looking at missing/ null values now,

case_data %>% summarise(missing_latitude = sum(is.na(latitude)))
case_data %>% summarise(missing_longitude = sum(is.na(longitude)))
case_data %>% summarise(missing_city = sum(city == "-"))

For missing values in case_data, the missing values for longitude and latitude mean the infection case was not a group infection but something else as they represent the location (latitude, longitude) of the group, but the missing city values don’t seem to match the number of misisng values for latitude/ longitude. Let’s investigate more.

#going to convert "-" to na 
case_data <- case_data %>%  mutate(city = na_if(city, "-"))
case_data %>% filter(is.na(city))
##    case_id          province city group                   infection_case
## 1  1000034             Seoul <NA>  TRUE                      Orange Life
## 2  1000036             Seoul <NA> FALSE                  overseas inflow
## 3  1000037             Seoul <NA> FALSE             contact with patient
## 4  1000038             Seoul <NA> FALSE                              etc
## 5  1100008             Busan <NA> FALSE                  overseas inflow
## 6  1100009             Busan <NA> FALSE             contact with patient
## 7  1100010             Busan <NA> FALSE                              etc
## 8  1200008             Daegu <NA> FALSE                  overseas inflow
## 9  1200009             Daegu <NA> FALSE             contact with patient
## 10 1200010             Daegu <NA> FALSE                              etc
## 11 1300003           Gwangju <NA> FALSE                  overseas inflow
## 12 1300004           Gwangju <NA> FALSE             contact with patient
## 13 1300005           Gwangju <NA> FALSE                              etc
## 14 1400005           Incheon <NA> FALSE                  overseas inflow
## 15 1400006           Incheon <NA> FALSE             contact with patient
## 16 1400007           Incheon <NA> FALSE                              etc
## 17 1500001           Daejeon <NA>  TRUE    Door-to-door sales in Daejeon
## 18 1500008           Daejeon <NA> FALSE                  overseas inflow
## 19 1500009           Daejeon <NA> FALSE             contact with patient
## 20 1500010           Daejeon <NA> FALSE                              etc
## 21 1600002             Ulsan <NA> FALSE                  overseas inflow
## 22 1600003             Ulsan <NA> FALSE             contact with patient
## 23 1600004             Ulsan <NA> FALSE                              etc
## 24 1700004            Sejong <NA> FALSE                  overseas inflow
## 25 1700005            Sejong <NA> FALSE             contact with patient
## 26 1700006            Sejong <NA> FALSE                              etc
## 27 2000009       Gyeonggi-do <NA>  TRUE SMR Newly Planted Churches Group
## 28 2000020       Gyeonggi-do <NA> FALSE                  overseas inflow
## 29 2000021       Gyeonggi-do <NA> FALSE             contact with patient
## 30 2000022       Gyeonggi-do <NA> FALSE                              etc
## 31 3000006        Gangwon-do <NA> FALSE                  overseas inflow
## 32 3000007        Gangwon-do <NA> FALSE             contact with patient
## 33 3000008        Gangwon-do <NA> FALSE                              etc
## 34 4000005 Chungcheongbuk-do <NA> FALSE                  overseas inflow
## 35 4000006 Chungcheongbuk-do <NA> FALSE             contact with patient
## 36 4000007 Chungcheongbuk-do <NA> FALSE                              etc
## 37 4100006 Chungcheongnam-do <NA> FALSE                  overseas inflow
## 38 4100007 Chungcheongnam-do <NA> FALSE             contact with patient
## 39 4100008 Chungcheongnam-do <NA> FALSE                              etc
## 40 5000004      Jeollabuk-do <NA> FALSE                  overseas inflow
## 41 5000005      Jeollabuk-do <NA> FALSE                              etc
## 42 5100003      Jeollanam-do <NA> FALSE                  overseas inflow
## 43 5100004      Jeollanam-do <NA> FALSE             contact with patient
## 44 5100005      Jeollanam-do <NA> FALSE                              etc
## 45 6000011  Gyeongsangbuk-do <NA> FALSE                  overseas inflow
## 46 6000012  Gyeongsangbuk-do <NA> FALSE             contact with patient
## 47 6000013  Gyeongsangbuk-do <NA> FALSE                              etc
## 48 6100010  Gyeongsangnam-do <NA> FALSE                  overseas inflow
## 49 6100011  Gyeongsangnam-do <NA> FALSE             contact with patient
## 50 6100012  Gyeongsangnam-do <NA> FALSE                              etc
## 51 7000001           Jeju-do <NA> FALSE                  overseas inflow
## 52 7000002           Jeju-do <NA> FALSE             contact with patient
## 53 7000003           Jeju-do <NA> FALSE                              etc
##    confirmed latitude longitude
## 1          1       NA        NA
## 2        298       NA        NA
## 3        162       NA        NA
## 4        100       NA        NA
## 5         36       NA        NA
## 6         19       NA        NA
## 7         30       NA        NA
## 8         41       NA        NA
## 9        917       NA        NA
## 10       747       NA        NA
## 11        23       NA        NA
## 12         5       NA        NA
## 13         1       NA        NA
## 14        68       NA        NA
## 15         6       NA        NA
## 16        11       NA        NA
## 17        55       NA        NA
## 18        15       NA        NA
## 19        15       NA        NA
## 20        15       NA        NA
## 21        25       NA        NA
## 22         3       NA        NA
## 23         7       NA        NA
## 24         5       NA        NA
## 25         3       NA        NA
## 26         1       NA        NA
## 27        25       NA        NA
## 28       305       NA        NA
## 29        63       NA        NA
## 30        84       NA        NA
## 31        16       NA        NA
## 32         0       NA        NA
## 33         7       NA        NA
## 34        13       NA        NA
## 35         8       NA        NA
## 36        11       NA        NA
## 37        16       NA        NA
## 38         2       NA        NA
## 39        12       NA        NA
## 40        12       NA        NA
## 41         5       NA        NA
## 42        14       NA        NA
## 43         4       NA        NA
## 44         4       NA        NA
## 45        22       NA        NA
## 46       190       NA        NA
## 47       133       NA        NA
## 48        26       NA        NA
## 49         6       NA        NA
## 50        20       NA        NA
## 51        14       NA        NA
## 52         0       NA        NA
## 53         4       NA        NA
case_data %>% filter(is.na(latitude) & is.na(longitude))
##     case_id          province            city group
## 1   1000007             Seoul from other city  TRUE
## 2   1000009             Seoul from other city  TRUE
## 3   1000018             Seoul      Gangnam-gu  TRUE
## 4   1000019             Seoul from other city  TRUE
## 5   1000020             Seoul    Geumcheon-gu  TRUE
## 6   1000021             Seoul from other city  TRUE
## 7   1000022             Seoul from other city  TRUE
## 8   1000027             Seoul       Seocho-gu  TRUE
## 9   1000028             Seoul from other city  TRUE
## 10  1000031             Seoul from other city  TRUE
## 11  1000033             Seoul from other city  TRUE
## 12  1000034             Seoul            <NA>  TRUE
## 13  1000036             Seoul            <NA> FALSE
## 14  1000037             Seoul            <NA> FALSE
## 15  1000038             Seoul            <NA> FALSE
## 16  1100002             Busan from other city  TRUE
## 17  1100006             Busan from other city  TRUE
## 18  1100007             Busan from other city  TRUE
## 19  1100008             Busan            <NA> FALSE
## 20  1100009             Busan            <NA> FALSE
## 21  1100010             Busan            <NA> FALSE
## 22  1200006             Daegu from other city  TRUE
## 23  1200007             Daegu from other city  TRUE
## 24  1200008             Daegu            <NA> FALSE
## 25  1200009             Daegu            <NA> FALSE
## 26  1200010             Daegu            <NA> FALSE
## 27  1300002           Gwangju from other city  TRUE
## 28  1300003           Gwangju            <NA> FALSE
## 29  1300004           Gwangju            <NA> FALSE
## 30  1300005           Gwangju            <NA> FALSE
## 31  1400001           Incheon from other city  TRUE
## 32  1400002           Incheon from other city  TRUE
## 33  1400003           Incheon from other city  TRUE
## 34  1400004           Incheon from other city  TRUE
## 35  1400005           Incheon            <NA> FALSE
## 36  1400006           Incheon            <NA> FALSE
## 37  1400007           Incheon            <NA> FALSE
## 38  1500001           Daejeon            <NA>  TRUE
## 39  1500006           Daejeon from other city  TRUE
## 40  1500007           Daejeon from other city  TRUE
## 41  1500008           Daejeon            <NA> FALSE
## 42  1500009           Daejeon            <NA> FALSE
## 43  1500010           Daejeon            <NA> FALSE
## 44  1600001             Ulsan from other city  TRUE
## 45  1600002             Ulsan            <NA> FALSE
## 46  1600003             Ulsan            <NA> FALSE
## 47  1600004             Ulsan            <NA> FALSE
## 48  1700003            Sejong from other city  TRUE
## 49  1700004            Sejong            <NA> FALSE
## 50  1700005            Sejong            <NA> FALSE
## 51  1700006            Sejong            <NA> FALSE
## 52  2000003       Gyeonggi-do from other city  TRUE
## 53  2000004       Gyeonggi-do from other city  TRUE
## 54  2000006       Gyeonggi-do from other city  TRUE
## 55  2000007       Gyeonggi-do from other city  TRUE
## 56  2000008       Gyeonggi-do from other city  TRUE
## 57  2000009       Gyeonggi-do            <NA>  TRUE
## 58  2000015       Gyeonggi-do from other city  TRUE
## 59  2000016       Gyeonggi-do from other city  TRUE
## 60  2000017       Gyeonggi-do from other city  TRUE
## 61  2000018       Gyeonggi-do from other city  TRUE
## 62  2000019       Gyeonggi-do     Seongnam-si  TRUE
## 63  2000020       Gyeonggi-do            <NA> FALSE
## 64  2000021       Gyeonggi-do            <NA> FALSE
## 65  2000022       Gyeonggi-do            <NA> FALSE
## 66  3000001        Gangwon-do from other city  TRUE
## 67  3000002        Gangwon-do from other city  TRUE
## 68  3000004        Gangwon-do from other city  TRUE
## 69  3000005        Gangwon-do from other city  TRUE
## 70  3000006        Gangwon-do            <NA> FALSE
## 71  3000007        Gangwon-do            <NA> FALSE
## 72  3000008        Gangwon-do            <NA> FALSE
## 73  4000002 Chungcheongbuk-do from other city  TRUE
## 74  4000003 Chungcheongbuk-do from other city  TRUE
## 75  4000004 Chungcheongbuk-do from other city  TRUE
## 76  4000005 Chungcheongbuk-do            <NA> FALSE
## 77  4000006 Chungcheongbuk-do            <NA> FALSE
## 78  4000007 Chungcheongbuk-do            <NA> FALSE
## 79  4100002 Chungcheongnam-do from other city  TRUE
## 80  4100004 Chungcheongnam-do from other city  TRUE
## 81  4100005 Chungcheongnam-do from other city  TRUE
## 82  4100006 Chungcheongnam-do            <NA> FALSE
## 83  4100007 Chungcheongnam-do            <NA> FALSE
## 84  4100008 Chungcheongnam-do            <NA> FALSE
## 85  5000001      Jeollabuk-do from other city  TRUE
## 86  5000002      Jeollabuk-do from other city  TRUE
## 87  5000003      Jeollabuk-do from other city  TRUE
## 88  5000004      Jeollabuk-do            <NA> FALSE
## 89  5000005      Jeollabuk-do            <NA> FALSE
## 90  5100002      Jeollanam-do from other city  TRUE
## 91  5100003      Jeollanam-do            <NA> FALSE
## 92  5100004      Jeollanam-do            <NA> FALSE
## 93  5100005      Jeollanam-do            <NA> FALSE
## 94  6000001  Gyeongsangbuk-do from other city  TRUE
## 95  6000005  Gyeongsangbuk-do from other city  TRUE
## 96  6000010  Gyeongsangbuk-do         Gumi-si  TRUE
## 97  6000011  Gyeongsangbuk-do            <NA> FALSE
## 98  6000012  Gyeongsangbuk-do            <NA> FALSE
## 99  6000013  Gyeongsangbuk-do            <NA> FALSE
## 100 6100001  Gyeongsangnam-do from other city  TRUE
## 101 6100008  Gyeongsangnam-do from other city  TRUE
## 102 6100009  Gyeongsangnam-do from other city  TRUE
## 103 6100010  Gyeongsangnam-do            <NA> FALSE
## 104 6100011  Gyeongsangnam-do            <NA> FALSE
## 105 6100012  Gyeongsangnam-do            <NA> FALSE
## 106 7000001           Jeju-do            <NA> FALSE
## 107 7000002           Jeju-do            <NA> FALSE
## 108 7000003           Jeju-do            <NA> FALSE
## 109 7000004           Jeju-do from other city  TRUE
##                                    infection_case confirmed latitude longitude
## 1                SMR Newly Planted Churches Group        36       NA        NA
## 2                        Coupang Logistics Center        25       NA        NA
## 3                  Gangnam Yeoksam-dong gathering         6       NA        NA
## 4                      Daejeon door-to-door sales         1       NA        NA
## 5   Geumcheon-gu rice milling machine manufacture         6       NA        NA
## 6                              Shincheonji Church         8       NA        NA
## 7                       Guri Collective Infection         5       NA        NA
## 8                                   Seocho Family         5       NA        NA
## 9                      Anyang Gunpo Pastors Group         1       NA        NA
## 10                                Yongin Brothers         4       NA        NA
## 11                        Uiwang Logistics Center         2       NA        NA
## 12                                    Orange Life         1       NA        NA
## 13                                overseas inflow       298       NA        NA
## 14                           contact with patient       162       NA        NA
## 15                                            etc       100       NA        NA
## 16                             Shincheonji Church        12       NA        NA
## 17                                  Itaewon Clubs         4       NA        NA
## 18                       Cheongdo Daenam Hospital         1       NA        NA
## 19                                overseas inflow        36       NA        NA
## 20                           contact with patient        19       NA        NA
## 21                                            etc        30       NA        NA
## 22                                  Itaewon Clubs         2       NA        NA
## 23                       Cheongdo Daenam Hospital         2       NA        NA
## 24                                overseas inflow        41       NA        NA
## 25                           contact with patient       917       NA        NA
## 26                                            etc       747       NA        NA
## 27                             Shincheonji Church         9       NA        NA
## 28                                overseas inflow        23       NA        NA
## 29                           contact with patient         5       NA        NA
## 30                                            etc         1       NA        NA
## 31                                  Itaewon Clubs        53       NA        NA
## 32                       Coupang Logistics Center        42       NA        NA
## 33                            Guro-gu Call Center        20       NA        NA
## 34                             Shincheonji Church         2       NA        NA
## 35                                overseas inflow        68       NA        NA
## 36                           contact with patient         6       NA        NA
## 37                                            etc        11       NA        NA
## 38                  Door-to-door sales in Daejeon        55       NA        NA
## 39                             Shincheonji Church         2       NA        NA
## 40                           Seosan-si Laboratory         2       NA        NA
## 41                                overseas inflow        15       NA        NA
## 42                           contact with patient        15       NA        NA
## 43                                            etc        15       NA        NA
## 44                             Shincheonji Church        16       NA        NA
## 45                                overseas inflow        25       NA        NA
## 46                           contact with patient         3       NA        NA
## 47                                            etc         7       NA        NA
## 48                             Shincheonji Church         1       NA        NA
## 49                                overseas inflow         5       NA        NA
## 50                           contact with patient         3       NA        NA
## 51                                            etc         1       NA        NA
## 52                                  Itaewon Clubs        59       NA        NA
## 53                                        Richway        58       NA        NA
## 54                            Guro-gu Call Center        50       NA        NA
## 55                             Shincheonji Church        29       NA        NA
## 56                    Yangcheon Table Tennis Club        28       NA        NA
## 57               SMR Newly Planted Churches Group        25       NA        NA
## 58                 Korea Campus Crusade of Christ         7       NA        NA
## 59  Geumcheon-gu rice milling machine manufacture         6       NA        NA
## 60                                Wangsung Church         6       NA        NA
## 61          Seoul City Hall Station safety worker         5       NA        NA
## 62                   Seongnam neighbors gathering         5       NA        NA
## 63                                overseas inflow       305       NA        NA
## 64                           contact with patient        63       NA        NA
## 65                                            etc        84       NA        NA
## 66                             Shincheonji Church        17       NA        NA
## 67                  Uijeongbu St. Mary’s Hospital        10       NA        NA
## 68                                        Richway         4       NA        NA
## 69  Geumcheon-gu rice milling machine manufacture         4       NA        NA
## 70                                overseas inflow        16       NA        NA
## 71                           contact with patient         0       NA        NA
## 72                                            etc         7       NA        NA
## 73                                  Itaewon Clubs         9       NA        NA
## 74                            Guro-gu Call Center         2       NA        NA
## 75                             Shincheonji Church         6       NA        NA
## 76                                overseas inflow        13       NA        NA
## 77                           contact with patient         8       NA        NA
## 78                                            etc        11       NA        NA
## 79                  Door-to-door sales in Daejeon        10       NA        NA
## 80                                        Richway         3       NA        NA
## 81              Eunpyeong-Boksagol culture center         3       NA        NA
## 82                                overseas inflow        16       NA        NA
## 83                           contact with patient         2       NA        NA
## 84                                            etc        12       NA        NA
## 85                                  Itaewon Clubs         2       NA        NA
## 86                  Door-to-door sales in Daejeon         3       NA        NA
## 87                             Shincheonji Church         1       NA        NA
## 88                                overseas inflow        12       NA        NA
## 89                                            etc         5       NA        NA
## 90                             Shincheonji Church         1       NA        NA
## 91                                overseas inflow        14       NA        NA
## 92                           contact with patient         4       NA        NA
## 93                                            etc         4       NA        NA
## 94                             Shincheonji Church       566       NA        NA
## 95                           Pilgrimage to Israel        41       NA        NA
## 96                               Gumi Elim Church        10       NA        NA
## 97                                overseas inflow        22       NA        NA
## 98                           contact with patient       190       NA        NA
## 99                                            etc       133       NA        NA
## 100                            Shincheonji Church        32       NA        NA
## 101                                 Itaewon Clubs         2       NA        NA
## 102                                 Onchun Church         2       NA        NA
## 103                               overseas inflow        26       NA        NA
## 104                          contact with patient         6       NA        NA
## 105                                           etc        20       NA        NA
## 106                               overseas inflow        14       NA        NA
## 107                          contact with patient         0       NA        NA
## 108                                           etc         4       NA        NA
## 109                                 Itaewon Clubs         1       NA        NA

As shown above, some rows just simply have the latitue and longitude values missing despite the provided city value:
Gangnam-gu (1000018), Geumcheon-gu (1000020), Seocho-gu (1000027), Seongnam-si (2000019), Gumi-si (6000010)

Let’s look at some other rows that have the same city values

case_data %>% filter(city == "Gangnam-gu")
##   case_id province       city group                  infection_case confirmed
## 1 1000014    Seoul Gangnam-gu  TRUE          Samsung Medical Center         7
## 2 1000018    Seoul Gangnam-gu  TRUE  Gangnam Yeoksam-dong gathering         6
## 3 1000025    Seoul Gangnam-gu  TRUE           Gangnam Dongin Church         1
## 4 1000029    Seoul Gangnam-gu  TRUE Samsung Fire & Marine Insurance         4
##   latitude longitude
## 1 37.48825  127.0856
## 2       NA        NA
## 3 37.52233  127.0574
## 4 37.49828  127.0301
case_data %>% filter(city == "Geumcheon-gu")
##   case_id province         city group
## 1 1000020    Seoul Geumcheon-gu  TRUE
##                                  infection_case confirmed latitude longitude
## 1 Geumcheon-gu rice milling machine manufacture         6       NA        NA
case_data %>% filter(city == "Seocho-gu")
##   case_id province      city group infection_case confirmed latitude longitude
## 1 1000027    Seoul Seocho-gu  TRUE  Seocho Family         5       NA        NA
case_data %>% filter(city == "Seongnam-si")
##   case_id    province        city group                  infection_case
## 1 2000001 Gyeonggi-do Seongnam-si  TRUE River of Grace Community Church
## 2 2000010 Gyeonggi-do Seongnam-si  TRUE        Bundang Jesaeng Hospital
## 3 2000019 Gyeonggi-do Seongnam-si  TRUE    Seongnam neighbors gathering
##   confirmed latitude longitude
## 1        67 37.45569  127.1616
## 2        22 37.38833  127.1218
## 3         5       NA        NA
case_data %>% filter(city == "Gumi-si")
##   case_id         province    city group   infection_case confirmed latitude
## 1 6000010 Gyeongsangbuk-do Gumi-si  TRUE Gumi Elim Church        10       NA
##   longitude
## 1        NA

As the result above shows, it does not seem to be a data entry error, but rather the latitude/ longitude information was not accessible/ provided.

For the patient_data, some age, city, and infection case values are missing with just empty field; we should replace that with NA values.

patient_data <- patient_data %>%  mutate(age = na_if(age, ""), city = na_if(city, ""), infection_case = na_if(infection_case, ""))

As for the outlier values,

summary(patient_data)
##    patient_id            sex           age              country         
##  Min.   :1.000e+09   female:2218   Length:5165        Length:5165       
##  1st Qu.:1.000e+09   male  :1825   Class :character   Class :character  
##  Median :2.000e+09   NA's  :1122   Mode  :character   Mode  :character  
##  Mean   :2.864e+09                                                      
##  3rd Qu.:6.001e+09                                                      
##  Max.   :7.000e+09                                                      
##                                                                         
##    province             city           infection_case     infected_by       
##  Length:5165        Length:5165        Length:5165        Length:5165       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  contact_number      symptom_onset_date   confirmed_date      
##  Min.   :0.000e+00   Min.   :2020-01-19   Min.   :2020-01-20  
##  1st Qu.:2.000e+00   1st Qu.:2020-02-29   1st Qu.:2020-03-04  
##  Median :4.000e+00   Median :2020-03-20   Median :2020-03-27  
##  Mean   :1.274e+06   Mean   :2020-04-05   Mean   :2020-04-10  
##  3rd Qu.:1.400e+01   3rd Qu.:2020-05-23   3rd Qu.:2020-05-27  
##  Max.   :1.000e+09   Max.   :2020-06-28   Max.   :2020-06-30  
##  NA's   :4380        NA's   :4476         NA's   :3           
##  released_date        deceased_date           state          
##  Min.   :2020-02-05   Min.   :2020-02-19   Length:5165       
##  1st Qu.:2020-03-20   1st Qu.:2020-03-02   Class :character  
##  Median :2020-03-28   Median :2020-03-09   Mode  :character  
##  Mean   :2020-04-03   Mean   :2020-03-17                     
##  3rd Qu.:2020-04-14   3rd Qu.:2020-03-30                     
##  Max.   :2020-06-28   Max.   :2020-05-25                     
##  NA's   :3578         NA's   :5099

The maximum value for contact_number seems to be extraordinarily large; indicating a possible data error

patient_data %>% arrange(desc(contact_number))

The largest value seems to be a data error, given by how it’s not feasible for anyone to have been in a contact with 1000000796 times, and as the magnitude and digit length of this value strongly suggest that the patient_id was mistakenly entered into the contact_number field. Therefore, we’re just going to drop this entire row; before doing that let’s check if any patient was in contact with this patient. The entire row for this patient does not have infection case either, thus keeping this data will not contribute to anything.

patient_data %>% filter(infected_by == "1000000819")
##  [1] patient_id         sex                age                country           
##  [5] province           city               infection_case     infected_by       
##  [9] contact_number     symptom_onset_date confirmed_date     released_date     
## [13] deceased_date      state             
## <0 rows> (or 0-length row.names)

Given by how no one was infected by this patient, it can be safely removed.

patient_data <- patient_data %>%  filter(patient_id != "1000000819")

Not only that, there could be more than binary gender categories for sex; let’s see the other categorical values other than female and male.

patient_data %>%  filter(!sex %in% c("female", "male")) %>% filter(!is.na(sex))
##  [1] patient_id         sex                age                country           
##  [5] province           city               infection_case     infected_by       
##  [9] contact_number     symptom_onset_date confirmed_date     released_date     
## [13] deceased_date      state             
## <0 rows> (or 0-length row.names)

There’s no other sex value other than female and male: just NA values for no sex data provided for a patient.

Let’s examine all the other datasets.

summary(time_data)
##       date                 time             test            negative      
##  Min.   :2020-01-20   Min.   : 0.000   Min.   :      1   Min.   :      0  
##  1st Qu.:2020-02-29   1st Qu.: 0.000   1st Qu.:  96488   1st Qu.:  58774  
##  Median :2020-04-10   Median : 0.000   Median : 503051   Median : 477303  
##  Mean   :2020-04-10   Mean   : 4.123   Mean   : 497780   Mean   : 475484  
##  3rd Qu.:2020-05-20   3rd Qu.:16.000   3rd Qu.: 782558   3rd Qu.: 754222  
##  Max.   :2020-06-30   Max.   :16.000   Max.   :1273766   Max.   :1240157  
##    confirmed        released        deceased    
##  Min.   :    1   Min.   :    0   Min.   :  0.0  
##  1st Qu.: 3443   1st Qu.:   29   1st Qu.: 17.5  
##  Median :10450   Median : 7117   Median :208.0  
##  Mean   : 7835   Mean   : 5604   Mean   :157.1  
##  3rd Qu.:11116   3rd Qu.:10100   3rd Qu.:263.5  
##  Max.   :12800   Max.   :11537   Max.   :282.0
summary(time_age_data)
##       date                 time       age              confirmed   
##  Min.   :2020-03-02   Min.   :0   Length:1089        Min.   :  32  
##  1st Qu.:2020-04-01   1st Qu.:0   Class :character   1st Qu.: 530  
##  Median :2020-05-01   Median :0   Mode  :character   Median :1052  
##  Mean   :2020-05-01   Mean   :0                      Mean   :1158  
##  3rd Qu.:2020-05-31   3rd Qu.:0                      3rd Qu.:1537  
##  Max.   :2020-06-30   Max.   :0                      Max.   :3362  
##     deceased     
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  3.00  
##  Mean   : 23.42  
##  3rd Qu.: 35.00  
##  Max.   :139.00
summary(time_gender_data)
##       date                 time       sex        confirmed       deceased    
##  Min.   :2020-03-02   Min.   :0   female:121   Min.   :1591   Min.   :  9.0  
##  1st Qu.:2020-04-01   1st Qu.:0   male  :121   1st Qu.:4328   1st Qu.: 82.0  
##  Median :2020-05-01   Median :0                Median :5118   Median :125.0  
##  Mean   :2020-05-01   Mean   :0                Mean   :5212   Mean   :105.4  
##  3rd Qu.:2020-05-31   3rd Qu.:0                3rd Qu.:6417   3rd Qu.:131.0  
##  Max.   :2020-06-30   Max.   :0                Max.   :7305   Max.   :151.0
summary(time_province_data)
##       date                 time          province           confirmed     
##  Min.   :2020-01-20   Min.   : 0.000   Length:2771        Min.   :   0.0  
##  1st Qu.:2020-02-29   1st Qu.: 0.000   Class :character   1st Qu.:   9.0  
##  Median :2020-04-10   Median : 0.000   Mode  :character   Median :  42.0  
##  Mean   :2020-04-10   Mean   : 4.123                      Mean   : 444.3  
##  3rd Qu.:2020-05-21   3rd Qu.:16.000                      3rd Qu.: 133.0  
##  Max.   :2020-06-30   Max.   :16.000                      Max.   :6906.0  
##     released         deceased     
##  Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:   1.0   1st Qu.:  0.00  
##  Median :  21.0   Median :  0.00  
##  Mean   : 320.7   Mean   :  9.24  
##  3rd Qu.:  92.0   3rd Qu.:  1.00  
##  Max.   :6700.0   Max.   :189.00
summary(region_data)
##       code         province             city              latitude    
##  Min.   :10000   Length:244         Length:244         Min.   :33.49  
##  1st Qu.:14028   Class :character   Class :character   1st Qu.:35.41  
##  Median :30075   Mode  :character   Mode  :character   Median :36.39  
##  Mean   :32912                                         Mean   :36.40  
##  3rd Qu.:51062                                         3rd Qu.:37.47  
##  Max.   :80000                                         Max.   :38.38  
##    longitude     elementary_school_count kindergarten_count university_count 
##  Min.   :126.3   Min.   :   4.00         Min.   :   4.00    Min.   :  0.000  
##  1st Qu.:126.9   1st Qu.:  14.75         1st Qu.:  16.00    1st Qu.:  0.000  
##  Median :127.4   Median :  22.00         Median :  31.00    Median :  1.000  
##  Mean   :127.7   Mean   :  74.18         Mean   : 107.90    Mean   :  4.152  
##  3rd Qu.:128.5   3rd Qu.:  36.25         3rd Qu.:  55.25    3rd Qu.:  3.000  
##  Max.   :130.9   Max.   :6087.00         Max.   :8837.00    Max.   :340.000  
##  academy_ratio   elderly_population_ratio elderly_alone_ratio
##  Min.   :0.190   Min.   : 7.69            Min.   : 3.30      
##  1st Qu.:0.870   1st Qu.:14.12            1st Qu.: 6.10      
##  Median :1.270   Median :18.53            Median : 8.75      
##  Mean   :1.295   Mean   :20.92            Mean   :10.64      
##  3rd Qu.:1.613   3rd Qu.:27.26            3rd Qu.:14.62      
##  Max.   :4.180   Max.   :40.26            Max.   :24.70      
##  nursing_home_count
##  Min.   :   11.0   
##  1st Qu.:  111.0   
##  Median :  300.0   
##  Mean   : 1159.3   
##  3rd Qu.:  694.5   
##  Max.   :94865.0
summary(weather_data)
##       code         province              date               avg_temp     
##  Min.   :10000   Length:26271       Min.   :2016-01-01   Min.   :-14.80  
##  1st Qu.:13500   Class :character   1st Qu.:2017-02-14   1st Qu.:  6.00  
##  Median :20000   Mode  :character   Median :2018-04-01   Median : 14.60  
##  Mean   :32125                      Mean   :2018-03-31   Mean   : 13.86  
##  3rd Qu.:50500                      3rd Qu.:2019-05-16   3rd Qu.: 21.90  
##  Max.   :70000                      Max.   :2020-06-29   Max.   : 33.90  
##                                                          NA's   :15      
##     min_temp          max_temp      precipitation     max_wind_speed 
##  Min.   :-19.200   Min.   :-11.90   Min.   :  0.000   Min.   : 1.00  
##  1st Qu.:  1.400   1st Qu.: 10.90   1st Qu.:  0.000   1st Qu.: 3.80  
##  Median :  9.900   Median : 19.80   Median :  0.000   Median : 4.70  
##  Mean   :  9.665   Mean   : 18.78   Mean   :  1.487   Mean   : 5.11  
##  3rd Qu.: 18.200   3rd Qu.: 26.70   3rd Qu.:  0.000   3rd Qu.: 6.00  
##  Max.   : 30.300   Max.   : 40.00   Max.   :266.000   Max.   :29.40  
##  NA's   :5         NA's   :3                          NA's   :9      
##  most_wind_direction avg_relative_humidity
##  Min.   : 20.0       Min.   : 10.4        
##  1st Qu.: 90.0       1st Qu.: 53.6        
##  Median :200.0       Median : 66.9        
##  Mean   :195.9       Mean   : 65.7        
##  3rd Qu.:290.0       3rd Qu.: 78.6        
##  Max.   :360.0       Max.   :100.0        
##  NA's   :29          NA's   :20
summary(search_trend_data)
##       date                 cold               flu             pneumonia       
##  Min.   :2016-01-01   Min.   : 0.05163   Min.   : 0.00981   Min.   : 0.06881  
##  1st Qu.:2017-02-14   1st Qu.: 0.10663   1st Qu.: 0.04210   1st Qu.: 0.12863  
##  Median :2018-03-31   Median : 0.13317   Median : 0.09785   Median : 0.16445  
##  Mean   :2018-03-31   Mean   : 0.19051   Mean   : 0.24495   Mean   : 0.22143  
##  3rd Qu.:2019-05-15   3rd Qu.: 0.16590   3rd Qu.: 0.25004   3rd Qu.: 0.20977  
##  Max.   :2020-06-29   Max.   :15.72071   Max.   :27.32727   Max.   :11.39320  
##   coronavirus       
##  Min.   :  0.00154  
##  1st Qu.:  0.00627  
##  Median :  0.00890  
##  Mean   :  1.86252  
##  3rd Qu.:  0.01316  
##  Max.   :100.00000
summary(seoul_floating_data)
##       date                 hour         birth_year     sex        
##  Min.   :2020-01-01   Min.   : 0.00   Min.   :20   female:542400  
##  1st Qu.:2020-02-07   1st Qu.: 5.00   1st Qu.:30   male  :542400  
##  Median :2020-03-17   Median :11.00   Median :45                  
##  Mean   :2020-03-16   Mean   :11.48   Mean   :45                  
##  3rd Qu.:2020-04-23   3rd Qu.:17.00   3rd Qu.:60                  
##  Max.   :2020-05-31   Max.   :23.00   Max.   :70                  
##    province             city               fp_num      
##  Length:1084800     Length:1084800     Min.   :  3630  
##  Class :character   Class :character   1st Qu.: 18350  
##  Mode  :character   Mode  :character   Median : 25510  
##                                        Mean   : 27427  
##                                        3rd Qu.: 33940  
##                                        Max.   :127640
summary(policy_data)
##    policy_id    country              type            gov_policy       
##  Min.   : 1   Length:61          Length:61          Length:61         
##  1st Qu.:16   Class :character   Class :character   Class :character  
##  Median :31   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31                                                           
##  3rd Qu.:46                                                           
##  Max.   :61                                                           
##                                                                       
##     detail            start_date            end_date         
##  Length:61          Min.   :2020-01-03   Min.   :2020-01-19  
##  Class :character   1st Qu.:2020-02-29   1st Qu.:2020-04-06  
##  Mode  :character   Median :2020-03-15   Median :2020-05-27  
##                     Mean   :2020-03-22   Mean   :2020-05-02  
##                     3rd Qu.:2020-04-16   3rd Qu.:2020-06-03  
##                     Max.   :2020-06-10   Max.   :2020-06-14  
##                                          NA's   :37

Next, we can look at the date range coverage.

range(time_data$date, na.rm = TRUE)
## [1] "2020-01-20" "2020-06-30"
range(time_age_data$date, na.rm = TRUE)
## [1] "2020-03-02" "2020-06-30"
range(time_gender_data$date, na.rm = TRUE)
## [1] "2020-03-02" "2020-06-30"
range(time_province_data$date, na.rm = TRUE)
## [1] "2020-01-20" "2020-06-30"

As the result above shows, except for the age and gender times series data, it seems to cover from 2020-01-20 to 2020-06-30, and it can be said that all datasets seem to cover around the same period.

Next, we’re going to check for the uniqueness of patient_id

sum(duplicated(patient_data$patient_id))
## [1] 1

As the results above shows, there’s a duplicate patient_id.

patient_data %>% 
  filter(duplicated(patient_id) | duplicated(patient_id, fromLast = TRUE))
##   patient_id    sex age country province        city  infection_case
## 1 1200012238 female 20s   Korea    Daegu Icheon-dong overseas inflow
## 2 1200012238 female 20s   Korea    Daegu      Nam-gu overseas inflow
##   infected_by contact_number symptom_onset_date confirmed_date released_date
## 1                         NA               <NA>     2020-06-17          <NA>
## 2                         NA               <NA>     2020-06-17          <NA>
##   deceased_date    state
## 1          <NA> isolated
## 2          <NA> isolated

According to wikipedia, Incheon-dong is a sub-district within Nam-gu, thus it can be assumeed that it was just a duplicated data, thus safe to drop a duplicate row.

patient_data <- patient_data %>% filter(!duplicated(patient_id))

Next, for logical inconsistencies, we’re going to check if date values have been put in correctly: if they were released after they were deceased, then there’s an illogical inconsistency with the data entry.

patient_data %>% filter(released_date > deceased_date)
##  [1] patient_id         sex                age                country           
##  [5] province           city               infection_case     infected_by       
##  [9] contact_number     symptom_onset_date confirmed_date     released_date     
## [13] deceased_date      state             
## <0 rows> (or 0-length row.names)

However, there’s no such thing as logical inconsistency with data entry for patient data, indicating no such error occurred.

Let’s do some distribution analysis.

For the patient data,

ggplot(data = patient_data, mapping = aes(x = sex)) + geom_bar()

The gender category has been divided into female, male, and NA (no sex information provided) successfully.

ggplot(patient_data, aes(x = contact_number)) + geom_histogram(bins = 30)
## Warning: Removed 4379 rows containing non-finite outside the scale range
## (`stat_bin()`).

There seems to be a few outlier values, yet 1000 is a reasonable number for the number of contacts considiering there could be a mass infection/ group spread.

Let’s look at the province (another yet bigger than city categorical data).

ggplot(data = patient_data, mapping = aes(x = province)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

The patient information seems to have been provided from all 17 provinces; yet it’s been stated in the data source that not all patient information from Daegu was provided.

For time_age data, let’s look at age categorical data.

ggplot(time_age_data, mapping = aes(x = age)) + geom_bar()

All of them seem to be equally distributed.

Let’s look at time_gender data.

ggplot(data = time_gender_data, mapping = aes(x = sex)) + geom_bar()

And for this as well, they seem to be equally distributed.

Last but not least for time_province data, let’s look at the province (categorical data).

ggplot(time_province_data, mapping = aes(x = province)) + geom_bar() + coord_flip()

And as same as the other time series datasets, this seems to be equally distributed too.

The time-series datasets exhibit consistent and evenly distributed observations across dates. Each categorical group is represented uniformly throughout the time period, allowing for valid comparison of confirmed case and deceased case rates without concerns of temporal imbalance or reporting gaps.

Phase 2: Investigate Initial Questions

This exploratory analysis investigates the first wave of COVID-19 in South Korea using national surveillance data. The analysis focuses on three outcomes: infection growth over time, demographic disparities in case fatality rates, and potential indicators of reduced disease severity such as recovery rates and policy interventions.

The data used for this analysis directly comes from https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data, a structured dataset based on the report materials of KCDC and local governments.

To provide an overview of this analysis/ visualization report, this project will attempt to answer these fundamental questions:

  1. Temporal Dynamics How did infections evolve over time?
  2. Demographic Impact Which groups were most affected?
  3. Regional Spread Where were outbreaks concentrated?
  4. Disease Outcomes What factors influenced mortality and recovery?

To start things off, let’s take a look at the overall national trend

Question 1: How did infections evolve over time in Korea?

To examine the overall scale of infections, the analysis uses the cumulative number of confirmed cases from the Time.csv dataset, which records the total number of positive cases over time. The data spans from January 20, 2020 to June 30, 2020, allowing for an assessment of infection trends during the initial phase of the pandemic in South Korea.

ggplot(time_data, mapping = aes(x = date, y = confirmed)) + geom_line() + labs(title = "Cumulative COVID-19 Cases In South Korea", x = "Date", y = "# of confirmed cases", subtitle = "2020/01/20 - 2020/06/30", caption = "https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data") + theme_linedraw() + theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + annotate("rect", xmin = as.Date("2020-02-15"), xmax = as.Date("2020-04-15"), ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) 

As shown in the line graph, confirmed COVID-19 cases began to increase sharply from below 1000 in mid-February 2020 to over 10000 by April 2020, indicating a rapid acceleration in transmission. The steep slope during this period reflects a high growth rate in cumulative cases. Beginning in April 2020, the curve noticeably flattens, suggesting a deceleration in spread and a reduction in the rate of new infections. This slowdown coincides with the Korean government implementing the infectious disease alert level with the strict interventions in April; with expanded testing and contact tracing measures.

Then, what about case fatality ratio/ recovery rate?

cfr_recovery_rate <- time_data %>% mutate(cfr = deceased / confirmed, recovery_rate = released / confirmed)

long_rates <- cfr_recovery_rate %>%  select(date, cfr, recovery_rate) %>% pivot_longer(cols = c(cfr, recovery_rate), names_to = "rate_type", values_to = "rate")

ggplot(long_rates, mapping = aes(x = date, y = rate, color = rate_type)) + geom_line() + annotate("rect", xmin = as.Date("2020-02-15"), xmax = as.Date("2020-04-15"), ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) + labs(
    title = "Evolution of Case Fatality Ratio (CFR) and Recovery Rate in South Korea",
    x = "Date",
    y = "Proportion of Confirmed Cases",
    color = "Case Fatality Ratio/ Recovery Rate",
    caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") + theme_minimal() + theme(plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold", size = 11))

As illustrated in the graph, the recovery rate declines noticeably during the initial phase of the outbreak. This pattern primarily reflects the rapid surge in confirmed cases, which increases the denominator (total confirmed cases) more quickly than recoveries can accumulate. Because recoveries occur days or weeks after diagnosis, the rapid surge in confirmed cases during late February temporarily lowers the observed recovery rate.

Moreover, early testing strategies often prioritize symptomatic or severe cases, while mild or asymptomatic infections may go undetected. This selective detection can make the proportion of severe cases appear higher, further contributing to a lower observed recovery rate. Together, these dynamics explain why recovery rates often decline during periods of rapid case growth, even if the underlying probability of recovery has not worsened.

Question 2: Which demographic groups exhibited the highest case fatality rates and the largest share of confirmed cases during Korea’s first wave of COVID-19??

How did case fatality rates and the distribution of confirmed COVID-19 cases vary across age groups and gender during Korea’s first wave?

It is evident that South Korea was significantly impacted by the global pandemic. However, an important question remains: which demographic groups were most affected? Were younger generations more impacted due to their higher levels of social interaction and physical mobility? Or did older populations experience greater consequences, given their comparatively weaker immune systems and higher vulnerability to severe illness? What about gender?

Let’s take a look at the correlation between age and infection.

weekly_infection <- time_age_data %>% mutate(week = floor_date(date, unit = "week")) %>%  group_by(week, age) %>% slice_max(date, n = 1) %>% ungroup() %>% group_by(week) %>% mutate(case_share = confirmed / sum(confirmed, na.rm = TRUE)) %>% ungroup()

ggplot(data = weekly_infection, mapping = aes(x = week, y = age, fill = case_share)) + geom_tile() + labs(title = "Age distribution of confirmed cases over time", y = "Age Group", x = "Week", caption = "https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data", fill = "Weekly share of confirmed cases") + theme_bw()+ theme(plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold"), panel.grid = element_blank()) + scale_fill_viridis_c(option = "cividis") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")

The distribution of confirmed COVID-19 cases indicates that individuals in their 20s accounted for the largest share of detected infections during Korea’s first wave, followed by individuals in their 50s. This suggests that transmission was concentrated among socially and economically active age groups rather than being evenly distributed across the population.

The relatively high share among individuals in their 20s may reflect greater social mobility and frequency of in-person interactions during the early phase of the outbreak, while the substantial share among those in their 50s may be associated with continued workplace participation. However, these interpretations remain speculative, as the dataset does not include behavioral or mobility measures. However, there’s a dataaset that shows the floating population within Seoul, which could serve as an indicator of high levels of social mobility/ interaction for 20s and 50s.

seoul_floating_age <- seoul_floating_data %>%
  group_by(birth_year) %>%
  summarise(total_fp_age = sum(fp_num, na.rm = TRUE)) %>%
  mutate(share = total_fp_age / sum(total_fp_age))

ggplot(seoul_floating_age, aes(x = birth_year, y = share)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = birth_year),
            hjust = 1.1,  
            color = "white",
            size = 4) +
  geom_label(aes(label = scales::percent(share, accuracy = 0.1)),
            hjust = 0,
            size = 2.5) +
  
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Share of Seoul Floating Population by Age Group",
    x = "",
    y = "Share of Floating Population"
  ) +
  theme_minimal()

The distribution of Seoul’s floating population indicates that individuals in their 20s through 50s account for the majority of daily mobility within the city. This pattern aligns with the observed concentration of confirmed cases among these age groups and is consistent with the idea that higher levels of social and economic activity may be associated with increased exposure opportunities during the first wave. However, because mobility and contact rates are not directly measured in this dataset, this relationship should be interpreted as suggestive rather than causal.

Importantly, the distribution of confirmed cases does not necessarily imply greater clinical severity. To assess whether these age groups were also disproportionately affected in terms of mortality risk, it is necessary to examine differences in case fatality ratios and the share of total deaths across age groups. This distinction allows us to separate transmission burden from outcome severity.

Because age-specific population denominators are not available, these findings describe the composition of confirmed cases within the dataset rather than population-level infection risk.

age_death_distribution <- time_age_data %>% mutate(week = floor_date(date, unit = "week")) %>%  group_by(week, age) %>% slice_max(date, n = 1) %>% ungroup() %>% group_by(week) %>% mutate(death_share = deceased / sum(deceased, na.rm = TRUE)) %>% ungroup()

ggplot(age_death_distribution, aes(x = week, y = age, fill = death_share)) +
  geom_tile() +
  labs(
    title = "Distribution of COVID-19 Deaths Across Age Groups Over Time",
    subtitle = "Weekly share of total deaths attributed to each age group",
    x = "Week",
    y = "Age Group",
    fill = "Weekly\nDeath Share",
    caption = "https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.title = element_text(face = "bold"), 
    panel.grid = element_blank()
  ) +
  scale_fill_viridis_c(
    option = "magma",
    labels = scales::percent
  ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")

weekly_cfr <- time_age_data %>%
  mutate(week = floor_date(date, unit = "week")) %>%
  group_by(week, age) %>%
  slice_max(date, n = 1) %>%   
  ungroup() %>%
  mutate(cfr = deceased / confirmed)

ggplot(weekly_cfr, mapping = aes(x = week, y = age, fill = cfr)) + geom_tile() + labs(
    title = "Case Fatality Ratio (CFR) by Age Group Over Time in South Korea",
    subtitle = "Deaths / confirmed cases (end-of-week cumulative values)",
    x = "Week",
    y = "Age Group",
    fill = "Weekly\nCFR",
    caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") + theme_bw()+ theme(plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold")) + scale_fill_viridis_c(option = "magma", labels = scales::percent) + 
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")

In contrast to the age distribution of confirmed cases, older age groups—particularly individuals in their 60s, 70s, and 80s—accounted for a smaller share of total infections but exhibited markedly higher case fatality ratios (CFR). Individuals in their 80s consistently displayed the highest proportion of deaths relative to confirmed cases throughout the observed period. Although individuals in their 50s experienced a substantial number of infections, mortality risk increased sharply with advancing age once infected.

This contrast underscores the distinction between transmission burden and conditional severity. Younger age groups, especially those in their 20s, comprised a larger share of confirmed cases, suggesting greater exposure opportunities. However, conditional on infection, older individuals faced significantly elevated mortality risk. The pronounced increase in CFR with age is consistent with established patterns of age-related vulnerability to severe viral illness. Thus, while infection burden was concentrated among socially active age groups, clinical severity was disproportionately concentrated among older cohorts.

To further explore age-related differences in disease progression, the next analysis examines whether age influences the duration of treatment, measured as the time between confirmation and clinical release among recovered cases.

age_duration <- patient_data %>% filter(!is.na(confirmed_date) & !is.na(released_date) & !is.na(age)) %>% mutate(treatment_duration = as.numeric(released_date - confirmed_date)) %>% select(age, treatment_duration) 
age_duration <- age_duration %>% group_by(age) 

ggplot(age_duration, mapping = aes(x = reorder(age, treatment_duration, median, na.rm = TRUE), y =treatment_duration)) +  geom_boxplot(fill = "#4E79A7", alpha = 0.7, outlier.alpha = 0.3)  + labs(title = "Duration of Treatment by Age", x = "Age Group Ordered by Median Duration of Treatment", y = "Duration of Treatment (Days)", subtitle = "Age-related differences in recovery time") + theme_bw() + theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + coord_flip() 

Across age groups, treatment duration exhibits clear variation in both central tendency and dispersion.

0s and 10s The youngest cohorts display the lowest median treatment durations, indicating relatively rapid recovery among most cases. However, both distributions are right-skewed, suggesting that while typical cases resolved quickly, a small subset experienced substantially longer recovery periods.

20s This group shows notable variability, with a larger number of outliers compared to other cohorts. Although individuals in their 20s accounted for a substantial share of confirmed infections, their median treatment duration remains comparatively short. This indicates that higher infection burden did not translate into prolonged clinical recovery on average.

30s, 40s, and 50s The distributions for individuals in their 30s, 40s, and 50s are broadly similar in shape, though the 30s group exhibits a slightly higher median treatment duration despite a lower overall infection share. This suggests that exposure frequency and recovery duration may not move in parallel and that factors beyond infection prevalence may influence clinical resolution time.

Overall, median treatment duration increases with age, indicating a positive association between age and recovery time. When considered alongside earlier findings on case fatality ratios and mortality share, a consistent pattern emerges: while younger and middle-aged groups bore a greater share of infections, older individuals experienced more severe outcomes conditional on infection, as reflected in both higher fatality ratios and longer recovery periods.

What about gender?

weekly_confirmed_cases <- time_gender_data %>% mutate(week = floor_date(date, unit = "week")) %>% group_by(week, sex) %>% slice_max (date, n = 1) %>% ungroup() %>% group_by(week) %>% mutate(case_share = confirmed / sum(confirmed), na.rm = TRUE) %>% ungroup()

ggplot(weekly_confirmed_cases, aes(x = week, y = sex, fill = case_share)) +
  geom_tile() +
  labs(
    title = "Distribution of COVID-19 Cases Across Gender Over Time",
    subtitle = "Weekly share of total deaths attributed to each gender group",
    x = "Week",
    y = "Gender: Female, Male",
    fill = "Weekly share of confirmed cases",
    caption = "https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.title = element_text(face = "bold"), 
    panel.grid = element_blank()
  ) +
  scale_fill_viridis_c(
    option = "magma",
    labels = scales::percent
  ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")

weekly_cfr_gender <- time_gender_data %>%
  mutate(week = floor_date(date, unit = "week")) %>%
  group_by(week, sex) %>%
  slice_max(date, n = 1) %>%   
  ungroup() %>%
  mutate(cfr = deceased / confirmed)

ggplot(weekly_cfr_gender, mapping = aes(x = week, y = sex, fill = cfr)) + geom_tile() + labs(
    title = "Case Fatality Ratio (CFR) by Gender Over Time in South Korea",
    subtitle = "Deaths / confirmed cases (end-of-week cumulative values)",
    x = "Week",
    y = "Gender: Female, Male",
    fill = "Weekly\nCFR",
    caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") + theme_bw()+ theme(plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold")) + scale_fill_viridis_c(option = "magma", labels = scales::percent) + 
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")

As for gender differences, females accounted for a higher weekly share of confirmed COVID-19 cases, indicating a greater observed infection burden during the study period. However, the case fatality ratio (CFR) was consistently higher among males. This contrast reinforces the distinction between infection concentration and conditional mortality risk: higher case share does not necessarily imply higher fatality risk once infected.

Several factors may help explain this divergence. Biological differences in immune response and underlying health conditions have been documented as contributors to differential COVID-19 severity. In addition, behavioral and structural factors, such as higher smoking prevalence, occupational exposure patterns, and delayed healthcare, seeking behavior—may have elevated mortality risk among males. Meanwhile, the higher share of confirmed cases among females may reflect differences in occupational roles or testing patterns rather than intrinsic biological susceptibility.

Taken together, the findings suggest that gender disparities during the first wave were shaped by both exposure patterns and differential clinical severity. While females represented a larger share of confirmed infections, males faced a disproportionately higher mortality risk conditional on infection.

Question 3: How did the severity and distribution of COVID-19 vary across provinces during Korea’s first wave, as measured by case fatality rates, share of confirmed cases, and share of deaths?

Having examined who was most affected by the pandemic in South Korea, we now turn to where the impact was geographically concentrated.

To provide context, South Korea is administratively divided into 17 provinces and metropolitan cities, each further subdivided into numerous districts and municipalities. Given the large number of city-level units, using cities as the primary geographic category would introduce excessive fragmentation and reduce interpretability. Therefore, this analysis adopts the provincial level as the unit of comparison in order to more clearly assess regional disparities in the impact of COVID-19.

province_last <- time_province_data %>%
  group_by(province) %>%
  slice_max(date, n = 1) %>%
  ungroup()

province_last <- province_last %>%
  mutate(share = confirmed / sum(confirmed, na.rm = TRUE))


sk_provinces <- gadm(country = "KOR", level = 1, path = tempdir())
sk_provinces <- st_as_sf(sk_provinces)

map_data <- sk_provinces %>%
  left_join(province_last, by = c("NAME_1" = "province"))

label_points <- st_point_on_surface(map_data)
## Warning: st_point_on_surface assumes attributes are constant over geometries
## Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
label_coords <- cbind(st_coordinates(label_points), map_data)

ggplot(map_data) +
  geom_sf(aes(fill = share), color = "white") + 
  geom_label_repel(data = label_coords, aes(X, Y, label = NAME_1), size = 2, color = "black") + 
  scale_fill_viridis_c(option = "cividis", labels = scales::percent) +
  labs(
  title = "Share of Total Confirmed Cases by Province in South Korea",
  subtitle = "Proportion of cumulative confirmed cases as of June 30, 2020",
  x = NULL,
  y = NULL,
  fill = "Share of\nConfirmed Cases",
  caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") +
  theme_bw() + theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

The map indicates that Daegu accounted for a disproportionately large share of confirmed cases during the initial stage of Korea’s first COVID-19 wave. Relative to other provinces, the concentration of infections in Daegu stands out clearly, reflecting a localized outbreak that rapidly escalated in intensity.

This spatial concentration is consistent with contemporary reports identifying Daegu as a major epicenter during the early phase of the pandemic. The outbreak linked to the Shincheonji Church of Jesus generated a substantial infection cluster, which significantly accelerated community transmission within the region. Such cluster-based amplification helps explain the sharp geographic imbalance observed in the data.

The rapid escalation in Daegu also reflects the broader uncertainty surrounding the virus during its early spread. Limited information about transmissibility and delayed implementation of stringent containment measures likely allowed the outbreak to intensify before effective mitigation strategies were fully in place.

province_death_share_rate <- time_province_data %>%
  group_by(province) %>%
  slice_max(date, n = 1) %>%
  ungroup() %>% 
  mutate(share = deceased / sum(deceased, na.rm = TRUE))

map_data <- sk_provinces %>% left_join(province_death_share_rate, by=c("NAME_1" = "province"))
label_points <- st_point_on_surface(map_data)
## Warning: st_point_on_surface assumes attributes are constant over geometries
## Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
label_coords <- cbind(st_coordinates(label_points), map_data)

ggplot(map_data) + geom_sf(aes(fill= share), color = "white") + scale_fill_viridis_c(option = "magma", labels = scales::percent) + geom_label_repel(data = label_coords, aes(X, Y, label = NAME_1), size = 2, color = "black") + labs(
  title = "Share of Total COVID-19 Deaths by Province in South Korea",
  subtitle = "Proportion of cumulative national deaths as of June 30, 2020",
  fill = "Share of\nNational Deaths",
  caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)", x= NULL, y = NULL) + theme_bw() + theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), axis.text = element_blank(), axis.ticks = element_blank()) 

province_cfr <- time_province_data %>%
  group_by(province) %>%
  slice_max(date, n = 1) %>%
  ungroup() %>% 
  mutate(cfr = deceased / confirmed)

map_data_cfr <- sk_provinces %>% left_join(province_cfr, by=c("NAME_1" = "province"))
label_points <- st_point_on_surface(map_data_cfr)
## Warning: st_point_on_surface assumes attributes are constant over geometries
## Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
label_coords <- cbind(st_coordinates(label_points), map_data_cfr)

ggplot(map_data_cfr) + geom_sf(aes(fill= cfr), color = "white") + scale_fill_viridis_c(option = "magma", labels = scales::percent) + geom_label_repel(data = label_coords, aes(X, Y, label = NAME_1), size = 2, color = "black") + labs(
    title = "Case Fatality Ratio (CFR) by Province in South Korea",
    subtitle = "Deaths divided by confirmed cases (cumulative as of June 30, 2020)",
    fill = "CFR\n(Deaths / Confirmed)",
    caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)",
    x = NULL,
    y = NULL) + theme_bw() + theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), axis.text = element_blank(), axis.ticks = element_blank())

Furthermore, the death-share map reinforces this pattern, indicating that Daegu accounted for the largest proportion of national deaths during the observed period. While Daegu concentrated much of the early burden, provinces such as Gangwon-do and Gyeongsangbuk-do exhibit higher CFR, suggesting differences in conditional severity even when total deaths were lower.

To further examine the relationship between the Shincheonji Church–related events in Daegu and the subsequent intensification of COVID-19 transmission, more granular data—such as case-level information from hospitals and local health institutions in Daegu—would be necessary. However, access to such detailed records is limited due to privacy protections and data confidentiality regulations.

Nevertheless, investigating the correlation between confirmed cases in Daegu and documented sources of infection would provide valuable insight into the dynamics of early cluster-based transmission. A more detailed analysis could help clarify the extent to which the Shincheonji-related outbreak contributed to the rapid regional escalation observed during the initial stages of the outbreak.

Daegu_data <- time_province_data %>% filter(province == "Daegu") %>% left_join(time_data, by="date") %>% select(date, confirmed = confirmed.x, released = released.x, deceased = deceased.x, total_number_tests = test, province)
Seoul_data <- time_province_data %>% filter(province == "Seoul") %>% left_join(time_data, by= "date") %>% select(date, confirmed = confirmed.x, released = released.x, deceased = deceased.x, total_number_tests = test, province)

Daegu_data <- Daegu_data %>% mutate(cfr = (deceased / confirmed) * 100, recovery_rate = (released / confirmed) *100) 
Seoul_data <- Seoul_data %>% mutate(cfr = (deceased / confirmed) * 100, recovery_rate = (released / confirmed) *100)

comparison_data <- bind_rows(Daegu_data, Seoul_data)
comparison_data <- comparison_data %>% arrange(desc(date))

peak_daegu <- comparison_data %>%
  filter(province == "Daegu") %>%
  slice_max(cfr, n = 1)

peak_seoul <- comparison_data %>% 
  filter(province == "Seoul") %>% 
  slice_max(cfr, n = 1)

ggplot(comparison_data, aes(x = date, y = cfr, color = province)) + geom_line() + 
  labs(
  title = "CFR Comparison: Daegu and Seoul",
  subtitle = "Case fatality ratio over time (deaths / confirmed cases)",
  x = "Date",
  y = "Case Fatality Ratio (Percent)",
  color = "Province",
  caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11),
    legend.position = "top") + geom_point(data = peak_daegu, size = 2) + 
  geom_text(data = peak_daegu, aes(label = "Peak in Daegu"), vjust = 2, show.legend = FALSE) + 
  geom_point(data = peak_seoul, size = 2) + 
  geom_text(data = peak_seoul, aes(label = "Peak in Seoul"), vjust = -1, show.legend = FALSE )
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_line()`).

Comparing Daegu — the initial epicenter of the national outbreak — with Seoul highlights the disproportionate severity Daegu faced during the early phase of the pandemic. Unlike Seoul, where case counts rose gradually, Daegu’s trajectory shows a sharp, unrelenting climb following its initial peak. Moreover, the timing of case surges in other provinces appears to follow Daegu’s peak, suggesting that the virus radiated outward from Daegu rather than emerging independently across regions — further reinforcing its role as the primary epicenter of Korea’s first wave.

Question 4: How did differences in case fatality rates and recovery rates between Seoul and Daegu evolve over the course of Korea’s first COVID-19 wave?

To examine how severity evolved and resolved over time in South Korea, we focus on a comparison between Seoul and Daegu using the case fatality ratio (CFR) and the recovery rate.

convergence_data <- Daegu_data %>% inner_join(Seoul_data, by="date") %>% rename(cfr_daegu = cfr.x, cfr_seoul = cfr.y, recovery_rate_daegu = recovery_rate.x, recovery_rate_seoul = recovery_rate.y) %>% select(date, cfr_daegu, cfr_seoul, recovery_rate_daegu, recovery_rate_seoul) %>% mutate(disparity_cfr = (cfr_daegu - cfr_seoul), disparity_recovery = (recovery_rate_daegu - recovery_rate_seoul)) %>% select(date, disparity_cfr, disparity_recovery)

ggplot(convergence_data, aes(x = date, y = disparity_cfr)) + geom_line(size = 1, color = "darkred") + geom_hline(yintercept = 0, linetype = "dashed") + 
  labs(title = "CFR Gap Between Daegu and Seoul Converged Over Time", subtitle = "Difference in case fatality ratios (Daegu − Seoul, percentage points)", x = "Date", y = "CFR Difference (pp)", caption = "Gap approaching zero indicates convergence in severity") + theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).

The graph closely mirrors the CFR trend observed in Daegu, suggesting that the overall national severity of the pandemic was largely driven by the outbreak in Daegu during the early phase. Beginning in mid-April, the severity gap gradually narrows, indicating that the disparity between Daegu and Seoul diminished as case fatality rates in Daegu declined.

ggplot(convergence_data, aes(x = date, y = disparity_recovery)) + geom_line(size = 1, color = "lightblue") + geom_hline(yintercept = 0, linetype = "dashed") + 
  labs(title = "Recovery Gap Between Daegu and Seoul Converged Over Time", subtitle = "Difference in recovery rates (Daegu − Seoul, percentage points)", x = "Date", y = "Recovery Rate Difference (pp)", caption = "Gap approaching zero indicates convergence in recovery") + theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).

Comparing the recovery gap and CFR gap charts together provides a more comprehensive view of regional disparity during the outbreak. The patterns suggest that Daegu was not only the epicenter of the epidemic but also faced significant strain on medical resources, as widely reported at the time. Notably, while the maximum CFR gap reaches approximately 2.3 percentage point, the recovery gap peaks at around 40 percentage point, indicating a much larger divergence in recovery outcomes. This disparity reflects the relative strain on Daegu’s healthcare system compared to Seoul, where more advanced infrastructure and greater resource availability may have facilitated faster recovery rates.

Question 6: How did different categories of infection sources contribute to the overall distribution and severity of COVID-19 cases during Korea’s first wave?

infection_summary <- case_data %>% mutate(infection_source = case_when(
  str_detect(infection_case, regex("church", ignore_case = TRUE)) ~ "CHURCH",
  str_detect(infection_case, regex("churches", ignore_case = TRUE)) ~ "CHURCH",
  str_detect(infection_case, regex("hospital", ignore_case = TRUE)) ~ "HOSPITAL", 
  str_detect(infection_case, regex("medical", ignore_case = TRUE)) ~ "HOSPITAL", 
  str_detect(infection_case, regex("overseas inflow", ignore_case = TRUE)) ~ "OVERSEAS", 
  str_detect(infection_case, regex("etc", ignore_case = TRUE)) ~ "UNKNOWN", 
  str_detect(infection_case, regex("clubs", ignore_case = TRUE)) ~ "SOCIAL HANGOUT",
  str_detect(infection_case, regex("contact with patient", ignore_case = TRUE)) ~ "CONTACT WITH A PATIENT",
  
  TRUE ~ "OTHER"
  
))

infection_summary <- infection_summary %>% count(infection_source) %>% mutate(prop = n /sum(n)) 

ggplot(infection_summary,
       aes(x = reorder(infection_source, prop),
           y = prop,
           fill = infection_source)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Distribution of COVID-19 Infection Sources",
    x = "Infection Source",
    y = "Share of Cases"
  ) +
  theme_minimal() + scale_fill_brewer() + theme(plot.title = element_text(face = "bold"))

As the case data indicates, the sources of infection were diverse, encompassing workplaces, schools, community gatherings, and overseas exposure. However, when excluding the broad “Other” category, which aggregates multiple smaller exposure types, church-related gatherings emerge as the most prominent identifiable source of transmission. This pattern highlights the role of large, close-contact congregational settings in facilitating rapid cluster-based spread during the early phase of the outbreak.

Reflection

Throughout this project, the most challenging aspect was not generating visualizations, but ensuring that the metrics I used were statistically coherent and conceptually meaningful. At the beginning of the analysis, I initially attempted to compute infection and mortality rates using the total number of tests as the denominator. At first glance, this seemed reasonable, since testing volume reflects detection activity. However, upon closer examination, I realized that using national-level test counts as a denominator for age-specific or province-specific confirmed cases was methodologically inconsistent. The numerator and denominator did not represent the same risk pool, which would have led to misleading interpretations.

This realization forced me to rethink how rates should be constructed. I learned that denominators must correspond to the same population as the numerator. When population data were unavailable, I replaced “infection rate” with share of confirmed cases to describe burden concentration rather than risk. For mortality analysis, I distinguished between province-level death share (deceased divided by total national deaths) and case fatality ratios (deceased divided by confirmed within the same province). This distinction clarified the difference between overall burden and conditional severity. I also reconsidered how cumulative data should be handled. Because the dataset recorded cumulative confirmed cases, summing across dates would have resulted in double counting. Instead, I extracted final cumulative values or computed daily new cases to avoid distortion in time-based analysis.

Another major realization was the conceptual difference between concentration and severity. Initially, I expected Daegu to exhibit both the highest case share and the highest fatality rate. However, after computing the appropriate metrics, I found that while Daegu accounted for the largest share of confirmed cases and deaths, it did not necessarily have the highest case fatality ratio. This forced me to separate disease burden (share of confirmed cases or deaths) from conditional severity (CFR). Understanding this distinction significantly strengthened the clarity and precision of my analysis. Share-based measures were more appropriate for comparing overall impact, while CFR was better suited for evaluating risk conditional on infection.

An additional insight was that data cleaning is not a one-time preliminary step, but an ongoing process that continues throughout visualization and analysis. As I created plots, I discovered inconsistencies in variable definitions, cumulative structures, and grouping logic that required adjustments mid-analysis. For example, calculating growth rates from cumulative data initially seemed straightforward, but I later recognized that daily new cases provided a clearer and more interpretable measure of transmission trends when analyzing policy timing. Determining which variables were appropriate for comparison, extracting the correct level of aggregation (province versus city), and selecting meaningful denominators were decisions that evolved throughout the project. This experience reinforced that effective analysis depends not only on plotting data, but on understanding what the data truly represent.

There was also considerable trial and error in selecting appropriate visualizations. I explored more complex visual forms, such as parallel coordinate plots, to compare provincial characteristics simultaneously, but encountered challenges related to scaling and grouping across mixed variable types. This process highlighted the importance of matching visualization structure to the analytical question. Some plots are useful for exploratory pattern detection, while others are better suited for clearly communicating specific relationships.

Overall, this project pushed me to think more critically about how data structure influences interpretation. I learned that generating a plot is relatively straightforward, but ensuring that the underlying metric is logically and statistically valid requires careful reasoning. The process of debugging denominators, distinguishing burden from severity, handling cumulative values appropriately, and refining definitions of “rate” significantly improved both the rigor and credibility of my analysis.