The data used in this analysis comes from the Kaggle dataset “Coronavirus Dataset” compiled by Kim Jihoo, based on official reports from the Korea Disease Control and Prevention Agency (KDCA) and local governments in South Korea.
The analysis primarily uses three datasets: (1) Time-series data containing daily confirmed, released, and deceased counts (2) Patient demographic data including age and gender (3) Policy data documenting major public health interventions.
Let’s begin by looking at each dataset to get a feel of what they look like.
head(case_data)
## case_id province city group infection_case confirmed
## 1 1000001 Seoul Yongsan-gu TRUE Itaewon Clubs 139
## 2 1000002 Seoul Gwanak-gu TRUE Richway 119
## 3 1000003 Seoul Guro-gu TRUE Guro-gu Call Center 95
## 4 1000004 Seoul Yangcheon-gu TRUE Yangcheon Table Tennis Club 43
## 5 1000005 Seoul Dobong-gu TRUE Day Care Center 43
## 6 1000006 Seoul Guro-gu TRUE Manmin Central Church 41
## latitude longitude
## 1 37.538621 126.992652
## 2 37.48208 126.901384
## 3 37.508163 126.884387
## 4 37.546061 126.874209
## 5 37.679422 127.044374
## 6 37.481059 126.894343
head(patient_data)
## patient_id sex age country province city infection_case
## 1 1e+09 male 50s Korea Seoul Gangseo-gu overseas inflow
## 2 1e+09 male 30s Korea Seoul Jungnang-gu overseas inflow
## 3 1e+09 male 50s Korea Seoul Jongno-gu contact with patient
## 4 1e+09 male 20s Korea Seoul Mapo-gu overseas inflow
## 5 1e+09 female 20s Korea Seoul Seongbuk-gu contact with patient
## 6 1e+09 female 50s Korea Seoul Jongno-gu contact with patient
## infected_by contact_number symptom_onset_date confirmed_date released_date
## 1 75 2020-01-22 2020-01-23 2020-02-05
## 2 31 2020-01-30 2020-03-02
## 3 2002000001 17 2020-01-30 2020-02-19
## 4 9 2020-01-26 2020-01-30 2020-02-15
## 5 1000000002 2 2020-01-31 2020-02-24
## 6 1000000003 43 2020-01-31 2020-02-19
## deceased_date state
## 1 released
## 2 released
## 3 released
## 4 released
## 5 released
## 6 released
head(policy_data)
## policy_id country type gov_policy detail
## 1 1 Korea Alert Infectious Disease Alert Level Level 1 (Blue)
## 2 2 Korea Alert Infectious Disease Alert Level Level 2 (Yellow)
## 3 3 Korea Alert Infectious Disease Alert Level Level 3 (Orange)
## 4 4 Korea Alert Infectious Disease Alert Level Level 4 (Red)
## 5 5 Korea Immigration Special Immigration Procedure from China
## 6 6 Korea Immigration Special Immigration Procedure from Hong Kong
## start_date end_date
## 1 2020-01-03 2020-01-19
## 2 2020-01-20 2020-01-27
## 3 2020-01-28 2020-02-22
## 4 2020-02-23
## 5 2020-02-04
## 6 2020-02-12
head(region_data)
## code province city latitude longitude elementary_school_count
## 1 10000 Seoul Seoul 37.56695 126.9780 607
## 2 10010 Seoul Gangnam-gu 37.51842 127.0472 33
## 3 10020 Seoul Gangdong-gu 37.53049 127.1238 27
## 4 10030 Seoul Gangbuk-gu 37.63994 127.0255 14
## 5 10040 Seoul Gangseo-gu 37.55117 126.8495 36
## 6 10050 Seoul Gwanak-gu 37.47829 126.9515 22
## kindergarten_count university_count academy_ratio elderly_population_ratio
## 1 830 48 1.44 15.38
## 2 38 0 4.18 13.17
## 3 32 0 1.54 14.55
## 4 21 0 0.67 19.49
## 5 56 1 1.17 14.39
## 6 33 1 0.89 15.12
## elderly_alone_ratio nursing_home_count
## 1 5.8 22739
## 2 4.3 3088
## 3 5.4 1023
## 4 8.5 628
## 5 5.7 1080
## 6 4.9 909
head(search_trend_data)
## date cold flu pneumonia coronavirus
## 1 2016-01-01 0.11663 0.05590 0.15726 0.00736
## 2 2016-01-02 0.13372 0.17135 0.20826 0.00890
## 3 2016-01-03 0.14917 0.22317 0.19326 0.00845
## 4 2016-01-04 0.17463 0.18626 0.29008 0.01145
## 5 2016-01-05 0.17226 0.15072 0.24562 0.01381
## 6 2016-01-06 0.17272 0.14399 0.25081 0.01381
head(seoul_floating_data)
## date hour birth_year sex province city fp_num
## 1 2020-01-01 0 20 female Seoul Dobong-gu 19140
## 2 2020-01-01 0 20 male Seoul Dobong-gu 19950
## 3 2020-01-01 0 20 female Seoul Dongdaemun-gu 25450
## 4 2020-01-01 0 20 male Seoul Dongdaemun-gu 27050
## 5 2020-01-01 0 20 female Seoul Dongjag-gu 28880
## 6 2020-01-01 0 20 male Seoul Dongjag-gu 30350
head(time_data)
## date time test negative confirmed released deceased
## 1 2020-01-20 16 1 0 1 0 0
## 2 2020-01-21 16 1 0 1 0 0
## 3 2020-01-22 16 4 3 1 0 0
## 4 2020-01-23 16 22 21 1 0 0
## 5 2020-01-24 16 27 25 2 0 0
## 6 2020-01-25 16 27 25 2 0 0
head(time_gender_data)
## date time sex confirmed deceased
## 1 2020-03-02 0 male 1591 13
## 2 2020-03-02 0 female 2621 9
## 3 2020-03-03 0 male 1810 16
## 4 2020-03-03 0 female 3002 12
## 5 2020-03-04 0 male 1996 20
## 6 2020-03-04 0 female 3332 12
head(time_province_data)
## date time province confirmed released deceased
## 1 2020-01-20 16 Seoul 0 0 0
## 2 2020-01-20 16 Busan 0 0 0
## 3 2020-01-20 16 Daegu 0 0 0
## 4 2020-01-20 16 Incheon 1 0 0
## 5 2020-01-20 16 Gwangju 0 0 0
## 6 2020-01-20 16 Daejeon 0 0 0
head(weather_data)
## code province date avg_temp min_temp max_temp precipitation
## 1 10000 Seoul 2016-01-01 1.2 -3.3 4.0 0
## 2 11000 Busan 2016-01-01 5.3 1.1 10.9 0
## 3 12000 Daegu 2016-01-01 1.7 -4.0 8.0 0
## 4 13000 Gwangju 2016-01-01 3.2 -1.5 8.1 0
## 5 14000 Incheon 2016-01-01 3.1 -0.4 5.7 0
## 6 15000 Daejeon 2016-01-01 1.6 -4.2 7.7 0
## max_wind_speed most_wind_direction avg_relative_humidity
## 1 3.5 90 73.0
## 2 7.4 340 52.1
## 3 3.7 270 70.5
## 4 2.7 230 73.1
## 5 5.3 180 83.9
## 6 4.4 320 77.4
Let’s look at the structure of all data frames too.
str(case_data)
## 'data.frame': 174 obs. of 8 variables:
## $ case_id : int 1000001 1000002 1000003 1000004 1000005 1000006 1000007 1000008 1000009 1000010 ...
## $ province : chr "Seoul" "Seoul" "Seoul" "Seoul" ...
## $ city : chr "Yongsan-gu" "Gwanak-gu" "Guro-gu" "Yangcheon-gu" ...
## $ group : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ infection_case: chr "Itaewon Clubs" "Richway" "Guro-gu Call Center" "Yangcheon Table Tennis Club" ...
## $ confirmed : int 139 119 95 43 43 41 36 17 25 30 ...
## $ latitude : chr "37.538621" "37.48208" "37.508163" "37.546061" ...
## $ longitude : chr "126.992652" "126.901384" "126.884387" "126.874209" ...
str(patient_data)
## 'data.frame': 5165 obs. of 14 variables:
## $ patient_id : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ sex : chr "male" "male" "male" "male" ...
## $ age : chr "50s" "30s" "50s" "20s" ...
## $ country : chr "Korea" "Korea" "Korea" "Korea" ...
## $ province : chr "Seoul" "Seoul" "Seoul" "Seoul" ...
## $ city : chr "Gangseo-gu" "Jungnang-gu" "Jongno-gu" "Mapo-gu" ...
## $ infection_case : chr "overseas inflow" "overseas inflow" "contact with patient" "overseas inflow" ...
## $ infected_by : chr "" "" "2002000001" "" ...
## $ contact_number : chr "75" "31" "17" "9" ...
## $ symptom_onset_date: chr "2020-01-22" "" "" "2020-01-26" ...
## $ confirmed_date : chr "2020-01-23" "2020-01-30" "2020-01-30" "2020-01-30" ...
## $ released_date : chr "2020-02-05" "2020-03-02" "2020-02-19" "2020-02-15" ...
## $ deceased_date : chr "" "" "" "" ...
## $ state : chr "released" "released" "released" "released" ...
str(policy_data)
## 'data.frame': 61 obs. of 7 variables:
## $ policy_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "Korea" "Korea" "Korea" "Korea" ...
## $ type : chr "Alert" "Alert" "Alert" "Alert" ...
## $ gov_policy: chr "Infectious Disease Alert Level" "Infectious Disease Alert Level" "Infectious Disease Alert Level" "Infectious Disease Alert Level" ...
## $ detail : chr "Level 1 (Blue)" "Level 2 (Yellow)" "Level 3 (Orange)" "Level 4 (Red)" ...
## $ start_date: chr "2020-01-03" "2020-01-20" "2020-01-28" "2020-02-23" ...
## $ end_date : chr "2020-01-19" "2020-01-27" "2020-02-22" "" ...
str(region_data)
## 'data.frame': 244 obs. of 12 variables:
## $ code : int 10000 10010 10020 10030 10040 10050 10060 10070 10080 10090 ...
## $ province : chr "Seoul" "Seoul" "Seoul" "Seoul" ...
## $ city : chr "Seoul" "Gangnam-gu" "Gangdong-gu" "Gangbuk-gu" ...
## $ latitude : num 37.6 37.5 37.5 37.6 37.6 ...
## $ longitude : num 127 127 127 127 127 ...
## $ elementary_school_count : int 607 33 27 14 36 22 22 26 18 42 ...
## $ kindergarten_count : int 830 38 32 21 56 33 33 34 19 66 ...
## $ university_count : int 48 0 0 0 1 1 3 3 0 6 ...
## $ academy_ratio : num 1.44 4.18 1.54 0.67 1.17 0.89 1.16 1 0.96 1.39 ...
## $ elderly_population_ratio: num 15.4 13.2 14.6 19.5 14.4 ...
## $ elderly_alone_ratio : num 5.8 4.3 5.4 8.5 5.7 4.9 4.8 5.7 6.7 7.4 ...
## $ nursing_home_count : int 22739 3088 1023 628 1080 909 723 741 475 952 ...
str(search_trend_data)
## 'data.frame': 1642 obs. of 5 variables:
## $ date : chr "2016-01-01" "2016-01-02" "2016-01-03" "2016-01-04" ...
## $ cold : num 0.117 0.134 0.149 0.175 0.172 ...
## $ flu : num 0.0559 0.1714 0.2232 0.1863 0.1507 ...
## $ pneumonia : num 0.157 0.208 0.193 0.29 0.246 ...
## $ coronavirus: num 0.00736 0.0089 0.00845 0.01145 0.01381 ...
str(seoul_floating_data)
## 'data.frame': 1084800 obs. of 7 variables:
## $ date : chr "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-01" ...
## $ hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ birth_year: int 20 20 20 20 20 20 20 20 20 20 ...
## $ sex : chr "female" "male" "female" "male" ...
## $ province : chr "Seoul" "Seoul" "Seoul" "Seoul" ...
## $ city : chr "Dobong-gu" "Dobong-gu" "Dongdaemun-gu" "Dongdaemun-gu" ...
## $ fp_num : int 19140 19950 25450 27050 28880 30350 27750 27910 19490 21940 ...
str(time_data)
## 'data.frame': 163 obs. of 7 variables:
## $ date : chr "2020-01-20" "2020-01-21" "2020-01-22" "2020-01-23" ...
## $ time : int 16 16 16 16 16 16 16 16 16 16 ...
## $ test : int 1 1 4 22 27 27 51 61 116 187 ...
## $ negative : int 0 0 3 21 25 25 47 56 97 155 ...
## $ confirmed: int 1 1 1 1 2 2 3 4 4 4 ...
## $ released : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deceased : int 0 0 0 0 0 0 0 0 0 0 ...
str(time_gender_data)
## 'data.frame': 242 obs. of 5 variables:
## $ date : chr "2020-03-02" "2020-03-02" "2020-03-03" "2020-03-03" ...
## $ time : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sex : chr "male" "female" "male" "female" ...
## $ confirmed: int 1591 2621 1810 3002 1996 3332 2149 3617 2345 3939 ...
## $ deceased : int 13 9 16 12 20 12 21 14 25 17 ...
str(time_province_data)
## 'data.frame': 2771 obs. of 6 variables:
## $ date : chr "2020-01-20" "2020-01-20" "2020-01-20" "2020-01-20" ...
## $ time : int 16 16 16 16 16 16 16 16 16 16 ...
## $ province : chr "Seoul" "Busan" "Daegu" "Incheon" ...
## $ confirmed: int 0 0 0 1 0 0 0 0 0 0 ...
## $ released : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deceased : int 0 0 0 0 0 0 0 0 0 0 ...
str(weather_data)
## 'data.frame': 26271 obs. of 10 variables:
## $ code : int 10000 11000 12000 13000 14000 15000 16000 20000 30000 40000 ...
## $ province : chr "Seoul" "Busan" "Daegu" "Gwangju" ...
## $ date : chr "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
## $ avg_temp : num 1.2 5.3 1.7 3.2 3.1 1.6 4 1.6 5.1 -1 ...
## $ min_temp : num -3.3 1.1 -4 -1.5 -0.4 -4.2 -1.6 -4.2 2.1 -5.9 ...
## $ max_temp : num 4 10.9 8 8.1 5.7 7.7 12 5.7 8.9 4.1 ...
## $ precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ max_wind_speed : num 3.5 7.4 3.7 2.7 5.3 4.4 2.7 2.1 9.6 1.6 ...
## $ most_wind_direction : int 90 340 270 230 180 320 320 180 290 110 ...
## $ avg_relative_humidity: num 73 52.1 70.5 73.1 83.9 77.4 53.3 80.1 33 79.4 ...
For case data, latitude and longitude values are chr, which should be converted into num.
case_data <- case_data %>% mutate(latitude = na_if(latitude, "-"), longitude = na_if(longitude, "-"), latitude = as.numeric(latitude), longitude = as.numeric(longitude))
As an observation, some rows have missing latitude, longitude, or city values missing.
For patient_data,
sex column could be converted into a factor value for the sake of preparing the data for analysis
contact number represents the count of contacts; thus should be converted into int.
symptom_onset_date, released_date, deceased_date: should be converted into date value.
age: should be integer value
patient_data <- patient_data %>% mutate(sex = na_if(sex, ""), sex = as.factor(sex), contact_number = as.integer(contact_number), symptom_onset_date = as.Date(symptom_onset_date), released_date = as.Date(released_date), deceased_date = as.Date(deceased_date), confirmed_date = as.Date(confirmed_date))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `contact_number = as.integer(contact_number)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
For policy data, some of the policies do not have an end date, yet we plan to look at the effects of a policy beginning the start date, thus at the moment, there’s nothing to fill in or need to fill in the end date for every policy.
policy_data <- policy_data %>% mutate(start_date = as.Date(start_date), end_date = as.Date(end_date))
For search_trend_data, the date values should be converted to date data type.
search_trend_data <- search_trend_data %>% mutate(date = as.Date(date))
For seoul_floating_date,
date: converted to date data type
sex: converted into a factor
seoul_floating_data <- seoul_floating_data %>% mutate(date = as.Date(date), sex = as.factor(sex))
For time_data, time_gender, time_province_data, and weather_data, all date values should be converted to date data type.
time_data <- time_data %>% mutate(date = as.Date(date))
time_age_data <- time_age_data %>% mutate(date = as.Date(date))
time_gender_data <- time_gender_data %>% mutate(date = as.Date(date))
time_province_data <- time_province_data %>% mutate(date = as.Date(date))
time_gender_data <- time_gender_data %>% mutate(date = as.Date(date), sex = na_if(sex, ""), sex = as.factor(sex))
weather_data <- weather_data %>% mutate(date = as.Date(date))
Looking at missing/ null values now,
case_data %>% summarise(missing_latitude = sum(is.na(latitude)))
case_data %>% summarise(missing_longitude = sum(is.na(longitude)))
case_data %>% summarise(missing_city = sum(city == "-"))
For missing values in case_data, the missing values for longitude and latitude mean the infection case was not a group infection but something else as they represent the location (latitude, longitude) of the group, but the missing city values don’t seem to match the number of misisng values for latitude/ longitude. Let’s investigate more.
#going to convert "-" to na
case_data <- case_data %>% mutate(city = na_if(city, "-"))
case_data %>% filter(is.na(city))
## case_id province city group infection_case
## 1 1000034 Seoul <NA> TRUE Orange Life
## 2 1000036 Seoul <NA> FALSE overseas inflow
## 3 1000037 Seoul <NA> FALSE contact with patient
## 4 1000038 Seoul <NA> FALSE etc
## 5 1100008 Busan <NA> FALSE overseas inflow
## 6 1100009 Busan <NA> FALSE contact with patient
## 7 1100010 Busan <NA> FALSE etc
## 8 1200008 Daegu <NA> FALSE overseas inflow
## 9 1200009 Daegu <NA> FALSE contact with patient
## 10 1200010 Daegu <NA> FALSE etc
## 11 1300003 Gwangju <NA> FALSE overseas inflow
## 12 1300004 Gwangju <NA> FALSE contact with patient
## 13 1300005 Gwangju <NA> FALSE etc
## 14 1400005 Incheon <NA> FALSE overseas inflow
## 15 1400006 Incheon <NA> FALSE contact with patient
## 16 1400007 Incheon <NA> FALSE etc
## 17 1500001 Daejeon <NA> TRUE Door-to-door sales in Daejeon
## 18 1500008 Daejeon <NA> FALSE overseas inflow
## 19 1500009 Daejeon <NA> FALSE contact with patient
## 20 1500010 Daejeon <NA> FALSE etc
## 21 1600002 Ulsan <NA> FALSE overseas inflow
## 22 1600003 Ulsan <NA> FALSE contact with patient
## 23 1600004 Ulsan <NA> FALSE etc
## 24 1700004 Sejong <NA> FALSE overseas inflow
## 25 1700005 Sejong <NA> FALSE contact with patient
## 26 1700006 Sejong <NA> FALSE etc
## 27 2000009 Gyeonggi-do <NA> TRUE SMR Newly Planted Churches Group
## 28 2000020 Gyeonggi-do <NA> FALSE overseas inflow
## 29 2000021 Gyeonggi-do <NA> FALSE contact with patient
## 30 2000022 Gyeonggi-do <NA> FALSE etc
## 31 3000006 Gangwon-do <NA> FALSE overseas inflow
## 32 3000007 Gangwon-do <NA> FALSE contact with patient
## 33 3000008 Gangwon-do <NA> FALSE etc
## 34 4000005 Chungcheongbuk-do <NA> FALSE overseas inflow
## 35 4000006 Chungcheongbuk-do <NA> FALSE contact with patient
## 36 4000007 Chungcheongbuk-do <NA> FALSE etc
## 37 4100006 Chungcheongnam-do <NA> FALSE overseas inflow
## 38 4100007 Chungcheongnam-do <NA> FALSE contact with patient
## 39 4100008 Chungcheongnam-do <NA> FALSE etc
## 40 5000004 Jeollabuk-do <NA> FALSE overseas inflow
## 41 5000005 Jeollabuk-do <NA> FALSE etc
## 42 5100003 Jeollanam-do <NA> FALSE overseas inflow
## 43 5100004 Jeollanam-do <NA> FALSE contact with patient
## 44 5100005 Jeollanam-do <NA> FALSE etc
## 45 6000011 Gyeongsangbuk-do <NA> FALSE overseas inflow
## 46 6000012 Gyeongsangbuk-do <NA> FALSE contact with patient
## 47 6000013 Gyeongsangbuk-do <NA> FALSE etc
## 48 6100010 Gyeongsangnam-do <NA> FALSE overseas inflow
## 49 6100011 Gyeongsangnam-do <NA> FALSE contact with patient
## 50 6100012 Gyeongsangnam-do <NA> FALSE etc
## 51 7000001 Jeju-do <NA> FALSE overseas inflow
## 52 7000002 Jeju-do <NA> FALSE contact with patient
## 53 7000003 Jeju-do <NA> FALSE etc
## confirmed latitude longitude
## 1 1 NA NA
## 2 298 NA NA
## 3 162 NA NA
## 4 100 NA NA
## 5 36 NA NA
## 6 19 NA NA
## 7 30 NA NA
## 8 41 NA NA
## 9 917 NA NA
## 10 747 NA NA
## 11 23 NA NA
## 12 5 NA NA
## 13 1 NA NA
## 14 68 NA NA
## 15 6 NA NA
## 16 11 NA NA
## 17 55 NA NA
## 18 15 NA NA
## 19 15 NA NA
## 20 15 NA NA
## 21 25 NA NA
## 22 3 NA NA
## 23 7 NA NA
## 24 5 NA NA
## 25 3 NA NA
## 26 1 NA NA
## 27 25 NA NA
## 28 305 NA NA
## 29 63 NA NA
## 30 84 NA NA
## 31 16 NA NA
## 32 0 NA NA
## 33 7 NA NA
## 34 13 NA NA
## 35 8 NA NA
## 36 11 NA NA
## 37 16 NA NA
## 38 2 NA NA
## 39 12 NA NA
## 40 12 NA NA
## 41 5 NA NA
## 42 14 NA NA
## 43 4 NA NA
## 44 4 NA NA
## 45 22 NA NA
## 46 190 NA NA
## 47 133 NA NA
## 48 26 NA NA
## 49 6 NA NA
## 50 20 NA NA
## 51 14 NA NA
## 52 0 NA NA
## 53 4 NA NA
case_data %>% filter(is.na(latitude) & is.na(longitude))
## case_id province city group
## 1 1000007 Seoul from other city TRUE
## 2 1000009 Seoul from other city TRUE
## 3 1000018 Seoul Gangnam-gu TRUE
## 4 1000019 Seoul from other city TRUE
## 5 1000020 Seoul Geumcheon-gu TRUE
## 6 1000021 Seoul from other city TRUE
## 7 1000022 Seoul from other city TRUE
## 8 1000027 Seoul Seocho-gu TRUE
## 9 1000028 Seoul from other city TRUE
## 10 1000031 Seoul from other city TRUE
## 11 1000033 Seoul from other city TRUE
## 12 1000034 Seoul <NA> TRUE
## 13 1000036 Seoul <NA> FALSE
## 14 1000037 Seoul <NA> FALSE
## 15 1000038 Seoul <NA> FALSE
## 16 1100002 Busan from other city TRUE
## 17 1100006 Busan from other city TRUE
## 18 1100007 Busan from other city TRUE
## 19 1100008 Busan <NA> FALSE
## 20 1100009 Busan <NA> FALSE
## 21 1100010 Busan <NA> FALSE
## 22 1200006 Daegu from other city TRUE
## 23 1200007 Daegu from other city TRUE
## 24 1200008 Daegu <NA> FALSE
## 25 1200009 Daegu <NA> FALSE
## 26 1200010 Daegu <NA> FALSE
## 27 1300002 Gwangju from other city TRUE
## 28 1300003 Gwangju <NA> FALSE
## 29 1300004 Gwangju <NA> FALSE
## 30 1300005 Gwangju <NA> FALSE
## 31 1400001 Incheon from other city TRUE
## 32 1400002 Incheon from other city TRUE
## 33 1400003 Incheon from other city TRUE
## 34 1400004 Incheon from other city TRUE
## 35 1400005 Incheon <NA> FALSE
## 36 1400006 Incheon <NA> FALSE
## 37 1400007 Incheon <NA> FALSE
## 38 1500001 Daejeon <NA> TRUE
## 39 1500006 Daejeon from other city TRUE
## 40 1500007 Daejeon from other city TRUE
## 41 1500008 Daejeon <NA> FALSE
## 42 1500009 Daejeon <NA> FALSE
## 43 1500010 Daejeon <NA> FALSE
## 44 1600001 Ulsan from other city TRUE
## 45 1600002 Ulsan <NA> FALSE
## 46 1600003 Ulsan <NA> FALSE
## 47 1600004 Ulsan <NA> FALSE
## 48 1700003 Sejong from other city TRUE
## 49 1700004 Sejong <NA> FALSE
## 50 1700005 Sejong <NA> FALSE
## 51 1700006 Sejong <NA> FALSE
## 52 2000003 Gyeonggi-do from other city TRUE
## 53 2000004 Gyeonggi-do from other city TRUE
## 54 2000006 Gyeonggi-do from other city TRUE
## 55 2000007 Gyeonggi-do from other city TRUE
## 56 2000008 Gyeonggi-do from other city TRUE
## 57 2000009 Gyeonggi-do <NA> TRUE
## 58 2000015 Gyeonggi-do from other city TRUE
## 59 2000016 Gyeonggi-do from other city TRUE
## 60 2000017 Gyeonggi-do from other city TRUE
## 61 2000018 Gyeonggi-do from other city TRUE
## 62 2000019 Gyeonggi-do Seongnam-si TRUE
## 63 2000020 Gyeonggi-do <NA> FALSE
## 64 2000021 Gyeonggi-do <NA> FALSE
## 65 2000022 Gyeonggi-do <NA> FALSE
## 66 3000001 Gangwon-do from other city TRUE
## 67 3000002 Gangwon-do from other city TRUE
## 68 3000004 Gangwon-do from other city TRUE
## 69 3000005 Gangwon-do from other city TRUE
## 70 3000006 Gangwon-do <NA> FALSE
## 71 3000007 Gangwon-do <NA> FALSE
## 72 3000008 Gangwon-do <NA> FALSE
## 73 4000002 Chungcheongbuk-do from other city TRUE
## 74 4000003 Chungcheongbuk-do from other city TRUE
## 75 4000004 Chungcheongbuk-do from other city TRUE
## 76 4000005 Chungcheongbuk-do <NA> FALSE
## 77 4000006 Chungcheongbuk-do <NA> FALSE
## 78 4000007 Chungcheongbuk-do <NA> FALSE
## 79 4100002 Chungcheongnam-do from other city TRUE
## 80 4100004 Chungcheongnam-do from other city TRUE
## 81 4100005 Chungcheongnam-do from other city TRUE
## 82 4100006 Chungcheongnam-do <NA> FALSE
## 83 4100007 Chungcheongnam-do <NA> FALSE
## 84 4100008 Chungcheongnam-do <NA> FALSE
## 85 5000001 Jeollabuk-do from other city TRUE
## 86 5000002 Jeollabuk-do from other city TRUE
## 87 5000003 Jeollabuk-do from other city TRUE
## 88 5000004 Jeollabuk-do <NA> FALSE
## 89 5000005 Jeollabuk-do <NA> FALSE
## 90 5100002 Jeollanam-do from other city TRUE
## 91 5100003 Jeollanam-do <NA> FALSE
## 92 5100004 Jeollanam-do <NA> FALSE
## 93 5100005 Jeollanam-do <NA> FALSE
## 94 6000001 Gyeongsangbuk-do from other city TRUE
## 95 6000005 Gyeongsangbuk-do from other city TRUE
## 96 6000010 Gyeongsangbuk-do Gumi-si TRUE
## 97 6000011 Gyeongsangbuk-do <NA> FALSE
## 98 6000012 Gyeongsangbuk-do <NA> FALSE
## 99 6000013 Gyeongsangbuk-do <NA> FALSE
## 100 6100001 Gyeongsangnam-do from other city TRUE
## 101 6100008 Gyeongsangnam-do from other city TRUE
## 102 6100009 Gyeongsangnam-do from other city TRUE
## 103 6100010 Gyeongsangnam-do <NA> FALSE
## 104 6100011 Gyeongsangnam-do <NA> FALSE
## 105 6100012 Gyeongsangnam-do <NA> FALSE
## 106 7000001 Jeju-do <NA> FALSE
## 107 7000002 Jeju-do <NA> FALSE
## 108 7000003 Jeju-do <NA> FALSE
## 109 7000004 Jeju-do from other city TRUE
## infection_case confirmed latitude longitude
## 1 SMR Newly Planted Churches Group 36 NA NA
## 2 Coupang Logistics Center 25 NA NA
## 3 Gangnam Yeoksam-dong gathering 6 NA NA
## 4 Daejeon door-to-door sales 1 NA NA
## 5 Geumcheon-gu rice milling machine manufacture 6 NA NA
## 6 Shincheonji Church 8 NA NA
## 7 Guri Collective Infection 5 NA NA
## 8 Seocho Family 5 NA NA
## 9 Anyang Gunpo Pastors Group 1 NA NA
## 10 Yongin Brothers 4 NA NA
## 11 Uiwang Logistics Center 2 NA NA
## 12 Orange Life 1 NA NA
## 13 overseas inflow 298 NA NA
## 14 contact with patient 162 NA NA
## 15 etc 100 NA NA
## 16 Shincheonji Church 12 NA NA
## 17 Itaewon Clubs 4 NA NA
## 18 Cheongdo Daenam Hospital 1 NA NA
## 19 overseas inflow 36 NA NA
## 20 contact with patient 19 NA NA
## 21 etc 30 NA NA
## 22 Itaewon Clubs 2 NA NA
## 23 Cheongdo Daenam Hospital 2 NA NA
## 24 overseas inflow 41 NA NA
## 25 contact with patient 917 NA NA
## 26 etc 747 NA NA
## 27 Shincheonji Church 9 NA NA
## 28 overseas inflow 23 NA NA
## 29 contact with patient 5 NA NA
## 30 etc 1 NA NA
## 31 Itaewon Clubs 53 NA NA
## 32 Coupang Logistics Center 42 NA NA
## 33 Guro-gu Call Center 20 NA NA
## 34 Shincheonji Church 2 NA NA
## 35 overseas inflow 68 NA NA
## 36 contact with patient 6 NA NA
## 37 etc 11 NA NA
## 38 Door-to-door sales in Daejeon 55 NA NA
## 39 Shincheonji Church 2 NA NA
## 40 Seosan-si Laboratory 2 NA NA
## 41 overseas inflow 15 NA NA
## 42 contact with patient 15 NA NA
## 43 etc 15 NA NA
## 44 Shincheonji Church 16 NA NA
## 45 overseas inflow 25 NA NA
## 46 contact with patient 3 NA NA
## 47 etc 7 NA NA
## 48 Shincheonji Church 1 NA NA
## 49 overseas inflow 5 NA NA
## 50 contact with patient 3 NA NA
## 51 etc 1 NA NA
## 52 Itaewon Clubs 59 NA NA
## 53 Richway 58 NA NA
## 54 Guro-gu Call Center 50 NA NA
## 55 Shincheonji Church 29 NA NA
## 56 Yangcheon Table Tennis Club 28 NA NA
## 57 SMR Newly Planted Churches Group 25 NA NA
## 58 Korea Campus Crusade of Christ 7 NA NA
## 59 Geumcheon-gu rice milling machine manufacture 6 NA NA
## 60 Wangsung Church 6 NA NA
## 61 Seoul City Hall Station safety worker 5 NA NA
## 62 Seongnam neighbors gathering 5 NA NA
## 63 overseas inflow 305 NA NA
## 64 contact with patient 63 NA NA
## 65 etc 84 NA NA
## 66 Shincheonji Church 17 NA NA
## 67 Uijeongbu St. Mary’s Hospital 10 NA NA
## 68 Richway 4 NA NA
## 69 Geumcheon-gu rice milling machine manufacture 4 NA NA
## 70 overseas inflow 16 NA NA
## 71 contact with patient 0 NA NA
## 72 etc 7 NA NA
## 73 Itaewon Clubs 9 NA NA
## 74 Guro-gu Call Center 2 NA NA
## 75 Shincheonji Church 6 NA NA
## 76 overseas inflow 13 NA NA
## 77 contact with patient 8 NA NA
## 78 etc 11 NA NA
## 79 Door-to-door sales in Daejeon 10 NA NA
## 80 Richway 3 NA NA
## 81 Eunpyeong-Boksagol culture center 3 NA NA
## 82 overseas inflow 16 NA NA
## 83 contact with patient 2 NA NA
## 84 etc 12 NA NA
## 85 Itaewon Clubs 2 NA NA
## 86 Door-to-door sales in Daejeon 3 NA NA
## 87 Shincheonji Church 1 NA NA
## 88 overseas inflow 12 NA NA
## 89 etc 5 NA NA
## 90 Shincheonji Church 1 NA NA
## 91 overseas inflow 14 NA NA
## 92 contact with patient 4 NA NA
## 93 etc 4 NA NA
## 94 Shincheonji Church 566 NA NA
## 95 Pilgrimage to Israel 41 NA NA
## 96 Gumi Elim Church 10 NA NA
## 97 overseas inflow 22 NA NA
## 98 contact with patient 190 NA NA
## 99 etc 133 NA NA
## 100 Shincheonji Church 32 NA NA
## 101 Itaewon Clubs 2 NA NA
## 102 Onchun Church 2 NA NA
## 103 overseas inflow 26 NA NA
## 104 contact with patient 6 NA NA
## 105 etc 20 NA NA
## 106 overseas inflow 14 NA NA
## 107 contact with patient 0 NA NA
## 108 etc 4 NA NA
## 109 Itaewon Clubs 1 NA NA
As shown above, some rows just simply have the latitue and longitude
values missing despite the provided city value:
Gangnam-gu (1000018), Geumcheon-gu (1000020), Seocho-gu (1000027),
Seongnam-si (2000019), Gumi-si (6000010)
Let’s look at some other rows that have the same city values
case_data %>% filter(city == "Gangnam-gu")
## case_id province city group infection_case confirmed
## 1 1000014 Seoul Gangnam-gu TRUE Samsung Medical Center 7
## 2 1000018 Seoul Gangnam-gu TRUE Gangnam Yeoksam-dong gathering 6
## 3 1000025 Seoul Gangnam-gu TRUE Gangnam Dongin Church 1
## 4 1000029 Seoul Gangnam-gu TRUE Samsung Fire & Marine Insurance 4
## latitude longitude
## 1 37.48825 127.0856
## 2 NA NA
## 3 37.52233 127.0574
## 4 37.49828 127.0301
case_data %>% filter(city == "Geumcheon-gu")
## case_id province city group
## 1 1000020 Seoul Geumcheon-gu TRUE
## infection_case confirmed latitude longitude
## 1 Geumcheon-gu rice milling machine manufacture 6 NA NA
case_data %>% filter(city == "Seocho-gu")
## case_id province city group infection_case confirmed latitude longitude
## 1 1000027 Seoul Seocho-gu TRUE Seocho Family 5 NA NA
case_data %>% filter(city == "Seongnam-si")
## case_id province city group infection_case
## 1 2000001 Gyeonggi-do Seongnam-si TRUE River of Grace Community Church
## 2 2000010 Gyeonggi-do Seongnam-si TRUE Bundang Jesaeng Hospital
## 3 2000019 Gyeonggi-do Seongnam-si TRUE Seongnam neighbors gathering
## confirmed latitude longitude
## 1 67 37.45569 127.1616
## 2 22 37.38833 127.1218
## 3 5 NA NA
case_data %>% filter(city == "Gumi-si")
## case_id province city group infection_case confirmed latitude
## 1 6000010 Gyeongsangbuk-do Gumi-si TRUE Gumi Elim Church 10 NA
## longitude
## 1 NA
As the result above shows, it does not seem to be a data entry error, but rather the latitude/ longitude information was not accessible/ provided.
For the patient_data, some age, city, and infection case values are missing with just empty field; we should replace that with NA values.
patient_data <- patient_data %>% mutate(age = na_if(age, ""), city = na_if(city, ""), infection_case = na_if(infection_case, ""))
As for the outlier values,
summary(patient_data)
## patient_id sex age country
## Min. :1.000e+09 female:2218 Length:5165 Length:5165
## 1st Qu.:1.000e+09 male :1825 Class :character Class :character
## Median :2.000e+09 NA's :1122 Mode :character Mode :character
## Mean :2.864e+09
## 3rd Qu.:6.001e+09
## Max. :7.000e+09
##
## province city infection_case infected_by
## Length:5165 Length:5165 Length:5165 Length:5165
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## contact_number symptom_onset_date confirmed_date
## Min. :0.000e+00 Min. :2020-01-19 Min. :2020-01-20
## 1st Qu.:2.000e+00 1st Qu.:2020-02-29 1st Qu.:2020-03-04
## Median :4.000e+00 Median :2020-03-20 Median :2020-03-27
## Mean :1.274e+06 Mean :2020-04-05 Mean :2020-04-10
## 3rd Qu.:1.400e+01 3rd Qu.:2020-05-23 3rd Qu.:2020-05-27
## Max. :1.000e+09 Max. :2020-06-28 Max. :2020-06-30
## NA's :4380 NA's :4476 NA's :3
## released_date deceased_date state
## Min. :2020-02-05 Min. :2020-02-19 Length:5165
## 1st Qu.:2020-03-20 1st Qu.:2020-03-02 Class :character
## Median :2020-03-28 Median :2020-03-09 Mode :character
## Mean :2020-04-03 Mean :2020-03-17
## 3rd Qu.:2020-04-14 3rd Qu.:2020-03-30
## Max. :2020-06-28 Max. :2020-05-25
## NA's :3578 NA's :5099
The maximum value for contact_number seems to be extraordinarily large; indicating a possible data error
patient_data %>% arrange(desc(contact_number))
The largest value seems to be a data error, given by how it’s not feasible for anyone to have been in a contact with 1000000796 times, and as the magnitude and digit length of this value strongly suggest that the patient_id was mistakenly entered into the contact_number field. Therefore, we’re just going to drop this entire row; before doing that let’s check if any patient was in contact with this patient. The entire row for this patient does not have infection case either, thus keeping this data will not contribute to anything.
patient_data %>% filter(infected_by == "1000000819")
## [1] patient_id sex age country
## [5] province city infection_case infected_by
## [9] contact_number symptom_onset_date confirmed_date released_date
## [13] deceased_date state
## <0 rows> (or 0-length row.names)
Given by how no one was infected by this patient, it can be safely removed.
patient_data <- patient_data %>% filter(patient_id != "1000000819")
Not only that, there could be more than binary gender categories for sex; let’s see the other categorical values other than female and male.
patient_data %>% filter(!sex %in% c("female", "male")) %>% filter(!is.na(sex))
## [1] patient_id sex age country
## [5] province city infection_case infected_by
## [9] contact_number symptom_onset_date confirmed_date released_date
## [13] deceased_date state
## <0 rows> (or 0-length row.names)
There’s no other sex value other than female and male: just NA values for no sex data provided for a patient.
Let’s examine all the other datasets.
summary(time_data)
## date time test negative
## Min. :2020-01-20 Min. : 0.000 Min. : 1 Min. : 0
## 1st Qu.:2020-02-29 1st Qu.: 0.000 1st Qu.: 96488 1st Qu.: 58774
## Median :2020-04-10 Median : 0.000 Median : 503051 Median : 477303
## Mean :2020-04-10 Mean : 4.123 Mean : 497780 Mean : 475484
## 3rd Qu.:2020-05-20 3rd Qu.:16.000 3rd Qu.: 782558 3rd Qu.: 754222
## Max. :2020-06-30 Max. :16.000 Max. :1273766 Max. :1240157
## confirmed released deceased
## Min. : 1 Min. : 0 Min. : 0.0
## 1st Qu.: 3443 1st Qu.: 29 1st Qu.: 17.5
## Median :10450 Median : 7117 Median :208.0
## Mean : 7835 Mean : 5604 Mean :157.1
## 3rd Qu.:11116 3rd Qu.:10100 3rd Qu.:263.5
## Max. :12800 Max. :11537 Max. :282.0
summary(time_age_data)
## date time age confirmed
## Min. :2020-03-02 Min. :0 Length:1089 Min. : 32
## 1st Qu.:2020-04-01 1st Qu.:0 Class :character 1st Qu.: 530
## Median :2020-05-01 Median :0 Mode :character Median :1052
## Mean :2020-05-01 Mean :0 Mean :1158
## 3rd Qu.:2020-05-31 3rd Qu.:0 3rd Qu.:1537
## Max. :2020-06-30 Max. :0 Max. :3362
## deceased
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 3.00
## Mean : 23.42
## 3rd Qu.: 35.00
## Max. :139.00
summary(time_gender_data)
## date time sex confirmed deceased
## Min. :2020-03-02 Min. :0 female:121 Min. :1591 Min. : 9.0
## 1st Qu.:2020-04-01 1st Qu.:0 male :121 1st Qu.:4328 1st Qu.: 82.0
## Median :2020-05-01 Median :0 Median :5118 Median :125.0
## Mean :2020-05-01 Mean :0 Mean :5212 Mean :105.4
## 3rd Qu.:2020-05-31 3rd Qu.:0 3rd Qu.:6417 3rd Qu.:131.0
## Max. :2020-06-30 Max. :0 Max. :7305 Max. :151.0
summary(time_province_data)
## date time province confirmed
## Min. :2020-01-20 Min. : 0.000 Length:2771 Min. : 0.0
## 1st Qu.:2020-02-29 1st Qu.: 0.000 Class :character 1st Qu.: 9.0
## Median :2020-04-10 Median : 0.000 Mode :character Median : 42.0
## Mean :2020-04-10 Mean : 4.123 Mean : 444.3
## 3rd Qu.:2020-05-21 3rd Qu.:16.000 3rd Qu.: 133.0
## Max. :2020-06-30 Max. :16.000 Max. :6906.0
## released deceased
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 0.00
## Median : 21.0 Median : 0.00
## Mean : 320.7 Mean : 9.24
## 3rd Qu.: 92.0 3rd Qu.: 1.00
## Max. :6700.0 Max. :189.00
summary(region_data)
## code province city latitude
## Min. :10000 Length:244 Length:244 Min. :33.49
## 1st Qu.:14028 Class :character Class :character 1st Qu.:35.41
## Median :30075 Mode :character Mode :character Median :36.39
## Mean :32912 Mean :36.40
## 3rd Qu.:51062 3rd Qu.:37.47
## Max. :80000 Max. :38.38
## longitude elementary_school_count kindergarten_count university_count
## Min. :126.3 Min. : 4.00 Min. : 4.00 Min. : 0.000
## 1st Qu.:126.9 1st Qu.: 14.75 1st Qu.: 16.00 1st Qu.: 0.000
## Median :127.4 Median : 22.00 Median : 31.00 Median : 1.000
## Mean :127.7 Mean : 74.18 Mean : 107.90 Mean : 4.152
## 3rd Qu.:128.5 3rd Qu.: 36.25 3rd Qu.: 55.25 3rd Qu.: 3.000
## Max. :130.9 Max. :6087.00 Max. :8837.00 Max. :340.000
## academy_ratio elderly_population_ratio elderly_alone_ratio
## Min. :0.190 Min. : 7.69 Min. : 3.30
## 1st Qu.:0.870 1st Qu.:14.12 1st Qu.: 6.10
## Median :1.270 Median :18.53 Median : 8.75
## Mean :1.295 Mean :20.92 Mean :10.64
## 3rd Qu.:1.613 3rd Qu.:27.26 3rd Qu.:14.62
## Max. :4.180 Max. :40.26 Max. :24.70
## nursing_home_count
## Min. : 11.0
## 1st Qu.: 111.0
## Median : 300.0
## Mean : 1159.3
## 3rd Qu.: 694.5
## Max. :94865.0
summary(weather_data)
## code province date avg_temp
## Min. :10000 Length:26271 Min. :2016-01-01 Min. :-14.80
## 1st Qu.:13500 Class :character 1st Qu.:2017-02-14 1st Qu.: 6.00
## Median :20000 Mode :character Median :2018-04-01 Median : 14.60
## Mean :32125 Mean :2018-03-31 Mean : 13.86
## 3rd Qu.:50500 3rd Qu.:2019-05-16 3rd Qu.: 21.90
## Max. :70000 Max. :2020-06-29 Max. : 33.90
## NA's :15
## min_temp max_temp precipitation max_wind_speed
## Min. :-19.200 Min. :-11.90 Min. : 0.000 Min. : 1.00
## 1st Qu.: 1.400 1st Qu.: 10.90 1st Qu.: 0.000 1st Qu.: 3.80
## Median : 9.900 Median : 19.80 Median : 0.000 Median : 4.70
## Mean : 9.665 Mean : 18.78 Mean : 1.487 Mean : 5.11
## 3rd Qu.: 18.200 3rd Qu.: 26.70 3rd Qu.: 0.000 3rd Qu.: 6.00
## Max. : 30.300 Max. : 40.00 Max. :266.000 Max. :29.40
## NA's :5 NA's :3 NA's :9
## most_wind_direction avg_relative_humidity
## Min. : 20.0 Min. : 10.4
## 1st Qu.: 90.0 1st Qu.: 53.6
## Median :200.0 Median : 66.9
## Mean :195.9 Mean : 65.7
## 3rd Qu.:290.0 3rd Qu.: 78.6
## Max. :360.0 Max. :100.0
## NA's :29 NA's :20
summary(search_trend_data)
## date cold flu pneumonia
## Min. :2016-01-01 Min. : 0.05163 Min. : 0.00981 Min. : 0.06881
## 1st Qu.:2017-02-14 1st Qu.: 0.10663 1st Qu.: 0.04210 1st Qu.: 0.12863
## Median :2018-03-31 Median : 0.13317 Median : 0.09785 Median : 0.16445
## Mean :2018-03-31 Mean : 0.19051 Mean : 0.24495 Mean : 0.22143
## 3rd Qu.:2019-05-15 3rd Qu.: 0.16590 3rd Qu.: 0.25004 3rd Qu.: 0.20977
## Max. :2020-06-29 Max. :15.72071 Max. :27.32727 Max. :11.39320
## coronavirus
## Min. : 0.00154
## 1st Qu.: 0.00627
## Median : 0.00890
## Mean : 1.86252
## 3rd Qu.: 0.01316
## Max. :100.00000
summary(seoul_floating_data)
## date hour birth_year sex
## Min. :2020-01-01 Min. : 0.00 Min. :20 female:542400
## 1st Qu.:2020-02-07 1st Qu.: 5.00 1st Qu.:30 male :542400
## Median :2020-03-17 Median :11.00 Median :45
## Mean :2020-03-16 Mean :11.48 Mean :45
## 3rd Qu.:2020-04-23 3rd Qu.:17.00 3rd Qu.:60
## Max. :2020-05-31 Max. :23.00 Max. :70
## province city fp_num
## Length:1084800 Length:1084800 Min. : 3630
## Class :character Class :character 1st Qu.: 18350
## Mode :character Mode :character Median : 25510
## Mean : 27427
## 3rd Qu.: 33940
## Max. :127640
summary(policy_data)
## policy_id country type gov_policy
## Min. : 1 Length:61 Length:61 Length:61
## 1st Qu.:16 Class :character Class :character Class :character
## Median :31 Mode :character Mode :character Mode :character
## Mean :31
## 3rd Qu.:46
## Max. :61
##
## detail start_date end_date
## Length:61 Min. :2020-01-03 Min. :2020-01-19
## Class :character 1st Qu.:2020-02-29 1st Qu.:2020-04-06
## Mode :character Median :2020-03-15 Median :2020-05-27
## Mean :2020-03-22 Mean :2020-05-02
## 3rd Qu.:2020-04-16 3rd Qu.:2020-06-03
## Max. :2020-06-10 Max. :2020-06-14
## NA's :37
Next, we can look at the date range coverage.
range(time_data$date, na.rm = TRUE)
## [1] "2020-01-20" "2020-06-30"
range(time_age_data$date, na.rm = TRUE)
## [1] "2020-03-02" "2020-06-30"
range(time_gender_data$date, na.rm = TRUE)
## [1] "2020-03-02" "2020-06-30"
range(time_province_data$date, na.rm = TRUE)
## [1] "2020-01-20" "2020-06-30"
As the result above shows, except for the age and gender times series data, it seems to cover from 2020-01-20 to 2020-06-30, and it can be said that all datasets seem to cover around the same period.
Next, we’re going to check for the uniqueness of patient_id
sum(duplicated(patient_data$patient_id))
## [1] 1
As the results above shows, there’s a duplicate patient_id.
patient_data %>%
filter(duplicated(patient_id) | duplicated(patient_id, fromLast = TRUE))
## patient_id sex age country province city infection_case
## 1 1200012238 female 20s Korea Daegu Icheon-dong overseas inflow
## 2 1200012238 female 20s Korea Daegu Nam-gu overseas inflow
## infected_by contact_number symptom_onset_date confirmed_date released_date
## 1 NA <NA> 2020-06-17 <NA>
## 2 NA <NA> 2020-06-17 <NA>
## deceased_date state
## 1 <NA> isolated
## 2 <NA> isolated
According to wikipedia, Incheon-dong is a sub-district within Nam-gu, thus it can be assumeed that it was just a duplicated data, thus safe to drop a duplicate row.
patient_data <- patient_data %>% filter(!duplicated(patient_id))
Next, for logical inconsistencies, we’re going to check if date values have been put in correctly: if they were released after they were deceased, then there’s an illogical inconsistency with the data entry.
patient_data %>% filter(released_date > deceased_date)
## [1] patient_id sex age country
## [5] province city infection_case infected_by
## [9] contact_number symptom_onset_date confirmed_date released_date
## [13] deceased_date state
## <0 rows> (or 0-length row.names)
However, there’s no such thing as logical inconsistency with data entry for patient data, indicating no such error occurred.
Let’s do some distribution analysis.
For the patient data,
ggplot(data = patient_data, mapping = aes(x = sex)) + geom_bar()
The gender category has been divided into female, male, and NA (no sex information provided) successfully.
ggplot(patient_data, aes(x = contact_number)) + geom_histogram(bins = 30)
## Warning: Removed 4379 rows containing non-finite outside the scale range
## (`stat_bin()`).
There seems to be a few outlier values, yet 1000 is a reasonable number for the number of contacts considiering there could be a mass infection/ group spread.
Let’s look at the province (another yet bigger than city categorical data).
ggplot(data = patient_data, mapping = aes(x = province)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The patient information seems to have been provided from all 17 provinces; yet it’s been stated in the data source that not all patient information from Daegu was provided.
For time_age data, let’s look at age categorical data.
ggplot(time_age_data, mapping = aes(x = age)) + geom_bar()
All of them seem to be equally distributed.
Let’s look at time_gender data.
ggplot(data = time_gender_data, mapping = aes(x = sex)) + geom_bar()
And for this as well, they seem to be equally distributed.
Last but not least for time_province data, let’s look at the province (categorical data).
ggplot(time_province_data, mapping = aes(x = province)) + geom_bar() + coord_flip()
And as same as the other time series datasets, this seems to be equally distributed too.
The time-series datasets exhibit consistent and evenly distributed observations across dates. Each categorical group is represented uniformly throughout the time period, allowing for valid comparison of confirmed case and deceased case rates without concerns of temporal imbalance or reporting gaps.
This exploratory analysis investigates the first wave of COVID-19 in South Korea using national surveillance data. The analysis focuses on three outcomes: infection growth over time, demographic disparities in case fatality rates, and potential indicators of reduced disease severity such as recovery rates and policy interventions.
The data used for this analysis directly comes from https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data, a structured dataset based on the report materials of KCDC and local governments.
To provide an overview of this analysis/ visualization report, this project will attempt to answer these fundamental questions:
To start things off, let’s take a look at the overall national trend
To examine the overall scale of infections, the analysis uses the cumulative number of confirmed cases from the Time.csv dataset, which records the total number of positive cases over time. The data spans from January 20, 2020 to June 30, 2020, allowing for an assessment of infection trends during the initial phase of the pandemic in South Korea.
ggplot(time_data, mapping = aes(x = date, y = confirmed)) + geom_line() + labs(title = "Cumulative COVID-19 Cases In South Korea", x = "Date", y = "# of confirmed cases", subtitle = "2020/01/20 - 2020/06/30", caption = "https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset/data") + theme_linedraw() + theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + annotate("rect", xmin = as.Date("2020-02-15"), xmax = as.Date("2020-04-15"), ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1)
As shown in the line graph, confirmed COVID-19 cases began to increase sharply from below 1000 in mid-February 2020 to over 10000 by April 2020, indicating a rapid acceleration in transmission. The steep slope during this period reflects a high growth rate in cumulative cases. Beginning in April 2020, the curve noticeably flattens, suggesting a deceleration in spread and a reduction in the rate of new infections. This slowdown coincides with the Korean government implementing the infectious disease alert level with the strict interventions in April; with expanded testing and contact tracing measures.
Then, what about case fatality ratio/ recovery rate?
cfr_recovery_rate <- time_data %>% mutate(cfr = deceased / confirmed, recovery_rate = released / confirmed)
long_rates <- cfr_recovery_rate %>% select(date, cfr, recovery_rate) %>% pivot_longer(cols = c(cfr, recovery_rate), names_to = "rate_type", values_to = "rate")
ggplot(long_rates, mapping = aes(x = date, y = rate, color = rate_type)) + geom_line() + annotate("rect", xmin = as.Date("2020-02-15"), xmax = as.Date("2020-04-15"), ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) + labs(
title = "Evolution of Case Fatality Ratio (CFR) and Recovery Rate in South Korea",
x = "Date",
y = "Proportion of Confirmed Cases",
color = "Case Fatality Ratio/ Recovery Rate",
caption = "Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)") + theme_minimal() + theme(plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold", size = 11))
As illustrated in the graph, the recovery rate declines noticeably during the initial phase of the outbreak. This pattern primarily reflects the rapid surge in confirmed cases, which increases the denominator (total confirmed cases) more quickly than recoveries can accumulate. Because recoveries occur days or weeks after diagnosis, the rapid surge in confirmed cases during late February temporarily lowers the observed recovery rate.
Moreover, early testing strategies often prioritize symptomatic or severe cases, while mild or asymptomatic infections may go undetected. This selective detection can make the proportion of severe cases appear higher, further contributing to a lower observed recovery rate. Together, these dynamics explain why recovery rates often decline during periods of rapid case growth, even if the underlying probability of recovery has not worsened.
To examine how severity evolved and resolved over time in South Korea, we focus on a comparison between Seoul and Daegu using the case fatality ratio (CFR) and the recovery rate.
convergence_data <- Daegu_data %>% inner_join(Seoul_data, by="date") %>% rename(cfr_daegu = cfr.x, cfr_seoul = cfr.y, recovery_rate_daegu = recovery_rate.x, recovery_rate_seoul = recovery_rate.y) %>% select(date, cfr_daegu, cfr_seoul, recovery_rate_daegu, recovery_rate_seoul) %>% mutate(disparity_cfr = (cfr_daegu - cfr_seoul), disparity_recovery = (recovery_rate_daegu - recovery_rate_seoul)) %>% select(date, disparity_cfr, disparity_recovery)
ggplot(convergence_data, aes(x = date, y = disparity_cfr)) + geom_line(size = 1, color = "darkred") + geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "CFR Gap Between Daegu and Seoul Converged Over Time", subtitle = "Difference in case fatality ratios (Daegu − Seoul, percentage points)", x = "Date", y = "CFR Difference (pp)", caption = "Gap approaching zero indicates convergence in severity") + theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).
The graph closely mirrors the CFR trend observed in Daegu, suggesting that the overall national severity of the pandemic was largely driven by the outbreak in Daegu during the early phase. Beginning in mid-April, the severity gap gradually narrows, indicating that the disparity between Daegu and Seoul diminished as case fatality rates in Daegu declined.
ggplot(convergence_data, aes(x = date, y = disparity_recovery)) + geom_line(size = 1, color = "lightblue") + geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Recovery Gap Between Daegu and Seoul Converged Over Time", subtitle = "Difference in recovery rates (Daegu − Seoul, percentage points)", x = "Date", y = "Recovery Rate Difference (pp)", caption = "Gap approaching zero indicates convergence in recovery") + theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).
Comparing the recovery gap and CFR gap charts together provides a more comprehensive view of regional disparity during the outbreak. The patterns suggest that Daegu was not only the epicenter of the epidemic but also faced significant strain on medical resources, as widely reported at the time. Notably, while the maximum CFR gap reaches approximately 2.3 percentage point, the recovery gap peaks at around 40 percentage point, indicating a much larger divergence in recovery outcomes. This disparity reflects the relative strain on Daegu’s healthcare system compared to Seoul, where more advanced infrastructure and greater resource availability may have facilitated faster recovery rates.
time_data_rate <- time_data %>%
arrange(date) %>% mutate(new_cases = confirmed - lag(confirmed))
ggplot(time_data_rate, aes(x = date, y = new_cases)) + geom_line(color = "darkred", alpha = 0.8, size = 0.5) + geom_vline(data = policy_data, aes(xintercept = start_date), inherit.aes = FALSE, linetype = "dashed", alpha = 0.35, size = 0.3, color ="grey40") +
labs(
title = "Daily New Confirmed COVID-19 Cases in South Korea",
subtitle = "New cases per day with policy implementation dates",
x = "Date",
y = "Number of New Confirmed Cases",
caption = "Vertical dashed lines indicate policy start dates | Data: Kim Jihoo Kaggle Coronavirus Dataset (KDCA & local governments)"
) +
theme_minimal() + scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
As shown above, periods marked by intensified policy implementation follow with noticeable declines in the number of daily new confirmed cases, suggesting that stricter nationwide interventions were associated with a slowdown in transmission. Several policy markers appear shortly after major spikes in daily case counts, indicating that the government responded rapidly to surges in infections. Although the effects of these interventions were not immediate—given the incubation period of the virus and reporting delays—the overall downward trend in new cases following peak periods suggests that policy measures contributed to stabilizing and eventually reducing transmission.
infection_summary <- case_data %>% mutate(infection_source = case_when(
str_detect(infection_case, regex("church", ignore_case = TRUE)) ~ "CHURCH",
str_detect(infection_case, regex("churches", ignore_case = TRUE)) ~ "CHURCH",
str_detect(infection_case, regex("hospital", ignore_case = TRUE)) ~ "HOSPITAL",
str_detect(infection_case, regex("medical", ignore_case = TRUE)) ~ "HOSPITAL",
str_detect(infection_case, regex("overseas inflow", ignore_case = TRUE)) ~ "OVERSEAS",
str_detect(infection_case, regex("etc", ignore_case = TRUE)) ~ "UNKNOWN",
str_detect(infection_case, regex("clubs", ignore_case = TRUE)) ~ "SOCIAL HANGOUT",
str_detect(infection_case, regex("contact with patient", ignore_case = TRUE)) ~ "CONTACT WITH A PATIENT",
TRUE ~ "OTHER"
))
infection_summary <- infection_summary %>% count(infection_source) %>% mutate(prop = n /sum(n))
ggplot(infection_summary,
aes(x = reorder(infection_source, prop),
y = prop,
fill = infection_source)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Distribution of COVID-19 Infection Sources",
x = "Infection Source",
y = "Share of Cases"
) +
theme_minimal() + scale_fill_brewer() + theme(plot.title = element_text(face = "bold"))
As the case data indicates, the sources of infection were diverse, encompassing workplaces, schools, community gatherings, and overseas exposure. However, when excluding the broad “Other” category, which aggregates multiple smaller exposure types, church-related gatherings emerge as the most prominent identifiable source of transmission. This pattern highlights the role of large, close-contact congregational settings in facilitating rapid cluster-based spread during the early phase of the outbreak.
Throughout this project, the most challenging aspect was not generating visualizations, but ensuring that the metrics I used were statistically coherent and conceptually meaningful. At the beginning of the analysis, I initially attempted to compute infection and mortality rates using the total number of tests as the denominator. At first glance, this seemed reasonable, since testing volume reflects detection activity. However, upon closer examination, I realized that using national-level test counts as a denominator for age-specific or province-specific confirmed cases was methodologically inconsistent. The numerator and denominator did not represent the same risk pool, which would have led to misleading interpretations.
This realization forced me to rethink how rates should be constructed. I learned that denominators must correspond to the same population as the numerator. When population data were unavailable, I replaced “infection rate” with share of confirmed cases to describe burden concentration rather than risk. For mortality analysis, I distinguished between province-level death share (deceased divided by total national deaths) and case fatality ratios (deceased divided by confirmed within the same province). This distinction clarified the difference between overall burden and conditional severity. I also reconsidered how cumulative data should be handled. Because the dataset recorded cumulative confirmed cases, summing across dates would have resulted in double counting. Instead, I extracted final cumulative values or computed daily new cases to avoid distortion in time-based analysis.
Another major realization was the conceptual difference between concentration and severity. Initially, I expected Daegu to exhibit both the highest case share and the highest fatality rate. However, after computing the appropriate metrics, I found that while Daegu accounted for the largest share of confirmed cases and deaths, it did not necessarily have the highest case fatality ratio. This forced me to separate disease burden (share of confirmed cases or deaths) from conditional severity (CFR). Understanding this distinction significantly strengthened the clarity and precision of my analysis. Share-based measures were more appropriate for comparing overall impact, while CFR was better suited for evaluating risk conditional on infection.
An additional insight was that data cleaning is not a one-time preliminary step, but an ongoing process that continues throughout visualization and analysis. As I created plots, I discovered inconsistencies in variable definitions, cumulative structures, and grouping logic that required adjustments mid-analysis. For example, calculating growth rates from cumulative data initially seemed straightforward, but I later recognized that daily new cases provided a clearer and more interpretable measure of transmission trends when analyzing policy timing. Determining which variables were appropriate for comparison, extracting the correct level of aggregation (province versus city), and selecting meaningful denominators were decisions that evolved throughout the project. This experience reinforced that effective analysis depends not only on plotting data, but on understanding what the data truly represent.
There was also considerable trial and error in selecting appropriate visualizations. I explored more complex visual forms, such as parallel coordinate plots, to compare provincial characteristics simultaneously, but encountered challenges related to scaling and grouping across mixed variable types. This process highlighted the importance of matching visualization structure to the analytical question. Some plots are useful for exploratory pattern detection, while others are better suited for clearly communicating specific relationships.
Overall, this project pushed me to think more critically about how data structure influences interpretation. I learned that generating a plot is relatively straightforward, but ensuring that the underlying metric is logically and statistically valid requires careful reasoning. The process of debugging denominators, distinguishing burden from severity, handling cumulative values appropriately, and refining definitions of “rate” significantly improved both the rigor and credibility of my analysis.