CIS670 Project: Analyze COVID-19 in South Korea

Important Note

Depending on how you access our project, some figures may not be displayed properly:

From RPubs: Everything works on RPubs. This is the recommended way to view our project.
From GitHub: The README displays everything EXCEPT for the interactive leaflet map. To view the interactive map, you need to download README.html and open it with a browser.

Introduction

In this project, we perform in-depth EDA(Exploratory Data Analysis) on a collection of data sets about COVID-19 in South Korea. Our goal is to use these data to better understand COVID and learn how to prevent it effectively to save lives.

We answered the following questions by performing in-depth data visualization, and regression analysis.

What is the general trend of confirmed, released, and deceased cases of COVID-19 in South Korea? How can we visualize it on a map?
What can we learn from the COVID patients? How does COVID affect different age groups, gender, and location? What is the mortality rate among different age groups and gender groups?
How can we identify massive infection events? What can we learn from those massive infection events?
How is the web search trend affected by COVID in South Korea since the pandemic started?
Is there a relationship between daily temperature and the number of confirmed cases?

Library Dependencies

Our project utilizes the following libraries, please make sure they are installed by using the install.packages() command.

tidyverse
ggplot2
ggmap
gganimate
gifski
tibbletime
data.table
htmltools
leaflet
scales

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(ggmap)

## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.

## Please cite ggmap if you use it! See citation("ggmap") for details.

library(gganimate)

library(tibbletime)

## 
## Attaching package: 'tibbletime'

## The following object is masked from 'package:stats':
## 
##     filter

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(htmltools)
library(leaflet)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

Datasets

There are many COVID datasets available on Kaggle. However, most of them only have information on date and the number of confirmed cases. This South Korea COVID dataset, on the other hand, contains comprehensive information on cases, patient info, weather, geographical location, contact tracing, etc. Therefore, it is ideal for this project.

The data is collected from KCDC (Korea Centers for Disease Control & Prevention). The data set is available on Kaggle. The following code snippet loads the dataset into R workspace, the best way to view the data is through R.

# Read Datasets, treat empty string value as NA
case <- read.csv('./Data/Case.csv', na.strings = c("", "NA"))
patient <- read.csv('./Data/PatientInfo.csv', na.strings = c("", "NA"))
policy <- read.csv('./Data/Policy.csv', na.strings = c("", "NA"))
region <- read.csv('./Data/Region.csv', na.strings = c("", "NA"))
searchtrend <- read.csv('./Data/SearchTrend.csv', na.strings = c("", "NA"))
seoulfloating <- read.csv('./Data/SeoulFloating.csv', na.strings = c("", "NA"))
time <- read.csv('./Data/Time.csv', na.strings = c("", "NA"))
timeage <- read.csv('./Data/TimeAge.csv', na.strings = c("", "NA"))
timegender <- read.csv('./Data/TimeGender.csv', na.strings = c("", "NA"))
timeprovince <- read.csv('./Data/TimeProvince.csv', na.strings = c("", "NA"))
weather <- read.csv('./Data/Weather.csv', na.strings = c("", "NA"))
route <- read.csv('./Data/routes.csv', na.strings = c("", "NA"))
patient_1 <- read.csv('./Data/patients.csv', na.strings = c("", "NA"))

Qestion 1: What is the general trend of confirmed, released, and deceased cases of COVID-19 in South Korea?

In this section, we will be plotting the the confirmed, released, and deceased cases of COVID-19 in South Korea against time. These trends would help us or other researchers with building math models and identify key features that could potentially lowered the growth rate of the confirmed or deceased cases throughout time.

Daily Confirmed Cases Trend

We plotted the daily confirmed cases by date. We also used geom_smooth function to plot the trend for better visualization.

# Filter out empty value for confirmed date
patient_dates <- patient_1 %>% 
  filter(!is.na(confirmed_date))

# Plot confirmed cases for each day
patient_dates %>%
  group_by(confirmed_date) %>%
  summarise(n = n()) %>%
  ggplot(aes(as.Date(confirmed_date, format = "%m/%d/%Y"), n))+
  geom_line(size = 2, alpha = 0.8, col="red")+
  geom_point(size = 5, alpha = 1, col="purple")+
  labs(title = "Daily Confirmed Cases of COVID-19 in South Korea by Date",
       subtitle = paste0("Total Confirmed Cases: ", nrow(patient)),
       caption = "Green Line Indicates the Trend") + 
  scale_x_date(date_labels = "%b %d", date_breaks = "2 days") +
  theme_light() +
  theme(plot.title = element_text(face = "bold", 
                                  hjust = 0.5, size = 18, color = "black"),
        plot.subtitle = element_text(face = "bold", 
                                     size = 12, color = "blue")) +
  ggrepel::geom_label_repel(aes(label=n), hjust=1.5, vjust=0.9, 
                            fill ="orange") +
  geom_smooth(method = "gam", se = TRUE, alpha = 0.5, col = "green")

Analysis:
As shown in the figure, marked by the green line, we can see that there is a huge surge from late Feburary to Early March in South Korea.

Total Confirmed Cases, Deceased Cases, and Released Cases Trend

After plotting the daily confirmed cases trend, we are interested in learning the trend for total confirmed, deceased, and released cases since the start of the pandemic. The blue line marked the total confirmed cases, the red line marked the total deceased cases, and the green line marked the total released cases.

# Filter out empty value for confirmed date
time_data <- time
# Plot confirmed cases for each day
ggplot () + 
  geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = confirmed), color = "blue") +
  geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = deceased), color = "red") + 
  geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = released), color = "green") +
  scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
  labs(title = "Total Confirmed, Deased, Released Cases by Date",
       caption = "Blue: Confirmed, Red: Deceased, Green: Released")

Analysis:
As shown in the figure, we can see that the total deceased cases trend is relatively flat, while both the total confirmed and total released cases has a huge surge around late Feburary. This matches the observation from previous fiture, where there is a huge surge in confirmed cases around late Feburary to early March.

Question 2: What can we learn from the COVID patients?

In this section, we will be dividing the patients into age and gender groups, and compare the different statistics among them. These comparisons could help normal individuals to be more aware of their risk of having COVID-19, and medical professionals to better care for patients in different groups.

Confirmed Cases Among Different Age Groups

# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year

# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
                        breaks = seq(0,90, by=10),
                        right = FALSE,
                        labels = c("0-10","10-20","20-30",
                                   "30-40","40-50","50-60",
                                   "60-70","70-80","80-90"))

patient_age <- patient_age %>%
    select(age_group) %>%
    na.omit() %>%
    group_by(age_group) %>%
    summarise(confirmed_cases = n())
  
ggplot(patient_age, aes(y = age_group, x = confirmed_cases, fill = age_group)) +
  geom_bar(stat="identity") +
  xlab("Confirmed Cases") +
  ylab("Age Group") +
  ggtitle(label = "Confirmed Cases by Age Group")

Analysis:
As shown in the figure, the most affected age group is 20-30, the second most affected group is 50-60.

Most dangenrous group by age

patient_age <- patient
patient_age$contact_number[patient_age$contact_number == "-"] = NA
patient_age_c = patient_age[!is.na(patient_age$contact_number),]
patient_age_c$contact_number = as.numeric(patient_age_c$contact_number)


patient_age = aggregate(patient_age_c$contact_number, by = list(Category = patient_age_c$age), FUN = mean)

ggplot(patient_age, aes(y = Category, x = x, fill = Category)) +
  geom_bar(stat="identity") +
  xlab("Contact Number") +
  ylab("Age Group") +
  ggtitle(label = "Average Contact Number by Age Group")

Analysis:
As shown in the figure, On average, people in 20s-50s, and 70s have contact number above 30.

Deceased Cases Among Different Age Groups

# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year

# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
                            breaks = seq(0,90, by=10),
                            right = FALSE,
                            labels = c("0-10","10-20","20-30",
                                       "30-40","40-50","50-60",
                                       "60-70","70-80","80-90"))
patient_age = patient_age[!is.na(patient_age$deceased_date),]

patient_age <- patient_age %>%
  select(age_group) %>%
  na.omit() %>%
  group_by(age_group) %>%
  summarise(deceased_cases = n())

ggplot(patient_age, aes(y = age_group, x = deceased_cases, fill = age_group)) +
  geom_bar(stat="identity") +
  xlab("deceased Cases") +
  ylab("Age Group") +
  ggtitle(label = "Deceased Cases by Age Group")

Analysis:
As shown in the figure, the most affected age group is 80-90.

Province with the Highest Number of Confirmed Cases

# Group number of patients by province
patient_province <- patient
patient_province <- patient_province %>%
  select(province) %>%
  na.omit()%>%
  group_by(province)%>%
  summarise(n=n())

ggplot(patient_province, aes(y=n, x=reorder(province,n), fill=province)) +
    geom_bar(stat="identity") +
    coord_flip() +
    xlab("Province Name") +
    ylab("Confirmed Cases") +
    ggtitle(label = "Confirmed Cases in Each Province")

Analysis:
As shown in the figure, Seoul is the most affected province, with Gyeongsangbuk-do and Gyeonggi-do close to Seoul.

Confirmed Cases among Genders

patient_gender <- patient
patient_gender <- patient_gender %>%
  select(sex) %>%
  na.omit() %>%
  group_by(sex) %>%
  summarise(patient_num = n())

ggplot(patient_gender,
       aes(x="",y=patient_num,fill=sex)) + 
  geom_bar(width=1, stat="identity",color="white") + 
  ggtitle(label = "Confirmed Cases Gender Ratio") + 
  coord_polar("y",start=0) + 
  geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
            position=position_stack(vjust=0.5)) + 
  theme_void() + 
  theme(plot.title = element_text(size = 30, face = "bold"))

Analysis:
As shown in the figure, COVID affects females more than male.

Deceased Cases among Genders

patient_gender <- patient
patient_gender = patient_gender[!is.na(patient_gender$deceased_date),]
patient_gender <- patient_gender %>%
  select(sex) %>%
  na.omit() %>%
  group_by(sex) %>%
  summarise(patient_num = n())
ggplot(patient_gender,
       aes(x="",y=patient_num,fill=sex)) + 
  geom_bar(width=1, stat="identity",color="white") + 
  ggtitle(label = "Deceased Gender Ratio") + 
  coord_polar("y",start=0) + 
  geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
            position=position_stack(vjust=0.5)) + 
  theme_void() + 
  theme(plot.title = element_text(size = 30, face = "bold"))

Analysis:
As shown in the figure, male has a much higher mortality rate than female.

Mortality Rate among Genders

patient_gender <- patient
patient_gender_d = patient_gender[!is.na(patient_gender$deceased_date),]

patient_gender <- patient_gender %>%
  select(sex) %>%
  na.omit() %>%
  group_by(sex) %>%
  summarise(patient_num = n())

patient_gender_d <- patient_gender_d %>%
  select(sex) %>%
  na.omit() %>%
  group_by(sex) %>%
  summarise(patient_num = n())

patient_gender_d[2] = patient_gender_d[2]/ patient_gender[2] * 100

ggplot(patient_gender_d, aes(x = sex, y = patient_num, fill = sex)) +
  geom_bar(stat="identity") +
  xlab("mortality rate") +
  ylab("Gender") +
  ggtitle(label = "Mortality Rate by Gender")

Confirmed Cases: Age Group vs. Province

patient_age_province <- patient_1
patient_age_province$age = 2020 - patient_1$birth_year
patient_age_province$age_group = cut(patient_age_province$age,
                        breaks = seq(0,90, by=10),
                        right = FALSE,
                        labels = c("0-10","10-20","20-30",
                                   "30-40","40-50","50-60",
                                   "60-70","70-80","80-90"))

patient_age_province %>%
    select(age_group, province) %>%
    na.omit() %>%
    group_by(age_group, province) %>%
    summarise(Count = n()) %>%
    ggplot() +
    geom_tile(aes(x = age_group, y = province, fill = Count)) +
    geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
    geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
    scale_fill_gradientn(colours = c("green","red"),
                         values = c(0,0.1,1)) +
    scale_x_discrete(expand = c(0,0)) +
    scale_y_discrete(expand = c(0,0)) +
    theme_classic() +
    theme(text = element_text(size = 15, face = "bold"),
          axis.text.x = element_text(angle = 90),
          legend.position = "top",
          legend.key.width = unit(3,"cm"))

Analysis:
As shown in the figure, the most affected group is age from 30-40, living in capital area.

Confirmed Cases: Age Group vs. Province vs. Gender

patient_age_province_gender <- patient_age_province
patient_age_province_gender %>%
    select(sex, age_group, province) %>%
    na.omit() %>%
    group_by(sex, age_group, province) %>%
    summarise(Count = n()) %>%
    ggplot() +
    geom_tile(aes(x = age_group, y = province, fill = Count)) +
    geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
    geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
    scale_fill_gradientn(colours = c("green","red"),
                         values = c(0,0.1,1)) +
    scale_x_discrete(expand = c(0,0)) +
    scale_y_discrete(expand = c(0,0)) +
    theme_classic() +
    theme(text = element_text(size = 15, face = "bold"),
          axis.text.x = element_text(angle = 90),
          legend.position = "top",
          legend.key.width = unit(3,"cm"))   +
    facet_wrap(~sex)

Analysis:
As shown in the figure, the most affected group is age from 30-40, male, living in capital area.

Question 3: How can we identify mass infection events?

In this section we visualize the infection events using the Google Map API. The following figures illustrate the mass infection locations in South Korea. These illustrations could help local disease control centers to better prevent Covid-19 infection spikes.

Visualize Infection Events on a Map

We want to mark the infection events along with their infection numbers on a map for better geological visualization. We used ggmap function to create a base map of South Korea, then plot each infection events with different size and color to mark its infection number.

case$longitude <- as.numeric(as.character(case$longitude))

## Warning: NAs introduced by coercion

case$latitude <- as.numeric(as.character(case$latitude))

## Warning: NAs introduced by coercion

# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
                    zoom = 7, scale = 4,
                    maptype = 'roadmap',
                    color = 'color')

figure <- ggmap(map) +
    theme_void() +
    ggtitle("Number of Confrimed Cases by Case ID in South Korea") +
    scale_color_gradientn(colors = rainbow(5)) +
    theme(plot.title = element_text(color = "blue"),
          panel.border = element_rect(color = "grey", fill = NA, size = 2))

figure + geom_point(data = case, 
                    aes(longitude, latitude, size= confirmed, 
                        color=confirmed, group=confirmed))

## Warning: Removed 109 rows containing missing values (geom_point).

Analysis:
As shown in the figure, the area around Seoul have a large number of small infection events. There is a massive infection event in the city of Nam-gu, Daegu province. This massive infection event is at Shincheonji Church, which infected 4511 people.

Infection Events Over Time Animation

After creating the plot above, we are interested in learning how those infection events evolve over time. We combined the case dataframe with timprovince dataframe to get a dataframe with both geological and time infromation. Then, we use gganimate to create a gif that shows how the infection events evolve over time.

# Get a data frame that contains both time and coordinates
case_bak <- subset(case, select = c(province, longitude, latitude))

geo_time <- merge(case_bak, timeprovince, by="province")
geo_time$date <- as.Date(geo_time$date, format = "%Y-%m-%d")


# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
                    zoom = 7, scale = 4,
                    maptype = 'roadmap',
                    color = 'color')

figure <- ggmap(map) +
    theme_void() +
    ggtitle("Number of Confrimed Cases by Case ID in South Korea Over Time") +
    scale_color_gradientn(colors = rainbow(5)) +
    theme(plot.title = element_text(color = "blue"),
          panel.border = element_rect(color = "grey", fill = NA, size = 2))

figure + geom_point(data = geo_time, 
                    aes(longitude, latitude, size= confirmed, 
                        color=confirmed, group=confirmed)) +
  transition_time(date)

Analysis:
The following animation shows the number of confirmed cases by location increasing over time. The figure shows that Covid-19 spreads most rapidly in Seoul, Incheon, and Daegu. However, the spread of Covid-19 in Busan, the second largest city in South Korea, was low in comparison. Future work investigating the different treatments of disease control in Busan and other large cities could give insight in how to reduce the spread in large populations.

Massive Infections with Details Visualization

To take the visualization to another level, we created an interactive map that let users view the massive infection events along with patient info by using leaflet widget.

# Known Group Infection Events
group_infec <- tibble(lon = c(127.0518049,126.9165592,127.1217472,128.7368285,129.0771404,128.4941047,128.5707814,128.5665743,128.9099632),
                          lat = c(37.2367708,37.6338843,37.3881528,35.6486172,35.2159947,36.0579173,35.8507155,35.8397168,36.9273528),
                          address = c('650-103 mangpo-dong, yeongtong-gu, suwon, gyeonggi-do, south korea','1021 tongil-ro, jingwan-dong, eunpyeong-gu, seoul, south korea','20 seohyeon-ro 180(baekpalsip)beo, seohyeon-dong, bundang-gu, seongnam-si, gyeonggi-do, south korea','100-1 beomgok-ri, hwayang-eup, cheongdo, gyeongsangbuk-do, south korea','436-8 oncheon 1(il)-dong, dongnae-gu, busan, south korea','918 yuhak-ro, gasan-myeon, chilgok-gun, gyeongsangbuk-do, south korea','72-10 seongdang-dong, dalseo-gu, daegu, south korea','81 daemyeong-ro, daemyeong 10(sip)-dong, nam-gu, daegu, south korea','916 soro-ri, chunyang-myeon, bonghwa-gun, gyeongsangbuk-do, south korea'),
                          korean_name = c('수원 생명샘교회','은평 성모병원','분당 제생병원','청도 대남병원','부산 온천교회','칠곡 밀알 사랑의집','대구 한마음아파트','대구 신천지 교회','봉화 푸른 요양원'),
                          english_name = c('Soowon Saengmyungsame Church','Eunpyung Sungmo Hospital','Boondang Jesaeng Hospital','Chungdo Daenam Hospital','Busan Onchun Church','Chilgok Milal Sarang House','Daegu Hanmaeum Apartment','Daegu Sinchunji Church','Bonghwa Pureun Clinic'),
                          n_of_infections= c(10,14,11,118,34,22,46,4482,51))

# Pre-processing
group_infec_2 <- group_infec %>% select(lon,lat,korean_name,english_name,n_of_infections) %>% 
  rename(latitude2 = lat, longitude2 = lon, id = english_name) %>% 
  as.data.table() 

group_infec_3 <- group_infec_2[, list(label=HTML(
  paste(sep = "<br>",
        paste("G.I(eng): ", id),
        paste("G.I(kor): ", korean_name),
        paste("Num of infects:", n_of_infections)))),
  list(id, longitude2, latitude2)]

#Combining route and patient and pre-processing

agg_1 <- route %>% left_join(patient_1, by="global_id") %>% 
  mutate(infected_by = ifelse(is.na(infected_by),"Unk",infected_by)) %>% 
  as.data.table

agg_2 <- agg_1[, list(label=HTML(
  paste(sep="<br>",
        paste("Patient Id :", global_id),
        paste("Sex :", sex),
        paste("Country :", country),
        paste("Disease :", disease),
        paste("Group:", group),
        paste("Infection_reason:", infection_reason),
        paste("Infection_order:",  infection_order),
        paste("Infected_by:", infected_by),
        paste("Contact_number:", contact_number),
        paste("Confirmed_date:", confirmed_date),
        paste("State:", state)))), 
  list(global_id, longitude, latitude)]

listed <- list(agg_2, group_infec_3)
agg_3 <- rbindlist(listed, use.names = TRUE, fill = TRUE)

agg_3 %>% 
  leaflet() %>%
  setView(128.5, 36.5, zoom = 7) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  #addProviderTiles : Stamen.Toner, CartoDB.Positron, Esri.NatGeoWorldMap, MtbMap
  addCircleMarkers(
    lat =  ~ latitude,
    lng =  ~ longitude,
    label = ~label,
    clusterOptions = markerClusterOptions(),
    labelOptions = labelOptions(noHide = F, 
                                direction = "left",
                                style = list(
                                  "color" = "black",
                                  "box-shadow" = "4px 4px rgba(0,0,0,0.25)",
                                  "font-size" = "13px",
                                  "border-color" = "rgba(0.5,3,0.5,0.5)"))) %>% 
  
  #Addding group infections
  addMarkers(
    lat =  ~ latitude2,
    lng =  ~ longitude2,
    label = ~label,
    labelOptions = labelOptions(noHide = F, 
                                direction = "right",
                                style = list(
                                  "color" = "red",
                                  "font-size" = "13px")))

## Warning in validateCoords(lng, lat, funcName): Data contains 338 rows with
## either missing or invalid lat/lon values and will be ignored

## Warning in validateCoords(lng, lat, funcName): Data contains 2050 rows with
## either missing or invalid lat/lon values and will be ignored

Analysis: This figure shows the COVID-19 contact locations for patients who reported where they were exposed. Community spread in residential areas is common. There are several locations that have large numbers of infections. Knowing what types of locations tend to have large COVID-19 infection events could help the CDC to identify particular community events according to the building type and come up with better policies.

Question 4: How is the web search trend affected by COVID in South Korea since the pandemic started?

We are interested in learning how COVID affect the web search trend. We plotted four different search word (cold, coronavirus, flue, pneumonia) along with confirmed, deceased and released cases over time. We also use a linear model to predict confirmed cases with different search trend.

Trends in Web Search

time_case <- read.csv("./Data/TimeProvince.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key = "Cumulative_case",value="number", -c(1:3)) %>%
    group_by(date, Cumulative_case) %>% summarise(cum_number = sum(number))
search_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key="Search_word", value="SearchTrend", -date) %>%
    filter(date >= "2020-01-20")


search_and_case <- search_trend  %>% left_join(time_case, by = 'date')

options(repr.plot.width = 20, repr.plot.height = 10)
search_and_case %>%
    ggplot(aes(as.Date(date, format = "%Y-%m-%d"), cum_number, group = Cumulative_case, color= Cumulative_case)) +
    geom_point() +
    geom_line(linetype = 'dashed') +
　　 theme(axis.text.x = element_text(angle = 75, hjust = 1),
        legend.text=element_text(size=20),
        legend.title=element_text(size=20),
        axis.title.x = element_text(color="black", size=20, face="bold"),
        axis.title.y = element_text(color="black", size=20, face="bold")) +
   geom_point(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word, shape=Search_word), color='black', size=3) + 
   geom_line(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word), color='black') +
   scale_y_log10() + 
   scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
   ylab("Search Volume or Cumulative Cases in log scale") + 
   ggtitle("Growth of COVID-19 cases and Web Search in South Korea")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key="Search_word", value="SearchTrend", -date) %>%
    filter(date >= "2020-01-20") %>%
    filter(Search_word == "coronavirus")
st_covid = subset(search_trend, select = -c(Search_word))
names(st_covid)[2] <- "covid"

temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key="Search_word", value="SearchTrend", -date) %>%
    filter(date >= "2020-01-20") %>%
    filter(Search_word == "cold")
st_cold = subset(search_trend, select = -c(Search_word))
names(st_cold)[2] <- "cold"
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key="Search_word", value="SearchTrend", -date) %>%
    filter(date >= "2020-01-20") %>%
    filter(Search_word == "flu")
st_flu = subset(search_trend, select = -c(Search_word))
names(st_flu)[2] <- "flu"
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE)  %>% 
    gather(key="Search_word", value="SearchTrend", -date) %>%
    filter(date >= "2020-01-20") %>%
    filter(Search_word == "pneumonia")
st_pneumonia = subset(search_trend, select = -c(Search_word))
names(st_pneumonia)[2] <- "pneumonia"

st <- merge(st_pneumonia, st_flu, by="date")
st <- merge(st, st_cold, by="date")
st <- merge(st, st_covid, by="date")
t <- time %>% select(date,confirmed)
t <- merge (st, t, by = "date")

fit_lm = lm(confirmed~ . - date, data = t)
summary (fit_lm)

## 
## Call:
## lm(formula = confirmed ~ . - date, data = t)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9238  -1128   1659   2404  21255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9294.732     21.676  428.80   <2e-16 ***
## pneumonia    -76.372      1.422  -53.69   <2e-16 ***
## flu          -76.372      1.422  -53.69   <2e-16 ***
## cold         -76.372      1.422  -53.69   <2e-16 ***
## covid        -76.372      1.422  -53.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3838 on 41467 degrees of freedom
## Multiple R-squared:  0.3186, Adjusted R-squared:  0.3185 
## F-statistic:  4847 on 4 and 41467 DF,  p-value: < 2.2e-16

Analysis:
This plot shows that the search for “cold”, “coronavirus”, “flu” and “pneumonia” skyrocketed as confirmed cases of COVID-19 went up, and as the growth rate of the cases decreases, the search trends also decrease. This is particularly helpfully because it shows that the general public’s awareness of the virus is related to its spread. In addition, we used a linear fit to predict confirmed cases with different search trend. In time of pandemic, common sickness’ search trends have similar weight in predicting confirmed cases.

Is there a relationship between daily temperature and the number of confirmed cases?

During the early stage of COVID, there is a rumor saying that COVID will go away as temperature warms up. We are interested in finding if there is a correlation between temperature and confirmed COVID cases. We performed a simple linear regression on temperature and confirmed cases for the period between 1/20 - 3/20.

w <- weather %>% group_by(date) %>% summarize(mean(avg_temp))
colnames(w) <- c("date","temperature")

t <- time %>% select(date,confirmed)

wt <- merge(w, t, by="date")

fit_lm <- lm(confirmed ~ . -date, data=wt)
summary(fit_lm)

## 
## Call:
## lm(formula = confirmed ~ . - date, data = wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6752.2 -1919.3    90.2  2364.6  5350.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1140.75     446.24   2.556   0.0115 *  
## temperature   528.60      30.87  17.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2780 on 160 degrees of freedom
## Multiple R-squared:  0.647,  Adjusted R-squared:  0.6448 
## F-statistic: 293.2 on 1 and 160 DF,  p-value: < 2.2e-16

Analysis: As the summary shows, the p-value is very small (<2.2e-16), we can reject the null hypothesis. Therefore, there is a relationship between temperature and the number of confirmed cases.

wt <- head(wt, 61)

ggplot(wt, aes(x=temperature, y=confirmed)) +
         geom_point(aes(color=date)) +
         stat_smooth(method='lm') +
         stat_smooth(method ='glm')+
  labs(title = "Temperature and Number of Confirmed Cases", x = "Temp", y= "Case")

Analysis:
We select data from 1-20 to 3-20 for clear visualization. A simple linear regression regression does not fit the data. The relationship cannot be explained linearly.

Discussion

Limitation

There are a lot of missing values in the dataset. This definitely impact our analysis since we have to drop a significant amount of NA values. However, it is understandable that in real world, data collection is far from perfect, many data cannot be collected due to privacy and other reasons. Also, the time span for this dataset is from January to June, although it is sufficient enough, we would like to have a broader time period to study. Also, since this dataset only records COVID in South Korea, it is not representative enough to generalize our observations to other countries in the world.

Impact

Our analysis shows the most affected group of people who has a higher chance of catching COVID, higher chance of dying due to COVID. Therefore, those group of people should take extra precaution dealing with COVID. Also, our analysis shows the trend, and massive infection events, it is useful for the infection experts to study how to prevent COVID from spreading. The trend of gross should also be informative for local CDC to properly allocate medical resources to different areas.