Depending on how you access our project, some figures may not be displayed properly:
In this project, we perform in-depth EDA(Exploratory Data Analysis) on a collection of data sets about COVID-19 in South Korea. Our goal is to use these data to better understand COVID and learn how to prevent it effectively to save lives.
We answered the following questions by performing in-depth data visualization, and regression analysis.
Our project utilizes the following libraries, please make sure they are installed by using the install.packages() command.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(gganimate)
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(htmltools)
library(leaflet)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
There are many COVID datasets available on Kaggle. However, most of them only have information on date and the number of confirmed cases. This South Korea COVID dataset, on the other hand, contains comprehensive information on cases, patient info, weather, geographical location, contact tracing, etc. Therefore, it is ideal for this project.
The data is collected from KCDC (Korea Centers for Disease Control & Prevention). The data set is available on Kaggle. The following code snippet loads the dataset into R workspace, the best way to view the data is through R.
# Read Datasets, treat empty string value as NA
case <- read.csv('./Data/Case.csv', na.strings = c("", "NA"))
patient <- read.csv('./Data/PatientInfo.csv', na.strings = c("", "NA"))
policy <- read.csv('./Data/Policy.csv', na.strings = c("", "NA"))
region <- read.csv('./Data/Region.csv', na.strings = c("", "NA"))
searchtrend <- read.csv('./Data/SearchTrend.csv', na.strings = c("", "NA"))
seoulfloating <- read.csv('./Data/SeoulFloating.csv', na.strings = c("", "NA"))
time <- read.csv('./Data/Time.csv', na.strings = c("", "NA"))
timeage <- read.csv('./Data/TimeAge.csv', na.strings = c("", "NA"))
timegender <- read.csv('./Data/TimeGender.csv', na.strings = c("", "NA"))
timeprovince <- read.csv('./Data/TimeProvince.csv', na.strings = c("", "NA"))
weather <- read.csv('./Data/Weather.csv', na.strings = c("", "NA"))
route <- read.csv('./Data/routes.csv', na.strings = c("", "NA"))
patient_1 <- read.csv('./Data/patients.csv', na.strings = c("", "NA"))
In this section, we will be plotting the the confirmed, released, and deceased cases of COVID-19 in South Korea against time. These trends would help us or other researchers with building math models and identify key features that could potentially lowered the growth rate of the confirmed or deceased cases throughout time.
We plotted the daily confirmed cases by date. We also used geom_smooth function to plot the trend for better visualization.
# Filter out empty value for confirmed date
patient_dates <- patient_1 %>%
filter(!is.na(confirmed_date))
# Plot confirmed cases for each day
patient_dates %>%
group_by(confirmed_date) %>%
summarise(n = n()) %>%
ggplot(aes(as.Date(confirmed_date, format = "%m/%d/%Y"), n))+
geom_line(size = 2, alpha = 0.8, col="red")+
geom_point(size = 5, alpha = 1, col="purple")+
labs(title = "Daily Confirmed Cases of COVID-19 in South Korea by Date",
subtitle = paste0("Total Confirmed Cases: ", nrow(patient)),
caption = "Green Line Indicates the Trend") +
scale_x_date(date_labels = "%b %d", date_breaks = "2 days") +
theme_light() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5, size = 18, color = "black"),
plot.subtitle = element_text(face = "bold",
size = 12, color = "blue")) +
ggrepel::geom_label_repel(aes(label=n), hjust=1.5, vjust=0.9,
fill ="orange") +
geom_smooth(method = "gam", se = TRUE, alpha = 0.5, col = "green")
Analysis:
As shown in the figure, marked by the green line, we can see that there is a huge surge from late Feburary to Early March in South Korea.
After plotting the daily confirmed cases trend, we are interested in learning the trend for total confirmed, deceased, and released cases since the start of the pandemic. The blue line marked the total confirmed cases, the red line marked the total deceased cases, and the green line marked the total released cases.
# Filter out empty value for confirmed date
time_data <- time
# Plot confirmed cases for each day
ggplot () +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = confirmed), color = "blue") +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = deceased), color = "red") +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = released), color = "green") +
scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
labs(title = "Total Confirmed, Deased, Released Cases by Date",
caption = "Blue: Confirmed, Red: Deceased, Green: Released")
Analysis:
As shown in the figure, we can see that the total deceased cases trend is relatively flat, while both the total confirmed and total released cases has a huge surge around late Feburary. This matches the observation from previous fiture, where there is a huge surge in confirmed cases around late Feburary to early March.
In this section, we will be dividing the patients into age and gender groups, and compare the different statistics among them. These comparisons could help normal individuals to be more aware of their risk of having COVID-19, and medical professionals to better care for patients in different groups.
# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year
# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age <- patient_age %>%
select(age_group) %>%
na.omit() %>%
group_by(age_group) %>%
summarise(confirmed_cases = n())
ggplot(patient_age, aes(y = age_group, x = confirmed_cases, fill = age_group)) +
geom_bar(stat="identity") +
xlab("Confirmed Cases") +
ylab("Age Group") +
ggtitle(label = "Confirmed Cases by Age Group")
Analysis:
As shown in the figure, the most affected age group is 20-30, the second most affected group is 50-60.
patient_age <- patient
patient_age$contact_number[patient_age$contact_number == "-"] = NA
patient_age_c = patient_age[!is.na(patient_age$contact_number),]
patient_age_c$contact_number = as.numeric(patient_age_c$contact_number)
patient_age = aggregate(patient_age_c$contact_number, by = list(Category = patient_age_c$age), FUN = mean)
ggplot(patient_age, aes(y = Category, x = x, fill = Category)) +
geom_bar(stat="identity") +
xlab("Contact Number") +
ylab("Age Group") +
ggtitle(label = "Average Contact Number by Age Group")
Analysis:
As shown in the figure, On average, people in 20s-50s, and 70s have contact number above 30.
# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year
# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age = patient_age[!is.na(patient_age$deceased_date),]
patient_age <- patient_age %>%
select(age_group) %>%
na.omit() %>%
group_by(age_group) %>%
summarise(deceased_cases = n())
ggplot(patient_age, aes(y = age_group, x = deceased_cases, fill = age_group)) +
geom_bar(stat="identity") +
xlab("deceased Cases") +
ylab("Age Group") +
ggtitle(label = "Deceased Cases by Age Group")
Analysis:
As shown in the figure, the most affected age group is 80-90.
# Group number of patients by province
patient_province <- patient
patient_province <- patient_province %>%
select(province) %>%
na.omit()%>%
group_by(province)%>%
summarise(n=n())
ggplot(patient_province, aes(y=n, x=reorder(province,n), fill=province)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Province Name") +
ylab("Confirmed Cases") +
ggtitle(label = "Confirmed Cases in Each Province")
Analysis:
As shown in the figure, Seoul is the most affected province, with Gyeongsangbuk-do and Gyeonggi-do close to Seoul.
patient_gender <- patient
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
ggplot(patient_gender,
aes(x="",y=patient_num,fill=sex)) +
geom_bar(width=1, stat="identity",color="white") +
ggtitle(label = "Confirmed Cases Gender Ratio") +
coord_polar("y",start=0) +
geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
position=position_stack(vjust=0.5)) +
theme_void() +
theme(plot.title = element_text(size = 30, face = "bold"))
Analysis:
As shown in the figure, COVID affects females more than male.
patient_gender <- patient
patient_gender = patient_gender[!is.na(patient_gender$deceased_date),]
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
ggplot(patient_gender,
aes(x="",y=patient_num,fill=sex)) +
geom_bar(width=1, stat="identity",color="white") +
ggtitle(label = "Deceased Gender Ratio") +
coord_polar("y",start=0) +
geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
position=position_stack(vjust=0.5)) +
theme_void() +
theme(plot.title = element_text(size = 30, face = "bold"))
Analysis:
As shown in the figure, male has a much higher mortality rate than female.
patient_gender <- patient
patient_gender_d = patient_gender[!is.na(patient_gender$deceased_date),]
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
patient_gender_d <- patient_gender_d %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
patient_gender_d[2] = patient_gender_d[2]/ patient_gender[2] * 100
ggplot(patient_gender_d, aes(x = sex, y = patient_num, fill = sex)) +
geom_bar(stat="identity") +
xlab("mortality rate") +
ylab("Gender") +
ggtitle(label = "Mortality Rate by Gender")
patient_age_province <- patient_1
patient_age_province$age = 2020 - patient_1$birth_year
patient_age_province$age_group = cut(patient_age_province$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age_province %>%
select(age_group, province) %>%
na.omit() %>%
group_by(age_group, province) %>%
summarise(Count = n()) %>%
ggplot() +
geom_tile(aes(x = age_group, y = province, fill = Count)) +
geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
scale_fill_gradientn(colours = c("green","red"),
values = c(0,0.1,1)) +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
theme_classic() +
theme(text = element_text(size = 15, face = "bold"),
axis.text.x = element_text(angle = 90),
legend.position = "top",
legend.key.width = unit(3,"cm"))
Analysis:
As shown in the figure, the most affected group is age from 30-40, living in capital area.
patient_age_province_gender <- patient_age_province
patient_age_province_gender %>%
select(sex, age_group, province) %>%
na.omit() %>%
group_by(sex, age_group, province) %>%
summarise(Count = n()) %>%
ggplot() +
geom_tile(aes(x = age_group, y = province, fill = Count)) +
geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
scale_fill_gradientn(colours = c("green","red"),
values = c(0,0.1,1)) +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
theme_classic() +
theme(text = element_text(size = 15, face = "bold"),
axis.text.x = element_text(angle = 90),
legend.position = "top",
legend.key.width = unit(3,"cm")) +
facet_wrap(~sex)
Analysis:
As shown in the figure, the most affected group is age from 30-40, male, living in capital area.
In this section we visualize the infection events using the Google Map API. The following figures illustrate the mass infection locations in South Korea. These illustrations could help local disease control centers to better prevent Covid-19 infection spikes.
We want to mark the infection events along with their infection numbers on a map for better geological visualization. We used ggmap function to create a base map of South Korea, then plot each infection events with different size and color to mark its infection number.
case$longitude <- as.numeric(as.character(case$longitude))
## Warning: NAs introduced by coercion
case$latitude <- as.numeric(as.character(case$latitude))
## Warning: NAs introduced by coercion
# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
zoom = 7, scale = 4,
maptype = 'roadmap',
color = 'color')
figure <- ggmap(map) +
theme_void() +
ggtitle("Number of Confrimed Cases by Case ID in South Korea") +
scale_color_gradientn(colors = rainbow(5)) +
theme(plot.title = element_text(color = "blue"),
panel.border = element_rect(color = "grey", fill = NA, size = 2))
figure + geom_point(data = case,
aes(longitude, latitude, size= confirmed,
color=confirmed, group=confirmed))
## Warning: Removed 109 rows containing missing values (geom_point).
Analysis:
As shown in the figure, the area around Seoul have a large number of small infection events. There is a massive infection event in the city of Nam-gu, Daegu province. This massive infection event is at Shincheonji Church, which infected 4511 people.
After creating the plot above, we are interested in learning how those infection events evolve over time. We combined the case dataframe with timprovince dataframe to get a dataframe with both geological and time infromation. Then, we use gganimate to create a gif that shows how the infection events evolve over time.
# Get a data frame that contains both time and coordinates
case_bak <- subset(case, select = c(province, longitude, latitude))
geo_time <- merge(case_bak, timeprovince, by="province")
geo_time$date <- as.Date(geo_time$date, format = "%Y-%m-%d")
# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
zoom = 7, scale = 4,
maptype = 'roadmap',
color = 'color')
figure <- ggmap(map) +
theme_void() +
ggtitle("Number of Confrimed Cases by Case ID in South Korea Over Time") +
scale_color_gradientn(colors = rainbow(5)) +
theme(plot.title = element_text(color = "blue"),
panel.border = element_rect(color = "grey", fill = NA, size = 2))
figure + geom_point(data = geo_time,
aes(longitude, latitude, size= confirmed,
color=confirmed, group=confirmed)) +
transition_time(date)
Analysis:
The following animation shows the number of confirmed cases by location increasing over time. The figure shows that Covid-19 spreads most rapidly in Seoul, Incheon, and Daegu. However, the spread of Covid-19 in Busan, the second largest city in South Korea, was low in comparison. Future work investigating the different treatments of disease control in Busan and other large cities could give insight in how to reduce the spread in large populations.
To take the visualization to another level, we created an interactive map that let users view the massive infection events along with patient info by using leaflet widget.
# Known Group Infection Events
group_infec <- tibble(lon = c(127.0518049,126.9165592,127.1217472,128.7368285,129.0771404,128.4941047,128.5707814,128.5665743,128.9099632),
lat = c(37.2367708,37.6338843,37.3881528,35.6486172,35.2159947,36.0579173,35.8507155,35.8397168,36.9273528),
address = c('650-103 mangpo-dong, yeongtong-gu, suwon, gyeonggi-do, south korea','1021 tongil-ro, jingwan-dong, eunpyeong-gu, seoul, south korea','20 seohyeon-ro 180(baekpalsip)beo, seohyeon-dong, bundang-gu, seongnam-si, gyeonggi-do, south korea','100-1 beomgok-ri, hwayang-eup, cheongdo, gyeongsangbuk-do, south korea','436-8 oncheon 1(il)-dong, dongnae-gu, busan, south korea','918 yuhak-ro, gasan-myeon, chilgok-gun, gyeongsangbuk-do, south korea','72-10 seongdang-dong, dalseo-gu, daegu, south korea','81 daemyeong-ro, daemyeong 10(sip)-dong, nam-gu, daegu, south korea','916 soro-ri, chunyang-myeon, bonghwa-gun, gyeongsangbuk-do, south korea'),
korean_name = c('수원 생명샘교회','은평 성모병원','분당 제생병원','청도 대남병원','부산 온천교회','칠곡 밀알 사랑의집','대구 한마음아파트','대구 신천지 교회','봉화 푸른 요양원'),
english_name = c('Soowon Saengmyungsame Church','Eunpyung Sungmo Hospital','Boondang Jesaeng Hospital','Chungdo Daenam Hospital','Busan Onchun Church','Chilgok Milal Sarang House','Daegu Hanmaeum Apartment','Daegu Sinchunji Church','Bonghwa Pureun Clinic'),
n_of_infections= c(10,14,11,118,34,22,46,4482,51))
# Pre-processing
group_infec_2 <- group_infec %>% select(lon,lat,korean_name,english_name,n_of_infections) %>%
rename(latitude2 = lat, longitude2 = lon, id = english_name) %>%
as.data.table()
group_infec_3 <- group_infec_2[, list(label=HTML(
paste(sep = "<br>",
paste("G.I(eng): ", id),
paste("G.I(kor): ", korean_name),
paste("Num of infects:", n_of_infections)))),
list(id, longitude2, latitude2)]
#Combining route and patient and pre-processing
agg_1 <- route %>% left_join(patient_1, by="global_id") %>%
mutate(infected_by = ifelse(is.na(infected_by),"Unk",infected_by)) %>%
as.data.table
agg_2 <- agg_1[, list(label=HTML(
paste(sep="<br>",
paste("Patient Id :", global_id),
paste("Sex :", sex),
paste("Country :", country),
paste("Disease :", disease),
paste("Group:", group),
paste("Infection_reason:", infection_reason),
paste("Infection_order:", infection_order),
paste("Infected_by:", infected_by),
paste("Contact_number:", contact_number),
paste("Confirmed_date:", confirmed_date),
paste("State:", state)))),
list(global_id, longitude, latitude)]
listed <- list(agg_2, group_infec_3)
agg_3 <- rbindlist(listed, use.names = TRUE, fill = TRUE)
agg_3 %>%
leaflet() %>%
setView(128.5, 36.5, zoom = 7) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
#addProviderTiles : Stamen.Toner, CartoDB.Positron, Esri.NatGeoWorldMap, MtbMap
addCircleMarkers(
lat = ~ latitude,
lng = ~ longitude,
label = ~label,
clusterOptions = markerClusterOptions(),
labelOptions = labelOptions(noHide = F,
direction = "left",
style = list(
"color" = "black",
"box-shadow" = "4px 4px rgba(0,0,0,0.25)",
"font-size" = "13px",
"border-color" = "rgba(0.5,3,0.5,0.5)"))) %>%
#Addding group infections
addMarkers(
lat = ~ latitude2,
lng = ~ longitude2,
label = ~label,
labelOptions = labelOptions(noHide = F,
direction = "right",
style = list(
"color" = "red",
"font-size" = "13px")))
## Warning in validateCoords(lng, lat, funcName): Data contains 338 rows with
## either missing or invalid lat/lon values and will be ignored
## Warning in validateCoords(lng, lat, funcName): Data contains 2050 rows with
## either missing or invalid lat/lon values and will be ignored
Analysis: This figure shows the COVID-19 contact locations for patients who reported where they were exposed. Community spread in residential areas is common. There are several locations that have large numbers of infections. Knowing what types of locations tend to have large COVID-19 infection events could help the CDC to identify particular community events according to the building type and come up with better policies.
We are interested in learning how COVID affect the web search trend. We plotted four different search word (cold, coronavirus, flue, pneumonia) along with confirmed, deceased and released cases over time. We also use a linear model to predict confirmed cases with different search trend.
time_case <- read.csv("./Data/TimeProvince.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key = "Cumulative_case",value="number", -c(1:3)) %>%
group_by(date, Cumulative_case) %>% summarise(cum_number = sum(number))
search_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20")
search_and_case <- search_trend %>% left_join(time_case, by = 'date')
options(repr.plot.width = 20, repr.plot.height = 10)
search_and_case %>%
ggplot(aes(as.Date(date, format = "%Y-%m-%d"), cum_number, group = Cumulative_case, color= Cumulative_case)) +
geom_point() +
geom_line(linetype = 'dashed') +
theme(axis.text.x = element_text(angle = 75, hjust = 1),
legend.text=element_text(size=20),
legend.title=element_text(size=20),
axis.title.x = element_text(color="black", size=20, face="bold"),
axis.title.y = element_text(color="black", size=20, face="bold")) +
geom_point(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word, shape=Search_word), color='black', size=3) +
geom_line(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word), color='black') +
scale_y_log10() +
scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
ylab("Search Volume or Cumulative Cases in log scale") +
ggtitle("Growth of COVID-19 cases and Web Search in South Korea")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20") %>%
filter(Search_word == "coronavirus")
st_covid = subset(search_trend, select = -c(Search_word))
names(st_covid)[2] <- "covid"
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20") %>%
filter(Search_word == "cold")
st_cold = subset(search_trend, select = -c(Search_word))
names(st_cold)[2] <- "cold"
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20") %>%
filter(Search_word == "flu")
st_flu = subset(search_trend, select = -c(Search_word))
names(st_flu)[2] <- "flu"
temp_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20") %>%
filter(Search_word == "pneumonia")
st_pneumonia = subset(search_trend, select = -c(Search_word))
names(st_pneumonia)[2] <- "pneumonia"
st <- merge(st_pneumonia, st_flu, by="date")
st <- merge(st, st_cold, by="date")
st <- merge(st, st_covid, by="date")
t <- time %>% select(date,confirmed)
t <- merge (st, t, by = "date")
fit_lm = lm(confirmed~ . - date, data = t)
summary (fit_lm)
##
## Call:
## lm(formula = confirmed ~ . - date, data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9238 -1128 1659 2404 21255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9294.732 21.676 428.80 <2e-16 ***
## pneumonia -76.372 1.422 -53.69 <2e-16 ***
## flu -76.372 1.422 -53.69 <2e-16 ***
## cold -76.372 1.422 -53.69 <2e-16 ***
## covid -76.372 1.422 -53.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3838 on 41467 degrees of freedom
## Multiple R-squared: 0.3186, Adjusted R-squared: 0.3185
## F-statistic: 4847 on 4 and 41467 DF, p-value: < 2.2e-16
Analysis:
This plot shows that the search for “cold”, “coronavirus”, “flu” and “pneumonia” skyrocketed as confirmed cases of COVID-19 went up, and as the growth rate of the cases decreases, the search trends also decrease. This is particularly helpfully because it shows that the general public’s awareness of the virus is related to its spread. In addition, we used a linear fit to predict confirmed cases with different search trend. In time of pandemic, common sickness’ search trends have similar weight in predicting confirmed cases.
During the early stage of COVID, there is a rumor saying that COVID will go away as temperature warms up. We are interested in finding if there is a correlation between temperature and confirmed COVID cases. We performed a simple linear regression on temperature and confirmed cases for the period between 1/20 - 3/20.
w <- weather %>% group_by(date) %>% summarize(mean(avg_temp))
colnames(w) <- c("date","temperature")
t <- time %>% select(date,confirmed)
wt <- merge(w, t, by="date")
fit_lm <- lm(confirmed ~ . -date, data=wt)
summary(fit_lm)
##
## Call:
## lm(formula = confirmed ~ . - date, data = wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6752.2 -1919.3 90.2 2364.6 5350.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1140.75 446.24 2.556 0.0115 *
## temperature 528.60 30.87 17.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2780 on 160 degrees of freedom
## Multiple R-squared: 0.647, Adjusted R-squared: 0.6448
## F-statistic: 293.2 on 1 and 160 DF, p-value: < 2.2e-16
Analysis: As the summary shows, the p-value is very small (<2.2e-16), we can reject the null hypothesis. Therefore, there is a relationship between temperature and the number of confirmed cases.
wt <- head(wt, 61)
ggplot(wt, aes(x=temperature, y=confirmed)) +
geom_point(aes(color=date)) +
stat_smooth(method='lm') +
stat_smooth(method ='glm')+
labs(title = "Temperature and Number of Confirmed Cases", x = "Temp", y= "Case")
Analysis:
We select data from 1-20 to 3-20 for clear visualization. A simple linear regression regression does not fit the data. The relationship cannot be explained linearly.
There are a lot of missing values in the dataset. This definitely impact our analysis since we have to drop a significant amount of NA values. However, it is understandable that in real world, data collection is far from perfect, many data cannot be collected due to privacy and other reasons. Also, the time span for this dataset is from January to June, although it is sufficient enough, we would like to have a broader time period to study. Also, since this dataset only records COVID in South Korea, it is not representative enough to generalize our observations to other countries in the world.
Our analysis shows the most affected group of people who has a higher chance of catching COVID, higher chance of dying due to COVID. Therefore, those group of people should take extra precaution dealing with COVID. Also, our analysis shows the trend, and massive infection events, it is useful for the infection experts to study how to prevent COVID from spreading. The trend of gross should also be informative for local CDC to properly allocate medical resources to different areas.