Pandemics are not new phenomena but it has kept coming in different time period since the beginning of civilization. Who had thought that we had to face a big pandemic after Spanish Flu in 1918. Although there were some epidemics such as Ebola but it was not as contagious as COVID-19 that appeared first in Wuhan, China in December 2019. This project aims to explore and visualize how COVID-19 has affected the world overall. The data has been taken from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. In this project, I will visualize how different countries has been affected by COVID-19 and what are the most appearing symptoms in the affected patients?
Data has been taken from John Hopkin’s University and it is published in Kaggle in the given link. It has 9 csv files. I will be focusing on the datasets that fulfill my research questions. I have opened all the datasets for the data exploration purpose although. I will explore each dataset and clean as per my requirements.
# Reading all files together
files <-list.files("C:/Users/hukha/Desktop/coronavirus dataset", "*.csv", full.names = TRUE)
coronavirus_dataset <- sapply(files, read_csv, simplify = FALSE)
The first dataset contains all the relevant data such as number of patients confirmed, died and recovered in different countries and in specific dates. Data was not given in proper tidy format and I had to change the dates into proper format to use them later. There were some missing values in the required columns which were dealt to avoid any biasness in analysis. I used DT package to show the data for exploration purpose.
# First dataset
#Let's read the data structure first to see if there is any thing we can do to make the data tidy
#head(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`) %>% kable() %>% kable_styling()
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)
# Converting the "ObservationDate" from character class into Date class
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv` %>% head()
## # A tibble: 6 x 8
## SNo ObservationDate `Province/State` `Country/Region` `Last Update`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 01/22/2020 Anhui Mainland China 1/22/2020 17~
## 2 2 01/22/2020 Beijing Mainland China 1/22/2020 17~
## 3 3 01/22/2020 Chongqing Mainland China 1/22/2020 17~
## 4 4 01/22/2020 Fujian Mainland China 1/22/2020 17~
## 5 5 01/22/2020 Gansu Mainland China 1/22/2020 17~
## 6 6 01/22/2020 Guangdong Mainland China 1/22/2020 17~
## # ... with 3 more variables: Confirmed <dbl>, Deaths <dbl>,
## # Recovered <dbl>
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate, format= "%m/%d/%Y")
#coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update` <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update`, format= "%m/%d/%y")
# Looking for missing values
colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)) # Only State has missing values so we have to change the missing values with "Unknown"
## SNo ObservationDate Province/State Country/Region
## 0 0 12195 0
## Last Update Confirmed Deaths Recovered
## 0 0 0 0
# Replacing missing values for States
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State` <- str_replace_na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State`, replacement= "Unknown")
# Creating copy of file locally
covid_19_data <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`
#sample_n(covid_19_data, 10) %>% kable() %>% kable_styling()
#covid_19_data %>% datatable
The next dataset contains the confirmed patients throughout the world along the longitude and latitude which would be required to plot the data on the map. There were no missing values in this dataset but data was present in wide format and I had to use tidyr’s gather function to convert the data into long format. I have printed few random rows from the data for exploration purpose. I went through next datasets which are covid19_deaths and covid19_recovered which had the same issues. They all are now tidy and clean.
# Confirmed Patients' Dataset
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`)
## Checking for missing values
#colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`))
## Replacing missing values
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State` <-str_replace_na( coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State`, replacement= "Unknown")
# Tidying the data by converting into long format and creating a copy
covid19_confirmed <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv` %>% gather(Date,
Confirmed_Cases, -c(`Province/State`, `Country/Region`, Long, Lat))
# Checking for missing values
colSums(is.na(covid19_confirmed))
## Province/State Country/Region Lat Long
## 0 0 0 0
## Date Confirmed_Cases
## 0 0
# Converting Date column into Date
covid19_confirmed$Date <- as.Date(covid19_confirmed$Date, format="%m/%d/%y")
#glimpse(covid19_confirmed)
sample_n(covid19_confirmed, 10) %>% kable() %>% kable_styling()
Province/State | Country/Region | Lat | Long | Date | Confirmed_Cases |
---|---|---|---|---|---|
Unknown | Mexico | 23.63450 | -102.55280 | 2020-02-11 | 0 |
Unknown | Mozambique | -18.66569 | 35.52956 | 2020-01-26 | 0 |
Unknown | Trinidad and Tobago | 10.69180 | -61.22250 | 2020-04-07 | 107 |
Unknown | Armenia | 40.06910 | 45.03820 | 2020-03-10 | 1 |
Unknown | Tajikistan | 38.86103 | 71.27609 | 2020-03-03 | 0 |
Unknown | Bahamas | 25.03430 | -77.39630 | 2020-02-22 | 0 |
Unknown | Eritrea | 15.17940 | 39.78230 | 2020-04-03 | 22 |
Unknown | Algeria | 28.03390 | 1.65960 | 2020-02-25 | 1 |
Sichuan | China | 30.61710 | 102.71030 | 2020-03-28 | 548 |
Unknown | Holy See | 41.90290 | 12.45340 | 2020-04-17 | 8 |
# Dataset - Deaths globally
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`)
# Creating local file
covid19_deaths <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`
# Tidying the data; converting from wide to long
covid19_deaths <- covid19_deaths %>% gather(Date, Deaths, -c(`Province/State`, `Country/Region`, Lat, Long))
# Converting the data type into date
covid19_deaths$Date <- as.Date(covid19_deaths$Date, format="%m/%d/%y")
# replacing missing values
covid19_deaths$`Province/State` <- str_replace_na(covid19_deaths$`Province/State`, replacement="Unknown")
# Checking for structure and missing values
#glimpse(covid19_deaths)
#colSums(is.na(covid19_deaths))
sample_n(covid19_deaths, 10) %>% kable() %>% kable_styling()
Province/State | Country/Region | Lat | Long | Date | Deaths |
---|---|---|---|---|---|
Unknown | Burundi | -3.3731 | 29.9189 | 2020-04-17 | 1 |
Unknown | Thailand | 15.0000 | 101.0000 | 2020-03-06 | 1 |
Liaoning | China | 41.2956 | 122.6085 | 2020-02-04 | 0 |
Unknown | San Marino | 43.9424 | 12.4578 | 2020-03-25 | 21 |
Unknown | Maldives | 3.2028 | 73.2207 | 2020-01-24 | 0 |
Unknown | Romania | 45.9432 | 24.9668 | 2020-02-13 | 0 |
Unknown | Croatia | 45.1000 | 15.2000 | 2020-03-14 | 0 |
Northern Territory | Australia | -12.4634 | 130.8456 | 2020-04-29 | 0 |
Unknown | Guyana | 5.0000 | -58.7500 | 2020-03-14 | 1 |
Unknown | Mongolia | 46.8625 | 103.8467 | 2020-01-26 | 0 |
# Dataset - Recovered Patients
# creating local file
covid19_recovered <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_recovered.csv`
# Tidying the data
covid19_recovered <- covid19_recovered %>% gather(Date, Recovered, -c(`Province/State`, `Country/Region`, Lat, Long))
# Data cleaning
covid19_recovered$Date <- as.Date(covid19_recovered$Date, format= "%m/%d/%y")
colSums(is.na(covid19_recovered))
## Province/State Country/Region Lat Long Date
## 20350 0 0 0 0
## Recovered
## 0
# replacing missing values
covid19_recovered$`Province/State` <- str_replace_na(covid19_recovered$`Province/State`, replacement="Unknown")
# Checking
glimpse(covid19_recovered)
## Observations: 27,720
## Variables: 6
## $ `Province/State` <chr> "Unknown", "Unknown", "Unknown", "Unknown", "...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ Lat <dbl> 33.0000, 41.1533, 28.0339, 42.5063, -11.2027,...
## $ Long <dbl> 65.0000, 20.1683, 1.6596, 1.5218, 17.8739, -6...
## $ Date <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ Recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
colSums(is.na(covid19_recovered))
## Province/State Country/Region Lat Long Date
## 0 0 0 0 0
## Recovered
## 0
sample_n(covid19_recovered, 10) %>% kable() %>% kable_styling()
Province/State | Country/Region | Lat | Long | Date | Recovered |
---|---|---|---|---|---|
Unknown | Somalia | 5.1521 | 46.1996 | 2020-03-26 | 0 |
Unknown | Switzerland | 46.8182 | 8.2275 | 2020-03-14 | 4 |
Unknown | Paraguay | -23.4425 | -58.4438 | 2020-02-18 | 0 |
Unknown | Senegal | 14.4974 | -14.4524 | 2020-03-30 | 27 |
Unknown | Germany | 51.0000 | 9.0000 | 2020-04-15 | 72600 |
Unknown | Gambia | 13.4432 | -15.3101 | 2020-01-26 | 0 |
British Virgin Islands | United Kingdom | 18.4207 | -64.6400 | 2020-02-20 | 0 |
Unknown | Nigeria | 9.0820 | 8.6753 | 2020-04-26 | 239 |
Unknown | Denmark | 56.2639 | 9.5018 | 2020-02-15 | 0 |
Unknown | Chile | -35.6751 | -71.5430 | 2020-04-10 | 1571 |
The below dataset is still dirty although but I need only column “symptoms” from this dataset for text mining analysis. I will do the necessary cleaning while conducting analysis.
# Dataset for symptoms
covid19_updated <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/COVID19_open_line_list.csv`
In this section, I will be exploring the data through visualizing them.
COVID-19 is spreading very quickly. According to John Hopkin’s University, it is 2 % contagious while Spanish Flu which killed millions of people globally was 1.8 % contagious. Since the beginning of pandemic up to 5th May 2020, almost 4 million people has got Corona Virus throughout the world. Unfortunately, mortality rate for COVID-19 is 4.5% whileas Spanish Flu had mortality rate of 2.5%. Apparently from the source, we can see that it is spreading very quickly throughout the world and so far it has killed almost 250,000 people. Around 500,000 people have recovered globally.
The article was taken from https://www.cnbc.com/2020/03/26/coronavirus-may-be-deadlier-than-1918-flu-heres-how-it-stacks-up-to-other-pandemics.html.
# Creating line plot to see how coronavirus affected
covid_19_data %>%
gather(Type,
Freq2,
-c(SNo, ObservationDate, `Province/State`, `Country/Region`, `Last Update`)) %>%
select(ObservationDate, Type, Freq2) %>%
group_by(ObservationDate, Type) %>%
summarise(n= sum(Freq2)) %>%
ggplot()+geom_line(aes(ObservationDate,n, color=Type),size=1)+theme_classic()+theme_update(plot.title= element_text(hjust=0.5)) +labs(title= "Covid-19 growth trend over time",x="Months", y="Frequency")+ scale_y_continuous(labels = comma)
COVID-19 was started initially from Wuhan, China but they managed to reduce confirmed cases and unfortunately US is a leading country that has the most number of cases as of May 10, 2020 with around 1.3 million with Spain, UK and Italy having second, third and forth countries respectively with highest confirmed cases.
covid_19_data %>%
filter(`ObservationDate` == max(`ObservationDate`)) %>%
group_by(`Country/Region`) %>%
summarise(n = sum(Confirmed)) %>%
arrange(desc(n)) %>%
head(n=20) %>%
ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="steelblue", width=0.9)+scale_y_continuous(label=comma)+
labs(title="Countries with Top COVID-19 Confirmed Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()
# Creating map for confirmed cases
covid19_confirmed %>%
filter(Date == max(Date)) %>%
ggplot()+
borders("world",color="gray85", fill="gray80", resolution=0.1)+
theme_map(base_size = 15)+
geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
color='purple', alpha=.5)+
scale_size_continuous(range=c(1,8),
#breaks=c(50000,100000,150000,200000,250000,300000),
label= comma)+
labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
theme_classic()+theme_update(plot.title= element_text(hjust=0.5))
# Animated map
anime_confirmed <- covid19_confirmed %>%
mutate(Week = week(Date)) %>%
group_by(`Country/Region`, Week) %>%
ggplot()+
borders("world",color="gray85", fill="gray80", resolution=0.1)+
theme_map(base_size = 15)+
geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
color='purple', alpha=.5)+
scale_size_continuous(range=c(1,8),
#breaks=c(50000,100000,150000,200000,250000,300000),
label= comma)+
labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+
labs(subtitle="Week: {frame_time}")+
transition_time(Week)+
shadow_wake(wake_length = 0.1)
anime_confirmed
US has highest number of deaths so far with almost 80,000 people with UK having almost 32000 and Italy having 30500 deaths. Spain and France also have around 26,000 deaths so far as of May 5, 2020.
covid_19_data %>%
filter(`ObservationDate` == max(`ObservationDate`)) %>%
group_by(`Country/Region`) %>%
summarise(n = sum(Deaths)) %>%
arrange(desc(n)) %>%
head(n=20) %>%
ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkred", width=0.9)+scale_y_continuous(label=comma)+
labs(title="Countries with Top COVID-19 Death Cases", x="Countries", y="Deaths")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()
# Creating map for deaths
covid19_deaths %>%
filter(Date == max(Date)) %>%
ggplot()+
borders("world",color="gray85", fill="gray80", resolution=0.1)+
theme_map(base_size = 15)+
geom_point(aes(x=Long, y=Lat, size=Deaths),
color='Darkred', alpha=.5)+
scale_size_continuous(label= comma)+
labs(size="Deaths", title="Deaths")+
theme_bw()+theme_update(plot.title= element_text(hjust=0.5))
# Animated map for deaths
# anime_deaths <- covid19_deaths %>%
# mutate(Week= week(Date)) %>%
# group_by(`Country/Region`, Week) %>%
# ggplot()+
# borders("world",color="gray85", fill="gray80", resolution=0.1)+
# theme_map(base_size = 15)+
# geom_point(aes(x=Long, y=Lat, size=Deaths),
# color='Darkred', alpha=.5)+
# scale_size_continuous(label= comma)+
# labs(size="Deaths", title="Deaths")+
# theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
# labs(subtitle="Week: {frame_time}")+
# transition_time(Week)+
# shadow_wake(wake_length = 0.1)
#anime_deaths
Fortunately, a lot of COVID-19 patients are being recovered without any significant damage and some of them don’t even have symtoms. US has the highest number of recovered patients which is almost 210,000, Germany with 144,000 and Spain with 136,666 patients who recovered.
covid_19_data %>%
filter(`ObservationDate` == max(`ObservationDate`)) %>%
group_by(`Country/Region`) %>%
summarise(n = sum(Recovered)) %>%
arrange(desc(n)) %>%
head(n=20) %>%
ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkgreen", width=0.9)+scale_y_continuous(label=comma)+
labs(title="Countries with Top COVID-19 Recovered Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()
# Creating map for recovered cases
covid19_recovered %>%
filter(Date == max(Date)) %>%
ggplot()+
borders("world",color="gray85", fill="gray80", resolution=0.1)+
theme_map(base_size = 15)+
geom_point(aes(x=Long, y=Lat, size=Recovered),
color='DarkGreen', alpha=.5)+
scale_size_continuous(label= comma)+
labs(size="Recovered", title="Recovered cases of COVID-19")+
theme_bw()+theme_update(plot.title= element_text(hjust=0.5))
# Creating animated maps
# anime_recovered <- covid19_recovered %>%
# mutate(Week= week(Date)) %>%
# group_by(`Country/Region`, Week) %>%
# ggplot()+
# borders("world",color="gray85", fill="gray80", resolution=0.1)+
# theme_map(base_size = 15)+
# geom_point(aes(x=Long, y=Lat, size=Recovered),
# color='DarkGreen', alpha=.5)+
# scale_size_continuous(label= comma)+
# labs(size="Recovered", title="Recovered cases of COVID-19")+
# theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
# labs(subtitle="Week: {frame_time}")+
# transition_time(Week)+
# shadow_wake(wake_length = 0.1)
# anime_recovered
#glimpse(covid19_updated)
covid19_updated %>%
mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>%
group_by(sex) %>%
drop_na(sex) %>%
filter(!sex %in% c("4000", "N/A")) %>%
count() %>%
ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")
covid19_updated %>%
mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>%
age <- as.integer(age) %>%
mutate(age2 = if_else(age <= 9, "0 - 9", if_else(age >=10 & age <= 19, "10 - 19", age))) %>%
group_by(sex, age) %>%
drop_na(sex, age) %>%
filter(!sex %in% c("4000", "N/A")) %>%
head(n=25)
count() #%>%
ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")
One of the datasets contained information about the symptoms some of the patients had. Although the data a lot of missing values but out of that data, I used text mining techniques to visualize the frequently appearing symptoms in COVID-19 patients. I used tm package to clean the data and converted symptoms into matrix in order to use word cloud. Result indicates that majority of COVID-19 patients had fever, cough and sore threat. Some of the patients caught pneumonia, headache and chills as well.
# Text mining
symptoms <- covid19_updated$symptoms
words <- Corpus(VectorSource(symptoms))
words <- tm_map(words, removeNumbers)
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)
words <- tm_map(words, removeWords, stopwords("english"))
#words <- tm_map(words,removeWords, c(""))
head(words)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 6
tdm <- TermDocumentMatrix((words))
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
head(d,50, row.names=FALSE) %>% kable() %>% kable_styling()
word | freq | |
---|---|---|
fever | fever | 364 |
cough | cough | 179 |
sore | sore | 38 |
throat | throat | 38 |
℃ | ℃ | 37 |
fatigue | fatigue | 28 |
pneumonitis | pneumonitis | 19 |
pneumonia | pneumonia | 17 |
headache | headache | 17 |
chills | chills | 16 |
nose | nose | 15 |
runny | runny | 15 |
chest | chest | 13 |
pain | pain | 13 |
malaise | malaise | 13 |
symptoms | symptoms | 12 |
dry | dry | 11 |
respiratory | respiratory | 10 |
discomfort | discomfort | 10 |
dyspnea | dyspnea | 9 |
sputum | sputum | 9 |
muscle | muscle | 9 |
soreness | soreness | 8 |
weakness | weakness | 8 |
breath | breath | 7 |
shortness | shortness | 7 |
nausea | nausea | 7 |
muscular | muscular | 6 |
diarrhea | diarrhea | 6 |
tightness | tightness | 6 |
asymptomatic | asymptomatic | 6 |
myalgia | myalgia | 6 |
joint | joint | 6 |
acute | acute | 5 |
difficulty | difficulty | 5 |
phlegm | phlegm | 5 |
weak | weak | 4 |
nasal | nasal | 4 |
vomiting | vomiting | 4 |
breathing | breathing | 4 |
mild | mild | 4 |
congestion | congestion | 3 |
distress | distress | 3 |
aches | aches | 3 |
coughing | coughing | 3 |
infection | infection | 3 |
pharyngeal | pharyngeal | 3 |
mouth | mouth | 3 |
anorexia | anorexia | 3 |
low | low | 3 |
# wordcloud
set.seed(1235)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), width=1200, height=800)
# Creating bar chart to see mostly appeared symptoms in COVID 19 patients
ggplot(head(d,30), aes(reorder(word, freq), freq))+geom_bar(stat="identity", fill="DarkRed",width = 0.9)+coord_flip()+labs(title="Mostly appearing symptoms in COVID-19 patients", x="Symptoms", y="Frequency")+theme_classic()
COVID-19 was started in December 2019 in Wuhuan, China and later it became pandemic and spread throughout the world. Initially, China and Italy were the most hitted countries but later United States became apicentre with huge number of confirmed cases, deaths and recovered cases. Data also shows that most of the patients have fever, cough, sore threat with some showing pneumonia and chills. The dashboard has also made that shows the summary of this project. Link is given below.