Introduction

Pandemics are not new phenomena but it has kept coming in different time period since the beginning of civilization. Who had thought that we had to face a big pandemic after Spanish Flu in 1918. Although there were some epidemics such as Ebola but it was not as contagious as COVID-19 that appeared first in Wuhan, China in December 2019. This project aims to explore and visualize how COVID-19 has affected the world overall. The data has been taken from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. In this project, I will visualize how different countries has been affected by COVID-19 and what are the most appearing symptoms in the affected patients?

Data cleaning and exploration

Data has been taken from John Hopkin’s University and it is published in Kaggle in the given link. It has 9 csv files. I will be focusing on the datasets that fulfill my research questions. I have opened all the datasets for the data exploration purpose although. I will explore each dataset and clean as per my requirements.

# Reading all files together

files <-list.files("C:/Users/hukha/Desktop/coronavirus dataset", "*.csv", full.names = TRUE)
coronavirus_dataset <- sapply(files, read_csv, simplify = FALSE)  

The first dataset contains all the relevant data such as number of patients confirmed, died and recovered in different countries and in specific dates. Data was not given in proper tidy format and I had to change the dates into proper format to use them later. There were some missing values in the required columns which were dealt to avoid any biasness in analysis. I used DT package to show the data for exploration purpose.

# First dataset

#Let's read the data structure first to see if there is any thing we can do to make the data tidy
#head(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`) %>% kable() %>% kable_styling() 
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)


# Converting the "ObservationDate" from character class into Date class
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv` %>% head()
## # A tibble: 6 x 8
##     SNo ObservationDate `Province/State` `Country/Region` `Last Update`
##   <dbl> <chr>           <chr>            <chr>            <chr>        
## 1     1 01/22/2020      Anhui            Mainland China   1/22/2020 17~
## 2     2 01/22/2020      Beijing          Mainland China   1/22/2020 17~
## 3     3 01/22/2020      Chongqing        Mainland China   1/22/2020 17~
## 4     4 01/22/2020      Fujian           Mainland China   1/22/2020 17~
## 5     5 01/22/2020      Gansu            Mainland China   1/22/2020 17~
## 6     6 01/22/2020      Guangdong        Mainland China   1/22/2020 17~
## # ... with 3 more variables: Confirmed <dbl>, Deaths <dbl>,
## #   Recovered <dbl>
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate, format= "%m/%d/%Y")

#coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update` <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update`, format= "%m/%d/%y")

# Looking for missing values
colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)) # Only State has missing values so we have to change                                                                                                     the missing values with "Unknown"
##             SNo ObservationDate  Province/State  Country/Region 
##               0               0           12195               0 
##     Last Update       Confirmed          Deaths       Recovered 
##               0               0               0               0
# Replacing missing values for States

coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State` <-  str_replace_na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State`, replacement= "Unknown")

# Creating copy of file locally

covid_19_data <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`
#sample_n(covid_19_data, 10) %>% kable() %>% kable_styling()
#covid_19_data %>% datatable

The next dataset contains the confirmed patients throughout the world along the longitude and latitude which would be required to plot the data on the map. There were no missing values in this dataset but data was present in wide format and I had to use tidyr’s gather function to convert the data into long format. I have printed few random rows from the data for exploration purpose. I went through next datasets which are covid19_deaths and covid19_recovered which had the same issues. They all are now tidy and clean.

# Confirmed Patients' Dataset
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`)

## Checking for missing values
#colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`))

## Replacing missing values
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State` <-str_replace_na( coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State`, replacement= "Unknown")

# Tidying the data by converting into long format and creating a copy
covid19_confirmed <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv` %>% gather(Date,
                             Confirmed_Cases, -c(`Province/State`, `Country/Region`, Long, Lat))


# Checking for missing values
colSums(is.na(covid19_confirmed))
##  Province/State  Country/Region             Lat            Long 
##               0               0               0               0 
##            Date Confirmed_Cases 
##               0               0
# Converting Date column into Date 
covid19_confirmed$Date <- as.Date(covid19_confirmed$Date, format="%m/%d/%y") 

#glimpse(covid19_confirmed)
sample_n(covid19_confirmed, 10) %>% kable() %>% kable_styling()
Province/State Country/Region Lat Long Date Confirmed_Cases
Unknown Mexico 23.63450 -102.55280 2020-02-11 0
Unknown Mozambique -18.66569 35.52956 2020-01-26 0
Unknown Trinidad and Tobago 10.69180 -61.22250 2020-04-07 107
Unknown Armenia 40.06910 45.03820 2020-03-10 1
Unknown Tajikistan 38.86103 71.27609 2020-03-03 0
Unknown Bahamas 25.03430 -77.39630 2020-02-22 0
Unknown Eritrea 15.17940 39.78230 2020-04-03 22
Unknown Algeria 28.03390 1.65960 2020-02-25 1
Sichuan China 30.61710 102.71030 2020-03-28 548
Unknown Holy See 41.90290 12.45340 2020-04-17 8
# Dataset - Deaths globally

#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`)

# Creating local file
covid19_deaths <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`

# Tidying the data; converting from wide to long
covid19_deaths <- covid19_deaths %>% gather(Date, Deaths, -c(`Province/State`, `Country/Region`, Lat, Long))

# Converting the data type into date
covid19_deaths$Date <- as.Date(covid19_deaths$Date, format="%m/%d/%y")

# replacing missing values
covid19_deaths$`Province/State` <- str_replace_na(covid19_deaths$`Province/State`, replacement="Unknown")

# Checking for structure and missing values
#glimpse(covid19_deaths)
#colSums(is.na(covid19_deaths))
sample_n(covid19_deaths, 10) %>% kable() %>% kable_styling()
Province/State Country/Region Lat Long Date Deaths
Unknown Burundi -3.3731 29.9189 2020-04-17 1
Unknown Thailand 15.0000 101.0000 2020-03-06 1
Liaoning China 41.2956 122.6085 2020-02-04 0
Unknown San Marino 43.9424 12.4578 2020-03-25 21
Unknown Maldives 3.2028 73.2207 2020-01-24 0
Unknown Romania 45.9432 24.9668 2020-02-13 0
Unknown Croatia 45.1000 15.2000 2020-03-14 0
Northern Territory Australia -12.4634 130.8456 2020-04-29 0
Unknown Guyana 5.0000 -58.7500 2020-03-14 1
Unknown Mongolia 46.8625 103.8467 2020-01-26 0
# Dataset - Recovered Patients

# creating local file
covid19_recovered <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_recovered.csv`

# Tidying the data
covid19_recovered <- covid19_recovered %>% gather(Date, Recovered, -c(`Province/State`, `Country/Region`, Lat, Long))

# Data cleaning
covid19_recovered$Date <- as.Date(covid19_recovered$Date, format= "%m/%d/%y")
colSums(is.na(covid19_recovered))
## Province/State Country/Region            Lat           Long           Date 
##          20350              0              0              0              0 
##      Recovered 
##              0
# replacing missing values
covid19_recovered$`Province/State` <- str_replace_na(covid19_recovered$`Province/State`, replacement="Unknown")

# Checking
glimpse(covid19_recovered)
## Observations: 27,720
## Variables: 6
## $ `Province/State` <chr> "Unknown", "Unknown", "Unknown", "Unknown", "...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ Lat              <dbl> 33.0000, 41.1533, 28.0339, 42.5063, -11.2027,...
## $ Long             <dbl> 65.0000, 20.1683, 1.6596, 1.5218, 17.8739, -6...
## $ Date             <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ Recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
colSums(is.na(covid19_recovered))
## Province/State Country/Region            Lat           Long           Date 
##              0              0              0              0              0 
##      Recovered 
##              0
sample_n(covid19_recovered, 10) %>% kable() %>% kable_styling()
Province/State Country/Region Lat Long Date Recovered
Unknown Somalia 5.1521 46.1996 2020-03-26 0
Unknown Switzerland 46.8182 8.2275 2020-03-14 4
Unknown Paraguay -23.4425 -58.4438 2020-02-18 0
Unknown Senegal 14.4974 -14.4524 2020-03-30 27
Unknown Germany 51.0000 9.0000 2020-04-15 72600
Unknown Gambia 13.4432 -15.3101 2020-01-26 0
British Virgin Islands United Kingdom 18.4207 -64.6400 2020-02-20 0
Unknown Nigeria 9.0820 8.6753 2020-04-26 239
Unknown Denmark 56.2639 9.5018 2020-02-15 0
Unknown Chile -35.6751 -71.5430 2020-04-10 1571

The below dataset is still dirty although but I need only column “symptoms” from this dataset for text mining analysis. I will do the necessary cleaning while conducting analysis.

# Dataset for symptoms
covid19_updated <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/COVID19_open_line_list.csv` 

Data Visualization

In this section, I will be exploring the data through visualizing them.

COVID-19’ Growing trend over time

COVID-19 is spreading very quickly. According to John Hopkin’s University, it is 2 % contagious while Spanish Flu which killed millions of people globally was 1.8 % contagious. Since the beginning of pandemic up to 5th May 2020, almost 4 million people has got Corona Virus throughout the world. Unfortunately, mortality rate for COVID-19 is 4.5% whileas Spanish Flu had mortality rate of 2.5%. Apparently from the source, we can see that it is spreading very quickly throughout the world and so far it has killed almost 250,000 people. Around 500,000 people have recovered globally.

The article was taken from https://www.cnbc.com/2020/03/26/coronavirus-may-be-deadlier-than-1918-flu-heres-how-it-stacks-up-to-other-pandemics.html.

# Creating line plot to see how coronavirus affected
covid_19_data %>% 
  gather(Type,
         Freq2,
         -c(SNo, ObservationDate, `Province/State`, `Country/Region`, `Last Update`)) %>% 
  select(ObservationDate, Type, Freq2) %>% 
  group_by(ObservationDate, Type) %>% 
  summarise(n= sum(Freq2)) %>% 
  ggplot()+geom_line(aes(ObservationDate,n, color=Type),size=1)+theme_classic()+theme_update(plot.title= element_text(hjust=0.5)) +labs(title= "Covid-19 growth trend over time",x="Months", y="Frequency")+ scale_y_continuous(labels = comma)

Confirmed cases

COVID-19 was started initially from Wuhan, China but they managed to reduce confirmed cases and unfortunately US is a leading country that has the most number of cases as of May 10, 2020 with around 1.3 million with Spain, UK and Italy having second, third and forth countries respectively with highest confirmed cases.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Confirmed)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="steelblue", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Confirmed Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for confirmed cases
covid19_confirmed %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
             color='purple', alpha=.5)+
  scale_size_continuous(range=c(1,8),
                        #breaks=c(50000,100000,150000,200000,250000,300000),
                        label= comma)+
  labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
  theme_classic()+theme_update(plot.title= element_text(hjust=0.5))

# Animated map
anime_confirmed <- covid19_confirmed %>% 
  mutate(Week = week(Date)) %>% 
  group_by(`Country/Region`, Week) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
             color='purple', alpha=.5)+
  scale_size_continuous(range=c(1,8),
                        #breaks=c(50000,100000,150000,200000,250000,300000),
                        label= comma)+
  labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
  theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+
  labs(subtitle="Week: {frame_time}")+
  transition_time(Week)+
  shadow_wake(wake_length = 0.1)
anime_confirmed

Deaths

US has highest number of deaths so far with almost 80,000 people with UK having almost 32000 and Italy having 30500 deaths. Spain and France also have around 26,000 deaths so far as of May 5, 2020.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Deaths)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkred", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Death Cases", x="Countries", y="Deaths")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for deaths
covid19_deaths %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Deaths),
             color='Darkred', alpha=.5)+
  scale_size_continuous(label= comma)+
  labs(size="Deaths", title="Deaths")+
  theme_bw()+theme_update(plot.title= element_text(hjust=0.5))

# Animated map for deaths

# anime_deaths <-  covid19_deaths %>% 
#    mutate(Week= week(Date)) %>% 
#    group_by(`Country/Region`, Week) %>% 
#    ggplot()+
#    borders("world",color="gray85", fill="gray80", resolution=0.1)+
#    theme_map(base_size = 15)+
#    geom_point(aes(x=Long, y=Lat, size=Deaths),
#               color='Darkred', alpha=.5)+
#    scale_size_continuous(label= comma)+
#    labs(size="Deaths", title="Deaths")+
#    theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
#    labs(subtitle="Week: {frame_time}")+
#    transition_time(Week)+
#    shadow_wake(wake_length = 0.1)
#anime_deaths

Recovered cases

Fortunately, a lot of COVID-19 patients are being recovered without any significant damage and some of them don’t even have symtoms. US has the highest number of recovered patients which is almost 210,000, Germany with 144,000 and Spain with 136,666 patients who recovered.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Recovered)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkgreen", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Recovered Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for recovered cases
covid19_recovered %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Recovered),
             color='DarkGreen', alpha=.5)+
  scale_size_continuous(label= comma)+
  labs(size="Recovered", title="Recovered cases of COVID-19")+
  theme_bw()+theme_update(plot.title= element_text(hjust=0.5))

# Creating animated maps
# anime_recovered <- covid19_recovered %>% 
#    mutate(Week= week(Date)) %>% 
#    group_by(`Country/Region`, Week) %>% 
#    ggplot()+
#    borders("world",color="gray85", fill="gray80", resolution=0.1)+
#    theme_map(base_size = 15)+
#    geom_point(aes(x=Long, y=Lat, size=Recovered),
#               color='DarkGreen', alpha=.5)+
#    scale_size_continuous(label= comma)+
#    labs(size="Recovered", title="Recovered cases of COVID-19")+
#    theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
#    labs(subtitle="Week: {frame_time}")+
#    transition_time(Week)+
#    shadow_wake(wake_length = 0.1)
# anime_recovered
#glimpse(covid19_updated)
covid19_updated %>% 
  mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>% 
  group_by(sex) %>% 
  drop_na(sex) %>%
  filter(!sex %in% c("4000", "N/A")) %>% 
  count() %>% 
  ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")

covid19_updated %>% 
  mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>% 
  age <- as.integer(age) %>% 
  mutate(age2 = if_else(age <= 9, "0 - 9", if_else(age >=10 & age <= 19, "10 - 19", age))) %>% 
  group_by(sex, age) %>% 
  drop_na(sex, age) %>%
  filter(!sex %in% c("4000", "N/A")) %>% 
  head(n=25)
  count() #%>% 
  
ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")

Symptoms

One of the datasets contained information about the symptoms some of the patients had. Although the data a lot of missing values but out of that data, I used text mining techniques to visualize the frequently appearing symptoms in COVID-19 patients. I used tm package to clean the data and converted symptoms into matrix in order to use word cloud. Result indicates that majority of COVID-19 patients had fever, cough and sore threat. Some of the patients caught pneumonia, headache and chills as well.

# Text mining 
symptoms <- covid19_updated$symptoms

words <- Corpus(VectorSource(symptoms))
words <- tm_map(words, removeNumbers)
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)
words <- tm_map(words, removeWords, stopwords("english"))
#words <- tm_map(words,removeWords, c(""))
head(words)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6
tdm <- TermDocumentMatrix((words))
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
head(d,50, row.names=FALSE) %>% kable() %>% kable_styling()
word freq
fever fever 364
cough cough 179
sore sore 38
throat throat 38
37
fatigue fatigue 28
pneumonitis pneumonitis 19
pneumonia pneumonia 17
headache headache 17
chills chills 16
nose nose 15
runny runny 15
chest chest 13
pain pain 13
malaise malaise 13
symptoms symptoms 12
dry dry 11
respiratory respiratory 10
discomfort discomfort 10
dyspnea dyspnea 9
sputum sputum 9
muscle muscle 9
soreness soreness 8
weakness weakness 8
breath breath 7
shortness shortness 7
nausea nausea 7
muscular muscular 6
diarrhea diarrhea 6
tightness tightness 6
asymptomatic asymptomatic 6
myalgia myalgia 6
joint joint 6
acute acute 5
difficulty difficulty 5
phlegm phlegm 5
weak weak 4
nasal nasal 4
vomiting vomiting 4
breathing breathing 4
mild mild 4
congestion congestion 3
distress distress 3
aches aches 3
coughing coughing 3
infection infection 3
pharyngeal pharyngeal 3
mouth mouth 3
anorexia anorexia 3
low low 3
# wordcloud

set.seed(1235)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"), width=1200, height=800)

# Creating bar chart to see mostly appeared symptoms in COVID 19 patients
ggplot(head(d,30), aes(reorder(word, freq), freq))+geom_bar(stat="identity", fill="DarkRed",width = 0.9)+coord_flip()+labs(title="Mostly appearing symptoms in COVID-19 patients", x="Symptoms", y="Frequency")+theme_classic()

Conclusion

COVID-19 was started in December 2019 in Wuhuan, China and later it became pandemic and spread throughout the world. Initially, China and Italy were the most hitted countries but later United States became apicentre with huge number of confirmed cases, deaths and recovered cases. Data also shows that most of the patients have fever, cough, sore threat with some showing pneumonia and chills. The dashboard has also made that shows the summary of this project. Link is given below.