Introduction

Pandemics are not new phenomena but it has kept coming in different time period since the beginning of civilization. Who had thought that we had to face a big pandemic after Spanish Flu in 1918. Although there were some epidemics such as Ebola but it was not as contagious as COVID-19 that appeared first in Wuhan, China in December 2019. This project aims to explore and visualize how COVID-19 has affected the world overall. The data has been taken from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. In this project, I will visualize how different countries has been affected by COVID-19 and what are the most appearing symptoms in the affected patients?

Data cleaning and exploration

Data has been taken from John Hopkin’s University and it is published in Kaggle in the given link. It has 9 csv files. I will be focusing on the datasets that fulfill my research questions. I have opened all the datasets for the data exploration purpose although. I will explore each dataset and clean as per my requirements.

# Reading all files together

files <-list.files("C:/Users/hukha/Desktop/coronavirus dataset", "*.csv", full.names = TRUE)
coronavirus_dataset <- sapply(files, read_csv, simplify = FALSE)

The first dataset contains all the relevant data such as number of patients confirmed, died and recovered in different countries and in specific dates. Data was not given in proper tidy format and I had to change the dates into proper format to use them later. There were some missing values in the required columns which were dealt to avoid any biasness in analysis. I used DT package to show the data for exploration purpose.

# First dataset

#Let's read the data structure first to see if there is any thing we can do to make the data tidy
#head(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`) %>% kable() %>% kable_styling() 
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)


# Converting the "ObservationDate" from character class into Date class
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv` %>% head()

## # A tibble: 6 x 8
##     SNo ObservationDate `Province/State` `Country/Region` `Last Update`
##   <dbl> <chr>           <chr>            <chr>            <chr>        
## 1     1 01/22/2020      Anhui            Mainland China   1/22/2020 17~
## 2     2 01/22/2020      Beijing          Mainland China   1/22/2020 17~
## 3     3 01/22/2020      Chongqing        Mainland China   1/22/2020 17~
## 4     4 01/22/2020      Fujian           Mainland China   1/22/2020 17~
## 5     5 01/22/2020      Gansu            Mainland China   1/22/2020 17~
## 6     6 01/22/2020      Guangdong        Mainland China   1/22/2020 17~
## # ... with 3 more variables: Confirmed <dbl>, Deaths <dbl>,
## #   Recovered <dbl>

coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$ObservationDate, format= "%m/%d/%Y")

#coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update` <- as.Date(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Last Update`, format= "%m/%d/%y")

# Looking for missing values
colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`)) # Only State has missing values so we have to change                                                                                                     the missing values with "Unknown"

##             SNo ObservationDate  Province/State  Country/Region 
##               0               0           12195               0 
##     Last Update       Confirmed          Deaths       Recovered 
##               0               0               0               0

# Replacing missing values for States

coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State` <-  str_replace_na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`$`Province/State`, replacement= "Unknown")

# Creating copy of file locally

covid_19_data <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/covid_19_data.csv`
#sample_n(covid_19_data, 10) %>% kable() %>% kable_styling()
#covid_19_data %>% datatable

The next dataset contains the confirmed patients throughout the world along the longitude and latitude which would be required to plot the data on the map. There were no missing values in this dataset but data was present in wide format and I had to use tidyr’s gather function to convert the data into long format. I have printed few random rows from the data for exploration purpose. I went through next datasets which are covid19_deaths and covid19_recovered which had the same issues. They all are now tidy and clean.

# Confirmed Patients' Dataset
#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`)

## Checking for missing values
#colSums(is.na(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`))

## Replacing missing values
coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State` <-str_replace_na( coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv`$`Province/State`, replacement= "Unknown")

# Tidying the data by converting into long format and creating a copy
covid19_confirmed <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_confirmed.csv` %>% gather(Date,
                             Confirmed_Cases, -c(`Province/State`, `Country/Region`, Long, Lat))


# Checking for missing values
colSums(is.na(covid19_confirmed))

##  Province/State  Country/Region             Lat            Long 
##               0               0               0               0 
##            Date Confirmed_Cases 
##               0               0

# Converting Date column into Date 
covid19_confirmed$Date <- as.Date(covid19_confirmed$Date, format="%m/%d/%y") 

#glimpse(covid19_confirmed)
sample_n(covid19_confirmed, 10) %>% kable() %>% kable_styling()

Province/State	Country/Region	Lat	Long	Date	Confirmed_Cases
Unknown	Mexico	23.63450	-102.55280	2020-02-11	0
Unknown	Mozambique	-18.66569	35.52956	2020-01-26	0
Unknown	Trinidad and Tobago	10.69180	-61.22250	2020-04-07	107
Unknown	Armenia	40.06910	45.03820	2020-03-10	1
Unknown	Tajikistan	38.86103	71.27609	2020-03-03	0
Unknown	Bahamas	25.03430	-77.39630	2020-02-22	0
Unknown	Eritrea	15.17940	39.78230	2020-04-03	22
Unknown	Algeria	28.03390	1.65960	2020-02-25	1
Sichuan	China	30.61710	102.71030	2020-03-28	548
Unknown	Holy See	41.90290	12.45340	2020-04-17	8

# Dataset - Deaths globally

#glimpse(coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`)

# Creating local file
covid19_deaths <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_deaths.csv`

# Tidying the data; converting from wide to long
covid19_deaths <- covid19_deaths %>% gather(Date, Deaths, -c(`Province/State`, `Country/Region`, Lat, Long))

# Converting the data type into date
covid19_deaths$Date <- as.Date(covid19_deaths$Date, format="%m/%d/%y")

# replacing missing values
covid19_deaths$`Province/State` <- str_replace_na(covid19_deaths$`Province/State`, replacement="Unknown")

# Checking for structure and missing values
#glimpse(covid19_deaths)
#colSums(is.na(covid19_deaths))
sample_n(covid19_deaths, 10) %>% kable() %>% kable_styling()

Province/State	Country/Region	Lat	Long	Date	Deaths
Unknown	Burundi	-3.3731	29.9189	2020-04-17	1
Unknown	Thailand	15.0000	101.0000	2020-03-06	1
Liaoning	China	41.2956	122.6085	2020-02-04	0
Unknown	San Marino	43.9424	12.4578	2020-03-25	21
Unknown	Maldives	3.2028	73.2207	2020-01-24	0
Unknown	Romania	45.9432	24.9668	2020-02-13	0
Unknown	Croatia	45.1000	15.2000	2020-03-14	0
Northern Territory	Australia	-12.4634	130.8456	2020-04-29	0
Unknown	Guyana	5.0000	-58.7500	2020-03-14	1
Unknown	Mongolia	46.8625	103.8467	2020-01-26	0

# Dataset - Recovered Patients

# creating local file
covid19_recovered <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/time_series_covid_19_recovered.csv`

# Tidying the data
covid19_recovered <- covid19_recovered %>% gather(Date, Recovered, -c(`Province/State`, `Country/Region`, Lat, Long))

# Data cleaning
covid19_recovered$Date <- as.Date(covid19_recovered$Date, format= "%m/%d/%y")
colSums(is.na(covid19_recovered))

## Province/State Country/Region            Lat           Long           Date 
##          20350              0              0              0              0 
##      Recovered 
##              0

# replacing missing values
covid19_recovered$`Province/State` <- str_replace_na(covid19_recovered$`Province/State`, replacement="Unknown")

# Checking
glimpse(covid19_recovered)

## Observations: 27,720
## Variables: 6
## $ `Province/State` <chr> "Unknown", "Unknown", "Unknown", "Unknown", "...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ Lat              <dbl> 33.0000, 41.1533, 28.0339, 42.5063, -11.2027,...
## $ Long             <dbl> 65.0000, 20.1683, 1.6596, 1.5218, 17.8739, -6...
## $ Date             <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ Recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

colSums(is.na(covid19_recovered))

## Province/State Country/Region            Lat           Long           Date 
##              0              0              0              0              0 
##      Recovered 
##              0

sample_n(covid19_recovered, 10) %>% kable() %>% kable_styling()

Province/State	Country/Region	Lat	Long	Date	Recovered
Unknown	Somalia	5.1521	46.1996	2020-03-26	0
Unknown	Switzerland	46.8182	8.2275	2020-03-14	4
Unknown	Paraguay	-23.4425	-58.4438	2020-02-18	0
Unknown	Senegal	14.4974	-14.4524	2020-03-30	27
Unknown	Germany	51.0000	9.0000	2020-04-15	72600
Unknown	Gambia	13.4432	-15.3101	2020-01-26	0
British Virgin Islands	United Kingdom	18.4207	-64.6400	2020-02-20	0
Unknown	Nigeria	9.0820	8.6753	2020-04-26	239
Unknown	Denmark	56.2639	9.5018	2020-02-15	0
Unknown	Chile	-35.6751	-71.5430	2020-04-10	1571

The below dataset is still dirty although but I need only column “symptoms” from this dataset for text mining analysis. I will do the necessary cleaning while conducting analysis.

# Dataset for symptoms
covid19_updated <- coronavirus_dataset$`C:/Users/hukha/Desktop/coronavirus dataset/COVID19_open_line_list.csv`

Data Visualization

In this section, I will be exploring the data through visualizing them.

COVID-19’ Growing trend over time

COVID-19 is spreading very quickly. According to John Hopkin’s University, it is 2 % contagious while Spanish Flu which killed millions of people globally was 1.8 % contagious. Since the beginning of pandemic up to 5th May 2020, almost 4 million people has got Corona Virus throughout the world. Unfortunately, mortality rate for COVID-19 is 4.5% whileas Spanish Flu had mortality rate of 2.5%. Apparently from the source, we can see that it is spreading very quickly throughout the world and so far it has killed almost 250,000 people. Around 500,000 people have recovered globally.

The article was taken from https://www.cnbc.com/2020/03/26/coronavirus-may-be-deadlier-than-1918-flu-heres-how-it-stacks-up-to-other-pandemics.html.

# Creating line plot to see how coronavirus affected
covid_19_data %>% 
  gather(Type,
         Freq2,
         -c(SNo, ObservationDate, `Province/State`, `Country/Region`, `Last Update`)) %>% 
  select(ObservationDate, Type, Freq2) %>% 
  group_by(ObservationDate, Type) %>% 
  summarise(n= sum(Freq2)) %>% 
  ggplot()+geom_line(aes(ObservationDate,n, color=Type),size=1)+theme_classic()+theme_update(plot.title= element_text(hjust=0.5)) +labs(title= "Covid-19 growth trend over time",x="Months", y="Frequency")+ scale_y_continuous(labels = comma)

Confirmed cases

COVID-19 was started initially from Wuhan, China but they managed to reduce confirmed cases and unfortunately US is a leading country that has the most number of cases as of May 10, 2020 with around 1.3 million with Spain, UK and Italy having second, third and forth countries respectively with highest confirmed cases.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Confirmed)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="steelblue", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Confirmed Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for confirmed cases
covid19_confirmed %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
             color='purple', alpha=.5)+
  scale_size_continuous(range=c(1,8),
                        #breaks=c(50000,100000,150000,200000,250000,300000),
                        label= comma)+
  labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
  theme_classic()+theme_update(plot.title= element_text(hjust=0.5))

# Animated map
anime_confirmed <- covid19_confirmed %>% 
  mutate(Week = week(Date)) %>% 
  group_by(`Country/Region`, Week) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Confirmed_Cases),
             color='purple', alpha=.5)+
  scale_size_continuous(range=c(1,8),
                        #breaks=c(50000,100000,150000,200000,250000,300000),
                        label= comma)+
  labs(size="Confirmed_Cases", title="Confirmed Cases of COVID-19")+
  theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+
  labs(subtitle="Week: {frame_time}")+
  transition_time(Week)+
  shadow_wake(wake_length = 0.1)
anime_confirmed

Deaths

US has highest number of deaths so far with almost 80,000 people with UK having almost 32000 and Italy having 30500 deaths. Spain and France also have around 26,000 deaths so far as of May 5, 2020.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Deaths)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkred", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Death Cases", x="Countries", y="Deaths")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for deaths
covid19_deaths %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Deaths),
             color='Darkred', alpha=.5)+
  scale_size_continuous(label= comma)+
  labs(size="Deaths", title="Deaths")+
  theme_bw()+theme_update(plot.title= element_text(hjust=0.5))

# Animated map for deaths

# anime_deaths <-  covid19_deaths %>% 
#    mutate(Week= week(Date)) %>% 
#    group_by(`Country/Region`, Week) %>% 
#    ggplot()+
#    borders("world",color="gray85", fill="gray80", resolution=0.1)+
#    theme_map(base_size = 15)+
#    geom_point(aes(x=Long, y=Lat, size=Deaths),
#               color='Darkred', alpha=.5)+
#    scale_size_continuous(label= comma)+
#    labs(size="Deaths", title="Deaths")+
#    theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
#    labs(subtitle="Week: {frame_time}")+
#    transition_time(Week)+
#    shadow_wake(wake_length = 0.1)
#anime_deaths

Recovered cases

Fortunately, a lot of COVID-19 patients are being recovered without any significant damage and some of them don’t even have symtoms. US has the highest number of recovered patients which is almost 210,000, Germany with 144,000 and Spain with 136,666 patients who recovered.

covid_19_data %>% 
  filter(`ObservationDate` == max(`ObservationDate`)) %>% 
  group_by(`Country/Region`) %>% 
  summarise(n = sum(Recovered)) %>% 
  arrange(desc(n)) %>%
  head(n=20) %>% 
  ggplot() +geom_bar(aes(x=reorder(`Country/Region`, n),y= n), stat="identity",fill="darkgreen", width=0.9)+scale_y_continuous(label=comma)+
  labs(title="Countries with Top COVID-19 Recovered Cases", x="Countries", y="Confirmed Cases")+geom_text(aes(`Country/Region`,n, label=n), hjust= -0.1, size=3, color="black", inherit.aes = TRUE, position = position_dodge(width=0.7))+ theme_classic()+theme_update(plot.title= element_text(hjust=0.5))+coord_flip()

# Creating map for recovered cases
covid19_recovered %>% 
  filter(Date == max(Date)) %>% 
  ggplot()+
  borders("world",color="gray85", fill="gray80", resolution=0.1)+
  theme_map(base_size = 15)+
  geom_point(aes(x=Long, y=Lat, size=Recovered),
             color='DarkGreen', alpha=.5)+
  scale_size_continuous(label= comma)+
  labs(size="Recovered", title="Recovered cases of COVID-19")+
  theme_bw()+theme_update(plot.title= element_text(hjust=0.5))

# Creating animated maps
# anime_recovered <- covid19_recovered %>% 
#    mutate(Week= week(Date)) %>% 
#    group_by(`Country/Region`, Week) %>% 
#    ggplot()+
#    borders("world",color="gray85", fill="gray80", resolution=0.1)+
#    theme_map(base_size = 15)+
#    geom_point(aes(x=Long, y=Lat, size=Recovered),
#               color='DarkGreen', alpha=.5)+
#    scale_size_continuous(label= comma)+
#    labs(size="Recovered", title="Recovered cases of COVID-19")+
#    theme_bw()+theme_update(plot.title= element_text(hjust=0.5))+
#    labs(subtitle="Week: {frame_time}")+
#    transition_time(Week)+
#    shadow_wake(wake_length = 0.1)
# anime_recovered

#glimpse(covid19_updated)
covid19_updated %>% 
  mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>% 
  group_by(sex) %>% 
  drop_na(sex) %>%
  filter(!sex %in% c("4000", "N/A")) %>% 
  count() %>% 
  ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")

covid19_updated %>% 
  mutate(sex = str_replace_all(sex, c("female" = "Female", "male" = "Male"))) %>% 
  age <- as.integer(age) %>% 
  mutate(age2 = if_else(age <= 9, "0 - 9", if_else(age >=10 & age <= 19, "10 - 19", age))) %>% 
  group_by(sex, age) %>% 
  drop_na(sex, age) %>%
  filter(!sex %in% c("4000", "N/A")) %>% 
  head(n=25)
  count() #%>% 
  
ggplot()+ geom_bar(aes(reorder(sex, n), n), stat="identity")

Symptoms

One of the datasets contained information about the symptoms some of the patients had. Although the data a lot of missing values but out of that data, I used text mining techniques to visualize the frequently appearing symptoms in COVID-19 patients. I used tm package to clean the data and converted symptoms into matrix in order to use word cloud. Result indicates that majority of COVID-19 patients had fever, cough and sore threat. Some of the patients caught pneumonia, headache and chills as well.

# Text mining 
symptoms <- covid19_updated$symptoms

words <- Corpus(VectorSource(symptoms))
words <- tm_map(words, removeNumbers)
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)
words <- tm_map(words, removeWords, stopwords("english"))
#words <- tm_map(words,removeWords, c(""))
head(words)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6

tdm <- TermDocumentMatrix((words))
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
head(d,50, row.names=FALSE) %>% kable() %>% kable_styling()

	word	freq
fever	fever	364
cough	cough	179
sore	sore	38
throat	throat	38
℃	℃	37
fatigue	fatigue	28
pneumonitis	pneumonitis	19
pneumonia	pneumonia	17
headache	headache	17
chills	chills	16
nose	nose	15
runny	runny	15
chest	chest	13
pain	pain	13
malaise	malaise	13
symptoms	symptoms	12
dry	dry	11
respiratory	respiratory	10
discomfort	discomfort	10
dyspnea	dyspnea	9
sputum	sputum	9
muscle	muscle	9
soreness	soreness	8
weakness	weakness	8
breath	breath	7
shortness	shortness	7
nausea	nausea	7
muscular	muscular	6
diarrhea	diarrhea	6
tightness	tightness	6
asymptomatic	asymptomatic	6
myalgia	myalgia	6
joint	joint	6
acute	acute	5
difficulty	difficulty	5
phlegm	phlegm	5
weak	weak	4
nasal	nasal	4
vomiting	vomiting	4
breathing	breathing	4
mild	mild	4
congestion	congestion	3
distress	distress	3
aches	aches	3
coughing	coughing	3
infection	infection	3
pharyngeal	pharyngeal	3
mouth	mouth	3
anorexia	anorexia	3
low	low	3

# wordcloud

set.seed(1235)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"), width=1200, height=800)

# Creating bar chart to see mostly appeared symptoms in COVID 19 patients
ggplot(head(d,30), aes(reorder(word, freq), freq))+geom_bar(stat="identity", fill="DarkRed",width = 0.9)+coord_flip()+labs(title="Mostly appearing symptoms in COVID-19 patients", x="Symptoms", y="Frequency")+theme_classic()

Conclusion

COVID-19 was started in December 2019 in Wuhuan, China and later it became pandemic and spread throughout the world. Initially, China and Italy were the most hitted countries but later United States became apicentre with huge number of confirmed cases, deaths and recovered cases. Data also shows that most of the patients have fever, cough, sore threat with some showing pneumonia and chills. The dashboard has also made that shows the summary of this project. Link is given below.

Appendix

https://hkhan10.shinyapps.io/corona/

An analysis of COVID-19

Habib Khan

April 16, 2020