Dataset Source: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
library(readr)
corona <- read_csv("R:/Datasets/novel-corona-virus-2019-dataset/2019_nCoV_data.csv")
## Parsed with column specification:
## cols(
## Sno = col_double(),
## Date = col_character(),
## `Province/State` = col_character(),
## Country = col_character(),
## `Last Update` = col_character(),
## Confirmed = col_double(),
## Deaths = col_double(),
## Recovered = col_double()
## )
Viewing the head of the table
head(corona)
## # A tibble: 6 x 8
## Sno Date `Province/State` Country `Last Update` Confirmed Deaths Recovered
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 01/22… Anhui China 01/22/2020 1… 1 0 0
## 2 2 01/22… Beijing China 01/22/2020 1… 14 0 0
## 3 3 01/22… Chongqing China 01/22/2020 1… 6 0 0
## 4 4 01/22… Fujian China 01/22/2020 1… 1 0 0
## 5 5 01/22… Gansu China 01/22/2020 1… 0 0 0
## 6 6 01/22… Guangdong China 01/22/2020 1… 26 0 0
We will do a quick descriptive analysis of the data. We will find the total number of confirmed, recovered and deaths in the world. We will also find maximum of number of each cases.
sum(corona$Confirmed) #Total Confirmed Cases
## [1] 781452
sum(corona$Deaths) #Total Death Cases
## [1] 17949
sum(corona$Recovered) #Total of patients recovered
## [1] 76258
max(corona$Deaths)
## [1] 1789
max (corona$Recovered)
## [1] 7862
max (corona$Confirmed)
## [1] 59989
Let us graphically view the number of confirmed cases in differnt countries. We will use ggplot2 to construct a barchart.
library(ggplot2)
options(scipen = 1) #To remove scientific notation
ggplot(corona)+
aes(Country, Confirmed)+
geom_bar(stat = "identity")+
coord_flip()
We can see the cases in Mainland China is so great that it makes visualizing the cases in other countries very difficult. Therefore, we will exlude Mainland China and China from the graph and analyse separately.
Removing the rows containing “Mainland China” and “China” from the dataset. We create a dataset named “world” that contains datas from other countries besides China.
world = corona [!(corona$Country == "Mainland China" | corona$Country == "China"),]
We will again aggregate the data for easier analayis.
world_confirm = aggregate(world$Confirmed~world$Country, FUN = sum)
colnames(world_confirm)= c("Country", "Confirmed")
(head (world_confirm))
## Country Confirmed
## 1 Australia 284
## 2 Belgium 14
## 3 Brazil 0
## 4 Cambodia 22
## 5 Canada 116
## 6 Egypt 4
ggplot(world_confirm)+
aes(reorder (Country, Confirmed), Confirmed)+
geom_bar(stat = "identity", fill = "Purple", color = "white")+
geom_text(aes(Country, Confirmed, label = Confirmed), size = 3.5, hjust = -0.3)+
coord_flip()+
xlab ("Country")
When the data was analysed (Feb 17, 2020), the most number of confirmed cases outside China was in Singapore, Hongkong anbd Japan. Now it is time to look at the death cases in the world (excluding china). We will prepare the data for analysis and see the data in the graph. Similarly, we will also see the graph for recovered cases.
world_death = aggregate(world$Deaths~world$Country, FUN = sum) #Creating a separate data
colnames(world_death)= c("Country", "Deaths") #Changing column names
head (world_death)
## Country Deaths
## 1 Australia 0
## 2 Belgium 0
## 3 Brazil 0
## 4 Cambodia 0
## 5 Canada 0
## 6 Egypt 0
Since there are countries that have zero deaths, we want to exclude it from the graph. Therefore, we are going to use dplyr package that selects only cases that have greater than zero deaths.
library(dplyr)
library (kableExtra)
world_death = filter(world_death, world_death$Deaths > 0)
kable(world_death) %>%
kable_styling()
| Country | Deaths |
|---|---|
| France | 3 |
| Hong Kong | 14 |
| Japan | 5 |
| Philippines | 17 |
| Taiwan | 2 |
It is now time to analyse the Coronavius cases in Mainland China.
First we combine the data, “Mainland China” and “China” into a single table, while excluding others.
china = filter(corona, corona$Country == "China" | corona$Country == "Mainland China")
Sorting the data according to the number of confirmed cases.
china = filter(corona, corona$Country == "China" | corona$Country == "Mainland China")
confirmed = aggregate(china$Confirmed~china$`Province/State`, FUN = sum)
deaths = aggregate(china$Deaths~china$`Province/State`, FUN = sum)
recovered = aggregate(china$Recovered~china$`Province/State`, FUN = sum)
province = cbind.data.frame(confirmed, deaths$`china$Deaths`, recovered$`china$Recovered`)
colnames(province) = c ("Province/State", "Confirmed", "Deaths", "Recovered")
colnames(deaths)= c ("Province", "Deaths")
kable(province) %>%
kable_styling() %>%
scroll_box(width = "800px", height = "300px")
| Province/State | Confirmed | Deaths | Recovered |
|---|---|---|---|
| Anhui | 13622 | 46 | 1729 |
| Beijing | 5953 | 42 | 876 |
| Chongqing | 8708 | 50 | 1340 |
| Fujian | 4614 | 1 | 614 |
| Gansu | 1424 | 19 | 374 |
| Guangdong | 20084 | 19 | 3389 |
| Guangxi | 3709 | 16 | 414 |
| Guizhou | 1806 | 13 | 297 |
| Hainan | 2425 | 48 | 361 |
| Hebei | 3917 | 42 | 747 |
| Heilongjiang | 5405 | 117 | 470 |
| Henan | 18153 | 126 | 3267 |
| Hong Kong | 0 | 0 | 0 |
| Hubei | 589921 | 17228 | 44863 |
| Hunan | 15347 | 19 | 3564 |
| Inner Mongolia | 999 | 0 | 77 |
| Jiangsu | 8603 | 0 | 1538 |
| Jiangxi | 13057 | 9 | 1906 |
| Jilin | 1243 | 11 | 227 |
| Liaoning | 1978 | 6 | 262 |
| Macau | 1 | 0 | 0 |
| Ningxia | 930 | 0 | 245 |
| Qinghai | 324 | 0 | 93 |
| Shaanxi | 3676 | 0 | 500 |
| Shandong | 7934 | 17 | 1239 |
| Shanghai | 5382 | 24 | 965 |
| Shanxi | 1986 | 0 | 402 |
| Sichuan | 7324 | 24 | 1171 |
| Taiwan | 1 | 0 | 0 |
| Tianjin | 1738 | 25 | 227 |
| Tibet | 20 | 0 | 6 |
| Xinjiang | 903 | 6 | 52 |
| Yunnan | 2721 | 0 | 331 |
| Zhejiang | 19592 | 0 | 3894 |
Now let’s visualize. Before visualizing, we are sure that the barlenght of Hubei province will completely overshadow cases in other provinces. We will just visualize the deaths.
ggplot(deaths)+
aes(deaths$Province, deaths$Deaths, fill = deaths$Province)+
geom_bar(stat = "identity")+
labs(x = "Deaths", y = "Province")+
coord_flip()
Let’s see the graph without Hubei province.
death=deaths[-c(14),] #Eliminating Hubei Province at 14th row.
ggplot(death)+
aes(death$Province, death$Deaths)+
geom_bar(stat = "identity")+
labs(x = "Total Deaths", y = "Province/States")+
coord_flip()
Thank you for your time. I will add more analysis about coronavirus in future.