Tập dữ liệu COVID-19
- Tổng quan về dữ liệu
- Liệt kê top 10 quốc gia có số ca nhiễm nhiều nhất
Thống kê theo ngày

Tập dữ liệu COVID-19

Ở phần này, chúng ta cùng lấy dữ liệu COVID-19 từ github của tác giả RamiKrispin. Gói dữ liệu CoronaVirus được cung cấp ở dạng dữ liệu đã được tiền xử lý theo các biến cần thiết. Dữ liệu thu thập trực tiếp từ nguồn dữ liệu của Đại học Johns Hopkins (JHU CSSE) - Johns Hopkins University Center for Systems Science and Engineering Coronavirus repository. Dữ liệu được liệt kê theo các biến date, province, country, lat (latitude point), long (longtitude point), type (confirmed, death, recovered), cases (the number of daily cases).

library(coronavirus)
library(devtools)
devtools::install_github("RamiKrispin/coronavirus")

## 
##   
  
  
   checking for file 'C:\Users\Admin\AppData\Local\Temp\RtmpAb5ppB\remotes30945178562e\RamiKrispin-coronavirus-b244498/DESCRIPTION' ...
  
   checking for file 'C:\Users\Admin\AppData\Local\Temp\RtmpAb5ppB\remotes30945178562e\RamiKrispin-coronavirus-b244498/DESCRIPTION' ... 
  
v  checking for file 'C:\Users\Admin\AppData\Local\Temp\RtmpAb5ppB\remotes30945178562e\RamiKrispin-coronavirus-b244498/DESCRIPTION'
## 
  
  
  
-  preparing 'coronavirus': (484ms)
##    checking DESCRIPTION meta-information ...
  
   checking DESCRIPTION meta-information ... 
  
v  checking DESCRIPTION meta-information
## 
  
  
  
-  checking for LF line-endings in source and make files and shell scripts
## 
  
  
  
-  checking for empty or unneeded directories
## 
  
  
  
-  building 'coronavirus_0.3.3.tar.gz'
## 
  
   
##

update_dataset()

## Updates are available on the coronavirus Dev version, do you want to update? n/Y

#covid19_df <- refresh_coronavirus_jhu()
data("coronavirus")
mycovid19_df <- coronavirus

Tổng quan về dữ liệu

Ở phần này, liệt kê tất cả các biến và thực hiện thống kê chung về các biến trong bộ dữ liệu.

str(mycovid19_df)

## 'data.frame':    399996 obs. of  7 variables:
##  $ date    : Date, format: "2020-01-22" "2020-01-22" ...
##  $ province: chr  "" "" "" "" ...
##  $ country : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ lat     : num  33.9 41.2 28 42.5 -11.2 ...
##  $ long    : num  67.71 20.17 1.66 1.52 17.87 ...
##  $ type    : chr  "confirmed" "confirmed" "confirmed" "confirmed" ...
##  $ cases   : num  0 0 0 0 0 0 0 0 0 0 ...

summary(mycovid19_df)

##       date              province           country               lat         
##  Min.   :2020-01-22   Length:399996      Length:399996      Min.   :-51.796  
##  1st Qu.:2020-05-23   Class :character   Class :character   1st Qu.:  4.788  
##  Median :2020-09-23   Mode  :character   Mode  :character   Median : 21.513  
##  Mean   :2020-09-23                                         Mean   : 19.987  
##  3rd Qu.:2021-01-24                                         3rd Qu.: 40.143  
##  Max.   :2021-05-27                                         Max.   : 71.707  
##                                                             NA's   :2460     
##       long             type               cases          
##  Min.   :-178.12   Length:399996      Min.   :-349116.0  
##  1st Qu.: -15.18   Class :character   1st Qu.:      0.0  
##  Median :  21.75   Mode  :character   Median :      0.0  
##  Mean   :  24.07                      Mean   :    712.1  
##  3rd Qu.:  85.24                      3rd Qu.:     34.0  
##  Max.   : 178.06                      Max.   :1123456.0  
##  NA's   :2460

Liệt kê top 10 quốc gia có số ca nhiễm nhiều nhất

library(dplyr)
mycovid19_df %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>% 
  summarise(total_case_confirmed = sum(cases)) %>% 
  arrange(-total_case_confirmed) -> summary_df

summary_df %>% head(n=10)

## # A tibble: 10 x 2
##    country        total_case_confirmed
##    <chr>                         <dbl>
##  1 US                         33217995
##  2 India                      27555457
##  3 Brazil                     16342162
##  4 France                      5697076
##  5 Turkey                      5220549
##  6 Russia                      4977332
##  7 United Kingdom              4489552
##  8 Italy                       4205970
##  9 Germany                     3673990
## 10 Argentina                   3663215

Thống kê theo ngày

Các ca COVID-19 trong ngày gần nhất theo quốc gia

library(tidyr)
mycovid19_df %>% 
  filter(date == max(date)) %>% 
  select(country, type, cases) %>% 
  group_by(country, type) %>% 
  summarise(total_cases = sum(cases)) %>% 
  pivot_wider(names_from = type, values_from = total_cases) %>% 
  arrange(-confirmed)

## # A tibble: 193 x 4
## # Groups:   country [193]
##    country   confirmed death recovered
##    <chr>         <dbl> <dbl>     <dbl>
##  1 India        186364  3660    259459
##  2 Brazil        67467  2245    183636
##  3 Argentina     41080   547     38186
##  4 US            27525  1338         0
##  5 Colombia      25092   513     22425
##  6 France        13933   142       977
##  7 Iran           9994   165     17401
##  8 Russia         8911   395      9593
##  9 Turkey         8426   183     13102
## 10 Chile          8105   185      5542
## # ... with 183 more rows

Trực quan hóa dữ liệu của thế giới bằng plotly

library(plotly)
mycovid19_df %>% 
  group_by(type, date) %>% 
  summarise(total_cases = sum(cases)) %>% 
  pivot_wider(names_from = type, values_from = total_cases) %>% 
  arrange(date) %>% 
  mutate(active = confirmed - death - recovered) %>% 
  mutate(active_total = cumsum(active), recovered_total = cumsum(recovered),
         death_total = cumsum(death)) %>% 
  plot_ly(x = ~ date, 
          y = ~ active_total,
          name = "Active",
          fillcolor = "#1f77b4",
          type = "scatter",
          mode = "lines",
          stackgroup = "one") %>% 
  add_trace(y = ~ death_total, 
            name = "Death",
            fillcolor = "#E41317") %>% 
  add_trace(y = ~ recovered_total,
            name = "Recovered",
            fillcolor = "forestgreen") %>% 
  layout(title = "The distribution of COVID 19 Cases Worldwide",
         legend = list(x = 0.5, y = 0.9),
         yaxis = list(title = "Number of cases"),
         xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))

Trực quan bằng treemap cho các quốc gia

library(plotly)
country_covid19 <- mycovid19_df %>% 
  filter(type == "confirmed") %>% 
  group_by(country) %>% 
  summarise(total_cases  = sum(cases)) %>% 
  arrange(-total_cases) %>% 
  mutate(parents = "Confirmed") %>% 
  ungroup()

plot_ly(data = country_covid19,
        type = "treemap",
        values = ~ total_cases,
        labels = ~ country,
        parents = ~ parents,
        domain = list(column=0),
        name = "Confirmed",
        textinfo = "label+value+percent parent")

COVID-19 Visualization

TRẦN QUANG QUÝ

Đại học Công nghệ Thông tin & Truyền thông Thái Nguyên

05 September 2021