Data Manipulation with dplyr using Covid19 Data

In this tutorial, we are going to analyze covid19 data using R and the dplyr.
Dplyr is a powerful tools to manipulate, clean and summarize data.

Important dplyr function

select() select columns
filter() filter rows
arrange() re-order
mutate() create new columns
summarise() summarise values
group_by() group operations..

Lets first Install dplyr package.
>install.packages(“dplyr”)

#   
# 
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load dplyr package

Get data set

Before we get start, we want to download our data set. For this example, we are going to use the dataset produced by European Centre for Disease Prevention and Control website.
The R code for data connection is copied from their website.

#these libraries need to be loaded

library(utils)

#read the Dataset sheet into "R".
data <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM")

First 5 rows

#library(knitr)
#kable(head(data))
head(data)

##      dateRep day month year cases deaths countriesAndTerritories geoId
## 1 02/05/2020   2     5 2020   164      4             Afghanistan    AF
## 2 01/05/2020   1     5 2020   222      4             Afghanistan    AF
## 3 30/04/2020  30     4 2020   122      0             Afghanistan    AF
## 4 29/04/2020  29     4 2020   124      3             Afghanistan    AF
## 5 28/04/2020  28     4 2020   172      0             Afghanistan    AF
## 6 27/04/2020  27     4 2020    68     10             Afghanistan    AF
##   countryterritoryCode popData2018 continentExp
## 1                  AFG    37172386         Asia
## 2                  AFG    37172386         Asia
## 3                  AFG    37172386         Asia
## 4                  AFG    37172386         Asia
## 5                  AFG    37172386         Asia
## 6                  AFG    37172386         Asia

Internal structure of an R object

str(data)

## 'data.frame':    14450 obs. of  11 variables:
##  $ dateRep                : Factor w/ 124 levels "01/01/2020","01/02/2020",..: 10 5 121 118 114 110 106 102 98 94 ...
##  $ day                    : int  2 1 30 29 28 27 26 25 24 23 ...
##  $ month                  : int  5 5 4 4 4 4 4 4 4 4 ...
##  $ year                   : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ cases                  : int  164 222 122 124 172 68 112 70 105 84 ...
##  $ deaths                 : int  4 4 0 3 0 10 4 1 2 4 ...
##  $ countriesAndTerritories: Factor w/ 209 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ geoId                  : Factor w/ 209 levels "AD","AE","AF",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ countryterritoryCode   : Factor w/ 205 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ popData2018            : int  37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 ...
##  $ continentExp           : Factor w/ 6 levels "Africa","America",..: 3 3 3 3 3 3 3 3 3 3 ...

Dimensions of dataset

dim(data)

## [1] 14450    11

Summarey of Data

summary(data)

##        dateRep           day            month             year     
##  02/05/2020:  208   Min.   : 1.00   Min.   : 1.000   Min.   :2019  
##  01/05/2020:  207   1st Qu.: 9.00   1st Qu.: 2.000   1st Qu.:2020  
##  26/04/2020:  206   Median :17.00   Median : 3.000   Median :2020  
##  27/04/2020:  206   Mean   :16.38   Mean   : 3.101   Mean   :2020  
##  28/04/2020:  206   3rd Qu.:24.00   3rd Qu.: 4.000   3rd Qu.:2020  
##  29/04/2020:  206   Max.   :31.00   Max.   :12.000   Max.   :2020  
##  (Other)   :13211                                                  
##      cases             deaths       countriesAndTerritories     geoId      
##  Min.   :-1430.0   Min.   :   0.0   Australia:  124         AT     :  124  
##  1st Qu.:    0.0   1st Qu.:   0.0   Austria  :  124         AU     :  124  
##  Median :    1.0   Median :   0.0   Belgium  :  124         BE     :  124  
##  Mean   :  228.9   Mean   :  16.5   Brazil   :  124         BR     :  124  
##  3rd Qu.:   30.0   3rd Qu.:   1.0   Canada   :  124         CA     :  124  
##  Max.   :48529.0   Max.   :4928.0   China    :  124         CH     :  124  
##                                     (Other)  :13706         (Other):13706  
##  countryterritoryCode  popData2018         continentExp 
##  AUS    :  124        Min.   :1.000e+03   Africa :2563  
##  AUT    :  124        1st Qu.:2.790e+06   America:2661  
##  BEL    :  124        Median :9.942e+06   Asia   :3797  
##  BRA    :  124        Mean   :5.494e+07   Europe :4873  
##  CAN    :  124        3rd Qu.:3.717e+07   Oceania: 492  
##  (Other):13726        Max.   :1.393e+09   Other  :  64  
##  NA's   :  104        NA's   :146

Data manipulation functions

Selecting some columns or fields

For example, we can use it to pick out a single column or more than one.
Here, glimpse() function used for a glimpse of the data.
dplyr package is needed to run this function.

data_select <-  select(data, countriesAndTerritories, deaths)
glimpse(data_select)

## Rows: 14,450
## Columns: 2
## $ countriesAndTerritories <fct> Afghanistan, Afghanistan, Afghanistan, Afgh...
## $ deaths                  <int> 4, 4, 0, 3, 0, 10, 4, 1, 2, 4, 1, 2, 3, 0, ...

Rearranging the rows by variables

data_arrange  <- arrange(data, countriesAndTerritories, deaths)

Filter

Filter() function can be used to select rows.

# Select only data from Asia:
data_filter <- data %>% filter( continentExp  == "Asia")

Pipe operator: %>%

This operator allows us to pipe the output from one function to the input of another function. Total number of deaths all countries

data %>%  summarise_at(vars(deaths), list( sum))

##   deaths
## 1 238431

Additional Readings

(1)R Programming For Beginners Part 1
https://www.youtube.com/watch?v=DmX5TW473BA
(2)R tutorial by examples:Data Frame
https://rpubs.com/sheikh/data_frame
(3)Data Manipulation with dplyr using Covid19 Data
https://rpubs.com/sheikh/dplyr