In this tutorial, we are going to analyze covid19 data using R and the dplyr.
Dplyr is a powerful tools to manipulate, clean and summarize data.
select() select columns
filter() filter rows
arrange() re-order
mutate() create new columns
summarise() summarise values
group_by() group operations..
Lets first Install dplyr package.
>install.packages(“dplyr”)
#
#
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load dplyr package
Before we get start, we want to download our data set. For this example, we are going to use the dataset produced by European Centre for Disease Prevention and Control website.
The R code for data connection is copied from their website.
#these libraries need to be loaded
library(utils)
#read the Dataset sheet into "R".
data <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#library(knitr)
#kable(head(data))
head(data)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 02/05/2020 2 5 2020 164 4 Afghanistan AF
## 2 01/05/2020 1 5 2020 222 4 Afghanistan AF
## 3 30/04/2020 30 4 2020 122 0 Afghanistan AF
## 4 29/04/2020 29 4 2020 124 3 Afghanistan AF
## 5 28/04/2020 28 4 2020 172 0 Afghanistan AF
## 6 27/04/2020 27 4 2020 68 10 Afghanistan AF
## countryterritoryCode popData2018 continentExp
## 1 AFG 37172386 Asia
## 2 AFG 37172386 Asia
## 3 AFG 37172386 Asia
## 4 AFG 37172386 Asia
## 5 AFG 37172386 Asia
## 6 AFG 37172386 Asia
str(data)
## 'data.frame': 14450 obs. of 11 variables:
## $ dateRep : Factor w/ 124 levels "01/01/2020","01/02/2020",..: 10 5 121 118 114 110 106 102 98 94 ...
## $ day : int 2 1 30 29 28 27 26 25 24 23 ...
## $ month : int 5 5 4 4 4 4 4 4 4 4 ...
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ cases : int 164 222 122 124 172 68 112 70 105 84 ...
## $ deaths : int 4 4 0 3 0 10 4 1 2 4 ...
## $ countriesAndTerritories: Factor w/ 209 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ geoId : Factor w/ 209 levels "AD","AE","AF",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ countryterritoryCode : Factor w/ 205 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ popData2018 : int 37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 37172386 ...
## $ continentExp : Factor w/ 6 levels "Africa","America",..: 3 3 3 3 3 3 3 3 3 3 ...
dim(data)
## [1] 14450 11
summary(data)
## dateRep day month year
## 02/05/2020: 208 Min. : 1.00 Min. : 1.000 Min. :2019
## 01/05/2020: 207 1st Qu.: 9.00 1st Qu.: 2.000 1st Qu.:2020
## 26/04/2020: 206 Median :17.00 Median : 3.000 Median :2020
## 27/04/2020: 206 Mean :16.38 Mean : 3.101 Mean :2020
## 28/04/2020: 206 3rd Qu.:24.00 3rd Qu.: 4.000 3rd Qu.:2020
## 29/04/2020: 206 Max. :31.00 Max. :12.000 Max. :2020
## (Other) :13211
## cases deaths countriesAndTerritories geoId
## Min. :-1430.0 Min. : 0.0 Australia: 124 AT : 124
## 1st Qu.: 0.0 1st Qu.: 0.0 Austria : 124 AU : 124
## Median : 1.0 Median : 0.0 Belgium : 124 BE : 124
## Mean : 228.9 Mean : 16.5 Brazil : 124 BR : 124
## 3rd Qu.: 30.0 3rd Qu.: 1.0 Canada : 124 CA : 124
## Max. :48529.0 Max. :4928.0 China : 124 CH : 124
## (Other) :13706 (Other):13706
## countryterritoryCode popData2018 continentExp
## AUS : 124 Min. :1.000e+03 Africa :2563
## AUT : 124 1st Qu.:2.790e+06 America:2661
## BEL : 124 Median :9.942e+06 Asia :3797
## BRA : 124 Mean :5.494e+07 Europe :4873
## CAN : 124 3rd Qu.:3.717e+07 Oceania: 492
## (Other):13726 Max. :1.393e+09 Other : 64
## NA's : 104 NA's :146
For example, we can use it to pick out a single column or more than one.
Here, glimpse() function used for a glimpse of the data.
dplyr package is needed to run this function.
data_select <- select(data, countriesAndTerritories, deaths)
glimpse(data_select)
## Rows: 14,450
## Columns: 2
## $ countriesAndTerritories <fct> Afghanistan, Afghanistan, Afghanistan, Afgh...
## $ deaths <int> 4, 4, 0, 3, 0, 10, 4, 1, 2, 4, 1, 2, 3, 0, ...
data_arrange <- arrange(data, countriesAndTerritories, deaths)
Filter() function can be used to select rows.
# Select only data from Asia:
data_filter <- data %>% filter( continentExp == "Asia")
This operator allows us to pipe the output from one function to the input of another function. Total number of deaths all countries
data %>% summarise_at(vars(deaths), list( sum))
## deaths
## 1 238431
(1)R Programming For Beginners Part 1
https://www.youtube.com/watch?v=DmX5TW473BA
(2)R tutorial by examples:Data Frame
https://rpubs.com/sheikh/data_frame
(3)Data Manipulation with dplyr using Covid19 Data
https://rpubs.com/sheikh/dplyr