This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Hello! Welcome to our first session in this R series aimed at providing our contribution (as data scientists/data scientists in the making) in the fight against this pandemic. We plan to do this by analyzing the data and helping others to understand, monitor and curb further spread of the coronavirus disease. Join us as we embark on this cause with the hope that it will help everyone keep an eye on this outbreak.
So let's get started.
## [1] source: R for Data Science by Hadley Wickham and Garret Grolemund
We need to import data. Good, reliable data for that case. So where will we obtain it? Here: https://github.com/CSSEGISandData/COVID-19, a data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).
Navigate to this page of the repository: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series that contains the global time series data of the pandemic(confirmed,recovered and succumbed).Clicking on one of the datasets will bring you an excel-like table. Then, let’s do the following steps to get the datasets:
global_conf_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
global_recov_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"
global_succumbed_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
## [1] artwork by @allison_horst
Now that we have our urls, let's get rolling by loading the required package: readr which is contained in the tidyverse. The tidyverse is a collection of packages designed to make the data science process fast, fluent and fun. The packages in the tidyverse share a com‐
mon philosophy of data and R programming, and are designed to
work together naturally. It's a good practice to load the entire library.
## -- Attaching packages ------------------------ tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.3
## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
All R statements where you create objects, assignment statements, have the same form: object_name <- value When reading that code say “object name gets value” in your head.
You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt-– (the minus sign). Object names must start with a letter, and can only contain letters, numbers, _, and .
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
Easy, right? Now that we have our data, let’s do some sanity check on our data.
There are three interrelated rules which make a dataset tidy:
Data wrangling, the art of getting your data into R in a useful form for visualization and modeling. Data wrangling is very important: without it you can’t work with your own data!
The first step is always to figure out what the variables and observations are. The second step is to resolve one of two common problems:
To fix these problems, you’ll need the two most important functions in tidyr: gather()and spread().
Our problem is centered around gather(). This is because the columns are not actually variables, they are values of variable that could be called date.
Let’s begin by removing some columns that we won’t be needing for now, the Lat and Long columns
Before we go any further with this, we need to introduce a powerful new idea: the pipe %>% {magrittr} which pipes an object forward into a function or call expression.A good way to pronounce %>% when reading code is “then.” Let’s try it out:
Our problem is centered around gather(). This is because the columns are not actually variables, they are values of variable that could be called “date”.
To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:
Let’s see the structure of our data as at this point
## Observations: 25,872
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ confirmed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## Observations: 24,500
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## Observations: 25,872
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ succumbed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The global_conf and global_succ have the same structure: Observations: 24,024 (rows) Variables: 4 (columns)
while the global_recov dataset has the following structure: Observations: 22,750 Variables: 4
That means we cannot merge them yet. But they’ve got some similarities still: Province/State
Country/Region
In a numeric vector, every entry must be a number.To store character strings, vectors can also be of class character. factor is categorical data eg male/female, benign/malignant
Let’s convert the date column into a date object using the package lubridate() and in doing so, we will introduce the accessor ‘$’
For our analysis, we will need to access the different variables represented by columns included in this data frame. To do this, we use the accessor operator $ in the following way:
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Observations: 24,500
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”
summarise() Collapses many values down to a single summary. summarise() is not terribly useful unless we pair it with group_by().
Why do we need group_by? Great question. Some countries like Australia have 8 entries in the ‘Province/State’ column while other like ‘Kenya’ have no entry in the ‘Province/State’ column. For our analysis right now, we want to aggregate all the cases of Australia’s states into one single value for Australia.
census=tibble(
gender=c("Male","Female","Male","Female","Male","Female","Male","Female"),
pop=c(8,10,45,89,100,45,8,132),
county=sample(1:2,8,T)
)
grouped_data=census %>% group_by(gender,county) %>% summarise(sum(pop))
grouped_data
## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ confirmed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ succumbed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
## [1] source: R for Data Science by Hadley Wickham and Garret Grolemund
## [1] artwork by @allison_horst
You have come this far! You did it! That was quite the wrangling we did. I am sure there are easier ways to do it though.
Anyhow, this adventure will end here, for now.
Until then, take care of yourself by following the policies set up by the government and the WHO.