R Notebook

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Data wrangling with R

Hello! Welcome to our first session in this R series aimed at providing our contribution (as data scientists/data scientists in the making) in the fight against this pandemic. We plan to do this by analyzing the data and helping others to understand, monitor and curb further spread of the coronavirus disease. Join us as we embark on this cause with the hope that it will help everyone keep an eye on this outbreak.

So let's get started.

## [1] source: R for Data Science by Hadley Wickham and Garret Grolemund

Getting data into a format that will permit further analysis

Step 1: *Importing data with readr

We need to import data. Good, reliable data for that case. So where will we obtain it? Here: https://github.com/CSSEGISandData/COVID-19, a data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

Navigate to this page of the repository: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series that contains the global time series data of the pandemic(confirmed,recovered and succumbed).Clicking on one of the datasets will bring you an excel-like table. Then, let’s do the following steps to get the datasets:

Click on the raw option present on the top right of the data
This will open a new window in the browser
Assign this link a variable name as we will be needing to download the CSV file from the repository

global_conf_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

global_recov_url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

global_succumbed_url <-  "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

## [1] artwork by @allison_horst

Now that we have our urls, let's get rolling by loading the required package: readr which is contained in the tidyverse. The tidyverse is a collection of packages designed to make the data science process fast, fluent and fun. The packages in the tidyverse share a com‐
mon philosophy of data and R programming, and are designed to
work together naturally. It's a good practice to load the entire library.

## -- Attaching packages ------------------------ tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'tidyr' was built under R version 3.6.3

## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

All R statements where you create objects, assignment statements, have the same form: object_name <- value When reading that code say “object name gets value” in your head.

You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt-– (the minus sign). Object names must start with a letter, and can only contain letters, numbers, _, and .

global_conf <- read_csv(global_conf_url)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )

## See spec(...) for full column specifications.

global_recov <- read_csv(global_recov_url)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.

global_succ <- read_csv(global_succumbed_url)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.

Easy, right? Now that we have our data, let’s do some sanity check on our data.

Step 2: A sanity check on our data

So what really is tidy data?

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Step 3: Data Wrangling

Data wrangling, the art of getting your data into R in a useful form for visualization and modeling. Data wrangling is very important: without it you can’t work with your own data!

Tidying the data

The first step is always to figure out what the variables and observations are. The second step is to resolve one of two common problems:

One variable might be spread across multiple columns.
One observation might be scattered across multiple rows

To fix these problems, you’ll need the two most important functions in tidyr: gather()and spread().

Our problem is centered around gather(). This is because the columns are not actually variables, they are values of variable that could be called date.

Let’s begin by removing some columns that we won’t be needing for now, the Lat and Long columns

Select & Deselecting Columns with select()

Before we go any further with this, we need to introduce a powerful new idea: the pipe %>% {magrittr} which pipes an object forward into a function or call expression.A good way to pronounce %>% when reading code is “then.” Let’s try it out:

Gathering

Our problem is centered around gather(). This is because the columns are not actually variables, they are values of variable that could be called “date”.

To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:

The set of columns that represent values, not variables. In this example, those are the columns 1/22/20,1/23/20,1/24/20 etc
The name of the variable whose values form the column names. Its called the key, and here it is date.
The name of the variable whose values are spread over the cells. Its called the value, and here it’s the number of confirmed cases.

Let’s see the structure of our data as at this point

## Observations: 25,872
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date             <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ confirmed        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

## Observations: 24,500
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date             <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

## Observations: 25,872
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date             <chr> "1/22/20", "1/22/20", "1/22/20", "1/22/20", "...
## $ succumbed        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

The global_conf and global_succ have the same structure: Observations: 24,024 (rows) Variables: 4 (columns)

while the global_recov dataset has the following structure: Observations: 22,750 Variables: 4

That means we cannot merge them yet. But they’ve got some similarities still: Province/State Country/Region date succumbed

In a numeric vector, every entry must be a number.To store character strings, vectors can also be of class character. factor is categorical data eg male/female, benign/malignant

Let’s convert the date column into a date object using the package lubridate() and in doing so, we will introduce the accessor ‘$’

The accessor: $

For our analysis, we will need to access the different variables represented by columns included in this data frame. To do this, we use the accessor operator $ in the following way:

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## Observations: 24,500
## Variables: 4
## $ `Province/State` <chr> " ", " ", " ", " ", " ", " ", " ", " ", "Aust...
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra...
## $ date             <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-...
## $ recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

one last detour: group_by&summarise

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”

summarise() Collapses many values down to a single summary. summarise() is not terribly useful unless we pair it with group_by().

Why do we need group_by? Great question. Some countries like Australia have 8 entries in the ‘Province/State’ column while other like ‘Kenya’ have no entry in the ‘Province/State’ column. For our analysis right now, we want to aggregate all the cases of Australia’s states into one single value for Australia.

census=tibble(
  gender=c("Male","Female","Male","Female","Male","Female","Male","Female"),
  pop=c(8,10,45,89,100,45,8,132),
  county=sample(1:2,8,T)
)
grouped_data=census %>% group_by(gender,county) %>% summarise(sum(pop))
grouped_data

## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date             <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date             <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ confirmed        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

## Observations: 18,130
## Variables: 3
## $ `Country/Region` <chr> "Afghanistan", "Afghanistan", "Afghanistan", ...
## $ date             <date> 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-...
## $ succumbed        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Wrapping up

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

## [1] source: R for Data Science by Hadley Wickham and Garret Grolemund

## [1] artwork by @allison_horst

You have come this far! You did it! That was quite the wrangling we did. I am sure there are easier ways to do it though.

Anyhow, this adventure will end here, for now.

Until then, take care of yourself by following the policies set up by the government and the WHO.