library(ggplot2); library(dplyr); library(readr); library(tidyr)
library(nycflights13); library(fivethirtyeight)

Import data

Text-based data

Let’s create our own dataset and call it d, which is short for data and is only one letter so we can type it quickly. I always call my datasets d, if I’m working with only one. I used to call them df but then realized that’s a function’s name.

d <- tribble(
    ~Name, ~Age, ~Sex,
    "John", 5, "Male",
    "Jane", 39, "Female",
    "Yoda", 637, "Not to answer I choose"
)

CSV (Comma Separated Values)

d %>% 
    write_csv("my_csv_data.csv")

read_csv("my_csv_data.csv")
## # A tibble: 3 x 3
##   Name    Age Sex                   
##   <chr> <dbl> <chr>                 
## 1 John      5 Male                  
## 2 Jane     39 Female                
## 3 Yoda    637 Not to answer I choose

This is how the data look on our disk as text. Is there a problem with this? If you are Icelandic you might often write the number two and a half as 2,5. Might that cause a problem in a csv file?

read_lines("my_csv_data.csv") %>% 
    writeLines()
## Name,Age,Sex
## John,5,Male
## Jane,39,Female
## Yoda,637,Not to answer I choose

CSV2 (Comma Separated Values …. two? (Semicolons))

d %>% 
    write_csv2("my_csv2_data.csv")

read_csv2("my_csv2_data.csv")
## # A tibble: 3 x 3
##   Name    Age Sex                   
##   <chr> <dbl> <chr>                 
## 1 John      5 Male                  
## 2 Jane     39 Female                
## 3 Yoda    637 Not to answer I choose
read_lines("my_csv2_data.csv") %>% 
    writeLines()
## Name;Age;Sex
## John;  5;Male
## Jane; 39;Female
## Yoda;637;Not to answer I choose

TSV (Tab Separated Values)

d %>% 
    write_tsv("my_tsv_data.csv")

read_tsv("my_tsv_data.csv")
## # A tibble: 3 x 3
##   Name    Age Sex                   
##   <chr> <dbl> <chr>                 
## 1 John      5 Male                  
## 2 Jane     39 Female                
## 3 Yoda    637 Not to answer I choose
read_lines("my_tsv_data.csv") %>% 
    writeLines()
## Name Age Sex
## John 5   Male
## Jane 39  Female
## Yoda 637 Not to answer I choose

Delimited (Anything Separated Values)

read_delim("my_csv_data.csv", delim = ",")
## # A tibble: 3 x 3
##   Name    Age Sex                   
##   <chr> <dbl> <chr>                 
## 1 John      5 Male                  
## 2 Jane     39 Female                
## 3 Yoda    637 Not to answer I choose
read_delim("my_csv2_data.csv", delim = ";")
## # A tibble: 3 x 3
##   Name  Age   Sex                   
##   <chr> <chr> <chr>                 
## 1 John  "  5" Male                  
## 2 Jane  " 39" Female                
## 3 Yoda  637   Not to answer I choose
read_delim("my_tsv_data.csv", delim = "\t")
## # A tibble: 3 x 3
##   Name    Age Sex                   
##   <chr> <dbl> <chr>                 
## 1 John      5 Male                  
## 2 Jane     39 Female                
## 3 Yoda    637 Not to answer I choose

Other data forms

Excell

Use the package readxl

SPSS and SAS

Use the package haven

Google Drive

Download the data and read it in as a text-based file or excell.

Alternatively use the package googlesheets

Other

If you have other types of data, someone has worked with it in R before. Spend some time googling around.

Tidy data

Gather

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

How can we calculate the mean of each variable for each type of flower?

iris %>% 
    group_by(Species) %>% 
    summarise(mean_sepallength = mean(Sepal.Length),
           mean_sepalwidth = mean(Sepal.Width),
           mean_petallength = mean(Petal.Length),
           mean_petalwidth = mean(Petal.Width))
## # A tibble: 3 x 5
##   Species mean_sepallength mean_sepalwidth mean_petallength mean_petalwidth
##   <fct>              <dbl>           <dbl>            <dbl>           <dbl>
## 1 setosa              5.01            3.43             1.46           0.246
## 2 versic…             5.94            2.77             4.26           1.33 
## 3 virgin…             6.59            2.97             5.55           2.03

Sure, that wasn’t the absolute worst, but it’s an invitation to typos! And what if I told you I want the standard deviation as well? And why stop there, maybe I want the median and interquartile range! Good luck typing all that up!

A faster way using gather()

iris <- iris %>% 
    gather(Attribute, Value, -Species)
head(iris)
##   Species    Attribute Value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

Now we can do everything quckly

iris %>% 
    group_by(Attribute, Species) %>% 
    summarise(mean = mean(Value),
              sd = sd(Value),
              median = median(Value),
              iqr = quantile(Value, .75) - quantile(Value, .25))
## # A tibble: 12 x 6
## # Groups:   Attribute [4]
##    Attribute    Species     mean    sd median   iqr
##    <chr>        <fct>      <dbl> <dbl>  <dbl> <dbl>
##  1 Petal.Length setosa     1.46  0.174   1.5  0.175
##  2 Petal.Length versicolor 4.26  0.470   4.35 0.600
##  3 Petal.Length virginica  5.55  0.552   5.55 0.775
##  4 Petal.Width  setosa     0.246 0.105   0.2  0.100
##  5 Petal.Width  versicolor 1.33  0.198   1.3  0.3  
##  6 Petal.Width  virginica  2.03  0.275   2    0.500
##  7 Sepal.Length setosa     5.01  0.352   5    0.4  
##  8 Sepal.Length versicolor 5.94  0.516   5.9  0.7  
##  9 Sepal.Length virginica  6.59  0.636   6.5  0.675
## 10 Sepal.Width  setosa     3.43  0.379   3.4  0.475
## 11 Sepal.Width  versicolor 2.77  0.314   2.8  0.475
## 12 Sepal.Width  virginica  2.97  0.322   3    0.375

Spread

This is the opposite of spread, so each observation is spread out over two or more rows. Let’s look at the table2 data from the tidyverse package. We see that one observation is different attributes of one country in one year, but the observations are spread out over two rows, one row showing the number of cases and one showing the population in the country. We use spread to fix this.

table2 
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>% 
    spread(type, count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Separate

This is if we have two values in one column. In the data below we see the rate stored as a character showing the number of cases divided by the total population. Let’s split this into two columns, one for cases and one for population.

table3 %>% 
    separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

We see that the columns are still stored as characters. We need to change them into numbers

table3 %>% 
    separate(col = rate, into = c("cases", "population")) %>% 
    mutate_at(vars(cases, population), parse_number)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Unite

The opposite of separate. If we have two columns that together form one value. In the data below we have one column for century and one for year. This is nonsense. Let’s change it into one column for year

table5
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>% 
    unite(col = "year", century, year)
## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
table5 %>% 
    unite(col = "year", century, year, sep = "")
## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

And change it into a number

table5 %>% 
    unite(col = "year", century, year, sep = "") %>% 
    mutate(year = parse_number(year))
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

And complete the tidying with what we learned above

table5 %>% 
    unite(col = "year", century, year, sep = "") %>% 
    separate(col = "rate", into = c("cases", "population")) %>% 
    mutate_at(vars(year, cases, population), parse_number)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Swirl exercises

Swirl, https://swirlstats.com, is an interactive learning environment situated inside R! You will now do one exercise in Swirl.

First, if the package is not installed, install it

install.packages("swirl")

Then load the package

library(swirl)

The exercise is part of a bigger course. You need to install the course

install_course("Getting and Cleaning Data")

Then run Swirl and find the exercise, Tidying Data with tidyr, inside the course.

swirl()

Case study: An epidemic ravages Iceland’s capitol!

The year is 2020 and new epidemic, midi-chlorianism, is spreading through Reykjavík’s districts as well as neighboring ones. The Directorate of Health has chosen you as its crack team of data analysts to find some clues. Good luck, we’re all counting on you.

The assignment

The files cases_2019.csv and population_2019.csv show the total number of cases and population respectively for each of the largest districts in and around Reykjavík. Working alone or as groups you must read the data into R, tidy them and use them to answer the following questions.

Be careful, the datasets are both saved as .csv files, but they don’t use the same delimiters. Use what you’ve learned in this class to import both files into R using read_csv, read_csv2 and read_tsv, or you could use read_delim for both and manually enter the delimiter!

The files are available in Ugla.

Cases

The cases_2019.csv file is messy. Find a way to make it tidy so that you can answer the other questions much faster!

What are the total number of cases in all of Reykjavík in January and December 2019?

Which district has the highest number of cases in January? How about in December?

Create a lineplot that shows how the number of cases in each district has evolved throughout the year. In which district has the number of cases increased fastest?

Population

The population.csv also messy. Make it tidy to help with the next questions!

Which district has the largest population in December?

What does this mean for your results above about which district has seem the fastest increase?

Connecting the datasets

Use what you learned in chapter 3.7 join data frames to combine the cases and population into one dataset (you can call it d for example)

Use your new dataset to calculate the prevalence of the disease for each district at each timepoint

Create a lineplot showing the evolution of prevalence in each district during each month. In which districts has the prevalence increased? In which districts has it decreased?

Create a table containing, for each city, how much the prevalence in that city has changed during the year.

Hint: For each city you only need the rows corresponding to the first month and the last month. The function slice(a, b) will return only rows a and b from the data. If you use slice() after using group_by() it wil return rows number a and b for each group. Once you have only the first and last month you can use spread() to create one column for each month and then a simple mutate() will allow you to calculate the difference. Remember that you cant use numbers as column names so you will have to place `` around any column names that are numbers to use them for calculations.

The final product

Finally, use what you’ve learned to make recommendations to the Directorate of Health on which districts need more help and which ones are doing great and should maybe be used as learning examples!