library(ggplot2); library(dplyr); library(readr); library(tidyr)
library(nycflights13); library(fivethirtyeight)
Let’s create our own dataset and call it d, which is short for data and is only one letter so we can type it quickly. I always call my datasets d, if I’m working with only one. I used to call them df but then realized that’s a function’s name.
d <- tribble(
~Name, ~Age, ~Sex,
"John", 5, "Male",
"Jane", 39, "Female",
"Yoda", 637, "Not to answer I choose"
)
d %>%
write_csv("my_csv_data.csv")
read_csv("my_csv_data.csv")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <dbl> <chr>
## 1 John 5 Male
## 2 Jane 39 Female
## 3 Yoda 637 Not to answer I choose
This is how the data look on our disk as text. Is there a problem with this? If you are Icelandic you might often write the number two and a half as 2,5. Might that cause a problem in a csv file?
read_lines("my_csv_data.csv") %>%
writeLines()
## Name,Age,Sex
## John,5,Male
## Jane,39,Female
## Yoda,637,Not to answer I choose
d %>%
write_csv2("my_csv2_data.csv")
read_csv2("my_csv2_data.csv")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <dbl> <chr>
## 1 John 5 Male
## 2 Jane 39 Female
## 3 Yoda 637 Not to answer I choose
read_lines("my_csv2_data.csv") %>%
writeLines()
## Name;Age;Sex
## John; 5;Male
## Jane; 39;Female
## Yoda;637;Not to answer I choose
d %>%
write_tsv("my_tsv_data.csv")
read_tsv("my_tsv_data.csv")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <dbl> <chr>
## 1 John 5 Male
## 2 Jane 39 Female
## 3 Yoda 637 Not to answer I choose
read_lines("my_tsv_data.csv") %>%
writeLines()
## Name Age Sex
## John 5 Male
## Jane 39 Female
## Yoda 637 Not to answer I choose
read_delim("my_csv_data.csv", delim = ",")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <dbl> <chr>
## 1 John 5 Male
## 2 Jane 39 Female
## 3 Yoda 637 Not to answer I choose
read_delim("my_csv2_data.csv", delim = ";")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <chr> <chr>
## 1 John " 5" Male
## 2 Jane " 39" Female
## 3 Yoda 637 Not to answer I choose
read_delim("my_tsv_data.csv", delim = "\t")
## # A tibble: 3 x 3
## Name Age Sex
## <chr> <dbl> <chr>
## 1 John 5 Male
## 2 Jane 39 Female
## 3 Yoda 637 Not to answer I choose
Use the package readxl
Use the package haven
Download the data and read it in as a text-based file or excell.
Alternatively use the package googlesheets
If you have other types of data, someone has worked with it in R before. Spend some time googling around.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
How can we calculate the mean of each variable for each type of flower?
iris %>%
group_by(Species) %>%
summarise(mean_sepallength = mean(Sepal.Length),
mean_sepalwidth = mean(Sepal.Width),
mean_petallength = mean(Petal.Length),
mean_petalwidth = mean(Petal.Width))
## # A tibble: 3 x 5
## Species mean_sepallength mean_sepalwidth mean_petallength mean_petalwidth
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versic… 5.94 2.77 4.26 1.33
## 3 virgin… 6.59 2.97 5.55 2.03
Sure, that wasn’t the absolute worst, but it’s an invitation to typos! And what if I told you I want the standard deviation as well? And why stop there, maybe I want the median and interquartile range! Good luck typing all that up!
A faster way using gather()
iris <- iris %>%
gather(Attribute, Value, -Species)
head(iris)
## Species Attribute Value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Now we can do everything quckly
iris %>%
group_by(Attribute, Species) %>%
summarise(mean = mean(Value),
sd = sd(Value),
median = median(Value),
iqr = quantile(Value, .75) - quantile(Value, .25))
## # A tibble: 12 x 6
## # Groups: Attribute [4]
## Attribute Species mean sd median iqr
## <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Petal.Length setosa 1.46 0.174 1.5 0.175
## 2 Petal.Length versicolor 4.26 0.470 4.35 0.600
## 3 Petal.Length virginica 5.55 0.552 5.55 0.775
## 4 Petal.Width setosa 0.246 0.105 0.2 0.100
## 5 Petal.Width versicolor 1.33 0.198 1.3 0.3
## 6 Petal.Width virginica 2.03 0.275 2 0.500
## 7 Sepal.Length setosa 5.01 0.352 5 0.4
## 8 Sepal.Length versicolor 5.94 0.516 5.9 0.7
## 9 Sepal.Length virginica 6.59 0.636 6.5 0.675
## 10 Sepal.Width setosa 3.43 0.379 3.4 0.475
## 11 Sepal.Width versicolor 2.77 0.314 2.8 0.475
## 12 Sepal.Width virginica 2.97 0.322 3 0.375
This is the opposite of spread, so each observation is spread out over two or more rows. Let’s look at the table2 data from the tidyverse package. We see that one observation is different attributes of one country in one year, but the observations are spread out over two rows, one row showing the number of cases and one showing the population in the country. We use spread to fix this.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
spread(type, count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
This is if we have two values in one column. In the data below we see the rate stored as a character showing the number of cases divided by the total population. Let’s split this into two columns, one for cases and one for population.
table3 %>%
separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
We see that the columns are still stored as characters. We need to change them into numbers
table3 %>%
separate(col = rate, into = c("cases", "population")) %>%
mutate_at(vars(cases, population), parse_number)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
The opposite of separate. If we have two columns that together form one value. In the data below we have one column for century and one for year. This is nonsense. Let’s change it into one column for year
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(col = "year", century, year)
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
table5 %>%
unite(col = "year", century, year, sep = "")
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
And change it into a number
table5 %>%
unite(col = "year", century, year, sep = "") %>%
mutate(year = parse_number(year))
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
And complete the tidying with what we learned above
table5 %>%
unite(col = "year", century, year, sep = "") %>%
separate(col = "rate", into = c("cases", "population")) %>%
mutate_at(vars(year, cases, population), parse_number)
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Swirl, https://swirlstats.com, is an interactive learning environment situated inside R! You will now do one exercise in Swirl.
First, if the package is not installed, install it
install.packages("swirl")
Then load the package
library(swirl)
The exercise is part of a bigger course. You need to install the course
install_course("Getting and Cleaning Data")
Then run Swirl and find the exercise, Tidying Data with tidyr, inside the course.
swirl()
The year is 2020 and new epidemic, midi-chlorianism, is spreading through Reykjavík’s districts as well as neighboring ones. The Directorate of Health has chosen you as its crack team of data analysts to find some clues. Good luck, we’re all counting on you.
The files cases_2019.csv and population_2019.csv show the total number of cases and population respectively for each of the largest districts in and around Reykjavík. Working alone or as groups you must read the data into R, tidy them and use them to answer the following questions.
Be careful, the datasets are both saved as .csv files, but they don’t use the same delimiters. Use what you’ve learned in this class to import both files into R using read_csv, read_csv2 and read_tsv, or you could use read_delim for both and manually enter the delimiter!
The files are available in Ugla.
cases_2019.csv file is messy. Find a way to make it tidy so that you can answer the other questions much faster!population.csv also messy. Make it tidy to help with the next questions!join data frames to combine the cases and population into one dataset (you can call it d for example)Hint: For each city you only need the rows corresponding to the first month and the last month. The function slice(a, b) will return only rows a and b from the data. If you use slice() after using group_by() it wil return rows number a and b for each group. Once you have only the first and last month you can use spread() to create one column for each month and then a simple mutate() will allow you to calculate the difference. Remember that you cant use numbers as column names so you will have to place `` around any column names that are numbers to use them for calculations.