Tidy Data with tidyr
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell
This means you put each dataset in a tibble and puch each variable in a column
# only table is an example where each column is a variable
library(tidyverse)
table1
table2
table3
table4a
table4b
dplyr, ggplot2 and all the other package in the tidyverse are designed to work with tidy data. Here are some examples:
table1 %>%
mutate(rate=cases/population *10000)
# compute cased per year
table1 %>%
count(year, wt = cases)
# visualize changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group=country), color = "grey50")+
geom_point(aes(color = country))

library(tidyverse)
View(table1)
View(table2)
View(table4a)
View(table4b)
Exercise: Compute the rate for table2
# table 2 extract the rate per country per year
tb2_cases <- filter(table2, type == "cases")[["count"]]
tb2_country <- filter(table2, type == "cases")[["country"]]
tb2_year <- filter(table2, type == "cases")[["year"]]
tb2_population <- filter(table2, type == "population")[["count"]]
table2_clean <- tibble(country = tb2_country,
year = tb2_year,
rate = tb2_cases / tb2_population)
table2_clean
Then compute the rate using the data from two tables table4a and table4b
tibble(country = table4a[["country"]],
'1999' = table4a[["1999"]] / table4b[["1999"]],
'2000' = table4a[["2000"]] / table4b[["2000"]]
)
Now plot using table2
library(ggplot2)
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country))

Spreading and Gathering
There are two common issues with database. When one variable is spread across multiple columns. Or when one observation is scattered across multiple rows.
When some of the column names are not names of variables, but values of a variable. Take the table4a; the column names 1999 and 2000 represet values of the YEAR valirable and each row represents two observations, not one.
To fix this, we need to ‘gather’ those columns into a new pair of variables. (transpose? )
You will need three paramaters: The set of columns that represent values, not variables, in table4a, tose are the columns 1999 and 2000 the name of the variable whose values form the column names. ( in this case “Year”) The name of the variable whose values are spread over the cells. Call this ‘value’ and here it’s the number of cases.
# example of gather function
table4a %>%
gather(`1999`,`2000`, key = "Year", value = "cases")
We can use gather to tidy4b as well
table4b %>%
gather(`1999`,`2000`, key = "year", value = "population")
To join the two tables (table4a and table4b)
tidy4a <- table4a %>%
gather(`1999`,`2000`, key = "Year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`,`2000`, key = "Year", value = "population")
left_join(tidy4a, tidy4b)
Joining, by = c("country", "Year")
Spreading
Spreading is the opposite of gathering. YOu use it when an observation is scattered across multiple rows. for example, take table2. The observation for each year is spread into two rows. one for cases, and one for population. This time we will need only two parameters. One is the key column and in table2 its the TYPE column, the next column is the column that contains the values…the value column, here it’s [count]
spread(table2, key=type, value=count)
So gather() makes wide tables narrorwer and longer, while spread() makes long tables shorter and wider.
Separating and Pull
For table3, the rate column contains two values (case and population). We use separate() to pull these two values apart.
table3 %>%
separate(rate, into = c("cases", "population"), sep ="/")
In the above, you will notice that cases and population are “char” fields. Use convert=TRUE to have separate() convert it to better types
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/", convert = TRUE)
Now its correctly separated as integer type.
You can also separate using positions by sep argument
table3 %>%
separate(year, into = c("century","year"), sep =2)
Unite
Check out unite() function that does the opposite thing. You use this if a single variable is spread across multiple columns. Sep -1 starts from the far right, Positive values start at 1 on the far left of the strings.
We can use unite() to rejoin century and year in the above example. That data is saved as tidyr:table5
table5 %>%
unite(new, century, year)
Notice the underscore? Let’s take it out with the sep=“” argument
table5 %>%
unite(new, century, year, sep="")
Exercise:
What do the extra and fill agruments do in separate()
tibble(x=c("a,b,c","d,e,f,g","h,i,j")) %>%
separate(x, c("one","two","three"))
Too many values at 1 locations: 2
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
separate(x, c("one","two","three"))
Too few values at 1 locations: 2
Extra = drop
tibble(x=c("a,b,c","d,e,f,g","h,i,j")) %>%
separate(x, c("one","two","three"), extra ="drop")
Extra= merge
tibble(x = c("a,b,c","d,e,f,g","h,i,j")) %>%
separate(x, c("one","two","three"), extra = "merge")
Using fill =“right”
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
separate(x, c("one","two","three"), fill = "right")
Using fill =“left”
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
separate(x, c("one","two","three"), fill = "left")
Both unite() and separate() have a remove option. Why would we set it to FALSE?
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
separate(x, c("one","two","three"), remove = FALSE)
Too few values at 1 locations: 2
Missing Values
A value can be missing in one of two ways: Explicitly, ie. flagged with NA, or Implicity,ie. simply not present in data.
stocks <- tibble(
year = c( 2005,2015,2015,2015,2016,2016,2016),
qtr = c(1,2,3,4,2,3,4),
return = c(1.88,0.59,.35, NA, 0.92, 0.17, 2.66)
)
stocks
One way that a dataset can make implicit values explicit.
stocks %>%
spread(year, return)
To turn explicit missing values implicit, use na.rm=TRUE
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm=TRUE)
Complete() takes a set of coumns and fins all unique combinations. It then ensures the original dataset contains all those values, filing in explicit NAs where necessary.
stocks %>%
complete(year,qtr)
When a dataset has been used for data entry, missing values indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~treatment, ~response,
"Derrick Whitmore",1,7,
NA,2,10,
NA,3,9,
"Katherine Burke",1,4
)
treatment
You can use the fill() It takes a set of columns where you want missing values to be replaced by the most recent nonmissing values (sometimes called last observation carried forward)
treatment %>%
fill(person)
Case study: tidyr::who
who
who1 <- who %>%
gather(
new_sp_m014:newrel_f65, key = "key",
value="cases",
na.rm=TRUE
)
who1
We can get some hint of the structure of the values in the new key column by counting them:
who1 %>%
count(key)
To make the data consistent, we use str_replace()
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
We can separate the values in each code with two passes of separet() The first pass will split the codes at ech underscore:
who3 <- who2 %>%
separate(key, c("new","type", "sexage"), sep="_")
who3
Then we might as well drop the new column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2 and iso3 since they’r redundant:
who3 %>%
count(new)
who4 <- who3 %>%
select(-new, -iso2,-iso3)
who4
Next we separate sexage into sex and age by splitting after the first character
who5 <- who4 %>%
separate(sexage, c("Sex","Age"), sep = 1)
who5
Exercise: Confirm that country, iso2 and iso3 are redundant? My answer: i use count of country, iso2, iso3. If they are non unique, you should see other aggregations
who %>%
count(country, iso2, iso3)
Nontidy Data
There are two main reasons to use other data structures: alternative representations may have substantial perfromance or space advantages Specialized fields have evolved their own conventions for storing data that may be quite different to the conventions
---
title: "R for Data Science Chapter 9 TidyData"
output: html_notebook
---

<h1> Tidy Data with tidyr </h1>

There are three interrelated rules which make a dataset tidy: </br>

1. Each variable must have its own column </br>
2. Each observation must have its own row </br>
3. Each value must have its own cell </br>

<img src="http://r4ds.had.co.nz/images/tidy-6.png" > </img>

This means you put each dataset in a tibble and puch each variable in a column </p>

```{r}
# only table is an example where each column is a variable
library(tidyverse)
table1
table2
table3
table4a
table4b

```

dplyr, ggplot2 and all the other package in the tidyverse are designed to work with tidy data. Here are some examples:

```{r}
# compute rate per 10,000
table1 %>% 
  mutate(rate = cases/population *10000)
```

```{r}
# compute cased per year
table1 %>% 
  count(year, wt = cases)
```


```{r}
# visualize changes over time 
library(ggplot2)
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group=country), color = "grey50")+
  geom_point(aes(color = country))
```



```{r}
library(tidyverse)
View(table1)
View(table2)
View(table4a)
View(table4b)
```

Exercise: Compute the rate for table2

```{r}
# table 2 extract the rate per country per year
tb2_cases <- filter(table2, type == "cases")[["count"]]
tb2_country <- filter(table2, type == "cases")[["country"]]
tb2_year <- filter(table2, type == "cases")[["year"]]
tb2_population <- filter(table2, type == "population")[["count"]]
table2_clean <- tibble(country = tb2_country,
       year = tb2_year,
       rate = tb2_cases / tb2_population)
table2_clean

```


Then compute the rate using the data from  two tables table4a and table4b

```{r}
tibble(country = table4a[["country"]],
       '1999' = table4a[["1999"]] / table4b[["1999"]],
       '2000' = table4a[["2000"]] / table4b[["2000"]]
)
```

Now plot using table2 

```{r}
library(ggplot2)
table2 %>%
  filter(type == "cases") %>%
  
ggplot(aes(year, count)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

```




<h2> Spreading and Gathering </h2>

There are two common issues with database. When one variable is spread across multiple columns. Or when one observation is scattered across multiple rows.  </p>

When some of the column names are not names of variables, but values of a variable. Take the table4a; the column names 1999 and 2000 represet values of the YEAR valirable and each row represents two observations, not one. </br>


To fix this, we need to 'gather' those columns into a new pair of variables. (transpose? ) </br>
<img src ="http://garrettgman.github.io/images/tidy-9.png" > </img>

You will need three paramaters: </br>
The set of columns that represent values, not variables, in table4a, tose are the columns 1999 and 2000 </br>
the name of the variable whose values form the column names. ( in this case "Year")</br>
The name of the variable whose values are spread over the cells. Call this 'value' and here it's the number of cases. </p>

```{r}
# example of gather function
# note the use ` backticks and not '
table4a %>%
  gather(`1999`,`2000`, key = "Year", value = "cases")
```

We can use gather to tidy4b as well 

```{r}
table4b %>%
  gather(`1999`,`2000`, key = "year", value = "population")
```

To join the two tables (table4a and table4b)

```{r}
tidy4a <-  table4a %>%
  gather(`1999`,`2000`, key = "Year", value = "cases")
tidy4b <- table4b %>%
  gather(`1999`,`2000`, key = "Year", value = "population")
left_join(tidy4a, tidy4b)

```



<h2> Spreading </h2>
Spreading is the opposite of gathering. YOu use it when an observation is scattered across multiple rows. for example, take table2. The observation for each year is spread into two rows. one for cases, and one for population. 
This time we will need only two parameters. One is the key column and in table2 its the TYPE column, the next column is the column that contains the values...the value column, here it's [count]

<img src="http://r4ds.had.co.nz/images/tidy-8.png" > </img>


```{r}
spread(table2, key = type, value = count)
```


So gather() makes wide tables narrorwer and longer, while spread() makes long tables shorter and wider.
</p>

<h2> Separating and Pull </h2>

For table3, the rate column contains two values (case and population). We use separate() to pull these two values apart. 

<img src="http://r4ds.had.co.nz/images/tidy-17.png" > </img>

```{r}
table3 %>%
  separate(rate, into = c("cases", "population"), sep ="/")
```
In the above, you will notice that cases and population are "char" fields. Use convert=TRUE to have separate() convert it to better types 

```{r}
table3 %>%
  separate(rate, into = c("cases", "population"), sep = "/", convert = TRUE)
```

Now its correctly separated as integer type. </p>

You can also separate using positions by sep argument

```{r}
table3 %>%
  separate(year, into = c("century","year"), sep =2)
```


<h2> Unite </h2>

Check out unite() function that does the opposite thing. You use this if a single variable is spread across multiple columns. Sep -1 starts from the far right, Positive values start at 1 on the far left of the strings.

We can use unite() to rejoin century and year in the above example.  That data is saved as tidyr:table5
<img src="http://r4ds.had.co.nz/images/tidy-18.png"> </img>


```{r}
table5 %>%
  unite(new, century, year)
```


Notice the underscore? Let's take it out with the sep="" argument

```{r}
table5 %>%
  unite(new, century, year, sep="")
```


Exercise: </p>


What do the extra and fill agruments do in separate()
```{r}
tibble(x=c("a,b,c","d,e,f,g","h,i,j")) %>%
  separate(x, c("one","two","three"))
```

```{r}
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
  separate(x, c("one","two","three"))
```



Extra = drop
```{r}
tibble(x = c("a,b,c","d,e,f,g","h,i,j")) %>%
  separate(x, c("one","two","three"), extra = "drop")
```
Extra= merge

```{r}
tibble(x = c("a,b,c","d,e,f,g","h,i,j")) %>%
  separate(x, c("one","two","three"), extra = "merge")
```


Using fill ="right"

```{r}
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
  separate(x, c("one","two","three"), fill = "right")
```

Using fill ="left"

```{r}
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
  separate(x, c("one","two","three"), fill = "left")
```


Both unite() and separate() have a remove option. Why would we set it to FALSE?


```{r}
tibble(x = c("a,b,c","d,e","f,g,i")) %>%
  separate(x, c("one","two","three"), remove = FALSE)
```


<h2> Missing Values </h2>

A value can be missing in one of two ways: Explicitly, ie. flagged with NA, or Implicity,ie. simply not present in data. </p>

```{r}
stocks <-  tibble(
  year = c( 2005,2015,2015,2015,2016,2016,2016),
  qtr = c(1,2,3,4,2,3,4),
  return = c(1.88,0.59,.35, NA, 0.92, 0.17, 2.66)
  
)

stocks
```

One way that a dataset can make implicit values explicit. 

```{r}
stocks %>%
  spread(year, return)

```


To turn explicit missing values implicit, use na.rm=TRUE

```{r}
stocks %>%
  spread(year, return) %>%
  gather(year, return, `2015`:`2016`, na.rm=TRUE)

```



Complete() takes a set of coumns and fins all unique combinations. It then ensures the original dataset contains all those values, filing in explicit NAs where necessary.

```{r}
stocks %>% 
  complete(year,qtr)
```



When a dataset has been used for data entry, missing values indicate that the previous value should be carried forward:


```{r}
treatment <- tribble(
  ~ person, ~treatment, ~response, 
  "Derrick Whitmore",1,7,
  NA,2,10,
  NA,3,9,
  "Katherine Burke",1,4
)

treatment

```

You can use the fill() It takes a set of columns where you want missing values to be replaced by the most recent nonmissing values (sometimes called last observation carried forward)

```{r}
treatment %>%
  fill(person)
```


Case study: tidyr::who

```{r}
who

```


```{r}
who1 <-  who %>%
  gather(
    new_sp_m014:newrel_f65, key = "key",
    value = "cases",
    na.rm = TRUE
  )

who1
```

We can get some hint of the structure of the values in the new key column by counting them:


```{r}
who1 %>%
  count(key)
```



To make the data consistent, we use str_replace() 

```{r}
who2 <-  who1 %>%
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
```


We can separate the values in each code with two passes of separet() The first pass will split the codes at ech underscore:

```{r}
who3 <-  who2 %>%
  separate(key, c("new","type", "sexage"), sep="_")
who3

```

Then we might as well drop the new column because it's constant in this dataset. While we're dropping columns, let's also drop iso2 and iso3 since they'r redundant:

```{r}
 who3 %>%
  count(new)
who4 <- who3 %>%
  select(-new, -iso2,-iso3)
who4

```



Next we separate sexage into sex and age by splitting after the first character

```{r}
who5 <- who4  %>%
  separate(sexage, c("Sex","Age"), sep = 1)
who5
```
Exercise: Confirm that country, iso2 and iso3 are redundant? </br>
My answer: i use count of country, iso2, iso3. If they are non unique, you should see other aggregations 


```{r}
who %>% 
  count(country, iso2, iso3)

```



<h2> Nontidy Data </h2>
There are two main reasons to use other data structures:</br>
alternative representations may have substantial perfromance or space advantages </br>
Specialized fields have evolved their own conventions for storing data that may be quite different to the conventions 
