Preamble
In this notes we shall look at how best we can use tidyr for structuring our data in desired format which may be called as ‘Tidy data’. This is most important while we aim at creating visualization and modeling. We can perform tasks such as converting columns to rows, rows to columns with tidyr
Need for Structuring and tidyr’s role
When we wish to analyze a data, we must first decide which format our data should be in order to make full use of the data. i.e we must determine what each row and column will represent. For this tidyr package (Which is part of Tidyverse) is used. It follows 3 principle for tidy data
- Each variable is represented in column
- Each observation is represented in row
- Each value will be in cell
Keeping these guidelines in mind we structure the data using tidyr.
Structuring data with tidyr
Loading the package and dataset
library(tidyr)We shall use us_rent_income dataset which is in tidyr package
data1=us_rent_income
data2=data1[c("NAME","variable","estimate")]| NAME | variable | estimate |
|---|---|---|
| Alabama | income | 24476 |
| Alabama | rent | 747 |
| Alaska | income | 32940 |
| Alaska | rent | 1200 |
| Arizona | income | 27517 |
| Arizona | rent | 972 |
tidyr is used to change the data into tidy one based on the principles given above, we shall see the actions which can be performed by tidyr one by one
Changing from rows to columns (Long to Wide)
A data which is having more number of rows because of more categories of a sigle variable is called as long data.
Let us look at an example of long data
| NAME | variable | estimate |
|---|---|---|
| Alabama | income | 24476 |
| Alabama | rent | 747 |
| Alaska | income | 32940 |
| Alaska | rent | 1200 |
| Arizona | income | 27517 |
| Arizona | rent | 972 |
| Arkansas | income | 23789 |
| Arkansas | rent | 709 |
| California | income | 29454 |
| California | rent | 1358 |
Here we have “variable” which is categorical and because of this we have long data
A data of long format can be transformed into wide format using pivot_wider() function , it considers a key column whose features are spread in to multiple columns which reduces number of rows. Arguments of this function are
names_from- we indicate the variable which we wish to spread across columns to convert long data in to wide
values_from- The values of the variables for each observation
datas=pivot_wider(data2,names_from=variable,values_from = estimate)| NAME | income | rent |
|---|---|---|
| Alabama | 24476 | 747 |
| Alaska | 32940 | 1200 |
| Arizona | 27517 | 972 |
| Arkansas | 23789 | 709 |
| California | 29454 | 1358 |
| Colorado | 32401 | 1125 |
Changing from columns to rows (Wide to Long)
We can change wide data into long one using pivot_longer().
we specify new variable name with which we mention the categories in names_to and values_to is specified for its values and we specify which column to leave as it is using ‘-’ operator.
data_long=pivot_longer(datas,names_to='variable_category',values_to='estimate',-NAME)| NAME | variable_category | estimate |
|---|---|---|
| Alabama | income | 24476 |
| Montana | income | 26249 |
| Alabama | rent | 747 |
| Montana | rent | 751 |
| Alaska | income | 32940 |
| Nebraska | income | 30020 |
| Alaska | rent | 1200 |
| Nebraska | rent | 773 |
| Arizona | income | 27517 |
| Nevada | income | 29019 |
separate
It is needed sometimes to separate a variable (which is character) in to two columns. This can be achieved using separate() function. We shall use the following data
| Name | sample.mail | phone |
|---|---|---|
| name1 | awesomename1@yahoo.com | 1234567890 |
| name2 | fabulousname2@gmail.com | 9876543210 |
| name3 | name321@outlook.com | 6234115600 |
We specify which column to separate using its name or position using col and names of the separated columns we are forming using into, we must specify which character to use for splitting string using sep
sep=separate(sample1,col=sample.mail,into=c("name","service"),sep='@')| Name | name | service | phone |
|---|---|---|---|
| name1 | awesomename1 | yahoo.com | 1234567890 |
| name2 | fabulousname2 | gmail.com | 9876543210 |
| name3 | name321 | outlook.com | 6234115600 |
we can keep the original column which we are separating using the following
sep2=separate(sample1,sample.mail,c("name","service"),remove=F,sep='@')| Name | sample.mail | name | service | phone |
|---|---|---|---|---|
| name1 | awesomename1@yahoo.com | awesomename1 | yahoo.com | 1234567890 |
| name2 | fabulousname2@gmail.com | fabulousname2 | gmail.com | 9876543210 |
| name3 | name321@outlook.com | name321 | outlook.com | 6234115600 |
unite
We can unite two columns in to one column by using unite(), we provide data and name of new column using col argument, we can also give separator between them using sep
united=unite(sep[c('name','service')],col=email_ID,sep='@')| email_ID |
|---|
| awesomename1@yahoo.com |
| fabulousname2@gmail.com |
| name321@outlook.com |
Final Note
In this notes we have seen how important it is to get a tidy data and how can we achieve it by using tidyr package. We have looked at how to convert long table into wide and wide table to long,separating a variable and uniting multiple variables.