Preamble

In this notes we shall look at how best we can use tidyr for structuring our data in desired format which may be called as ‘Tidy data’. This is most important while we aim at creating visualization and modeling. We can perform tasks such as converting columns to rows, rows to columns with tidyr

Need for Structuring and tidyr’s role

When we wish to analyze a data, we must first decide which format our data should be in order to make full use of the data. i.e we must determine what each row and column will represent. For this tidyr package (Which is part of Tidyverse) is used. It follows 3 principle for tidy data

  1. Each variable is represented in column
  2. Each observation is represented in row
  3. Each value will be in cell

Keeping these guidelines in mind we structure the data using tidyr.

Structuring data with tidyr

Loading the package and dataset

library(tidyr)

We shall use us_rent_income dataset which is in tidyr package

data1=us_rent_income
data2=data1[c("NAME","variable","estimate")]
We consider the variables ‘NAME’,‘variable’ and ‘estimate’ from the data for example, let us have a look at first few rows of data
NAME variable estimate
Alabama income 24476
Alabama rent 747
Alaska income 32940
Alaska rent 1200
Arizona income 27517
Arizona rent 972

tidyr is used to change the data into tidy one based on the principles given above, we shall see the actions which can be performed by tidyr one by one

Changing from rows to columns (Long to Wide)

A data which is having more number of rows because of more categories of a sigle variable is called as long data.

Let us look at an example of long data

NAME variable estimate
Alabama income 24476
Alabama rent 747
Alaska income 32940
Alaska rent 1200
Arizona income 27517
Arizona rent 972
Arkansas income 23789
Arkansas rent 709
California income 29454
California rent 1358

Here we have “variable” which is categorical and because of this we have long data

A data of long format can be transformed into wide format using pivot_wider() function , it considers a key column whose features are spread in to multiple columns which reduces number of rows. Arguments of this function are
names_from- we indicate the variable which we wish to spread across columns to convert long data in to wide
values_from- The values of the variables for each observation

datas=pivot_wider(data2,names_from=variable,values_from = estimate)
NAME income rent
Alabama 24476 747
Alaska 32940 1200
Arizona 27517 972
Arkansas 23789 709
California 29454 1358
Colorado 32401 1125

Changing from columns to rows (Wide to Long)

We can change wide data into long one using pivot_longer().

we specify new variable name with which we mention the categories in names_to and values_to is specified for its values and we specify which column to leave as it is using ‘-’ operator.

data_long=pivot_longer(datas,names_to='variable_category',values_to='estimate',-NAME)
Let us look at some rows of data
NAME variable_category estimate
Alabama income 24476
Montana income 26249
Alabama rent 747
Montana rent 751
Alaska income 32940
Nebraska income 30020
Alaska rent 1200
Nebraska rent 773
Arizona income 27517
Nevada income 29019

separate

It is needed sometimes to separate a variable (which is character) in to two columns. This can be achieved using separate() function. We shall use the following data

Name sample.mail phone
name1 1234567890
name2 9876543210
name3 6234115600

We specify which column to separate using its name or position using col and names of the separated columns we are forming using into, we must specify which character to use for splitting string using sep

sep=separate(sample1,col=sample.mail,into=c("name","service"),sep='@')
Let us look at the data
Name name service phone
name1 awesomename1 yahoo.com 1234567890
name2 fabulousname2 gmail.com 9876543210
name3 name321 outlook.com 6234115600

we can keep the original column which we are separating using the following

sep2=separate(sample1,sample.mail,c("name","service"),remove=F,sep='@')
Let us look at the data
Name sample.mail name service phone
name1 awesomename1 yahoo.com 1234567890
name2 fabulousname2 gmail.com 9876543210
name3 name321 outlook.com 6234115600

unite

We can unite two columns in to one column by using unite(), we provide data and name of new column using col argument, we can also give separator between them using sep

united=unite(sep[c('name','service')],col=email_ID,sep='@')
email_ID

Final Note

In this notes we have seen how important it is to get a tidy data and how can we achieve it by using tidyr package. We have looked at how to convert long table into wide and wide table to long,separating a variable and uniting multiple variables.