Preamble

In this notes we shall look at how best we can use tidyr for structuring our data in desired format which may be called as ‘Tidy data’. This is most important while we aim at creating visualization and modeling. We can perform tasks such as converting columns to rows, rows to columns with tidyr

Need for Structuring and tidyr’s role

When we wish to analyze a data, we must first decide which format our data should be in order to make full use of the data. i.e we must determine what each row and column will represent. For this tidyr package (Which is part of Tidyverse) is used. It follows 3 principle for tidy data

Each variable is represented in column
Each observation is represented in row
Each value will be in cell

Keeping these guidelines in mind we structure the data using tidyr.

Structuring data with tidyr

Loading the package and dataset

library(tidyr)

We shall use us_rent_income dataset which is in tidyr package

data1=us_rent_income
data2=data1[c("NAME","variable","estimate")]

We consider the variables ‘NAME’,‘variable’ and ‘estimate’ from the data for example, let us have a look at first few rows of data

NAME	variable	estimate
Alabama	income	24476
Alabama	rent	747
Alaska	income	32940
Alaska	rent	1200
Arizona	income	27517
Arizona	rent	972

tidyr is used to change the data into tidy one based on the principles given above, we shall see the actions which can be performed by tidyr one by one

Changing from rows to columns (Long to Wide)

A data which is having more number of rows because of more categories of a sigle variable is called as long data.

Let us look at an example of long data

NAME	variable	estimate
Alabama	income	24476
Alabama	rent	747
Alaska	income	32940
Alaska	rent	1200
Arizona	income	27517
Arizona	rent	972
Arkansas	income	23789
Arkansas	rent	709
California	income	29454
California	rent	1358

Here we have “variable” which is categorical and because of this we have long data

A data of long format can be transformed into wide format using pivot_wider() function , it considers a key column whose features are spread in to multiple columns which reduces number of rows. Arguments of this function are
names_from- we indicate the variable which we wish to spread across columns to convert long data in to wide
values_from- The values of the variables for each observation

datas=pivot_wider(data2,names_from=variable,values_from = estimate)

NAME	income	rent
Alabama	24476	747
Alaska	32940	1200
Arizona	27517	972
Arkansas	23789	709
California	29454	1358
Colorado	32401	1125

Changing from columns to rows (Wide to Long)

We can change wide data into long one using pivot_longer().

we specify new variable name with which we mention the categories in names_to and values_to is specified for its values and we specify which column to leave as it is using ‘-’ operator.

data_long=pivot_longer(datas,names_to='variable_category',values_to='estimate',-NAME)

Let us look at some rows of data

NAME	variable_category	estimate
Alabama	income	24476
Montana	income	26249
Alabama	rent	747
Montana	rent	751
Alaska	income	32940
Nebraska	income	30020
Alaska	rent	1200
Nebraska	rent	773
Arizona	income	27517
Nevada	income	29019

separate

It is needed sometimes to separate a variable (which is character) in to two columns. This can be achieved using separate() function. We shall use the following data

Name	sample.mail	phone
name1	awesomename1@yahoo.com	1234567890
name2	fabulousname2@gmail.com	9876543210
name3	name321@outlook.com	6234115600

We specify which column to separate using its name or position using col and names of the separated columns we are forming using into, we must specify which character to use for splitting string using sep

sep=separate(sample1,col=sample.mail,into=c("name","service"),sep='@')

Let us look at the data

Name	name	service	phone
name1	awesomename1	yahoo.com	1234567890
name2	fabulousname2	gmail.com	9876543210
name3	name321	outlook.com	6234115600

we can keep the original column which we are separating using the following

sep2=separate(sample1,sample.mail,c("name","service"),remove=F,sep='@')

Let us look at the data

Name	sample.mail	name	service	phone
name1	awesomename1@yahoo.com	awesomename1	yahoo.com	1234567890
name2	fabulousname2@gmail.com	fabulousname2	gmail.com	9876543210
name3	name321@outlook.com	name321	outlook.com	6234115600

unite

We can unite two columns in to one column by using unite(), we provide data and name of new column using col argument, we can also give separator between them using sep

united=unite(sep[c('name','service')],col=email_ID,sep='@')

email_ID
awesomename1@yahoo.com
fabulousname2@gmail.com
name321@outlook.com

Final Note

In this notes we have seen how important it is to get a tidy data and how can we achieve it by using tidyr package. We have looked at how to convert long table into wide and wide table to long,separating a variable and uniting multiple variables.