The data we work with is rarely in the ideal format for analysis. Different software packages often require data in different formats, which can be a challenge. To work efficiently, we need reliable tools to reshape and reformat data as needed, allowing us to focus on tasks like creating visualizations and building models.
The dplyr and tidyr packages in R offer a
comprehensive and highly effective collection of functions for reshaping
data efficiently. We need to call/install them first before we go to the
cleaning and manipulation.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.3
Many forms of data shapes can be conducted, depending on the analyses, but commonly they will fall into these categories:
Split: Split the data into logical groups
Apply: Calculate some summary statistic on each group (e.g. mean total length by year)
Combine: Combine the groups back together into a single table
Most commonly used functions:
mutate()
group_by()
summarise()
select()
filter()
rename()
%>%Both tidyr and dplyr make use of the pipe
operator %>%, which might seem unfamiliar at first. This
operator is a powerful tool for efficiently linking multiple operations
together. It works by taking the result of one step and passing it as
the input to the next, creating a clear and readable sequence of data
transformations.
For example, if we want to filter and
select data. Instead of doing these actions separately:
df <- mtcars #default data from R
df_filtered <- filter(df, mpg < 21) #take the data from variable mpg within df data that contain value less than 21
head(df_filtered)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
df_selected <- select(df_filtered, hp) #take only the data from column name 'hp', delete other columns
head(df_selected)
## hp
## Hornet Sportabout 175
## Valiant 105
## Duster 360 245
## Merc 280 123
## Merc 280C 123
## Merc 450SE 180
We can just do this:
df_cleaned <- df %>%
filter(mpg < 21) %>%
select(hp)
head(df_cleaned)
## hp
## Hornet Sportabout 175
## Valiant 105
## Duster 360 245
## Merc 280 123
## Merc 280C 123
## Merc 450SE 180
select()Check columns for mtcars data frame or our variable
df.
head(df) #check the first six rows of the data
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
If we want to select certain columns (mpg, cyl, disp):
df_1 <- df %>%
select(mpg, cyl, disp)
head(df_1)
## mpg cyl disp
## Mazda RX4 21.0 6 160
## Mazda RX4 Wag 21.0 6 160
## Datsun 710 22.8 4 108
## Hornet 4 Drive 21.4 6 258
## Hornet Sportabout 18.7 8 360
## Valiant 18.1 6 225
If we don’t want certain columns (mpg, cyl, disp):
df_2 <- df %>%
select(-mpg, -cyl, -disp)
head(df_2)
## hp drat wt qsec vs am gear carb
## Mazda RX4 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 175 3.15 3.440 17.02 0 0 3 2
## Valiant 105 2.76 3.460 20.22 1 0 3 1
mutate()We want to create a new column within df, named
watt, containing converted value of power in horsepower
(hp) unit to watt (W) unit, by multiplying hp values with 745.7.
df_3 <- df %>%
mutate(watt = hp*747.7)
head(df_3)
## mpg cyl disp hp drat wt qsec vs am gear carb watt
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 82247.0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 82247.0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 69536.1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 82247.0
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 130847.5
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 78508.5
group_by() and
summarise()We want to know the mean of mpg (miles per gallon) based
on type of transmission (0 = automatic, 1 = manual).
df_4 <- df %>%
group_by(am) %>%
summarise(mean(mpg))
head(df_4)
## # A tibble: 2 × 2
## am `mean(mpg)`
## <dbl> <dbl>
## 1 0 17.1
## 2 1 24.4
filter()We use filter() to extract rows in a data frame that
satisfy certain criteria. For example, we want to filter data that
contain cars with manual transmission (am = 1).
df_5 <- df %>%
filter(am == 1)
head(df_5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
rename()If we want to change the name of a column, e.g. disp to
display, we can do this by applying:
df_6 <- df %>%
rename(display = disp)
head(df_6)
## mpg cyl display hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1