What do we have to clean and manipulate our data?

The data we work with is rarely in the ideal format for analysis. Different software packages often require data in different formats, which can be a challenge. To work efficiently, we need reliable tools to reshape and reformat data as needed, allowing us to focus on tasks like creating visualizations and building models.

What packages will we use?

The dplyr and tidyr packages in R offer a comprehensive and highly effective collection of functions for reshaping data efficiently. We need to call/install them first before we go to the cleaning and manipulation.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.3

Common functions

Many forms of data shapes can be conducted, depending on the analyses, but commonly they will fall into these categories:

Functions

Most commonly used functions:

Pipe operator %>%

Both tidyr and dplyr make use of the pipe operator %>%, which might seem unfamiliar at first. This operator is a powerful tool for efficiently linking multiple operations together. It works by taking the result of one step and passing it as the input to the next, creating a clear and readable sequence of data transformations.

For example, if we want to filter and select data. Instead of doing these actions separately:

df <- mtcars #default data from R

df_filtered <- filter(df, mpg < 21) #take the data from variable mpg within df data that contain value less than 21
head(df_filtered)
##                    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
## Merc 280C         17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4
## Merc 450SE        16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3
df_selected <- select(df_filtered, hp) #take only the data from column name 'hp', delete  other columns
head(df_selected)
##                    hp
## Hornet Sportabout 175
## Valiant           105
## Duster 360        245
## Merc 280          123
## Merc 280C         123
## Merc 450SE        180

We can just do this:

df_cleaned <- df %>%
  filter(mpg < 21) %>%
  select(hp)

head(df_cleaned)
##                    hp
## Hornet Sportabout 175
## Valiant           105
## Duster 360        245
## Merc 280          123
## Merc 280C         123
## Merc 450SE        180

Selecting or removing columns: select()

Check columns for mtcars data frame or our variable df.

head(df) #check the first six rows of the data
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If we want to select certain columns (mpg, cyl, disp):

df_1 <- df %>%
  select(mpg, cyl, disp)

head(df_1)
##                    mpg cyl disp
## Mazda RX4         21.0   6  160
## Mazda RX4 Wag     21.0   6  160
## Datsun 710        22.8   4  108
## Hornet 4 Drive    21.4   6  258
## Hornet Sportabout 18.7   8  360
## Valiant           18.1   6  225

If we don’t want certain columns (mpg, cyl, disp):

df_2 <- df %>%
  select(-mpg, -cyl, -disp)

head(df_2)
##                    hp drat    wt  qsec vs am gear carb
## Mazda RX4         110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 175 3.15 3.440 17.02  0  0    3    2
## Valiant           105 2.76 3.460 20.22  1  0    3    1

Adding columns: mutate()

We want to create a new column within df, named watt, containing converted value of power in horsepower (hp) unit to watt (W) unit, by multiplying hp values with 745.7.

df_3 <- df %>%
  mutate(watt = hp*747.7)

head(df_3)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb     watt
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  82247.0
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  82247.0
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  69536.1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  82247.0
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 130847.5
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  78508.5

Grouping and summarising data: group_by() and summarise()

We want to know the mean of mpg (miles per gallon) based on type of transmission (0 = automatic, 1 = manual).

df_4 <- df %>%
  group_by(am) %>%
  summarise(mean(mpg))

head(df_4)
## # A tibble: 2 × 2
##      am `mean(mpg)`
##   <dbl>       <dbl>
## 1     0        17.1
## 2     1        24.4

Filtering rows: filter()

We use filter() to extract rows in a data frame that satisfy certain criteria. For example, we want to filter data that contain cars with manual transmission (am = 1).

df_5 <- df %>%
  filter(am == 1)

head(df_5)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Renaming columns: rename()

If we want to change the name of a column, e.g. disp to display, we can do this by applying:

df_6 <- df %>%
  rename(display = disp)

head(df_6)
##                    mpg cyl display  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6     160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6     160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4     108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6     258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8     360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6     225 105 2.76 3.460 20.22  1  0    3    1