Introduction

In this exercise document, we will explore the airquality and mtcars datasets, using the dplyr package. Our goal is to demonstrate the power of dplyr functions for data manipulation by applying various operations such as filtering, selecting, grouping, summarizing, and arranging on both datasets. We will dive into the datasets, extract insights, and make meaningful observations using these powerful functions.

Loading the Datasets

Let’s load both the airquality and mtcars datasets into R to begin.

# Load the Needed Libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the 'airquality' dataset
data(airquality)

# Load the 'mtcars' dataset
data(mtcars)

# Check the data structure
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The Pipe Operator %>%

Before going the dplyr practice, we need to explain what the pipe operator is.

The pipe operator, represented by %>%, is a powerful tool introduced by the dplyr package in R. It allows you to chain multiple operations together in a more readable and efficient manner. With the pipe operator, you can pass the output of one function as the first argument to another function, reducing the need for intermediate assignments and making the code easier to follow.

Syntax:

The syntax of the pipe operator is as follows:

output <- input %>% function1() %>% function2() %>% ... %>% functionN()

Explanation:

input: This is the initial data or object that serves as the input to the first function in the chain.
%>%: The pipe operator takes the output from the left-hand side (LHS) and passes it as the first argument to the function on the right-hand side (RHS).
function1(): This is the first dplyr function that will be applied to the input data.
function2(): This is the second dplyr function, which takes the output of function1() as its input.
...: You can continue chaining as many dplyr functions as needed, and each function will operate on the output of the previous one.
functionN(): This is the last dplyr function in the chain, and its output will be stored in the output variable.

Now we can start with the Filter and Select functions in dplyr.

Filtering and Selecting Columns

Task: Use the filter function to extract data for the month of July from the ‘airquality’ dataset and select specific columns for analysis.

# Filter data for the month of July and select Ozone, Temp, Wind only
july_air_quality <- airquality %>%
  filter(Month == 7) %>%
  select(Ozone, Temp, Wind)

# Show the first 6 rows of the july_air_quality
head(july_air_quality)

##   Ozone Temp Wind
## 1   135   84  4.1
## 2    49   85  9.2
## 3    32   81  9.2
## 4    NA   84 10.9
## 5    64   83  4.6
## 6    40   83 10.9

Task: Filter the records with both month = 7 and temperature over 80.

# Do the same as previous, but for the temperatures over 80
july_air_quality <- airquality %>%
  filter(Month == 7 & Temp >80) %>%
  select(Ozone, Temp, Wind)

# Show the first 6 rows of the july_air_quality
head(july_air_quality)

##   Ozone Temp Wind
## 1   135   84  4.1
## 2    49   85  9.2
## 3    32   81  9.2
## 4    NA   84 10.9
## 5    64   83  4.6
## 6    40   83 10.9

Task: Oh, my bad. I was gonna say month = 7 OR temperature over 80. Sorry. Can you revise it?

# Do the same as previous, but with the difference of 'OR'
july_air_quality <- airquality %>%
  filter(Month == 7 | Temp >80) %>%
  select(Ozone, Temp, Wind)

# Show the first 6 rows of the july_air_quality
head(july_air_quality)

##   Ozone Temp Wind
## 1    45   81 14.9
## 2    NA   84  9.2
## 3    NA   85  8.6
## 4    29   82  9.7
## 5    NA   87  6.9
## 6    71   90 13.8

Perfect! Let’s do more filtering! But this time, on MTCARS!

Task: I’m more interested in Merc 240 and Mazda RX4 cars. Can you filter them in mtcars dataset? I believe they have Merc in their name. Also, you need to convert row names as column.

# Convert row names to a new column 'Car_Name'
mtcars_with_car_name <- mtcars %>%
  mutate(Car_Name = rownames(.))

# Show the first 6 rows
head(mtcars_with_car_name)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                            Car_Name
## Mazda RX4                 Mazda RX4
## Mazda RX4 Wag         Mazda RX4 Wag
## Datsun 710               Datsun 710
## Hornet 4 Drive       Hornet 4 Drive
## Hornet Sportabout Hornet Sportabout
## Valiant                     Valiant

# Filter the cars with specific countries in the name
selected_brands <- c("Merc 240D", "Mazda RX4")
Merc_and_RX4 <- mtcars_with_car_name %>%
  filter(Car_Name %in% selected_brands)

top_n As A Filter

Task: I want to see the most powerful cars in the data! Filter only the top 3 cars by HP!

# Filter the most powerful ones as the_muscles
the_muscles <- mtcars %>%
  select(hp) %>% 
  top_n(3, hp) 

# Show "the_muscles"
the_muscles

##                 hp
## Duster 360     245
## Camaro Z28     245
## Ford Pantera L 264
## Maserati Bora  335

Using Arrange

Task: Also arrange the top 3 cars by hp!

# Arrange the_muscles ascending
the_muscles %>% arrange(hp)

##                 hp
## Duster 360     245
## Camaro Z28     245
## Ford Pantera L 264
## Maserati Bora  335

Task: No this is what I want! I want them to be arranged in DESCending order!

# Arrange the_muscles descending
the_muscles %>% arrange(desc(hp))

##                 hp
## Maserati Bora  335
## Ford Pantera L 264
## Duster 360     245
## Camaro Z28     245

Wow that was great! You are truly a DPLYR wizard! 🪄

But we are not done yet! I have still tasks for you…

Using Group By and Summarize

Task : Group the cars by the number of cylinders and calculate the average MPG for each group

# Group the cars by the number of cyls and calculate the avg MPG for each
grouped_avg_mpg_by_cyl <- mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg))
# Show grouped_avg_mpg_by_cyl
grouped_avg_mpg_by_cyl

## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1

Task: I know it doesn’t make a lot of sense but, I really wonder the total and average hp of small cars engine cars (4cyl) in our dataset. Can you calculate?

# Calculate and record as mtcars_hps
mtcars_hps <- mtcars %>% 
  filter(cyl == 4) %>% 
  summarise(total_hp=sum(hp), avg_hp = mean(hp))
# Show the result of mtcars_hps
mtcars_hps

##   total_hp   avg_hp
## 1      909 82.63636

Using Mutate

Did you realize that summarize created new columns but in a grouped way. Wha if I just need a new column based on an existing column?

Task: MPG is good for U.S but not too clear for Europeans. Let’s add a column for them and convert MPG to liters per 100kms.

(Use this formula = 235.215 / mpg)

# Convert MPG to L/100km using mutate, record as mtcars_converted
mtcars_converted <- mtcars %>%
  mutate(L_per_100km = 235.215 / mpg)
# Show the mtcars_converted
head(mtcars_converted)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                   L_per_100km
## Mazda RX4            11.20071
## Mazda RX4 Wag        11.20071
## Datsun 710           10.31645
## Hornet 4 Drive       10.99136
## Hornet Sportabout    12.57834
## Valiant              12.99530

Nice! Much better for internationals! 👍😎

Using Rename

Yet, L_per_100km is still not so good, not seem so clear to me…

Task: Rename it as Liters_per_100km.

# Rename with the variable name mtcars_renamed
mtcars_renamed <- mtcars %>%
  rename(Liters_per_100km = mpg)
# Show the result
mtcars_renamed

##                     Liters_per_100km cyl  disp  hp drat    wt  qsec vs am gear
## Mazda RX4                       21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
## Mazda RX4 Wag                   21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
## Datsun 710                      22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
## Hornet 4 Drive                  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
## Hornet Sportabout               18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
## Valiant                         18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
## Duster 360                      14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
## Merc 240D                       24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
## Merc 230                        22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
## Merc 280                        19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
## Merc 280C                       17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
## Merc 450SE                      16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
## Merc 450SL                      17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
## Merc 450SLC                     15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
## Cadillac Fleetwood              10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
## Lincoln Continental             10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
## Chrysler Imperial               14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
## Fiat 128                        32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
## Honda Civic                     30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
## Toyota Corolla                  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
## Toyota Corona                   21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
## Dodge Challenger                15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
## AMC Javelin                     15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
## Camaro Z28                      13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
## Pontiac Firebird                19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
## Fiat X1-9                       27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
## Porsche 914-2                   26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
## Lotus Europa                    30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
## Ford Pantera L                  15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
## Ferrari Dino                    19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
## Maserati Bora                   15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
## Volvo 142E                      21.4   4 121.0 109 4.11 2.780 18.60  1  1    4
##                     carb
## Mazda RX4              4
## Mazda RX4 Wag          4
## Datsun 710             1
## Hornet 4 Drive         1
## Hornet Sportabout      2
## Valiant                1
## Duster 360             4
## Merc 240D              2
## Merc 230               2
## Merc 280               4
## Merc 280C              4
## Merc 450SE             3
## Merc 450SL             3
## Merc 450SLC            3
## Cadillac Fleetwood     4
## Lincoln Continental    4
## Chrysler Imperial      4
## Fiat 128               1
## Honda Civic            2
## Toyota Corolla         1
## Toyota Corona          1
## Dodge Challenger       2
## AMC Javelin            2
## Camaro Z28             4
## Pontiac Firebird       2
## Fiat X1-9              1
## Porsche 914-2          2
## Lotus Europa           2
## Ford Pantera L         4
## Ferrari Dino           6
## Maserati Bora          8
## Volvo 142E             2

Using Count and Add Count

Task: We want to see the counts of observations in each month, can you count them without making a grouping?

# Add the number of observations for each Month as airquality_with_count
airquality_with_count <- airquality %>%
  add_count(Month)
# Show the first 6 rows of airquality_with_count
head(airquality_with_count)

##   Ozone Solar.R Wind Temp Month Day  n
## 1    41     190  7.4   67     5   1 31
## 2    36     118  8.0   72     5   2 31
## 3    12     149 12.6   74     5   3 31
## 4    18     313 11.5   62     5   4 31
## 5    NA      NA 14.3   56     5   5 31
## 6    28      NA 14.9   66     5   6 31

It was nice to add this. But I need also a pivot table with only observation count per month.

Task: Create a summary table which includes counts of observations and monthly average temp.

# Createa a pivot table as monthly_observation_count
monthly_observation_count <- airquality %>%
  group_by(Month) %>%
  count()
# Show the result of monthly_observation_count
monthly_observation_count

## # A tibble: 5 × 2
## # Groups:   Month [5]
##   Month     n
##   <int> <int>
## 1     5    31
## 2     6    30
## 3     7    31
## 4     8    31
## 5     9    30

Task: Can you put the average temp also to the table?

# Calculate and record as monthly_observation_count
monthly_observation_count <- airquality %>%
  group_by(Month) %>%
  summarise(Nr_of_Obs = n(), 
            Avg_Temp = mean(Temp))
# Show the result of monthly_observation_count
monthly_observation_count

## # A tibble: 5 × 3
##   Month Nr_of_Obs Avg_Temp
##   <int>     <int>    <dbl>
## 1     5        31     65.5
## 2     6        30     79.1
## 3     7        31     83.9
## 4     8        31     84.0
## 5     9        30     76.9

Some More Exercise!

Task: Group the July air quality data based on temperature category, calculate the average ozone levels and maximum wind speed for each category, and arrange the data based on average ozone levels in descending order.

# Group by temperature category, summarize average ozone levels and max wind speed, and arrange by average ozone levels as grouped_summarized_arranged_data
grouped_summarized_arranged_data <- july_air_quality %>%
  mutate(Temp_Category = ifelse(Temp >= 80, "Hot", "Moderate")) %>%
  group_by(Temp_Category) %>%
  summarise(avg_ozone = mean(Ozone, na.rm = TRUE), 
            max_wind = max(Wind, na.rm = TRUE)) %>%
  arrange(desc(avg_ozone))
# Show the result of grouped_summarized_arranged_data
grouped_summarized_arranged_data

## # A tibble: 2 × 3
##   Temp_Category avg_ozone max_wind
##   <chr>             <dbl>    <dbl>
## 1 Hot                63.1     15.5
## 2 Moderate           13       14.3

Task: Find the average hp by cylinder for the cars above mpg over 25 and arrange them in descending order.

# Filter cars with high MPG, group by number of cylinders, and calculate average horsepower as avg_hp_by_cyl
avg_hp_by_cyl <- mtcars %>%
  filter(mpg >= 25) %>%
  group_by(cyl) %>%
  summarise(avg_hp = mean(hp)) %>%
  arrange(desc(avg_hp))
# Show avg_hp_by_cyl
avg_hp_by_cyl

## # A tibble: 1 × 2
##     cyl avg_hp
##   <dbl>  <dbl>
## 1     4   75.5

Wrap Up and Conclusion

In this practice file, we practiced the fundamental dplyr functions which are quite useful for data summaries, filtering and transforming the data. Yet, these are not the only functions you can use. You can check this cheat sheet as a handy memory jogger and to see and practice the other functions. Also you should check the dplyr documentation in detail for more…

Why don’t you try those functions in another embedded R dataset called iris ?

Good luck!

Project Dplyr - Basic Dplyr Functions and Practices

Burak Demirtas

2023-06-05