M05-2-Application: Data Wrangling with Tidyverse in R

Author

Nicole Bewley-Hudson

Published

March 8, 2026

1 Ex

1.1 Convert airquality to a tibble

Rows: 153
Columns: 6
$ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
$ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
$ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
$ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

1.2 Create a data frame from the tibble

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

# A tibble: 6 × 3
  Ozone  Temp  Wind
  <int> <int> <dbl>
1    41    67   7.4
2    36    72   8  
3    12    74  12.6
4    18    62  11.5
5    NA    56  14.3
6    28    66  14.9

1.3 Find the Ozone amount when Temp is 97

Note

This question is solved using both Base R and Tidyverse. The goal is to identify the Ozone amount when the temperature equals 97.

Notes:

First, the entire row where Temp = 97 will be identified.
Then the Ozone value will be extracted from that row.
Both Base R and Tidyverse approaches are demonstrated.

# Base R: show the entire row where Temp is 97
air.tib[air.tib$Temp == 97, ]

# A tibble: 1 × 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    76     203   9.7    97     8    28

# Base R: show only the Ozone value where Temp is 97
air.tib[air.tib$Temp == 97, "Ozone"]

# A tibble: 1 × 1
  Ozone
  <int>
1    76

# Tidyverse: filter the row where Temp equals 97
air.tib |>
  filter(Temp == 97)

# A tibble: 1 × 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    76     203   9.7    97     8    28

# Tidyverse: filter Temp == 97 then select the Ozone column
air.tib |>
  filter(Temp == 97) |>
  select(Ozone)

# A tibble: 1 × 1
  Ozone
  <int>
1    76

2 Ex

2.1 Select the top 10 worst arrival delays for each day and calculate the monthly average delay

Note

In this exercise, I first identify the top 10 worst arrival delays for each day.
Then I remove the grouping and calculate the average delay for each month.

# A tibble: 3,650 × 19
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1     2343           1724       379      314           1938
 3  2013     1     1     1815           1325       290     2120           1542
 4  2013     1     1     1842           1422       260     1958           1535
 5  2013     1     1     2115           1700       255     2330           1920
 6  2013     1     1     2205           1720       285       46           2040
 7  2013     1     1     2006           1630       216     2230           1848
 8  2013     1     1     2312           2000       192       21           2110
 9  2013     1     1     1942           1705       157     2124           1830
10  2013     1     1     1938           1703       155     2109           1823
# ℹ 3,640 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Month	Average Delay (minutes)
Average Arrival Delay by Month
1	200.0161
2	198.6214
3	202.1194
4	220.4000
5	201.6613
6	249.4167
7	254.1290
8	194.8839
9	166.0233
10	161.4903
11	161.8767
12	225.0581

2.2 Visualize the average delay by month

The chart shows that the average arrival delay among the top 10 worst delayed flights varies across the months. Summer months, especially June and July, appear to have the highest average delays, while some fall months have lower average delays. This suggests that delays may be affected by seasonal travel patterns, heavier traffic, and possible weather-related disruptions during busier times of the year.

3 Ex

The final goal of this exercise is to visualize the data. To do so, you will want to understand the data and determine the kind of relationship you would like to visualize. Next you will want to wrangle data to support the visualization objective. In the following sub-questions, you will be asked a step-by-step process. Organize well in your Quarto Markdown by showing sub-question numbers: 3.1, 3.2, 3.3, 3.4, and 3.5 right above each code chunk, except for 3.3. Note that the 3.3 activity doesn’t require coding, but you need to provide your answer as text in the text area, not inside the code chunk. This is because When your response is longer than one line inside the code chunk, the response will go out of the boundary of the chunk highlight when knitted.

3.1 Explore the relig_income data

Note

In this section, I examine the structure, size, variables, and missing values in the `relig_income` dataset before reshaping it for visualization.

# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

Rows: 18
Columns: 11
$ religion             <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "D…
$ `<$10k`              <dbl> 27, 12, 27, 418, 15, 575, 1, 228, 20, 19, 289, 29…
$ `$10-20k`            <dbl> 34, 27, 21, 617, 14, 869, 9, 244, 27, 19, 495, 40…
$ `$20-30k`            <dbl> 60, 37, 30, 732, 15, 1064, 7, 236, 24, 25, 619, 4…
$ `$30-40k`            <dbl> 81, 52, 34, 670, 11, 982, 9, 238, 24, 25, 655, 51…
$ `$40-50k`            <dbl> 76, 35, 33, 638, 10, 881, 11, 197, 21, 30, 651, 5…
$ `$50-75k`            <dbl> 137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95, 110…
$ `$75-100k`           <dbl> 122, 73, 62, 949, 21, 949, 47, 131, 15, 69, 939, …
$ `$100-150k`          <dbl> 109, 59, 39, 792, 17, 723, 48, 81, 11, 87, 753, 4…
$ `>150k`              <dbl> 84, 74, 53, 633, 18, 414, 54, 78, 6, 151, 634, 42…
$ `Don't know/refused` <dbl> 96, 76, 54, 1489, 116, 1529, 37, 339, 37, 162, 13…

# A tibble: 1 × 3
   Rows Columns Missing_Values
  <int>   <int>          <int>
1    18      11              0

3.2 View the help file for relig_income

Religious Landscape Study

The relig_income dataset comes from a Pew Research survey that examines the relationship between religion and household income. Each row represents a religion category, while each column represents an income bracket. The values in the table represent the number of respondents in each income category for that religion.

Because the income categories are stored as column names, the dataset is currently in a
wide format rather than tidy format. For tidy data, each variable should have its own
column, each observation should have its own row, and each value should have its own
cell. To properly analyze and visualize this dataset, the income columns need to be
reshaped into a single column representing income brackets with another column
representing the counts of respondents.

3.3 Identify the relationship you want to visualize

The relationship I want to visualize is how the distribution of household income varies across different religious groups. Specifically, I want to see which religions have higher proportions of respondents in higher income brackets compared to lower income brackets. This will help illustrate any potential correlations between religious affiliation and income levels.

3.4 Reshape the data and visualize income distribution by religion

# A tibble: 6 × 3
  religion income  count
  <chr>    <chr>   <dbl>
1 Agnostic <$10k      27
2 Agnostic $10-20k    34
3 Agnostic $20-30k    60
4 Agnostic $30-40k    81
5 Agnostic $40-50k    76
6 Agnostic $50-75k   137

Figure 1. Income Distribution by Religion (Stacked View)

This chart displays the distribution of respondents across income brackets for all religious groups combined. Each color represents a different religion, allowing us to compare how income categories are distributed across the entire population. While this visualization shows the overall distribution, it can become visually crowded because many religions are stacked together.

Figure 2. Income Distribution by Religion (Faceted View)

This visualization separates each religion into its own panel, making it easier to observe the income distribution within each religious group individually. By isolating the groups, patterns become clearer and comparisons between income brackets within each religion are easier to interpret.

4 Ex

4.1 Explore the us_rent_income data

In this section, I explore the structure and characteristics of the `us_rent_income` dataset in order to understand the variables and determine how the data should be reshaped for visualization.

# A tibble: 6 × 5
  GEOID NAME    variable estimate   moe
  <chr> <chr>   <chr>       <dbl> <dbl>
1 01    Alabama income      24476   136
2 01    Alabama rent          747     3
3 02    Alaska  income      32940   508
4 02    Alaska  rent         1200    13
5 04    Arizona income      27517   148
6 04    Arizona rent          972     4

Measure	Value
Summary of us_rent_income Dataset
Rows	104
Columns	5
Missing Values	2

4.2 View the help file for us_rent_income

The us_rent_income dataset provides estimates of household income and gross rent for U.S. states. The GEOID variable identifies the geographic region, while NAME provides the name of the state. The variable column indicates whether the observation refers to income or rent. The estimate column contains the estimated value, and moe represents the margin of error for that estimate. Because income and rent values are stored as rows within the variable column rather than separate columns, the data is currently not structured in the ideal format for directly visualizing the relationship between income and rent. To visualize the relationship between income and rent, the data will need to be reshaped so that income and rent values are in separate columns, allowing for easier comparison and analysis of how these two variables relate across different states.

4.3 Does the current data shape support the visualization goal?

No, the current data shape does not support the visualization goal because the variable column contains both income and rent values in a long format. To visualize the relationship between income and rent effectively, it would be more useful to have these as separate columns (e.g., income and rent) for each state. This would allow for easier comparison and analysis of how income and rent relate to each other across different states. The current structure requires additional wrangling to reshape the data into a format that is suitable for visualization.

For tidy data, each variable should have its own column and each observation should occupy its own row. In this case, income and rent should be stored in separate columns for each state. To achieve this structure, the data needs to be reshaped from long format to wide format using the pivot_wider() function. This transformation will allow income and rent estimates to appear side-by-side for each state, making it possible to visualize the relationship between the two variables.

4.4 Reshape the data and visualize the relationship between income and rent

Note

Code chunk to reshape

The scatterplot shows a positive relationship between estimated household income and estimated gross rent across states. In general, states with higher household income tend to also have higher average rents. This pattern suggests that housing costs typically increase as income levels rise. However, some variation is visible across states, which may reflect differences in housing markets, cost of living, and regional economic conditions.

4.5 One long chain of wrangling and visualization

Note

Visualization code

The scatterplot shows a positive relationship between household income and rent across states. States with higher estimated household income tend to also have higher average rents. This pattern suggests that housing costs generally increase as income levels rise. However, there may be some variation in this relationship due to factors such as local housing markets, cost of living, and regional economic conditions. Overall, the visualization indicates that income and rent are correlated, with higher income states typically experiencing higher rents.

5 Ex

5.1 Create the dataset

  player year  stats
1      A    1 22/2/3
2      A    2 29/3/4
3      B    1 18/6/7
4      B    2 11/1/2
5      C    1 12/1/1
6      C    2 19/2/4

This creates a dataset like: | player | year | stats | |——–|——|———| | A | 1 | 22/2/3 | | A | 2 | 29/3/4 |

Where stats contains: points/assists/rebounds for each player in each year.

5.2 Separate the stats column

  player year points assists steals
1      A    1     22       2      3
2      A    2     29       3      4
3      B    1     18       6      7
4      B    2     11       1      2
5      C    1     12       1      1
6      C    2     19       2      4

This code uses the separate() function to split the stats column into three new columns: points, assists, and steals. The sep = "/" argument indicates that the values in the stats column are separated by slashes, which is how the function knows where to split the data. After running this code, the resulting dataset will have separate columns for points, assists, and steals for each player and year.

5.3 Unite the columns back into one stats column

  player year  stats
1      A    1 22/2/3
2      A    2 29/3/4
3      B    1 18/6/7
4      B    2 11/1/2
5      C    1 12/1/1
6      C    2 19/2/4

This code uses the unite() function to combine the points, assists, and steals columns back into a single stats column. The col = stats argument specifies the name of the new column that will contain the combined values. The points:steals argument indicates that all columns from points to steals should be united, and sep = "/" specifies that the values should be separated by slashes in the resulting stats column. After running this code, the dataset will return to its original format with a single stats column containing the combined values for points, assists, and steals.

6 Ex

6.1 Filter non-canceled flights and identify the day with the latest first departure in each month

This code filters out canceled flights (where dep_time is not NA) and then groups the data by month and day to find the latest departure time for the first flight of each day. The slice_max() function is used to select the row with the maximum departure time for each group, effectively identifying the day with the latest first departure in each month.

Note

In this section, I first remove canceled flights. Then I calculate the first and last departure times for each day. After that, I identify one day in each month where the first flight departure was the latest and display the results in descending order.

Year	Month	Day	First Departure	Last Departure	Number of Flights
Day with the Latest First Departure in Each Month
2013	2	9	901	2358	291
2013	1	20	525	2356	782
2013	9	29	521	2356	911
2013	10	6	519	2346	914
2013	11	3	519	2358	900
2013	3	3	509	2353	910
2013	12	18	500	2358	953
2013	5	29	457	2352	969
2013	6	15	456	2356	795
2013	4	30	455	2351	958
2013	8	31	455	2359	676
2013	7	5	38	2358	819

6.2 Find the top 5 destinations

In this section, I group the flights data by destination and count the number of flights to each destination. Then I use slice_max() to select the top 5 destinations with the highest flight counts. Finally, I display the results in a formatted table.

Destination	Number of Flights
Top 5 Destinations by Number of Flights
ORD	17283
ATL	17215
LAX	16174
BOS	15508
MCO	14082

The results show the top 5 destinations with the highest number of flights. These destinations are likely major hubs or popular travel locations, which is why they have a high volume of flights. The table provides a clear summary of the most frequently traveled destinations from the dataset.

7 Ex

7.1 Check missing values in Solar.R and replace them with the mean

This code checks for missing values in the Solar.R column of the airquality dataset. If any missing values are found, it calculates the mean of the non-missing values in the Solar.R column and replaces the missing values with this mean. Finally, it confirms that there are no more missing values in the Solar.R column.

# A tibble: 1 × 2
  missing_values percent_missing
           <int>           <dbl>
1              7            4.58

[1] 185.9315

# A tibble: 1 × 1
  remaining_missing
              <int>
1                 0

8 Ex

In this section, I calculate the total temperature for each month by weighting the count with Temp. Then I convert the numeric month values into month names, change the month variable to a factor, reorder the months by total temperature in descending order, and visualize the results in a horizontal bar chart.

8.1 Alternative visualization

This code achieves the same goal as the previous method but uses group_by() and summarise() to calculate the total temperature for each month instead of count(). The remaining steps—recoding the month names, converting the variable to a factor, reordering the months, and visualizing the results—are the same as before.

9 Ex

9.1 Find the maxium temperature for each month and visualize the results

This code groups the air.tib dataset by month and calculates the maximum temperature for each month using summarise(max_temp = max(Temp)). It then recodes the numeric month values into month names, converts the month variable into a factor, reorders the months by maximum temperature in descending order, and visualizes the results in a horizontal bar chart.

10 Ex

10.1 Create temp_wind and visualize the relationships between Ozone and selected predictors

This code creates a new variable temp_wind by dividing the Temp column by the Wind column in the air.tib dataset. It then reshapes the data to compare the relationships between Ozone and three predictors: temperature, wind speed, and the combined temp_wind variable. A faceted scatterplot is used to visualize these relationships.

The scatterplots show the relationships between Ozone levels and three predictors: temperature, wind speed, and the combined temp_wind variable. The plot for temperature shows a positive relationship, indicating that higher temperatures are associated with higher Ozone levels. The plot for wind speed shows a negative relationship, suggesting that higher wind speeds are associated with lower Ozone levels. The temp_wind plot combines these effects and appears to show a stronger positive relationship with Ozone. Overall, these visualizations help illustrate how different weather variables relate to Ozone concentrations in the air.