Rows: 153
Columns: 6
$ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
$ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
$ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
$ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
M05-2-Application: Data Wrangling with Tidyverse in R
1 Ex
1.1 Convert airquality to a tibble
1.2 Create a data frame from the tibble
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
# A tibble: 6 × 3
Ozone Temp Wind
<int> <int> <dbl>
1 41 67 7.4
2 36 72 8
3 12 74 12.6
4 18 62 11.5
5 NA 56 14.3
6 28 66 14.9
1.3 Find the Ozone amount when Temp is 97
This question is solved using both Base R and Tidyverse. The goal is to identify the Ozone amount when the temperature equals 97.
Notes:
- First, the entire row where Temp = 97 will be identified.
- Then the Ozone value will be extracted from that row.
- Both Base R and Tidyverse approaches are demonstrated.
# A tibble: 1 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 76 203 9.7 97 8 28
# A tibble: 1 × 1
Ozone
<int>
1 76
# A tibble: 1 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 76 203 9.7 97 8 28
2 Ex
2.1 Select the top 10 worst arrival delays for each day and calculate the monthly average delay
In this exercise, I first identify the top 10 worst arrival delays for each day.
Then I remove the grouping and calculate the average delay for each month.
# A tibble: 3,650 × 19
# Groups: year, month, day [365]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 848 1835 853 1001 1950
2 2013 1 1 2343 1724 379 314 1938
3 2013 1 1 1815 1325 290 2120 1542
4 2013 1 1 1842 1422 260 1958 1535
5 2013 1 1 2115 1700 255 2330 1920
6 2013 1 1 2205 1720 285 46 2040
7 2013 1 1 2006 1630 216 2230 1848
8 2013 1 1 2312 2000 192 21 2110
9 2013 1 1 1942 1705 157 2124 1830
10 2013 1 1 1938 1703 155 2109 1823
# ℹ 3,640 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
| Average Arrival Delay by Month | |
| Month | Average Delay (minutes) |
|---|---|
| 1 | 200.0161 |
| 2 | 198.6214 |
| 3 | 202.1194 |
| 4 | 220.4000 |
| 5 | 201.6613 |
| 6 | 249.4167 |
| 7 | 254.1290 |
| 8 | 194.8839 |
| 9 | 166.0233 |
| 10 | 161.4903 |
| 11 | 161.8767 |
| 12 | 225.0581 |
2.2 Visualize the average delay by month
The chart shows that the average arrival delay among the top 10 worst delayed flights varies across the months. Summer months, especially June and July, appear to have the highest average delays, while some fall months have lower average delays. This suggests that delays may be affected by seasonal travel patterns, heavier traffic, and possible weather-related disruptions during busier times of the year.
3 Ex
The final goal of this exercise is to visualize the data. To do so, you will want to understand the data and determine the kind of relationship you would like to visualize. Next you will want to wrangle data to support the visualization objective. In the following sub-questions, you will be asked a step-by-step process. Organize well in your Quarto Markdown by showing sub-question numbers: 3.1, 3.2, 3.3, 3.4, and 3.5 right above each code chunk, except for 3.3. Note that the 3.3 activity doesn’t require coding, but you need to provide your answer as text in the text area, not inside the code chunk. This is because When your response is longer than one line inside the code chunk, the response will go out of the boundary of the chunk highlight when knitted.
3.1 Explore the relig_income data
In this section, I examine the structure, size, variables, and missing values in the `relig_income` dataset before reshaping it for visualization.
# A tibble: 6 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t kn… 15 14 15 11 10 35 21
6 Evangeli… 575 869 1064 982 881 1486 949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
Rows: 18
Columns: 11
$ religion <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "D…
$ `<$10k` <dbl> 27, 12, 27, 418, 15, 575, 1, 228, 20, 19, 289, 29…
$ `$10-20k` <dbl> 34, 27, 21, 617, 14, 869, 9, 244, 27, 19, 495, 40…
$ `$20-30k` <dbl> 60, 37, 30, 732, 15, 1064, 7, 236, 24, 25, 619, 4…
$ `$30-40k` <dbl> 81, 52, 34, 670, 11, 982, 9, 238, 24, 25, 655, 51…
$ `$40-50k` <dbl> 76, 35, 33, 638, 10, 881, 11, 197, 21, 30, 651, 5…
$ `$50-75k` <dbl> 137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95, 110…
$ `$75-100k` <dbl> 122, 73, 62, 949, 21, 949, 47, 131, 15, 69, 939, …
$ `$100-150k` <dbl> 109, 59, 39, 792, 17, 723, 48, 81, 11, 87, 753, 4…
$ `>150k` <dbl> 84, 74, 53, 633, 18, 414, 54, 78, 6, 151, 634, 42…
$ `Don't know/refused` <dbl> 96, 76, 54, 1489, 116, 1529, 37, 339, 37, 162, 13…
# A tibble: 1 × 3
Rows Columns Missing_Values
<int> <int> <int>
1 18 11 0
3.2 View the help file for relig_income
The relig_income dataset comes from a Pew Research survey that examines the relationship between religion and household income. Each row represents a religion category, while each column represents an income bracket. The values in the table represent the number of respondents in each income category for that religion.
Because the income categories are stored as column names, the dataset is currently in a
wide format rather than tidy format. For tidy data, each variable should have its own
column, each observation should have its own row, and each value should have its own
cell. To properly analyze and visualize this dataset, the income columns need to be
reshaped into a single column representing income brackets with another column
representing the counts of respondents.
3.3 Identify the relationship you want to visualize
The relationship I want to visualize is how the distribution of household income varies across different religious groups. Specifically, I want to see which religions have higher proportions of respondents in higher income brackets compared to lower income brackets. This will help illustrate any potential correlations between religious affiliation and income levels.
3.4 Reshape the data and visualize income distribution by religion
# A tibble: 6 × 3
religion income count
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
Figure 1. Income Distribution by Religion (Stacked View)
This chart displays the distribution of respondents across income brackets for all religious groups combined. Each color represents a different religion, allowing us to compare how income categories are distributed across the entire population. While this visualization shows the overall distribution, it can become visually crowded because many religions are stacked together.
Figure 2. Income Distribution by Religion (Faceted View)
This visualization separates each religion into its own panel, making it easier to observe the income distribution within each religious group individually. By isolating the groups, patterns become clearer and comparisons between income brackets within each religion are easier to interpret.
4 Ex
4.1 Explore the us_rent_income data
In this section, I explore the structure and characteristics of the `us_rent_income` dataset in order to understand the variables and determine how the data should be reshaped for visualization.
# A tibble: 6 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
| Summary of us_rent_income Dataset | |
| Measure | Value |
|---|---|
| Rows | 104 |
| Columns | 5 |
| Missing Values | 2 |
4.2 View the help file for us_rent_income
The
us_rent_incomedataset provides estimates of household income and gross rent for U.S. states. TheGEOIDvariable identifies the geographic region, whileNAMEprovides the name of the state. Thevariablecolumn indicates whether the observation refers to income or rent. Theestimatecolumn contains the estimated value, andmoerepresents the margin of error for that estimate. Because income and rent values are stored as rows within thevariablecolumn rather than separate columns, the data is currently not structured in the ideal format for directly visualizing the relationship between income and rent. To visualize the relationship between income and rent, the data will need to be reshaped so that income and rent values are in separate columns, allowing for easier comparison and analysis of how these two variables relate across different states.
4.3 Does the current data shape support the visualization goal?
No, the current data shape does not support the visualization goal because the
variablecolumn contains both income and rent values in a long format. To visualize the relationship between income and rent effectively, it would be more useful to have these as separate columns (e.g.,incomeandrent) for each state. This would allow for easier comparison and analysis of how income and rent relate to each other across different states. The current structure requires additional wrangling to reshape the data into a format that is suitable for visualization.
For tidy data, each variable should have its own column and each observation should occupy its own row. In this case, income and rent should be stored in separate columns for each state. To achieve this structure, the data needs to be reshaped from long format to wide format using the
pivot_wider()function. This transformation will allow income and rent estimates to appear side-by-side for each state, making it possible to visualize the relationship between the two variables.
4.4 Reshape the data and visualize the relationship between income and rent
Code chunk to reshape
The scatterplot shows a positive relationship between estimated household income and estimated gross rent across states. In general, states with higher household income tend to also have higher average rents. This pattern suggests that housing costs typically increase as income levels rise. However, some variation is visible across states, which may reflect differences in housing markets, cost of living, and regional economic conditions.
4.5 One long chain of wrangling and visualization
Visualization code
The scatterplot shows a positive relationship between household income and rent across states. States with higher estimated household income tend to also have higher average rents. This pattern suggests that housing costs generally increase as income levels rise. However, there may be some variation in this relationship due to factors such as local housing markets, cost of living, and regional economic conditions. Overall, the visualization indicates that income and rent are correlated, with higher income states typically experiencing higher rents.
5 Ex
5.1 Create the dataset
player year stats
1 A 1 22/2/3
2 A 2 29/3/4
3 B 1 18/6/7
4 B 2 11/1/2
5 C 1 12/1/1
6 C 2 19/2/4
This creates a dataset like: | player | year | stats | |——–|——|———| | A | 1 | 22/2/3 | | A | 2 | 29/3/4 |
Where
statscontains: points/assists/rebounds for each player in each year.
5.2 Separate the stats column
player year points assists steals
1 A 1 22 2 3
2 A 2 29 3 4
3 B 1 18 6 7
4 B 2 11 1 2
5 C 1 12 1 1
6 C 2 19 2 4
This code uses the
separate()function to split thestatscolumn into three new columns:points,assists, andsteals. Thesep = "/"argument indicates that the values in thestatscolumn are separated by slashes, which is how the function knows where to split the data. After running this code, the resulting dataset will have separate columns for points, assists, and steals for each player and year.
5.3 Unite the columns back into one stats column
player year stats
1 A 1 22/2/3
2 A 2 29/3/4
3 B 1 18/6/7
4 B 2 11/1/2
5 C 1 12/1/1
6 C 2 19/2/4
This code uses the
unite()function to combine thepoints,assists, andstealscolumns back into a singlestatscolumn. Thecol = statsargument specifies the name of the new column that will contain the combined values. Thepoints:stealsargument indicates that all columns frompointstostealsshould be united, andsep = "/"specifies that the values should be separated by slashes in the resultingstatscolumn. After running this code, the dataset will return to its original format with a singlestatscolumn containing the combined values for points, assists, and steals.
6 Ex
6.1 Filter non-canceled flights and identify the day with the latest first departure in each month
This code filters out canceled flights (where
dep_timeis not NA) and then groups the data by month and day to find the latest departure time for the first flight of each day. Theslice_max()function is used to select the row with the maximum departure time for each group, effectively identifying the day with the latest first departure in each month.
In this section, I first remove canceled flights. Then I calculate the first and last departure times for each day. After that, I identify one day in each month where the first flight departure was the latest and display the results in descending order.
| Day with the Latest First Departure in Each Month | |||||
| Year | Month | Day | First Departure | Last Departure | Number of Flights |
|---|---|---|---|---|---|
| 2013 | 2 | 9 | 901 | 2358 | 291 |
| 2013 | 1 | 20 | 525 | 2356 | 782 |
| 2013 | 9 | 29 | 521 | 2356 | 911 |
| 2013 | 10 | 6 | 519 | 2346 | 914 |
| 2013 | 11 | 3 | 519 | 2358 | 900 |
| 2013 | 3 | 3 | 509 | 2353 | 910 |
| 2013 | 12 | 18 | 500 | 2358 | 953 |
| 2013 | 5 | 29 | 457 | 2352 | 969 |
| 2013 | 6 | 15 | 456 | 2356 | 795 |
| 2013 | 4 | 30 | 455 | 2351 | 958 |
| 2013 | 8 | 31 | 455 | 2359 | 676 |
| 2013 | 7 | 5 | 38 | 2358 | 819 |
6.2 Find the top 5 destinations
In this section, I group the flights data by destination and count the number of flights to each destination. Then I use
slice_max()to select the top 5 destinations with the highest flight counts. Finally, I display the results in a formatted table.
| Top 5 Destinations by Number of Flights | |
| Destination | Number of Flights |
|---|---|
| ORD | 17283 |
| ATL | 17215 |
| LAX | 16174 |
| BOS | 15508 |
| MCO | 14082 |
The results show the top 5 destinations with the highest number of flights. These destinations are likely major hubs or popular travel locations, which is why they have a high volume of flights. The table provides a clear summary of the most frequently traveled destinations from the dataset.
7 Ex
7.1 Check missing values in Solar.R and replace them with the mean
This code checks for missing values in the
Solar.Rcolumn of theairqualitydataset. If any missing values are found, it calculates the mean of the non-missing values in theSolar.Rcolumn and replaces the missing values with this mean. Finally, it confirms that there are no more missing values in theSolar.Rcolumn.
# A tibble: 1 × 2
missing_values percent_missing
<int> <dbl>
1 7 4.58
[1] 185.9315
# A tibble: 1 × 1
remaining_missing
<int>
1 0
8 Ex
In this section, I calculate the total temperature for each month by weighting the count with
Temp. Then I convert the numeric month values into month names, change the month variable to a factor, reorder the months by total temperature in descending order, and visualize the results in a horizontal bar chart.
8.1 Alternative visualization
This code achieves the same goal as the previous method but uses
group_by()andsummarise()to calculate the total temperature for each month instead ofcount(). The remaining steps—recoding the month names, converting the variable to a factor, reordering the months, and visualizing the results—are the same as before.
9 Ex
9.1 Find the maxium temperature for each month and visualize the results
This code groups the
air.tibdataset by month and calculates the maximum temperature for each month usingsummarise(max_temp = max(Temp)). It then recodes the numeric month values into month names, converts the month variable into a factor, reorders the months by maximum temperature in descending order, and visualizes the results in a horizontal bar chart.
10 Ex
10.1 Create temp_wind and visualize the relationships between Ozone and selected predictors
This code creates a new variable
temp_windby dividing theTempcolumn by theWindcolumn in theair.tibdataset. It then reshapes the data to compare the relationships between Ozone and three predictors: temperature, wind speed, and the combinedtemp_windvariable. A faceted scatterplot is used to visualize these relationships.
The scatterplots show the relationships between Ozone levels and three predictors: temperature, wind speed, and the combined
temp_windvariable. The plot for temperature shows a positive relationship, indicating that higher temperatures are associated with higher Ozone levels. The plot for wind speed shows a negative relationship, suggesting that higher wind speeds are associated with lower Ozone levels. Thetemp_windplot combines these effects and appears to show a stronger positive relationship with Ozone. Overall, these visualizations help illustrate how different weather variables relate to Ozone concentrations in the air.