Assignment 2 - Carlos Rivera (car808)

Step 2: Workbook exercises

library(tidyverse)

data = nycflights13::flights

not_cancelled = data %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

5.2.4 Exercises

1. Find all flights that
  • Had an arrival delay of two or more hours
data %>% 
  filter(arr_delay >= 2*60)

The resulting table has 10200 rows and 19 columns.

  • Flew to Houston (IAH or HOU)
data %>% 
  filter(dest %in% c('IAH', 'HOU'))

The resulting table has 9313 rows and 19 columns.

  • Were operated by United, American, or Delta
# AA - American Airlines Inc.     
# DL - Delta Air Lines Inc.       
# UA - United Air Lines Inc.      

data %>% 
  filter(carrier %in% c('AA', 'DL', 'UA'))

The resulting table has 139504 rows and 19 columns.

  • Departed in summer (July, August, and September)
# month: July=7, August=8, September=9
data %>% 
  filter(month >= 7,
         month <= 9)

The resulting table has 86326 rows and 19 columns.

  • Arrived more than two hours late, but didn’t leave late
data %>% 
  filter(arr_delay >= 2 * 60,
         dep_delay == 0)

The resulting table has 3 rows and 19 columns.

  • Were delayed by at least an hour, but made up over 30 minutes in flight
data %>% 
  filter(dep_delay >= 60,
         (dep_delay - arr_delay) > 30)

The resulting table has 1844 rows and 19 columns.

  • Departed between midnight and 6am (inclusive)
data %>% 
  filter(dep_time <= 600)

The resulting table has 9344 rows and 19 columns.

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
  • Departed in summer (July, August, and September)
# month: July=7, August=8, September=9
data %>% 
  filter(between(month, 7, 9))
  • Departed between midnight and 6am (inclusive)
data %>% 
  filter(between(, 0, 600))
3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
data %>% 
  filter(is.na(dep_time)) %>% 
  head(5)
  • How many flights have a missing dep_time?

In total there are 8,255 flights with the dep_time missing.

  • What other variables are missing?

Other variables which are also missing in the same flights are: dep_time, dep_delay, arr_time, arr_delay, air_time.

  • What might these rows represent?

Most likely these flights will be cancelled, so there is no departure nor arrival information.

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

This behavior may be due to the fact that NA stands for Not Available and has logical class so that for any of the 3 situations the result does not change independently of the value that the variable might take:

  • NA ^ 0: any value numeric (number or Inf), logical (TRUE, FALSE) power 0 is equal 1.

  • NA | TRUE: any numeric value != 0 is TRUE and value == 0 is FALSE, independent the first argument due is an operator OR is always TRUE because the second argument. Different situation if the second argument were FALSE, in which case the result is NA because it depends on the first argument.

  • FALSE & NA: in the same sense, since it is an operator AND independent of the value that the NA argument could take, the result will be FALSE.

  • NA * 0: The behavior of this tricky counterexample is due to the fact that not any value multiplied by 0 will equal to be, punctually there are two indeterminations when the variable takes the values -Inf or Inf, in which case multiplying by 0 results in NaN, in which case NA * 0 will be indeterminate.

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
data %>% 
  arrange(desc(is.na(dep_time))) %>% 
  head(5)
2. Sort flights to find the most delayed flights. Find the flights that left earliest.
data %>% 
  arrange(desc(dep_delay), desc(arr_delay)) %>% 
  head(5)
3. Sort flights to find the fastest (highest speed) flights.
# air_time: Amount of time spent in the air, in minutes.
# distance: Distance between airports, in miles.
# speed = distance / (air_time / 60) [mile/hour]

data %>% 
  mutate(speed = distance / (air_time / 60)) %>% 
  arrange(desc(speed)) %>% 
  head(5)
4. Which flights travelled the farthest? Which travelled the shortest?
data %>% 
  arrange(desc(distance)) %>% 
  head(5)

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
# dep_time
# dep_delay
# arr_time
# arr_delay

var_sel = c('dep_time', 'dep_delay', 'arr_time', 'arr_delay')

head(data[, var_sel], 5)
head(data[, c(4,6,7,9)], 5)
data %>% select(dep_time, dep_delay, arr_time, arr_delay) %>% head(5)
data %>% select(any_of(var_sel)) %>% head(5)
data %>% 
  select(starts_with('dep'), starts_with('arr')) %>% 
  head(5)
data %>% 
  select(dep_time:arr_delay, -starts_with('sched')) %>% 
  head(5)
2. What happens if you include the name of a variable multiple times in a select() call?
data %>% 
  select(year, year, year) %>% 
  head(5)

When including a variable name multiple times in the select function, the resulting data.frame contains the variable only once.

3. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

data %>% 
  select(any_of(vars)) %>% 
  head(5)

the any_of functoin allows you to define a vector with the names of the columns you want to select, and then the select function returns each of the columns that match the values in the vector of names, this is very useful and has multiple applications, for example sleeving the same variables in multiple data.frame.

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

data %>% 
  select(contains('TIME')) %>% 
  head(5)
data %>% 
  select(contains('TIME', ignore.case = F))
  • I was surprised, since R is generally case sensitive so I expected it not to return columns.

  • The selection helpers have the ignore.case argument by default TRUE, so it returns a case-insensitive match.

  • changing the argument to FALSE: ignore.case = F.

Step 3: Generate your own exercises

Exploration of contaminant production in Mississippi in the last 20 years

The EPA Toxics Release Inventory data is a valuable source of information for the state of Mississippi. From this data we can take a look at how emissions have evolved over the last two decade 2004 - 2023.

Since 1987, the EPA has registered the Toxics Release Inventory (TRI), which contains the 100 most used data fields of the TRI Reporting Form R and the Form A Certification Statement. One of the most relevant variables is the total emissions generated. This information is disaggregated by year, county, industrial sector, among others, which allows for detailed observation and analysis of the information through summary tables (Table 1).

Table 1: Dataset variables
ID col_name Variable Description
1 year YEAR Calendar year in which the reported activities occurred.Source: TRI_REPORTING_FORM.REPORTING_YEAR
Reference: Part I, Section 1
Maximum Length: 4
7 county COUNTY County in which the reporting facility is located.
Reference: Part I, Section 4.1
Maximum Length: 50
23 sector INDUSTRY SECTOR The industry or sector (e.g., Coal Mining, Metal Mining,Electrical Utilities, etc.) a facility belongs to. This categorization is primarily used to classify, analyze, and show industrial trends within TRI data.
Maximum Length: 120
37 chemical CHEMICAL Name of the chemical as listed on the TRI chemical list, orgeneric name, if the chemical is claimed as a trade secret.
Reference: Part II, Section 1.2 or Part II, Section1.3
Maximum Length: 70
46 carc CARCINOGEN Flag indicating whether the chemical is classified as acarcinogen by the Occupational Safety and HealthAdministration (OSHA).
Yes = CARC
No = Non-CARC
See “Appendix B: Chemical Classifications - Carcinogens” for a list of TRI chemicals classified as OSHA carcinogens.
Maximum Length: 3
107 total TOTAL RELEASES The total on- and off-site releases from sections 5 and 6 of the Form R. The value for this field equals On-site Release Total (row #65) + Off-site Release Total (row #88).
Maximum Length: 22,7

Given the format in which the data is found, we must download the csv files by year and then bind by rows with bind_rows, as a final result our dataset looks like in Table 2. First we can check the number of categories of the variables year, county, sector, and chemical (Table 3).

Table 2: Data header: first 5 row of the dataset
year county sector chemical carc total_rel
2004 DESOTO Fabricated Metals Xylene (mixed isomers) NO 33731.0
2004 DESOTO Fabricated Metals Toluene NO 22202.0
2004 LEFLORE Electrical Equipment Diisocyanates NO 0.0
2004 HINDS Petroleum Polycyclic aromatic compounds YES 400.9
2004 LAMAR Petroleum Polycyclic aromatic compounds YES 0.0
Table 3: Categorical variables
Variable Number categories Categories (at least first 30)
year 20 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
county 76 DESOTO, LEFLORE, HINDS, LAMAR, HARRISON, GRENADA, LOWNDES, TISHOMINGO, JACKSON, CALHOUN, MADISON, COPIAH, UNION, TIPPAH, NOXUBEE, JONES, WARREN, PIKE, CLARKE, RANKIN, OKTIBBEHA, CLAY, FORREST, SMITH, SCOTT, WASHINGTON, PERRY, CHOCTAW, LAFAYETTE, LEE
sector 24 Fabricated Metals, Electrical Equipment, Petroleum, Chemicals, Wood Products, Textiles, Electric Utilities, Nonmetallic Mineral Product, Transportation Equipment, Food, Primary Metals, Plastics and Rubber, Machinery, Paper, Petroleum Bulk Terminals, Furniture, Printing, Chemical Wholesalers, Computers and Electronic Products, Miscellaneous Manufacturing, Other, Textile Product, Hazardous Waste, Natural Gas Processing
chemical 233 Xylene (mixed isomers), Toluene, Diisocyanates, Polycyclic aromatic compounds, Titanium tetrachloride, Chromium compounds (except for chromite ore mined in the Transvaal Region), N-Methyl-2-pyrrolidone, Hydrochloric acid (acid aerosols including mists, vapors, gas, fog, and other airborne forms of any particle size), Methanol, tert-Butyl alcohol, Vanadium compounds, Nitric acid, Lead compounds, Lead, Manganese compounds, Copper compounds, Pentachlorobenzene, Cresol (mixed isomers), Nickel, Di(2-ethylhexyl) phthalate, Creosote, Chlorodifluoromethane (HCFC-22), Ethylbenzene, Acetaldehyde, 1,2,4-Trimethylbenzene, Pendimethalin, Hydramethylnon, Nickel compounds, Zinc compounds, Copper

The Table 4 reflects the top 5 years that most polluting waste was produced, in this top we can see that during the period 2013 - 2017 was by far the most polluting years.

Table 4: Top 5 year release contamination in Millions of Pounds.
Year Total Over the mean over mean [%]
2014 70.33 10.28 17.12
2004 69.74 9.69 16.14
2013 67.84 7.79 12.97
2015 65.54 5.49 9.15
2017 65.44 5.39 8.98

If we take the annual averages of toxic pollutant production, we can identify the counties with the highest level of pollutant production as well as particularly which industrial sectors are responsible for generating these wastes. Thus, we identified Harrison and Monroe counties as having by far the highest annual average pollutant production from the chemical production sector. In addition, in this same top we identified that the state’s pollution is led by three main industries, chemicals, electric utilities and paper production (Table 5).

Table 5: Top 10 average pollution release by county and industrial sector [Millions of pounds].
County Industrial Sector Average Release
HARRISON Chemicals 15.13
MONROE Chemicals 12.27
HARRISON Electric Utilities 3.56
CHOCTAW Electric Utilities 2.41
PERRY Paper 2.28
WARREN Paper 2.12
LAWRENCE Paper 2.02
LEAKE Food 1.74
LOWNDES Paper 1.73
YAZOO Chemicals 1.64

These results indicate which sectors should be given more attention for contaminant reduction.