Assignment 2 - Carlos Rivera (car808)

Step 2: Workbook exercises

library(tidyverse)

data = nycflights13::flights

not_cancelled = data %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

5.2.4 Exercises

1. Find all flights that

Had an arrival delay of two or more hours

data %>% 
  filter(arr_delay >= 2*60)

The resulting table has 10200 rows and 19 columns.

Flew to Houston (IAH or HOU)

data %>% 
  filter(dest %in% c('IAH', 'HOU'))

The resulting table has 9313 rows and 19 columns.

Were operated by United, American, or Delta

# AA - American Airlines Inc.     
# DL - Delta Air Lines Inc.       
# UA - United Air Lines Inc.      

data %>% 
  filter(carrier %in% c('AA', 'DL', 'UA'))

The resulting table has 139504 rows and 19 columns.

Departed in summer (July, August, and September)

# month: July=7, August=8, September=9
data %>% 
  filter(month >= 7,
         month <= 9)

The resulting table has 86326 rows and 19 columns.

Arrived more than two hours late, but didn’t leave late

data %>% 
  filter(arr_delay >= 2 * 60,
         dep_delay == 0)

The resulting table has 3 rows and 19 columns.

Were delayed by at least an hour, but made up over 30 minutes in flight

data %>% 
  filter(dep_delay >= 60,
         (dep_delay - arr_delay) > 30)

The resulting table has 1844 rows and 19 columns.

Departed between midnight and 6am (inclusive)

data %>% 
  filter(dep_time <= 600)

The resulting table has 9344 rows and 19 columns.

2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

Departed in summer (July, August, and September)

# month: July=7, August=8, September=9
data %>% 
  filter(between(month, 7, 9))

Departed between midnight and 6am (inclusive)

data %>% 
  filter(between(, 0, 600))

3. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?

data %>% 
  filter(is.na(dep_time)) %>% 
  head(5)

How many flights have a missing dep_time?

In total there are 8,255 flights with the dep_time missing.

What other variables are missing?

Other variables which are also missing in the same flights are: dep_time, dep_delay, arr_time, arr_delay, air_time.

What might these rows represent?

Most likely these flights will be cancelled, so there is no departure nor arrival information.

4. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)

This behavior may be due to the fact that NA stands for Not Available and has logical class so that for any of the 3 situations the result does not change independently of the value that the variable might take:

NA ^ 0: any value numeric (number or Inf), logical (TRUE, FALSE) power 0 is equal 1.
NA | TRUE: any numeric value != 0 is TRUE and value == 0 is FALSE, independent the first argument due is an operator OR is always TRUE because the second argument. Different situation if the second argument were FALSE, in which case the result is NA because it depends on the first argument.
FALSE & NA: in the same sense, since it is an operator AND independent of the value that the NA argument could take, the result will be FALSE.
NA * 0: The behavior of this tricky counterexample is due to the fact that not any value multiplied by 0 will equal to be, punctually there are two indeterminations when the variable takes the values -Inf or Inf, in which case multiplying by 0 results in NaN, in which case NA * 0 will be indeterminate.

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

data %>% 
  arrange(desc(is.na(dep_time))) %>% 
  head(5)

2. Sort flights to find the most delayed flights. Find the flights that left earliest.

data %>% 
  arrange(desc(dep_delay), desc(arr_delay)) %>% 
  head(5)

3. Sort flights to find the fastest (highest speed) flights.

# air_time: Amount of time spent in the air, in minutes.
# distance: Distance between airports, in miles.
# speed = distance / (air_time / 60) [mile/hour]

data %>% 
  mutate(speed = distance / (air_time / 60)) %>% 
  arrange(desc(speed)) %>% 
  head(5)

4. Which flights travelled the farthest? Which travelled the shortest?

data %>% 
  arrange(desc(distance)) %>% 
  head(5)

5.4.1 Exercises

1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.

# dep_time
# dep_delay
# arr_time
# arr_delay

var_sel = c('dep_time', 'dep_delay', 'arr_time', 'arr_delay')

head(data[, var_sel], 5)

head(data[, c(4,6,7,9)], 5)

data %>% select(dep_time, dep_delay, arr_time, arr_delay) %>% head(5)

data %>% select(any_of(var_sel)) %>% head(5)

data %>% 
  select(starts_with('dep'), starts_with('arr')) %>% 
  head(5)

data %>% 
  select(dep_time:arr_delay, -starts_with('sched')) %>% 
  head(5)

2. What happens if you include the name of a variable multiple times in a `select()` call?

data %>% 
  select(year, year, year) %>% 
  head(5)

When including a variable name multiple times in the select function, the resulting data.frame contains the variable only once.

3. What does the `any_of()` function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

data %>% 
  select(any_of(vars)) %>% 
  head(5)

the any_of functoin allows you to define a vector with the names of the columns you want to select, and then the select function returns each of the columns that match the values in the vector of names, this is very useful and has multiple applications, for example sleeving the same variables in multiple data.frame.

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

data %>% 
  select(contains('TIME')) %>% 
  head(5)

data %>% 
  select(contains('TIME', ignore.case = F))

I was surprised, since R is generally case sensitive so I expected it not to return columns.
The selection helpers have the ignore.case argument by default TRUE, so it returns a case-insensitive match.
changing the argument to FALSE: ignore.case = F.

Step 3: Generate your own exercises

Exploration of contaminant production in Mississippi in the last 20 years

The EPA Toxics Release Inventory data is a valuable source of information for the state of Mississippi. From this data we can take a look at how emissions have evolved over the last two decade 2004 - 2023.

Since 1987, the EPA has registered the Toxics Release Inventory (TRI), which contains the 100 most used data fields of the TRI Reporting Form R and the Form A Certification Statement. One of the most relevant variables is the total emissions generated. This information is disaggregated by year, county, industrial sector, among others, which allows for detailed observation and analysis of the information through summary tables (Table 1).

Table 1: Dataset variables

ID	col_name	Variable	Description
1	year	YEAR	Calendar year in which the reported activities occurred.Source: TRI_REPORTING_FORM.REPORTING_YEAR Reference: Part I, Section 1 Maximum Length: 4
7	county	COUNTY	County in which the reporting facility is located. Reference: Part I, Section 4.1 Maximum Length: 50
23	sector	INDUSTRY SECTOR	The industry or sector (e.g., Coal Mining, Metal Mining,Electrical Utilities, etc.) a facility belongs to. This categorization is primarily used to classify, analyze, and show industrial trends within TRI data. Maximum Length: 120
37	chemical	CHEMICAL	Name of the chemical as listed on the TRI chemical list, orgeneric name, if the chemical is claimed as a trade secret. Reference: Part II, Section 1.2 or Part II, Section1.3 Maximum Length: 70
46	carc	CARCINOGEN	Flag indicating whether the chemical is classified as acarcinogen by the Occupational Safety and HealthAdministration (OSHA). Yes = CARC No = Non-CARC See “Appendix B: Chemical Classifications - Carcinogens” for a list of TRI chemicals classified as OSHA carcinogens. Maximum Length: 3
107	total	TOTAL RELEASES	The total on- and off-site releases from sections 5 and 6 of the Form R. The value for this field equals On-site Release Total (row #65) + Off-site Release Total (row #88). Maximum Length: 22,7

Given the format in which the data is found, we must download the csv files by year and then bind by rows with bind_rows, as a final result our dataset looks like in Table 2. First we can check the number of categories of the variables year, county, sector, and chemical (Table 3).

Table 2: Data header: first 5 row of the dataset

year	county	sector	chemical	carc	total_rel
2004	DESOTO	Fabricated Metals	Xylene (mixed isomers)	NO	33731.0
2004	DESOTO	Fabricated Metals	Toluene	NO	22202.0
2004	LEFLORE	Electrical Equipment	Diisocyanates	NO	0.0
2004	HINDS	Petroleum	Polycyclic aromatic compounds	YES	400.9
2004	LAMAR	Petroleum	Polycyclic aromatic compounds	YES	0.0

Table 3: Categorical variables

Variable	Number categories	Categories (at least first 30)
year	20	2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
county	76	DESOTO, LEFLORE, HINDS, LAMAR, HARRISON, GRENADA, LOWNDES, TISHOMINGO, JACKSON, CALHOUN, MADISON, COPIAH, UNION, TIPPAH, NOXUBEE, JONES, WARREN, PIKE, CLARKE, RANKIN, OKTIBBEHA, CLAY, FORREST, SMITH, SCOTT, WASHINGTON, PERRY, CHOCTAW, LAFAYETTE, LEE
sector	24	Fabricated Metals, Electrical Equipment, Petroleum, Chemicals, Wood Products, Textiles, Electric Utilities, Nonmetallic Mineral Product, Transportation Equipment, Food, Primary Metals, Plastics and Rubber, Machinery, Paper, Petroleum Bulk Terminals, Furniture, Printing, Chemical Wholesalers, Computers and Electronic Products, Miscellaneous Manufacturing, Other, Textile Product, Hazardous Waste, Natural Gas Processing
chemical	233	Xylene (mixed isomers), Toluene, Diisocyanates, Polycyclic aromatic compounds, Titanium tetrachloride, Chromium compounds (except for chromite ore mined in the Transvaal Region), N-Methyl-2-pyrrolidone, Hydrochloric acid (acid aerosols including mists, vapors, gas, fog, and other airborne forms of any particle size), Methanol, tert-Butyl alcohol, Vanadium compounds, Nitric acid, Lead compounds, Lead, Manganese compounds, Copper compounds, Pentachlorobenzene, Cresol (mixed isomers), Nickel, Di(2-ethylhexyl) phthalate, Creosote, Chlorodifluoromethane (HCFC-22), Ethylbenzene, Acetaldehyde, 1,2,4-Trimethylbenzene, Pendimethalin, Hydramethylnon, Nickel compounds, Zinc compounds, Copper

The Table 4 reflects the top 5 years that most polluting waste was produced, in this top we can see that during the period 2013 - 2017 was by far the most polluting years.

Table 4: Top 5 year release contamination in Millions of Pounds.

Year	Total	Over the mean	over mean [%]
2014	70.33	10.28	17.12
2004	69.74	9.69	16.14
2013	67.84	7.79	12.97
2015	65.54	5.49	9.15
2017	65.44	5.39	8.98

If we take the annual averages of toxic pollutant production, we can identify the counties with the highest level of pollutant production as well as particularly which industrial sectors are responsible for generating these wastes. Thus, we identified Harrison and Monroe counties as having by far the highest annual average pollutant production from the chemical production sector. In addition, in this same top we identified that the state’s pollution is led by three main industries, chemicals, electric utilities and paper production (Table 5).

Table 5: Top 10 average pollution release by county and industrial sector [Millions of pounds].

County	Industrial Sector	Average Release
HARRISON	Chemicals	15.13
MONROE	Chemicals	12.27
HARRISON	Electric Utilities	3.56
CHOCTAW	Electric Utilities	2.41
PERRY	Paper	2.28
WARREN	Paper	2.12
LAWRENCE	Paper	2.02
LEAKE	Food	1.74
LOWNDES	Paper	1.73
YAZOO	Chemicals	1.64

These results indicate which sectors should be given more attention for contaminant reduction.