Data Wrangling (Modern Dive, Chapter 4)

importing libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(nycflights13)
  1. Find the average air time and the minimum distance for flights to San Francisco International Airport.
flights %>% filter(dest=="SFO") %>% group_by(dest) %>% summarise(avg_air_time=mean(air_time, na.rm=TRUE), min_dist=min(distance, na.rm=TRUE))
  1. How many “NA” values are there in the air_time column for flights to “MDW” (Chicago Midway)? If you don’t know, you should ask your instructor or read about selecting rows with NA values in a particular column.
flights %>% filter(dest=="MDW") %>% group_by(dest) %>% summarise(num_na = sum(is.na(air_time)))

Work with flights in September to either Dallas Fort Worth or Denver International Airport. (This means consider both flights to Dallas and flights to Denver!)

flights2 <- flights %>% filter(month==9) %>% filter(dest=="DEN"|dest=="DFW")
  1. How many flights depart at least five minutes early?
flights2 %>% filter(dep_delay <= -5) %>% count()
  1. What is the average air time for the flights depart at least 5 minutes early? Write an answer in a sentence including units.
flights2 %>% filter(dep_delay <= -5) %>% summarise(mean(air_time, na.rm=TRUE))
  1. Review: Make a histogram of the scheduled departure times for those early flights.
ggplot(flights2 %>% filter(dep_delay <= -5)) +
  geom_histogram(aes(dep_time))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Make a table of average and median arrival delays for each destination.
flights %>% group_by(dest) %>% summarise(avg_arr_delay = mean(arr_delay, na.rm=TRUE), med_arr_delay = median(arr_delay, na.rm=TRUE))
  1. Make a table containing the carrier and the average flight time, arranged in descending order of time (so longest flight times come first).
flights3 <- flights %>% group_by(carrier) %>% summarise(avg_flight_time = mean(air_time, na.rm=TRUE)) %>% arrange(desc(avg_flight_time))
  1. Join your answer to question 5 with the ‘airlines’ table so that it shows the names of each airline in addition to the carrier code.
airlines %>% left_join(flights3)
## Joining, by = "carrier"

Data set: car speeding and warning signs

Read about the car speeding data set.

“In a study into the effect that warning signs have on speeding patterns, Cambridgeshire County Council considered 14 pairs of locations. The locations were paired to account for factors such as traffic volume and type of road. One site in each pair had a sign erected warning of the dangers of speeding and asking drivers to slow down. No action was taken at the second site. Three sets of measurements were taken at each site. Each set of measurements was nominally of the speeds of 100 cars but not all sites have exactly 100 measurements. These speed measurements were taken before the erection of the sign, shortly after the erection of the sign, and again after the sign had been in place for some time.”

speeding <- read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/boot/amis.csv", 
                     col_types = cols(col_double(), col_double(), col_factor(), col_factor(), col_factor())) %>%
            rename(row = `...1`)
## New names:
## • `` -> `...1`
head(speeding)

Do the warning signs make any difference?

Answer in words. Produce graphs. Explain how graphs support your answer.

Danger: If you have not carefully read the problem description paragraph, your answer is not likely to make sense!

graph 1 (this one isnt that useful but I drew it out and it came out alot like my drawing which is cool)

speeding %>% group_by(period, warning, pair) %>% summarise(avg_speed = mean(speed, na.rm=TRUE)) %>% ggplot() +
  geom_point(aes(x=period, y=avg_speed, color=pair, shape=warning))
## `summarise()` has grouped output by 'period', 'warning'. You can override using
## the `.groups` argument.

graph 2

speeding %>% group_by(period, warning, pair) %>% summarise(avg_speed = mean(speed, na.rm=TRUE)) %>% ggplot() +
  geom_point(aes(x=period, y=avg_speed, color=warning)) +
  facet_wrap(~pair)
## `summarise()` has grouped output by 'period', 'warning'. You can override using
## the `.groups` argument.

graph 3

speeding %>% group_by(warning, period) %>% summarise(avg_speed = mean(speed, na.rm=TRUE)) %>% ggplot() +
  geom_point(aes(x=period, y=avg_speed, color=warning))
## `summarise()` has grouped output by 'warning'. You can override using the
## `.groups` argument.

I would conclude that warning signs are effective initially, but fall off after people get used to the signs.