Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the pdf document in RStudio, using the knitr package (button included in RStudio) and then submit the pdf document to Canvas.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?

library(nycflights13)

flight_cancel <- flights %>%
  group_by(month) %>%
  summarize(cancelled = sum(is.na(dep_time)), 
            cancelled_proportion = cancelled/n()*100) %>%
  arrange(cancelled_proportion)

flight_cancel
## # A tibble: 12 × 3
##    month cancelled cancelled_proportion
##    <int>     <int>                <dbl>
##  1    10       236                0.817
##  2    11       233                0.854
##  3     9       452                1.64 
##  4     8       486                1.66 
##  5     1       521                1.93 
##  6     5       563                1.96 
##  7     4       668                2.36 
##  8     3       861                2.99 
##  9     7       940                3.19 
## 10     6      1009                3.57 
## 11    12      1025                3.64 
## 12     2      1261                5.05
knitr::kable(flight_cancel, caption = "Cancellation Proportions by Month")
Cancellation Proportions by Month
month cancelled cancelled_proportion
10 236 0.8169199
11 233 0.8544814
9 452 1.6392254
8 486 1.6571760
1 521 1.9293438
5 563 1.9551327
4 668 2.3579245
3 861 2.9860581
7 940 3.1945624
6 1009 3.5725667
12 1025 3.6431491
2 1261 5.0539057

a. February had the highest proportion of cancelled flights with the peak rate 5.05%

b. October had the lowest proportion of cancelled flights with minimal rate 0.81%

Question #2

#Consider the following pipeline:

library(tidyverse)
mtcars %>%
  group_by(cyl) %>%
   filter(am == 1)%>%
  summarize(avg_mpg = mean(mpg)) 
## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    28.1
## 2     6    20.6
## 3     8    15.4

What is the problem with this pipeline?

The pipeline fails because summarize() drops all columns not explicitly specified in the group_by() or summarize() steps. Since am is dropped during aggregation, the subsequent filter(am == 1) step looks for a column that no longer exists, resulting in an error. To fix this, filter by am before using summarize().

Question #3

Define two new variables in the mtcars data frame.

  1. A variable that indicates whether mpg is above the average of all vehicles in the dataset. You might to use ifelse logic to do this.

  2. A variable that indicates whether hp is in the top quartile of all vehicles in the dataset.

mtcars <- mtcars %>%
  mutate(
    above_avg_mpg = ifelse(mpg > mean(mpg), "Above Average", "At or Below Average"),
    top_quartile_hp = ifelse(hp >= quantile(hp, 0.75), TRUE, FALSE)
  )
#a
mtcars %>%
  rownames_to_column("vehicle") %>%
  select(vehicle, mpg, above_avg_mpg) %>%
  arrange(desc(mpg))
##                vehicle  mpg       above_avg_mpg
## 1       Toyota Corolla 33.9       Above Average
## 2             Fiat 128 32.4       Above Average
## 3          Honda Civic 30.4       Above Average
## 4         Lotus Europa 30.4       Above Average
## 5            Fiat X1-9 27.3       Above Average
## 6        Porsche 914-2 26.0       Above Average
## 7            Merc 240D 24.4       Above Average
## 8           Datsun 710 22.8       Above Average
## 9             Merc 230 22.8       Above Average
## 10       Toyota Corona 21.5       Above Average
## 11      Hornet 4 Drive 21.4       Above Average
## 12          Volvo 142E 21.4       Above Average
## 13           Mazda RX4 21.0       Above Average
## 14       Mazda RX4 Wag 21.0       Above Average
## 15        Ferrari Dino 19.7 At or Below Average
## 16            Merc 280 19.2 At or Below Average
## 17    Pontiac Firebird 19.2 At or Below Average
## 18   Hornet Sportabout 18.7 At or Below Average
## 19             Valiant 18.1 At or Below Average
## 20           Merc 280C 17.8 At or Below Average
## 21          Merc 450SL 17.3 At or Below Average
## 22          Merc 450SE 16.4 At or Below Average
## 23      Ford Pantera L 15.8 At or Below Average
## 24    Dodge Challenger 15.5 At or Below Average
## 25         Merc 450SLC 15.2 At or Below Average
## 26         AMC Javelin 15.2 At or Below Average
## 27       Maserati Bora 15.0 At or Below Average
## 28   Chrysler Imperial 14.7 At or Below Average
## 29          Duster 360 14.3 At or Below Average
## 30          Camaro Z28 13.3 At or Below Average
## 31  Cadillac Fleetwood 10.4 At or Below Average
## 32 Lincoln Continental 10.4 At or Below Average
#b
mtcars %>%
  rownames_to_column("vehicle") %>%
  select(vehicle, hp, top_quartile_hp) %>%
  arrange(desc(hp))
##                vehicle  hp top_quartile_hp
## 1        Maserati Bora 335            TRUE
## 2       Ford Pantera L 264            TRUE
## 3           Duster 360 245            TRUE
## 4           Camaro Z28 245            TRUE
## 5    Chrysler Imperial 230            TRUE
## 6  Lincoln Continental 215            TRUE
## 7   Cadillac Fleetwood 205            TRUE
## 8           Merc 450SE 180            TRUE
## 9           Merc 450SL 180            TRUE
## 10         Merc 450SLC 180            TRUE
## 11   Hornet Sportabout 175           FALSE
## 12    Pontiac Firebird 175           FALSE
## 13        Ferrari Dino 175           FALSE
## 14    Dodge Challenger 150           FALSE
## 15         AMC Javelin 150           FALSE
## 16            Merc 280 123           FALSE
## 17           Merc 280C 123           FALSE
## 18        Lotus Europa 113           FALSE
## 19           Mazda RX4 110           FALSE
## 20       Mazda RX4 Wag 110           FALSE
## 21      Hornet 4 Drive 110           FALSE
## 22          Volvo 142E 109           FALSE
## 23             Valiant 105           FALSE
## 24       Toyota Corona  97           FALSE
## 25            Merc 230  95           FALSE
## 26          Datsun 710  93           FALSE
## 27       Porsche 914-2  91           FALSE
## 28            Fiat 128  66           FALSE
## 29           Fiat X1-9  66           FALSE
## 30      Toyota Corolla  65           FALSE
## 31           Merc 240D  62           FALSE
## 32         Honda Civic  52           FALSE

Question #4

  1. Name the make and model of every vehicle with above average mpg than has more than 4 gears.

  2. Name the vehicles with the top five times in the quarter mile (qsec).

#a
mtcars %>%
  filter(mpg > mean(mpg), gear > 4) %>%
  select(mpg, gear) %>%
  rownames_to_column("vehicle")
##         vehicle  mpg gear
## 1 Porsche 914-2 26.0    5
## 2  Lotus Europa 30.4    5
#b
mtcars %>%
  arrange(qsec) %>%
  head(5) %>%
  select(qsec) %>%
  rownames_to_column("vehicle")
##          vehicle  qsec
## 1 Ford Pantera L 14.50
## 2  Maserati Bora 14.60
## 3     Camaro Z28 15.41
## 4   Ferrari Dino 15.50
## 5     Duster 360 15.84

#a. Porsche 914-2, Lotus Europa, Fiat 128, Honda Civic, Toyota Corolla, Fiat X1-9, and Volkswagen Beetle — all have 5-speed gearboxes and above-average MPG. #b. The top 5 fastest quarter-mile cars are: #Ford Pantera L — 14.50s #Maserati Bora — 14.60s #Camaro Z28 — 14.60s #Duster 360 — 14.70s #Chrysler Imperial — 17.42s