Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? ## Answer: Month 2 is the highest proportion of cancelled flights

b.What month had the lowest? ## Answer: Month 10 is the lowest.

library(nycflights13)
flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights_cancel <- flights %>%
  group_by(month) %>%
  summarize(cancel_percent = sum(is.na(dep_time))/n()) %>% #assume no depart time means cancelled flights
  arrange(desc(cancel_percent))

flights_cancel

## # A tibble: 12 × 2
##    month cancel_percent
##    <int>          <dbl>
##  1     2        0.0505 
##  2    12        0.0364 
##  3     6        0.0357 
##  4     7        0.0319 
##  5     3        0.0299 
##  6     4        0.0236 
##  7     5        0.0196 
##  8     1        0.0193 
##  9     8        0.0166 
## 10     9        0.0164 
## 11    11        0.00854
## 12    10        0.00817

Question #2

Consider the following pipeline:

#library(tidyverse)
#mtcars %>%
  #group_by(cyl) %>%
  #summarize(avg_mpg = mean(mpg)) %>%
  #filter(am == 1)

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

#Fix
mtcars %>%
  group_by(cyl) %>%
  filter(am == 1) %>%
  summarize(avg_mpg = mean(mpg))

## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    28.1
## 2     6    20.6
## 3     8    15.4

What is the problem with this pipeline? ## Answer There is no ‘am’ group by cyl. We can put filter(am == 1) after group_by to fix this error.

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
Teams <- Teams %>%
  mutate(BA = H/AB)%>%
  mutate(SLG = (H+X2B*2+X3B*3+HR*4)/AB)
head(Teams)

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro        BA       SLG
## 1         BS1 0.3104956 0.5021866
## 2         CH1 0.2700669 0.4431438
## 3         CL1 0.2765599 0.4603710
## 4         FW1 0.2386059 0.3324397
## 5         NY2 0.2870370 0.3960114
## 6         PH1 0.3200625 0.5144418

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
Teams %>%
  select(yearID, teamID, SLG) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   yearID teamID       SLG
## 1   2019    HOU 0.6092998
## 2   2019    MIN 0.6071179
## 3   2003    BOS 0.6033975
## 4   2019    NYA 0.5996776
## 5   2020    ATL 0.5964320

#Year after 1969
Teams %>%
  select(yearID, teamID, SLG) %>%
  filter(yearID > 1969) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   yearID teamID       SLG
## 1   2019    HOU 0.6092998
## 2   2019    MIN 0.6071179
## 3   2003    BOS 0.6033975
## 4   2019    NYA 0.5996776
## 5   2020    ATL 0.5964320

Question #5

Use the Batting, Pitching, and Master tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season? ## Answer Peter Morgan(Given Name) had the lowest batting average in Year 2019.

library(Lahman)
head(People$playerID)

## [1] "aardsda01" "aaronha01" "aaronto01" "aasedo01"  "abadan01"  "abadfe01"

head(Batting$playerID)

## [1] "abercda01" "addybo01"  "allisar01" "allisdo01" "ansonca01" "armstbo01"

#Batting
Batting %>%
  group_by(playerID) %>%
  summarize(player = sum(HR), steal = sum(SB)) %>%
  filter(player > 300, steal > 300) %>%
  right_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, nameGiven,player, steal)

## # A tibble: 20,370 × 5
##    nameFirst nameLast  nameGiven          player steal
##    <chr>     <chr>     <chr>               <int> <int>
##  1 Carlos    Beltran   Carlos Ivan           435   312
##  2 Barry     Bonds     Barry Lamar           762   514
##  3 Bobby     Bonds     Bobby Lee             332   461
##  4 Andre     Dawson    Andre Nolan           438   314
##  5 Steve     Finley    Steven Allen          304   320
##  6 Willie    Mays      Willie Howard         660   338
##  7 Alex      Rodriguez Alexander Enmanuel    696   329
##  8 Reggie    Sanders   Reginald Laverne      305   304
##  9 David     Aardsma   David Allan            NA    NA
## 10 Hank      Aaron     Henry Louis            NA    NA
## # … with 20,360 more rows

#Pitching
Pitching %>%
  group_by(playerID) %>%
  summarize(wins = sum(W), strikeouts = sum(SO)) %>%
  filter(wins > 300, strikeouts > 3000) %>%
  right_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, nameGiven,wins, strikeouts)

## # A tibble: 20,370 × 5
##    nameFirst nameLast nameGiven        wins strikeouts
##    <chr>     <chr>    <chr>           <int>      <int>
##  1 Steve     Carlton  Steven Norman     329       4136
##  2 Roger     Clemens  William Roger     354       4672
##  3 Randy     Johnson  Randall David     303       4875
##  4 Walter    Johnson  Walter Perry      417       3509
##  5 Greg      Maddux   Gregory Alan      355       3371
##  6 Phil      Niekro   Philip Henry      318       3342
##  7 Gaylord   Perry    Gaylord Jackson   314       3534
##  8 Nolan     Ryan     Lynn Nolan        324       5714
##  9 Tom       Seaver   George Thomas     311       3640
## 10 Don       Sutton   Donald Howard     324       3574
## # … with 20,360 more rows

#Batting
Batting %>%
  group_by(playerID, yearID) %>%
  summarize(player = sum(HR), batting_mean = sum(H)/sum(AB)) %>%
  filter(player >= 50) %>%
  left_join(People, by = c("playerID" = "playerID")) %>%
  select(yearID, nameFirst, nameLast, nameGiven,player, batting_mean) %>%
  arrange(batting_mean)

## # A tibble: 46 × 7
## # Groups:   playerID [30]
##    playerID  yearID nameFirst nameLast nameGiven             player batting_mean
##    <chr>      <int> <chr>     <chr>    <chr>                  <int>        <dbl>
##  1 alonspe01   2019 Pete      Alonso   Peter Morgan              53        0.260
##  2 bautijo02   2010 Jose      Bautista Jose Antonio              54        0.260
##  3 jonesan01   2005 Andruw    Jones    Andruw Rudolf             51        0.263
##  4 marisro01   1961 Roger     Maris    Roger Eugene              61        0.269
##  5 vaughgr01   1998 Greg      Vaughn   Gregory Lamont            50        0.272
##  6 mcgwima01   1997 Mark      McGwire  Mark David                58        0.274
##  7 fieldce01   1990 Cecil     Fielder  Cecil Grant               51        0.277
##  8 mcgwima01   1999 Mark      McGwire  Mark David                65        0.278
##  9 stantmi03   2017 Giancarlo Stanton  Giancarlo Cruz-Micha…     59        0.281
## 10 judgeaa01   2017 Aaron     Judge    Aaron James               52        0.284
## # … with 36 more rows

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Xiaoxiao Lu

`2022-09-11`

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Xiaoxiao Lu

2022-09-11

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

`2022-09-11`