Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? February b.What month had the lowest? October

library(nycflights13)
flights = flights

flights %>%
  group_by(month) %>%
  summarise(total = n(), proportion = sum(is.na(dep_time))/total)

## # A tibble: 12 × 3
##    month total proportion
##    <int> <int>      <dbl>
##  1     1 27004    0.0193 
##  2     2 24951    0.0505 
##  3     3 28834    0.0299 
##  4     4 28330    0.0236 
##  5     5 28796    0.0196 
##  6     6 28243    0.0357 
##  7     7 29425    0.0319 
##  8     8 29327    0.0166 
##  9     9 27574    0.0164 
## 10    10 28889    0.00817
## 11    11 27268    0.00854
## 12    12 28135    0.0364

Question #2

Consider the following pipeline:

#library(tidyverse)
mtcars = mtcars

mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg)) %>%
  filter(am == 1)

What is the problem with this pipeline? The problem is that the pipeline is first grouping the data by cylinder and then obtaining the mean mpg for each cylinder type. Since the data is already grouped, there are no rows where am = 1, so it gives an error.

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)

Teams = Teams

Teams %>%
  mutate(
    BA = H/AB,
    SLG = (H+2*X2B+3*X3B+4*HR)/AB) %>%
    select(BA,SLG, everything()) %>%
    slice(1:10)

##           BA       SLG yearID lgID teamID franchID divID Rank  G Ghome  W  L
## 1  0.3104956 0.5021866   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10
## 2  0.2700669 0.4431438   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9
## 3  0.2765599 0.4603710   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19
## 4  0.2386059 0.3324397   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12
## 5  0.2870370 0.3960114   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17
## 6  0.3200625 0.5144418   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7
## 7  0.2644788 0.4333977   1871   NA    RC1      ROK  <NA>    9 25    NA  4 21
## 8  0.3076923 0.4903846   1871   NA    TRO      TRO  <NA>    6 29    NA 13 15
## 9  0.2771619 0.4323725   1871   NA    WS3      OLY  <NA>    4 32    NA 15 15
## 10 0.2928821 0.4332944   1872   NA    BL1      BLC  <NA>    2 58    NA 35 19
##    DivWin WCWin LgWin WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER
## 1    <NA>  <NA>     N  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109
## 2    <NA>  <NA>     N  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77
## 3    <NA>  <NA>     N  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116
## 4    <NA>  <NA>     N  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97
## 5    <NA>  <NA>     N  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121
## 6    <NA>  <NA>     Y  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137
## 7    <NA>  <NA>     N  <NA> 231 1036 274  44  25  3 38 30 53 10  NA NA 287 108
## 8    <NA>  <NA>     N  <NA> 351 1248 384  51  34  6 49 19 62 24  NA NA 362 153
## 9    <NA>  <NA>     N  <NA> 310 1353 375  54  26  6 48 13 48 13  NA NA 303 137
## 10   <NA>  <NA>     N  <NA> 617 2571 753 106  31 14 29 28 53 18  NA NA 434 166
##     ERA CG SHO SV IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1  3.55 22   1  3    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2  2.76 25   0  1    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3  4.11 23   0  0    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4  5.17 19   1  0    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5  3.72 32   1  0    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6  4.95 27   0  0    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
## 7  4.30 23   1  0    678 315   3  34  16 220 14 0.821   Rockford Forest Citys
## 8  5.51 28   0  0    750 431   4  75  12 198 22 0.845          Troy Haymakers
## 9  4.37 32   0  0    846 371   4  45  13 218 20 0.850     Washington Olympics
## 10 2.90 48   1  1   1548 573   3  63  77 432 22 0.830      Baltimore Canaries
##                                 park attendance BPF PPF teamIDBR teamIDlahman45
## 1                South End Grounds I         NA 103  98      BOS            BS1
## 2            Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3       National Association Grounds         NA  96 100      CLE            CL1
## 4                     Hamilton Field         NA 101 107      KEK            FW1
## 5           Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6           Jefferson Street Grounds         NA 102  98      ATH            PH1
## 7  Agricultural Society Fair Grounds         NA  97  99      ROK            RC1
## 8                 Haymakers' Grounds         NA 101 100      TRO            TRO
## 9                   Olympics Grounds         NA  94  98      OLY            WS3
## 10                    Newington Park         NA 106 102      BAL            BL1
##    teamIDretro
## 1          BS1
## 2          CH1
## 3          CL1
## 4          FW1
## 5          NY2
## 6          PH1
## 7          RC1
## 8          TRO
## 9          WS3
## 10         BL1

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

#library(Lahman)

Teams %>%
  mutate(
    BA = H/AB,
    SLG = (H+2*X2B+3*X3B+4*HR)/AB) %>%
    arrange(desc(SLG)) %>%
    select(yearID, name, SLG) %>%
    slice(1:5)

##   yearID             name       SLG
## 1   2019   Houston Astros 0.6092998
## 2   2019  Minnesota Twins 0.6071179
## 3   2003   Boston Red Sox 0.6033975
## 4   2019 New York Yankees 0.5996776
## 5   2020   Atlanta Braves 0.5964320

#Since 1969
Teams %>%
  mutate(
    BA = H/AB,
    SLG = (H+2*X2B+3*X3B+4*HR)/AB) %>%
    filter(yearID>=1969) %>%
    arrange(desc(SLG)) %>%
    select(yearID, name, SLG) %>%
    slice(1:5)

##   yearID             name       SLG
## 1   2019   Houston Astros 0.6092998
## 2   2019  Minnesota Twins 0.6071179
## 3   2003   Boston Red Sox 0.6033975
## 4   2019 New York Yankees 0.5996776
## 5   2020   Atlanta Braves 0.5964320

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)

Batting = Batting
People = People

Batting %>%
  group_by(playerID) %>%
  summarise(HRs = sum(HR), SBs = sum(SB)) %>%
  filter(HRs >= 300 & SBs >= 300) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  slice(1:5)

## # A tibble: 5 × 28
##   playerID    HRs   SBs birthY…¹ birth…² birth…³ birth…⁴ birth…⁵ birth…⁶ death…⁷
##   <chr>     <int> <int>    <int>   <int>   <int> <chr>   <chr>   <chr>     <int>
## 1 beltrca01   435   312     1977       4      24 P.R.    <NA>    Manati       NA
## 2 bondsba01   762   514     1964       7      24 USA     CA      Rivers…      NA
## 3 bondsbo01   332   461     1946       3      15 USA     CA      Rivers…    2003
## 4 dawsoan01   438   314     1954       7      10 USA     FL      Miami        NA
## 5 finlest01   304   320     1965       3      12 USA     TN      Union …      NA
## # … with 18 more variables: deathMonth <int>, deathDay <int>,
## #   deathCountry <chr>, deathState <chr>, deathCity <chr>, nameFirst <chr>,
## #   nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>,
## #   throws <fct>, debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## #   deathDate <date>, birthDate <date>, and abbreviated variable names
## #   ¹birthYear, ²birthMonth, ³birthDay, ⁴birthCountry, ⁵birthState, ⁶birthCity,
## #   ⁷deathYear

Pitching %>%
  group_by(playerID) %>%
  summarise(Ws = sum(W), SOs = sum(SO)) %>%
  filter(Ws >= 300 & SOs >= 3000) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  slice(1:5)

## # A tibble: 5 × 28
##   playerID     Ws   SOs birthY…¹ birth…² birth…³ birth…⁴ birth…⁵ birth…⁶ death…⁷
##   <chr>     <int> <int>    <int>   <int>   <int> <chr>   <chr>   <chr>     <int>
## 1 carltst01   329  4136     1944      12      22 USA     FL      Miami        NA
## 2 clemero02   354  4672     1962       8       4 USA     OH      Dayton       NA
## 3 johnsra05   303  4875     1963       9      10 USA     CA      Walnut…      NA
## 4 johnswa01   417  3509     1887      11       6 USA     KS      Humbol…    1946
## 5 maddugr01   355  3371     1966       4      14 USA     TX      San An…      NA
## # … with 18 more variables: deathMonth <int>, deathDay <int>,
## #   deathCountry <chr>, deathState <chr>, deathCity <chr>, nameFirst <chr>,
## #   nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>,
## #   throws <fct>, debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## #   deathDate <date>, birthDate <date>, and abbreviated variable names
## #   ¹birthYear, ²birthMonth, ³birthDay, ⁴birthCountry, ⁵birthState, ⁶birthCity,
## #   ⁷deathYear

Batting %>%
  group_by(playerID, yearID) %>%
  summarize(HRs = sum(HR), BA = sum(H)/sum(AB)) %>%
  filter(HRs >= 50) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  arrange(BA)

## # A tibble: 46 × 29
## # Groups:   playerID [30]
##    playerID  yearID   HRs    BA birthY…¹ birth…² birth…³ birth…⁴ birth…⁵ birth…⁶
##    <chr>      <int> <int> <dbl>    <int>   <int>   <int> <chr>   <chr>   <chr>  
##  1 alonspe01   2019    53 0.260     1994      12       7 USA     FL      Tampa  
##  2 bautijo02   2010    54 0.260     1980      10      19 D.R.    Distri… Santo …
##  3 jonesan01   2005    51 0.263     1977       4      23 Curacao <NA>    Willem…
##  4 marisro01   1961    61 0.269     1934       9      10 USA     MN      Hibbing
##  5 vaughgr01   1998    50 0.272     1965       7       3 USA     CA      Sacram…
##  6 mcgwima01   1997    58 0.274     1963      10       1 USA     CA      Pomona 
##  7 fieldce01   1990    51 0.277     1963       9      21 USA     CA      Los An…
##  8 mcgwima01   1999    65 0.278     1963      10       1 USA     CA      Pomona 
##  9 stantmi03   2017    59 0.281     1989      11       8 USA     CA      Panora…
## 10 judgeaa01   2017    52 0.284     1992       4      26 USA     CA      Linden 
## # … with 36 more rows, 19 more variables: deathYear <int>, deathMonth <int>,
## #   deathDay <int>, deathCountry <chr>, deathState <chr>, deathCity <chr>,
## #   nameFirst <chr>, nameLast <chr>, nameGiven <chr>, weight <int>,
## #   height <int>, bats <fct>, throws <fct>, debut <chr>, finalGame <chr>,
## #   retroID <chr>, bbrefID <chr>, deathDate <date>, birthDate <date>, and
## #   abbreviated variable names ¹birthYear, ²birthMonth, ³birthDay,
## #   ⁴birthCountry, ⁵birthState, ⁶birthCity

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Edwin Villavicencio

2023-01-24

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Edwin Villavicencio

2023-01-24

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`