Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions:

library(nycflights13)

flights_df = nycflights13::flights # NOTE: get `flights`  dataframe

# Get columns 
names(flights_df)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

# Cancelled flights is defined as there is no dep_time

agg_flights_cancelled <- flights_df %>% # dataframe
  group_by(month) %>% # aggregate/group by month
  summarise( 
    flights_cancelled_per_month = sum(is.na(dep_time)), # NOTE: this is total records of dept_time is N.A.
    flights_total_per_month = n(),  # NOTE: this is total records per month
    flights_cancelled_proportion = flights_cancelled_per_month / flights_total_per_month * 100, # NOTE: this is in percentage
            ) %>%
  arrange(desc(flights_cancelled_proportion)) # NOTE: sort flights_cancelled_proportion in descending order


# NOTE: final aggregated output
 print(agg_flights_cancelled)

## # A tibble: 12 × 4
##    month flights_cancelled_per_month flights_total_per_month flights_cancelled_…
##    <int>                       <int>                   <int>               <dbl>
##  1     2                        1261                   24951               5.05 
##  2    12                        1025                   28135               3.64 
##  3     6                        1009                   28243               3.57 
##  4     7                         940                   29425               3.19 
##  5     3                         861                   28834               2.99 
##  6     4                         668                   28330               2.36 
##  7     5                         563                   28796               1.96 
##  8     1                         521                   27004               1.93 
##  9     8                         486                   29327               1.66 
## 10     9                         452                   27574               1.64 
## 11    11                         233                   27268               0.854
## 12    10                         236                   28889               0.817

a.What month had the highest proportion of cancelled flights?

Ans: month 2 (Feb) has the highest cancellation proportion (5.05 %). The first row of the aggregated dataframe

b.What month had the lowest?

Ans: month 10 (Oct) has the lowest (0.82 %). The last row of the dataframe

Question #2

Consider the following pipeline:

library(tidyverse)
test_agg <- mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg))
  #filter(am == 1)

# 
print(test_agg)

What is the problem with this pipeline?

Ans: after summarizing, there is no am field anymore. See the dataframe test_agg

We could filter mtcars by am first, then apply summarise. See the dataframe test_agg_fixed

test_agg_fixed <- mtcars %>%
  filter(am == 1) %>% # filter first

  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg))
  #filter(am == 1) # comment out


print(test_agg_fixed)

## # A tibble: 3 × 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    28.1
## 2     6    20.6
## 3     8    15.4

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)

team_df = Lahman::Teams # Get Teams

names(team_df)

##  [1] "yearID"         "lgID"           "teamID"         "franchID"      
##  [5] "divID"          "Rank"           "G"              "Ghome"         
##  [9] "W"              "L"              "DivWin"         "WCWin"         
## [13] "LgWin"          "WSWin"          "R"              "AB"            
## [17] "H"              "X2B"            "X3B"            "HR"            
## [21] "BB"             "SO"             "SB"             "CS"            
## [25] "HBP"            "SF"             "RA"             "ER"            
## [29] "ERA"            "CG"             "SHO"            "SV"            
## [33] "IPouts"         "HA"             "HRA"            "BBA"           
## [37] "SOA"            "E"              "DP"             "FP"            
## [41] "name"           "park"           "attendance"     "BPF"           
## [45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"

# NOTE: create new field - BA

team_df$BA = team_df$H / team_df$AB # ans of a


# NOTE: create another field - SLG

# Reference: https://rdrr.io/cran/Lahman/man/BattingPost.html

# H: hits
# X2B: doubles
# X3B: triples
# HR: homerun

team_df$total_base = team_df$H * 1 + team_df$X2B * 2 + team_df$X3B * 3 + team_df$HR * 4


team_df$SLG = team_df$total_base / team_df$AB # ans of b


print(head(team_df))

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro        BA total_base       SLG
## 1         BS1 0.3104956        689 0.5021866
## 2         CH1 0.2700669        530 0.4431438
## 3         CL1 0.2765599        546 0.4603710
## 4         FW1 0.2386059        248 0.3324397
## 5         NY2 0.2870370        556 0.3960114
## 6         PH1 0.3200625        659 0.5144418

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

# top 5 in history
team_df %>%
  select(yearID, teamID, SLG) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   yearID teamID       SLG
## 1   2019    HOU 0.6092998
## 2   2019    MIN 0.6071179
## 3   2003    BOS 0.6033975
## 4   2019    NYA 0.5996776
## 5   2020    ATL 0.5964320

# since 1969
team_df %>%
  filter(yearID>=1969) %>% # NOTE: filter by years
  select(yearID, teamID, SLG) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   yearID teamID       SLG
## 1   2019    HOU 0.6092998
## 2   2019    MIN 0.6071179
## 3   2003    BOS 0.6033975
## 4   2019    NYA 0.5996776
## 5   2020    ATL 0.5964320

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

library(Lahman)

batting_df = Lahman::Batting 

names(batting_df)

##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
##  [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
## [13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
## [19] "HBP"      "SH"       "SF"       "GIDP"

people_df = Lahman::People

names(people_df)

##  [1] "playerID"     "birthYear"    "birthMonth"   "birthDay"     "birthCountry"
##  [6] "birthState"   "birthCity"    "deathYear"    "deathMonth"   "deathDay"    
## [11] "deathCountry" "deathState"   "deathCity"    "nameFirst"    "nameLast"    
## [16] "nameGiven"    "weight"       "height"       "bats"         "throws"      
## [21] "debut"        "finalGame"    "retroID"      "bbrefID"      "deathDate"   
## [26] "birthDate"

batting_df %>%
  group_by(playerID) %>% # aggregated by each playerID
  summarise(
    total_HR = sum(HR), # total of HR per player
    total_SB = sum(SB), # total of SB per player
  ) %>%
  filter(total_HR >= 300 & total_SB >= 300) %>%
  
  inner_join(people_df, by = "playerID") %>% # inner join with people dataframe on playerID
  
  select(nameFirst, nameLast, nameGiven, total_HR, total_SB) # pick the necessary field

## # A tibble: 8 × 5
##   nameFirst nameLast  nameGiven          total_HR total_SB
##   <chr>     <chr>     <chr>                 <int>    <int>
## 1 Carlos    Beltran   Carlos Ivan             435      312
## 2 Barry     Bonds     Barry Lamar             762      514
## 3 Bobby     Bonds     Bobby Lee               332      461
## 4 Andre     Dawson    Andre Nolan             438      314
## 5 Steve     Finley    Steven Allen            304      320
## 6 Willie    Mays      Willie Howard           660      338
## 7 Alex      Rodriguez Alexander Enmanuel      696      329
## 8 Reggie    Sanders   Reginald Laverne        305      304

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).

pitching_df = Lahman::Pitching


names(pitching_df)

##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "W"       
##  [7] "L"        "G"        "GS"       "CG"       "SHO"      "SV"      
## [13] "IPouts"   "H"        "ER"       "HR"       "BB"       "SO"      
## [19] "BAOpp"    "ERA"      "IBB"      "WP"       "HBP"      "BK"      
## [25] "BFP"      "GF"       "R"        "SH"       "SF"       "GIDP"

pitching_df %>% 
  group_by(playerID) %>%
  summarise(
    total_Win = sum(W),
    total_SO = sum(SO),
  ) %>%
  
  filter(total_Win >= 300 & total_SO >= 3000) %>%
  
  inner_join(people_df, by = "playerID") %>%
  select(nameFirst, nameLast, nameGiven, total_Win, total_SO)

## # A tibble: 10 × 5
##    nameFirst nameLast nameGiven       total_Win total_SO
##    <chr>     <chr>    <chr>               <int>    <int>
##  1 Steve     Carlton  Steven Norman         329     4136
##  2 Roger     Clemens  William Roger         354     4672
##  3 Randy     Johnson  Randall David         303     4875
##  4 Walter    Johnson  Walter Perry          417     3509
##  5 Greg      Maddux   Gregory Alan          355     3371
##  6 Phil      Niekro   Philip Henry          318     3342
##  7 Gaylord   Perry    Gaylord Jackson       314     3534
##  8 Nolan     Ryan     Lynn Nolan            324     5714
##  9 Tom       Seaver   George Thomas         311     3640
## 10 Don       Sutton   Donald Howard         324     3574

Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

Ans: 2019, (Pete Alonso Peter Morgan)

# NOTE: season by year

batting_df %>%
  group_by(playerID, yearID) %>%
  summarise(
    total_HR_per_year = sum(HR), # total Home Run per year
    batting_avg_per_year = sum(H) / sum(AB), # batting average per year
  ) %>%
  filter(total_HR_per_year >= 50) %>%
  
  inner_join(people_df, by = "playerID") %>% # merge with people_df dataframe on playerID
  select(nameFirst, nameLast, nameGiven, yearID, total_HR_per_year, batting_avg_per_year) %>%
  
  arrange(batting_avg_per_year)

## # A tibble: 46 × 7
## # Groups:   playerID [30]
##    playerID  nameFirst nameLast nameGiven              yearID total_HR_per_year
##    <chr>     <chr>     <chr>    <chr>                   <int>             <int>
##  1 alonspe01 Pete      Alonso   Peter Morgan             2019                53
##  2 bautijo02 Jose      Bautista Jose Antonio             2010                54
##  3 jonesan01 Andruw    Jones    Andruw Rudolf            2005                51
##  4 marisro01 Roger     Maris    Roger Eugene             1961                61
##  5 vaughgr01 Greg      Vaughn   Gregory Lamont           1998                50
##  6 mcgwima01 Mark      McGwire  Mark David               1997                58
##  7 fieldce01 Cecil     Fielder  Cecil Grant              1990                51
##  8 mcgwima01 Mark      McGwire  Mark David               1999                65
##  9 stantmi03 Giancarlo Stanton  Giancarlo Cruz-Michael   2017                59
## 10 judgeaa01 Aaron     Judge    Aaron James              2017                52
## # … with 36 more rows, and 1 more variable: batting_avg_per_year <dbl>

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Hao Zhang

2023-01-07

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Hao Zhang

2023-01-07

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`