Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?

library(nycflights13)
library(tidyverse)

flights = nycflights13::flights

flights <- flights %>%
  mutate(
    cancelled =ifelse(is.na(dep_time),1,0)
  )

flight_cancellation_by_month_sorted <- flights %>%
  select(year,month,day,cancelled) %>%
  group_by(month,year) %>%
  summarise(
    month = first(month),
    year  = first(year),
    count = sum(cancelled),
    total = n(),
    pct   = sum(cancelled)/n()
  ) %>%
  arrange(pct)

flight_cancellation_by_month_sorted[12, , drop=TRUE]

## # A tibble: 1 × 5
##   month  year count total    pct
##   <int> <int> <dbl> <int>  <dbl>
## 1     2  2013  1261 24951 0.0505

flight_cancellation_by_month_sorted[1, , drop=TRUE]

## # A tibble: 1 × 5
##   month  year count total     pct
##   <int> <int> <dbl> <int>   <dbl>
## 1    10  2013   236 28889 0.00817

February has the highest cancellations at 5.05% October had the lowest cancellations at 0.81%

Question #2

Consider the following pipeline:

#mtcars %>%
#  group_by(cyl) %>%
#  summarize(avg_mpg = mean(mpg)) %>%
#  filter(am == 1)

What is the problem with this pipeline?

The problem with the pipeline is the order of our operations. We first perform a group by operation on one column and use the summarize option next to only calculate the average mpg for the group. The resulting tibble only has the grouping column (cyl) and the summarized column, therefore the filter step found right after will fail since that column does not exist. This can be fixed by including the filter above the group_by operation so that only rows with am==1 are passed into the group_by function. See the code snippet below to generate the example in this context.

library(tidyverse)
#what the earlier operation was doing
res1<-mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg))
print(res1,n=5)

res2<-mtcars %>%
  filter(am == 1) %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg)) 

print(res2,n=5)

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
teams <- Teams

teams <- teams %>%
  mutate(
    BA = H/AB,
    X1B= H-X2B-X3B-HR,
    SLG= (((X1B)*1) + (X2B*2) + (X3B*3) + (HR*4))/AB #since singles are not recorded and hits include singles
  )

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
#lets use the teams tibble we made earlier which has the SLG calculated.

teams %>%
  arrange(SLG) %>%
  top_n(n=5,wt=SLG) #top 5 teams ranked by SLG

##   yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin LgWin
## 1   1927   AL    NYA      NYY  <NA>    1 155    77 110 44   <NA>  <NA>     Y
## 2   2019   AL    NYA      NYY     E    1 162    81 103 59      Y     N     N
## 3   2003   AL    BOS      BOS     E    2 162    81  95 67      N     Y     N
## 4   2019   AL    MIN      MIN     C    1 162    81 101 61      Y     N     N
## 5   2019   AL    HOU      HOU     W    1 162    81 107 55      Y     N     Y
##   WSWin   R   AB    H X2B X3B  HR  BB   SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     Y 975 5347 1644 291 103 158 635  605 90 64  NA NA 599 494 3.20 82  11 20
## 2     N 943 5583 1493 290  17 306 569 1437 55 22  49 33 739 691 4.31  1   9 50
## 3     N 961 5769 1667 371  40 238 620  943 88 35  53 64 809 729 4.48  5   6 36
## 4     N 939 5732 1547 318  23 307 525 1334 28 21  81 41 754 680 4.18  1  10 50
## 5     N 920 5613 1538 323  28 288 645 1166 67 27  66 57 640 595 3.66  2  14 47
##   IPouts   HA HRA BBA  SOA   E  DP    FP             name               park
## 1   4167 1403  42 409  431 196 123 0.969 New York Yankees   Yankee Stadium I
## 2   4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 3   4394 1503 153 488 1141 113 130 0.982   Boston Red Sox     Fenway Park II
## 4   4390 1456 198 452 1463 111 130 0.981  Minnesota Twins       Target Field
## 5   4387 1205 230 448 1671  71  96 0.988   Houston Astros   Minute Maid Park
##   attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro        BA  X1B
## 1    1164015  98  94      NYY            NYA         NYA 0.3074621 1092
## 2    3304404  98  96      NYY            NYA         NYA 0.2674190  880
## 3    2724165 105 103      BOS            BOS         BOS 0.2889582 1018
## 4    2294152 100  99      MIN            MIN         MIN 0.2698883  899
## 5    2857367 103 100      HOU            HOU         HOU 0.2740068  899
##         SLG
## 1 0.4890593
## 2 0.4898800
## 3 0.4908996
## 4 0.4940684
## 5 0.4954570

teams %>%
  filter(yearID>=1969) %>%
  arrange(SLG) %>%
  top_n(n=5,wt=SLG)

##   yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin LgWin
## 1   1997   AL    SEA      SEA     W    1 162    81  90 72      Y     N     N
## 2   2019   AL    NYA      NYY     E    1 162    81 103 59      Y     N     N
## 3   2003   AL    BOS      BOS     E    2 162    81  95 67      N     Y     N
## 4   2019   AL    MIN      MIN     C    1 162    81 101 61      Y     N     N
## 5   2019   AL    HOU      HOU     W    1 162    81 107 55      Y     N     Y
##   WSWin   R   AB    H X2B X3B  HR  BB   SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     N 925 5614 1574 312  21 264 626 1110 89 40  49 49 833 769 4.78  9   8 38
## 2     N 943 5583 1493 290  17 306 569 1437 55 22  49 33 739 691 4.31  1   9 50
## 3     N 961 5769 1667 371  40 238 620  943 88 35  53 64 809 729 4.48  5   6 36
## 4     N 939 5732 1547 318  23 307 525 1334 28 21  81 41 754 680 4.18  1  10 50
## 5     N 920 5613 1538 323  28 288 645 1166 67 27  66 57 640 595 3.66  2  14 47
##   IPouts   HA HRA BBA  SOA   E  DP    FP             name               park
## 1   4343 1500 192 598 1207 126 143 0.979 Seattle Mariners           Kingdome
## 2   4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 3   4394 1503 153 488 1141 113 130 0.982   Boston Red Sox     Fenway Park II
## 4   4390 1456 198 452 1463 111 130 0.981  Minnesota Twins       Target Field
## 5   4387 1205 230 448 1671  71  96 0.988   Houston Astros   Minute Maid Park
##   attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro        BA  X1B
## 1    3192237  98  98      SEA            SEA         SEA 0.2803705  977
## 2    3304404  98  96      NYY            NYA         NYA 0.2674190  880
## 3    2724165 105 103      BOS            BOS         BOS 0.2889582 1018
## 4    2294152 100  99      MIN            MIN         MIN 0.2698883  899
## 5    2857367 103 100      HOU            HOU         HOU 0.2740068  899
##         SLG
## 1 0.4845030
## 2 0.4898800
## 3 0.4908996
## 4 0.4940684
## 5 0.4954570

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)

batting <- Batting
pitching<- Pitching
people  <- People

batting_joined <-
  batting %>%
  left_join(people,by=c("playerID" = "playerID"))
  
batting_joined %>%
  group_by(playerID)%>%
  summarise(
    firstName= first(nameFirst),
    lastName = first(nameLast),
    total_HR = sum(HR),
    total_SB = sum(SB),
  ) %>%
  filter(
    total_HR>=300 & total_SB >=300
  )

## # A tibble: 8 × 5
##   playerID  firstName lastName  total_HR total_SB
##   <chr>     <chr>     <chr>        <int>    <int>
## 1 beltrca01 Carlos    Beltran        435      312
## 2 bondsba01 Barry     Bonds          762      514
## 3 bondsbo01 Bobby     Bonds          332      461
## 4 dawsoan01 Andre     Dawson         438      314
## 5 finlest01 Steve     Finley         304      320
## 6 mayswi01  Willie    Mays           660      338
## 7 rodrial01 Alex      Rodriguez      696      329
## 8 sandere02 Reggie    Sanders        305      304

pitching_joined <-
  pitching %>%
  left_join(people,by=c("playerID" = "playerID"))

pitching_joined %>%
  group_by(playerID) %>%
  summarise(
    firstName = first(nameFirst),
    lastName  = first(nameLast),
    total_W   = sum(W),
    total_SO  = sum(SO),
  ) %>%
  filter(
    total_W>=300 & total_SO>=3000)

## # A tibble: 10 × 5
##    playerID  firstName lastName total_W total_SO
##    <chr>     <chr>     <chr>      <int>    <int>
##  1 carltst01 Steve     Carlton      329     4136
##  2 clemero02 Roger     Clemens      354     4672
##  3 johnsra05 Randy     Johnson      303     4875
##  4 johnswa01 Walter    Johnson      417     3509
##  5 maddugr01 Greg      Maddux       355     3371
##  6 niekrph01 Phil      Niekro       318     3342
##  7 perryga01 Gaylord   Perry        314     3534
##  8 ryanno01  Nolan     Ryan         324     5714
##  9 seaveto01 Tom       Seaver       311     3640
## 10 suttodo01 Don       Sutton       324     3574

batting_joined %>%
  filter(HR>=50) %>%
  mutate(
    BA = H/AB
  ) %>%
  group_by(yearID) %>%
  arrange(desc(BA)) %>%
  top_n(n=-1) %>%
  select(
    nameFirst,
    nameLast,
    yearID,
    BA,
    HR
  ) %>%
  arrange(BA)

## # A tibble: 30 × 5
## # Groups:   yearID [30]
##    nameFirst nameLast yearID    BA    HR
##    <chr>     <chr>     <int> <dbl> <int>
##  1 Pete      Alonso     2019 0.260    53
##  2 Jose      Bautista   2010 0.260    54
##  3 Andruw    Jones      2005 0.263    51
##  4 Roger     Maris      1961 0.269    61
##  5 Greg      Vaughn     1998 0.272    50
##  6 Cecil     Fielder    1990 0.277    51
##  7 Mark      McGwire    1999 0.278    65
##  8 Giancarlo Stanton    2017 0.281    59
##  9 Chris     Davis      2013 0.286    53
## 10 David     Ortiz      2006 0.287    54
## # … with 20 more rows

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Anshuman Narayan

2023-01-24

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Anshuman Narayan

2023-01-24

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`