dplyr
basicsdplyrDuring ANLY 512 we will be studying the theory and practice of
data visualization. We will be using R and the
packages within R to assemble data and construct many
different types of visualizations. Before we begin studying data
visualizations we need to develop some data wrangling skills. We will
use these skills to wrangle our data into a form that we can use for
visualizations.
The objective of this assignment is to introduce you to R Studio,
Rmarkdown, the tidyverse and more specifically the dplyr
package.
Each question is worth 5 points.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.
Question #1
Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?
library(nycflights13)
library(tidyverse)
flights = nycflights13::flights
flights <- flights %>%
mutate(
cancelled =ifelse(is.na(dep_time),1,0)
)
flight_cancellation_by_month_sorted <- flights %>%
select(year,month,day,cancelled) %>%
group_by(month,year) %>%
summarise(
month = first(month),
year = first(year),
count = sum(cancelled),
total = n(),
pct = sum(cancelled)/n()
) %>%
arrange(pct)
flight_cancellation_by_month_sorted[12, , drop=TRUE]
## # A tibble: 1 × 5
## month year count total pct
## <int> <int> <dbl> <int> <dbl>
## 1 2 2013 1261 24951 0.0505
flight_cancellation_by_month_sorted[1, , drop=TRUE]
## # A tibble: 1 × 5
## month year count total pct
## <int> <int> <dbl> <int> <dbl>
## 1 10 2013 236 28889 0.00817
February has the highest cancellations at 5.05% October had the lowest cancellations at 0.81%
Question #2
Consider the following pipeline:
#mtcars %>%
# group_by(cyl) %>%
# summarize(avg_mpg = mean(mpg)) %>%
# filter(am == 1)
What is the problem with this pipeline?
The problem with the pipeline is the order of our operations. We
first perform a group by operation on one column and use the summarize
option next to only calculate the average mpg for the group. The
resulting tibble only has the grouping column (cyl) and the summarized
column, therefore the filter step found right after will fail since that
column does not exist. This can be fixed by including the filter above
the group_by operation so that only rows with am==1 are
passed into the group_by function. See the code snippet below to
generate the example in this context.
library(tidyverse)
#what the earlier operation was doing
res1<-mtcars %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg))
print(res1,n=5)
res2<-mtcars %>%
filter(am == 1) %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg))
print(res2,n=5)
Question #3
Define two new variables in the Teams data frame in the
pkg Lahman() package.
batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.
library(Lahman)
teams <- Teams
teams <- teams %>%
mutate(
BA = H/AB,
X1B= H-X2B-X3B-HR,
SLG= (((X1B)*1) + (X2B*2) + (X3B*3) + (HR*4))/AB #since singles are not recorded and hits include singles
)
Question #4
Using the Teams data frame in the
pkg Lahman() package. display the top-5 teams ranked in
terms of slugging percentage (SLG) in Major League Baseball history.
Repeat this using teams since 1969. Slugging percentage is total bases
divided by at-bats.To compute total bases, you get 1 for a single, 2 for
a double, 3 for a triple, and 4 for a home run.
library(Lahman)
#lets use the teams tibble we made earlier which has the SLG calculated.
teams %>%
arrange(SLG) %>%
top_n(n=5,wt=SLG) #top 5 teams ranked by SLG
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1927 AL NYA NYY <NA> 1 155 77 110 44 <NA> <NA> Y
## 2 2019 AL NYA NYY E 1 162 81 103 59 Y N N
## 3 2003 AL BOS BOS E 2 162 81 95 67 N Y N
## 4 2019 AL MIN MIN C 1 162 81 101 61 Y N N
## 5 2019 AL HOU HOU W 1 162 81 107 55 Y N Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 Y 975 5347 1644 291 103 158 635 605 90 64 NA NA 599 494 3.20 82 11 20
## 2 N 943 5583 1493 290 17 306 569 1437 55 22 49 33 739 691 4.31 1 9 50
## 3 N 961 5769 1667 371 40 238 620 943 88 35 53 64 809 729 4.48 5 6 36
## 4 N 939 5732 1547 318 23 307 525 1334 28 21 81 41 754 680 4.18 1 10 50
## 5 N 920 5613 1538 323 28 288 645 1166 67 27 66 57 640 595 3.66 2 14 47
## IPouts HA HRA BBA SOA E DP FP name park
## 1 4167 1403 42 409 431 196 123 0.969 New York Yankees Yankee Stadium I
## 2 4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 3 4394 1503 153 488 1141 113 130 0.982 Boston Red Sox Fenway Park II
## 4 4390 1456 198 452 1463 111 130 0.981 Minnesota Twins Target Field
## 5 4387 1205 230 448 1671 71 96 0.988 Houston Astros Minute Maid Park
## attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro BA X1B
## 1 1164015 98 94 NYY NYA NYA 0.3074621 1092
## 2 3304404 98 96 NYY NYA NYA 0.2674190 880
## 3 2724165 105 103 BOS BOS BOS 0.2889582 1018
## 4 2294152 100 99 MIN MIN MIN 0.2698883 899
## 5 2857367 103 100 HOU HOU HOU 0.2740068 899
## SLG
## 1 0.4890593
## 2 0.4898800
## 3 0.4908996
## 4 0.4940684
## 5 0.4954570
teams %>%
filter(yearID>=1969) %>%
arrange(SLG) %>%
top_n(n=5,wt=SLG)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1997 AL SEA SEA W 1 162 81 90 72 Y N N
## 2 2019 AL NYA NYY E 1 162 81 103 59 Y N N
## 3 2003 AL BOS BOS E 2 162 81 95 67 N Y N
## 4 2019 AL MIN MIN C 1 162 81 101 61 Y N N
## 5 2019 AL HOU HOU W 1 162 81 107 55 Y N Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 925 5614 1574 312 21 264 626 1110 89 40 49 49 833 769 4.78 9 8 38
## 2 N 943 5583 1493 290 17 306 569 1437 55 22 49 33 739 691 4.31 1 9 50
## 3 N 961 5769 1667 371 40 238 620 943 88 35 53 64 809 729 4.48 5 6 36
## 4 N 939 5732 1547 318 23 307 525 1334 28 21 81 41 754 680 4.18 1 10 50
## 5 N 920 5613 1538 323 28 288 645 1166 67 27 66 57 640 595 3.66 2 14 47
## IPouts HA HRA BBA SOA E DP FP name park
## 1 4343 1500 192 598 1207 126 143 0.979 Seattle Mariners Kingdome
## 2 4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 3 4394 1503 153 488 1141 113 130 0.982 Boston Red Sox Fenway Park II
## 4 4390 1456 198 452 1463 111 130 0.981 Minnesota Twins Target Field
## 5 4387 1205 230 448 1671 71 96 0.988 Houston Astros Minute Maid Park
## attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro BA X1B
## 1 3192237 98 98 SEA SEA SEA 0.2803705 977
## 2 3304404 98 96 NYY NYA NYA 0.2674190 880
## 3 2724165 105 103 BOS BOS BOS 0.2889582 1018
## 4 2294152 100 99 MIN MIN MIN 0.2698883 899
## 5 2857367 103 100 HOU HOU HOU 0.2740068 899
## SLG
## 1 0.4845030
## 2 0.4898800
## 3 0.4908996
## 4 0.4940684
## 5 0.4954570
Question #5
Use the Batting, Pitching, and
People tables in the pkg Lahman() package to
answer the following questions.
a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.
Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?
library(Lahman)
batting <- Batting
pitching<- Pitching
people <- People
batting_joined <-
batting %>%
left_join(people,by=c("playerID" = "playerID"))
batting_joined %>%
group_by(playerID)%>%
summarise(
firstName= first(nameFirst),
lastName = first(nameLast),
total_HR = sum(HR),
total_SB = sum(SB),
) %>%
filter(
total_HR>=300 & total_SB >=300
)
## # A tibble: 8 × 5
## playerID firstName lastName total_HR total_SB
## <chr> <chr> <chr> <int> <int>
## 1 beltrca01 Carlos Beltran 435 312
## 2 bondsba01 Barry Bonds 762 514
## 3 bondsbo01 Bobby Bonds 332 461
## 4 dawsoan01 Andre Dawson 438 314
## 5 finlest01 Steve Finley 304 320
## 6 mayswi01 Willie Mays 660 338
## 7 rodrial01 Alex Rodriguez 696 329
## 8 sandere02 Reggie Sanders 305 304
pitching_joined <-
pitching %>%
left_join(people,by=c("playerID" = "playerID"))
pitching_joined %>%
group_by(playerID) %>%
summarise(
firstName = first(nameFirst),
lastName = first(nameLast),
total_W = sum(W),
total_SO = sum(SO),
) %>%
filter(
total_W>=300 & total_SO>=3000)
## # A tibble: 10 × 5
## playerID firstName lastName total_W total_SO
## <chr> <chr> <chr> <int> <int>
## 1 carltst01 Steve Carlton 329 4136
## 2 clemero02 Roger Clemens 354 4672
## 3 johnsra05 Randy Johnson 303 4875
## 4 johnswa01 Walter Johnson 417 3509
## 5 maddugr01 Greg Maddux 355 3371
## 6 niekrph01 Phil Niekro 318 3342
## 7 perryga01 Gaylord Perry 314 3534
## 8 ryanno01 Nolan Ryan 324 5714
## 9 seaveto01 Tom Seaver 311 3640
## 10 suttodo01 Don Sutton 324 3574
batting_joined %>%
filter(HR>=50) %>%
mutate(
BA = H/AB
) %>%
group_by(yearID) %>%
arrange(desc(BA)) %>%
top_n(n=-1) %>%
select(
nameFirst,
nameLast,
yearID,
BA,
HR
) %>%
arrange(BA)
## # A tibble: 30 × 5
## # Groups: yearID [30]
## nameFirst nameLast yearID BA HR
## <chr> <chr> <int> <dbl> <int>
## 1 Pete Alonso 2019 0.260 53
## 2 Jose Bautista 2010 0.260 54
## 3 Andruw Jones 2005 0.263 51
## 4 Roger Maris 1961 0.269 61
## 5 Greg Vaughn 1998 0.272 50
## 6 Cecil Fielder 1990 0.277 51
## 7 Mark McGwire 1999 0.278 65
## 8 Giancarlo Stanton 2017 0.281 59
## 9 Chris Davis 2013 0.286 53
## 10 David Ortiz 2006 0.287 54
## # … with 20 more rows