Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?

library(nycflights13)

FlightsCancelled <- 
  flights %>%
  group_by(month) %>%
  summarise(cancelled = sum(is.na(dep_time)),
            Proportioncancelled = (cancelled/n())*100, N = n()) %>%
  arrange(Proportioncancelled)

FlightsCancelled

## # A tibble: 12 × 4
##    month cancelled Proportioncancelled     N
##    <int>     <int>               <dbl> <int>
##  1    10       236               0.817 28889
##  2    11       233               0.854 27268
##  3     9       452               1.64  27574
##  4     8       486               1.66  29327
##  5     1       521               1.93  27004
##  6     5       563               1.96  28796
##  7     4       668               2.36  28330
##  8     3       861               2.99  28834
##  9     7       940               3.19  29425
## 10     6      1009               3.57  28243
## 11    12      1025               3.64  28135
## 12     2      1261               5.05  24951

Month 6 had the highest proportion of cancelled flights (3.57) and Month 10 had the lowest proportion of cancelled flights (0.82)

Question #2

Consider the following pipeline:

library(tidyverse)
mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg)) %>%
  filter(am == 1)

What is the problem with this pipeline? Upon running the chunk, it has been reported that the object ‘am’ is not found or available.

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
Teams <- 
  Teams %>%
  mutate( BA = H/AB, SLG = (H+2*X2B+3*X3B+4*HR)/AB)

head(select(Teams, BA, SLG))

##          BA       SLG
## 1 0.3104956 0.5021866
## 2 0.2700669 0.4431438
## 3 0.2765599 0.4603710
## 4 0.2386059 0.3324397
## 5 0.2870370 0.3960114
## 6 0.3200625 0.5144418

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
T <- Teams

TopSlug <-
  filter(T, yearID >= "1969") %>%
  group_by(yearID) %>%
  arrange(desc(SLG)) %>%
  select(yearID, name, SLG)
head(TopSlug,5)

## # A tibble: 5 × 3
## # Groups:   yearID [3]
##   yearID name               SLG
##    <int> <chr>            <dbl>
## 1   2019 Houston Astros   0.609
## 2   2019 Minnesota Twins  0.607
## 3   2003 Boston Red Sox   0.603
## 4   2019 New York Yankees 0.600
## 5   2020 Atlanta Braves   0.596

Top5SluggingPerYear <- 
  filter(T, yearID >= "1969") %>%
  group_by(yearID) %>%
  arrange(desc(SLG)) %>%
  slice(1:5) %>%
  select(yearID, name, SLG)

Top5SluggingPerYear

## # A tibble: 265 × 3
## # Groups:   yearID [53]
##    yearID name                   SLG
##     <int> <chr>                <dbl>
##  1   1969 Boston Red Sox       0.500
##  2   1969 Cincinnati Reds      0.500
##  3   1969 Baltimore Orioles    0.493
##  4   1969 Minnesota Twins      0.486
##  5   1969 Pittsburgh Pirates   0.467
##  6   1970 Cincinnati Reds      0.524
##  7   1970 Boston Red Sox       0.515
##  8   1970 Chicago Cubs         0.497
##  9   1970 San Francisco Giants 0.491
## 10   1970 Pittsburgh Pirates   0.483
## # … with 255 more rows

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)
Batting %>%
  group_by(playerID) %>%
  summarize(totalHomeRuns = sum(HR), totalStolenBases = sum(SB)) %>%
  filter(totalHomeRuns >= 300 & totalStolenBases >= 300) %>%
  left_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, totalHomeRuns, totalStolenBases)

## # A tibble: 8 × 4
##   nameFirst nameLast  totalHomeRuns totalStolenBases
##   <chr>     <chr>             <int>            <int>
## 1 Carlos    Beltran             435              312
## 2 Barry     Bonds               762              514
## 3 Bobby     Bonds               332              461
## 4 Andre     Dawson              438              314
## 5 Steve     Finley              304              320
## 6 Willie    Mays                660              338
## 7 Alex      Rodriguez           696              329
## 8 Reggie    Sanders             305              304

Pitching %>%
  group_by(playerID) %>%
  summarize(TotalWins = sum(W), TotalStrikeouts = sum(SO)) %>%
  filter(TotalWins >= 300 & TotalStrikeouts >= 3000) %>%
  left_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, TotalWins, TotalStrikeouts)

## # A tibble: 10 × 4
##    nameFirst nameLast TotalWins TotalStrikeouts
##    <chr>     <chr>        <int>           <int>
##  1 Steve     Carlton        329            4136
##  2 Roger     Clemens        354            4672
##  3 Randy     Johnson        303            4875
##  4 Walter    Johnson        417            3509
##  5 Greg      Maddux         355            3371
##  6 Phil      Niekro         318            3342
##  7 Gaylord   Perry          314            3534
##  8 Nolan     Ryan           324            5714
##  9 Tom       Seaver         311            3640
## 10 Don       Sutton         324            3574

Batting %>%
  group_by(playerID, yearID) %>%
  summarize(TotalHomeRuns = sum(HR), BattingAverage = sum(H)/sum(AB)) %>%
  filter(TotalHomeRuns >= 50) %>%
  left_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, TotalHomeRuns, BattingAverage)

## # A tibble: 46 × 5
## # Groups:   playerID [30]
##    playerID  nameFirst nameLast TotalHomeRuns BattingAverage
##    <chr>     <chr>     <chr>            <int>          <dbl>
##  1 alonspe01 Pete      Alonso              53          0.260
##  2 anderbr01 Brady     Anderson            50          0.297
##  3 bautijo02 Jose      Bautista            54          0.260
##  4 belleal01 Albert    Belle               50          0.317
##  5 bondsba01 Barry     Bonds               73          0.328
##  6 davisch02 Chris     Davis               53          0.286
##  7 fieldce01 Cecil     Fielder             51          0.277
##  8 fieldpr01 Prince    Fielder             50          0.288
##  9 fostege01 George    Foster              52          0.320
## 10 foxxji01  Jimmie    Foxx                58          0.364
## # … with 36 more rows

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Ranjani Sudarsan

2023-01-24

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Ranjani Sudarsan

2023-01-24

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`