Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?

#install.packages("nycflights13")
library(nycflights13)
library(dplyr)


cancelledFlights <- flights %>%
                    group_by(month) %>%
                    summarize(flights_cancelled = sum(is.na(dep_time)),
                              flights_cancelled_proportion = flights_cancelled/n()*100) %>%
                    arrange(flights_cancelled_proportion)
cancelledFlights

## # A tibble: 12 × 3
##    month flights_cancelled flights_cancelled_proportion
##    <int>             <int>                        <dbl>
##  1    10               236                        0.817
##  2    11               233                        0.854
##  3     9               452                        1.64 
##  4     8               486                        1.66 
##  5     1               521                        1.93 
##  6     5               563                        1.96 
##  7     4               668                        2.36 
##  8     3               861                        2.99 
##  9     7               940                        3.19 
## 10     6              1009                        3.57 
## 11    12              1025                        3.64 
## 12     2              1261                        5.05

#A) February with a proportion of 5.05 cancelled flights 
#B) October with a proportion of 0.82 cancelled flights

Question #2

Consider the following pipeline:

library(tidyverse)
mtcars %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg)) %>%
  filter(am == 1)

#Corrected pipeline
mtcars %>%
  filter(am == 1) %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg))

What is the problem with this pipeline?

The issue with this pipeline is the filter of am, in order for this to work we would have to filter am before grouping by cyl and getting the average of mpg.

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

#install.packages("Lahman")
library(Lahman)

#A)
Teams <- Teams %>%
          mutate(BA = H/AB)

#B)
Teams$totalbases <- Teams$H+2*Teams$X2B+3*Teams$X3B+4*Teams$HR
Teams <- Teams %>%
          mutate(SLG = totalbases/AB)

str(Teams)

## 'data.frame':    2985 obs. of  51 variables:
##  $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ lgID          : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
##  $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
##  $ divID         : chr  NA NA NA NA ...
##  $ Rank          : int  3 2 8 7 5 1 9 6 4 2 ...
##  $ G             : int  31 28 29 19 33 28 25 29 32 58 ...
##  $ Ghome         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ W             : int  20 19 10 7 16 21 4 13 15 35 ...
##  $ L             : int  10 9 19 12 17 7 21 15 15 19 ...
##  $ DivWin        : chr  NA NA NA NA ...
##  $ WCWin         : chr  NA NA NA NA ...
##  $ LgWin         : chr  "N" "N" "N" "N" ...
##  $ WSWin         : chr  NA NA NA NA ...
##  $ R             : int  401 302 249 137 302 376 231 351 310 617 ...
##  $ AB            : int  1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
##  $ H             : int  426 323 328 178 403 410 274 384 375 753 ...
##  $ X2B           : int  70 52 35 19 43 66 44 51 54 106 ...
##  $ X3B           : int  37 21 40 8 21 27 25 34 26 31 ...
##  $ HR            : int  3 10 7 2 1 9 3 6 6 14 ...
##  $ BB            : int  60 60 26 33 33 46 38 49 48 29 ...
##  $ SO            : int  19 22 25 9 15 23 30 19 13 28 ...
##  $ SB            : int  73 69 18 16 46 56 53 62 48 53 ...
##  $ CS            : int  16 21 8 4 15 12 10 24 13 18 ...
##  $ HBP           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RA            : int  303 241 341 243 313 266 287 362 303 434 ...
##  $ ER            : int  109 77 116 97 121 137 108 153 137 166 ...
##  $ ERA           : num  3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 2.9 ...
##  $ CG            : int  22 25 23 19 32 27 23 28 32 48 ...
##  $ SHO           : int  1 0 0 1 1 0 1 0 0 1 ...
##  $ SV            : int  3 1 0 0 0 0 0 0 0 1 ...
##  $ IPouts        : int  828 753 762 507 879 747 678 750 846 1548 ...
##  $ HA            : int  367 308 346 261 373 329 315 431 371 573 ...
##  $ HRA           : int  2 6 13 5 7 3 3 4 4 3 ...
##  $ BBA           : int  42 28 53 21 42 53 34 75 45 63 ...
##  $ SOA           : int  23 22 34 17 22 16 16 12 13 77 ...
##  $ E             : int  243 229 234 163 235 194 220 198 218 432 ...
##  $ DP            : int  24 16 15 8 14 13 14 22 20 22 ...
##  $ FP            : num  0.834 0.829 0.818 0.803 0.84 0.845 0.821 0.845 0.85 0.83 ...
##  $ name          : chr  "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
##  $ park          : chr  "South End Grounds I" "Union Base-Ball Grounds" "National Association Grounds" "Hamilton Field" ...
##  $ attendance    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPF           : int  103 104 96 101 90 102 97 101 94 106 ...
##  $ PPF           : int  98 102 100 107 88 98 99 100 98 102 ...
##  $ teamIDBR      : chr  "BOS" "CHI" "CLE" "KEK" ...
##  $ teamIDlahman45: chr  "BS1" "CH1" "CL1" "FW1" ...
##  $ teamIDretro   : chr  "BS1" "CH1" "CL1" "FW1" ...
##  $ BA            : num  0.31 0.27 0.277 0.239 0.287 ...
##  $ totalbases    : num  689 530 546 248 556 ...
##  $ SLG           : num  0.502 0.443 0.46 0.332 0.396 ...

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)

Teams %>%
  select(teamID, SLG, yearID) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   teamID       SLG yearID
## 1    HOU 0.6092998   2019
## 2    MIN 0.6071179   2019
## 3    BOS 0.6033975   2003
## 4    NYA 0.5996776   2019
## 5    ATL 0.5964320   2020

Teams %>%
  select(teamID, SLG, yearID) %>%
  filter(yearID >= 1969) %>%
  arrange(desc(SLG)) %>%
  head(5)

##   teamID       SLG yearID
## 1    HOU 0.6092998   2019
## 2    MIN 0.6071179   2019
## 3    BOS 0.6033975   2003
## 4    NYA 0.5996776   2019
## 5    ATL 0.5964320   2020

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)

#A)
Batting %>% 
  group_by(playerID) %>%
  summarize(homeRuns = sum(HR), stolenBases = sum(SB)) %>%
  filter(homeRuns >= 300 & stolenBases >= 300) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, homeRuns, stolenBases)

## # A tibble: 8 × 4
##   nameFirst nameLast  homeRuns stolenBases
##   <chr>     <chr>        <int>       <int>
## 1 Carlos    Beltran        435         312
## 2 Barry     Bonds          762         514
## 3 Bobby     Bonds          332         461
## 4 Andre     Dawson         438         314
## 5 Steve     Finley         304         320
## 6 Willie    Mays           660         338
## 7 Alex      Rodriguez      696         329
## 8 Reggie    Sanders        305         304

#B)
Pitching %>% 
  group_by(playerID) %>%
  summarize(wins = sum(W), strikeouts = sum(SO)) %>%
  filter(wins >= 300 & strikeouts >= 3000) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, wins, strikeouts)

## # A tibble: 10 × 4
##    nameFirst nameLast  wins strikeouts
##    <chr>     <chr>    <int>      <int>
##  1 Steve     Carlton    329       4136
##  2 Roger     Clemens    354       4672
##  3 Randy     Johnson    303       4875
##  4 Walter    Johnson    417       3509
##  5 Greg      Maddux     355       3371
##  6 Phil      Niekro     318       3342
##  7 Gaylord   Perry      314       3534
##  8 Nolan     Ryan       324       5714
##  9 Tom       Seaver     311       3640
## 10 Don       Sutton     324       3574

#C) Pete Alonso, PlayerID = alonspe01 has the lowest batting average of the season with 0.259315.
Batting %>% 
  group_by(playerID, yearID) %>%
  summarize(homeRuns = sum(HR), battingAvrg = sum(H)/sum(AB)) %>%
  filter(homeRuns >= 50) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, yearID, homeRuns, battingAvrg) %>%
  arrange(battingAvrg)

## # A tibble: 46 × 6
## # Groups:   playerID [30]
##    playerID  nameFirst nameLast yearID homeRuns battingAvrg
##    <chr>     <chr>     <chr>     <int>    <int>       <dbl>
##  1 alonspe01 Pete      Alonso     2019       53       0.260
##  2 bautijo02 Jose      Bautista   2010       54       0.260
##  3 jonesan01 Andruw    Jones      2005       51       0.263
##  4 marisro01 Roger     Maris      1961       61       0.269
##  5 vaughgr01 Greg      Vaughn     1998       50       0.272
##  6 mcgwima01 Mark      McGwire    1997       58       0.274
##  7 fieldce01 Cecil     Fielder    1990       51       0.277
##  8 mcgwima01 Mark      McGwire    1999       65       0.278
##  9 stantmi03 Giancarlo Stanton    2017       59       0.281
## 10 judgeaa01 Aaron     Judge      2017       52       0.284
## # … with 36 more rows

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

Belosan W Jekale

2023-01-12

Directions

The issue with this pipeline is the filter of am, in order for this to work we would have to filter am before grouping by cyl and getting the average of mpg.

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

Belosan W Jekale

2023-01-12

Directions

The issue with this pipeline is the filter of am, in order for this to work we would have to filter am before grouping by cyl and getting the average of mpg.

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`