Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? February, 2013 has highest flights cancelled with 1261 cancelled flight.

b.What month had the lowest?

October, 2013 has lowest cancelled flight with 236 number of cancelled flight.

library(nycflights13)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
month_cancelledFlights <- flights %>%
group_by(year,month) %>%
  
summarize(flights_cancelled = sum(is.na(dep_time)),
                              flights_cancelled_proportion = flights_cancelled/n()*100) %>%
                    arrange(flights_cancelled_proportion)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
month_cancelledFlights
## # A tibble: 12 × 4
## # Groups:   year [1]
##     year month flights_cancelled flights_cancelled_proportion
##    <int> <int>             <int>                        <dbl>
##  1  2013    10               236                        0.817
##  2  2013    11               233                        0.854
##  3  2013     9               452                        1.64 
##  4  2013     8               486                        1.66 
##  5  2013     1               521                        1.93 
##  6  2013     5               563                        1.96 
##  7  2013     4               668                        2.36 
##  8  2013     3               861                        2.99 
##  9  2013     7               940                        3.19 
## 10  2013     6              1009                        3.57 
## 11  2013    12              1025                        3.64 
## 12  2013     2              1261                        5.05

Question #2

Consider the following pipeline:

library(tidyverse)

mtcars %>%
  filter(am == 1) %>%
  group_by(cyl) %>%
  summarize(avg_mpg = mean(mpg)) 

What is the problem with this pipeline?

The problem with above pipeline is that filter (am) is used before summarize funcation. If filter function is used after its sumarization then it becomes hard to filter.

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

  1. batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)

  2. slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)


Teams <- Teams %>%
  mutate(BA = H / AB,
         SLG = (H + 2 * X2B + 3 * X3B + 4 * HR) / AB)

head(Teams, 5)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
##   teamIDretro        BA       SLG
## 1         BS1 0.3104956 0.5021866
## 2         CH1 0.2700669 0.4431438
## 3         CL1 0.2765599 0.4603710
## 4         FW1 0.2386059 0.3324397
## 5         NY2 0.2870370 0.3960114
summary(Teams)
##      yearID     lgID          teamID        franchID       divID          
##  Min.   :1871   AA:  85   CHN    : 146   ATL    : 146   Length:2985       
##  1st Qu.:1922   AL:1295   PHI    : 139   CHC    : 146   Class :character  
##  Median :1967   FL:  16   PIT    : 135   CIN    : 140   Mode  :character  
##  Mean   :1959   NA:  50   CIN    : 132   PIT    : 140                     
##  3rd Qu.:1997   NL:1519   SLN    : 130   STL    : 140                     
##  Max.   :2021   PL:   8   BOS    : 121   PHI    : 139                     
##                 UA:  12   (Other):2182   (Other):2134                     
##       Rank              G           Ghome             W         
##  Min.   : 1.000   Min.   :  6   Min.   :24.00   Min.   :  0.00  
##  1st Qu.: 2.000   1st Qu.:154   1st Qu.:77.00   1st Qu.: 66.00  
##  Median : 4.000   Median :159   Median :81.00   Median : 77.00  
##  Mean   : 4.039   Mean   :150   Mean   :78.05   Mean   : 74.61  
##  3rd Qu.: 6.000   3rd Qu.:162   3rd Qu.:81.00   3rd Qu.: 87.00  
##  Max.   :13.000   Max.   :165   Max.   :84.00   Max.   :116.00  
##                                 NA's   :399                     
##        L             DivWin             WCWin              LgWin          
##  Min.   :  4.00   Length:2985        Length:2985        Length:2985       
##  1st Qu.: 65.00   Class :character   Class :character   Class :character  
##  Median : 76.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 74.61                                                           
##  3rd Qu.: 87.00                                                           
##  Max.   :134.00                                                           
##                                                                           
##     WSWin                 R              AB             H       
##  Length:2985        Min.   :  24   Min.   : 211   Min.   :  33  
##  Class :character   1st Qu.: 614   1st Qu.:5135   1st Qu.:1299  
##  Mode  :character   Median : 691   Median :5402   Median :1390  
##                     Mean   : 681   Mean   :5129   Mean   :1339  
##                     3rd Qu.: 764   3rd Qu.:5519   3rd Qu.:1465  
##                     Max.   :1220   Max.   :5781   Max.   :1783  
##                                                                 
##       X2B             X3B               HR              BB       
##  Min.   :  1.0   Min.   :  0.00   Min.   :  0.0   Min.   :  1.0  
##  1st Qu.:194.0   1st Qu.: 29.00   1st Qu.: 45.0   1st Qu.:425.8  
##  Median :234.0   Median : 40.00   Median :110.0   Median :494.0  
##  Mean   :228.7   Mean   : 45.67   Mean   :105.9   Mean   :473.6  
##  3rd Qu.:272.0   3rd Qu.: 59.00   3rd Qu.:155.0   3rd Qu.:554.2  
##  Max.   :376.0   Max.   :150.00   Max.   :307.0   Max.   :835.0  
##                                                   NA's   :1      
##        SO               SB              CS              HBP        
##  Min.   :   3.0   Min.   :  1.0   Min.   :  3.00   Min.   :  7.00  
##  1st Qu.: 516.0   1st Qu.: 62.5   1st Qu.: 33.00   1st Qu.: 32.00  
##  Median : 761.0   Median : 93.0   Median : 44.00   Median : 43.00  
##  Mean   : 762.1   Mean   :109.4   Mean   : 46.55   Mean   : 45.82  
##  3rd Qu.: 990.0   3rd Qu.:137.0   3rd Qu.: 56.00   3rd Qu.: 57.00  
##  Max.   :1596.0   Max.   :581.0   Max.   :191.00   Max.   :160.00  
##  NA's   :16       NA's   :126     NA's   :832      NA's   :1158    
##        SF              RA             ER              ERA       
##  Min.   : 7.00   Min.   :  34   Min.   :  23.0   Min.   :1.220  
##  1st Qu.:38.00   1st Qu.: 610   1st Qu.: 503.0   1st Qu.:3.370  
##  Median :44.00   Median : 689   Median : 594.0   Median :3.840  
##  Mean   :44.11   Mean   : 681   Mean   : 573.4   Mean   :3.841  
##  3rd Qu.:50.00   3rd Qu.: 766   3rd Qu.: 671.0   3rd Qu.:4.330  
##  Max.   :77.00   Max.   :1252   Max.   :1023.0   Max.   :8.000  
##  NA's   :1541                                                   
##        CG              SHO               SV            IPouts    
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.00   Min.   : 162  
##  1st Qu.:  9.00   1st Qu.: 6.000   1st Qu.:10.00   1st Qu.:4080  
##  Median : 41.00   Median : 9.000   Median :25.00   Median :4252  
##  Mean   : 47.55   Mean   : 9.588   Mean   :24.42   Mean   :4013  
##  3rd Qu.: 76.00   3rd Qu.:12.000   3rd Qu.:39.00   3rd Qu.:4341  
##  Max.   :148.00   Max.   :32.000   Max.   :68.00   Max.   :4518  
##                                                                  
##        HA            HRA             BBA             SOA        
##  Min.   :  49   Min.   :  0.0   Min.   :  1.0   Min.   :   0.0  
##  1st Qu.:1287   1st Qu.: 51.0   1st Qu.:429.0   1st Qu.: 511.0  
##  Median :1389   Median :113.0   Median :495.0   Median : 762.0  
##  Mean   :1339   Mean   :105.9   Mean   :473.7   Mean   : 761.6  
##  3rd Qu.:1468   3rd Qu.:153.0   3rd Qu.:554.0   3rd Qu.: 997.0  
##  Max.   :1993   Max.   :305.0   Max.   :827.0   Max.   :1687.0  
##                                                                 
##        E               DP              FP             name          
##  Min.   : 20.0   Min.   :  0.0   Min.   :0.7610   Length:2985       
##  1st Qu.:111.0   1st Qu.:116.0   1st Qu.:0.9660   Class :character  
##  Median :141.0   Median :140.0   Median :0.9770   Mode  :character  
##  Mean   :180.8   Mean   :132.6   Mean   :0.9664                     
##  3rd Qu.:207.0   3rd Qu.:157.0   3rd Qu.:0.9810                     
##  Max.   :639.0   Max.   :217.0   Max.   :0.9910                     
##                                                                     
##      park             attendance           BPF             PPF       
##  Length:2985        Min.   :      0   Min.   : 60.0   Min.   : 60.0  
##  Class :character   1st Qu.: 538461   1st Qu.: 97.0   1st Qu.: 97.0  
##  Mode  :character   Median :1190886   Median :100.0   Median :100.0  
##                     Mean   :1376599   Mean   :100.2   Mean   :100.2  
##                     3rd Qu.:2066598   3rd Qu.:103.0   3rd Qu.:103.0  
##                     Max.   :4483350   Max.   :129.0   Max.   :141.0  
##                     NA's   :279                                      
##    teamIDBR         teamIDlahman45     teamIDretro              BA        
##  Length:2985        Length:2985        Length:2985        Min.   :0.1564  
##  Class :character   Class :character   Class :character   1st Qu.:0.2494  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.2600  
##                                                           Mean   :0.2607  
##                                                           3rd Qu.:0.2708  
##                                                           Max.   :0.3498  
##                                                                           
##       SLG        
##  Min.   :0.1659  
##  1st Qu.:0.4192  
##  Median :0.4596  
##  Mean   :0.4561  
##  3rd Qu.:0.4950  
##  Max.   :0.6093  
## 

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)

Teams %>% 
  mutate(SLG = (H + 2 * X2B + 3 * X3B + 4 * HR) / AB) %>% 
  arrange(desc(SLG)) %>% 
head(5)
##   yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin LgWin
## 1   2019   AL    HOU      HOU     W    1 162    81 107 55      Y     N     Y
## 2   2019   AL    MIN      MIN     C    1 162    81 101 61      Y     N     N
## 3   2003   AL    BOS      BOS     E    2 162    81  95 67      N     Y     N
## 4   2019   AL    NYA      NYY     E    1 162    81 103 59      Y     N     N
## 5   2020   NL    ATL      ATL     E    1  60    30  35 25      Y     N     N
##   WSWin   R   AB    H X2B X3B  HR  BB   SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     N 920 5613 1538 323  28 288 645 1166 67 27  66 57 640 595 3.66  2  14 47
## 2     N 939 5732 1547 318  23 307 525 1334 28 21  81 41 754 680 4.18  1  10 50
## 3     N 961 5769 1667 371  40 238 620  943 88 35  53 64 809 729 4.48  5   6 36
## 4     N 943 5583 1493 290  17 306 569 1437 55 22  49 33 739 691 4.31  1   9 50
## 5     N 348 2074  556 130   3 103 239  573 23  4  23  7 288 257 4.41  0   0 13
##   IPouts   HA HRA BBA  SOA   E  DP    FP             name               park
## 1   4387 1205 230 448 1671  71  96 0.988   Houston Astros   Minute Maid Park
## 2   4390 1456 198 452 1463 111 130 0.981  Minnesota Twins       Target Field
## 3   4394 1503 153 488 1141 113 130 0.982   Boston Red Sox     Fenway Park II
## 4   4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 5   1573  494  69 220  506  33  52 0.985   Atlanta Braves      SunTrust Park
##   attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro        BA       SLG
## 1    2857367 103 100      HOU            HOU         HOU 0.2740068 0.6092998
## 2    2294152 100  99      MIN            MIN         MIN 0.2698883 0.6071179
## 3    2724165 105 103      BOS            BOS         BOS 0.2889582 0.6033975
## 4    3304404  98  96      NYY            NYA         NYA 0.2674190 0.5996776
## 5          0 107 106      ATL            ATL         ATL 0.2680810 0.5964320
Teams %>% 
  filter(yearID >= 1969) %>% 
  mutate(SLG = (H + 2 * X2B + 3 * X3B + 4 * HR) / AB) %>% 
  arrange(desc(SLG)) %>% 
head(5)
##   yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin LgWin
## 1   2019   AL    HOU      HOU     W    1 162    81 107 55      Y     N     Y
## 2   2019   AL    MIN      MIN     C    1 162    81 101 61      Y     N     N
## 3   2003   AL    BOS      BOS     E    2 162    81  95 67      N     Y     N
## 4   2019   AL    NYA      NYY     E    1 162    81 103 59      Y     N     N
## 5   2020   NL    ATL      ATL     E    1  60    30  35 25      Y     N     N
##   WSWin   R   AB    H X2B X3B  HR  BB   SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     N 920 5613 1538 323  28 288 645 1166 67 27  66 57 640 595 3.66  2  14 47
## 2     N 939 5732 1547 318  23 307 525 1334 28 21  81 41 754 680 4.18  1  10 50
## 3     N 961 5769 1667 371  40 238 620  943 88 35  53 64 809 729 4.48  5   6 36
## 4     N 943 5583 1493 290  17 306 569 1437 55 22  49 33 739 691 4.31  1   9 50
## 5     N 348 2074  556 130   3 103 239  573 23  4  23  7 288 257 4.41  0   0 13
##   IPouts   HA HRA BBA  SOA   E  DP    FP             name               park
## 1   4387 1205 230 448 1671  71  96 0.988   Houston Astros   Minute Maid Park
## 2   4390 1456 198 452 1463 111 130 0.981  Minnesota Twins       Target Field
## 3   4394 1503 153 488 1141 113 130 0.982   Boston Red Sox     Fenway Park II
## 4   4329 1374 248 507 1534 102 135 0.982 New York Yankees Yankee Stadium III
## 5   1573  494  69 220  506  33  52 0.985   Atlanta Braves      SunTrust Park
##   attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro        BA       SLG
## 1    2857367 103 100      HOU            HOU         HOU 0.2740068 0.6092998
## 2    2294152 100  99      MIN            MIN         MIN 0.2698883 0.6071179
## 3    2724165 105 103      BOS            BOS         BOS 0.2889582 0.6033975
## 4    3304404  98  96      NYY            NYA         NYA 0.2674190 0.5996776
## 5          0 107 106      ATL            ATL         ATL 0.2680810 0.5964320

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

  1. Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).

  2. Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)
Batting %>% 
    group_by(playerID) %>%
    summarize(HR=sum(HR), SB=sum(SB)) %>%
    filter(HR >= 300 & SB >= 300) %>%
    inner_join(People, by = c("playerID" = "playerID")) %>% 
    select(nameFirst, nameLast, HR, SB)
## # A tibble: 8 × 4
##   nameFirst nameLast     HR    SB
##   <chr>     <chr>     <int> <int>
## 1 Carlos    Beltran     435   312
## 2 Barry     Bonds       762   514
## 3 Bobby     Bonds       332   461
## 4 Andre     Dawson      438   314
## 5 Steve     Finley      304   320
## 6 Willie    Mays        660   338
## 7 Alex      Rodriguez   696   329
## 8 Reggie    Sanders     305   304
Pitching %>% 
  group_by(playerID) %>%
  summarize(W=sum(W), SO=sum(SO)) %>%
  filter(W >= 300 & SO >= 3000) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>% 
  select(nameFirst, nameLast, W, SO)
## # A tibble: 10 × 4
##    nameFirst nameLast     W    SO
##    <chr>     <chr>    <int> <int>
##  1 Steve     Carlton    329  4136
##  2 Roger     Clemens    354  4672
##  3 Randy     Johnson    303  4875
##  4 Walter    Johnson    417  3509
##  5 Greg      Maddux     355  3371
##  6 Phil      Niekro     318  3342
##  7 Gaylord   Perry      314  3534
##  8 Nolan     Ryan       324  5714
##  9 Tom       Seaver     311  3640
## 10 Don       Sutton     324  3574
Batting %>%
  group_by(playerID, yearID) %>%
  summarize(HR = sum(HR), BA = sum(H)/sum(AB)) %>%
  filter(HR >=50) %>%
  left_join(People, by = c("playerID" = "playerID")) %>%
  select(nameFirst, nameLast, nameGiven, HR, BA) %>%
  arrange(BA)
## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `playerID`
## # A tibble: 46 × 6
## # Groups:   playerID [30]
##    playerID  nameFirst nameLast nameGiven                 HR    BA
##    <chr>     <chr>     <chr>    <chr>                  <int> <dbl>
##  1 alonspe01 Pete      Alonso   Peter Morgan              53 0.260
##  2 bautijo02 Jose      Bautista Jose Antonio              54 0.260
##  3 jonesan01 Andruw    Jones    Andruw Rudolf             51 0.263
##  4 marisro01 Roger     Maris    Roger Eugene              61 0.269
##  5 vaughgr01 Greg      Vaughn   Gregory Lamont            50 0.272
##  6 mcgwima01 Mark      McGwire  Mark David                58 0.274
##  7 fieldce01 Cecil     Fielder  Cecil Grant               51 0.277
##  8 mcgwima01 Mark      McGwire  Mark David                65 0.278
##  9 stantmi03 Giancarlo Stanton  Giancarlo Cruz-Michael    59 0.281
## 10 judgeaa01 Aaron     Judge    Aaron James               52 0.284
## # … with 36 more rows