dplyr
basicsdplyrDuring ANLY 512 we will be studying the theory and practice of
data visualization. We will be using R and the
packages within R to assemble data and construct many
different types of visualizations. Before we begin studying data
visualizations we need to develop some data wrangling skills. We will
use these skills to wrangle our data into a form that we can use for
visualizations.
The objective of this assignment is to introduce you to R Studio,
Rmarkdown, the tidyverse and more specifically the dplyr
package.
Each question is worth 5 points.
To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.
Question #1
Use the nycflights13 package and the flights data frame to answer the following questions:
a.What month had the highest proportion of cancelled flights?
** Month of February has the highest proportion of cancelled flights i.e. around 1340 cancelled flights or 0.053 proportion of cancelled flights compared to other months.
b.What month had the lowest?
** Month of October has the lowest proportion of cancelled flights i.e. around 571 cancelled flights or 0.019 proportion of cancelled flights compared to other months.
library(nycflights13)
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#Determination for the month having the highest proportion of cancelled flights leveraging dplyr & pipe function.
Flights_Cancelled <- flights %>%
group_by(month, year) %>%
summarize(Cancelled_Flights = sum(is.na(arr_delay)),
total = n(),
Prop_cancelled_flights = Cancelled_Flights/total) %>%
arrange(desc(Prop_cancelled_flights))
Flights_Cancelled
## # A tibble: 12 × 5
## # Groups: month [12]
## month year Cancelled_Flights total Prop_cancelled_flights
## <int> <int> <int> <int> <dbl>
## 1 2 2013 1340 24951 0.0537
## 2 6 2013 1168 28243 0.0414
## 3 12 2013 1115 28135 0.0396
## 4 7 2013 1132 29425 0.0385
## 5 3 2013 932 28834 0.0323
## 6 4 2013 766 28330 0.0270
## 7 5 2013 668 28796 0.0232
## 8 1 2013 606 27004 0.0224
## 9 9 2013 564 27574 0.0205
## 10 8 2013 571 29327 0.0195
## 11 11 2013 297 27268 0.0109
## 12 10 2013 271 28889 0.00938
# Result Inference:
#Month of February has the highest proportion of cancelled flights i.e. around 1340 cancelled flights or 0.053 proportion of cancelled flights compared to other months.
#Month of October has the lowest proportion of cancelled flights i.e. around 571 flights or 0.019 proportion of cancelled flights compared to other months.
Question #2
Consider the following pipeline:
#mtcars %>%
#group_by(cyl) %>%
#summarize(avg_mpg = mean(mpg)) %>%
#filter(am == 1)
## Issue with the incorrect pipeline: The problem issue is with the code element arrangement order for the fourth element # of the code i.e. for deploying the filter function "filter(am == 1), which is deployed after or followed by the #summarize function for the mean (mpg), however, if the arrangement order for this filter function is reversed or #corrected as indicated below, this pipeline error issue will be resolved.
## Pipeline with correction:
mtcars %>%
group_by(cyl) %>%
filter(am == 1) %>%
summarize(avg_mpg = mean(mpg))
## # A tibble: 3 × 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 28.1
## 2 6 20.6
## 3 8 15.4
What is the problem with this pipeline?
The problem issue with the incorrect pipeline: The problem issue is with the code element arrangement order for the fourth element of the code i.e. for deploying the filter function “filter(am == 1), which is deployed after or followed by the summarize function for the mean (mpg), however, if the arrangement order for this filter function is reversed or corrected as indicated below, this pipeline error issue will be resolved.
Accurate Pipeline with correction: mtcars %>% group_by(cyl) %>% filter(am == 1) %>% summarize(avg_mpg = mean(mpg))
Question #3
Define two new variables in the Teams data frame in the
pkg Lahman() package.
batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.
library(Lahman)
str(Teams)
## 'data.frame': 2985 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
## $ divID : chr NA NA NA NA ...
## $ Rank : int 3 2 8 7 5 1 9 6 4 2 ...
## $ G : int 31 28 29 19 33 28 25 29 32 58 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 20 19 10 7 16 21 4 13 15 35 ...
## $ L : int 10 9 19 12 17 7 21 15 15 19 ...
## $ DivWin : chr NA NA NA NA ...
## $ WCWin : chr NA NA NA NA ...
## $ LgWin : chr "N" "N" "N" "N" ...
## $ WSWin : chr NA NA NA NA ...
## $ R : int 401 302 249 137 302 376 231 351 310 617 ...
## $ AB : int 1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
## $ H : int 426 323 328 178 403 410 274 384 375 753 ...
## $ X2B : int 70 52 35 19 43 66 44 51 54 106 ...
## $ X3B : int 37 21 40 8 21 27 25 34 26 31 ...
## $ HR : int 3 10 7 2 1 9 3 6 6 14 ...
## $ BB : int 60 60 26 33 33 46 38 49 48 29 ...
## $ SO : int 19 22 25 9 15 23 30 19 13 28 ...
## $ SB : int 73 69 18 16 46 56 53 62 48 53 ...
## $ CS : int 16 21 8 4 15 12 10 24 13 18 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 303 241 341 243 313 266 287 362 303 434 ...
## $ ER : int 109 77 116 97 121 137 108 153 137 166 ...
## $ ERA : num 3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 2.9 ...
## $ CG : int 22 25 23 19 32 27 23 28 32 48 ...
## $ SHO : int 1 0 0 1 1 0 1 0 0 1 ...
## $ SV : int 3 1 0 0 0 0 0 0 0 1 ...
## $ IPouts : int 828 753 762 507 879 747 678 750 846 1548 ...
## $ HA : int 367 308 346 261 373 329 315 431 371 573 ...
## $ HRA : int 2 6 13 5 7 3 3 4 4 3 ...
## $ BBA : int 42 28 53 21 42 53 34 75 45 63 ...
## $ SOA : int 23 22 34 17 22 16 16 12 13 77 ...
## $ E : int 243 229 234 163 235 194 220 198 218 432 ...
## $ DP : int 24 16 15 8 14 13 14 22 20 22 ...
## $ FP : num 0.834 0.829 0.818 0.803 0.84 0.845 0.821 0.845 0.85 0.83 ...
## $ name : chr "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
## $ park : chr "South End Grounds I" "Union Base-Ball Grounds" "National Association Grounds" "Hamilton Field" ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 103 104 96 101 90 102 97 101 94 106 ...
## $ PPF : int 98 102 100 107 88 98 99 100 98 102 ...
## $ teamIDBR : chr "BOS" "CHI" "CLE" "KEK" ...
## $ teamIDlahman45: chr "BS1" "CH1" "CL1" "FW1" ...
## $ teamIDretro : chr "BS1" "CH1" "CL1" "FW1" ...
## Determination for the new revised "Teams" data frame mutated with two new additional variables i.e. Batting Average #(BA) & Slugging percentage (SLG), reference to the above defined ratio and average calculation conditions.
Revised_Teams <- Teams %>%
mutate(BA = H/AB, SLG = (H + 2 * X2B + 3 * X3B + 4 * HR) / AB)
head(Revised_Teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro BA SLG
## 1 BS1 0.3104956 0.5021866
## 2 CH1 0.2700669 0.4431438
## 3 CL1 0.2765599 0.4603710
## 4 FW1 0.2386059 0.3324397
## 5 NY2 0.2870370 0.3960114
## 6 PH1 0.3200625 0.5144418
# Revised Teams dataframe, mutated with two new additional variables i.e. Batting Average (BA) & Slugging percentage (SLG)
str(Revised_Teams)
## 'data.frame': 2985 obs. of 50 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
## $ divID : chr NA NA NA NA ...
## $ Rank : int 3 2 8 7 5 1 9 6 4 2 ...
## $ G : int 31 28 29 19 33 28 25 29 32 58 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 20 19 10 7 16 21 4 13 15 35 ...
## $ L : int 10 9 19 12 17 7 21 15 15 19 ...
## $ DivWin : chr NA NA NA NA ...
## $ WCWin : chr NA NA NA NA ...
## $ LgWin : chr "N" "N" "N" "N" ...
## $ WSWin : chr NA NA NA NA ...
## $ R : int 401 302 249 137 302 376 231 351 310 617 ...
## $ AB : int 1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
## $ H : int 426 323 328 178 403 410 274 384 375 753 ...
## $ X2B : int 70 52 35 19 43 66 44 51 54 106 ...
## $ X3B : int 37 21 40 8 21 27 25 34 26 31 ...
## $ HR : int 3 10 7 2 1 9 3 6 6 14 ...
## $ BB : int 60 60 26 33 33 46 38 49 48 29 ...
## $ SO : int 19 22 25 9 15 23 30 19 13 28 ...
## $ SB : int 73 69 18 16 46 56 53 62 48 53 ...
## $ CS : int 16 21 8 4 15 12 10 24 13 18 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 303 241 341 243 313 266 287 362 303 434 ...
## $ ER : int 109 77 116 97 121 137 108 153 137 166 ...
## $ ERA : num 3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 2.9 ...
## $ CG : int 22 25 23 19 32 27 23 28 32 48 ...
## $ SHO : int 1 0 0 1 1 0 1 0 0 1 ...
## $ SV : int 3 1 0 0 0 0 0 0 0 1 ...
## $ IPouts : int 828 753 762 507 879 747 678 750 846 1548 ...
## $ HA : int 367 308 346 261 373 329 315 431 371 573 ...
## $ HRA : int 2 6 13 5 7 3 3 4 4 3 ...
## $ BBA : int 42 28 53 21 42 53 34 75 45 63 ...
## $ SOA : int 23 22 34 17 22 16 16 12 13 77 ...
## $ E : int 243 229 234 163 235 194 220 198 218 432 ...
## $ DP : int 24 16 15 8 14 13 14 22 20 22 ...
## $ FP : num 0.834 0.829 0.818 0.803 0.84 0.845 0.821 0.845 0.85 0.83 ...
## $ name : chr "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
## $ park : chr "South End Grounds I" "Union Base-Ball Grounds" "National Association Grounds" "Hamilton Field" ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 103 104 96 101 90 102 97 101 94 106 ...
## $ PPF : int 98 102 100 107 88 98 99 100 98 102 ...
## $ teamIDBR : chr "BOS" "CHI" "CLE" "KEK" ...
## $ teamIDlahman45: chr "BS1" "CH1" "CL1" "FW1" ...
## $ teamIDretro : chr "BS1" "CH1" "CL1" "FW1" ...
## $ BA : num 0.31 0.27 0.277 0.239 0.287 ...
## $ SLG : num 0.502 0.443 0.46 0.332 0.396 ...
Question #4
Using the Teams data frame in the
pkg Lahman() package. display the top-5 teams ranked in
terms of slugging percentage (SLG) in Major League Baseball history.
Repeat this using teams since 1969. Slugging percentage is total bases
divided by at-bats.To compute total bases, you get 1 for a single, 2 for
a double, 3 for a triple, and 4 for a home run.
library(Lahman)
## Determination for the display of the top-5 teams ranked in terms of sluggish percentage (SLG) leveraging pipeline #function:
Top_five_Teams <- Revised_Teams %>%
arrange(desc(SLG)) %>%
select(teamID, SLG) %>%
head(n = 5)
Top_five_Teams
## teamID SLG
## 1 HOU 0.6092998
## 2 MIN 0.6071179
## 3 BOS 0.6033975
## 4 NYA 0.5996776
## 5 ATL 0.5964320
## Determination for the display of top-5 teams since 1969 ranked in terms of sluggish percentage (SLG).
Top_five_Teams_since_1969 <- Revised_Teams %>%
select(yearID, teamID, SLG) %>%
filter(yearID >= 1969) %>%
arrange(desc(SLG)) %>%
head(n = 5)
Top_five_Teams_since_1969
## yearID teamID SLG
## 1 2019 HOU 0.6092998
## 2 2019 MIN 0.6071179
## 3 2003 BOS 0.6033975
## 4 2019 NYA 0.5996776
## 5 2020 ATL 0.5964320
Question #5
Use the Batting, Pitching, and
People tables in the pkg Lahman() package to
answer the following questions.
a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.
Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?
** The player “Pete Alonso” alias “Peter Morgon” had the lowest batting average of 0.25 in that season.
library(Lahman)
str(Batting)
## 'data.frame': 110495 obs. of 22 variables:
## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ G : int 1 25 29 27 25 12 1 31 1 18 ...
## $ AB : int 4 118 137 133 120 49 4 157 5 86 ...
## $ R : int 0 30 28 28 29 9 0 66 1 13 ...
## $ H : int 0 32 40 44 39 11 1 63 1 13 ...
## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ...
## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ...
## $ HR : int 0 0 0 2 0 0 0 0 0 0 ...
## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ...
## $ SB : int 0 8 3 1 6 0 0 11 0 1 ...
## $ CS : int 0 1 1 1 2 1 0 6 0 0 ...
## $ BB : int 0 4 2 0 2 0 1 13 0 0 ...
## $ SO : int 0 0 5 2 1 1 0 1 0 0 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int 0 0 1 0 0 0 0 1 0 0 ...
str(People)
## 'data.frame': 20370 obs. of 26 variables:
## $ playerID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ birthYear : int 1981 1934 1939 1954 1972 1985 1850 1877 1869 1866 ...
## $ birthMonth : int 12 2 8 9 8 12 11 4 11 10 ...
## $ birthDay : int 27 5 5 8 25 17 4 15 11 14 ...
## $ birthCountry: chr "USA" "USA" "USA" "USA" ...
## $ birthState : chr "CO" "AL" "AL" "CA" ...
## $ birthCity : chr "Denver" "Mobile" "Mobile" "Orange" ...
## $ deathYear : int NA 2021 1984 NA NA NA 1905 1957 1962 1926 ...
## $ deathMonth : int NA 1 8 NA NA NA 5 1 6 4 ...
## $ deathDay : int NA 22 16 NA NA NA 17 6 11 27 ...
## $ deathCountry: chr NA "USA" "USA" NA ...
## $ deathState : chr NA "GA" "GA" NA ...
## $ deathCity : chr NA "Atlanta" "Atlanta" NA ...
## $ nameFirst : chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ nameGiven : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
## $ bats : Factor w/ 3 levels "B","L","R": 3 3 3 3 2 2 3 3 3 2 ...
## $ throws : Factor w/ 3 levels "L","R","S": 2 2 2 2 1 1 2 2 2 1 ...
## $ debut : chr "2004-04-06" "1954-04-13" "1962-04-10" "1977-07-26" ...
## $ finalGame : chr "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" ...
## $ retroID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
## $ bbrefID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ deathDate : Date, format: NA "2021-01-22" ...
## $ birthDate : Date, format: "1981-12-27" "1934-02-05" ...
## Determination for every player in baseball history who has accumulated at least 300 Home runs (HR) and at least 300 #Stolen bases (SB), leveraging "Batting" & "people" table in "Lahman" pkg.
Batting %>%
group_by(playerID) %>%
summarise(tHR = sum(HR), tSB = sum(SB)) %>%
filter(tHR >= 300 & tSB >= 300) %>%
left_join(People, by = c("playerID" = "playerID")) %>%
select(nameFirst, nameLast, tHR, tSB)
## # A tibble: 8 × 4
## nameFirst nameLast tHR tSB
## <chr> <chr> <int> <int>
## 1 Carlos Beltran 435 312
## 2 Barry Bonds 762 514
## 3 Bobby Bonds 332 461
## 4 Andre Dawson 438 314
## 5 Steve Finley 304 320
## 6 Willie Mays 660 338
## 7 Alex Rodriguez 696 329
## 8 Reggie Sanders 305 304
## Determination for every player in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO), leveraging "Pitching" & "people" table in "Lahman" pkg.
str(Pitching)
## 'data.frame': 49430 obs. of 30 variables:
## $ playerID: chr "bechtge01" "brainas01" "fergubo01" "fishech01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 142 90 111 90 136 111 56 97 136 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ W : int 1 12 0 4 0 0 0 6 18 12 ...
## $ L : int 2 15 0 16 1 0 1 11 5 15 ...
## $ G : int 3 30 1 24 1 1 3 19 25 29 ...
## $ GS : int 3 30 0 24 1 0 1 19 25 29 ...
## $ CG : int 2 30 0 22 1 0 1 19 25 28 ...
## $ SHO : int 0 0 0 1 0 0 0 1 0 0 ...
## $ SV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IPouts : int 78 792 3 639 27 3 39 507 666 747 ...
## $ H : int 43 361 8 295 20 1 20 261 285 430 ...
## $ ER : int 23 132 3 103 10 0 5 97 113 153 ...
## $ HR : int 0 4 0 3 0 0 0 5 3 4 ...
## $ BB : int 11 37 0 31 3 0 3 21 40 75 ...
## $ SO : int 1 13 0 15 0 0 1 17 15 12 ...
## $ BAOpp : num NA NA NA NA NA NA NA NA NA NA ...
## $ ERA : num 7.96 4.5 27 4.35 10 0 3.46 5.17 4.58 5.53 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ WP : int 7 7 2 20 0 0 1 15 3 44 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ BK : int 0 0 0 0 0 0 0 2 0 0 ...
## $ BFP : int 146 1291 14 1080 57 3 70 876 1059 1334 ...
## $ GF : int 0 0 0 1 0 1 1 0 0 0 ...
## $ R : int 42 292 9 257 21 0 30 243 223 362 ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
Pitching %>%
group_by(playerID) %>%
summarise(tW = sum(W), tSO = sum(SO)) %>%
filter(tW >= 300 & tSO >= 3000) %>%
left_join(People, by = c("playerID" = "playerID")) %>%
select(nameFirst, nameLast, tW, tSO)
## # A tibble: 10 × 4
## nameFirst nameLast tW tSO
## <chr> <chr> <int> <int>
## 1 Steve Carlton 329 4136
## 2 Roger Clemens 354 4672
## 3 Randy Johnson 303 4875
## 4 Walter Johnson 417 3509
## 5 Greg Maddux 355 3371
## 6 Phil Niekro 318 3342
## 7 Gaylord Perry 314 3534
## 8 Nolan Ryan 324 5714
## 9 Tom Seaver 311 3640
## 10 Don Sutton 324 3574
## Determination for the name and year of every player who has hit at least 50 home runs in a single season & player having the lowest batting average in that season.
Batting %>%
group_by(playerID, yearID) %>%
summarise(total_Home_Run = sum(HR), Batting_Avg = sum(H)/sum(AB)) %>%
filter(total_Home_Run >= 50) %>%
left_join(People, by = c("playerID" = "playerID")) %>%
select(yearID, playerID, nameFirst, nameLast, nameGiven, total_Home_Run, Batting_Avg) %>%
ungroup() %>%
arrange(Batting_Avg)
## # A tibble: 46 × 7
## yearID playerID nameFirst nameLast nameGiven total_Ho…¹ Batti…²
## <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 2019 alonspe01 Pete Alonso Peter Morgan 53 0.260
## 2 2010 bautijo02 Jose Bautista Jose Antonio 54 0.260
## 3 2005 jonesan01 Andruw Jones Andruw Rudolf 51 0.263
## 4 1961 marisro01 Roger Maris Roger Eugene 61 0.269
## 5 1998 vaughgr01 Greg Vaughn Gregory Lamont 50 0.272
## 6 1997 mcgwima01 Mark McGwire Mark David 58 0.274
## 7 1990 fieldce01 Cecil Fielder Cecil Grant 51 0.277
## 8 1999 mcgwima01 Mark McGwire Mark David 65 0.278
## 9 2017 stantmi03 Giancarlo Stanton Giancarlo Cruz-Michael 59 0.281
## 10 2017 judgeaa01 Aaron Judge Aaron James 52 0.284
## # … with 36 more rows, and abbreviated variable names ¹total_Home_Run,
## # ²Batting_Avg
## The player "Pete Alonso" alias "Peter Morgon" had the lowest batting average of 0.25 in that season.