Problem Set #1 - dplyr basics

Directions

During ANLY 512 we will be studying the theory and practice of data visualization. We will be using R and the packages within R to assemble data and construct many different types of visualizations. Before we begin studying data visualizations we need to develop some data wrangling skills. We will use these skills to wrangle our data into a form that we can use for visualizations.

The objective of this assignment is to introduce you to R Studio, Rmarkdown, the tidyverse and more specifically the dplyr package.

Each question is worth 5 points.

To submit this homework you will create the document in Rstudio, using the knitr package (button included in Rstudio) and then submit the document to your Rpubs account. Once uploaded you will submit the link to that document on Canvas. Please make sure that this link is hyper linked and that I can see the visualization and the code required to create it.

Question #1

Use the nycflights13 package and the flights data frame to answer the following questions: a.What month had the highest proportion of cancelled flights? b.What month had the lowest?

library(nycflights13)
view(flights)

cancelled_f = flights %>%
group_by(month) %>%

summarize(cancelled_f = sum(is.na(dep_time)), 
          cancelled_p = cancelled_f/n()) %>%

arrange(cancelled_p)

print(cancelled_f)

## # A tibble: 12 × 3
##    month cancelled_f cancelled_p
##    <int>       <int>       <dbl>
##  1    10         236     0.00817
##  2    11         233     0.00854
##  3     9         452     0.0164 
##  4     8         486     0.0166 
##  5     1         521     0.0193 
##  6     5         563     0.0196 
##  7     4         668     0.0236 
##  8     3         861     0.0299 
##  9     7         940     0.0319 
## 10     6        1009     0.0357 
## 11    12        1025     0.0364 
## 12     2        1261     0.0505

# highest proportion of cancelled flight: Oct.
# lowest proportion of cancelled flight: Jane.

Question #2

Consider the following pipeline:

library(tidyverse)
mtcars %>%
  group_by(cyl) %>%
  filter(am == 1)
  summarize(avg_mpg = mean(mpg)) %>%

# it said object "am" not found, we need to put the filter (am == 1) right after the "group_by(cyl) in order to filter am before the following commends.

What is the problem with this pipeline?

Question #3

Define two new variables in the Teams data frame in the pkg Lahman() package.

batting average (BA). Batting average is the ratio of hits (H) to at-bats (AB)
slugging percentage (SLG). Slugging percentage is total bases divided by at-bats (AB). To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
view(Teams)

Teams = Teams %>%
mutate(BA =H/AB)
Teams = Teams %>%
mutate(SLG =H +2*X2B+3*X3B+4*HR/AB)

summary(Teams$BA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1564  0.2494  0.2600  0.2607  0.2708  0.3498

summary(Teams$SLG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      35    1842    1993    1934    2131    2745

Question #4

Using the Teams data frame in the pkg Lahman() package. display the top-5 teams ranked in terms of slugging percentage (SLG) in Major League Baseball history. Repeat this using teams since 1969. Slugging percentage is total bases divided by at-bats.To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

library(Lahman)
summary(Teams$SLG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      35    1842    1993    1934    2131    2745

Teams %>%
select(yearID, name, SLG) %>%
arrange(desc(SLG)) %>%
head(5)

##   yearID                  name      SLG
## 1   1930   St. Louis Cardinals 2745.075
## 2   1894 Philadelphia Phillies 2707.031
## 3   1936     Cleveland Indians 2675.087
## 4   1929        Detroit Tigers 2640.079
## 5   1894     Baltimore Orioles 2639.028

Teams %>%
select(yearID, name, SLG) %>%
filter(yearID >= 1969) %>%
arrange(desc(SLG)) %>%
head(5)

##   yearID             name      SLG
## 1   2003   Boston Red Sox 2529.165
## 2   1997   Boston Red Sox 2526.128
## 3   2007   Detroit Tigers 2506.123
## 4   2001 Colorado Rockies 2494.150
## 5   2008    Texas Rangers 2476.135

Question #5

Use the Batting, Pitching, and People tables in the pkg Lahman() package to answer the following questions.

a.Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the Master data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

library(Lahman)

#view(Batting)
Batting %>%
  group_by(playerID) %>%
  summarize(Total_HR = sum(HR),Total_SB = sum(SB)) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  filter(Total_HR >= 300 & Total_SB >= 300) %>%
  select(nameGiven, Total_HR, Total_SB)

## # A tibble: 8 × 3
##   nameGiven          Total_HR Total_SB
##   <chr>                 <int>    <int>
## 1 Carlos Ivan             435      312
## 2 Barry Lamar             762      514
## 3 Bobby Lee               332      461
## 4 Andre Nolan             438      314
## 5 Steven Allen            304      320
## 6 Willie Howard           660      338
## 7 Alexander Enmanuel      696      329
## 8 Reginald Laverne        305      304

#View (Pitching)
Pitching %>%
  group_by(playerID) %>%
  summarize(Total_W = sum(W), Total_S = sum(SO)) %>%
  inner_join(People, by =  c("playerID" = "playerID")) %>%
  filter(Total_W >= 300 & Total_S >= 3000) %>%
  select(nameGiven, Total_W, Total_S)

## # A tibble: 10 × 3
##    nameGiven       Total_W Total_S
##    <chr>             <int>   <int>
##  1 Steven Norman       329    4136
##  2 William Roger       354    4672
##  3 Randall David       303    4875
##  4 Walter Perry        417    3509
##  5 Gregory Alan        355    3371
##  6 Philip Henry        318    3342
##  7 Gaylord Jackson     314    3534
##  8 Lynn Nolan          324    5714
##  9 George Thomas       311    3640
## 10 Donald Howard       324    3574

Batting %>%
  group_by(playerID, yearID) %>%
  summarize(Total_HR = sum(HR), BA = sum(H)/sum(AB)) %>%
  inner_join(People, by = c("playerID" = "playerID")) %>%
  filter(Total_HR >= 50) %>%
  select(nameGiven, yearID, Total_HR, BA) %>%
  arrange(BA)

## # A tibble: 46 × 5
## # Groups:   playerID [30]
##    playerID  nameGiven              yearID Total_HR    BA
##    <chr>     <chr>                   <int>    <int> <dbl>
##  1 alonspe01 Peter Morgan             2019       53 0.260
##  2 bautijo02 Jose Antonio             2010       54 0.260
##  3 jonesan01 Andruw Rudolf            2005       51 0.263
##  4 marisro01 Roger Eugene             1961       61 0.269
##  5 vaughgr01 Gregory Lamont           1998       50 0.272
##  6 mcgwima01 Mark David               1997       58 0.274
##  7 fieldce01 Cecil Grant              1990       51 0.277
##  8 mcgwima01 Mark David               1999       65 0.278
##  9 stantmi03 Giancarlo Cruz-Michael   2017       59 0.281
## 10 judgeaa01 Aaron James              2017       52 0.284
## # … with 36 more rows

#Peter Morgan has the lowest batting average in that season.

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`

S.S

2023-01-24

Directions

Problem Set #1 - dplyr basics

Data Wrangling with dplyr

S.S

2023-01-24

Directions

Problem Set #1 - `dplyr` basics

Data Wrangling with `dplyr`