In this assignment we will leverage the five verbs for dplyr to recreate a Five Thirty Eight Viz on Baseball Team Payrolls and Win Percentages. The data for this project will be imported from Kagle. We will then seek to leverage the tidyverse and the Five Verbs Of dpyr to recreate the Five Thirty-Eight Viz on the importance of baseball team payrolls. Before we get started, however, we will first review dplyr’s five verbs and what we can do with them.
You can checkout the FiveThirtyEight article here:
https://fivethirtyeight.com/features/how-your-favorite-baseball-team-blows-its-money/
Inspiration for this vingette came from The Yhat Blog - July 20, 2015-The Code Behind Building a FiveThirtyEight post.
dplyr is a package in the Tidyverse. Its primary purpose is to manipulate data frames.The name dplyr is a play on words that seeks to combine the word data with pliers - dplyr for short. In the same way a handyman might use a pair of pliers to shape and/or transform wire or metal, the data scientist can use dplyr’s five verbs to shape and/or transform her data frames. dplyr’s five verbs follow:
The magic of dplyr is that these five simple words enable the data scientiest to complete 80 to 90 percent of the data manipulation that is required for various data science task. When compbined with other tidyverse packages and base r the data scientist have everything she could need to complete the task at hand.
__Here’s a brief overview of each of the verbs. We will use a teams Tibble in our example.
teams <-
tibble(
name = c("Red Sox", "Rays", "Yankees", "Blue Jays", "Orioles"),
order = c(3,2,1,4,5),
wins = c(88,96,103,76,50),
games= c(162,162,162,162,162))Arrange is the sorter in chief. This verb is used to change the order of data frame rows, similar to Excel’s sort function. By default Arrange sorts in ascending order. To sort in descending order simply use the desc() modifier. The examples below use arrange to sort in ascending and descending order
## # A tibble: 5 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Yankees 1 103 162
## 2 Rays 2 96 162
## 3 Red Sox 3 88 162
## 4 Blue Jays 4 76 162
## 5 Orioles 5 50 162
## # A tibble: 5 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Orioles 5 50 162
## 2 Blue Jays 4 76 162
## 3 Red Sox 3 88 162
## 4 Rays 2 96 162
## 5 Yankees 1 103 162
Select picks variables (columns) based on their names. Select allows you to rapidly zoom in on a useful subset of coloumns based on the names of the variables. Select also enjoys a number of predicates and helper function that make it even easier to get to the data one wants. Here is SELECT in action:
## # A tibble: 5 x 3
## name wins games
## <chr> <dbl> <dbl>
## 1 Red Sox 88 162
## 2 Rays 96 162
## 3 Yankees 103 162
## 4 Blue Jays 76 162
## 5 Orioles 50 162
## # A tibble: 5 x 3
## order wins games
## <dbl> <dbl> <dbl>
## 1 3 88 162
## 2 2 96 162
## 3 1 103 162
## 4 4 76 162
## 5 5 50 162
## # A tibble: 5 x 1
## name
## <chr>
## 1 Red Sox
## 2 Rays
## 3 Yankees
## 4 Blue Jays
## 5 Orioles
There are a number of helper functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.
Filter picks observations (rows) based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. Filter leverages both comparison (==, >, <) and logical (& (and) and |(or)) operators, thus enabling the data scientist to filtr using an infinite number of criteria. Check out some of the examples below:
## # A tibble: 4 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Red Sox 3 88 162
## 2 Rays 2 96 162
## 3 Yankees 1 103 162
## 4 Blue Jays 4 76 162
## # A tibble: 3 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Red Sox 3 88 162
## 2 Rays 2 96 162
## 3 Blue Jays 4 76 162
## # A tibble: 1 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Yankees 1 103 162
## # A tibble: 2 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Red Sox 3 88 162
## 2 Yankees 1 103 162
#filter using in allows you to filter x if its one of the values in y
teams %>%
filter(wins %in% c(96,103))## # A tibble: 2 x 4
## name order wins games
## <chr> <dbl> <dbl> <dbl>
## 1 Rays 2 96 162
## 2 Yankees 1 103 162
Mutate adds new variables that are functions of existing variables. Mutate will add columns to the end of your tibble (data frame). If you only want to keep new columns you can use transmute instead of mutate. We will add some new columns to our teams tibble to demonstrate:
# Add win percentage to teams
teams <- teams %>%
mutate(win_percentage = wins / games)
# Add team name in Capital Letters
teams <- teams %>%
mutate(team_in_caps= str_to_upper(name))
# Add team losses to the team tibble
teams <- teams %>%
mutate(losses= games - wins)
# Add a good team / bad team column to the teams tibble
teams <- teams %>%
mutate(goodOrbad = if_else((wins/games) < 0.5,"BAD", "GOOD")) %>%
arrange(desc(wins))
teams## # A tibble: 5 x 8
## name order wins games win_percentage team_in_caps losses goodOrbad
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Yankees 1 103 162 0.636 YANKEES 59 GOOD
## 2 Rays 2 96 162 0.593 RAYS 66 GOOD
## 3 Red Sox 3 88 162 0.543 RED SOX 74 GOOD
## 4 Blue Jays 4 76 162 0.469 BLUE JAYS 86 BAD
## 5 Orioles 5 50 162 0.309 ORIOLES 112 BAD
Summarize reduces multiple values down to a single summary row. Summarize is most useful when paired with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. Here are some examples of using summarize. We’ll have more in our Five Thirty Eight project below.
summary <- teams %>%
group_by(goodOrbad) %>%
summarise(
mostwins = max(wins),
leastwins = min(wins),
avgwins = mean(wins),
totalwins =sum(wins)
)
summary## # A tibble: 2 x 5
## goodOrbad mostwins leastwins avgwins totalwins
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BAD 76 50 63 126
## 2 GOOD 103 88 95.7 287
s <- read_csv("https://raw.githubusercontent.com/MundyMSDS/DATA607/master/Salaries.csv")
t <- read_csv("https://raw.githubusercontent.com/MundyMSDS/DATA607/master/Teams.csv")
#Filter dates greater than 1984 and select the 4 columns of interest
t <- t %>%
filter(yearID > 1984) %>%
select(teamID, yearID, G, W)
#Group by Teams and Year, so Summarize can calculate est_payroll by summing individual player salaries
s <- s %>%
group_by(yearID, teamID) %>%
summarise(est_payroll = sum(salary))
#Arrange (sort) t by yearID and teamID so you feel more organized
#Use mutate to create the Team winning percentages (wins / games) column
#Select three columns of interest (teamID, yearID wpct)
t <- t %>%
arrange(yearID, teamID) %>%
mutate(wpct = (W/G)) %>%
select(teamID, yearID, wpct)
#Join the s and t tibbles to create data frame df
df <- inner_join(s, t, by = c("yearID" = "yearID", "teamID" = "teamID"))yearly_payroll <- df %>%
#Use the dynamic duo of group_by and summarize to create two new columns (mean_payroll and std_payroll)
group_by(yearID) %>%
summarise(
mean_payroll=mean(est_payroll, na.rm=TRUE),
std_payroll=sd(est_payroll, na.rm=TRUE)
)
# Merge newly calculated variable into the df data frame
df <- merge(df, yearly_payroll, by="yearID")df <- df %>%
#Using our two new variable we use mutate to create standardized_payroll
mutate(standardized_payroll = (est_payroll - mean_payroll) / std_payroll) %>%
arrange(yearID,teamID)
# Now dplyr has gotten our data prepared for charting!ggplot2
divisions <- c("AL East","AL Central","AL West", "NL East","NL Central","NL West")
for (div in divisions) {
df.division <- subset(df, division==div)
p <- ggplot(df.division, aes(x=standardized_payroll, y=wpct, color=team_color)) +
geom_point(alpha=0.75, size=3) +
stat_smooth(data=within(df, teamID <- NULL), color="grey", size=.5,
method="lm", formula = y ~ poly(x, 2), se=FALSE) +
stat_smooth(size=1.5, method="lm", formula = y ~ poly(x, 2), se=FALSE) +
scale_color_identity() +
scale_x_continuous(name="Standardized Salary\n(# of standard deviations from yearly mean)",
breaks=c(-2, 0, 2), limit=c(-2.5, 2.5), labels=c("-2", "0", "+2")) +
scale_y_continuous(name="Win/Loss %", breaks=seq(0.3, 0.7, 0.1), limit=c(0.25, 0.75)) +
facet_wrap(~teamID, ncol=5, scales="free_x") +
theme_fivethirtyeight() +
ggtitle(div)
ggsave(filename=paste0(div, ".png"), plot=p, width=13.65, height=3.59)
}