My Approach

In this assignment we will leverage the five verbs for dplyr to recreate a Five Thirty Eight Viz on Baseball Team Payrolls and Win Percentages. The data for this project will be imported from Kagle. We will then seek to leverage the tidyverse and the Five Verbs Of dpyr to recreate the Five Thirty-Eight Viz on the importance of baseball team payrolls. Before we get started, however, we will first review dplyr’s five verbs and what we can do with them.

You can checkout the FiveThirtyEight article here:

https://fivethirtyeight.com/features/how-your-favorite-baseball-team-blows-its-money/

Inspiration for this vingette came from The Yhat Blog - July 20, 2015-The Code Behind Building a FiveThirtyEight post.

dplyr’s FIVE VERBS

dplyr is a package in the Tidyverse. Its primary purpose is to manipulate data frames.The name dplyr is a play on words that seeks to combine the word data with pliers - dplyr for short. In the same way a handyman might use a pair of pliers to shape and/or transform wire or metal, the data scientist can use dplyr’s five verbs to shape and/or transform her data frames. dplyr’s five verbs follow:

  1. Arrange
  2. Select
  3. Filter
  4. Mutate
  5. Summarize

The magic of dplyr is that these five simple words enable the data scientiest to complete 80 to 90 percent of the data manipulation that is required for various data science task. When compbined with other tidyverse packages and base r the data scientist have everything she could need to complete the task at hand.

__Here’s a brief overview of each of the verbs. We will use a teams Tibble in our example.

Arrange

Arrange is the sorter in chief. This verb is used to change the order of data frame rows, similar to Excel’s sort function. By default Arrange sorts in ascending order. To sort in descending order simply use the desc() modifier. The examples below use arrange to sort in ascending and descending order

## # A tibble: 5 x 4
##   name      order  wins games
##   <chr>     <dbl> <dbl> <dbl>
## 1 Yankees       1   103   162
## 2 Rays          2    96   162
## 3 Red Sox       3    88   162
## 4 Blue Jays     4    76   162
## 5 Orioles       5    50   162
## # A tibble: 5 x 4
##   name      order  wins games
##   <chr>     <dbl> <dbl> <dbl>
## 1 Orioles       5    50   162
## 2 Blue Jays     4    76   162
## 3 Red Sox       3    88   162
## 4 Rays          2    96   162
## 5 Yankees       1   103   162

Select

Select picks variables (columns) based on their names. Select allows you to rapidly zoom in on a useful subset of coloumns based on the names of the variables. Select also enjoys a number of predicates and helper function that make it even easier to get to the data one wants. Here is SELECT in action:

## # A tibble: 5 x 3
##   name       wins games
##   <chr>     <dbl> <dbl>
## 1 Red Sox      88   162
## 2 Rays         96   162
## 3 Yankees     103   162
## 4 Blue Jays    76   162
## 5 Orioles      50   162
## # A tibble: 5 x 3
##   order  wins games
##   <dbl> <dbl> <dbl>
## 1     3    88   162
## 2     2    96   162
## 3     1   103   162
## 4     4    76   162
## 5     5    50   162
## # A tibble: 5 x 1
##   name     
##   <chr>    
## 1 Red Sox  
## 2 Rays     
## 3 Yankees  
## 4 Blue Jays
## 5 Orioles

There are a number of helper functions you can use within select():

starts_with(“abc”): matches names that begin with “abc”.

ends_with(“xyz”): matches names that end with “xyz”.

contains(“ijk”): matches names that contain “ijk”.

matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

num_range(“x”, 1:3): matches x1, x2 and x3.

Filter

Filter picks observations (rows) based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. Filter leverages both comparison (==, >, <) and logical (& (and) and |(or)) operators, thus enabling the data scientist to filtr using an infinite number of criteria. Check out some of the examples below:

## # A tibble: 4 x 4
##   name      order  wins games
##   <chr>     <dbl> <dbl> <dbl>
## 1 Red Sox       3    88   162
## 2 Rays          2    96   162
## 3 Yankees       1   103   162
## 4 Blue Jays     4    76   162
## # A tibble: 3 x 4
##   name      order  wins games
##   <chr>     <dbl> <dbl> <dbl>
## 1 Red Sox       3    88   162
## 2 Rays          2    96   162
## 3 Blue Jays     4    76   162
## # A tibble: 1 x 4
##   name    order  wins games
##   <chr>   <dbl> <dbl> <dbl>
## 1 Yankees     1   103   162
## # A tibble: 2 x 4
##   name    order  wins games
##   <chr>   <dbl> <dbl> <dbl>
## 1 Red Sox     3    88   162
## 2 Yankees     1   103   162
## # A tibble: 2 x 4
##   name    order  wins games
##   <chr>   <dbl> <dbl> <dbl>
## 1 Rays        2    96   162
## 2 Yankees     1   103   162

Mutate

Mutate adds new variables that are functions of existing variables. Mutate will add columns to the end of your tibble (data frame). If you only want to keep new columns you can use transmute instead of mutate. We will add some new columns to our teams tibble to demonstrate:

## # A tibble: 5 x 8
##   name      order  wins games win_percentage team_in_caps losses goodOrbad
##   <chr>     <dbl> <dbl> <dbl>          <dbl> <chr>         <dbl> <chr>    
## 1 Yankees       1   103   162          0.636 YANKEES          59 GOOD     
## 2 Rays          2    96   162          0.593 RAYS             66 GOOD     
## 3 Red Sox       3    88   162          0.543 RED SOX          74 GOOD     
## 4 Blue Jays     4    76   162          0.469 BLUE JAYS        86 BAD      
## 5 Orioles       5    50   162          0.309 ORIOLES         112 BAD

Summarize

Summarize reduces multiple values down to a single summary row. Summarize is most useful when paired with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. Here are some examples of using summarize. We’ll have more in our Five Thirty Eight project below.

## # A tibble: 2 x 5
##   goodOrbad mostwins leastwins avgwins totalwins
##   <chr>        <dbl>     <dbl>   <dbl>     <dbl>
## 1 BAD             76        50    63         126
## 2 GOOD           103        88    95.7       287

That’s a summary of dpyr’s verbs! In the following sections you can see these verbs in action and how they enable the data scientist to wrangle her data to the desired shape, size or configuration. Checkout the code comments to see how it works.

How Does Your Favorite Baseball Team Spend Its Money

Win Percentage vs Standard Deviation From Average Salary