Lesson 3

Pulling Data

We are going to make use of the dpylr() library throughout this lesson which is already pre-loaded into the tidyverse. The dpylr library contains five main function that have the following purposes:

Order the rows however you want using arrange()

Pull only certain data using filter()

Make new variables by using other already known variables using mutate()

Select a group of variables by name using select()

Getting basic summaries of the data using summarise()

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.1

## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## Warning: package 'tibble' was built under R version 3.5.1

## Warning: package 'tidyr' was built under R version 3.5.1

## Warning: package 'readr' was built under R version 3.5.1

## Warning: package 'purrr' was built under R version 3.5.1

## Warning: package 'dplyr' was built under R version 3.5.1

## Warning: package 'stringr' was built under R version 3.5.1

## Warning: package 'forcats' was built under R version 3.5.1

## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

shooting<-read_csv(file="C:/Users/ankit/OneDrive/Desktop/Robotics Scouting/Data Sets/nba_history.csv")

## Parsed with column specification:
## cols(
##   PLAYER = col_character(),
##   SEASON = col_integer(),
##   FGM = col_integer(),
##   FGA = col_integer(),
##   TPM = col_integer(),
##   TPA = col_integer(),
##   FTM = col_integer(),
##   FTA = col_integer(),
##   FGP = col_double(),
##   TPP = col_double(),
##   FTP = col_double()
## )

Arranging Data

The arrange() function takes the data from the tibble and the column names sorts data according tho the values in the columns we specify.

arrange(shooting,FGM)

## Warning: package 'bindrcpp' was built under R version 3.5.1

## # A tibble: 7,447 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Dell Demps   1997     0     3     0     1     2     2     0     0  1   
##  2 James Sco~   1997     0     8     0     4     1     2     0     0  0.5 
##  3 Darrick M~   2002     0    10     0     2     1     2     0     0  0.5 
##  4 Alvin Wil~   2006     0     3     0     2     1     2     0     0  0.5 
##  5 Alvin Wil~   2007     0     2     0     1     2     4     0     0  0.5 
##  6 Roger Pow~   2007     0     3     0     1     2     2     0     0  1   
##  7 Patrick E~   2011     0     5     0     2     3     4     0     0  0.75
##  8 Will Conr~   2013     0     9     0     3     1     2     0     0  0.5 
##  9 Scotty Ho~   2014     0     4     0     2     1     2     0     0  0.5 
## 10 Mustafa S~   2014     0     3     0     1     1     2     0     0  0.5 
## # ... with 7,437 more rows

What happened here was our dataset, shooting, was arranged from least to greatest by FGM and we see many players with 0 FGM. You will also notice that you cannot see the full ammount of entries as only a certain dimension of the tibble is shown. If we wanted to View the whole shooting tibble arranged we could use the View() function to see our full tibble.

View(arrange(shooting,FGM))

If you ever want to view the full tibble at anytime just use the View() function. Always have capital V in View()

We can do the exact same thing but in descending order with the simple desc() function.

arrange(shooting, desc(FGM))

## # A tibble: 7,447 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Kobe Brya~   2006   978  2173   180   518   696   819 0.450 0.347 0.850
##  2 Shaquille~   2000   956  1665     0     1   432   824 0.574 0     0.524
##  3 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  4 Michael J~   1998   881  1893    30   126   565   721 0.465 0.238 0.784
##  5 LeBron Ja~   2006   875  1823   127   379   601   814 0.480 0.335 0.738
##  6 Kobe Brya~   2003   868  1924   124   324   601   713 0.451 0.383 0.843
##  7 Karl Malo~   1997   864  1571     0    13   521   690 0.550 0     0.755
##  8 Dwyane Wa~   2009   854  1739    88   278   590   771 0.491 0.317 0.765
##  9 Kevin Dur~   2014   849  1688   192   491   703   805 0.503 0.391 0.873
## 10 Tracy McG~   2003   829  1813   173   448   576   726 0.457 0.386 0.793
## # ... with 7,437 more rows

We could even sort more than 1 column at once as shown below:

arrange(shooting, FGM, FGA, TPM, TPA)

## # A tibble: 7,447 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Alvin Wil~   2007     0     2     0     1     2     4     0     0  0.5 
##  2 Dell Demps   1997     0     3     0     1     2     2     0     0  1   
##  3 Roger Pow~   2007     0     3     0     1     2     2     0     0  1   
##  4 Mustafa S~   2014     0     3     0     1     1     2     0     0  0.5 
##  5 Alvin Wil~   2006     0     3     0     2     1     2     0     0  0.5 
##  6 Scotty Ho~   2014     0     4     0     2     1     2     0     0  0.5 
##  7 Patrick E~   2011     0     5     0     2     3     4     0     0  0.75
##  8 Quincy Mi~   2015     0     6     0     3     2     4     0     0  0.5 
##  9 Darius Jo~   2014     0     7     0     3     0     2     0     0  0   
## 10 James Sco~   1997     0     8     0     4     1     2     0     0  0.5 
## # ... with 7,437 more rows

Filtering Data

We can see from the arrange that there are a bunch of players who pretty much have done nothing and when we want to do meaningful things with datasets those useless numbers don’t help us. We can make use of the filter() function to pull specific subsets of data to satisfy the logic statement we choose such as FGM>500 or TPA<=100.

There are a variety of logical operators to help us make comparison:

== used for saying equal to

!= used for not equal to

< and <= used for less than and less than or equal to respectively

> and >= used for greater than and greater than or equal to respectively

&, |, ! used for AND, OR, NOT respectively

Let’s filter the data by only players who have made more than 100 Three Pointers

filter(shooting, TPM>100)

## # A tibble: 910 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Glen Rice    1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  3 Mitch Ric~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  4 Latrell S~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  5 Allen Ive~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  6 Gary Payt~   1997   706  1482   119   380   254   355 0.476 0.313 0.715
##  7 Reggie Mi~   1997   552  1244   229   536   418   475 0.444 0.427 0.88 
##  8 Jerry Sta~   1997   533  1308   102   342   511   667 0.407 0.298 0.766
##  9 Tim Harda~   1997   575  1384   203   590   291   364 0.415 0.344 0.799
## 10 Scottie P~   1997   648  1366   156   424   204   291 0.474 0.368 0.701
## # ... with 900 more rows

Notice how the size of the tibble went down from 7447 players to 910 players.

If were to view our shooting tibble we would see that it still contains all 7447 players, in order to make a seperate tibble for our TPM>100 condition we just assign it another name.

tpm_shooting<-filter(shooting, TPM>100)
arrange(tpm_shooting,FGM)

## # A tibble: 910 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Anthony T~   2014   129   307   102   247    33    41 0.420 0.413 0.805
##  2 James Jon~   2011   146   346   123   287    60    72 0.422 0.429 0.833
##  3 Damon Jon~   2008   151   363   115   276    20    28 0.416 0.417 0.714
##  4 Shane Bat~   2013   152   362   136   316    32    38 0.420 0.430 0.842
##  5 Jason Kidd   2013   152   409   114   325    40    48 0.372 0.351 0.833
##  6 Matt Bonn~   2012   154   350   105   250    16    21 0.44  0.42  0.762
##  7 James Pos~   2006   159   395   117   290    48    61 0.403 0.403 0.787
##  8 Steve Nov~   2012   161   337   133   282    22    26 0.478 0.472 0.846
##  9 Anthony T~   2015   161   393   106   290    54    69 0.410 0.366 0.783
## 10 Charlie W~   2003   165   414   101   267    41    53 0.399 0.378 0.774
## # ... with 900 more rows

Now let’s try filtering the data by TPM>100 or FGM>300

filter(shooting,TPM>100|FGM>300)

## # A tibble: 2,325 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Karl Malo~   1997   864  1571     0    13   521   690 0.550 0     0.755
##  3 Glen Rice    1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  4 Shaquille~   1997   552   991     0     4   232   479 0.557 0     0.484
##  5 Mitch Ric~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  6 Latrell S~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  7 Allen Ive~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  8 Hakeem Ol~   1997   727  1426     5    16   351   446 0.510 0.312 0.787
##  9 Patrick E~   1997   655  1342     2     9   439   582 0.488 0.222 0.754
## 10 LaPhonso ~   1997   445  1014    95   259   218   282 0.439 0.367 0.773
## # ... with 2,315 more rows

We can have multiple logical statements in one line rather than multiple by using parentheses:

filter(shooting, (TPM>100 & FGM>300) | (TPP >= 0.40 & FGP>=0.45))

## # A tibble: 1,063 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Glen Rice    1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  3 Mitch Ric~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  4 Latrell S~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  5 Allen Ive~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  6 Gary Payt~   1997   706  1482   119   380   254   355 0.476 0.313 0.715
##  7 Reggie Mi~   1997   552  1244   229   536   418   475 0.444 0.427 0.88 
##  8 Jerry Sta~   1997   533  1308   102   342   511   667 0.407 0.298 0.766
##  9 Tim Harda~   1997   575  1384   203   590   291   364 0.415 0.344 0.799
## 10 Scottie P~   1997   648  1366   156   424   204   291 0.474 0.368 0.701
## # ... with 1,053 more rows

In 1999 and 2012 the NBA had a lockout which shortened the seasons which may make our data bad. We can remove the data from those seasons in a quick way using the operator %in% which lets us select multiple things at once rather than saying SEASON== for every season we don’t want.

filter(shooting, !SEASON %in% c(1999, 2012))

## # A tibble: 6,721 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Karl Malo~   1997   864  1571     0    13   521   690 0.550 0     0.755
##  3 Glen Rice    1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  4 Shaquille~   1997   552   991     0     4   232   479 0.557 0     0.484
##  5 Mitch Ric~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  6 Latrell S~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  7 Allen Ive~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  8 Hakeem Ol~   1997   727  1426     5    16   351   446 0.510 0.312 0.787
##  9 Patrick E~   1997   655  1342     2     9   439   582 0.488 0.222 0.754
## 10 LaPhonso ~   1997   445  1014    95   259   218   282 0.439 0.367 0.773
## # ... with 6,711 more rows

For the rest of this lesson let’s only focus on players who attempted at least 150 field goals, 200 free throws, and 75 three pointers non-lockout seasons and 2015 because I hate the Golden State Warriors.

new_shooting <- filter(shooting, FGA >= 150 & FTA >= 200 & TPA >= 75 & !SEASON %in% c(1999, 2012, 2015))
new_shooting

## # A tibble: 1,069 x 11
##    PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Michael J~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Glen Rice    1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  3 Mitch Ric~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  4 Latrell S~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  5 Allen Ive~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  6 LaPhonso ~   1997   445  1014    95   259   218   282 0.439 0.367 0.773
##  7 Kendall G~   1997   644  1453    74   220   427   536 0.443 0.336 0.797
##  8 Gary Payt~   1997   706  1482   119   380   254   355 0.476 0.313 0.715
##  9 Reggie Mi~   1997   552  1244   229   536   418   475 0.444 0.427 0.88 
## 10 Glenn Rob~   1997   669  1438    63   180   288   364 0.465 0.35  0.791
## # ... with 1,059 more rows

Making New Variables Using Old Vairables

We have our new table but we realize that we don’t have some other statistics we want such as Effective Field Goal Percentage, Total Points, and True Shooting Percentage. The mutate() function allows us to do this by creating a new variable using old variables. Let’s overwrite our current new_shooting to add EFGP.

new_shooting<-mutate(new_shooting, EFGP = (FGM + 0.5 * TPM)/FGA)
new_shooting

## # A tibble: 1,069 x 12
##    PLAYER SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
##    <chr>   <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
##  1 Micha~   1997   920  1892   111   297   480   576 0.486 0.374 0.833
##  2 Glen ~   1997   722  1513   207   440   464   535 0.477 0.470 0.867
##  3 Mitch~   1997   717  1578   204   477   457   531 0.454 0.428 0.861
##  4 Latre~   1997   649  1444   147   415   493   585 0.449 0.354 0.843
##  5 Allen~   1997   625  1504   155   455   382   544 0.416 0.341 0.702
##  6 LaPho~   1997   445  1014    95   259   218   282 0.439 0.367 0.773
##  7 Kenda~   1997   644  1453    74   220   427   536 0.443 0.336 0.797
##  8 Gary ~   1997   706  1482   119   380   254   355 0.476 0.313 0.715
##  9 Reggi~   1997   552  1244   229   536   418   475 0.444 0.427 0.88 
## 10 Glenn~   1997   669  1438    63   180   288   364 0.465 0.35  0.791
## # ... with 1,059 more rows, and 1 more variable: EFGP <dbl>

You can also add the other two categories yourself with the given formulas:

PTS=FTM+2*FGM+TPM

TSP=PTS/(2(FGA+0.44FTA))

Creating Categorical Variables

We have used mutate() to create numeric variables. If we wanted to assign qualitative categories such as bad or good to the players performance we wouldn’t be able to use only mutate, instead we could the case_when() command which lets us categorize however we choose as shown below. Let’s assign the following categories to player’s Three Point Percentages.

Garabge: TPP < 20%

Better Garbage: 20% <= TPP < 30%

Fine: 30% <= TPP < 35%

Pretty Good: 35% < TPP < 40%

Godly: TPP > 40%

new_shooting <- mutate(new_shooting, 
                       TPPTier= case_when(
                          TPP < 0.2 ~  "Garbage",
                          0.2 <= TPP & TPP < 0.3 ~ "Better Garbage",
                          0.3 <= TPP & TPP < 0.35 ~  "Fine",
                         0.35 <= TPP & TPP < 0.4 ~ "Pretty Good",
                         0.4 <= TPP ~ "Godly"))
View(new_shooting)

You can see that R created tiers for each player based on their Three Point Percentages.

Column Summaries

If we want to find the average Three Point Percentage for the 2004 season it is pretty simple. We can use some the functions we learned in previous lessons.

new_shooting_2004 <- filter(shooting, SEASON == 2004)
summarise(new_shooting_2004, TPP = mean(TPP))

## # A tibble: 1 x 1
##     TPP
##   <dbl>
## 1 0.262

We caould do that for multiple categories if we wanted which you could try out yourself.

Selecting Variables

We have a lot of columns here and we may not care about some of them. To look at only certain columns we can use the select() function which does exactly what it says, selects certain columns.

new_shooting_tpptier<-select(new_shooting, PLAYER, SEASON,TPP,TPPTier)
new_shooting_tpptier

## # A tibble: 1,069 x 4
##    PLAYER           SEASON   TPP TPPTier    
##    <chr>             <int> <dbl> <chr>      
##  1 Michael Jordan     1997 0.374 Pretty Good
##  2 Glen Rice          1997 0.470 Godly      
##  3 Mitch Richmond     1997 0.428 Godly      
##  4 Latrell Sprewell   1997 0.354 Pretty Good
##  5 Allen Iverson      1997 0.341 Fine       
##  6 LaPhonso Ellis     1997 0.367 Pretty Good
##  7 Kendall Gill       1997 0.336 Fine       
##  8 Gary Payton        1997 0.313 Fine       
##  9 Reggie Miller      1997 0.427 Godly      
## 10 Glenn Robinson     1997 0.35  Pretty Good
## # ... with 1,059 more rows