We are going to make use of the dpylr() library throughout this lesson which is already pre-loaded into the tidyverse. The dpylr library contains five main function that have the following purposes:
Order the rows however you want using arrange()
Pull only certain data using filter()
Make new variables by using other already known variables using mutate()
Select a group of variables by name using select()
Getting basic summaries of the data using summarise()
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
shooting<-read_csv(file="C:/Users/ankit/OneDrive/Desktop/Robotics Scouting/Data Sets/nba_history.csv")
## Parsed with column specification:
## cols(
## PLAYER = col_character(),
## SEASON = col_integer(),
## FGM = col_integer(),
## FGA = col_integer(),
## TPM = col_integer(),
## TPA = col_integer(),
## FTM = col_integer(),
## FTA = col_integer(),
## FGP = col_double(),
## TPP = col_double(),
## FTP = col_double()
## )
The arrange() function takes the data from the tibble and the column names sorts data according tho the values in the columns we specify.
arrange(shooting,FGM)
## Warning: package 'bindrcpp' was built under R version 3.5.1
## # A tibble: 7,447 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Dell Demps 1997 0 3 0 1 2 2 0 0 1
## 2 James Sco~ 1997 0 8 0 4 1 2 0 0 0.5
## 3 Darrick M~ 2002 0 10 0 2 1 2 0 0 0.5
## 4 Alvin Wil~ 2006 0 3 0 2 1 2 0 0 0.5
## 5 Alvin Wil~ 2007 0 2 0 1 2 4 0 0 0.5
## 6 Roger Pow~ 2007 0 3 0 1 2 2 0 0 1
## 7 Patrick E~ 2011 0 5 0 2 3 4 0 0 0.75
## 8 Will Conr~ 2013 0 9 0 3 1 2 0 0 0.5
## 9 Scotty Ho~ 2014 0 4 0 2 1 2 0 0 0.5
## 10 Mustafa S~ 2014 0 3 0 1 1 2 0 0 0.5
## # ... with 7,437 more rows
What happened here was our dataset, shooting, was arranged from least to greatest by FGM and we see many players with 0 FGM. You will also notice that you cannot see the full ammount of entries as only a certain dimension of the tibble is shown. If we wanted to View the whole shooting tibble arranged we could use the View() function to see our full tibble.
View(arrange(shooting,FGM))
If you ever want to view the full tibble at anytime just use the View() function. Always have capital V in View()
We can do the exact same thing but in descending order with the simple desc() function.
arrange(shooting, desc(FGM))
## # A tibble: 7,447 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Kobe Brya~ 2006 978 2173 180 518 696 819 0.450 0.347 0.850
## 2 Shaquille~ 2000 956 1665 0 1 432 824 0.574 0 0.524
## 3 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 4 Michael J~ 1998 881 1893 30 126 565 721 0.465 0.238 0.784
## 5 LeBron Ja~ 2006 875 1823 127 379 601 814 0.480 0.335 0.738
## 6 Kobe Brya~ 2003 868 1924 124 324 601 713 0.451 0.383 0.843
## 7 Karl Malo~ 1997 864 1571 0 13 521 690 0.550 0 0.755
## 8 Dwyane Wa~ 2009 854 1739 88 278 590 771 0.491 0.317 0.765
## 9 Kevin Dur~ 2014 849 1688 192 491 703 805 0.503 0.391 0.873
## 10 Tracy McG~ 2003 829 1813 173 448 576 726 0.457 0.386 0.793
## # ... with 7,437 more rows
We could even sort more than 1 column at once as shown below:
arrange(shooting, FGM, FGA, TPM, TPA)
## # A tibble: 7,447 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Alvin Wil~ 2007 0 2 0 1 2 4 0 0 0.5
## 2 Dell Demps 1997 0 3 0 1 2 2 0 0 1
## 3 Roger Pow~ 2007 0 3 0 1 2 2 0 0 1
## 4 Mustafa S~ 2014 0 3 0 1 1 2 0 0 0.5
## 5 Alvin Wil~ 2006 0 3 0 2 1 2 0 0 0.5
## 6 Scotty Ho~ 2014 0 4 0 2 1 2 0 0 0.5
## 7 Patrick E~ 2011 0 5 0 2 3 4 0 0 0.75
## 8 Quincy Mi~ 2015 0 6 0 3 2 4 0 0 0.5
## 9 Darius Jo~ 2014 0 7 0 3 0 2 0 0 0
## 10 James Sco~ 1997 0 8 0 4 1 2 0 0 0.5
## # ... with 7,437 more rows
We can see from the arrange that there are a bunch of players who pretty much have done nothing and when we want to do meaningful things with datasets those useless numbers don’t help us. We can make use of the filter() function to pull specific subsets of data to satisfy the logic statement we choose such as FGM>500 or TPA<=100.
There are a variety of logical operators to help us make comparison:
== used for saying equal to
!= used for not equal to
< and <= used for less than and less than or equal to respectively
> and >= used for greater than and greater than or equal to respectively
&, |, ! used for AND, OR, NOT respectively
Let’s filter the data by only players who have made more than 100 Three Pointers
filter(shooting, TPM>100)
## # A tibble: 910 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 3 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 4 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 5 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 6 Gary Payt~ 1997 706 1482 119 380 254 355 0.476 0.313 0.715
## 7 Reggie Mi~ 1997 552 1244 229 536 418 475 0.444 0.427 0.88
## 8 Jerry Sta~ 1997 533 1308 102 342 511 667 0.407 0.298 0.766
## 9 Tim Harda~ 1997 575 1384 203 590 291 364 0.415 0.344 0.799
## 10 Scottie P~ 1997 648 1366 156 424 204 291 0.474 0.368 0.701
## # ... with 900 more rows
Notice how the size of the tibble went down from 7447 players to 910 players.
If were to view our shooting tibble we would see that it still contains all 7447 players, in order to make a seperate tibble for our TPM>100 condition we just assign it another name.
tpm_shooting<-filter(shooting, TPM>100)
arrange(tpm_shooting,FGM)
## # A tibble: 910 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Anthony T~ 2014 129 307 102 247 33 41 0.420 0.413 0.805
## 2 James Jon~ 2011 146 346 123 287 60 72 0.422 0.429 0.833
## 3 Damon Jon~ 2008 151 363 115 276 20 28 0.416 0.417 0.714
## 4 Shane Bat~ 2013 152 362 136 316 32 38 0.420 0.430 0.842
## 5 Jason Kidd 2013 152 409 114 325 40 48 0.372 0.351 0.833
## 6 Matt Bonn~ 2012 154 350 105 250 16 21 0.44 0.42 0.762
## 7 James Pos~ 2006 159 395 117 290 48 61 0.403 0.403 0.787
## 8 Steve Nov~ 2012 161 337 133 282 22 26 0.478 0.472 0.846
## 9 Anthony T~ 2015 161 393 106 290 54 69 0.410 0.366 0.783
## 10 Charlie W~ 2003 165 414 101 267 41 53 0.399 0.378 0.774
## # ... with 900 more rows
Now let’s try filtering the data by TPM>100 or FGM>300
filter(shooting,TPM>100|FGM>300)
## # A tibble: 2,325 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Karl Malo~ 1997 864 1571 0 13 521 690 0.550 0 0.755
## 3 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 4 Shaquille~ 1997 552 991 0 4 232 479 0.557 0 0.484
## 5 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 6 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 7 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 8 Hakeem Ol~ 1997 727 1426 5 16 351 446 0.510 0.312 0.787
## 9 Patrick E~ 1997 655 1342 2 9 439 582 0.488 0.222 0.754
## 10 LaPhonso ~ 1997 445 1014 95 259 218 282 0.439 0.367 0.773
## # ... with 2,315 more rows
We can have multiple logical statements in one line rather than multiple by using parentheses:
filter(shooting, (TPM>100 & FGM>300) | (TPP >= 0.40 & FGP>=0.45))
## # A tibble: 1,063 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 3 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 4 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 5 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 6 Gary Payt~ 1997 706 1482 119 380 254 355 0.476 0.313 0.715
## 7 Reggie Mi~ 1997 552 1244 229 536 418 475 0.444 0.427 0.88
## 8 Jerry Sta~ 1997 533 1308 102 342 511 667 0.407 0.298 0.766
## 9 Tim Harda~ 1997 575 1384 203 590 291 364 0.415 0.344 0.799
## 10 Scottie P~ 1997 648 1366 156 424 204 291 0.474 0.368 0.701
## # ... with 1,053 more rows
In 1999 and 2012 the NBA had a lockout which shortened the seasons which may make our data bad. We can remove the data from those seasons in a quick way using the operator %in% which lets us select multiple things at once rather than saying SEASON== for every season we don’t want.
filter(shooting, !SEASON %in% c(1999, 2012))
## # A tibble: 6,721 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Karl Malo~ 1997 864 1571 0 13 521 690 0.550 0 0.755
## 3 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 4 Shaquille~ 1997 552 991 0 4 232 479 0.557 0 0.484
## 5 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 6 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 7 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 8 Hakeem Ol~ 1997 727 1426 5 16 351 446 0.510 0.312 0.787
## 9 Patrick E~ 1997 655 1342 2 9 439 582 0.488 0.222 0.754
## 10 LaPhonso ~ 1997 445 1014 95 259 218 282 0.439 0.367 0.773
## # ... with 6,711 more rows
For the rest of this lesson let’s only focus on players who attempted at least 150 field goals, 200 free throws, and 75 three pointers non-lockout seasons and 2015 because I hate the Golden State Warriors.
new_shooting <- filter(shooting, FGA >= 150 & FTA >= 200 & TPA >= 75 & !SEASON %in% c(1999, 2012, 2015))
new_shooting
## # A tibble: 1,069 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 3 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 4 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 5 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 6 LaPhonso ~ 1997 445 1014 95 259 218 282 0.439 0.367 0.773
## 7 Kendall G~ 1997 644 1453 74 220 427 536 0.443 0.336 0.797
## 8 Gary Payt~ 1997 706 1482 119 380 254 355 0.476 0.313 0.715
## 9 Reggie Mi~ 1997 552 1244 229 536 418 475 0.444 0.427 0.88
## 10 Glenn Rob~ 1997 669 1438 63 180 288 364 0.465 0.35 0.791
## # ... with 1,059 more rows
We have our new table but we realize that we don’t have some other statistics we want such as Effective Field Goal Percentage, Total Points, and True Shooting Percentage. The mutate() function allows us to do this by creating a new variable using old variables. Let’s overwrite our current new_shooting to add EFGP.
new_shooting<-mutate(new_shooting, EFGP = (FGM + 0.5 * TPM)/FGA)
new_shooting
## # A tibble: 1,069 x 12
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Micha~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Glen ~ 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 3 Mitch~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 4 Latre~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 5 Allen~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 6 LaPho~ 1997 445 1014 95 259 218 282 0.439 0.367 0.773
## 7 Kenda~ 1997 644 1453 74 220 427 536 0.443 0.336 0.797
## 8 Gary ~ 1997 706 1482 119 380 254 355 0.476 0.313 0.715
## 9 Reggi~ 1997 552 1244 229 536 418 475 0.444 0.427 0.88
## 10 Glenn~ 1997 669 1438 63 180 288 364 0.465 0.35 0.791
## # ... with 1,059 more rows, and 1 more variable: EFGP <dbl>
You can also add the other two categories yourself with the given formulas:
PTS=FTM+2*FGM+TPM
TSP=PTS/(2(FGA+0.44FTA))
We have used mutate() to create numeric variables. If we wanted to assign qualitative categories such as bad or good to the players performance we wouldn’t be able to use only mutate, instead we could the case_when() command which lets us categorize however we choose as shown below. Let’s assign the following categories to player’s Three Point Percentages.
Garabge: TPP < 20%
Better Garbage: 20% <= TPP < 30%
Fine: 30% <= TPP < 35%
Pretty Good: 35% < TPP < 40%
Godly: TPP > 40%
new_shooting <- mutate(new_shooting,
TPPTier= case_when(
TPP < 0.2 ~ "Garbage",
0.2 <= TPP & TPP < 0.3 ~ "Better Garbage",
0.3 <= TPP & TPP < 0.35 ~ "Fine",
0.35 <= TPP & TPP < 0.4 ~ "Pretty Good",
0.4 <= TPP ~ "Godly"))
View(new_shooting)
You can see that R created tiers for each player based on their Three Point Percentages.
If we want to find the average Three Point Percentage for the 2004 season it is pretty simple. We can use some the functions we learned in previous lessons.
new_shooting_2004 <- filter(shooting, SEASON == 2004)
summarise(new_shooting_2004, TPP = mean(TPP))
## # A tibble: 1 x 1
## TPP
## <dbl>
## 1 0.262
We caould do that for multiple categories if we wanted which you could try out yourself.
We have a lot of columns here and we may not care about some of them. To look at only certain columns we can use the select() function which does exactly what it says, selects certain columns.
new_shooting_tpptier<-select(new_shooting, PLAYER, SEASON,TPP,TPPTier)
new_shooting_tpptier
## # A tibble: 1,069 x 4
## PLAYER SEASON TPP TPPTier
## <chr> <int> <dbl> <chr>
## 1 Michael Jordan 1997 0.374 Pretty Good
## 2 Glen Rice 1997 0.470 Godly
## 3 Mitch Richmond 1997 0.428 Godly
## 4 Latrell Sprewell 1997 0.354 Pretty Good
## 5 Allen Iverson 1997 0.341 Fine
## 6 LaPhonso Ellis 1997 0.367 Pretty Good
## 7 Kendall Gill 1997 0.336 Fine
## 8 Gary Payton 1997 0.313 Fine
## 9 Reggie Miller 1997 0.427 Godly
## 10 Glenn Robinson 1997 0.35 Pretty Good
## # ... with 1,059 more rows