In the last lesson we did a lot of basic R calculations but now we can get more advanced with the use of libraries. The library known as the tidyverse which contains a variety of packages all under this one. Lets load the tidyverse package. We only need to install the package once but when we want to load it we will need to use the library command. You may need to use install.packages(“tidyverse”) before we load the library.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We can see all the different packages that tidyverse downloads all together and we will make use of these as we progress. Don’t worry about the warnings and conflicts.
As we come upon data that is more expansive it becomes beneficial to have it all in tablular form. In the tidyverse, the standard way to work with tabular data is to use a tbl (pronounced “tibble”, I guess it sounds cooler than table). A tibble is essentially a two-dimensional array whose columns can be a variety of data types. We could have a column of characters followed by a column of doubles.
To practice with tibbles let’s use our data from Lesson 1 to save some time.
players <- c("Ray Allen", "Gary Payton", "Shawn Kemp", "Mitch Richmond", "Rick Barry", "Oscar Robertson", "Hakeem Olajuwon", "Vince Carter", "Moses Malone", "Patrick Ewing")
fgm <-c(804, 571, 470, 710, 698, 737, 498, 373, 215, 552)
fga<-c(1596,1341,1033,1617,1381,1416,1112,886,442,1061)
tpm <- c(401, 118, 63, 236, 186, 87, 126, 81, 0, 2)
tpa <- c(877, 328, 270, 657, 480, 282, 342, 243, 2, 6)
ftm <- c(362, 278, 397, 720, 447, 359, 250, 201, 92, 208)
fta <- c(390, 348, 477, 837, 498, 491, 280, 240, 131, 586)
To create the tibble we do the following:
nba_legends_data<-tibble(
Players=players,
FGM=fgm,
FGA=fga,
TPM=tpm,
TPA=tpa,
FTM=ftm,
FTA=fta
)
We use the function tibble() to create nba_legends_data. Within the parantheses, we have several expressions with equal signswith a character string on the left equaling the vector on the right. The character string sets the column heading followed by the values that correspond with the heading. It’s generally better to have one expression per line to help differentiate between data better as well as make debugging easier when we need to, it’s overall more organized. Now let’s run nba_legends_data and see if we have a tibble.
nba_legends_data
## # A tibble: 10 x 7
## Players FGM FGA TPM TPA FTM FTA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ray Allen 804 1596 401 877 362 390
## 2 Gary Payton 571 1341 118 328 278 348
## 3 Shawn Kemp 470 1033 63 270 397 477
## 4 Mitch Richmond 710 1617 236 657 720 837
## 5 Rick Barry 698 1381 186 480 447 498
## 6 Oscar Robertson 737 1416 87 282 359 491
## 7 Hakeem Olajuwon 498 1112 126 342 250 280
## 8 Vince Carter 373 886 81 243 201 240
## 9 Moses Malone 215 442 0 2 92 131
## 10 Patrick Ewing 552 1061 2 6 208 586
Like we did in lesson 1, we can subset within a tibble and get specific data by specifying the tibble name and the heading we want.
quantile(nba_legends_data[["FTM"]])
## 0% 25% 50% 75% 100%
## 92.00 218.50 318.50 388.25 720.00
You might have seen but it was a relatively tedious process to enter stats for players. Now imagine doing that thousands of player, of course we would never manually enter a ton of rows of data. Lucky for us we can find data often in Excel files and R makes it easy to load all Excel files into R as tiblles, we will be using the read_csv() command as most of the excel files we will use will be csv and avoids the need to load extra libraries. Let’s create a tibble from an Excel file using the nba_history csv file. Your file location will be different from mine depending on where you have your datasets, write the location of wherever your dataset is into the read_csv().
nba_history <- read_csv(file="C:/Users/ankit/OneDrive/Desktop/Robotics Scouting/Data Sets/nba_history.csv")
## Parsed with column specification:
## cols(
## PLAYER = col_character(),
## SEASON = col_integer(),
## FGM = col_integer(),
## FGA = col_integer(),
## TPM = col_integer(),
## TPA = col_integer(),
## FTM = col_integer(),
## FTA = col_integer(),
## FGP = col_double(),
## TPP = col_double(),
## FTP = col_double()
## )
nba_history
## # A tibble: 7,447 x 11
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP
## <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 Michael J~ 1997 920 1892 111 297 480 576 0.486 0.374 0.833
## 2 Karl Malo~ 1997 864 1571 0 13 521 690 0.550 0 0.755
## 3 Glen Rice 1997 722 1513 207 440 464 535 0.477 0.470 0.867
## 4 Shaquille~ 1997 552 991 0 4 232 479 0.557 0 0.484
## 5 Mitch Ric~ 1997 717 1578 204 477 457 531 0.454 0.428 0.861
## 6 Latrell S~ 1997 649 1444 147 415 493 585 0.449 0.354 0.843
## 7 Allen Ive~ 1997 625 1504 155 455 382 544 0.416 0.341 0.702
## 8 Hakeem Ol~ 1997 727 1426 5 16 351 446 0.510 0.312 0.787
## 9 Patrick E~ 1997 655 1342 2 9 439 582 0.488 0.222 0.754
## 10 LaPhonso ~ 1997 445 1014 95 259 218 282 0.439 0.367 0.773
## # ... with 7,437 more rows
Take note of the doubles, integers and characters. From my experience, I would spend tons of time trying to figure out why I couldn’t manipulate data in R once it was loaded into the tibble because the data I was trying to manipulate was being read as characters. If you have numbers being read as characters it easiest to just change the formatting in the actual Excel file.
Plotting helps us study the distribution of the individual columns and also study relationships between variables. For example, is a player’s field goal percentage predictive of their three point percentage?
The graphics we produce are created from the ggplot2 library, which is package that is a part of the tidyverse. Essentially, ggplot2 layers different graphics and adds various features over each other. More info on ggplot2 can be found here https://ggplot2.tidyverse.org/. For now we will stick to more basic plotting and will get to more advanced stuff later.
A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. Let’s make a histogram for Three Point Percentage (TPP).
ggplot(data = nba_history) +
geom_histogram(mapping = aes(x = TPP), bins =20 )
The part of code, ggplot(data = nba_history) tells R to generate a coordinate system which can have more layers added over it. It tells R that the data we want to plot from is coming from our nba_history tibble. The next part of the code,+ geom_histogram(mapping = aes(TPP), bins = 20) adds a layer to the plot with the layer type being a histogram as specified by geom_histogram. The mapping sets the way in which each variable in our set of data is mapped to the visual parts of the histogram which is the aesthetics. Because we are doing a histogram, we only have to specify the X axis to R since the Y axis will automatically be counted. Once again we split it over two lines to reduce clutter, it is also important that you have the + sign when splitting it up. The bins determine the amount of intervals we will have for all the values to fall into. The more bins, the more accurate the the histogram will look.
ggplot(data = nba_history) +
geom_histogram(mapping = aes(x = TPP), bins =75 )
If we wanted we could also make a density histogram where the total area under the histogram would always be equal to 1
ggplot(data = nba_history) +
geom_histogram(mapping = aes(x = TPP, y = ..density..), bins = 60)
The syntax is something to get used to and becomes easy with practice. Try making histograms yourself with varying bin widths for the other categories and observe how distributions change with different bin widths.
One variable visualization is pretty cool but relationships between two variables or more is often cooler. R can compare variables visually in a simple way to observe relationships between variables.
We can make a scatterplot in R with code very similar to the chunk we used to generate the histograms. Lets observe Field Goal Percentage and Free Throw Percentage.
ggplot(data = nba_history) +
geom_point(mapping = aes(x = FGP, y = FTP))
The syntax here is relatively the same but because we want a scatterplot we instead use geom_point(). We also want two variables so we added the comma and then y=FTP to graph the second axis. One shortcoming here is that we dont see the density of points as it all has the same fill. Would (0,0.5) be one player or one player. We can solve this problem we can use what is known as alpha-blending to solve this problem as it changes the transparency of points. If we have more points in a region, the region will appear darker.
ggplot(data = nba_history) +
geom_point(mapping = aes(x = FGP, y = FTP), alpha=0.08)
Let’s get even more adanvced by getting the frquency within the scatterplot of certian areas. We can do this with the use of Heatmaps. The plot is divided into areas based on bin sizes and the color will respresent the density within a certain area. Once again the higher the bin size the higher the accuracy.
ggplot(data = nba_history) +
geom_bin2d(mapping = aes(x = FGP, y = FTP), bins = 150)
Once again, the syntax to creating these graphs will come with practice. Try creating various scatterplots and heatplots with the other variables and look at the distributions.
If we also wanted to look at only a specific range of data we could specify it and have limited regions plotted.
ggplot(data = nba_history) +
geom_point(aes(x = FTA, y = FTP)) +
xlim(0, 150)
## Warning: Removed 2656 rows containing missing values (geom_point).