In this workshop, we’ll show you how to use data visualization in R, using one of the most popular tools, ggplot2.

Start by getting the necessary packages, and if you don’t have them already, we’ll help you install them.

library(ggplot2)
library(dplyr)
library(ggrepel)

1. Import & Clean Data

Alright, for our next tasks, we’re going to play around with some data from the NBA_player_regular_season.csv file. Let’s start by bringing that data into R and call it ‘nba’ in our workspace.

Run and check of variables and types:

summary (nba)
##       id                 year       firstname           lastname        
##  Length:21961       Min.   :1946   Length:21961       Length:21961      
##  Class :character   1st Qu.:1974   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1986                                        
##                     3rd Qu.:1999                                        
##                     Max.   :2009                                        
##      team               leag                gp               minutes    
##  Length:21961       Length:21961       Length:21961       Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.: 275  
##  Mode  :character   Mode  :character   Mode  :character   Median :1038  
##                                                           Mean   :1204  
##                                                           3rd Qu.:2009  
##                                                           Max.   :3882  
##       pts              oreb             dreb             reb        
##  Min.   :   0.0   Min.   :  0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 113.0   1st Qu.:  0.00   1st Qu.:   1.0   1st Qu.:  44.0  
##  Median : 386.0   Median : 22.00   Median :  60.0   Median : 160.0  
##  Mean   : 531.3   Mean   : 49.85   Mean   : 117.8   Mean   : 229.7  
##  3rd Qu.: 811.0   3rd Qu.: 75.00   3rd Qu.: 180.0   3rd Qu.: 333.0  
##  Max.   :4029.0   Max.   :895.00   Max.   :1538.0   Max.   :2149.0  
##       asts            stl                blk              turnover        
##  Min.   :   0.0   Length:21961       Length:21961       Length:21961      
##  1st Qu.:  20.0   Class :character   Class :character   Class :character  
##  Median :  71.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 118.1                                                           
##  3rd Qu.: 167.0                                                           
##  Max.   :1164.0                                                           
##        pf             fga              fgm              fta        
##  Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 43.0   1st Qu.: 106.0   1st Qu.:  43.0   1st Qu.:  30.0  
##  Median :118.0   Median : 345.0   Median : 148.0   Median :  99.0  
##  Mean   :123.6   Mean   : 452.5   Mean   : 204.3   Mean   : 146.9  
##  3rd Qu.:193.0   3rd Qu.: 696.0   3rd Qu.: 313.0   3rd Qu.: 218.0  
##  Max.   :386.0   Max.   :3159.0   Max.   :1597.0   Max.   :1363.0  
##       ftm             tpa              tpm        
##  Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 20.0   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 70.0   Median :  2.00   Median :  0.00  
##  Mean   :109.6   Mean   : 38.08   Mean   : 13.11  
##  3rd Qu.:161.0   3rd Qu.: 27.00   3rd Qu.:  7.00  
##  Max.   :840.0   Max.   :678.00   Max.   :269.00

So, when we look at the summary, it seems that some data like games played, steals, blocks, and turnovers got imported as text instead of numbers. To figure out what’s happening, we can use the ‘table’ function, which can give us a better view of the situation.

table(nba$gp)
## 
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22   23 
##  260  206  206  159  176  177  158  174  146  163  161  278  142  182  163  177 
##   24   25   26   27   28   29    3   30   31   32   33   34   35   36   37   38 
##  184  163  161  184  181  173  298  158  179  162  172  128  161  149  156  137 
##   39    4   40   41   42   43   44   45   46   47   48   49    5   50   51   52 
##  162  271  143  163  176  162  153  158  174  195  214  202  237  247  185  163 
##   53   54   55   56   57   58   59    6   60   61   62   63   64   65   66   67 
##  190  184  218  215  187  210  204  243  241  216  236  232  240  283  307  321 
##   68   69    7   70   71   72   73   74   75   76   77   78   79    8   80   81 
##  319  272  223  362  375  506  327  386  465  464  503  580  704  199  867  880 
##   82   83   84   85   86   87   88    9   90    N 
## 1716   80  106    6    7    3    2  200    1    2
table(nba$stl)
## 
##    0    1   10  100  101  102  103  104  105  106  107  108  109   11  110  111 
## 5939  607  262   51   42   41   47   35   44   38   34   42   30  218   40   37 
##  112  113  114  115  116  117  118  119   12  120  121  122  123  124  125  126 
##   19   32   31   24   28   33   30   28  232   22   24   22   30   21   30   22 
##  127  128  129   13  130  131  132  133  134  135  136  137  138  139   14  140 
##   22   16   37  234   10   17   22   13   12   22   19   18   30   17  211   17 
##  141  142  143  144  145  146  147  148  149   15  150  151  152  153  154  155 
##   11    9   12   16    4   10   12    9    6  227   10    8   11    7   10    7 
##  156  157  158  159   16  160  161  162  163  164  165  166  167  168  169   17 
##   10   10   11    7  212    9    7    8    7    4    6   13    9    6    6  199 
##  170  171  172  173  174  175  176  177  178  179   18  180  181  182  183  184 
##    8    5    8    9    5    5    7    6    5    2  188    5    5    3    4    1 
##  185  186  187  188  189   19  190  191  192  193  194  195  196  197  198  199 
##    7    3    4    2    3  202    5    2    2    2    3    1    2    6    1    5 
##    2   20  200  201  202  203  204  205  206  207  208  209   21  210  211  212 
##  479  199    2    4    2    3    3    1    2    4    1    2  204    3    3    4 
##  213  214  215  216  217   22  220  221  222  223  225  227  228  229   23  231 
##    3    2    2    3    2  203    1    2    1    3    2    1    1    1  208    1 
##  232  233  234  236   24  242  243  244  246   25  250  259   26  260  261  263 
##    2    1    2    1  171    1    2    1    1  193    1    1  155    1    1    1 
##  265   27   28  281   29    3   30  301   31   32   33   34  346   35  354   36 
##    1  164  170    1  177  445  181    1  132  165  153  136    1  144    1  155 
##   37   38   39    4   40   41   42   43   44   45   46   47   48   49    5   50 
##  149  148  137  343  155  135  149  133  138  142  130  137  153  128  345  120 
##   51   52   53   54   55   56   57   58   59    6   60   61   62   63   64   65 
##  113  126  125  109  115   98  114   99  123  292  117  114  116  120   95  106 
##   66   67   68   69    7   70   71   72   73   74   75   76   77   78   79    8 
##   99   91   98   85  280  100   88   76   70   77   81   79   79   83   64  280 
##   80   81   82   83   84   85   86   87   88   89    9   90   91   92   93   94 
##   71   80   73   53   68   74   69   62   66   54  255   55   59   51   54   54 
##   95   96   97   98   99 NULL 
##   64   47   39   43   40    1

OK so this shows us that there are values N and NULL in the data. Let’s re-import the csv file but use the option na.strings = c('N', 'NULL') to properly handle those values.

nba <- read.csv("data/NBA_player_regular_season.csv",
                na.strings = c('N', 'NULL'))

To begin with, filter the dataset to only include player-seasons with more than 100 total minutes. Then add a new calculated column that simply adds up some generally positive stats and minuses some negative stats.

nba <- nba %>%
  filter(minutes > 100) %>%
  mutate(
    nba_eff = pts + reb + stl + asts + blk - turnover - (fga - fgm)
    )

2. Using ggplot

Let’s start by plotting two variables we might expect to be related; steals and turnovers.

ggplot(data = nba,                      # plot data from this data frame
       aes(x = stl, y = turnover)) +    # specify the mapping between variables in the data and plor elements
  geom_point()                          # ask for points (i.e. a scatter plot)

Okay, what does it mean??? Let’s view that subset of data.

Now we ca create a basic scatterplot visualisation using using geom_point() function.

ggplot(data = nba, 
       aes(x = stl, y = turnover, color = year)) +
  geom_point()
## Warning: Removed 253 rows containing missing values (`geom_point()`).

2.1 Add colour

Let’s break down the code. In the initial part, where we have ‘ggplot(data = nba),’ we’re essentially creating an object that links to the dataset we want to work with.

ggplot(data = nba, aes(x = tpa, y = tpm)) + 
  geom_point()

What will happen if you run it like this?

ggplot(data = nba)

ggplot(data = nba, aes(x = tpa, y = tpm))

After the initial declaration, we use the ‘+’ operator to add more elements to our ggplot object. In this case, we’re adding a ‘geom_point()’ object, which helps create a scatter plot. The ‘aes(x = tpa, y = tpm)’ part tells us how our data variables relate to the visual aspects of the plot.

In the next plot, we’ll introduce a ‘color = team’ argument. This will assign colors to the points in the visualization, making them match the teams they represent

ggplot(nba, aes(x = tpa, y = tpm, color = team)) + 
  geom_point()

However, if you want to add in an aesthetic that is not related to your dataset, then that is required to be outside of the aes( ) function. For example:

ggplot(nba, aes(x = tpa, y = tpm)) + 
  geom_point(color = "hotpink")

Try it out with your favourite colour!

Okay guys, back to business.

2.2 Size

Now, we’ll focus on a specific subset of the data, specifically the Chicago Bulls teams from 1991-1998. In addition, we’ll make the visualization more interesting by adjusting the size of the elements.

bulls <- nba %>% 
  filter(team == "CHI" & year >= 1991 & year <= 1998)

ggplot(bulls, aes(x = tpm, y = tpa, color = lastname, size = minutes)) + 
  geom_point()

Is this visualization effective in conveying its message, and if so, what makes it effective? If not, what could be improved, and why?

2.3 Shape

In order to make it easier to distinguish between each player on the team, we can use different shapes to represent each athlete. Additionally, we’ll apply a filter to narrow down the data and focus on a smaller group of players from that time period.

bulls <- bulls %>% 
  filter(lastname == "Jordan" |
           lastname == "Pippen" | 
           lastname == "Longley" | 
           lastname == "Rodman" |
           lastname == "Kukoc")

ggplot(bulls, aes(x = tpa, y = tpm, color = lastname, size = minutes, shape = lastname)) +
     geom_point()

2.4 Facets

Another way to add more details to a plot is by using facets. Facets are handy when dealing with categories or groups in your data. They allow you to create small plots for each group, which is a great way to show additional information without making a single plot too cluttered.

ggplot(data = bulls, aes(x = fga, y = fgm, color = lastname, size = nba_eff)) +
   geom_point() +
   facet_wrap(~lastname, nrow = 2) +
  theme_bw()

2.5 Highlights

If you want to emphasize specific parts of your visualization to grab the viewer’s attention and convey important information, you can do that. For example, if we’re plotting two traits that might show a player’s style and we want to focus on three well-known players like Michael Jordan, John Stockton, and Shaq, we can make them stand out in the plot.

ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
  geom_point(alpha = 0.2, size = 0.1) +
  geom_point(data = subset(nba, id == 'JORDAMI01'), color = 'yellow') +
  geom_point(data = subset(nba, id == 'STOCKJO01'), color = 'orange') +
  geom_point(data = subset(nba, id == 'ONEASH01'), color = 'red')

So the points definately stand out in that visualisation - but you can’t tell which player is which (although you may be able to guess).

Creating legends is very easy in ggplot because it happens by default in most situations.

ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
  geom_point(alpha = 0.2, size = 0.1) +
  geom_point(data = subset(nba, id == 'JORDAMI01'), aes(color = 'Jordan')) +
  geom_point(data = subset(nba, id == 'STOCKJO01'), aes(color = 'Stockton')) +
  geom_point(data = subset(nba, id == 'ONEASH01'), aes(color = 'Shaq'))

2.6 Themes

There is an add on package for adding themes to ggplots called ggthemes. Let’s install it and try it out.

library(ggthemes)
p1 <- ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
  geom_point(alpha = 0.2, size = 0.1) +
  geom_point(data = subset(nba, id == 'JORDAMI01'), aes(color = 'Jordan')) +
  geom_point(data = subset(nba, id == 'STOCKJO01'), aes(color = 'Stockton')) +
  geom_point(data = subset(nba, id == 'ONEASH01'), aes(color = 'Shaq'))

p1 + theme_base()

p1 + theme_bw() # This is a good one in my opinion

p1 + theme_fivethirtyeight()

p1 + theme_dark()

p1 + theme_excel()

p1 + theme_minimal()

p1 + theme_economist()

p1 + theme_wsj()

p1 + theme_void()

# Default is
p1 + theme_gray()

2.7 Passing string arguments to ggplot

This is a very common task when creating dashboards and web-apps and commonly causes some confusion.

Consider the situation where we want to specify a team and year somewhere in our code then make a plot using those values later on.

team_pick <- 'LAL'
year_pick <- '2000'
x_stat <- 'pts'
y_stat <- 'reb'

Then we want to create a scatter plot of only team_pick in year_pick with x/y_stat on the x and y-axes. A first try might be:

ggplot(data = filter(nba,
                     team == team_pick,
                     year == year_pick),
       aes(x = x_stat, y = y_stat)) +
  geom_point() +
  geom_text(aes(label = lastname))

This fails because ggplot does not like getting strings as arguments inside aes() (e.g. aes(x = 'pts') vs aes(x = pts)).

To get around this we can use get():

ggplot(data = filter(nba,
                     team == team_pick,
                     year == year_pick),
       aes(x = get(x_stat), y = get(y_stat))) +
  geom_point() +
  geom_text(aes(label = lastname))

2.8 Labelling

Annotating points is a common task that ggplot does not do very well by default (see the plot above). There is a package called ggrepel that is very useful for this task.

library(ggrepel)

ggplot(data = filter(nba,
                     team == team_pick,
                     year == year_pick),
       aes(x = get(x_stat), y = get(y_stat))) +
  geom_point() +
  geom_text_repel(aes(label = lastname))

Annotation is a good way adding extra relevant information to a plot.

ggplot(data = nba,
       aes(x = get(x_stat), y = get(y_stat))) +
  geom_point(alpha = 0.1) +
  geom_point(data = filter(nba, id == 'JORDAMI01'),
             aes(size = nba_eff)) +
  geom_text_repel(
    data = filter(nba, id == 'JORDAMI01'),
    aes(label = year), size = 3) +
  xlab(x_stat) +
  ylab(y_stat) +
  theme_bw() +
  ggtitle('Michael Jordan - Points and rebounds each season')

2.9 Combining geoms

Create a new variable in the data for each players career year.

nba <- nba %>%
  group_by(id) %>%
  mutate(
    careeryear = year - min(year)
  )

Now, we can create a plot that represents our hypothetical performance metric for each player throughout their career. By adding some extra details to the plot, we can identify patterns and trends in the data.

ggplot(data = nba, 
       aes(x = careeryear, y = nba_eff)) +
  geom_point(size = 0.5, alpha = 0.1) +
  geom_line(aes(group = id), alpha = 0.1) +
  geom_smooth(aes(color = 'Smoothed fit')) +
  geom_smooth(method = 'lm', aes(color = 'Linear model')) +
  theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ x'

We also can add highlighting

ggplot(data = nba, 
       aes(x = careeryear, y = nba_eff)) +
  geom_point(size = 0.5, alpha = 0.1) +
  geom_line(aes(group = id), alpha = 0.1) +
  geom_smooth(aes(color = 'Smoothed fit')) +
  geom_smooth(method = 'lm', aes(color = 'Linear model')) +
  geom_line(
    data = filter(nba, id %in% c('ABDULKA01',
                                 'BIRDLA01',
                                 'JAMESLE01',
                                 'MCGRATR01',
                                 'JORDAMI01')),
    aes(group = id, color = lastname), alpha = 0.8, size = 2) +
  theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ x'