In this workshop, we’ll show you how to use data visualization in R, using one of the most popular tools, ggplot2.
Start by getting the necessary packages, and if you don’t have them already, we’ll help you install them.
library(ggplot2)
library(dplyr)
library(ggrepel)
Alright, for our next tasks, we’re going to play around with some data from the NBA_player_regular_season.csv file. Let’s start by bringing that data into R and call it ‘nba’ in our workspace.
Run and check of variables and types:
summary (nba)
## id year firstname lastname
## Length:21961 Min. :1946 Length:21961 Length:21961
## Class :character 1st Qu.:1974 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1986
## 3rd Qu.:1999
## Max. :2009
## team leag gp minutes
## Length:21961 Length:21961 Length:21961 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 275
## Mode :character Mode :character Mode :character Median :1038
## Mean :1204
## 3rd Qu.:2009
## Max. :3882
## pts oreb dreb reb
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 113.0 1st Qu.: 0.00 1st Qu.: 1.0 1st Qu.: 44.0
## Median : 386.0 Median : 22.00 Median : 60.0 Median : 160.0
## Mean : 531.3 Mean : 49.85 Mean : 117.8 Mean : 229.7
## 3rd Qu.: 811.0 3rd Qu.: 75.00 3rd Qu.: 180.0 3rd Qu.: 333.0
## Max. :4029.0 Max. :895.00 Max. :1538.0 Max. :2149.0
## asts stl blk turnover
## Min. : 0.0 Length:21961 Length:21961 Length:21961
## 1st Qu.: 20.0 Class :character Class :character Class :character
## Median : 71.0 Mode :character Mode :character Mode :character
## Mean : 118.1
## 3rd Qu.: 167.0
## Max. :1164.0
## pf fga fgm fta
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 43.0 1st Qu.: 106.0 1st Qu.: 43.0 1st Qu.: 30.0
## Median :118.0 Median : 345.0 Median : 148.0 Median : 99.0
## Mean :123.6 Mean : 452.5 Mean : 204.3 Mean : 146.9
## 3rd Qu.:193.0 3rd Qu.: 696.0 3rd Qu.: 313.0 3rd Qu.: 218.0
## Max. :386.0 Max. :3159.0 Max. :1597.0 Max. :1363.0
## ftm tpa tpm
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 20.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 70.0 Median : 2.00 Median : 0.00
## Mean :109.6 Mean : 38.08 Mean : 13.11
## 3rd Qu.:161.0 3rd Qu.: 27.00 3rd Qu.: 7.00
## Max. :840.0 Max. :678.00 Max. :269.00
So, when we look at the summary, it seems that some data like games played, steals, blocks, and turnovers got imported as text instead of numbers. To figure out what’s happening, we can use the ‘table’ function, which can give us a better view of the situation.
table(nba$gp)
##
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23
## 260 206 206 159 176 177 158 174 146 163 161 278 142 182 163 177
## 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38
## 184 163 161 184 181 173 298 158 179 162 172 128 161 149 156 137
## 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52
## 162 271 143 163 176 162 153 158 174 195 214 202 237 247 185 163
## 53 54 55 56 57 58 59 6 60 61 62 63 64 65 66 67
## 190 184 218 215 187 210 204 243 241 216 236 232 240 283 307 321
## 68 69 7 70 71 72 73 74 75 76 77 78 79 8 80 81
## 319 272 223 362 375 506 327 386 465 464 503 580 704 199 867 880
## 82 83 84 85 86 87 88 9 90 N
## 1716 80 106 6 7 3 2 200 1 2
table(nba$stl)
##
## 0 1 10 100 101 102 103 104 105 106 107 108 109 11 110 111
## 5939 607 262 51 42 41 47 35 44 38 34 42 30 218 40 37
## 112 113 114 115 116 117 118 119 12 120 121 122 123 124 125 126
## 19 32 31 24 28 33 30 28 232 22 24 22 30 21 30 22
## 127 128 129 13 130 131 132 133 134 135 136 137 138 139 14 140
## 22 16 37 234 10 17 22 13 12 22 19 18 30 17 211 17
## 141 142 143 144 145 146 147 148 149 15 150 151 152 153 154 155
## 11 9 12 16 4 10 12 9 6 227 10 8 11 7 10 7
## 156 157 158 159 16 160 161 162 163 164 165 166 167 168 169 17
## 10 10 11 7 212 9 7 8 7 4 6 13 9 6 6 199
## 170 171 172 173 174 175 176 177 178 179 18 180 181 182 183 184
## 8 5 8 9 5 5 7 6 5 2 188 5 5 3 4 1
## 185 186 187 188 189 19 190 191 192 193 194 195 196 197 198 199
## 7 3 4 2 3 202 5 2 2 2 3 1 2 6 1 5
## 2 20 200 201 202 203 204 205 206 207 208 209 21 210 211 212
## 479 199 2 4 2 3 3 1 2 4 1 2 204 3 3 4
## 213 214 215 216 217 22 220 221 222 223 225 227 228 229 23 231
## 3 2 2 3 2 203 1 2 1 3 2 1 1 1 208 1
## 232 233 234 236 24 242 243 244 246 25 250 259 26 260 261 263
## 2 1 2 1 171 1 2 1 1 193 1 1 155 1 1 1
## 265 27 28 281 29 3 30 301 31 32 33 34 346 35 354 36
## 1 164 170 1 177 445 181 1 132 165 153 136 1 144 1 155
## 37 38 39 4 40 41 42 43 44 45 46 47 48 49 5 50
## 149 148 137 343 155 135 149 133 138 142 130 137 153 128 345 120
## 51 52 53 54 55 56 57 58 59 6 60 61 62 63 64 65
## 113 126 125 109 115 98 114 99 123 292 117 114 116 120 95 106
## 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79 8
## 99 91 98 85 280 100 88 76 70 77 81 79 79 83 64 280
## 80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94
## 71 80 73 53 68 74 69 62 66 54 255 55 59 51 54 54
## 95 96 97 98 99 NULL
## 64 47 39 43 40 1
OK so this shows us that there are
values N and NULL in the data. Let’s re-import
the csv file but use the
option na.strings = c('N', 'NULL') to properly handle those
values.
nba <- read.csv("data/NBA_player_regular_season.csv",
na.strings = c('N', 'NULL'))
To begin with, filter the dataset to only include player-seasons with more than 100 total minutes. Then add a new calculated column that simply adds up some generally positive stats and minuses some negative stats.
nba <- nba %>%
filter(minutes > 100) %>%
mutate(
nba_eff = pts + reb + stl + asts + blk - turnover - (fga - fgm)
)
Let’s start by plotting two variables we might expect to be related; steals and turnovers.
ggplot(data = nba, # plot data from this data frame
aes(x = stl, y = turnover)) + # specify the mapping between variables in the data and plor elements
geom_point() # ask for points (i.e. a scatter plot)
Okay, what does it mean??? Let’s view that subset of data.
Now we ca create a basic scatterplot visualisation using using geom_point() function.
ggplot(data = nba,
aes(x = stl, y = turnover, color = year)) +
geom_point()
## Warning: Removed 253 rows containing missing values (`geom_point()`).
Let’s break down the code. In the initial part, where we have ‘ggplot(data = nba),’ we’re essentially creating an object that links to the dataset we want to work with.
ggplot(data = nba, aes(x = tpa, y = tpm)) +
geom_point()
What will happen if you run it like this?
ggplot(data = nba)
ggplot(data = nba, aes(x = tpa, y = tpm))
After the initial declaration, we use the ‘+’ operator to add more elements to our ggplot object. In this case, we’re adding a ‘geom_point()’ object, which helps create a scatter plot. The ‘aes(x = tpa, y = tpm)’ part tells us how our data variables relate to the visual aspects of the plot.
In the next plot, we’ll introduce a ‘color = team’ argument. This will assign colors to the points in the visualization, making them match the teams they represent
ggplot(nba, aes(x = tpa, y = tpm, color = team)) +
geom_point()
However, if you want to add in an aesthetic that is not related to
your dataset, then that is required to be outside of
the aes( ) function. For example:
ggplot(nba, aes(x = tpa, y = tpm)) +
geom_point(color = "hotpink")
Try it out with your favourite colour!
Okay guys, back to business.
Now, we’ll focus on a specific subset of the data, specifically the Chicago Bulls teams from 1991-1998. In addition, we’ll make the visualization more interesting by adjusting the size of the elements.
bulls <- nba %>%
filter(team == "CHI" & year >= 1991 & year <= 1998)
ggplot(bulls, aes(x = tpm, y = tpa, color = lastname, size = minutes)) +
geom_point()
Is this visualization effective in conveying its message, and if so, what makes it effective? If not, what could be improved, and why?
In order to make it easier to distinguish between each player on the team, we can use different shapes to represent each athlete. Additionally, we’ll apply a filter to narrow down the data and focus on a smaller group of players from that time period.
bulls <- bulls %>%
filter(lastname == "Jordan" |
lastname == "Pippen" |
lastname == "Longley" |
lastname == "Rodman" |
lastname == "Kukoc")
ggplot(bulls, aes(x = tpa, y = tpm, color = lastname, size = minutes, shape = lastname)) +
geom_point()
Another way to add more details to a plot is by using facets. Facets are handy when dealing with categories or groups in your data. They allow you to create small plots for each group, which is a great way to show additional information without making a single plot too cluttered.
ggplot(data = bulls, aes(x = fga, y = fgm, color = lastname, size = nba_eff)) +
geom_point() +
facet_wrap(~lastname, nrow = 2) +
theme_bw()
If you want to emphasize specific parts of your visualization to grab the viewer’s attention and convey important information, you can do that. For example, if we’re plotting two traits that might show a player’s style and we want to focus on three well-known players like Michael Jordan, John Stockton, and Shaq, we can make them stand out in the plot.
ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
geom_point(alpha = 0.2, size = 0.1) +
geom_point(data = subset(nba, id == 'JORDAMI01'), color = 'yellow') +
geom_point(data = subset(nba, id == 'STOCKJO01'), color = 'orange') +
geom_point(data = subset(nba, id == 'ONEASH01'), color = 'red')
So the points definately stand out in that visualisation - but you can’t tell which player is which (although you may be able to guess).
Creating legends is very easy in ggplot because it happens by default in most situations.
ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
geom_point(alpha = 0.2, size = 0.1) +
geom_point(data = subset(nba, id == 'JORDAMI01'), aes(color = 'Jordan')) +
geom_point(data = subset(nba, id == 'STOCKJO01'), aes(color = 'Stockton')) +
geom_point(data = subset(nba, id == 'ONEASH01'), aes(color = 'Shaq'))
There is an add on package for adding themes to ggplots
called ggthemes. Let’s install it and try it out.
library(ggthemes)
p1 <- ggplot(data = nba, aes(x = reb / minutes, y = asts / minutes)) +
geom_point(alpha = 0.2, size = 0.1) +
geom_point(data = subset(nba, id == 'JORDAMI01'), aes(color = 'Jordan')) +
geom_point(data = subset(nba, id == 'STOCKJO01'), aes(color = 'Stockton')) +
geom_point(data = subset(nba, id == 'ONEASH01'), aes(color = 'Shaq'))
p1 + theme_base()
p1 + theme_bw() # This is a good one in my opinion
p1 + theme_fivethirtyeight()
p1 + theme_dark()
p1 + theme_excel()
p1 + theme_minimal()
p1 + theme_economist()
p1 + theme_wsj()
p1 + theme_void()
# Default is
p1 + theme_gray()
This is a very common task when creating dashboards and web-apps and commonly causes some confusion.
Consider the situation where we want to specify a team
and year somewhere in our code then make a plot using those
values later on.
team_pick <- 'LAL'
year_pick <- '2000'
x_stat <- 'pts'
y_stat <- 'reb'
Then we want to create a scatter plot of
only team_pick in year_pick with x/y_stat on
the x and y-axes. A first try might be:
ggplot(data = filter(nba,
team == team_pick,
year == year_pick),
aes(x = x_stat, y = y_stat)) +
geom_point() +
geom_text(aes(label = lastname))
This fails because ggplot does not like getting strings
as arguments inside aes() (e.g. aes(x = 'pts')
vs aes(x = pts)).
To get around this we can use get():
ggplot(data = filter(nba,
team == team_pick,
year == year_pick),
aes(x = get(x_stat), y = get(y_stat))) +
geom_point() +
geom_text(aes(label = lastname))
Annotating points is a common task that ggplot does not do very well
by default (see the plot above). There is a package
called ggrepel that is very useful for this task.
library(ggrepel)
ggplot(data = filter(nba,
team == team_pick,
year == year_pick),
aes(x = get(x_stat), y = get(y_stat))) +
geom_point() +
geom_text_repel(aes(label = lastname))
Annotation is a good way adding extra relevant information to a plot.
ggplot(data = nba,
aes(x = get(x_stat), y = get(y_stat))) +
geom_point(alpha = 0.1) +
geom_point(data = filter(nba, id == 'JORDAMI01'),
aes(size = nba_eff)) +
geom_text_repel(
data = filter(nba, id == 'JORDAMI01'),
aes(label = year), size = 3) +
xlab(x_stat) +
ylab(y_stat) +
theme_bw() +
ggtitle('Michael Jordan - Points and rebounds each season')
geomsCreate a new variable in the data for each players career year.
nba <- nba %>%
group_by(id) %>%
mutate(
careeryear = year - min(year)
)
Now, we can create a plot that represents our hypothetical performance metric for each player throughout their career. By adding some extra details to the plot, we can identify patterns and trends in the data.
ggplot(data = nba,
aes(x = careeryear, y = nba_eff)) +
geom_point(size = 0.5, alpha = 0.1) +
geom_line(aes(group = id), alpha = 0.1) +
geom_smooth(aes(color = 'Smoothed fit')) +
geom_smooth(method = 'lm', aes(color = 'Linear model')) +
theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ x'
We also can add highlighting
ggplot(data = nba,
aes(x = careeryear, y = nba_eff)) +
geom_point(size = 0.5, alpha = 0.1) +
geom_line(aes(group = id), alpha = 0.1) +
geom_smooth(aes(color = 'Smoothed fit')) +
geom_smooth(method = 'lm', aes(color = 'Linear model')) +
geom_line(
data = filter(nba, id %in% c('ABDULKA01',
'BIRDLA01',
'JAMESLE01',
'MCGRATR01',
'JORDAMI01')),
aes(group = id, color = lastname), alpha = 0.8, size = 2) +
theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ x'