In this lesson, we get more advanced with data visualization with the package ggplot2 which is containted within the tidyverse. As a quick review, the ggplot function does visualization using multiple components: data, aesthetic mappings, and geometric objects. These components are then added together (or layered) to produce the final graph.
In this lesson we are going to use a diving dataset to observe how scores vary by judges from certain countries (I don’t really know much about diving but it’s for this lesson). Let’s load in our dataset.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dive <- read_csv("C:/Users/ankit/OneDrive/Desktop/Robotics Scouting/Data Sets/bias_diving.csv")
## Parsed with column specification:
## cols(
## Event = col_character(),
## Round = col_character(),
## Diver = col_character(),
## Country = col_character(),
## Rank = col_integer(),
## DiveNo = col_integer(),
## Difficulty = col_double(),
## JScore = col_double(),
## Judge = col_character(),
## JCountry = col_character()
## )
The first thing to do is tell ggplot what data will be used in graphing. This is done using the call to ggplot below. We save our graph as histo which we will then add layers.
histo <- ggplot(data = dive)
The second layer we add is with aesthetics which map data to properties of whatever you want to plot. As seen before we use aes() and add some of the following things:
x: the variable that will be on the x-axis
y: the variable that will be on the y-axis
color: the variable that categorizes data by color
shape: the variable that categorizes data by shape
The third layer we add is the geometric object in which we use the geom_ for graphs such as:
Geometric objects, or geoms, determine the type of plot that will be created. Examples include:
geom_point(): create a scatterplot
geom_histogram(): create a histogram
geom_line(): create a line
geom_boxplot(): create a boxplot
Putting the three layers together we get a histogram specified with the data we wanted.
histo <- histo + geom_histogram(aes(x = JScore), binwidth = 0.25)
histo
Now if we want to change the label from JSCore:
histo <- histo + labs(x = "Judge Scores")
histo
If we wanted to see a graph of scores based of countries we could do it in one line of code rather than specifying every country in a bunch of lines using what are called facets. Facets allow you to separate graphs by category. Let’s add a facet to the data we have.
histo <- histo + facet_wrap(~ JCountry, nrow = 4)
histo
You must have the ~ when using a facet wrap, just a syntax thing to get used to
If we wanted to see how skewed the data is in comparison to the median of all scores we could plot the median on top of the plot we have with geom_vline() which will add a vertical line over wherever we specify.
median<- median(dive[["JScore"]])
histo <- histo + geom_vline(xintercept = median, color = "blue")
histo
Create your own facet wrap using a different category (not country)
A boxplot is another visualization tool. We can now make boxplots of the judges scores separated by diving round. Here we use the aes property fill = Round to color (or “fill”) the boxplots by round.
bp <- ggplot(data = dive)
bp <- bp + geom_boxplot(aes(x = Round, y = JScore, fill = Round))
bp <- bp + labs(title = "Scores by Round", x = "", y = "Judge Score", fill = "Round")
bp
Let’s also get rid of the redundant labels on the X-Axis
bp <- bp + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
bp
Using geom_bar() we can create a bar chart.
bp <- ggplot(data = dive)
bp <- bp + geom_bar(aes(x = JCountry, fill = JCountry))
bp <- bp + labs(x = "Judge Country", fill = "Judge Country")
bp
Notice that the fill automatically fills in a color and corresponds it with a country.
Let’s go back to scatterplots. We can plot the rank of divers versus the judge scores. The higher the divers rank, the higher their score.
scat <- ggplot(data = dive)
scat <- scat + geom_point(aes(x = Rank, y = JScore, color = Country))
scat
We can also add another layer using stat_, which is for statistical transformation. This is useful if we want to plot a summary statistic of our data, such as a mean or median. By using a stat_ layer, we do not have to so any summaries beforehand as ggplot does it automoatically
If we want to plot means of each judge’s score one standard deviation (+/-). We could use summarize and group_by to find the mean and standard deviations for each judge, or we could just use a stat_ layer!
The layer stat_summary() computes and plots a summary statistic. We can use the function mean_se to calculate the means and standard deviations of the scores of each judge.
Let’s add a stat_summary() layer
judge <- ggplot(data = dive, aes(x = Judge, y = JScore))
judge <- judge + stat_summary(fun.data = mean_se)
judge <- judge + labs(y = "Judge Score")
judge
As you can see you can not read a single name but we can fix that easily using coord_flip()
judge <- judge + coord_flip()
judge
Scales allow adjustment of asethetics. We will use the scatter plot of the judges’ scores vs rank of the divers. This time, we want to color the points by the difficulty of the dive.
We use the layer: scale_color_distiller
Above, the second word, color, is the aes we want to change. We can replace it with x, y or fill, depending on the aes we want to change.
The third word is distiller, which we use because our color variable, Difficulty, is continuous.
scat <- ggplot(data = dive)
scat <- scat + geom_point(aes(x = Rank, y = JScore, color = Difficulty))
scat <- scat + scale_color_distiller(palette = "OrRd", direction = 1)
scat