In this module, we will first study a few data visualization and analysis examples, which naturally raises the necesssity of performing data transformation before data visualization in many situations.
Using “Facets” is another way to add additional variables into a graph.
Facets divide a plot into subplots based on the values of one or more discrete variables.
When creating subplots based on values of a single
categorical variable, one should use facet_wrap()
.
As below is an example.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2) +
labs(title = "Vehicle Fuel Economy Data by Vehicle Class",
x = "Engine Displacement (liter)",
y = "Highway Mile per Gallon") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The facet_wrap()
function wraps subplots into a
2-dimensional array. This is generally a better use of screen space
because most displays are roughly rectangular.
In the code above, ~ class
is called a
formula in R. We will study it later. For now you just
need to know that ~ <VARIABLE_NAME>
is needed as the
first argument of facet_wrap
function.
In the graph above, we still plot engine
displacement vs highway mpg, but only plot
grouped data for each class
in every subplot. By doing
this, we clearly see where each group is - better than plotting them
altogether.
facet_grid()
:ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class) + # A formula with two variables
labs(title = "Fuel Economy Data by Vehicle Class and Drive Train Type",
x = "Engine Displacement (liter)",
y = "Highway Mile per Gallon") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
In this case, a grid of subplots is created and the x- and y-axis of
the grid corresponds to the values of drv
and
class
, respectively.
For example, in the subplot at the top right corner, it plots
displ
against hwy
for data points with a
class
value of suv, and a drv
value of
4
, which corresponds to 4-wheel drive suvs.
facet_grid()
can be used to study the relationship
between four variables (two numeric and two categorical). When the data
set is large and complicated, it can be very useful to provide some
insights for us.
diamonds
data
setWhen scatter plots are too messy for a data set, the smoothed line graph can be very helpful to identify the effect of variables.
For example, in the diamonds
data set, we want to know
the effect of cut on price. If we directly plot price
against cut
using a box plot, it won’t work well since
there is another major factor carat
on price.
So a good way to show the effect of cut
is to also
consider the effect of carat
. To do so, we can make the
following graph:
ggplot(diamonds) +
geom_smooth(mapping = aes(x = carat, y = price, color = color)) +
labs(title = "Diamond Price Data by Carat and Color",
x = "Carat",
y = "Price in US dollar") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Graph analysis: So from this figure we can see that for color level “H”, “I”, “J”, the color level has an effect on the mean price for diamonds under 2 carat. When the color level is “G” or higher or above 2 carat, the effect is not significant.
Another useful type of bar chart is the grouped bar chart, which is useful in many ways. As below is an example:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
labs(title = "Diamond Data by Cut Quality and Clarity",
x = "Cut Quality",
y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Here we show different bars for two categorical variables -
cut
and clarity
. To activate grouped bar
chart, the key is to set position
to be
"dodge"
in the plotting function.
stat_summary()
So far our bar plots are all plotting for counts. In many cases we
hope to plot for other descriptive measures. In the
diamonds
data set, for example, we may hope to the mean
price for diamonds with various combination of color/clarity levels. The
following code does the job for us:
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = clarity, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") +
labs(title = "Diamond Price by clarity and Color",
x = "Clarity Level",
y = "Mean Price (US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Basically, stat_summary
or any other
<STAT_FUNCTION>
is an alternative to to
<GEOM_FUNCTIONS>
. The different is that,
-<GEOM_FUNCTIONS>
creates a graph of a particular
graph type (point, line, bar plot etc.) and we need to specify the
mapping in it. -<STAT_FUNCTION>
creates a graph that
plots a particular statistic (count, proportion, density, or any
specified variables and their functions), and we need to specify the
graph type in the function by using the geom
argument.
Graph analysis: The graph above suggests some information that is not consistent with common sense.
Why? That is because we didn’t consider the major price indicator
carat
, along with another important factor sample size in
this graph. To find the correct trend, we must look into the data more
carefully and we have to learn data transformation -
the next topic of our course.
diamonds <- mutate(diamonds, carat_group = cut(diamonds$carat, c(0, 1,2,3, Inf), c('<= 1 carat', '1-2 carat', '2-3 carat', '> 3 carat')))
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = carat_group, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") +
labs(title = "Diamond Price by Carat and Color",
x = "Carat",
y = "Mean Price (US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
Analysis: This graph makes much more sense now since for diamonds of less than 3 carats, the mean price decreases as the color level downgrades as an overall trend. However, the group “> 3 carat” still looks weird since the worst color still has the highest mean price.
In this case, we need to speculate why this may happen and investigate whether our speculation is correct or not by graphs. Here a natural speculation is that “> 3 carat” diamonds are very rare so the sample size is too small to reflect the trend. We may verify our speculation with the following graph:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = carat_group, fill = color) , position = "dodge") +
labs(title = "Diamond Data by Carat and Color",
x = "Carat",
y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The graph above clearly shows that the sample size in the “> 3 carats” group is simply too small, so we are not able to draw reliable conclusions for that group.
It is also helpful to check the mean carat
for each
color level:
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") +
labs(title = "Mean Diamond Weight by Color",
x = "Color Level",
y = "Mean Weight (carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
As we see, in this data set diamonds with lower color quality are heavier on average, and that’s why we have to consider this factor when analyzing the relationship between price and color level.
It is also helpful to study diamonds of less than 2 carat to exclude
big diamonds that are rare. In this case, we need to first filter our
data using the filter
function in dplyr
package.
ggplot(data = filter(diamonds, carat < 2)) +
stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") +
labs(title = "Mean Weight vs Color for '< 2 carat' Diamonds",
x = "Color Level",
y = "Mean Weight (carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The trend changes little compared with the previous graph. Therefore
we must take into account the effect of carat
when
analyzing the effect of color on price.
This also shows that diamond color indeed impacts the price because although diamonds with color “D” and “F” share similar mean price, but on average “F” is heavier than “D”.
a) Mean price vs carat groups colored by different clarity level.
b) Mean weight vs different clarity level for diamonds of less than 2 carat.
The example above shows the process of Exploratory Data Analysis (EDA), which involves using visualization and transformation to explore your data in a systematic way. This is really an iterative cycle of the following steps:
Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
During this process, you will find that in many situations we don’t have the data in exactly the right form for what we need. In that case we have to refer to data transformation, which refers to the operations of
filter()
)arrange()
)select()
)mutate()
)collapse()
)group_by()
)We will start to study data transformation with examples from the next lecture.