In this module, we will first study a few data visualization and analysis examples, which naturally raises the necesssity of performing data transformation before data visualization in many situations.

1. Facets

Using “Facets” is another way to add additional variables into a graph.

Facets divide a plot into subplots based on the values of one or more discrete variables.
When creating subplots based on values of a single categorical variable, one should use facet_wrap(). As below is an example.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2) +
  labs(title = "Vehicle Fuel Economy Data by Vehicle Class", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

The facet_wrap() function wraps subplots into a 2-dimensional array. This is generally a better use of screen space because most displays are roughly rectangular.

In the code above, ~ class is called a formula in R. We will study it later. For now you just need to know that ~ <VARIABLE_NAME> is needed as the first argument of facet_wrap function.

In the graph above, we still plot engine displacement vs highway mpg, but only plot grouped data for each class in every subplot. By doing this, we clearly see where each group is - better than plotting them altogether.

When creating subplots based on values of two categorical variables, one should use facet_grid():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class) + # A formula with two variables
  labs(title = "Fuel Economy Data by Vehicle Class and Drive Train Type", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

In this case, a grid of subplots is created and the x- and y-axis of the grid corresponds to the values of drv and class, respectively.

For example, in the subplot at the top right corner, it plots displ against hwy for data points with a class value of suv, and a drv value of 4, which corresponds to 4-wheel drive suvs.

facet_grid() can be used to study the relationship between four variables (two numeric and two categorical). When the data set is large and complicated, it can be very useful to provide some insights for us.

Exemplary exploratory data analysis of `diamonds` data set

2. Aesthetic mapping with different line colors

When scatter plots are too messy for a data set, the smoothed line graph can be very helpful to identify the effect of variables.

For example, in the diamonds data set, we want to know the effect of cut on price. If we directly plot price against cut using a box plot, it won’t work well since there is another major factor carat on price.

So a good way to show the effect of cut is to also consider the effect of carat. To do so, we can make the following graph:

ggplot(diamonds) +
  geom_smooth(mapping = aes(x = carat, y = price, color = color)) + 
  labs(title = "Diamond Price Data by Carat and Color", 
       x = "Carat", 
       y = "Price in US dollar") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Graph analysis: So from this figure we can see that for color level “H”, “I”, “J”, the color level has an effect on the mean price for diamonds under 2 carat. When the color level is “G” or higher or above 2 carat, the effect is not significant.

3. Grouped Bar Chart

Another useful type of bar chart is the grouped bar chart, which is useful in many ways. As below is an example:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") + 
  labs(title = "Diamond Data by Cut Quality and Clarity", 
       x = "Cut Quality", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Here we show different bars for two categorical variables - cut and clarity. To activate grouped bar chart, the key is to set position to be "dodge" in the plotting function.

4. Change the y variable in bar charts with `stat_summary()`

So far our bar plots are all plotting for counts. In many cases we hope to plot for other descriptive measures. In the diamonds data set, for example, we may hope to the mean price for diamonds with various combination of color/clarity levels. The following code does the job for us:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = clarity, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by clarity and Color", 
       x = "Clarity Level", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Basically, stat_summary or any other <STAT_FUNCTION> is an alternative to to <GEOM_FUNCTIONS>. The different is that, -<GEOM_FUNCTIONS> creates a graph of a particular graph type (point, line, bar plot etc.) and we need to specify the mapping in it. -<STAT_FUNCTION> creates a graph that plots a particular statistic (count, proportion, density, or any specified variables and their functions), and we need to specify the graph type in the function by using the geom argument.

Graph analysis: The graph above suggests some information that is not consistent with common sense.

The clarity level is from I1 (worst) to IF (best), but the mean price seems to decrease as the clarity level increases.
Similarly, diamonds of the best color “D” for most clarity groups do not have the highest price, but rather the worst color “J” usually has a higher price.

Why? That is because we didn’t consider the major price indicator carat, along with another important factor sample size in this graph. To find the correct trend, we must look into the data more carefully and we have to learn data transformation - the next topic of our course.

5. An Advanced Example leading to Data Transformation and EDA (exploratory data analysis)

diamonds <- mutate(diamonds, carat_group = cut(diamonds$carat, c(0, 1,2,3, Inf), c('<= 1 carat', '1-2 carat', '2-3 carat', '> 3 carat')))

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = carat_group, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by Carat and Color", 
       x = "Carat", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Analysis: This graph makes much more sense now since for diamonds of less than 3 carats, the mean price decreases as the color level downgrades as an overall trend. However, the group “> 3 carat” still looks weird since the worst color still has the highest mean price.

In this case, we need to speculate why this may happen and investigate whether our speculation is correct or not by graphs. Here a natural speculation is that “> 3 carat” diamonds are very rare so the sample size is too small to reflect the trend. We may verify our speculation with the following graph:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = carat_group, fill = color) , position = "dodge") + 
  labs(title = "Diamond Data by Carat and Color", 
       x = "Carat", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

The graph above clearly shows that the sample size in the “> 3 carats” group is simply too small, so we are not able to draw reliable conclusions for that group.

It is also helpful to check the mean carat for each color level:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Diamond Weight by Color", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

As we see, in this data set diamonds with lower color quality are heavier on average, and that’s why we have to consider this factor when analyzing the relationship between price and color level.

It is also helpful to study diamonds of less than 2 carat to exclude big diamonds that are rare. In this case, we need to first filter our data using the filter function in dplyr package.

ggplot(data = filter(diamonds, carat < 2)) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Weight vs Color for '< 2 carat' Diamonds", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

The trend changes little compared with the previous graph. Therefore we must take into account the effect of carat when analyzing the effect of color on price.

This also shows that diamond color indeed impacts the price because although diamonds with color “D” and “F” share similar mean price, but on average “F” is heavier than “D”.

Lab Exercise: Do a similar analysis regarding the effect of diamond clarity on the price by creating the following graphs:

a) Mean price vs carat groups colored by different clarity level.

b) Mean weight vs different clarity level for diamonds of less than 2 carat.

Data Transformation and Explorative Data Analysis (EDA)

The example above shows the process of Exploratory Data Analysis (EDA), which involves using visualization and transformation to explore your data in a systematic way. This is really an iterative cycle of the following steps:

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

During this process, you will find that in many situations we don’t have the data in exactly the right form for what we need. In that case we have to refer to data transformation, which refers to the operations of

pick observations that meet our requirement (filter())
reorder the rows (arrange())
pick variables by their names (select())
create new variables with functions of existing variables (mutate())
collapse many values down to a single summary (collapse())
group data by particular conditions (group_by())

We will start to study data transformation with examples from the next lecture.

Lecture 5: From Data Visualization to Data Transformation

Miao Yu

2023-02-04

1. Facets

Exemplary exploratory data analysis of `diamonds` data set

2. Aesthetic mapping with different line colors

3. Grouped Bar Chart

4. Change the y variable in bar charts with `stat_summary()`

5. An Advanced Example leading to Data Transformation and EDA (exploratory data analysis)

Lab Exercise: Do a similar analysis regarding the effect of diamond clarity on the price by creating the following graphs:

Data Transformation and Explorative Data Analysis (EDA)

Lecture 5: From Data Visualization to Data Transformation

Miao Yu

2023-02-04

1. Facets

Exemplary exploratory data analysis of diamonds data set

2. Aesthetic mapping with different line colors

3. Grouped Bar Chart

4. Change the y variable in bar charts with stat_summary()

5. An Advanced Example leading to Data Transformation and EDA (exploratory data analysis)

Lab Exercise: Do a similar analysis regarding the effect of diamond clarity on the price by creating the following graphs:

Data Transformation and Explorative Data Analysis (EDA)

Exemplary exploratory data analysis of `diamonds` data set

4. Change the y variable in bar charts with `stat_summary()`