Introduction

Most of the people want to visualize data in compelling ways so that they can extract meaningful information from it, but they do not know-how. We can visualize data by adding more variables to it, which makes it easier to understand and helps to find correlations among them.

This vignette explains how we can add more variables in our existing visualization to find out connection among it. Also, we will start with simple histogram and eventually by adding more layers we can visualize multivariate dataset using ggplot2 package in R.

Loading Dataset

I have used Diamonds dataset in this vignette which consists of prices and quality information from about 54,000 diamonds, and it comes with ggplot2 package.

Plus, gridExtra package is used to arrange multipe plot on a page.

library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.6.3

data("diamonds")

Describing dataset

The diamond dataset contains information on prices of diamonds, with various characteristics of diamonds. Where some of which are known to influence their price, i.e. the 4 C’s (carat, cut, colour, and clarity), as well as some physical dimensions(depth, table, price, x, y, and z).

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Here, a cut is quality of the diamond cut ranging from Fair(worst) to Ideal(best). Additionally, the colour here is diamond colour, from D(best) to J(worst). Moreover, clarity is a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).

The summary() function in R is used to obtain a better view of the entire diamonds dataset. For ordered class variables(cut, colour, clarity), the summary() function shows the total count for each category. For numeric class variables(price, carat, depth, table, x, y, z), the summary() function shows the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum value.

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Basic Histogram

Let’s create a simple histogram showing the price of diamonds on the x-axis and its related quantity/occurrence on the y-axis. In the Figure 1 below, The histogram is highly skewed, and we are not able to infer anything from it. Thus, we need to apply log transformation in the Figure 2.

Log transformation(scale_x_log10()) is used to make highly skewed distributions less skewed. It can be valuable for making patterns in the data more interpretable.

Did you notice something? In Figure 2 there are two peaks around $1000 and $6000 price tag which shows the gap between middle class and higher class families. And this kind of distribution is also called Bimodal Distribution.

Do you know what is binwidth? A histogram splits the x-axis into equally spaced bins(or buckets). It then uses the height of a bar to display the number of observations that fall in each bin. Try changing the binwidth to make more sense.

plot1 <- ggplot(aes(x = price), data = diamonds) +
  geom_histogram(fill="black", color="grey", binwidth = 100) +
  ggtitle("Figure 1: Diamonds price histogram(Skewed)") +
  ylab("Quantity of diamonds") +
  xlab("Diamond price")

plot2 <- ggplot(aes(x = price), data = diamonds) +
  geom_histogram(fill="black", color="grey", binwidth = 0.01) +
  scale_x_log10() +
  ggtitle("Figure 2: Diamond price histogram(Log transformed)") +
  ylab("Quantity of diamonds") +
  xlab("Diamond price")

grid.arrange(plot1, plot2, ncol = 1)

Here, grid.arrange() function used to plot two histograms side-by-side in a grid where ncol is the number of columns in a grid which in this case is one.

Adding more variables in our histogram

Now let see the prices of diamonds by it cut. For that, we are going to use aesthetic mapping(aes()) which come with ggplot package to fill colour using cut variable.

ggplot(aes(x = price), data = diamonds) +
  geom_histogram(aes(fill = cut), binwidth = 0.05) +
  scale_x_log10() +
  ggtitle("Differentiating price with cut") +
  ylab("Quantity of diamonds") +
  xlab("Diamond price") +
  labs(fill = "Diamond cut")

By Looking at the histogram, I think that the majority of the people prefer to buy diamonds having cut(fair or good). In contrast, very few of them favour purchasing diamonds with an ideal cut.

Adding third variable in the histogram

What if we want to add the third variable in our existing plot? Yes, We can do it using facet_wrap() or facet_grid() function that can split a single plot into many related plots.

Here, I have applied facet_wrap() to create interrelated plots by diamond colour.

ggplot(aes(x = price), data = diamonds) +
  geom_histogram(aes(fill = cut)) +
  facet_wrap(~color) +
  scale_x_log10() +
  ggtitle("Price Histogram, colored by Cut and facets by diamond Color") +
  ylab("Quantity of diamonds") +
  xlab("Diamond price") +
  labs(fill = "Diamond cut")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interestingly, the histogram manifests that numerous people favour E, F and G colour diamonds. At the same time, very few go for J colour diamonds which is the worst colour.

Conclusion

It seems that we were not able to decipher more information from a simple histogram but as we start combining more variables into our existing histogram than we are able to discover many associations and hidden patterns in them. In real-world, it is very useful when operating with a massive amount of data to find a connection between them.

Multivariate data visualization using ggplot

Tirth Patel

23 March 2020