Most of the people want to visualize data in compelling ways so that they can extract meaningful information from it, but they do not know-how. We can visualize data by adding more variables to it, which makes it easier to understand and helps to find correlations among them.
This vignette explains how we can add more variables in our existing visualization to find out connection among it. Also, we will start with simple histogram and eventually by adding more layers we can visualize multivariate dataset using ggplot2 package in R.
I have used Diamonds dataset in this vignette which consists of prices and quality information from about 54,000 diamonds, and it comes with ggplot2 package.
Plus, gridExtra package is used to arrange multipe plot on a page.
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
data("diamonds")
The diamond dataset contains information on prices of diamonds, with various characteristics of diamonds. Where some of which are known to influence their price, i.e. the 4 C’s (carat, cut, colour, and clarity), as well as some physical dimensions(depth, table, price, x, y, and z).
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Here, a cut is quality of the diamond cut ranging from Fair(worst) to Ideal(best). Additionally, the colour here is diamond colour, from D(best) to J(worst). Moreover, clarity is a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
The summary() function in R is used to obtain a better view of the entire diamonds dataset. For ordered class variables(cut, colour, clarity), the summary() function shows the total count for each category. For numeric class variables(price, carat, depth, table, x, y, z), the summary() function shows the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum value.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Let’s create a simple histogram showing the price of diamonds on the x-axis and its related quantity/occurrence on the y-axis. In the Figure 1 below, The histogram is highly skewed, and we are not able to infer anything from it. Thus, we need to apply log transformation in the Figure 2.
Log transformation(scale_x_log10()) is used to make highly skewed distributions less skewed. It can be valuable for making patterns in the data more interpretable.
Did you notice something? In Figure 2 there are two peaks around $1000 and $6000 price tag which shows the gap between middle class and higher class families. And this kind of distribution is also called Bimodal Distribution.
Do you know what is binwidth? A histogram splits the x-axis into equally spaced bins(or buckets). It then uses the height of a bar to display the number of observations that fall in each bin. Try changing the binwidth to make more sense.
plot1 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram(fill="black", color="grey", binwidth = 100) +
ggtitle("Figure 1: Diamonds price histogram(Skewed)") +
ylab("Quantity of diamonds") +
xlab("Diamond price")
plot2 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram(fill="black", color="grey", binwidth = 0.01) +
scale_x_log10() +
ggtitle("Figure 2: Diamond price histogram(Log transformed)") +
ylab("Quantity of diamonds") +
xlab("Diamond price")
grid.arrange(plot1, plot2, ncol = 1)
Here, grid.arrange() function used to plot two histograms side-by-side in a grid where ncol is the number of columns in a grid which in this case is one.
Now let see the prices of diamonds by it cut. For that, we are going to use aesthetic mapping(aes()) which come with ggplot package to fill colour using cut variable.
ggplot(aes(x = price), data = diamonds) +
geom_histogram(aes(fill = cut), binwidth = 0.05) +
scale_x_log10() +
ggtitle("Differentiating price with cut") +
ylab("Quantity of diamonds") +
xlab("Diamond price") +
labs(fill = "Diamond cut")
By Looking at the histogram, I think that the majority of the people prefer to buy diamonds having cut(fair or good). In contrast, very few of them favour purchasing diamonds with an ideal cut.
What if we want to add the third variable in our existing plot? Yes, We can do it using facet_wrap() or facet_grid() function that can split a single plot into many related plots.
Here, I have applied facet_wrap() to create interrelated plots by diamond colour.
ggplot(aes(x = price), data = diamonds) +
geom_histogram(aes(fill = cut)) +
facet_wrap(~color) +
scale_x_log10() +
ggtitle("Price Histogram, colored by Cut and facets by diamond Color") +
ylab("Quantity of diamonds") +
xlab("Diamond price") +
labs(fill = "Diamond cut")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interestingly, the histogram manifests that numerous people favour E, F and G colour diamonds. At the same time, very few go for J colour diamonds which is the worst colour.
It seems that we were not able to decipher more information from a simple histogram but as we start combining more variables into our existing histogram than we are able to discover many associations and hidden patterns in them. In real-world, it is very useful when operating with a massive amount of data to find a connection between them.