1. Few ways to visualize a family of density plots

Let’s say we wanted to visualize the distribution of entry level lawyers’ salaries in 4 different towns. Let’s start by creating a dataset and showing the first few rows of it.

Just as a note, I am leaving all the snipets of code on purpose with all the graphs and tables in the output so that you can see how I got there.

df <- data.frame(
  City = factor(rep(c("Hong Kong", "New York", "Mumbai", "Paris"), each = 200)),
  Salary = round(c(rnorm(200, mean = 30000, sd = 5000),
                   rnorm(200, mean = 50000, sd = 10000),
                   rnorm(200, mean = 15000, sd = 3000),
                   rnorm(200, mean = 45000, sd = 7000))))

head(df, 10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")
City Salary
Hong Kong 33558
Hong Kong 27076
Hong Kong 28727
Hong Kong 28638
Hong Kong 31100
Hong Kong 26946
Hong Kong 30356
Hong Kong 29554
Hong Kong 36649
Hong Kong 27698

Maybe the first way that we would like to visualize the data is by having the density plot for all the data at hand:

ggplot(df, aes(x = Salary)) + geom_density(fill = "gray", alpha = 0.3) +
  ggtitle("Density plot of salaries") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) 

Which would allow to see that roughly 35% of the lawyers have a salary of less than 35000. Now that we have the global picture, maybe we would want to have one density curve for each city:

p <- ggplot(df, aes(x=Salary, color=City)) +
  geom_density() +
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5))

p

And maybe we would want to add mean lines and fill in the area under the curves.

So first we would calculate the mean salary for each city:

mean_data <- df %>% group_by(City) %>%
             summarise(Salary_Mean = mean(Salary))

mean_data %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")
City Salary_Mean
Hong Kong 29768.96
Mumbai 14641.94
New York 50263.44
Paris 44746.46

And then we would plot the graph:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed") 

We can then see that New York has the highest mean and the biggest spread in salaries whereas Mumbai has the lowest mean but there is less fluctuation in the salaries. But maybe we would also want to add histograms to estimate more easily the share of applications in a certain salary interval:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  geom_histogram(aes(y=..density..), alpha=0.5, binwidth = 1000, position = "identity")+
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed") 

In order to make this graph more readable, we could split it out to have one graph for each city as such:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  geom_histogram(aes(y=..density..), alpha=0.5, binwidth = 1000, position = "identity")+
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed") +
  facet_wrap(~City)

2. Difference between layout() and split.screen() commands

In all the examples above besides the last one, each output has only one graph. But what if we wanted to visualise 3 histograms in one graph? With the top row having the histogram of salaries in Hong Kong, the bottom row being split in 2 with on the left side the histogram for New York and the right side for Paris. The split.screen() and layout functions are the solution. Here is how we can obtain that with the split.screen():

histogram_plotter <- function(x,y,color) {
hist(df[df$City == x,]$Salary, breaks=50 , 
     border=F ,
     col= color , 
     xlab="Salary" ,
     cex.main = y,
     cex.lab = 0.9*y,
     cex.axis = 0.9*y,
     main= paste("Histogram of salaries in ", x))}

split.screen(c(2,1))
## [1] 1 2
split.screen(c(1,2), screen=2)
## [1] 3 4
screen(1)
histogram_plotter("Hong Kong",0.9,"gray74")
screen(3)
histogram_plotter("New York",0.8,"orangered")
screen(4)
histogram_plotter("Paris",0.8,"deepskyblue2")

But what if we wanted the “Paris” graph to take up 31% of the width of the bottom row and the “New York” row to take up 69% of it (I don’t know why we would want to be that precise but let’s say our boss is very demanding). The split.screen() function does not allow us to do so, this is where the layout() function comes in handy. It allows us to very precisely define the relative widths of the columns and heigths of the rows as follows:

nf <- layout(matrix(c(1,1,2,3),2,2,byrow=TRUE), c(6.9,3.1), c(3.3,3.3) ,TRUE) 

histogram_plotter("Hong Kong",0.9,"gray74")
histogram_plotter("New York",0.8,"orangered")
histogram_plotter("Paris",0.65,"deepskyblue2")

Now our boss is happy :)

To summarise, layout() gives us additional flexibility in defining the size of each part of the screen.

Note that in the examples above, I did not use the ggplot2 library to draw the histograms, that is because it is incompatible with these functions. If we wanted to do the same split of the screen using the ggplot2 functions, we would have to use the grid.arrange() function of the gridExtra package as follows:

HK <- df %>% filter(City == "Hong Kong") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "gray74") +
  ggtitle("Hong Kong salary density plot") +
  theme(plot.title = element_text(hjust = 0.5))

NY <- df %>% filter(City == "New York") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "orangered") +
  ggtitle("New York salary density plot") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(plot.title = element_text(size=10),
        axis.title = element_text(size=8))

Par <-df %>% filter(City == "Paris") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "deepskyblue") +
  ggtitle("Paris salary density plot") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(plot.title = element_text(size=10),
        axis.title = element_text(size=8))
  
lay <- rbind(c(1,1),
             c(2,3))

grid.arrange(HK,NY,Par,layout_matrix = lay)

3. Visualization examples

In my porfolio I have first of all all these plots created above :) Additionaly, here are some visualizations that I like to use. The dataset that will be used is a movie rating dataset.

Here is an extract of it:

head(movies, 10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")
Film Genre CriticRating AudienceRating BudgetMillions Year
  1. Days of Summer
Comedy 87 81 8 2009
10000 B.C. Adventure 9 44 105 2008
12 Rounds Action 30 52 20 2009
127 Hours Adventure 93 84 18 2010
17 Again Comedy 55 70 20 2009
2012 Action 39 63 200 2009
27 Dresses Comedy 40 71 30 2008
30 Days of Night Horror 50 57 32 2007
30 Minutes or Less Comedy 43 48 28 2011
50/50 Comedy 93 93 8 2011

Maybe the first thing that we would like to know is how many movies there are in each category, for that, we could use a coxcomb chart as follows:

s <- ggplot(data = movies) + geom_bar(mapping = aes(x = Genre, fill = Genre)) + coord_polar()

s

We see that comedy and action movies are the most common and romance and adventure movies are the least common ones.

Next we might wondering how the critic and audience ratings correlate. For that, we could use a point plot with smoothening lines as follows:

u <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, color = Genre)) +
  geom_point(size = 1) +
  geom_smooth(fill = NA) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("How are critic and audience ratings correlated?")

u 

We see that audience ratings tend to be concentrated between 40 and 80 whereas critic ratings cover the full spectrum.We can derive other interesting insights such as romance movies that have a low critic rating (between 12 and 25) tend to have a high audience rating (between 50 and 65). Horror movies that have a rating of 75 tend to have a lower audience rating then action movies.

If we wanted to understand how this has evolved over the years for each genre, we could add facets:

w <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre)) +
  geom_point(aes(size = BudgetMillions, alpha = 0.3))

w + geom_smooth() +
  facet_grid(Genre ~ Year) +
  coord_cartesian(ylim=c(0,100))

Interestingly, action movies with low critic ratings in 2010 and 2011 had higher audience ratings than those in 2008 and 2009.

Finally, if wanted to understand how the audience ratings are distributed for each genre, we could use boxplots as follows:

v <- ggplot(data = movies, aes(x=Genre, y=AudienceRating, color = Genre))
v + geom_jitter(size = 0.8) + geom_boxplot(size = 0.8, alpha = 0.4) +
  ggtitle("Audience rating distribution per Genre") +
  theme(plot.title = element_text(hjust = 0.5))

We see that thrillers and romance movies have similar median ratings but the ratings for the thrillers are more concentrated around the median than for romance movies.

That’s it, thank you for reading this far!