Project application

1. Few ways to visualize a family of density plots

Let’s say we wanted to visualize the distribution of entry level lawyers’ salaries in 4 different towns. Let’s start by creating a dataset and showing the first few rows of it.

Just as a note, I am leaving all the snipets of code on purpose with all the graphs and tables in the output so that you can see how I got there.

df <- data.frame(
  City = factor(rep(c("Hong Kong", "New York", "Mumbai", "Paris"), each = 200)),
  Salary = round(c(rnorm(200, mean = 30000, sd = 5000),
                   rnorm(200, mean = 50000, sd = 10000),
                   rnorm(200, mean = 15000, sd = 3000),
                   rnorm(200, mean = 45000, sd = 7000))))

head(df, 10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")

City	Salary
Hong Kong	33558
Hong Kong	27076
Hong Kong	28727
Hong Kong	28638
Hong Kong	31100
Hong Kong	26946
Hong Kong	30356
Hong Kong	29554
Hong Kong	36649
Hong Kong	27698

Maybe the first way that we would like to visualize the data is by having the density plot for all the data at hand:

ggplot(df, aes(x = Salary)) + geom_density(fill = "gray", alpha = 0.3) +
  ggtitle("Density plot of salaries") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

Which would allow to see that roughly 35% of the lawyers have a salary of less than 35000. Now that we have the global picture, maybe we would want to have one density curve for each city:

p <- ggplot(df, aes(x=Salary, color=City)) +
  geom_density() +
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5))

p

And maybe we would want to add mean lines and fill in the area under the curves.

So first we would calculate the mean salary for each city:

mean_data <- df %>% group_by(City) %>%
             summarise(Salary_Mean = mean(Salary))

mean_data %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")

City	Salary_Mean
Hong Kong	29768.96
Mumbai	14641.94
New York	50263.44
Paris	44746.46

And then we would plot the graph:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed")

We can then see that New York has the highest mean and the biggest spread in salaries whereas Mumbai has the lowest mean but there is less fluctuation in the salaries. But maybe we would also want to add histograms to estimate more easily the share of applications in a certain salary interval:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  geom_histogram(aes(y=..density..), alpha=0.5, binwidth = 1000, position = "identity")+
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed")

In order to make this graph more readable, we could split it out to have one graph for each city as such:

ggplot(df, aes(x=Salary, fill = City)) +
  geom_density(alpha = 0.4) +
  geom_histogram(aes(y=..density..), alpha=0.5, binwidth = 1000, position = "identity")+
  ggtitle("Density plot of salaries in 4 towns") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(data = mean_data, aes(xintercept=Salary_Mean, color=City), linetype="dashed") +
  facet_wrap(~City)

2. Difference between layout() and split.screen() commands

In all the examples above besides the last one, each output has only one graph. But what if we wanted to visualise 3 histograms in one graph? With the top row having the histogram of salaries in Hong Kong, the bottom row being split in 2 with on the left side the histogram for New York and the right side for Paris. The split.screen() and layout functions are the solution. Here is how we can obtain that with the split.screen():

histogram_plotter <- function(x,y,color) {
hist(df[df$City == x,]$Salary, breaks=50 , 
     border=F ,
     col= color , 
     xlab="Salary" ,
     cex.main = y,
     cex.lab = 0.9*y,
     cex.axis = 0.9*y,
     main= paste("Histogram of salaries in ", x))}

split.screen(c(2,1))

## [1] 1 2

split.screen(c(1,2), screen=2)

## [1] 3 4

screen(1)
histogram_plotter("Hong Kong",0.9,"gray74")
screen(3)
histogram_plotter("New York",0.8,"orangered")
screen(4)
histogram_plotter("Paris",0.8,"deepskyblue2")

But what if we wanted the “Paris” graph to take up 31% of the width of the bottom row and the “New York” row to take up 69% of it (I don’t know why we would want to be that precise but let’s say our boss is very demanding). The split.screen() function does not allow us to do so, this is where the layout() function comes in handy. It allows us to very precisely define the relative widths of the columns and heigths of the rows as follows:

nf <- layout(matrix(c(1,1,2,3),2,2,byrow=TRUE), c(6.9,3.1), c(3.3,3.3) ,TRUE) 

histogram_plotter("Hong Kong",0.9,"gray74")
histogram_plotter("New York",0.8,"orangered")
histogram_plotter("Paris",0.65,"deepskyblue2")

Now our boss is happy :)

To summarise, layout() gives us additional flexibility in defining the size of each part of the screen.

Note that in the examples above, I did not use the ggplot2 library to draw the histograms, that is because it is incompatible with these functions. If we wanted to do the same split of the screen using the ggplot2 functions, we would have to use the grid.arrange() function of the gridExtra package as follows:

HK <- df %>% filter(City == "Hong Kong") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "gray74") +
  ggtitle("Hong Kong salary density plot") +
  theme(plot.title = element_text(hjust = 0.5))

NY <- df %>% filter(City == "New York") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "orangered") +
  ggtitle("New York salary density plot") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(plot.title = element_text(size=10),
        axis.title = element_text(size=8))

Par <-df %>% filter(City == "Paris") %>% ggplot(aes(x=Salary)) +
  geom_density(fill = "deepskyblue") +
  ggtitle("Paris salary density plot") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(plot.title = element_text(size=10),
        axis.title = element_text(size=8))
  
lay <- rbind(c(1,1),
             c(2,3))

grid.arrange(HK,NY,Par,layout_matrix = lay)

3. Visualization examples

In my porfolio I have first of all all these plots created above :) Additionaly, here are some visualizations that I like to use. The dataset that will be used is a movie rating dataset.

Here is an extract of it:

head(movies, 10) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover"),row_label_position = "l", full_width = F, position = "center")

Film	Genre	CriticRating	AudienceRating	BudgetMillions	Year
Days of Summer	Comedy	87	81	8	2009
10000 B.C.	Adventure	9	44	105	2008
12 Rounds	Action	30	52	20	2009
127 Hours	Adventure	93	84	18	2010
17 Again	Comedy	55	70	20	2009
2012	Action	39	63	200	2009
27 Dresses	Comedy	40	71	30	2008
30 Days of Night	Horror	50	57	32	2007
30 Minutes or Less	Comedy	43	48	28	2011
50/50	Comedy	93	93	8	2011

Maybe the first thing that we would like to know is how many movies there are in each category, for that, we could use a coxcomb chart as follows:

s <- ggplot(data = movies) + geom_bar(mapping = aes(x = Genre, fill = Genre)) + coord_polar()

s

We see that comedy and action movies are the most common and romance and adventure movies are the least common ones.

Next we might wondering how the critic and audience ratings correlate. For that, we could use a point plot with smoothening lines as follows:

u <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, color = Genre)) +
  geom_point(size = 1) +
  geom_smooth(fill = NA) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("How are critic and audience ratings correlated?")

u

We see that audience ratings tend to be concentrated between 40 and 80 whereas critic ratings cover the full spectrum.We can derive other interesting insights such as romance movies that have a low critic rating (between 12 and 25) tend to have a high audience rating (between 50 and 65). Horror movies that have a rating of 75 tend to have a lower audience rating then action movies.

If we wanted to understand how this has evolved over the years for each genre, we could add facets:

w <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, colour = Genre)) +
  geom_point(aes(size = BudgetMillions, alpha = 0.3))

w + geom_smooth() +
  facet_grid(Genre ~ Year) +
  coord_cartesian(ylim=c(0,100))

Interestingly, action movies with low critic ratings in 2010 and 2011 had higher audience ratings than those in 2008 and 2009.

Finally, if wanted to understand how the audience ratings are distributed for each genre, we could use boxplots as follows:

v <- ggplot(data = movies, aes(x=Genre, y=AudienceRating, color = Genre))
v + geom_jitter(size = 0.8) + geom_boxplot(size = 0.8, alpha = 0.4) +
  ggtitle("Audience rating distribution per Genre") +
  theme(plot.title = element_text(hjust = 0.5))

We see that thrillers and romance movies have similar median ratings but the ratings for the thrillers are more concentrated around the median than for romance movies.

That’s it, thank you for reading this far!