DataM: Homework Exercise 0420 - Data visualization

Exercise 1.
- Load in the data set and check its structure.
- Plot the distribution
Exercise 2 (Not finished).
Exercise 3.

library(dplyr)
library(ggplot2)

Exercise 1.

The distribution of personal disposable income in Taiwan in 2015 has a story to tell. Revise the following plot to enhance that message.

Target output 2

Load in the data set and check its structure.

income_tw <- read.table('../data/income_tw.txt', header = TRUE, sep = ',')
head(income_tw)

str(income_tw)

'data.frame':   41 obs. of  2 variables:
 $ Income: Factor w/ 41 levels "1,000,000 to 1,069,999",..: 7 8 9 11 12 13 14 15 16 17 ...
 $ Count : int  807160 301650 313992 329290 369583 452671 495387 517779 557786 584497 ...

Plot the distribution

[Key message] The distribution fo income seems to be a law-power distrubtution: Most people have lower income and few people have higher income. For more details, there are the most people the stage of the lowest income and most of people cluster in stages from 240,000 to 480,000.

To make the message more obvious, I:

sort the order of income stage,
transform the original dot plot,
add the vertical segments for each income stage,
remove useless vertical grid lines,
add a horizontal dashed line marking level of the median of counts in income stages,
and add color hue marker to highlight that income concentrated in few people.

income_tw <- income_tw %>% 
  mutate(Income = factor(income_tw$Income, levels = Income),
         CumCount_pro = rev(cumsum(rev(Count))) / sum(Count),
         CumCount_pro_group = 
           cut(CumCount_pro, include.lowest = TRUE,
               breaks = c(0, .05, .1, .2, .4, 1),
               labels = c('top-5%', 'top-10%', 'top-20%', 'top-40%', 'other')))

ggplot(aes(x = Income, y = Count), data = income_tw) +
  geom_point(aes(x = Income, y = Count, color = CumCount_pro_group)) +
  geom_segment(aes(xend = Income, yend = 0, color = CumCount_pro_group)) +
  scale_colour_manual(name = 'Top-k% rich', values = paste0('steelblue', c(4, '', 3:1))) +
  geom_hline(yintercept = median(income_tw$Count), color = 'grey60', lwd = .5, lty = 2) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, margin = margin(t = 6), hjust = 1, size = 8),
        panel.grid.major.x = element_blank(),
        legend.position = 'top')

Exercise 2 (Not finished).

Comment on how the graphs presented in this link violate the principles for effective graphics and how would you revise them.

Exercise 3.

Sarah Leo at the Economist magazine published a data set to accompany the story about how scientific publishing is dominated by men. The plot on the left panel below is the orignal graph that appeared in the article. Help her find a better plot.

Target output

Load in the data set and display it

dta3 <- read.table('../data/women_science.txt', sep = ',', header = TRUE)
dta3

Data transformation

dta3_long <- dta3 %>% reshape2::melt() %>%
  mutate(female_dominated = cut(value, include.lowest = TRUE,
                                breaks = c(0, .2, .5, 1), 
                                labels = c('< 20%', '20-50%', '>= 50%')),
         variable = factor(variable,
                           labels = c('Health sciences', 'Physical sciences', 'Engineering',
                                      'Computer science and maths', 'Inventores (patent applications)')))

Plot 1: Lollipop plot with faceting by disciplines

Turn one single panel into plot with faceting by the discipline to see difference among countries more easily in one discipline.
Add segments to make the dot plot become the lollipop plot and remove useless horizontal grid lines to make differences more obvious (representation of the proportion: the dot location -> the lollipop length).
Use two vertical dasher lines to mark boundarise of different stages, and mark different proportion stage with different colors, to highlight difference among disciplines as well (e.g., proportions in Health sciences are much higher and proportions in Inventores are much lower).
Scale x-axis to range of \([0, 1]\) to show how low these proportions are to highlight the phenomenon of "men's" world.

qplot(x = value, y = Country, color = female_dominated, data = dta3_long) +
  facet_wrap(. ~ variable, nrow = 5) + 
  scale_x_continuous(limits = c(0, 1),
                     breaks = seq(0, 1, by=.1),
                     labels = seq(0, 1, by=.1)*100) +
  scale_color_manual(name = '',
                     values = c('deeppink2', 'black', 'dodgerblue2')) + 
  geom_segment(aes(xend = 0, yend = Country)) +
  geom_vline(xintercept = c(.2, .5), color = 'grey55', lty = 2) +
  xlab('Proportion (%)') + ylab('Country') + 
  labs(caption = 'Source: \"Gender in the Global Research Landscape by Elsevier;\nThe Economist\"') +
  ggtitle('Still a man\'s world',
          subtitle = 'Women among researchers with papers published*\n(index in Scopus) in 2011-15') +
  theme_bw() +
  theme(plot.title.position = 'plot', 
        legend.position = 'top',
        axis.text.y = element_text(size = 7),
        plot.caption.position = 'plot',
        plot.caption = element_text(hjust = 0),
        panel.grid.major.y = element_blank())

Plot 2: Lollipop plot with faceting by countries

Try to exchange the role of discipline and country. This plot can help to see difference among discipline more easily in one country. There are 12 countries in this data set, putting 12 panels in one single column would make the plot too long. Thus, I put them into 4 columns. Color marker still can help us to see difference among countries as well (e.g., there are the most disciplines with proportion lower than 20% in Japan. That is, the scientific community in Japan is much men-dominated than other countries).

qplot(x = value, y = variable, color = female_dominated, data = dta3_long) +
  facet_wrap(. ~ Country, nrow = 4) + 
  scale_x_continuous(limits = c(0, 1),
                     breaks = seq(0, 1, by=.1),
                     labels = seq(0, 1, by=.1)*100) +
  scale_color_manual(name = '',
                     values = c('deeppink2', 'black', 'dodgerblue2')) + 
  geom_segment(aes(xend = 0, yend = variable)) +
  geom_vline(xintercept = c(.2, .5), color = 'grey55', lty = 2) +
  xlab('Proportion (%)') + ylab('Discipline') + 
  labs(caption = 'Source: \"Gender in the Global Research Landscape by Elsevier;\nThe Economist\"') +
  ggtitle('Still a man\'s world',
          subtitle = 'Women among researchers with papers published*\n(index in Scopus) in 2011-15') +
  theme_bw() + 
  theme(plot.title.position = 'plot', 
        legend.position = 'top',
        plot.caption.position = 'plot',
        plot.caption = element_text(hjust = 0),
        panel.grid.major.y = element_blank())

Future work

Sort y labels (e.g., country in plot 1, discipline in plot 2) by the order of proption in one facet panel.