Initial plot
In fact, the initial plot I have chosen to improve is an improved version itself. It is a pie chart solution which is stated as better way of presenting data on how American citizens get to work (let me call it graph 1). Here is the pie chart:
American Community Survey of the US Census Bureau (2005)
By following the link below, one may find the very first version of the graph, or graph 1; I do not include it here for the sake of not getting confused with three different graphs:
What is wrong?
Basically, as far as I got used to thinking already, pie charts are better to be avoided at all, but if one wants to apply it — no more than 3-5 categories should remain.
I must admit that I like some parts of this graph, for instance, the background picture which does not distract and displays the subject, or the fact that categories are placed hierarchically and easy to read.
Solution
In my opinion, what is not okay is first of of all, the graph contains too many groupings, hence the best solution here would be to apply bar plot.
Then, percentages do not sum to 100%, but rather to 100.1%, which is a mistake. I would perhaps subtract 0.1 from the biggest category in this case.
Finally, the colours are too bright and distract from each other. Since the categories imply the same subject, I believe that bars may be presented with one colour, it is not that crucial to differentiate the categories in this manner.
Revised graph
perc <- c(0.1, 0.2, 0.4, 0.9, 2.5, 3.6, 4.7, 10.7, 76.9)
catg <- c("Taxi", "Motorcycle", "Bike", "Other means", "Walk", "Work from home", "Public Transportaion", "Car Pool", "Drive alone")
amer <- data.frame(perc, catg)
sum(amer$perc) ## now the sum is fine## [1] 100
percent <- c("0.1%", "0.2%", "0.4%", "0.9%", "2.5%", "3.6%", "4.7%", "10.7%", "76.9%")
amer <- data.frame(perc, catg, percent)Bar plot:
library(ggplot2)
library(ggpubr)
amer$catg <- factor(amer$catg, level = c("Other means", "Taxi", "Motorcycle", "Bike", "Walk", "Work from home", "Public Transportaion", "Car Pool", "Drive alone"), label = c("Other means", "Taxi", "Motorcycle", "Bike", "Walk", "Work from home", "Public Transportaion", "Car Pool", "Drive alone"))
ggplot(amer, aes(x = catg, y = perc)) +
geom_bar(aes(alpha = catg == "Drive alone"), stat = "identity", color = "darkseagreen", fill = "darkseagreen3") +
geom_text(aes(label = percent), hjust = -0.2, vjust = 0.5, size = 3.8, color = "darkolivegreen") +
labs(title = "How do Americans get to work", caption = "American Community Survey of the US Census Bureau, 2005", x = NULL, y = NULL) +
coord_flip() +
theme(plot.caption = element_text(color = "coral4", size = 10.5), plot.title = element_text(hjust = 0.08), axis.text.y = element_text(size = 11, color = "gray7"), axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank()) +
ylim(0, 100) + scale_alpha_manual(values = c("TRUE" = 1, "FALSE" = 0.6), guide = F)P.S. I have also decided to shift Other means to the bottom even though it is higher than a taxi, bike, and motorcycle. As for me, I think that it is just more logical that the category not implying any specific answer is the last.