Authors: Charlie Ding, Geoffrey Jing
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(scales)
library(png)
library(grid)
df <- read_excel("portfolio1.xlsx")
df <- df %>%
mutate(tax_per_return = tax / returns)
Original graph screenshot for reference:
Replicating the original graph
#Replicating the existing graph (have tried best to be similar)
ggplot(df, aes(y = rate))+
geom_col(aes(x = -returns), fill = "steelblue") +
geom_text(aes(x = -returns, label = comma(returns)), color = "black", fontface = "bold", size=4, hjust=0.5) +
geom_col(aes(x = tax), fill = "darkorange") +
geom_text(aes(x = tax, label = comma(tax)), color = "black", fontface = "bold", size = 4, hjust=-0.1) +
labs(
title = "NUMBER OF TAX RETURNS AND REVENUE GENERATED BY BRACKET (TY 2015)",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 13) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Our redesign of original graph
#prepare for recreating graph
df_long <- df %>%
pivot_longer(
cols = c("returns", "tax", "tax_per_return"),
names_to = "variable",
values_to = "value"
)
#recreating new graph
ggplot(df_long, aes(x = rate, y = value, fill = variable)) +
geom_col() +
coord_flip() +
facet_wrap(~ variable, scales = "free_x") +
labs(
title = "Number of Tax Returns, Revenue Generated, and Tax per Return By Bracket",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 14) +
theme(panel.grid.major.y = element_blank())
The graph that we chose displays the number of tax returns and revenue generated by tax brackets in 2015. The graph does some things well, like choosing a bar chart to display the differences in tax returns and revenue for tax brackets, which is a helpful medium for the viewer to perceive the data value more accurately than say a pie chart. Additionally, the graph differentiates the tax returns and revenue data values with different colors to reduce cognitive load for the viewer. Lastly, like mentioned in slide 06 in class, the graph also utilizes whitespace to separate the elements, which enables distinguishing between the data without relying only on the contrast of the colors.
However, there a few aspects of this original graph that could be improved. Firstly, the color of the graph is not ideal for interpretation, as the blue color is dark and hard to differentiate from the text label accompanying it. Choosing better colors, or better yet, moving the label to the side instead of within the graph itself would improve the interpretability of the graph. Secondly, combining the tax returns and the revenue together in the same graph makes it difficult to digest their differences and compare them effectively. It would be better to separate the returns and revenue so that it is easier for the viewer to compare them visually. On the topic of comparison, the graph makes it difficult to compare how much tax is collected per return in each bracket because of the way the data is visualized. It relies heavily on the viewer’s cognitive load to decipher, and possibly adding an extra data field representing the amount of tax collected per return for each bracket would help the viewer better understand the differences. It should also be noted that although the tax returns value is also hard to see for certain brackets, in this case choosing not to log the data instead might still be the best option due to what was mentioned in Chapter 17.2 of the Fundamentals of Data Visualization by Wilke, where logging the values would not allow the bars to start at 0.
For our improved graph, we made several changes to better visualize the data. First, we made changes to the facets layer of the graph. We split the graph to multiple graphs and separated the tax return and revenue generated, so that it is easier on the viewer to observe its differences. We also changed the data layer, by adding another value indicating how much tax is collected per return in each bracket. This makes it easier for the viewer to see the differences in tax per return, an important metric in the context of this graph, without having to calculate that on their own. Additionally, we changed the aesthetic layer by moving the y-axis labels (percentages) to the left side instead of being within the graph, which makes it a lot clearer to read for the viewer. We also changed the theme layer, by switching the grid lines of the graph from horizontal to vertical. This is because vertical graph lines in this graph reduces the cognitive load of the viewer as the bars themselves are horizontal, and we want to make it as easy as possible for the user to understand our content like mentioned in slides 06 in class. Furthermore, this is due to the fact mentioned by Claus O. Wilke in Fundamentals of Data Visualization Chapter 17.3, in which humans perceive a data value encoded as a distance more accurately than when the data value is encoded through a combination of two or more distances that together create an area. Lastly, within the theme layer we also made changes to the legend, to make it larger and easier to read compared to the one in the original graph.
# https://tidyr.tidyverse.org/reference/pivot_longer.html
# https://www.datacamp.com/tutorial/facets-ggplot-r?utm_source=google&utm_medium=paid_search&utm_campaignid=19589720830&utm_adgroupid=157156377351&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=g&utm_adpostion=&utm_creative=726015683901&utm_targetid=dsa-2218886984060&utm_loc_interest_ms=&utm_loc_physical_ms=9019509&utm_content=&utm_campaign=230119_1-sea~dsa~tofu_2-b2c_3-us_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na-jan25&gad_source=1&gclid=Cj0KCQiA4rK8BhD7ARIsAFe5LXLZT51sZY4kWH8aXOg3ukPi_6CJFbE70CHzBgPuLNXYYMqBEgwSIBEaAr5VEALw_wcB
# https://www.r-bloggers.com/2021/09/adding-text-labels-to-ggplot2-bar-chart/