Click the Original, Code and Reconstruction tabs to learn about the issues and how they were fixed.

Original


Source:" How Much Americans Make in Wages ,2018


Objective

The circle graph shown represents the salary compensation received or wages earned in the U.S.A for the year 2018. The objective here is to highlight the disparity in wages earned by the popultation. The size/area of the slice represents the percentage of Americans falling in that net compensation interval.Also, the compensation intervals are further consolidated to make intervals representing low-income, mid-income wage earners and this is shown by different colours.

The target audience for the graph are policy makers, people following economic trends, and the working population of the U.S.

The visualisation chosen had the following three main issues:

  • Circle Graph - There are about 62 intervals represented in graph. Too many proportions in the circle graph make it had to read. This makes the compensation intervals(areas represented by the slices) in graph hard to compare.It is difficult to rely on the visuals alone and figure out the diffrence in area of slices and we rely on the value written in the graph.

  • Ignoring Convention - The slices for net compensation intervals do not follow the convention of finding the next compensation interval adjacentto it in the image. The slices for compensation interval $50K- $99.9K (in green) is not preceeded by $30K- $49.9K (in light pink) nor followed by $100K- $249.9K (in light blue). To look for the each interval we have to look over the whole image again using legends and labels.

  • Visual Bombardment/Colours - There is too much information in the visuaisation, giving a complicated objective. It is hard to understand and compare the different compensation intervals and their percentages. With too many scattered information labels and bright or loud colours, the visual takes significant amount of time comprehend.

Reference

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)

library(dplyr)
library(readxl)
library(forcats)

#Reading the file
wages_csv <- read.csv("E:\\DV - Data Visualisation\\A2\\A2_Data preprocessed.csv", stringsAsFactors = TRUE, header = TRUE)
#wages_csv

#Checking the datatypes of the dataset
str(wages_csv)
## 'data.frame':    48 obs. of  5 variables:
##  $ Net.compensation.interval....         : Factor w/ 48 levels "0 — 5k","100k — 105k",..: 1 40 4 15 26 28 30 32 34 36 ...
##  $ Wage.earners.Number                   : int  20912646 12993812 11550542 10937984 10930426 10649685 10157201 9281197 8242459 7405906 ...
##  $ Wage.earners..Percentage.of.Total     : Factor w/ 45 levels "0.08%","0.09%",..: 32 45 44 43 43 42 41 40 39 38 ...
##  $ Net.compensation......Aggregate.amount: Factor w/ 48 levels "$105,394,346,162 ",..: 33 47 7 12 17 21 26 28 29 30 ...
##  $ Net.compensation.......Average.amount : Factor w/ 48 levels "$1,199,356 ",..: 24 42 6 17 25 27 29 31 33 35 ...
#Getting the column names
colnames(wages_csv)
## [1] "Net.compensation.interval...."         
## [2] "Wage.earners.Number"                   
## [3] "Wage.earners..Percentage.of.Total"     
## [4] "Net.compensation......Aggregate.amount"
## [5] "Net.compensation.......Average.amount"
#Renaming unusual columnnames
names(wages_csv)[names(wages_csv) == "Net.compensation.interval...."] <- "Compensation_Intervals"
names(wages_csv)[names(wages_csv) == "Wage.earners.Number"] <- "Earners_Count"
names(wages_csv)[names(wages_csv) == "Wage.earners..Percentage.of.Total" ] <- "Earners_Percent"

# Selecting only the variables required i.e. the intervals and the frequency, percentage occurance ion the intervals
wages <- wages_csv %>% select(Compensation_Intervals, Earners_Count, Earners_Percent)
#wages

#wages["Earners_Percent"]

# Converting 'Percentage of total earners" to numeric from factor datatype
prcnt_to_num<- function(n){
  n_replace_prcnt<-sub("%", "", n)
  n_as_numeric<-as.numeric(n_replace_prcnt)
}
wages[['Earners_Percent']] = prcnt_to_num(wages[['Earners_Percent']])

#wages["Earners_Percent"]

#wages$Compensation_Intervals

#Checking the ordering of the confidence intervals
#levels(wages$Compensation_Intervals)

# Reordering the compensation intervals in the ascending order
wages$Compensation_Intervals <- factor(wages$Compensation_Intervals , levels = c('0 - 5k','5k - 10k','10k - 15k','15k - 20k','20k - 25k','25k - 30k','30k- 35k','35k- 40k','40k - 45k','45k - 50k','50k - 55k','55k - 60k','60k - 65k','65k - 70k','70k - 75k','75k - 80k','80k - 85k','85k - 90k','90k - 94k','95k - 100k','100k - 105k','105k - 110k','110k - 115k','115k - 120k','120k - 125k','125k - 130k','130k - 135k','135k - 140k','140k - 145k','145k - 150k','150k - 155k','155k - 160k','160k - 165k','165k - 170k','170k - 175k','175k - 180k','180k - 185k','185k - 190k','190k - 195k','195k - 200k','200k - 250k','250k - 300k','300k - 350k','350k - 400k','400k - 450k','450k - 500k','500k - 1M','1M and over'))

#wages

# Creating divisions based on the compensation intervals to give idea of income levels
wages$Compensation_Intervals_2 <-fct_collapse(wages$Compensation_Intervals,
                                              "0K-50K" = c("0 - 5k","5k - 10k","10k - 15k","15k - 20k","20k - 25k","25k - 30k","30k- 35k","35k- 40k","40k - 45k","45k - 50k"
                                              ),
                                              "50K-100K" =c("50k - 55k","55k - 60k","60k - 65k","65k - 70k","70k - 75k","75k - 80k","80k - 85k","85k - 90k","90k - 94k","95k - 100k"
                                              ),
                                              "100K-150K" =c("100k - 105k","105k - 110k","110k - 115k","115k - 120k","120k - 125k","125k - 130k","130k - 135k","135k - 140k","140k - 145k","145k - 150k"
                                              ),
                                              "150K-200K "=c("150k - 155k","155k - 160k","160k - 165k","165k - 170k","170k - 175k","175k - 180k","180k - 185k","185k - 190k","190k - 195k","195k - 200k"
                                              ),
                                              "200K-1M and over" = c("200k - 250k","250k - 300k","300k - 350k","350k - 400k","400k - 450k","450k - 500k","500k - 1M","1M and over"
                                              )
)

#table(wages$Compensation_Intervals_2)


#Using the ggplot2 to reconstruct the graph
# Constructing a barchart
p <- ggplot(data = wages,
            aes(x = Compensation_Intervals, y = Earners_Percent, fill = Compensation_Intervals_2)) +
  geom_bar(stat = "identity") + geom_text(aes(label=Earners_Percent), hjust=0,  position = 'dodge') + coord_flip()

# Adding details of the graph
p2 <- p +
  labs(title = "How much Americans Make in Wages (2018)",
       x = "Percentage of Total Earners (in percent)",
       y = "Compensation Intervals",
      subtitle  = "Highest Percentage of the Wage earners, earn the lowest in wages, in the year 2018"
              )

# Changing color palette to use more subtle and pleasant to the eye colors
#install.packages("extrafont")
library(extrafont)
#font_import()

pal <- c("#89141cff",
         "#b16264ff",
         "#d8b0acff",
         "#d8d0c1ff",
         "#3a506bff",
         "#29265bff"
)

p3 <- p2 +
  theme_gray() +
  scale_fill_manual(values = pal) +
  theme(        text=element_text(family="Georgia"),
        title = element_text(face = "bold"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.title = element_blank())

Data Reference

Reconstruction

The following plot fixes the main issues in the original.