Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: R. Amoros, The Cheapest and Most Expensive U.S. Cities to Start a New Company (2018).


Objective

The objective of the visualisation is to highlight the most expensive cities in the United States to start a new company. The costs represented in the visualisation include filling fees, office space, utilities, legal and accounting fees and payroll. The visualisation is directed towards potential entrepreneurs to understand the different costs associated with creating a new company and which city to set up operations from a total cost perspective.

The visualisation chosen had the following three main issues:

  • Use of Pie Charts: The following visualisation is a variation of a pie chart, where angle would normally be used to represent proportions. In this visualisation it appears that the modified pie chart relies on area to represent percentage of total start up costs. In isolation, each pie chart indicates the cost breakdown for that individual city; however, when we extend our view across all 10 charts, it becomes increasingly difficult to compare portions (expenses) between different cities accurately. Although this visualisation is eye catching, it does not assist in facilitating an easy understanding of the data and comparison between cities, which is the intended purpose.

  • Poor Labelling: The annotations on the charts are designed to assist the viewer to understand the % breakdown of costs across each of the 10 cities numerically. However, this tends to strain the eyes of the audience when comparing the one set of percentages to another. The labels for variables such as filling fees and utilities become increasingly difficult to position on the charts as available space decreases. In addition, the use of a single colour pallet makes it difficult to distinguish which percentage figure belongs to the particular cost variable. This is a separate issue in itself.

  • Use of Colour: The total cost variable is given a single colour pallet scale of red (from light to dark) to represent the different categories (i.e. Payroll, Utilities) in an ordinal fashion. In general, colour scales that rely on red should be avoided due to the most common forms of red-green colour blindness. However, the scale becomes difficult to interpret with the lighter shades of red. In all of the charts, there is a portion of a darker shade of red which represents ‘office space’ that separates two lighter shades (i.e. ‘Legal & Accounting’, ‘Utilities’) in the original visualisation. This is difficult to distinguish and may result in the audience guessing which variable belongs to which shade of red. This may be more pronounced with colour blindness such as Protanomaly. To enhance viewer comprehension of the visualization and facilitate comparison, each category of cost should have a unique colour assigned.

Reference

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)
library(magrittr)
library(readr)
library(tidyr)
library(dplyr)

#Load Data
getwd()
## [1] "C:/Users/andre/Desktop/RMIT/Data Visualisation"
startup<-read_csv("startup.csv")
#Converting to appropriate varaible 
startup$`Filling Fees`<-as.numeric(startup$`Filling Fees`)
startup$`Office Space`<-as.numeric(startup$`Office Space`)
startup$Utilities<-as.numeric(startup$Utilities)
startup$`Legal & Accounting`<-as.numeric(startup$`Legal & Accounting`)
startup$Payroll<-as.numeric(startup$Payroll)
startup$Total<-as.numeric(startup$Total)
startup$City<-as.factor(startup$City)

str(startup)
## tibble [10 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Rank              : num [1:10] 1 2 3 4 5 6 7 8 9 10
##  $ City              : Factor w/ 10 levels "Boston, MA","Bridgeport, CT",..: 7 6 10 3 1 2 8 5 9 4
##  $ Filling Fees      : num [1:10] 100 100 220 238 450 293 180 100 130 130
##  $ Office Space      : num [1:10] 37580 50430 53290 69310 41170 ...
##  $ Utilities         : num [1:10] 3107 3107 2045 3267 2351 ...
##  $ Legal & Accounting: num [1:10] 6255 5771 5536 5671 5382 ...
##  $ Payroll           : num [1:10] 394992 380744 344448 316576 339664 ...
##  $ Total             : num [1:10] 442034 440152 405539 395061 389017 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Rank = col_double(),
##   ..   City = col_character(),
##   ..   `Filling Fees` = col_double(),
##   ..   `Office Space` = col_double(),
##   ..   Utilities = col_double(),
##   ..   `Legal & Accounting` = col_double(),
##   ..   Payroll = col_double(),
##   ..   Total = col_double()
##   .. )
#Data Modification & Preparation
startup$City <- startup$City %>% 
  factor(levels = startup$City[order(-startup$Total)]) 
startup_1 <- gather (startup, Variable, Value, 'Filling Fees':'Payroll') 
str(startup_1)
## tibble [50 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Rank    : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  $ City    : Factor w/ 10 levels "San Jose, CA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total   : num [1:50] 442034 440152 405539 395061 389017 ...
##  $ Variable: chr [1:50] "Filling Fees" "Filling Fees" "Filling Fees" "Filling Fees" ...
##  $ Value   : num [1:50] 100 100 220 238 450 293 180 100 130 130 ...
startup_2<-startup_1 %>% mutate(Percentage = (Value/Total)*100)
startup_2
## # A tibble: 50 x 6
##     Rank City               Total Variable     Value Percentage
##    <dbl> <fct>              <dbl> <chr>        <dbl>      <dbl>
##  1     1 San Jose, CA      442034 Filling Fees   100     0.0226
##  2     2 San Francisco, CA 440152 Filling Fees   100     0.0227
##  3     3 Washington, DC    405539 Filling Fees   220     0.0542
##  4     4 New York, NY      395061 Filling Fees   238     0.0602
##  5     5 Boston, MA        389017 Filling Fees   450     0.116 
##  6     6 Bridgeport, CT    367980 Filling Fees   293     0.0796
##  7     7 Seattle, WA       357023 Filling Fees   180     0.0504
##  8     8 Oakland, CA       351379 Filling Fees   100     0.0285
##  9     9 Trenton, NJ       342695 Filling Fees   130     0.0379
## 10    10 Newark, NJ        336207 Filling Fees   130     0.0387
## # ... with 40 more rows
str(startup_2)
## tibble [50 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Rank      : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  $ City      : Factor w/ 10 levels "San Jose, CA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total     : num [1:50] 442034 440152 405539 395061 389017 ...
##  $ Variable  : chr [1:50] "Filling Fees" "Filling Fees" "Filling Fees" "Filling Fees" ...
##  $ Value     : num [1:50] 100 100 220 238 450 293 180 100 130 130 ...
##  $ Percentage: num [1:50] 0.0226 0.0227 0.0542 0.0602 0.1157 ...
startup_2$Total<-as.factor(startup_2$Total)
str(startup_2)
## tibble [50 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Rank      : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  $ City      : Factor w/ 10 levels "San Jose, CA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total     : Factor w/ 10 levels "336207","342695",..: 10 9 8 7 6 5 4 3 2 1 ...
##  $ Variable  : chr [1:50] "Filling Fees" "Filling Fees" "Filling Fees" "Filling Fees" ...
##  $ Value     : num [1:50] 100 100 220 238 450 293 180 100 130 130 ...
##  $ Percentage: num [1:50] 0.0226 0.0227 0.0542 0.0602 0.1157 ...
startup_2$Totallbl <- paste('Startup Cost: $',format(startup_1$Total, big.mark=",", scientific=FALSE))
startup_2$Total <- as.factor(startup_2$Total)
str(startup_2)
## tibble [50 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Rank      : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  $ City      : Factor w/ 10 levels "San Jose, CA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total     : Factor w/ 10 levels "336207","342695",..: 10 9 8 7 6 5 4 3 2 1 ...
##  $ Variable  : chr [1:50] "Filling Fees" "Filling Fees" "Filling Fees" "Filling Fees" ...
##  $ Value     : num [1:50] 100 100 220 238 450 293 180 100 130 130 ...
##  $ Percentage: num [1:50] 0.0226 0.0227 0.0542 0.0602 0.1157 ...
##  $ Totallbl  : chr [1:50] "Startup Cost: $ 442,034" "Startup Cost: $ 440,152" "Startup Cost: $ 405,539" "Startup Cost: $ 395,061" ...
startup_3<-rename(startup_2,"Expenses"="Variable")
str(startup_3)
## tibble [50 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Rank      : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  $ City      : Factor w/ 10 levels "San Jose, CA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total     : Factor w/ 10 levels "336207","342695",..: 10 9 8 7 6 5 4 3 2 1 ...
##  $ Expenses  : chr [1:50] "Filling Fees" "Filling Fees" "Filling Fees" "Filling Fees" ...
##  $ Value     : num [1:50] 100 100 220 238 450 293 180 100 130 130 ...
##  $ Percentage: num [1:50] 0.0226 0.0227 0.0542 0.0602 0.1157 ...
##  $ Totallbl  : chr [1:50] "Startup Cost: $ 442,034" "Startup Cost: $ 440,152" "Startup Cost: $ 405,539" "Startup Cost: $ 395,061" ...
#Data Visualisation
p3<-ggplot(startup_3, aes(x=Expenses, y = Percentage, fill = Expenses))
p4<-p3 +geom_bar(stat="identity") + facet_grid(City + Totallbl~.)  + ylim(0,100) +
  ylab("Percentage (%) of Total Expenses") + 
  xlab("") + labs(title = "US Cities with the Highest Startup Costs (From Highest to Lowest) broken down by Expense Type", subtitle ="Based on minimum requirements of a 1,000 square foot commercial office and 5 full-time employees", caption = "Source: R. Amoros, The Cheapest and Most Expensive U.S. Cities to Start a New Company (2018) - https://howmuch.net/articles/cities-with-lowest-highest-startup-costs") +
  geom_text(aes(label=paste(round(Percentage,1), "%", sep="")), vjust = -0.5,size = 4) +
  theme(legend.position="right") + theme(
  plot.title = element_text(size = 18, face = "bold"),
  plot.subtitle = element_text(size=13),
  plot.caption = element_text(size=11, face = "italic"),
  legend.title = element_text(size = 14, face = "bold"),
  legend.text = element_text(size = 12),
  axis.text.x = element_text(size = 14, face= "bold"),
  axis.title.y = element_text(size = 14, face="bold"),
  strip.text.y = element_text(size = 12, face = "bold"))

Data Reference

Reconstruction

The following plot fixes the main issues in the original.