Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Exploratory statistics

Number of Rows

The rank appears to be 1 through 5000. So let’s see if there are exactly 5000 rows.

nrow(inc)
## [1] 5001
inc_comp <- inc[complete.cases(inc), ]

nrow(inc_comp)
## [1] 4989

The dataframe contains 5001 rows. Interesting given the highest numeric Rank is 5000. Also, the number of rows with at least one missing element is 12 (5001-4989).

Unique values per column

Turns out the dataframe contains 5001 rows, so let’s investigate the number of ranks included and the number of companies identified, and the number of states included.

# Ouput number of distinct values for each column
sapply(inc, function(x) length(unique(x)))
##        Rank        Name Growth_Rate     Revenue    Industry   Employees 
##        4999        5001        1147        1069          25         692 
##        City       State 
##        1519          52

The number of ranks being 4999 indicates duplicate rank numbers exist. We should be able to identify two ranks duplicated or one rank used three times.

The name column returns 5001, a comforting result which indicates each name is unique, and thus should represent a different company.

The state count results in 52. Given 50 states in the United States, let’s see if Washington, D.C. is included or perhaps a couple of US territories. If not true, then perhaps a company exists that is not in the United States.

Identify duplicate ranks

# Identify duplicated values for Rank
inc$dupe <- duplicated(inc$Rank)

subset(inc, dupe == TRUE)
##      Rank                    Name Growth_Rate  Revenue
## 3424 3424 Total Beverage Solution        0.90 41500000
## 5001 5000                    ALL4        0.34  4700000
##                    Industry Employees         City State dupe
## 3424        Food & Beverage        35 Mt. Pleasant    SC TRUE
## 5001 Environmental Services        34    Kimberton    PA TRUE
subset(inc, Rank == 3424 | Rank == 5000)
##      Rank                    Name Growth_Rate  Revenue
## 3423 3424     Stemp Systems Group       19.37  6800000
## 3424 3424 Total Beverage Solution        0.90 41500000
## 5000 5000                     INE        0.34  6800000
## 5001 5000                    ALL4        0.34  4700000
##                    Industry Employees             City State  dupe
## 3423            IT Services        39 Long Island City    NY FALSE
## 3424        Food & Beverage        35     Mt. Pleasant    SC  TRUE
## 5000            IT Services        35         Bellevue    WA FALSE
## 5001 Environmental Services        34        Kimberton    PA  TRUE

Based on above output, the ranks of 3424 and 5000 are each used twice. Further investigation of Stemp Systems Group finds the Growth Rate value indicates the company should be ranked higher. After visual examination of the dataset, I find rank 218 is missing and appears Stemp Systems Group should be there.

As for rank 5000, based on the pattern of duplicate growth rates, yet unique ranks, the final rank should be 5001.

52 States … United States is growing?

unique(inc$State)
##  [1] CA VA FL TX MA TN UT RI SC DC NJ OH WA ME NY CO GA IL AZ NC MD MN OK
## [24] PA CT IN MS WI WY MI MO KS OR NE AL HI NV IA KY ID AK LA DE AR NH VT
## [47] NM SD ND PR MT WV
## 52 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY

Not too surprisingly, the two additional entries in the State column are DC (Washington, D.C.) and PR (Puerto Rico). This discovery is certainly reasonable given the territorial make-up of the United States.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Answer 1

library(ggplot2)

# Bar graph of counts
# This uses the mtcars data frame, with the "cyl" column for
# x position. The y position is calculated by counting the number of rows for
# each value of cyl.
ggplot(inc, aes(x = State)) + 
  geom_bar() +
  coord_flip() +
  scale_x_discrete(limits = rev(levels(as.factor(inc$State)))) +
  ggtitle("Companies per State") +
  xlab("State") +
  ylab("Companies") +
  geom_text(stat='count', aes(label=..count..), hjust=-.3) +
  theme(axis.text.x=element_blank()) # Following the gif from the week 2 discussion

Explanation of decisions

  • I chose a bar graph as the chart style allows for clear comparison of the different bars. Each state is an individual count so a line graph would not have made any sense. I also chose not to use a histogram or distribution chart as that would force the states to be out of alphabetical order. Also, the histogram or distribution plot often implies a relationship for the given distribution. Seeing as the states are vastly different in size and population, not each state is equal in that regard.

  • As the number of states is 52 in the chart, I decided to list the states along the Y axis. This allows the chart to be presented in portrait orientation. I also had to reorder the list to ensure the state abbreviations are alphabetical starting at the top. I also believe the choice of Y axis for states is good because the abbreviations aren’t overlapping and aren’t tilted at 30 degrees which often happens when many categories are identified along an X axis.

  • I chose alphabetical ordering over numeric ordering because I believe the alphabetizing of the state abbreviations allows for easier comprehension by the viewer.

  • Following some principles of Edward Tufte, and also with the default ggplot2 theme, the bars are one solid color, no reason to present each state with a different color. I removed the Y axis tick labels in favor of displaying the actual count for each bar on the graph itself.

Question 2

Let’s dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

New York is 3rd

Based on Question 1 output, New York is the state with the 3rd most companies.

# Subset the intial dataframe to just the rows with State equal to 'NY'
ny <- subset(inc, State == 'NY')

# Output data for visual confirmation
head(ny)
##    Rank                      Name Growth_Rate  Revenue
## 26   26              BeenVerified       84.43 13700000
## 30   30                  Sailthru       73.22  8100000
## 37   37              YellowHammer       67.40 18000000
## 38   38                 Conductor       67.02  7100000
## 48   48 Cinium Financial Services       53.65  5900000
## 70   70                  33Across       44.99 27900000
##                        Industry Employees      City State  dupe
## 26 Consumer Products & Services        17  New York    NY FALSE
## 30      Advertising & Marketing        79  New York    NY FALSE
## 37      Advertising & Marketing        27  New York    NY FALSE
## 38      Advertising & Marketing        89  New York    NY FALSE
## 48           Financial Services        32 Rock Hill    NY FALSE
## 70      Advertising & Marketing        75  New York    NY FALSE
nrow(ny)
## [1] 311
# Remove any non-complete rows from the NY dataframe
ny_comp <- ny[complete.cases(ny), ]

nrow(ny_comp)
## [1] 311

Confirming the 311 rows of New York have complete data

library(dplyr)

# Reorder based on the employee average instead of industry name
ny$Industry = with(ny, reorder(Industry, Employees, mean))

# Calculate the mean, min, and max of employees for each company and group by industry
ny.summary <- ny %>%
  group_by(Industry) %>%
  summarise(
    emp_mean = mean(Employees),
    emp_min = min(Employees),
    emp_max = max(Employees)
  )

# Output the resulting dataframe for visual inspection
ny.summary
## # A tibble: 25 x 4
##    Industry                   emp_mean emp_min emp_max
##    <fct>                         <dbl>   <int>   <int>
##  1 Government Services            17        17      17
##  2 Real Estate                    18.2       7      30
##  3 Retail                         24.8       3      75
##  4 Logistics & Transportation     29.5       1      70
##  5 Insurance                      32.5      15      50
##  6 Computer Hardware              44        44      44
##  7 Engineering                    53.5      11      94
##  8 Advertising & Marketing        58.4       2     270
##  9 Education                      59.9      19     200
## 10 Construction                   61        10     219
## # … with 15 more rows
# Trying to identify outliers
maxval <- 3200

dd <- ny.summary %>% filter(ny.summary$emp_max > maxval) %>%
        group_by(Industry) %>%
          summarise(outlier_txt=paste(emp_max,collapse=","))

# Output resulting df
dd
## # A tibble: 2 x 2
##   Industry                     outlier_txt
##   <fct>                        <chr>      
## 1 Consumer Products & Services 10000      
## 2 Business Products & Services 32000

Answer 2

# Round the mean values for display purposes
ny.summary$emp_mean <- round(ny.summary$emp_mean, digits=0)

# Plot the chart for Question 2
ggplot() + 
  geom_linerange(ny.summary, mapping=aes(x=Industry, ymin=emp_min, ymax=emp_max), width=0.2, size=1) +
  geom_point(ny.summary, mapping=aes(x=Industry, y=emp_mean), size=2, shape=23, fill="white") +
  geom_text(ny.summary, mapping=aes(x=Industry, y=emp_mean, label=emp_mean), size=3, vjust=-.6, nudge_x = 0.05) +
  coord_flip(ylim=c(0, maxval)) +
  ggtitle("Average Employees with Range for New York") +
  xlab("Industry") +
  ylab("Employees") +
  geom_text(data=dd,aes(x=Industry, y=maxval,label=outlier_txt),size=3,vjust=-.6,hjust=.5) +
  geom_segment(data=dd,aes(x=Industry, y=maxval*0.95,yend=3365, xend=factor(Industry)),
               arrow = arrow(length = unit(0.3,"cm")))

Explanation of decisions

  • Based on the question requiring an identification of the ranges, initially I generated a boxplot using geom_boxplot(). But then the more I thought about the prompt, I realized the boxplot captures the entire range as long as all the outliers are intact along with median value and the quartiles of distribution. As the prompt left the design open-ended (on purpose), I chose the geom_linerage approach with an identifier for the average as the final approach. I wanted to portray just the data and information requested in the prompt, and thus not include additional information. Yes, the Edward Tufte approach is really starting to weigh on my design decisions. I was imagining this chart in a newspaper or on a news website, not as a submission for a data science analysis project, thus the less-is-more approach to the presented chart.

  • I ordered the industries by average employees. Given this statistic as the focus of the chart, ordering on the statistic brings awareness to its importance. Also, I figured ordering by alphabetized industry names would provide little value. Sure, the alphabetized list provides a mechanism of order to a viewer, but the names aren’t household names in the way the state abbreviations are in Question 1. I wouldn’t expect a casual viewer to look for “Transportation” as “Logistics & Transportation”.

  • Using geom_linerange, I’m able to create the range of data using an indiscreet black line, a la Tufte. The average for each industry is portrayed as a diamond on the range line along with the true numeric value. I believe the specific number provides more clarity to a viewer instead of requiring the viewer to estimate the number based on the X axis values. I left the X axis labels on the chart, as many averages are below 1000, I felt the higher numbers would be harder to estimate because the chart doesn’t stop at an exact multiple of 1000.

  • I presented the industry names on the Y axis, following the approach in Question 1. This allows for a long list of industry names and also allows the names to be written horizontally (not tilted).

  • As for the outliers, I’m not completely satisfied with this approach, as I added arrows to the two lines that would extend past the graph’s right edge. With those arrows, I also identify the specific maximum values of the range. The size of the chart seemed a good compromise to capture all the numbers below 1000 along with capturing the extent to which the two outliers extend farther beyond the presented chart.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

library(ggthemes)

# Set theme from package ggthemes
theme_set(theme_fivethirtyeight())

# Remove incomplete rows
inc <- inc[complete.cases(inc), ]

nrow(inc)
## [1] 4989
# Calculate the revenue by the number of employees for each company
inc$Rev_Per_Emp <- inc$Revenue / inc$Employees

# Reorder by industry using the mean of the revenue per employee
inc$Industry = with(inc, reorder(Industry, Rev_Per_Emp, mean))

# Graph the boxplot of the data using ggplot
ggplot(inc, aes(x=Industry, y=Rev_Per_Emp)) + 
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 2, fill = "white") +
  ggtitle("Revenue Per Employee") +
  xlab("Industry") +
  ylab("Revenue") +
  coord_flip(ylim = c(0, 1800000))

As mentioned in Question 2, I first attempted a boxplot in order to capture the range of values. The more I thought about the prompt, the less I considered the boxplot the correct the approach. Nonetheless, I left the boxplot here as evidence of my thought process. The boxplot does capture the revenue per employee by industry, but doesn’t exactly capture the full range of values. The outliers are not captured on this graph, and thus the graph doesn’t identify the full range as requested in the question prompt.

I understood the prompt to imply a presentation in a print publication or news website, thus the use of boxplots would be overkill to a typical consumer of news graphics. I also used the theme of the website fivethirtyeight.com from the ggthemes package for experimentation purposes.

# Set theme to Tufte, an homage
theme_set(theme_tufte())

# Reorder the data based on the calculation of revenue per employee
inc$Industry = with(inc, reorder(Industry, Rev_Per_Emp, mean))

# Calculate the mean, min, and max of the revenue per employee and group by industry
inc.summary <- inc %>%
  group_by(Industry) %>%
  summarise(
    rev_emp_mean = mean(Rev_Per_Emp),
    rev_emp_min = min(Rev_Per_Emp),
    rev_emp_max = max(Rev_Per_Emp)
  )

# Output for visual inspection
inc.summary
## # A tibble: 25 x 4
##    Industry                rev_emp_mean rev_emp_min rev_emp_max
##    <fct>                          <dbl>       <dbl>       <dbl>
##  1 Engineering                  201120.      37067.     725424.
##  2 Software                     225989.      14062.   12600000 
##  3 Government Services          243596.      18000     3835616.
##  4 IT Services                  270494.       8369.    3506349.
##  5 Security                     283391.      14865.    3075000 
##  6 Environmental Services       283607.      41623.    2009231.
##  7 Education                    296454.       4706.    5100000 
##  8 Advertising & Marketing      306036.       2076.    5663492.
##  9 Media                        307144.       6273.    1550000 
## 10 Health                       325199.       9778.    6835484.
## # … with 15 more rows
# Round the mean and max values for presentation purposes
inc.summary$rev_emp_mean <- round(inc.summary$rev_emp_mean, digits=0)
inc.summary$rev_emp_max <- round(inc.summary$rev_emp_max, digits=0)

# Trying to identify outliers
maxval <- 8000000

dd <- inc.summary %>% filter(inc.summary$rev_emp_max > maxval) %>%
        group_by(Industry) %>%
          summarise(outlier_txt=paste(rev_emp_max,collapse=","))

# Output resulting df of outliers
dd
## # A tibble: 4 x 2
##   Industry                   outlier_txt
##   <fct>                      <chr>      
## 1 Software                   12600000   
## 2 Manufacturing              12800000   
## 3 Logistics & Transportation 9666667    
## 4 Energy                     40740000

Answer 3

# Plot the chart for Question 3
ggplot() + 
  geom_linerange(inc.summary, mapping=aes(x=Industry, ymin=rev_emp_min, ymax=rev_emp_max), width=0.2, size=1) +
  geom_point(inc.summary, mapping=aes(x=Industry, y=rev_emp_mean), size=2, shape=23, fill="white") +
  geom_text(inc.summary, mapping=aes(x=Industry, y=rev_emp_mean, label=rev_emp_mean), size=3, vjust=-.6, nudge_x = 0.05) +
  coord_flip(ylim=c(0, maxval)) +
  ggtitle("Average Revenue Per Employee with Range") +
  xlab("Industry") +
  ylab("Revenue Per Employee") +
  geom_text(data=dd,aes(x=Industry, y=maxval,label=outlier_txt),size=3,vjust=-.6,hjust=.5) +
  geom_segment(data=dd,aes(x=Industry, y=maxval*0.95,yend=8400000, xend=factor(Industry)),
              arrow = arrow(length = unit(0.3,"cm"))) +
  scale_y_continuous(labels = scales::number)

Explanation of decisions

  • Overall, I followed the display approach used for Question 2. The boxplot was considered first, but instead I used the linerange to clearly denote the range of values for each industry. The boxplot gives more information but doesn’t capture the full range. I wanted the chart to capture the exact data requested in the question prompt, and no more.

  • The industries are presented on the Y axis to allow for easy readability of the industry names instead of presenting the names tilted on the X axis.

  • The industries are ordered by the average of revenue per employee. As the prompt requests the presentation of this statistic primarily, I chose to order by this statistic to bring more attention specifically to that information.

  • In an effort to indicate the exact values, the averages are presented as white diamonds along with the exact number rounded to a whole number. I did keep the X axis numeric values to help readability, but I presented the exact averages to provide the information clearly and not require visual estimating by the viewer.

  • The outliers are treated the same as in Question 2. Arrow heads are presented for those line ranges that extend beyond the right edge of the chart with the maximum value. To me, not the ideal presentation of the outliers but does include the information.

  • The selection of just over 8 million as the upper boundary was based on trial an error in trying to allow enough space for the line ranges and averages below 1 million while also having as few ranges extend beyond the chart. At a lower maximum value on the X axis, say 2 million, the averages would be much easier to focus on and compare, but then most line ranges would extend beyond the chart and thus provide no real value. Again, a compromise to capture the averages and full line ranges with just a few identified outliers.

  • I did have to manipulate the X axis labels to show the full numbers, multiples of 1 million. By default the numbers were presented in scientific notation, but I felt that would be too obtuse of a casual viewer.