Data-Ink Ratio Exercise with ggplot2

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

inc <- dplyr::as_tibble(inc)
head(inc)
## # A tibble: 6 x 8
##    Rank Name         Growth_Rate  Revenue Industry       Employees City    State
##   <int> <chr>              <dbl>    <dbl> <chr>              <int> <chr>   <chr>
## 1     1 Fuhu                421.   1.18e8 Consumer Prod~       104 El Seg~ CA   
## 2     2 FederalConf~        248.   4.96e7 Government Se~        51 Dumfri~ VA   
## 3     3 The HCI Gro~        245.   2.55e7 Health               132 Jackso~ FL   
## 4     4 Bridger             233.   1.90e9 Energy                50 Addison TX   
## 5     5 DataXu              213.   8.70e7 Advertising &~       220 Boston  MA   
## 6     6 MileStone C~        179.   4.57e7 Real Estate           63 Austin  TX
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

I’ll keep this section relatively short, as experience has taught me that numeric summaries, like the ones above, can be misleading without visualizing the data as well.

I am curious to see how many companies are in each industry, the industries with the highest median growth rates, as well as the aggregated revenue from each industry.

# Insert your code here, create more chunks as necessary
inc %>%
  dplyr::select(Industry, Revenue, Growth_Rate) %>%
  dplyr::group_by(Industry) %>%
  dplyr::summarize(
    Count = n(),
    '%' = round(n()/nrow(.),2),
    Total_Revenue = sum(Revenue),
    Mdn_Growth_Rate = median(Growth_Rate)
  ) %>%
  dplyr::arrange(desc(Mdn_Growth_Rate)) 
## # A tibble: 25 x 5
##    Industry                     Count   `%` Total_Revenue Mdn_Growth_Rate
##    <chr>                        <int> <dbl>         <dbl>           <dbl>
##  1 Government Services            202  0.04    6009100000            2.11
##  2 Energy                         109  0.02   13771600000            2.08
##  3 Real Estate                     96  0.02    2965700000            2.07
##  4 Media                           54  0.01    1742400000            1.94
##  5 Consumer Products & Services   203  0.04   14956400000            1.82
##  6 Retail                         203  0.04   10257400000            1.76
##  7 Software                       342  0.07    8140600000            1.72
##  8 Advertising & Marketing        471  0.09    7785000000            1.61
##  9 Health                         355  0.07   17863400000            1.57
## 10 Security                        73  0.01    3812800000            1.54
## # ... with 15 more rows

Interestingly, it looks like Government Services has the highest median growth rate, followed by Energy and Real Estate. It is also interesting that some of the faster growing companies are in industries which make up a relatively small percentage of companies in this list such as Energy, Real Estate, and Media. None of these industries are new, so there must have been something occurring at the time this data was gathered that would have fueled that growth.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

inc %>% 
  dplyr::select(State) %>%
  dplyr::group_by(State) %>%
  dplyr::count() %>%
  ggplot() +
    aes(x = reorder(State, n) , y = n, fill = n) +
    geom_col(position = 'dodge') +
    geom_text(aes(label = n), size = 3,  hjust=-0.10)+ 
    labs(title = 'Number of Fastest Growing Companies by State') +
    xlab('') + 
    ylab('') + 
    scale_fill_gradient(low = "indianred1", 
                        high = "indianred4" 
                         ) + 
    coord_flip() +
    theme(
      panel.background = element_rect(fill = "white", color = NA),
      axis.text.x = element_blank(), 
      axis.ticks.x = element_blank(),
      legend.position="none",
      plot.margin=unit(c(.2,.5,.2,.2),"cm")
    )

We can see from the above visual, a substantial amount of the fastest growing companies come from California.

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

In looking at the data, there are 12 rows with missing values in the Employees column. We’ll filter these out using tidyr::drop_na, which will drop any rows with missing values. Outliers are an issue with this visual, however, as an employee of the state, we are more interested in general cases, not the outliers, so we’ll exclude several outliers to make our plot readable.

for_box <- inc %>%
  filter(State == 'NY') %>%
  tidyr::drop_na()

for_box %>%
  ggplot() + 
    aes(x = reorder(Industry, Employees, FUN = median), y = Employees) + 
    geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.45) +
    ylim(0,1250) +
    xlab('') + 
    ylab('') +
    coord_flip() +
    labs(title = "Number of Empolyees by Industry", caption = "*Outliers above 1,250 employees were excluded") +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.15),
    panel.grid.major.x =  element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey"),
    plot.margin=unit(c(.2,.5,.2,.2),"cm")
  )

In looking at the box plots above, we can see that Environmental Services has the highest median number of employees out of all industries.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Here again we’ll have an issue with outliers, however, as an investor we’re interested in the overall health of the industry, not a one-off outlier, so excluding extreme outliers here will help us visualize our data better.

for_box %>%
  mutate(rev_per_emp = Revenue / Employees) %>%
  filter(rev_per_emp <= 1750000) %>%
  ggplot() +
    aes(x = reorder(Industry, rev_per_emp, FUN = median), y = rev_per_emp) +
    geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.45) +
    xlab('') +
    ylab('') +
    coord_flip() +
    labs(title = "Revenue per Employee by Industry", caption = "*Outliers above $1,750,000 were excluded") +
    scale_y_continuous(labels=scales::dollar_format()) +
  theme_minimal() +
  theme(
    panel.grid.major.x =  element_line(color = "gray90", linetype = "dashed"),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey"),
    plot.margin=unit(c(.2,.5,.2,.2),"cm")
  )

Looking at the above plot as an investor, we can see that Logistics & Transportation is probably the best bet for our money. While it does have a wide range, the minimum revenue per employee is fairly high, the median value is the 3rd highest (almost tied for 2nd), and the higher end of the range is the highest of any industry, meaning there is a lot possibility.