Libraries Used

require(dplyr)
require(tidyr)
require(knitr)
require(kableExtra)
require(kable)
require(ggplot2)

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

The initial summaries above are helpful in allowing us to conclude the following:

We can also investigate other aspects of this data by using the group_by function to create more specific breakdowns of some of our variables.


Average Revenue By Industry

If we were to break down the amount of revenue per industry, we can see the following:

inc %>% 
  group_by(Industry) %>% 
  summarise(Mean_Revenue = mean(Revenue), Number_of_Companies = n()) %>% 
  arrange(desc(Mean_Revenue)) %>% 
  kable('html') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Industry Mean_Revenue Number_of_Companies
Computer Hardware 270129545 44
Energy 126344954 109
Food & Beverage 98559542 131
Logistics & Transportation 95745161 155
Consumer Products & Services 73676847 203
Construction 70450802 187
Telecommunications 56855814 129
Business Products & Services 54705187 482
Security 52230137 73
Environmental Services 51741176 51
Financial Services 50580385 260
Retail 50529064 203
Health 50319437 355
Manufacturing 49546875 256
Travel & Hospitality 47283871 62
Human Resources 47173980 196
Insurance 46758000 50
Engineering 34222973 74
Media 32266667 54
Real Estate 30892708 96
Government Services 29748020 202
IT Services 28214598 733
Software 23802924 342
Advertising & Marketing 16528662 471
Education 13726506 83

It looks like, on average, the Computer Hardware industry generates the most revenue out of the 25 industries in this dataset.


Average Revenue Per Employee

Next, we will factor in the number of employees into this calculation to see how this impacts the average revenue calculation.

inc %>% 
  group_by(Industry) %>% 
  summarise(Mean_Revenue_per_employee = mean((Revenue / Employees), na.rm = TRUE), Number_of_companies = n(), Mean_Employees = mean(Employees, na.rm = TRUE)) %>% 
  arrange(desc(Mean_Revenue_per_employee)) %>% 
  kable('html') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Industry Mean_Revenue_per_employee Number_of_companies Mean_Employees
Energy 1554655.8 109 242.54128
Computer Hardware 817702.2 44 220.77273
Logistics & Transportation 794810.9 155 259.70130
Food & Beverage 618382.8 131 510.93798
Insurance 474966.4 50 146.78000
Consumer Products & Services 466068.1 203 223.96059
Construction 465682.4 187 155.60963
Manufacturing 453524.0 256 172.32157
Telecommunications 449259.6 129 242.85039
Real Estate 434515.6 96 198.87368
Travel & Hospitality 414788.1 62 371.53226
Retail 412554.9 203 182.60099
Human Resources 395972.2 196 1158.06122
Financial Services 394230.9 260 183.43462
Business Products & Services 359096.9 482 244.49375
Health 325198.9 355 232.85311
Media 307143.8 54 176.51852
Advertising & Marketing 306036.3 471 84.35456
Education 296453.5 83 92.59036
Environmental Services 283607.3 51 199.11765
Security 283391.4 73 562.45205
IT Services 270494.2 733 140.42076
Government Services 243596.0 202 129.62871
Software 225989.2 342 150.32845
Engineering 201119.9 74 276.14865

As we can see above, although the Computer Hardware industry appears to generate, on average, the most revenue, when factoring in the number of employees, we see the Energy industry supersedes Computer Hardware. Additionally, it’s interesting to see Insurance jump up into the top 5 industries when factoring in the average number of employees, and to see Food & Beverage fall below Logistics & Transportation. I believe this is due to the number of employees required for each industry. As we can see from the Mean_Employees column, on average, Food & Beverage companies require over 500 employees, which is far greater than many others on the list. Additionally, Insurance companies require about 150 employees on average, which is far lower than many others on this list. Therefore, these movements seem to be impacted by this phenomenon.


Top 10 Companies with Highest/Lowest Growth Rate

We can also take a look at the companies that had the highest and lowest growth rates. Below, are the top 10 companies with the highest growth rates:

inc %>% 
  arrange(desc(Growth_Rate)) %>% 
  head(n = 10L) %>% 
  kable('html') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
7 Value Payment Systems 174.04 2.550e+07 Financial Services 27 Nashville TN
8 Emerge Digital Group 170.64 2.390e+07 Advertising & Marketing 75 San Francisco CA
9 Goal Zero 169.81 3.310e+07 Consumer Products & Services 97 Bluffdale UT
10 Yagoozon 166.89 1.860e+07 Retail 15 Warwick RI

As we can see, it looks like Fuhu, based in El Segundo, CA had an extremely high growth rate (almost double the next highest company on the list).

And here are the 10 companies with the lowest growth rates:

inc %>% 
  arrange(desc(Growth_Rate)) %>% 
  tail(n = 10L) %>% 
  kable('html') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Rank Name Growth_Rate Revenue Industry Employees City State
4992 4992 Salem Metal Fabricators 0.35 7.40e+06 Manufacturing 50 Middleton MA
4993 4993 The PI Company 0.35 2.00e+06 Business Products & Services 6 North Little Rock AR
4994 4994 RFB Holdings 0.35 7.20e+06 Human Resources 27 Downer Grove IL
4995 4995 Sterling Computers 0.35 1.66e+08 Government Services 98 Norfolk NE
4996 4996 cSubs 0.34 1.34e+07 Business Products & Services 19 Montvale NJ
4997 4997 Dot Foods 0.34 4.50e+09 Food & Beverage 3919 Mt. Sterling IL
4998 4998 Lethal Performance 0.34 6.80e+06 Retail 8 Wellington FL
4999 4999 ArcaTech Systems 0.34 3.26e+07 Financial Services 63 Mebane NC
5000 5000 INE 0.34 6.80e+06 IT Services 35 Bellevue WA
5001 5000 ALL4 0.34 4.70e+06 Environmental Services 34 Kimberton PA

Above, we can see that ALL4, located in Kimberton, PA had the slowest growth rate among the top 5,000 companies in the dataset.


Growth Rates by State (states with at least 50 companies)

We can also break down growth rates by state:

inc %>% 
  group_by(State) %>% 
  summarise(Mean_Growth_Rate = mean(Growth_Rate), Number_of_Companies = n()) %>% 
  arrange(desc(Mean_Growth_Rate)) %>% 
  filter(Number_of_Companies >= 50) %>% 
  kable('html') %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
State Mean_Growth_Rate Number_of_Companies
UT 6.307790 95
TX 6.024186 387
CA 5.897489 701
FL 5.846099 282
MA 5.416648 182
CT 4.994600 50
MD 4.984809 131
CO 4.952015 134
TN 4.950366 82
VA 4.877350 283
IN 4.788261 69
AZ 4.616700 100
NJ 4.445380 158
NY 4.371158 311
WA 4.002692 130
MN 3.821477 88
IL 3.738755 273
OH 3.557527 186
GA 3.523538 212
NC 3.511679 137
WI 2.690253 79
PA 2.566585 164
MO 2.497288 59
AL 2.407451 51
MI 2.238571 126

Although it’s difficult to glean too much here given the wide variety of the number of companies per state in this dataset, we can see that states with at least 50 companies in this dataset have mean growth rates ranging from ~6.3 percent to ~2.2 percent.


Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

In order to do this, we’ll first need to isolate and group the data by state, counting the number of companies per state using the group_by and summarise functions.

comp_state <- inc %>% 
  group_by(State) %>% 
  summarise(Number_of_companies = n()) %>% 
  arrange(desc(Number_of_companies))

Then, using ggplot, we can create a bar graph:

g <- ggplot(comp_state, aes(x = reorder(State, Number_of_companies), y = Number_of_companies)) 
g <- g + geom_bar(stat = "identity", fill = "#009a9c") + coord_flip()
g <- g + scale_y_continuous(expand = c(0, 0), limits = c(0, 900)) 
g <- g + ggtitle("Distribution of the 5,000 Fastest Growing \nCompanies in the U.S. (by State)")
g <- g + geom_hline(yintercept=seq(1,800,250), col="white", lwd=1)
g <- g + ylab("Number of Companies")
g <- g + xlab("State")
g <- g + geom_text(aes(label = Number_of_companies, size = 1), vjust = 0.20, hjust = -0.25, size=2.5, fontface='bold', color="#009a9c")
g <- g + theme(plot.title = element_text(size=12, face="bold", hjust = 0.25),
        axis.text=element_text(size=9, face = "bold"),
        axis.title=element_text(size=9,face="bold", color = "#555555"),
        panel.background = element_blank(),
        axis.line = element_line(color = "#cccccc", 
                      size = 0.5, linetype = "solid"),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_line(color="#cccccc"))
g

As mentioned in the question text, I decided to display the bar graph in portrait format, with the bars extending horizontally rather than vertically since there were a large number of states. I also attempted to be mindful of the data-ink ratio here, by removing extra gridlines in the background, and any extraneous borders and ink that could distract from the overall message of the graph. I did add some white guidelines to the bars in order to show deliniations between our major gridlines on the x-axis (0, 250, 500, and 750).


Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

First, we’ll have to isolate the state with the 3rd most companies in the dataset. This appears to be the state of New York. In order to do this, we can use the filter function in R:

ny_companies <- inc %>% 
  filter(State == 'NY') %>% 
  filter(complete.cases(.))

With the New York State companies saved in a separate data frame, checking to ensure that all companies in the list are complete.cases(), we can now create another data frame that holds the specific mean employee calculations for each industry in the state. Also, we’ll find the maximum average mean employee value for all of the industries in the data set – in order to use later for our scaling on our chart. Similar to before, we can use the group_by and summarise functions again:

mean_industry <- ny_companies %>% 
  group_by(Industry) %>% 
  summarise(Mean_employees = mean(Employees))

mean_industry_max <- max(mean_industry$Mean_employees)
ny_companies_max <- max(ny_companies$Employees)

With these data frames and values ready, we can now create our graph that shows both the mean and median employment by industry for companies in NY. To do this, we’ll set up a ggplot boxplot:

g <- ggplot(ny_companies, aes(x= reorder(Industry, Employees, mean), y=Employees))
g <- g + geom_boxplot(width = 0.5, fill = "aliceblue", fatten = 2, color="#333333", alpha=0.2) + coord_flip()
g <- g + geom_point(data = mean_industry, aes(x=Industry, y=Mean_employees, fill = "Mean Employees"),  color="maroon", size = 2, show.legend = TRUE)
g <- g + scale_y_continuous(expand = c(0,0), limits = c(0, mean_industry_max + 50))
g <- g + ggtitle("Distribution of Employees per Industry (New York State)")
g <- g + ylab("Employees")
g <- g + xlab("Industry")
g <- g + theme_bw()
g <- g + theme(plot.title = element_text(size=12, face="bold", hjust = 0.5, color = "#555555"),
        axis.text=element_text(size=9, face = "bold"),
        axis.title=element_text(size=9,face="bold", color = "#555555"),
        panel.background = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#cccccc", 
                      size = 0.5, linetype = "solid"),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_line(color="#cccccc"),
        legend.position = "top")
g <- g + labs(fill = "")
g

As we can see above, it looks like our mean employee calculations are a bit skewed. When examining the data, for the Business Products & Services industry, we can see one company really pulls the mean employee value much higher (Sutherland Global Services), and given our scaling, it isn’t even pictured on this first iteration of the graph. This is also happening with a few other industries. Therefore, to account for this, we can do a log transformation on our visualization to account for these outliers:

g <- ggplot(ny_companies, aes(x= reorder(Industry, Employees, mean), y=Employees))
g <- g + geom_boxplot(width = 0.5, fill = "aliceblue", fatten = 2, color="#333333", alpha = 0.2) + coord_flip()
g <- g + geom_point(data = mean_industry, aes(x=Industry, y=Mean_employees, fill = "Mean Employees"),  color="maroon", size = 2, show.legend = TRUE)
g <- g + scale_y_continuous(expand = c(0,0), limits = c(0, mean_industry_max))
g <- g + ggtitle("Distribution of Employees per Industry (New York State) | Log Transformation")
g <- g + ylab("Employees")
g <- g + xlab("Industry")
g <- g + theme_bw()
g <- g + theme(plot.title = element_text(size=11, face="bold", color = "#555555"),
        axis.text=element_text(size=9, face = "bold"),
        axis.title=element_text(size=9,face="bold", color = "#555555"),
        panel.background = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "#cccccc", 
                      size = 0.5, linetype = "solid"),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_line(color="#cccccc"),
        legend.position = "top")
g <- g + labs(fill = "")
g <- g + scale_y_log10(limits = c(1, mean_industry_max))
g <- g + labs(caption = "(log transformation of employees | x-axis on log scale)")
g <- g + theme(plot.caption = element_text(size = 8, hjust = 0.5, color = "#333333"))
g

Above, we can see that the visualization now accounts for these outliers. As with the first graph, I did my best to minimize the data-ink ratio by removing excess borders, only including important gridlines, and keeping the non-data ink to a minimum.


Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Similar to our work in the exploratory analysis section above, I’ll use the group_by and summarise functions again to group the data based on Industry, and calculate the mean revenue per employee for each industry in the dataset:

rev_emp <- inc %>% 
  filter(complete.cases(.)) %>% 
  group_by(Industry) %>% 
  summarise(Rev_total = sum(Revenue), Employee_total =  sum(Employees)) %>%
  mutate(Revenue_per_employee = Rev_total / Employee_total) %>% 
  arrange(desc(Revenue_per_employee))

With the data grouped accordingly, I then decided to create a similar visualization (bar chart), as I did in the first question:

g <- ggplot(rev_emp, aes(x = reorder(Industry, Revenue_per_employee), y = Revenue_per_employee)) 
g <- g + geom_bar(stat = "identity", fill = "#009a9c") + coord_flip()
g <- g + scale_y_continuous(expand = c(0, 0), limits = c(0, 1500000), breaks = c(0, 500000, 1000000), labels = scales::comma) 
g <- g + ggtitle("Revenue per Employee by Industry of the 5,000 Fastest Growing \nCompanies in the U.S.")
g <- g + geom_hline(yintercept=seq(0,1500000,250000), col="white", lwd=1)
g <- g + ylab("Revenue per Employee")
g <- g + xlab("Industry")
g <- g + geom_text(aes(label = scales::comma(round(Revenue_per_employee, 0)), size = 1), vjust = 0.20, hjust = -0.15, size=2.5, fontface='bold', color="#009a9c")
g <- g + theme(plot.title = element_text(size=11, face="bold", color = "#555555"),
        axis.text=element_text(size=9, face = "bold"),
        axis.title=element_text(size=9,face="bold", color = "#555555"),
        panel.background = element_blank(),
        axis.line = element_line(color = "#cccccc", 
                      size = 0.5, linetype = "solid"),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_line(color="#cccccc"))
g

Above, I used the same format as Question 1, trying my best to minimize my data-ink ratio, by reducing my gridlines and extra non-data ink. As we can see above, the Computer Hardware industry far surpasses all other industries for the amount of revenue generated per employee amongst the 5,000 fastest growing companies in the United States.