Introduction

The aim is this work is to apply principles of data visualization using the tidyverse and ggplot2 packages while exploring the 5,000 fastest growing companies in the US in the year 2020, as compiled by Inc. magazine.

One of key principles of data visualization is the data ink ratio. The concept behind data ink ratio is that non-data ink (now ink or pixels) is distracting and that the proportion of data ink relative to the total ink required to print/create a graph should be maximized. This is a minimalist approach with a couple keys decorating methods to avoid such as 3D effects, background images and unnecessary axes and grid lines.

Setup

library(tidyverse)
library(skimr)

inc <- read_csv("https://raw.githubusercontent.com/maelillien/cuny-projects/main/us-company-distribution/Data/inc5000_data.csv", col_names=TRUE)

Tabular Exploration

Lets preview this data:

head(inc)

While the base R functions can be used to get a quick glance at the data or statistical summaries are useful, the output presents the data in a rather messy way.

summary(inc)

##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

The skimr package is used to provide compact and flexible summaries of data. The returned descriptions of the variable types are very helpful in determining how the data was ingested and if any corrections need to be made.

skim(inc)

Data summary
Name	inc
Number of rows	5001
Number of columns	8
_______________________
Column type frequency:
character	4
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Name	1	2	51	5001
Industry	1	5	28	25
City	1	4	22	1519
State	1	2	2	52

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Rank	0	1	2501.64	1443.51	1.0e+00	1.252e+03	2.502e+03	3.751e+03	5.0000e+03	▇▇▇▇▇
Growth_Rate	0	1	4.61	14.12	3.4e-01	7.700e-01	1.420e+00	3.290e+00	4.2148e+02	▇▁▁▁▁
Revenue	0	1	48222535.49	240542281.14	2.0e+06	5.100e+06	1.090e+07	2.860e+07	1.0100e+10	▇▁▁▁▁
Employees	12	1	232.72	1353.13	1.0e+00	2.500e+01	5.300e+01	1.320e+02	6.6803e+04	▇▁▁▁▁

Cleaning up and wrangling this data we can add some more relevant non-visual exploratory information presented in a cleaner fashion and more digestible format. The summaries below help to understand the skew of the data via the total, average and mean columns and also provide an aggregate of the revenue (in billions of $) per state and per industry.

#inc$Employees[is.na(inc$Employees)] <- 0 # Replace NAs with 0
inc <- na.omit(inc)

state_data <- inc %>% group_by(State) %>% 
      summarise(CompanyCount = n(), TotEmployee = sum(Employees), AvgEmployee = round(mean(Employees),2), StdEmployee = round(sd(Employees),2), SumRevenue_b = sum(Revenue)/(10^9)) %>% 
      arrange(desc(CompanyCount,na.rm=TRUE)) %>% top_n(10)
knitr::kable(state_data, caption = "Summary: Top 10 State by Company Count")

Summary: Top 10 State by Company Count
State	CompanyCount	TotEmployee	AvgEmployee	StdEmployee	SumRevenue_b
CA	700	161219	230.31	1213.67	23.3646
TX	386	90765	235.14	739.43	22.1543
NY	311	84370	271.29	1916.18	18.2604
VA	283	35667	126.03	263.66	8.6677
FL	282	61221	217.10	960.70	10.6103
IL	272	103266	379.65	1463.61	33.2388
OH	186	38002	204.31	818.26	12.7866
NC	135	36685	271.74	819.37	9.2525
MI	126	36905	292.90	850.47	7.8058
WI	77	15548	201.92	757.54	7.1314

industry_data <- inc %>% group_by(Industry) %>% 
      summarise(CompanyCount = n(), TotEmployee = sum(Employees), AvgEmployee = round(mean(Employees),2), StdEmployee = round(sd(Employees),2), SumRevenue_b = sum(Revenue)/(10^9)) %>% 
  arrange(desc(CompanyCount,na.rm=TRUE))
knitr::kable(industry_data, caption = "Summary: Industries")

Summary: Industries
Industry	CompanyCount	TotEmployee	AvgEmployee	StdEmployee	SumRevenue_b
IT Services	732	102788	140.42	392.37	20.5250
Business Products & Services	480	117357	244.49	1519.52	26.3459
Advertising & Marketing	471	39731	84.35	287.40	7.7850
Health	354	82430	232.85	490.96	17.8601
Software	341	51262	150.33	267.57	8.1346
Financial Services	260	47693	183.43	302.83	13.1509
Manufacturing	255	43942	172.32	617.10	12.6036
Consumer Products & Services	203	45464	223.96	1214.94	14.9564
Retail	203	37068	182.60	594.90	10.2574
Government Services	202	26185	129.63	182.64	6.0091
Human Resources	196	226980	1158.06	5474.04	9.2461
Construction	187	29099	155.61	589.23	13.1743
Logistics & Transportation	154	39994	259.70	928.82	14.8378
Food & Beverage	129	65911	510.94	1250.18	12.8125
Telecommunications	127	30842	242.85	919.95	7.2879
Energy	109	26437	242.54	454.36	13.7716
Real Estate	95	18893	198.87	412.58	2.9568
Education	83	7685	92.59	136.42	1.1393
Engineering	74	20435	276.15	1166.04	2.5325
Security	73	41059	562.45	2433.98	3.8128
Travel & Hospitality	62	23035	371.53	900.79	2.9316
Media	54	9532	176.52	502.23	1.7424
Environmental Services	51	10155	199.12	742.18	2.6388
Insurance	50	7339	146.78	412.92	2.3379
Computer Hardware	44	9714	220.77	1016.74	11.8857

Visualizations

Company Distribution by State

Given that there are a lot of states and that this visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (taller than wider), it is important to consider which axis to use in the choice of layout. I chose to use a sorted bar graph. The large number of states justifies the flip in axes. The bar graph uses length to display information which is is visually easy to interpret, while the sorting eliminates having to visually compare non-adjacent bars.

Workforce Distribution per Industry in NY State

Narrowing the exploration of this data to NY state, it is interesting to know how many people are employed by companies in different industries, and how variable the ranges are. To visualize this, I chose the boxplot and the log transform because the it had the nicest spread of data over the plot and allowed for the larger industries not to visually swamp out the smaller ones. Sorting the boxplot by median also helps to makes the pattern in the mean more discernable. While the data may have a nice spread, the reader may have difficulties interpreting the mean and median values relative to the log scale.

ny_state <- inc %>% filter(complete.cases(inc)) %>% filter(State == 'NY') %>% arrange(desc(Employees))

ggplot(data = ny_state, aes(x=reorder(Industry, Employees, FUN=median), y=Employees)) + 
  geom_boxplot(outlier.colour = "red", outlier.shape = NA) +
  stat_summary(fun=mean, geom="point", shape=20, size=2, color="blue", fill="red", show.legend=TRUE) +
  scale_y_continuous(trans='log10', breaks=c(1,5,10,50,100,500,1000,5000)) +
  coord_flip() +
  ggtitle("Distribution of Employee Counts in NY State Industries") +
  theme(panel.background = element_blank(),
        panel.grid.major.x = element_line(colour = "gray"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank()) +
  annotate(geom="text", x=20, y=13000, label="mean",color="black") +
  geom_point(aes(x=20, y=6500), size=1, color="blue")

Industry Revenue per Employee

Another interesting analysis of this NY State data is finding out which industries generate the most revenue per employee. Here, a rotated and sorted bar seemed again to be the most appropriate choice. The bar labels were divided by 10^3 and rounded to make the number more digestible to the user.

ind_rev_per_em <- inc %>% filter(complete.cases(inc)) %>% 
  group_by(Industry) %>% 
  summarise(TotalRevenue = sum(Revenue), TotalEmp = sum(Employees)) %>% 
  mutate(RevPerEmp = TotalRevenue/TotalEmp) %>% 
  arrange(desc(RevPerEmp))

ggplot(data = ind_rev_per_em, aes(x=reorder(Industry, RevPerEmp), y=RevPerEmp)) + 
  geom_bar(stat = 'identity', fill= 'lightblue', width = 0.8) + 
  geom_text(aes(label = paste0("$", round(RevPerEmp/1000,0), 'k')), hjust = -0.1, color='black') +
  #aes(label=ifelse(EUR>100,paste0("€", round(EUR,0)),""),)
  scale_y_continuous(labels=scales::dollar_format()) +
  coord_flip() + 
  ggtitle("Industry Revenue per Employee") +
  theme(
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        axis.title = element_blank(),
        panel.background = element_blank()) +
  expand_limits(y = c(0, 1500000))

Exploring the 5000 Fastest Growing Companies in the US

Mael Illien

9/5/2020