Introduction

The aim is this work is to apply principles of data visualization using the tidyverse and ggplot2 packages while exploring the 5,000 fastest growing companies in the US in the year 2020, as compiled by Inc. magazine.

One of key principles of data visualization is the data ink ratio. The concept behind data ink ratio is that non-data ink (now ink or pixels) is distracting and that the proportion of data ink relative to the total ink required to print/create a graph should be maximized. This is a minimalist approach with a couple keys decorating methods to avoid such as 3D effects, background images and unnecessary axes and grid lines.

Tabular Exploration

Lets preview this data:

While the base R functions can be used to get a quick glance at the data or statistical summaries are useful, the output presents the data in a rather messy way.

##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

The skimr package is used to provide compact and flexible summaries of data. The returned descriptions of the variable types are very helpful in determining how the data was ingested and if any corrections need to be made.

Data summary
Name inc
Number of rows 5001
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 2 51 0 5001 0
Industry 0 1 5 28 0 25 0
City 0 1 4 22 0 1519 0
State 0 1 2 2 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rank 0 1 2501.64 1443.51 1.0e+00 1.252e+03 2.502e+03 3.751e+03 5.0000e+03 ▇▇▇▇▇
Growth_Rate 0 1 4.61 14.12 3.4e-01 7.700e-01 1.420e+00 3.290e+00 4.2148e+02 ▇▁▁▁▁
Revenue 0 1 48222535.49 240542281.14 2.0e+06 5.100e+06 1.090e+07 2.860e+07 1.0100e+10 ▇▁▁▁▁
Employees 12 1 232.72 1353.13 1.0e+00 2.500e+01 5.300e+01 1.320e+02 6.6803e+04 ▇▁▁▁▁

Cleaning up and wrangling this data we can add some more relevant non-visual exploratory information presented in a cleaner fashion and more digestible format. The summaries below help to understand the skew of the data via the total, average and mean columns and also provide an aggregate of the revenue (in billions of $) per state and per industry.

Summary: Top 10 State by Company Count
State CompanyCount TotEmployee AvgEmployee StdEmployee SumRevenue_b
CA 700 161219 230.31 1213.67 23.3646
TX 386 90765 235.14 739.43 22.1543
NY 311 84370 271.29 1916.18 18.2604
VA 283 35667 126.03 263.66 8.6677
FL 282 61221 217.10 960.70 10.6103
IL 272 103266 379.65 1463.61 33.2388
OH 186 38002 204.31 818.26 12.7866
NC 135 36685 271.74 819.37 9.2525
MI 126 36905 292.90 850.47 7.8058
WI 77 15548 201.92 757.54 7.1314
Summary: Industries
Industry CompanyCount TotEmployee AvgEmployee StdEmployee SumRevenue_b
IT Services 732 102788 140.42 392.37 20.5250
Business Products & Services 480 117357 244.49 1519.52 26.3459
Advertising & Marketing 471 39731 84.35 287.40 7.7850
Health 354 82430 232.85 490.96 17.8601
Software 341 51262 150.33 267.57 8.1346
Financial Services 260 47693 183.43 302.83 13.1509
Manufacturing 255 43942 172.32 617.10 12.6036
Consumer Products & Services 203 45464 223.96 1214.94 14.9564
Retail 203 37068 182.60 594.90 10.2574
Government Services 202 26185 129.63 182.64 6.0091
Human Resources 196 226980 1158.06 5474.04 9.2461
Construction 187 29099 155.61 589.23 13.1743
Logistics & Transportation 154 39994 259.70 928.82 14.8378
Food & Beverage 129 65911 510.94 1250.18 12.8125
Telecommunications 127 30842 242.85 919.95 7.2879
Energy 109 26437 242.54 454.36 13.7716
Real Estate 95 18893 198.87 412.58 2.9568
Education 83 7685 92.59 136.42 1.1393
Engineering 74 20435 276.15 1166.04 2.5325
Security 73 41059 562.45 2433.98 3.8128
Travel & Hospitality 62 23035 371.53 900.79 2.9316
Media 54 9532 176.52 502.23 1.7424
Environmental Services 51 10155 199.12 742.18 2.6388
Insurance 50 7339 146.78 412.92 2.3379
Computer Hardware 44 9714 220.77 1016.74 11.8857

Visualizations

Company Distribution by State

Given that there are a lot of states and that this visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (taller than wider), it is important to consider which axis to use in the choice of layout. I chose to use a sorted bar graph. The large number of states justifies the flip in axes. The bar graph uses length to display information which is is visually easy to interpret, while the sorting eliminates having to visually compare non-adjacent bars.

Workforce Distribution per Industry in NY State

Narrowing the exploration of this data to NY state, it is interesting to know how many people are employed by companies in different industries, and how variable the ranges are. To visualize this, I chose the boxplot and the log transform because the it had the nicest spread of data over the plot and allowed for the larger industries not to visually swamp out the smaller ones. Sorting the boxplot by median also helps to makes the pattern in the mean more discernable. While the data may have a nice spread, the reader may have difficulties interpreting the mean and median values relative to the log scale.