The aim is this work is to apply principles of data visualization using the tidyverse and ggplot2 packages while exploring the 5,000 fastest growing companies in the US in the year 2020, as compiled by Inc. magazine.
One of key principles of data visualization is the data ink ratio. The concept behind data ink ratio is that non-data ink (now ink or pixels) is distracting and that the proportion of data ink relative to the total ink required to print/create a graph should be maximized. This is a minimalist approach with a couple keys decorating methods to avoid such as 3D effects, background images and unnecessary axes and grid lines.
Lets preview this data:
While the base R functions can be used to get a quick glance at the data or statistical summaries are useful, the output presents the data in a rather messy way.
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
The skimr package is used to provide compact and flexible summaries of data. The returned descriptions of the variable types are very helpful in determining how the data was ingested and if any corrections need to be made.
| Name | inc |
| Number of rows | 5001 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1 | 2 | 51 | 0 | 5001 | 0 |
| Industry | 0 | 1 | 5 | 28 | 0 | 25 | 0 |
| City | 0 | 1 | 4 | 22 | 0 | 1519 | 0 |
| State | 0 | 1 | 2 | 2 | 0 | 52 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 0 | 1 | 2501.64 | 1443.51 | 1.0e+00 | 1.252e+03 | 2.502e+03 | 3.751e+03 | 5.0000e+03 | ▇▇▇▇▇ |
| Growth_Rate | 0 | 1 | 4.61 | 14.12 | 3.4e-01 | 7.700e-01 | 1.420e+00 | 3.290e+00 | 4.2148e+02 | ▇▁▁▁▁ |
| Revenue | 0 | 1 | 48222535.49 | 240542281.14 | 2.0e+06 | 5.100e+06 | 1.090e+07 | 2.860e+07 | 1.0100e+10 | ▇▁▁▁▁ |
| Employees | 12 | 1 | 232.72 | 1353.13 | 1.0e+00 | 2.500e+01 | 5.300e+01 | 1.320e+02 | 6.6803e+04 | ▇▁▁▁▁ |
Cleaning up and wrangling this data we can add some more relevant non-visual exploratory information presented in a cleaner fashion and more digestible format. The summaries below help to understand the skew of the data via the total, average and mean columns and also provide an aggregate of the revenue (in billions of $) per state and per industry.
#inc$Employees[is.na(inc$Employees)] <- 0 # Replace NAs with 0
inc <- na.omit(inc)
state_data <- inc %>% group_by(State) %>%
summarise(CompanyCount = n(), TotEmployee = sum(Employees), AvgEmployee = round(mean(Employees),2), StdEmployee = round(sd(Employees),2), SumRevenue_b = sum(Revenue)/(10^9)) %>%
arrange(desc(CompanyCount,na.rm=TRUE)) %>% top_n(10)
knitr::kable(state_data, caption = "Summary: Top 10 State by Company Count")| State | CompanyCount | TotEmployee | AvgEmployee | StdEmployee | SumRevenue_b |
|---|---|---|---|---|---|
| CA | 700 | 161219 | 230.31 | 1213.67 | 23.3646 |
| TX | 386 | 90765 | 235.14 | 739.43 | 22.1543 |
| NY | 311 | 84370 | 271.29 | 1916.18 | 18.2604 |
| VA | 283 | 35667 | 126.03 | 263.66 | 8.6677 |
| FL | 282 | 61221 | 217.10 | 960.70 | 10.6103 |
| IL | 272 | 103266 | 379.65 | 1463.61 | 33.2388 |
| OH | 186 | 38002 | 204.31 | 818.26 | 12.7866 |
| NC | 135 | 36685 | 271.74 | 819.37 | 9.2525 |
| MI | 126 | 36905 | 292.90 | 850.47 | 7.8058 |
| WI | 77 | 15548 | 201.92 | 757.54 | 7.1314 |
industry_data <- inc %>% group_by(Industry) %>%
summarise(CompanyCount = n(), TotEmployee = sum(Employees), AvgEmployee = round(mean(Employees),2), StdEmployee = round(sd(Employees),2), SumRevenue_b = sum(Revenue)/(10^9)) %>%
arrange(desc(CompanyCount,na.rm=TRUE))
knitr::kable(industry_data, caption = "Summary: Industries")| Industry | CompanyCount | TotEmployee | AvgEmployee | StdEmployee | SumRevenue_b |
|---|---|---|---|---|---|
| IT Services | 732 | 102788 | 140.42 | 392.37 | 20.5250 |
| Business Products & Services | 480 | 117357 | 244.49 | 1519.52 | 26.3459 |
| Advertising & Marketing | 471 | 39731 | 84.35 | 287.40 | 7.7850 |
| Health | 354 | 82430 | 232.85 | 490.96 | 17.8601 |
| Software | 341 | 51262 | 150.33 | 267.57 | 8.1346 |
| Financial Services | 260 | 47693 | 183.43 | 302.83 | 13.1509 |
| Manufacturing | 255 | 43942 | 172.32 | 617.10 | 12.6036 |
| Consumer Products & Services | 203 | 45464 | 223.96 | 1214.94 | 14.9564 |
| Retail | 203 | 37068 | 182.60 | 594.90 | 10.2574 |
| Government Services | 202 | 26185 | 129.63 | 182.64 | 6.0091 |
| Human Resources | 196 | 226980 | 1158.06 | 5474.04 | 9.2461 |
| Construction | 187 | 29099 | 155.61 | 589.23 | 13.1743 |
| Logistics & Transportation | 154 | 39994 | 259.70 | 928.82 | 14.8378 |
| Food & Beverage | 129 | 65911 | 510.94 | 1250.18 | 12.8125 |
| Telecommunications | 127 | 30842 | 242.85 | 919.95 | 7.2879 |
| Energy | 109 | 26437 | 242.54 | 454.36 | 13.7716 |
| Real Estate | 95 | 18893 | 198.87 | 412.58 | 2.9568 |
| Education | 83 | 7685 | 92.59 | 136.42 | 1.1393 |
| Engineering | 74 | 20435 | 276.15 | 1166.04 | 2.5325 |
| Security | 73 | 41059 | 562.45 | 2433.98 | 3.8128 |
| Travel & Hospitality | 62 | 23035 | 371.53 | 900.79 | 2.9316 |
| Media | 54 | 9532 | 176.52 | 502.23 | 1.7424 |
| Environmental Services | 51 | 10155 | 199.12 | 742.18 | 2.6388 |
| Insurance | 50 | 7339 | 146.78 | 412.92 | 2.3379 |
| Computer Hardware | 44 | 9714 | 220.77 | 1016.74 | 11.8857 |
Given that there are a lot of states and that this visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (taller than wider), it is important to consider which axis to use in the choice of layout. I chose to use a sorted bar graph. The large number of states justifies the flip in axes. The bar graph uses length to display information which is is visually easy to interpret, while the sorting eliminates having to visually compare non-adjacent bars.
Narrowing the exploration of this data to NY state, it is interesting to know how many people are employed by companies in different industries, and how variable the ranges are. To visualize this, I chose the boxplot and the log transform because the it had the nicest spread of data over the plot and allowed for the larger industries not to visually swamp out the smaller ones. Sorting the boxplot by median also helps to makes the pattern in the mean more discernable. While the data may have a nice spread, the reader may have difficulties interpreting the mean and median values relative to the log scale.
ggplot(data = ny_state, aes(x=reorder(Industry, Employees, FUN=median), y=Employees)) +
geom_boxplot(outlier.colour = "red", outlier.shape = NA) +
stat_summary(fun=mean, geom="point", shape=20, size=2, color="blue", fill="red", show.legend=TRUE) +
scale_y_continuous(trans='log10', breaks=c(1,5,10,50,100,500,1000,5000)) +
coord_flip() +
ggtitle("Distribution of Employee Counts in NY State Industries") +
theme(panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "gray"),
axis.title.y = element_blank(),
axis.title.x = element_blank()) +
annotate(geom="text", x=20, y=13000, label="mean",color="black") +
geom_point(aes(x=20, y=6500), size=1, color="blue")Another interesting analysis of this NY State data is finding out which industries generate the most revenue per employee. Here, a rotated and sorted bar seemed again to be the most appropriate choice. The bar labels were divided by 10^3 and rounded to make the number more digestible to the user.
ind_rev_per_em <- inc %>% filter(complete.cases(inc)) %>%
group_by(Industry) %>%
summarise(TotalRevenue = sum(Revenue), TotalEmp = sum(Employees)) %>%
mutate(RevPerEmp = TotalRevenue/TotalEmp) %>%
arrange(desc(RevPerEmp))ggplot(data = ind_rev_per_em, aes(x=reorder(Industry, RevPerEmp), y=RevPerEmp)) +
geom_bar(stat = 'identity', fill= 'lightblue', width = 0.8) +
geom_text(aes(label = paste0("$", round(RevPerEmp/1000,0), 'k')), hjust = -0.1, color='black') +
#aes(label=ifelse(EUR>100,paste0("€", round(EUR,0)),""),)
scale_y_continuous(labels=scales::dollar_format()) +
coord_flip() +
ggtitle("Industry Revenue per Employee") +
theme(
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.title = element_blank(),
panel.background = element_blank()) +
expand_limits(y = c(0, 1500000))