I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. Let’s read this in:
And let’s preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
# Insert your code here, create more chunks as necessary
inc %>%
select(Industry) %>%
table() %>%
sort() %>%
rev()
## .
## IT Services Business Products & Services
## 733 482
## Advertising & Marketing Health
## 471 355
## Software Financial Services
## 342 260
## Manufacturing Retail
## 256 203
## Consumer Products & Services Government Services
## 203 202
## Human Resources Construction
## 196 187
## Logistics & Transportation Food & Beverage
## 155 131
## Telecommunications Energy
## 129 109
## Real Estate Education
## 96 83
## Engineering Security
## 74 73
## Travel & Hospitality Media
## 62 54
## Environmental Services Insurance
## 51 50
## Computer Hardware
## 44
And how are they distributed by State?
inc %>%
select(State) %>%
table() %>%
sort() %>%
rev()
## .
## CA TX NY VA FL IL GA OH MA PA NJ NC CO MD WA MI AZ UT MN TN
## 701 387 311 283 282 273 212 186 182 164 158 137 134 131 130 126 100 95 88 82
## WI IN MO AL CT OR SC OK DC KY KS LA IA NE NV NH ID RI DE ME
## 79 69 59 51 50 49 48 46 43 40 38 37 28 27 26 24 17 16 16 13
## MS ND AR HI VT NM MT SD WY WV AK PR
## 12 10 9 7 6 5 4 3 2 2 2 1
Which Industry has the highest median growth rate?
inc %>%
group_by(Industry) %>%
summarise(MedianGrowth=median(Growth_Rate)) %>%
arrange(desc(MedianGrowth))
## # A tibble: 25 × 2
## Industry MedianGrowth
## <chr> <dbl>
## 1 Government Services 2.11
## 2 Energy 2.08
## 3 Real Estate 2.07
## 4 Media 1.94
## 5 Consumer Products & Services 1.82
## 6 Retail 1.76
## 7 Software 1.72
## 8 Advertising & Marketing 1.61
## 9 Health 1.57
## 10 Security 1.54
## # … with 15 more rows
Possible talking points:
The median growths are very similar per industry: about 1% - 2%
There are huge outliers in all 3 numericals (Growth Rate, Revenue, Employees)
Companies per state would be more insightful if scaled by something like state population
This is a list of some 5000 companies, so it might be more useful to weight big companies vs. small
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Answer Question 1 here
library(ggplot2)
# Use a paper size as aspect ratio
PORTRAIT = 11 / 8.5
# Possibly limit overcrowding of chart
NSTATES = 52
states = inc %>%
select(State) %>%
table() %>%
sort() %>%
tail(NSTATES) %>%
as.data.frame()
states %>%
ggplot(aes(x = ., y = Freq, width=.5)) +
geom_col() +
coord_flip() +
ggtitle('Fast-Growth Company Locations') +
theme_minimal() +
theme(aspect.ratio = PORTRAIT) +
ylab('Number of Companies') +
theme(axis.text.y = element_text(size = 6))
Depending on the audience and topic, it could be more meaningful to scale these numbers by total number of companies in the state.
Lets dig in on the state with the 3rd most companies in the
data set. Imagine you work for the state and are interested in how many
people are employed by companies in different industries. Create a plot
that shows the average and/or median employment by industry for
companies in this state (only use cases with full data, use R’s
complete.cases() function.) In addition to this, your graph
should show how variable the ranges are, and you should deal with
outliers.
# inspect numbers
inc %>%
filter(State=='NY') %>%
filter(complete.cases(.)) %>%
group_by(Industry) %>%
summarise(MedianEmpl=median(Employees), Companies=n()) %>%
arrange(desc(MedianEmpl))
## # A tibble: 25 × 3
## Industry MedianEmpl Companies
## <chr> <dbl> <int>
## 1 Environmental Services 155 2
## 2 Energy 120 5
## 3 Financial Services 81 13
## 4 Software 80 13
## 5 Business Products & Services 70.5 26
## 6 Travel & Hospitality 61 7
## 7 Human Resources 56 11
## 8 Engineering 54.5 4
## 9 IT Services 54 43
## 10 Education 50.5 14
## # … with 15 more rows
# Answer Question 2 here
inc %>%
filter(State=='NY') %>%
filter(complete.cases(.)) %>%
ggplot(aes(reorder(Industry, Employees, FUN=median), Employees)) +
geom_boxplot(varwidth = T, outlier.size = .7) +
coord_flip() +
theme_minimal() +
xlab('') +
scale_y_log10() +
labs(title = 'Employees per NY Growth Company, by Industry',
caption='Log-scaled, and with bar thickness proportional to number of companies in that industry')
Notes:
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
# Answer Question 3 here
inc %>%
filter(complete.cases(.)) %>%
mutate(RpE = Revenue / Employees) %>%
ggplot(aes(reorder(Industry, RpE, FUN=median), RpE)) +
geom_boxplot(varwidth = T, outlier.size = .7) +
coord_flip() +
theme_minimal() +
xlab('') +
scale_y_log10() +
labs(title = 'Revenue per Employee at each Company, by Industry',
caption='Log-scaled, and with bar thickness proportional to number of companies in that industry')
Notes: