I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
Environmental Dependancies
library( dplyr )
library( kableExtra )
library( ggplot2 )
library( forcats )
library( scales )
# Insert your code here, create more chunks as necessary
# I find it helpful use dplyr's glimpse() function to view a dataframe
glimpse( inc )
## Rows: 5,001
## Columns: 8
## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Name <chr> "Fuhu", "FederalConference.com", "The HCI Group", "Bridger…
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 174.04, 17…
## $ Revenue <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e+07, 4.5…
## $ Industry <chr> "Consumer Products & Services", "Government Services", "He…
## $ Employees <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 165, 250, …
## $ City <chr> "El Segundo", "Dumfries", "Jacksonville", "Addison", "Bost…
## $ State <chr> "CA", "VA", "FL", "TX", "MA", "TX", "TN", "CA", "UT", "RI"…
glimpse organizes the dataframe in a more succinct way than the head function and, in combination with the statistical output of the summary function hints at several ways that we might explore the data further to understand it further:
State-Level Data The following code will group data features by State, although a similar analysis can be performed by city
#Geographic grouping (State) & summarize several features
inc_state <- inc %>%
group_by( State ) %>%
summarise( num_companies = n(),
num_city = n_distinct( City ),
mean_Rev = mean( Revenue ),
mean_Growth = mean( Growth_Rate ),
mean_Rank = mean( Rank ),
mean_Size = mean( Employees, na.rm=TRUE ),
num_Industry = n_distinct( Industry ) )
A full view of state level data sorted in descending order by the number of companies:
inc_state %>%
arrange( desc( num_companies ) ) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
| State | num_companies | num_city | mean_Rev | mean_Growth | mean_Rank | mean_Size | num_Industry |
|---|---|---|---|---|---|---|---|
| CA | 701 | 204 | 33463481 | 5.897489 | 2273.867 | 230.31286 | 25 |
| TX | 387 | 67 | 57271835 | 6.024186 | 2468.930 | 235.14249 | 24 |
| NY | 311 | 90 | 58715113 | 4.371158 | 2611.688 | 271.28617 | 25 |
| VA | 283 | 56 | 30627915 | 4.877350 | 2437.286 | 126.03180 | 24 |
| FL | 282 | 102 | 37625177 | 5.846099 | 2366.695 | 217.09574 | 25 |
| IL | 273 | 104 | 121773993 | 3.738755 | 2530.011 | 379.65441 | 25 |
| GA | 212 | 45 | 30510849 | 3.523538 | 2499.193 | 163.72512 | 22 |
| OH | 186 | 79 | 68745161 | 3.557527 | 2739.651 | 204.31183 | 24 |
| MA | 182 | 72 | 33174725 | 5.416648 | 2522.874 | 135.61538 | 24 |
| PA | 164 | 80 | 34568902 | 2.566585 | 2755.872 | 186.45399 | 24 |
| NJ | 158 | 97 | 29574684 | 4.445380 | 2453.924 | 190.89873 | 22 |
| NC | 137 | 46 | 67580292 | 3.511679 | 2535.482 | 271.74074 | 22 |
| CO | 134 | 28 | 31270149 | 4.952015 | 2481.709 | 198.78195 | 23 |
| MD | 131 | 44 | 25193893 | 4.984809 | 2255.901 | 308.69466 | 22 |
| WA | 130 | 30 | 27156923 | 4.002692 | 2447.946 | 135.00775 | 19 |
| MI | 126 | 52 | 61950794 | 2.238571 | 2721.484 | 292.89683 | 22 |
| AZ | 100 | 13 | 55015000 | 4.616700 | 2488.230 | 342.81000 | 22 |
| UT | 95 | 26 | 36038947 | 6.307790 | 2234.516 | 200.29474 | 20 |
| MN | 88 | 37 | 57256818 | 3.821477 | 2496.966 | 210.61364 | 19 |
| TN | 82 | 21 | 35870732 | 4.950366 | 2789.878 | 177.87805 | 20 |
| WI | 79 | 43 | 92362025 | 2.690253 | 2968.519 | 201.92208 | 20 |
| IN | 69 | 30 | 50105797 | 4.788261 | 2150.290 | 184.01449 | 21 |
| MO | 59 | 27 | 45164407 | 2.497288 | 2811.085 | 293.15254 | 18 |
| AL | 51 | 13 | 25907843 | 2.407451 | 2728.137 | 125.35294 | 18 |
| CT | 50 | 34 | 49486000 | 4.994600 | 2495.440 | 139.78000 | 18 |
| OR | 49 | 12 | 28589796 | 3.148367 | 2726.918 | 89.77551 | 16 |
| SC | 48 | 22 | 30993750 | 6.060625 | 2397.167 | 111.41667 | 18 |
| OK | 46 | 12 | 41015217 | 3.097174 | 2582.065 | 151.65217 | 15 |
| DC | 43 | 3 | 76344186 | 8.298139 | 2161.047 | 219.54762 | 13 |
| KY | 40 | 13 | 33232500 | 2.064000 | 2693.725 | 138.60000 | 14 |
| KS | 38 | 16 | 40752632 | 3.628684 | 2737.842 | 229.60526 | 17 |
| LA | 37 | 15 | 56648649 | 1.944595 | 2700.973 | 288.35135 | 13 |
| IA | 28 | 18 | 123142857 | 1.761071 | 3195.964 | 405.14286 | 13 |
| NE | 27 | 7 | 40696296 | 2.078889 | 3343.333 | 141.59259 | 13 |
| NV | 26 | 6 | 19915385 | 2.330769 | 2778.308 | 66.34615 | 11 |
| NH | 24 | 15 | 41691667 | 1.512917 | 2870.708 | 120.41667 | 13 |
| ID | 17 | 12 | 231523529 | 2.645294 | 2561.824 | 342.17647 | 10 |
| DE | 16 | 7 | 42300000 | 2.420000 | 2424.562 | 4284.00000 | 11 |
| RI | 16 | 9 | 46981250 | 16.031250 | 1513.500 | 185.25000 | 12 |
| ME | 13 | 5 | 12476923 | 16.210000 | 1951.615 | 67.61538 | 8 |
| MS | 12 | 8 | 43766667 | 5.642500 | 2641.417 | 460.91667 | 9 |
| ND | 10 | 5 | 18240000 | 1.227000 | 2965.200 | 96.30000 | 8 |
| AR | 9 | 7 | 8333333 | 1.670000 | 3449.222 | 55.11111 | 7 |
| HI | 7 | 4 | 99485714 | 6.792857 | 1453.571 | 88.71429 | 5 |
| VT | 6 | 4 | 46200000 | 1.296667 | 3089.167 | 178.16667 | 6 |
| NM | 5 | 2 | 9640000 | 1.364000 | 2965.000 | 123.40000 | 5 |
| MT | 4 | 2 | 6150000 | 0.762500 | 3835.500 | 418.25000 | 3 |
| SD | 3 | 3 | 5900000 | 1.406667 | 2719.333 | 253.66667 | 3 |
| AK | 2 | 2 | 171500000 | 4.805000 | 2634.500 | 1264.00000 | 2 |
| WV | 2 | 2 | 15650000 | 0.620000 | 4164.500 | 120.00000 | 2 |
| WY | 2 | 2 | 34750000 | 19.145000 | 1998.500 | 53.50000 | 2 |
| PR | 1 | 1 | 2300000 | 1.730000 | 2140.000 | 29.00000 | 1 |
The above table organizes State level data summaries by the number of top 5000 ranked companies (num_companies). Predictably, we see that the larger states (California, CA; Texas, TX; New York, NY) have more companies that made the ranked listing compared to smaller states (e.g. West Virginia, WV; Wyoming, WY). Additionally, we can see that states with more ranked companies also have more represented industries (num_Industry); this can be interpreted as a measure of a state’s economic diversity.
The table above is very dense with information. We can subset the aggregated data to pull out details of interest. For example, the following table organizes the top 10 states by mean Growth Rate
#sort by a feature to rearrange the data
inc_state %>%
select( State, mean_Growth ) %>%
arrange( desc( mean_Growth ) ) %>%
top_n(10) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
## Selecting by mean_Growth
| State | mean_Growth |
|---|---|
| WY | 19.145000 |
| ME | 16.210000 |
| RI | 16.031250 |
| DC | 8.298139 |
| HI | 6.792857 |
| UT | 6.307790 |
| SC | 6.060625 |
| TX | 6.024186 |
| CA | 5.897489 |
| FL | 5.846099 |
Industry-Level Data Industry is another interesting categorical variable organize the data by. The following lists the unique Industry categories:
unique( inc$Industry )
## [1] "Consumer Products & Services" "Government Services"
## [3] "Health" "Energy"
## [5] "Advertising & Marketing" "Real Estate"
## [7] "Financial Services" "Retail"
## [9] "Software" "Computer Hardware"
## [11] "Logistics & Transportation" "Food & Beverage"
## [13] "IT Services" "Business Products & Services"
## [15] "Education" "Construction"
## [17] "Manufacturing" "Telecommunications"
## [19] "Security" "Human Resources"
## [21] "Travel & Hospitality" "Media"
## [23] "Environmental Services" "Engineering"
## [25] "Insurance"
Here, the data is aggregated by Industry and several summary metrics are tabulated
#Aggregating by Industry & summarize several features
inc_industry <- inc %>%
group_by( Industry ) %>%
summarise( num_companies = n(),
num_city = n_distinct( City ),
mean_Rev = mean( Revenue ),
mean_Growth = mean( Growth_Rate ),
mean_Rank = mean( Rank ),
mean_Size = mean( Employees, na.rm=TRUE ),
num_State = n_distinct( State ) )
#A full view of industry level data sorted in descending order by the number of companies:
inc_industry %>%
arrange( desc( num_companies ) ) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
| Industry | num_companies | num_city | mean_Rev | mean_Growth | mean_Rank | mean_Size | num_State |
|---|---|---|---|---|---|---|---|
| IT Services | 733 | 389 | 28214598 | 3.331815 | 2540.729 | 140.42076 | 44 |
| Business Products & Services | 482 | 293 | 54705187 | 3.518486 | 2878.041 | 244.49375 | 42 |
| Advertising & Marketing | 471 | 252 | 16528662 | 6.225478 | 2358.718 | 84.35456 | 43 |
| Health | 355 | 258 | 50319437 | 4.856394 | 2415.746 | 232.85311 | 44 |
| Software | 342 | 202 | 23802924 | 5.020643 | 2328.196 | 150.32845 | 44 |
| Financial Services | 260 | 191 | 50580385 | 5.435308 | 2352.785 | 183.43462 | 39 |
| Manufacturing | 256 | 224 | 49546875 | 2.295391 | 3000.590 | 172.32157 | 38 |
| Consumer Products & Services | 203 | 152 | 73676847 | 8.776108 | 2193.768 | 223.96059 | 34 |
| Retail | 203 | 164 | 50529064 | 6.184729 | 2233.729 | 182.60099 | 37 |
| Government Services | 202 | 102 | 29748020 | 7.238168 | 2087.866 | 129.62871 | 30 |
| Human Resources | 196 | 130 | 47173980 | 3.300459 | 2426.398 | 1158.06122 | 33 |
| Construction | 187 | 157 | 70450802 | 3.366684 | 2642.241 | 155.60963 | 36 |
| Logistics & Transportation | 155 | 124 | 95745161 | 4.339226 | 2580.284 | 259.70130 | 32 |
| Food & Beverage | 131 | 108 | 98559542 | 3.636565 | 2649.435 | 510.93798 | 32 |
| Telecommunications | 129 | 103 | 56855814 | 2.883721 | 2596.504 | 242.85039 | 28 |
| Energy | 109 | 83 | 126344954 | 9.603303 | 2038.220 | 242.54128 | 31 |
| Real Estate | 96 | 75 | 30892708 | 7.746667 | 2017.740 | 198.87368 | 27 |
| Education | 83 | 68 | 13726506 | 3.642651 | 2758.000 | 92.59036 | 25 |
| Engineering | 74 | 65 | 34222973 | 1.984324 | 2733.095 | 276.14865 | 26 |
| Security | 73 | 63 | 52230137 | 3.388904 | 2494.041 | 562.45205 | 30 |
| Travel & Hospitality | 62 | 52 | 47283871 | 2.353064 | 2906.242 | 371.53226 | 20 |
| Media | 54 | 39 | 32266667 | 4.374074 | 2222.870 | 176.51852 | 21 |
| Environmental Services | 51 | 50 | 51741176 | 2.068039 | 2699.118 | 199.11765 | 22 |
| Insurance | 50 | 49 | 46758000 | 2.008400 | 2899.400 | 146.78000 | 23 |
| Computer Hardware | 44 | 39 | 270129545 | 4.089773 | 3008.068 | 220.77273 | 17 |
This organized view of Industry listed in descending order of num_companies informs on what industries have a dominate presence in the ranking of fastest growing companies. However, we can draw subsets from the aggregate to highlight details of the dataset. The following ranks the top 10 Industry by mean Growth Rate (mean_growth)
inc_industry %>%
select( Industry, mean_Growth ) %>%
arrange( desc( mean_Growth ) ) %>%
top_n(10) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
## Selecting by mean_Growth
| Industry | mean_Growth |
|---|---|
| Energy | 9.603303 |
| Consumer Products & Services | 8.776108 |
| Real Estate | 7.746667 |
| Government Services | 7.238168 |
| Advertising & Marketing | 6.225478 |
| Retail | 6.184729 |
| Financial Services | 5.435308 |
| Software | 5.020643 |
| Health | 4.856394 |
| Media | 4.374074 |
Aggregating the data by categorical features provides some informative summaries of the data. However, we may also be interested in drawing information based on one of the numeric features. For example, the following code will rank the 10 top companies by highest Growth Rate:
inc %>%
arrange( Rank ) %>%
top_n( 10, Growth_Rate ) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
| 7 | Value Payment Systems | 174.04 | 2.550e+07 | Financial Services | 27 | Nashville | TN |
| 8 | Emerge Digital Group | 170.64 | 2.390e+07 | Advertising & Marketing | 75 | San Francisco | CA |
| 9 | Goal Zero | 169.81 | 3.310e+07 | Consumer Products & Services | 97 | Bluffdale | UT |
| 10 | Yagoozon | 166.89 | 1.860e+07 | Retail | 15 | Warwick | RI |
This view organizes the top 10 companies ranked by Revenue
inc %>%
arrange( Rank ) %>%
top_n( 10, Revenue ) %>%
kbl() %>%
kable_classic(bootstrap_options = "striped",full_width = F)
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1397 | EnvisionRxOptions | 2.88 | 2.70e+09 | Health | 625 | Twinsburg | OH |
| 2522 | DLA Piper | 1.41 | 2.40e+09 | Business Products & Services | 4036 | Chicago | IL |
| 3853 | ABC Supply | 0.73 | 4.70e+09 | Construction | 6549 | Beloit | WI |
| 4052 | Kum & Go | 0.65 | 2.80e+09 | Retail | 4589 | West Des Moines | IA |
| 4246 | American Tire Distributors | 0.59 | 3.50e+09 | Consumer Products & Services | 3341 | Huntersville | NC |
| 4716 | Westcon Group | 0.44 | 3.80e+09 | IT Services | 3000 | Tarrytown | NY |
| 4788 | CDW | 0.41 | 1.01e+10 | Computer Hardware | 6800 | Vernon Hills | IL |
| 4802 | Boise Cascade | 0.41 | 2.80e+09 | Construction | 4470 | Boise | ID |
| 4936 | Coty | 0.36 | 4.60e+09 | Consumer Products & Services | 10000 | New York | NY |
| 4997 | Dot Foods | 0.34 | 4.50e+09 | Food & Beverage | 3919 | Mt. Sterling | IL |
It is interesting that none of the top 10 in Growth_Rate are present in the top 10 companies listed by Revenue. Out of curiosity, let’s investigate to see how many of the top 100 Growth_Rate companies rank in the top 500 of Revenue:
num = 500
top_GR <- inc %>%
arrange( Rank ) %>%
top_n( num, Growth_Rate ) %>%
select( Name )
top_Rev <- inc %>%
arrange( Rank ) %>%
top_n( num, Revenue ) %>%
select( Name )
both <- intersect( top_GR, top_Rev )
per_both <- nrow( both )/num*100
paste0( 'Only ', per_both, '% of the top ', num, ' Growth Rate ranked companies feature among the top 100 in Revenue (on the list)' )
## [1] "Only 6.6% of the top 500 Growth Rate ranked companies feature among the top 100 in Revenue (on the list)"
That only 6.6% intersect in the two orderings of the data is an interesting outcome.
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
y_lines = c( 200, 400, 600 )
# Answer Question 1 here
inc %>%
group_by( State ) %>% #aggregate by state
count() %>% #count the number of records for each state
ggplot( aes( x= reorder(State,n), y= n ) ) +
geom_col( ) +
coord_flip() +
labs(
title = 'Number of top 5000 companies by State',
x = 'State',
y = '# Companies'
) +
theme_classic() +
geom_hline(yintercept=y_lines, color="white", size=1)
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
From visual inspection of the figure above, we see that the state with the 3rd most companies in the data set if New York (NY).
# Answer Question 2 here
backtrans <- function(y){log10(mean(10 ^ y)) }
inc %>%
filter( State == 'NY' ) %>% #filter records for NY state
filter( complete.cases(.) ) %>% #filter only complete cases
select( Industry, Employees ) %>% #select the features Industry & Employees
mutate( Industry = as_factor( Industry ) ) %>% #recast Idustry as a factor
mutate(zRT = scale(Employees)) %>%
filter(between(zRT,-2.5,+2.5)) %>% #filter outliers based on Z-score
ggplot( aes(x = reorder( Industry, Employees, FUN = median ), y = Employees ) ) +
geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
scale_y_log10() +
coord_flip() +
labs(
title = 'Employment by Industry for New York State',
subtitle = 'descending order by median, mean given in red',
x = 'Industry',
y = 'Employees'
) +
theme_linedraw()
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
# Answer Question 3 here
inc %>%
filter( complete.cases(.) ) %>% #filter incomplete cases
mutate( rev_per_emp = Revenue/Employees ) %>% #create a new feature that quantifies revenue per employee
ggplot( aes(x = reorder( Industry, rev_per_emp, FUN = median ), y = rev_per_emp )) +
geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
scale_y_log10( labels = trans_format("log10", math_format(10^.x)) ) +
coord_flip() +
labs(
title = 'Revenue per Employee by Industry',
subtitle = 'descending order by Industry median, mean given in red',
x = 'Industry',
y = 'Revenue per Employee'
) +
theme_linedraw()