Data 608 Assignment 1

Alexander Ng

09/04/2020

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Response 0

For reading ease, we display all R code at the end of this document. I first observe below that only 12 companies out of 5001 in the dataset lack the number of employees. Moreover, the summary of those 12 companies shows no systematic pattern by city, state, industry or revenues. This suggests the problem is idiosyncratic. I will eliminate such data points from subsequent study.

My non-visual observations on the data include

  1. There are 52 states including District of Columbia, Puerto Rico.
  2. There are 25 industries.
  3. Number of employees range from 1 to 66803. So firms range from startups to large mature corporations.
  4. There is no obvious way to validate or interprete the growth rate since explanatory notes about calculation methodology are absent.

Supporting calculations are in the appendix.

##       Rank                                    Name    Growth_Rate    
##  Min.   : 183   Carolinas Home Medical Equipment:1   Min.   : 0.350  
##  1st Qu.:1521   Excalibur Exhibits              :1   1st Qu.: 0.670  
##  Median :2470   First Flight Solutions          :1   Median : 1.475  
##  Mean   :2606   Global Communications Group     :1   Mean   : 3.408  
##  3rd Qu.:4012   Heartland Business Systems      :1   3rd Qu.: 2.700  
##  Max.   :4968   Higher Logic                    :1   Max.   :22.320  
##                 (Other)                         :6                   
##     Revenue                                  Industry   Employees  
##  Min.   :  2700000   Business Products & Services:2   Min.   : NA  
##  1st Qu.:  5025000   Food & Beverage             :2   1st Qu.: NA  
##  Median :  9400000   Telecommunications          :2   Median : NA  
##  Mean   : 35408333   Health                      :1   Mean   :NaN  
##  3rd Qu.: 52275000   IT Services                 :1   3rd Qu.: NA  
##  Max.   :156300000   Logistics & Transportation  :1   Max.   : NA  
##                      (Other)                     :3   NA's   :12   
##            City       State  
##  Atlanta     :1   NC     :2  
##  Bellevue    :1   WI     :2  
##  Emerald Isle:1   CA     :1  
##  Englewood   :1   CO     :1  
##  Horsham     :1   DC     :1  
##  houston     :1   GA     :1  
##  (Other)     :6   (Other):4
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2501   110 Consulting        :   1   Mean   :  4.615  
##  3rd Qu.:3750   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4983                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 732   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 480   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.825e+07   Health                      : 354   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 341   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2351                    
##             City          State     
##  New York     : 160   CA     : 700  
##  Chicago      :  90   TX     : 386  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  74   FL     : 282  
##  Atlanta      :  73   IL     : 272  
##  (Other)      :4428   (Other):2755

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Response 1

There are 52 states or territories includes in the data sample. This suggests that we use the vertical axis to display the large number of states and the horizontal axis to display the number of firms in each state.

Our first chart below uses ggplot2 to display the number of firms. The default scale of the bar chart labels (for the states) were initially unsatisfactory. The labels were too crowded together along the Y-axis. By adjusting the aspect ratio of the picture and reducing font size, we obtained satisfactory results.

The number of firms drops off rapidly after the first state, California. The number of firms follows some type of statistical power law since the decay in number of firms is clearly nonlinear.

In the above visual, we realize that California, Texas, New York are the top 3 states by number of firms. This ranking is consistent with their greater state populations. On the other hand, West Virginia, Arkansas, Wyoming have small populations.

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

Response 2

New York state clearly has the 3rd most companies in the data set. We obtained this answer from the previous Question. The first R statement below obtains the state with the third most companies (without hardcoding for New York).

Industries in NYS Sorted by Median Employees

Industry numFirms avgEmployees p0 p25 p50 p75 p100
Environmental Services 2 155.0 60 107.5 155.0 202.5 250
Energy 5 129.2 5 90.0 120.0 137.0 294
Financial Services 13 144.3 14 35.0 81.0 198.0 483
Software 13 245.9 15 28.0 80.0 220.0 1271
Business Products & Services 26 1492.5 4 30.5 70.5 332.8 32000
Travel & Hospitality 7 547.7 6 36.0 61.0 707.5 2280
Human Resources 11 437.5 7 21.0 56.0 662.5 2081
Engineering 4 53.5 11 24.5 54.5 83.5 94
IT Services 43 204.1 8 25.5 54.0 183.0 3000
Education 14 59.9 19 24.2 50.5 67.8 200
Health 13 81.8 2 31.0 45.0 121.0 298
Media 11 108.0 4 11.5 45.0 122.5 602
Computer Hardware 1 44.0 44 44.0 44.0 44.0 44
Food & Beverage 9 76.4 5 14.0 41.0 63.0 383
Advertising & Marketing 57 58.4 2 21.0 38.0 65.0 270
Insurance 2 32.5 15 23.8 32.5 41.2 50
Security 4 135.0 25 25.0 32.5 142.5 450
Telecommunications 17 95.4 6 21.0 31.0 139.0 316
Manufacturing 13 73.3 11 16.0 30.0 77.0 307
Consumer Products & Services 17 626.3 5 17.0 25.0 64.0 10000
Construction 6 61.0 10 21.0 24.5 57.2 219
Logistics & Transportation 4 29.5 1 16.8 23.5 36.2 70
Real Estate 4 18.2 7 13.0 18.0 23.2 30
Government Services 1 17.0 17 17.0 17.0 17.0 17
Retail 14 24.8 3 10.2 13.5 20.8 75

In the table below, we also report the outliers by number of employees. The table below shows that 9 companies have over 1000 employees with headquarters in NY State. These outliers have the potential to skew the box plot if maximum number of employees per industry are displayed. However, we choose the suppress them in the boxplot and include them in the table below.

Name Rank Industry Employees City State
Sutherland Global Services 4577 Business Products & Services 32000 Pittsford NY
Coty 4936 Consumer Products & Services 10000 New York NY
Westcon Group 4716 IT Services 3000 Tarrytown NY
Denihan Hospitality Group 3899 Travel & Hospitality 2280 New York NY
TransPerfect 4363 Business Products & Services 2218 New York NY
Sterling Infosystems 1499 Human Resources 2081 New York NY
OpenLink 4465 Software 1271 Uniondale NY
FSO Onsite Outsourcing 3136 Human Resources 1134 New York NY
ReSource Pro 2830 Business Products & Services 1124 New York NY

We conclude there is great variability in the number of employees in each Industry. This variation is not proportional to the median number of employees in the Industry. More research into the cause of this variation may be warranted.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Response 3

We first display a table of the relevant summary statistics.

US Industries sorted by Revenue Per Employee

Industry totRev numFirms totEmployees RevPerEmp
Computer Hardware 11886 44 9714 1223564
Energy 13772 109 26437 520921
Construction 13174 187 29099 452741
Logistics & Transportation 14838 154 39994 371001
Consumer Products & Services 14956 203 45464 328972
Insurance 2338 50 7339 318558
Manufacturing 12604 255 43942 286824
Retail 10257 203 37068 276718
Financial Services 13151 260 47693 275741
Environmental Services 2639 51 10155 259852
Telecommunications 7288 127 30842 236298
Government Services 6009 202 26185 229486
Business Products & Services 26346 480 117357 224494
Health 17860 354 82430 216670
IT Services 20525 732 102788 199683
Advertising & Marketing 7785 471 39731 195943
Food & Beverage 12812 129 65911 194391
Media 1742 54 9532 182795
Software 8135 341 51262 158687
Real Estate 2957 95 18893 156502
Education 1139 83 7685 148250
Travel & Hospitality 2932 62 23035 127267
Engineering 2532 74 20435 123930
Security 3813 73 41059 92861
Human Resources 9246 196 226980 40735
Note:
totRev is expressed in millions.

The above table illustrates that Computer Hardware, Energy and Construction rank as the top 3. Computer Hardware is really the outlier compared to other industries as its revenues are more than twice the second ranking industry Energy. Computer Hardware revenues are 1.22 million per employee on average while traditional industries like Financial Services yield about 275 thousand per employee. Industries at the bottom are Human Resources (the worst performing) at 40 thousand per employee and Security at 92 thousand per employee. These rankings are consistent with my expectations on the growth potential of industries.

The above chart shows the majority of industries earn between 150-350 thousand in revenues per employee. Industries that are human labor intensive like Security, Travel and Hospitality and Human Resources make sense to rank at the bottom. Industries which are capital intensive like Computer Hardware and Energy seem to come out on top.

Code

In this section, we provide all the code used in the assignment.

knitr::opts_chunk$set(echo = TRUE)

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)


head(inc)


summary(inc)

# Insert your code here, create more chunks as necessary
# Libraries that I use are included
library(tidyverse)
library(kableExtra)

# Helps to move code to a single location.
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE) 

summary(inc[complete.cases(inc)==0,])

good_firms = complete.cases(inc)
inc[good_firms,] %>% group_by(Industry)  %>% summarize( sumR = sum(Revenue), numFirms= n(), 
                                                                 totEmployees=sum(Employees) ,
                                                                 RevPerEmp = sumR/totEmployees) -> by_industry

inc[good_firms,] %>% group_by(State)  %>% summarize( sumR = sum(Revenue), 
                                                              numFirms= n(), 
                                                              totEmployees=sum(Employees) ,
                                                              RevPerEmp = sumR/totEmployees) -> by_state

summary(inc[good_firms,] )
# Answer Question 1 here
by_state %>% mutate( State = fct_reorder(State, numFirms) )  %>% 
  ggplot(aes(x=State, y=numFirms)) + 
  geom_bar(stat="identity") + 
  coord_flip() +
  ggtitle("States by Number of Firms in Top 5000") + 
  theme(aspect.ratio = 2.3 , text= element_text(size=7))

# Answer Question 2 here

# Get the 3rd state with the most companies.  Then get the dataframe of all companies 
# matching that State.

inc[complete.cases(inc),] %>% filter(State == slice(arrange(by_state, desc(numFirms)) , 3)$State ) -> state3data

#
# Get the summary statistics for the 3rd state and display them neatly
# --------------------------------------------------
state3data %>%
  group_by(Industry) %>% summarize(numFirms = n(),
                                   avgEmployees=mean(Employees), 
                                   p0=min(Employees),
                                   p25=quantile(Employees, probs=0.25),
                                   p50=quantile(Employees, probs=0.5), 
                                   p75=quantile(Employees, probs=0.75),
                                   p100=max(Employees)) %>% arrange(desc(p50)) %>% kable(digits=c(0,0,1,0,1,1,1,0), caption="Industries in NYS Sorted by Median Employees") 

#
#  Sort the industries by median number of employees and plot as a box plot.
# ---------------------------------------------------------------------------
state3data %>% ggplot(aes(   x=reorder(Industry, Employees, FUN=median), y=Employees)) + geom_boxplot(outlier.shape = NA)  + 
  labs( x="A", title="NY State Employees") +
  coord_cartesian(ylim=c(0,1200)) + theme(legend.position = "none", axis.text.x = element_text(angle=90, vjust=0.5, hjust=1 ) )

#
# Display the companies with very large number of employees.
# ----------------------------------------------------------------
state3data %>% filter( Employees > 1000) %>% arrange(desc(Employees)) %>% 
  select(Name, Rank,  Industry, Employees, City, State) %>%
  kable()



# Answer Question 3 here

by_industry %>% arrange(desc(RevPerEmp)) -> industry_by_revperemp 

industry_by_revperemp %>% mutate(totRev = sumR/1000000) %>% select(Industry, totRev, numFirms, totEmployees, RevPerEmp) %>% kable(digit=c(0,0,0,0,0), caption="US Industries sorted by Revenue Per Employee") %>% footnote(general="totRev is expressed in millions.")


industry_by_revperemp %>% mutate(Industry = fct_reorder(Industry, RevPerEmp)) %>%
  ggplot(aes(x=Industry, y=RevPerEmp)) + 
  geom_bar(stat="identity") + 
  coord_flip() +
  ggtitle("Industries Ranked by Revenues/Employees") + 
  theme(aspect.ratio = 2.3 , text= element_text(size=7))