DATA 608 - Home Work1

R Libraries:

Load necessary libraries -

library(kableExtra)
library(dplyr)
library(ggplot2)

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

#head(inc)
head(inc) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX

# Apply Complete.cases() function to exclude records with null values in any of the columns

inc <- inc[complete.cases(inc),]

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2501   110 Consulting        :   1   Mean   :  4.615  
##  3rd Qu.:3750   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4983                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 732   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 480   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.825e+07   Health                      : 354   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 341   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2351                    
##             City          State     
##  New York     : 160   CA     : 700  
##  Chicago      :  90   TX     : 386  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  74   FL     : 282  
##  Atlanta      :  73   IL     : 272  
##  (Other)      :4428   (Other):2755

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

High Level understanding from the statistical summary

A. Maximum no. of growing companies are located in state of CA in the West Coast most probably due to high concentration of start ups and tech firms

B. New York city in the East Coast has highest city level concentration of growing companies most likely due to being the major hub for banking and finanlcial industry

C. IT Services industry has highest no. of fast growing companies

D. The Employees count in the data set ranges from 1 to 67K and revenue from $2M to $10B+. So the data set includes growing companies of all sizes including start ups to much bigger corporate houses

# Insert your code here, create more chunks as necessary

# Ranking of States based on Mean/Avg. Company Growth Rate
topGrowthStates <- inc %>% group_by(State) %>% summarise(Growth_Rate.mean = mean(Growth_Rate)) %>% mutate(rank = rank(-Growth_Rate.mean)) %>% arrange(rank)


topGrowthStates %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

State	Growth_Rate.mean	rank
WY	19.145000	1
ME	16.210000	2
RI	16.031250	3
DC	8.439524	4
HI	6.792857	5
UT	6.307790	6
SC	6.060625	7
TX	6.036503	8
CA	5.900229	9
FL	5.846099	10
MS	5.642500	11
MA	5.416648	12
CT	4.994600	13
MD	4.984809	14
CO	4.971955	15
TN	4.950366	16
VA	4.877350	17
AK	4.805000	18
IN	4.788261	19
AZ	4.616700	20
NJ	4.445380	21
NY	4.371158	22
WA	4.020698	23
MN	3.821477	24
IL	3.751213	25
KS	3.628684	26
OH	3.557527	27
GA	3.522607	28
NC	3.393630	29
OR	3.148367	30
OK	3.097174	31
WI	2.739351	32
ID	2.645294	33
PA	2.578159	34
MO	2.497288	35
DE	2.420000	36
AL	2.407451	37
NV	2.330769	38
MI	2.238571	39
NE	2.078889	40
KY	2.064000	41
LA	1.944595	42
IA	1.761071	43
PR	1.730000	44
AR	1.670000	45
NH	1.512917	46
SD	1.406667	47
NM	1.364000	48
VT	1.296667	49
ND	1.227000	50
MT	0.762500	51
WV	0.620000	52

From above ranking of states for mean company growth rate, we can see even though CA has maximum no. of growing companies, but CA ranks 9th interms of avg. growth rate. WY has the maximum avg. growth rate, but has only 2 companies included in the data set.

# Industry Revenue Share
industryRev <- inc %>% group_by(Industry) %>% summarise(TotalRevenue = sum(Revenue)) %>% mutate(share = TotalRevenue/sum(TotalRevenue)) %>% mutate(rank = rank(-share)) %>% arrange(rank)


industryRev %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Industry	TotalRevenue	share	rank
Business Products & Services	26345900000	0.1094390	1
IT Services	20525000000	0.0852594	2
Health	17860100000	0.0741896	3
Consumer Products & Services	14956400000	0.0621278	4
Logistics & Transportation	14837800000	0.0616352	5
Energy	13771600000	0.0572062	6
Construction	13174300000	0.0547251	7
Financial Services	13150900000	0.0546279	8
Food & Beverage	12812500000	0.0532222	9
Manufacturing	12603600000	0.0523544	10
Computer Hardware	11885700000	0.0493723	11
Retail	10257400000	0.0426085	12
Human Resources	9246100000	0.0384076	13
Software	8134600000	0.0337905	14
Advertising & Marketing	7785000000	0.0323383	15
Telecommunications	7287900000	0.0302734	16
Government Services	6009100000	0.0249614	17
Security	3812800000	0.0158381	18
Real Estate	2956800000	0.0122823	19
Travel & Hospitality	2931600000	0.0121777	20
Environmental Services	2638800000	0.0109614	21
Engineering	2532500000	0.0105198	22
Insurance	2337900000	0.0097115	23
Media	1742400000	0.0072378	24
Education	1139300000	0.0047326	25

From the above table, we can see that Business Products & Services industry has the highest share of revenue of fastest growing companies followed by IT Services and Health sector.

# Industry Employment Share
industryEmployment <- inc %>% group_by(Industry) %>% summarise(TotalEmployees = sum(Employees)) %>% mutate(share = TotalEmployees/sum(TotalEmployees)) %>% mutate(rank = rank(-share)) %>% arrange(rank)


industryEmployment %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Industry	TotalEmployees	share	rank
Human Resources	226980	0.1954988	1
Business Products & Services	117357	0.1010801	2
IT Services	102788	0.0885317	3
Health	82430	0.0709973	4
Food & Beverage	65911	0.0567694	5
Software	51262	0.0441522	6
Financial Services	47693	0.0410782	7
Consumer Products & Services	45464	0.0391583	8
Manufacturing	43942	0.0378474	9
Security	41059	0.0353643	10
Logistics & Transportation	39994	0.0344470	11
Advertising & Marketing	39731	0.0342205	12
Retail	37068	0.0319268	13
Telecommunications	30842	0.0265643	14
Construction	29099	0.0250631	15
Energy	26437	0.0227703	16
Government Services	26185	0.0225533	17
Travel & Hospitality	23035	0.0198401	18
Engineering	20435	0.0176008	19
Real Estate	18893	0.0162726	20
Environmental Services	10155	0.0087465	21
Computer Hardware	9714	0.0083667	22
Media	9532	0.0082100	23
Education	7685	0.0066191	24
Insurance	7339	0.0063211	25

From the above employees share table by industry, we can infer -

Human Resources has the highest share (19%) of the Employees. This is probably due to large number of fast growing recruitement and leadership development companies
Business Products & Services, IT Services and Health sector is consistent in terms of generating higher revenue share and generating maximum employment opportunities

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here
topStates <- inc %>% group_by(State) %>% summarise(compCount = n()) %>% mutate(rank = rank(-compCount)) %>% arrange(rank)

ggplot(topStates, aes(x = reorder(State,compCount), y = compCount)) +
  geom_bar(stat = "identity", position = "dodge", fill = "orange") +
  geom_text(aes(label=compCount), hjust=-0.5, color="black", position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x=element_text(angle = 0, vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Distribution of Fastest Growing Companies By State") +
  labs(x = "State", y = "No. of Companies") +
  coord_flip()

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# Answer Question 2 here
top3rdState <- inc %>% group_by(State) %>% summarise(compCount = n()) %>% mutate(rank = rank(-compCount)) %>% filter(rank == 3)

print(paste0("Top 3rd state in no. of companies: ",top3rdState$State, " with ", top3rdState$compCount," companies."))

## [1] "Top 3rd state in no. of companies: NY with 311 companies."

# Filter data for 3rd state in the rank for highest no. of companies
incTop3rdState <- inc %>% filter(State == toString(top3rdState$State)) %>% filter(Employees < 5000)

ggplot(incTop3rdState, aes(x = factor(Industry), y = Employees)) + 
  geom_boxplot(aes(colour = Industry), width = 0.7)+
  stat_boxplot(geom ='errorbar') +
  ggtitle("Distribution of Companies By Employee Count in NY") +
  ylab("No. of Employees") +
  xlab("Industry") +
  theme(legend.position="bottom") +
  coord_flip()

In order to deal with outliers, I have removed records from New York’s data set with employees > 5000.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here

# Derive Revenue Per Employee By individual company
revenueEmplRatio <- inc %>% mutate(revEmplRatio = round((Revenue/Employees)/1000000,0))

# Calculate mean of the ratio by industry
revenueEmplRatioMean <- revenueEmplRatio %>% group_by(Industry) %>% summarise(ratio.mean = round(mean(revEmplRatio),3))

ggplot(revenueEmplRatioMean, aes(x = reorder(Industry,ratio.mean), y = ratio.mean)) +
  geom_bar(stat = "identity", position = "dodge", fill = "blue") +
  geom_text(aes(label=paste("$",ratio.mean,"M")), hjust=-0.1, color="black", position = position_dodge(0.9), size=3.5) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x=element_text(angle = 0, vjust = 0.1)) +
  theme(plot.title = element_text(hjust = 0.1)) +
  ggtitle("National Ranking of Revenue Per Employee Ratio By Industry") +
  labs(x = "Industry", y = "Mean Revenue Per Employee Ratio (in Millions)") +
  coord_flip()

From the above plot, it can be inferred that Energy industry has by far the highest mean Revenue Per Employee numbers nationally. But from the below Histogram distribution plot including RED dotted line for the mean ratio, referring to the facet for ‘Energy’ industry, it can be observed that there are quite a few outliers.

# Distribution of the Ratio by Industry
ggplot(revenueEmplRatio, aes(x=revEmplRatio)) + geom_histogram(binwidth=.5, colour="black", fill="white") + 
   # facet_grid(Industry ~.,scales = "free") +
    facet_wrap(Industry ~., scales = "free", ncol = 3) +
    geom_vline(data=revenueEmplRatioMean, aes(xintercept=ratio.mean),
               linetype="dashed", size=1, colour="red")+
    ggtitle("Distribution of Revenue Per Employee Ratio By Industry") +
    xlab("Revenue Per Employee Ratio (in Millions)")