Jagdish_Chhabria_DS608

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

#class(inc)
head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

As can be seen from the above, there are 12 records with missing values for employees, which then impacts the revenue per employee column as well.

#inc=inc[complete.cases(inc),]
#head(inc)

# View the structure of the different data columns in this dataset
str(inc)

## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

# Import libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

#library(tidyverse)

options(scipen=999)

# Check the number of unique values for Industry, State and City
cat('The number of unique states in this dataset is:',inc$State%>%nlevels(),'\n')

## The number of unique states in this dataset is: 52

cat('The number of unique industries in this dataset is:',inc$Industry%>%nlevels(),'\n')

## The number of unique industries in this dataset is: 25

cat('The number of unique cities in this dataset is:',inc$City%>%nlevels(),'\n')

## The number of unique cities in this dataset is: 1519

The growth rate for the top 5000 companies seems to have a very wide range from 0.34% to 421%. On the other hand, the revenue ranges from $2 million to $10.1 billion. It’s easier for smaller companies (by revenue) to grow quickly. The sweet spot would be companies that are not too small and still growing aggressively. Let’s filter the companies that have an annual revenue of $50 million and grew by atleast 100% year-over-year.

kable(inc%>%filter(Growth_Rate>=100&Revenue>=50000000))%>%kable_styling()

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	117900000	Consumer Products & Services	104	El Segundo	CA
4	Bridger	233.08	1900000000	Energy	50	Addison	TX
5	DataXu	213.37	87000000	Advertising & Marketing	220	Boston	MA
15	LivingSocial	123.33	536000000	Consumer Products & Services	4100	Washington	DC
17	Intelligent Audit	105.73	145000000	Logistics & Transportation	15	Rochelle Park	NJ

This results in only 5 companies that meet the criteria.

Now let’s look at the top 10 cities by number of companies. This should give us an indication of the relative business-friendliness of the cities in this dataset.

kable(inc%>%group_by(City)%>%summarise(n=n())%>%arrange(desc(n))%>%top_n(10))%>%kable_styling()

## `summarise()` ungrouping output (override with `.groups` argument)

## Selecting by n

City	n
New York	160
Chicago	90
Austin	88
Houston	76
San Francisco	75
Atlanta	74
San Diego	67
Seattle	52
Boston	43
Dallas	42
Denver	42

So NY City has the maximum number of companies followed by Chicago and Austin. But if we want to find which cities are the hub for growth, we need to bring in growth rate into the mix. Let’s examine which are the top 10 cities for the 500 fastest growing companies.

top10_cities<-inc%>%arrange(Growth_Rate)%>%head(500)%>%count(City,sort=TRUE)%>%head(10)
kable(top10_cities)

City	n
New York	18
Houston	15
Austin	11
Chicago	9
Salt Lake City	6
Seattle	6
Atlanta	5
Cleveland	5
Dallas	5
Boston	4

So NY retains the top spot - not only is it home to most companies, it is also home to the fastest growing ones.

# Look at the proportion of each industry in the top 5000 companies sorted in descending order
industry.freq<-(as.data.frame(round(100*prop.table(table(inc$Industry)),2)))
industry.freq%>%arrange(desc(industry.freq$Freq))

##                            Var1  Freq
## 1                   IT Services 14.66
## 2  Business Products & Services  9.64
## 3       Advertising & Marketing  9.42
## 4                        Health  7.10
## 5                      Software  6.84
## 6            Financial Services  5.20
## 7                 Manufacturing  5.12
## 8  Consumer Products & Services  4.06
## 9                        Retail  4.06
## 10          Government Services  4.04
## 11              Human Resources  3.92
## 12                 Construction  3.74
## 13   Logistics & Transportation  3.10
## 14              Food & Beverage  2.62
## 15           Telecommunications  2.58
## 16                       Energy  2.18
## 17                  Real Estate  1.92
## 18                    Education  1.66
## 19                  Engineering  1.48
## 20                     Security  1.46
## 21         Travel & Hospitality  1.24
## 22                        Media  1.08
## 23       Environmental Services  1.02
## 24                    Insurance  1.00
## 25            Computer Hardware  0.88

#head(industry.freq)

From the above, it can be seen that IT Services comprises the biggest share of the top 5000 based on number of companies, followed by Business Products & Services. Let’s dig deeper into the industry-specific metrics below.

# Look at Industry-specific metrics such as Average Revenue, Total Employees etc
inc1<-inc%>%group_by(Industry)%>%summarise(NumCompanies=sum(!is.na(Revenue)),Prop_NumCompanies=round(NumCompanies/nrow(.),2), TotalRevenueMln=round(sum(Revenue)/1000000,2), Prop_TotalRev=round(sum(Revenue)/sum(inc$Revenue),2),TotalEmployees=sum(Employees,na.rm=TRUE),AvgRevenueMln=round(mean(Revenue)/1000000,2),AvgNumEmployees=round(mean(Employees,na.rm=TRUE),0),WtdAvgGrowthRate=round(mean((Growth_Rate*Revenue)/sum(Revenue)),4), AvgRevPerEmp=sum(Revenue)/sum(Employees,na.rm=TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

kable(inc1%>%arrange(desc(NumCompanies)))

Industry	NumCompanies	Prop_NumCompanies	TotalRevenueMln	Prop_TotalRev	TotalEmployees	AvgRevenueMln	AvgNumEmployees	WtdAvgGrowthRate	AvgRevPerEmp
IT Services	733	0.15	20681.3	0.09	102788	28.21	140	0.0027	201203.45
Business Products & Services	482	0.10	26367.9	0.11	117357	54.71	244	0.0035	224681.10
Advertising & Marketing	471	0.09	7785.0	0.03	39731	16.53	84	0.0196	195942.71
Health	355	0.07	17863.4	0.07	82430	50.32	233	0.0080	216709.94
Software	342	0.07	8140.6	0.03	51262	23.80	150	0.0116	158803.79
Financial Services	260	0.05	13150.9	0.05	47693	50.58	183	0.0150	275740.67
Manufacturing	256	0.05	12684.0	0.05	43942	49.55	172	0.0066	288653.22
Consumer Products & Services	203	0.04	14956.4	0.06	45464	73.68	224	0.0515	328972.37
Retail	203	0.04	10257.4	0.04	37068	50.53	183	0.0144	276718.46
Government Services	202	0.04	6009.1	0.02	26185	29.75	130	0.0285	229486.35
Human Resources	196	0.04	9246.1	0.04	226980	47.17	1158	0.0100	40735.31
Construction	187	0.04	13174.3	0.05	29099	70.45	156	0.0089	452740.64
Logistics & Transportation	155	0.03	14840.5	0.06	39994	95.75	260	0.0185	371068.16
Food & Beverage	131	0.03	12911.3	0.05	65911	98.56	511	0.0148	195889.91
Telecommunications	129	0.03	7334.4	0.03	30842	56.86	243	0.0138	237805.59
Energy	109	0.02	13771.6	0.06	26437	126.34	243	0.3315	520921.44
Real Estate	96	0.02	2965.7	0.01	18893	30.89	199	0.0773	156973.48
Education	83	0.02	1139.3	0.00	7685	13.73	93	0.0409	148249.84
Engineering	74	0.01	2532.5	0.01	20435	34.22	276	0.0262	123929.53
Security	73	0.01	3812.8	0.02	41059	52.23	562	0.0247	92861.49
Travel & Hospitality	62	0.01	2931.6	0.01	23035	47.28	372	0.0193	127267.20
Media	54	0.01	1742.4	0.01	9532	32.27	177	0.0525	182794.80
Environmental Services	51	0.01	2638.8	0.01	10155	51.74	199	0.0650	259852.29
Insurance	50	0.01	2337.9	0.01	7339	46.76	147	0.0315	318558.39
Computer Hardware	44	0.01	11885.7	0.05	9714	270.13	221	0.0153	1223563.93

While IT services drops to second place in terms of proportion of total revenue across all industries, another interesting aspect that come to light in the table above is the much higher ‘Revenue per Employee’ for Computer Hardware. Instead of taking this at face value, I think this needs more investigation and additional data - it is possible that this number is so high because this industry relies heavily on imports, and therefore the contribution of its supply chain to the revenue is not reflected in the number of employees.

# Look at averages for numerical metrics by State and Industry
inc2<-inc%>%group_by(State,Industry)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))

## `summarise()` regrouping output by 'State' (override with `.groups` argument)

head(inc2)

## # A tibble: 6 x 5
## # Groups:   State [2]
##   State Industry                     AvgRevenueMln AvgEmployees AvgGrowthRate
##   <fct> <fct>                                <dbl>        <dbl>         <dbl>
## 1 AK    Construction                          4.7            27          9.19
## 2 AK    Energy                              338.           2501          0.42
## 3 AL    Advertising & Marketing              11.7           105          1.33
## 4 AL    Business Products & Services         43.3           118          1.95
## 5 AL    Construction                         19.4            11          0.63
## 6 AL    Consumer Products & Services          2.85           79          2.17

Unlike the growth rate in the previous table, the growth rate shown in the table above is a simple average so it may be less informative than the weighted average growth rate shown in the previous table.

# Look at averages for numerical metrics by City
inc3<-inc%>%group_by(City)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))

## `summarise()` ungrouping output (override with `.groups` argument)

head(inc3)

## # A tibble: 6 x 4
##   City         AvgRevenueMln AvgEmployees AvgGrowthRate
##   <fct>                <dbl>        <dbl>         <dbl>
## 1 Acton                 6.1            17          0.43
## 2 Addison             183.            153         24   
## 3 Adrian               26.5            31          2.81
## 4 Agoura Hills          7.67          166          2.62
## 5 Aiea                 45.1            80         16.1 
## 6 Akron                66.6           198          2.96

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(ggplot2)
#Group number of companies by state and sort in descending order
state.count<-inc%>%group_by(State)%>%summarize(NumCompanies=sum(!is.na(Revenue)))%>%arrange(desc(NumCompanies))

## `summarise()` ungrouping output (override with `.groups` argument)

state.count

## # A tibble: 52 x 2
##    State NumCompanies
##    <fct>        <int>
##  1 CA             701
##  2 TX             387
##  3 NY             311
##  4 VA             283
##  5 FL             282
##  6 IL             273
##  7 GA             212
##  8 OH             186
##  9 MA             182
## 10 PA             164
## # ... with 42 more rows

The table above shows that California has the highest number of top 5000 companies, which is likely on account of Silicon valley as well as it’s large size compared to the other states.

#Create a horizontal bar chart of number of companies by state
chart<-ggplot(state.count)+geom_bar(aes(reorder(State,NumCompanies),NumCompanies,fill=State),stat='identity')+coord_flip()

chart+scale_y_continuous("Number of Companies")+ggtitle("Distribution of Companies by State")+theme(axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=14,face="bold")) + labs(title = 'Distribution of Companies by State', x = 'State')

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

From the table in question 1, we can see that NY is the state with the 3rd most companies.

# Filter companies to those in NY State
incNY<-inc%>%filter(State=='NY')

# Restrict the rows to those with no missing values
incNY<-incNY[complete.cases(incNY),]

# Calculate mean and median number of employees by industry for those companies in NY State
incNY_Emp_By_Industry<-incNY%>%group_by(Industry)%>%summarise(MeanEmployees=round(mean(Employees),0),MedEmployees=round(median(Employees),0))

## `summarise()` ungrouping output (override with `.groups` argument)

head(incNY_Emp_By_Industry)

## # A tibble: 6 x 3
##   Industry                     MeanEmployees MedEmployees
##   <fct>                                <dbl>        <dbl>
## 1 Advertising & Marketing                 58           38
## 2 Business Products & Services          1492           70
## 3 Computer Hardware                       44           44
## 4 Construction                            61           24
## 5 Consumer Products & Services           626           25
## 6 Education                               60           50

# Graph the boxplot for Employees by Industry in NY State
ggplot(incNY, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS")

We remove the data points with minimum and maximum number of employees to reduce the impact of the skew. Another alternative would have been to remove additional outliers that fall outside the whiskers of the bokplots.

# Remove outliers by excluding the minimum and maximum value for Employees for each industry in New York State
incNY_sub<-incNY%>%group_by(Industry)%>%filter(Employees<max(Employees)&Employees>min(Employees))

# Graph the boxplot without the minimum and maximum value for Employees
ggplot(incNY_sub, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS without outliers")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

#Calculate Revenue per employee and create new dataframe with this column in descending order
inc$RevenuePerEmployee<-inc$Revenue/inc$Employees
incMod<-inc[complete.cases(inc),]%>%arrange(desc(RevenuePerEmployee))
incMod<-subset(incMod, select = c('Industry','RevenuePerEmployee'))
#head(incMod)

#Calculate mean, median, standard deviation and quartile of Revenue per Employee, grouped by Industry
inc_Rev_Per_Emp<-incMod%>%group_by(Industry)%>%summarise(MeanRevPerEmp=round(mean(RevenuePerEmployee),0),StdDevRevPerEmp=round(sd(RevenuePerEmployee),0),MedRevPerEmp=round(median(RevenuePerEmployee),0),Q25_RevPerEmp=round(quantile(RevenuePerEmployee)[2],0))

## `summarise()` ungrouping output (override with `.groups` argument)

inc_Rev_Per_Emp%>%arrange(desc(inc_Rev_Per_Emp,MeanRevPerEmp))

## # A tibble: 25 x 5
##    Industry             MeanRevPerEmp StdDevRevPerEmp MedRevPerEmp Q25_RevPerEmp
##    <fct>                        <dbl>           <dbl>        <dbl>         <dbl>
##  1 Travel & Hospitality        414788          528632       224405        111976
##  2 Telecommunications          449260          733984       284000        174519
##  3 Software                    225989          731013       155319        110000
##  4 Security                    283391          414090       158744         72143
##  5 Retail                      412555          381430       312755        185908
##  6 Real Estate                 434516          453724       253571        128583
##  7 Media                       307144          246008       261458        168065
##  8 Manufacturing               453524         1058512       231250        157619
##  9 Logistics & Transpo~        794811         1212270       425024        189584
## 10 IT Services                 270494          357580       163915        112308
## # ... with 15 more rows

#head(inc_Rev_Per_Emp)

#Create box plots of revenue per employee by industry 
ggplot(incMod, aes(x=Industry, y=RevenuePerEmployee))+ scale_y_continuous("Revenue per Employee", trans='log2') + geom_boxplot()+coord_flip()+ggtitle("Revenue per Employee by Industry")

chart1<-ggplot(incMod)+geom_bar(aes(reorder(Industry,RevenuePerEmployee),RevenuePerEmployee),stat='identity')+coord_flip()

chart1+scale_y_continuous("Industry")+ggtitle("Revenue Efficiency by Industry")+theme(aspect.ratio = 2.1, axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=8,face="bold")) + labs(title = 'Revenue Efficiency by Industry', x = 'Revenue Per Employee')

Jagdish_Chhabria_DS608_HW1

Question 1

Question 2

Question 3