Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

#class(inc)
head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

As can be seen from the above, there are 12 records with missing values for employees, which then impacts the revenue per employee column as well.

#inc=inc[complete.cases(inc),]
#head(inc)
# View the structure of the different data columns in this dataset
str(inc)
## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

# Import libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
#library(tidyverse)

options(scipen=999)
# Check the number of unique values for Industry, State and City
cat('The number of unique states in this dataset is:',inc$State%>%nlevels(),'\n')   
## The number of unique states in this dataset is: 52
cat('The number of unique industries in this dataset is:',inc$Industry%>%nlevels(),'\n')
## The number of unique industries in this dataset is: 25
cat('The number of unique cities in this dataset is:',inc$City%>%nlevels(),'\n')
## The number of unique cities in this dataset is: 1519

The growth rate for the top 5000 companies seems to have a very wide range from 0.34% to 421%. On the other hand, the revenue ranges from $2 million to $10.1 billion. It’s easier for smaller companies (by revenue) to grow quickly. The sweet spot would be companies that are not too small and still growing aggressively. Let’s filter the companies that have an annual revenue of $50 million and grew by atleast 100% year-over-year.

kable(inc%>%filter(Growth_Rate>=100&Revenue>=50000000))%>%kable_styling()
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 117900000 Consumer Products & Services 104 El Segundo CA
4 Bridger 233.08 1900000000 Energy 50 Addison TX
5 DataXu 213.37 87000000 Advertising & Marketing 220 Boston MA
15 LivingSocial 123.33 536000000 Consumer Products & Services 4100 Washington DC
17 Intelligent Audit 105.73 145000000 Logistics & Transportation 15 Rochelle Park NJ

This results in only 5 companies that meet the criteria.

Now let’s look at the top 10 cities by number of companies. This should give us an indication of the relative business-friendliness of the cities in this dataset.

kable(inc%>%group_by(City)%>%summarise(n=n())%>%arrange(desc(n))%>%top_n(10))%>%kable_styling()
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by n
City n
New York 160
Chicago 90
Austin 88
Houston 76
San Francisco 75
Atlanta 74
San Diego 67
Seattle 52
Boston 43
Dallas 42
Denver 42

So NY City has the maximum number of companies followed by Chicago and Austin. But if we want to find which cities are the hub for growth, we need to bring in growth rate into the mix. Let’s examine which are the top 10 cities for the 500 fastest growing companies.

top10_cities<-inc%>%arrange(Growth_Rate)%>%head(500)%>%count(City,sort=TRUE)%>%head(10)
kable(top10_cities)
City n
New York 18
Houston 15
Austin 11
Chicago 9
Salt Lake City 6
Seattle 6
Atlanta 5
Cleveland 5
Dallas 5
Boston 4

So NY retains the top spot - not only is it home to most companies, it is also home to the fastest growing ones.

# Look at the proportion of each industry in the top 5000 companies sorted in descending order
industry.freq<-(as.data.frame(round(100*prop.table(table(inc$Industry)),2)))
industry.freq%>%arrange(desc(industry.freq$Freq))
##                            Var1  Freq
## 1                   IT Services 14.66
## 2  Business Products & Services  9.64
## 3       Advertising & Marketing  9.42
## 4                        Health  7.10
## 5                      Software  6.84
## 6            Financial Services  5.20
## 7                 Manufacturing  5.12
## 8  Consumer Products & Services  4.06
## 9                        Retail  4.06
## 10          Government Services  4.04
## 11              Human Resources  3.92
## 12                 Construction  3.74
## 13   Logistics & Transportation  3.10
## 14              Food & Beverage  2.62
## 15           Telecommunications  2.58
## 16                       Energy  2.18
## 17                  Real Estate  1.92
## 18                    Education  1.66
## 19                  Engineering  1.48
## 20                     Security  1.46
## 21         Travel & Hospitality  1.24
## 22                        Media  1.08
## 23       Environmental Services  1.02
## 24                    Insurance  1.00
## 25            Computer Hardware  0.88
#head(industry.freq)

From the above, it can be seen that IT Services comprises the biggest share of the top 5000 based on number of companies, followed by Business Products & Services. Let’s dig deeper into the industry-specific metrics below.

# Look at Industry-specific metrics such as Average Revenue, Total Employees etc
inc1<-inc%>%group_by(Industry)%>%summarise(NumCompanies=sum(!is.na(Revenue)),Prop_NumCompanies=round(NumCompanies/nrow(.),2), TotalRevenueMln=round(sum(Revenue)/1000000,2), Prop_TotalRev=round(sum(Revenue)/sum(inc$Revenue),2),TotalEmployees=sum(Employees,na.rm=TRUE),AvgRevenueMln=round(mean(Revenue)/1000000,2),AvgNumEmployees=round(mean(Employees,na.rm=TRUE),0),WtdAvgGrowthRate=round(mean((Growth_Rate*Revenue)/sum(Revenue)),4), AvgRevPerEmp=sum(Revenue)/sum(Employees,na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(inc1%>%arrange(desc(NumCompanies)))
Industry NumCompanies Prop_NumCompanies TotalRevenueMln Prop_TotalRev TotalEmployees AvgRevenueMln AvgNumEmployees WtdAvgGrowthRate AvgRevPerEmp
IT Services 733 0.15 20681.3 0.09 102788 28.21 140 0.0027 201203.45
Business Products & Services 482 0.10 26367.9 0.11 117357 54.71 244 0.0035 224681.10
Advertising & Marketing 471 0.09 7785.0 0.03 39731 16.53 84 0.0196 195942.71
Health 355 0.07 17863.4 0.07 82430 50.32 233 0.0080 216709.94
Software 342 0.07 8140.6 0.03 51262 23.80 150 0.0116 158803.79
Financial Services 260 0.05 13150.9 0.05 47693 50.58 183 0.0150 275740.67
Manufacturing 256 0.05 12684.0 0.05 43942 49.55 172 0.0066 288653.22
Consumer Products & Services 203 0.04 14956.4 0.06 45464 73.68 224 0.0515 328972.37
Retail 203 0.04 10257.4 0.04 37068 50.53 183 0.0144 276718.46
Government Services 202 0.04 6009.1 0.02 26185 29.75 130 0.0285 229486.35
Human Resources 196 0.04 9246.1 0.04 226980 47.17 1158 0.0100 40735.31
Construction 187 0.04 13174.3 0.05 29099 70.45 156 0.0089 452740.64
Logistics & Transportation 155 0.03 14840.5 0.06 39994 95.75 260 0.0185 371068.16
Food & Beverage 131 0.03 12911.3 0.05 65911 98.56 511 0.0148 195889.91
Telecommunications 129 0.03 7334.4 0.03 30842 56.86 243 0.0138 237805.59
Energy 109 0.02 13771.6 0.06 26437 126.34 243 0.3315 520921.44
Real Estate 96 0.02 2965.7 0.01 18893 30.89 199 0.0773 156973.48
Education 83 0.02 1139.3 0.00 7685 13.73 93 0.0409 148249.84
Engineering 74 0.01 2532.5 0.01 20435 34.22 276 0.0262 123929.53
Security 73 0.01 3812.8 0.02 41059 52.23 562 0.0247 92861.49
Travel & Hospitality 62 0.01 2931.6 0.01 23035 47.28 372 0.0193 127267.20
Media 54 0.01 1742.4 0.01 9532 32.27 177 0.0525 182794.80
Environmental Services 51 0.01 2638.8 0.01 10155 51.74 199 0.0650 259852.29
Insurance 50 0.01 2337.9 0.01 7339 46.76 147 0.0315 318558.39
Computer Hardware 44 0.01 11885.7 0.05 9714 270.13 221 0.0153 1223563.93

While IT services drops to second place in terms of proportion of total revenue across all industries, another interesting aspect that come to light in the table above is the much higher ‘Revenue per Employee’ for Computer Hardware. Instead of taking this at face value, I think this needs more investigation and additional data - it is possible that this number is so high because this industry relies heavily on imports, and therefore the contribution of its supply chain to the revenue is not reflected in the number of employees.

# Look at averages for numerical metrics by State and Industry
inc2<-inc%>%group_by(State,Industry)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))
## `summarise()` regrouping output by 'State' (override with `.groups` argument)
head(inc2)
## # A tibble: 6 x 5
## # Groups:   State [2]
##   State Industry                     AvgRevenueMln AvgEmployees AvgGrowthRate
##   <fct> <fct>                                <dbl>        <dbl>         <dbl>
## 1 AK    Construction                          4.7            27          9.19
## 2 AK    Energy                              338.           2501          0.42
## 3 AL    Advertising & Marketing              11.7           105          1.33
## 4 AL    Business Products & Services         43.3           118          1.95
## 5 AL    Construction                         19.4            11          0.63
## 6 AL    Consumer Products & Services          2.85           79          2.17

Unlike the growth rate in the previous table, the growth rate shown in the table above is a simple average so it may be less informative than the weighted average growth rate shown in the previous table.

# Look at averages for numerical metrics by City
inc3<-inc%>%group_by(City)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))
## `summarise()` ungrouping output (override with `.groups` argument)
head(inc3)
## # A tibble: 6 x 4
##   City         AvgRevenueMln AvgEmployees AvgGrowthRate
##   <fct>                <dbl>        <dbl>         <dbl>
## 1 Acton                 6.1            17          0.43
## 2 Addison             183.            153         24   
## 3 Adrian               26.5            31          2.81
## 4 Agoura Hills          7.67          166          2.62
## 5 Aiea                 45.1            80         16.1 
## 6 Akron                66.6           198          2.96

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(ggplot2)
#Group number of companies by state and sort in descending order
state.count<-inc%>%group_by(State)%>%summarize(NumCompanies=sum(!is.na(Revenue)))%>%arrange(desc(NumCompanies))
## `summarise()` ungrouping output (override with `.groups` argument)
state.count
## # A tibble: 52 x 2
##    State NumCompanies
##    <fct>        <int>
##  1 CA             701
##  2 TX             387
##  3 NY             311
##  4 VA             283
##  5 FL             282
##  6 IL             273
##  7 GA             212
##  8 OH             186
##  9 MA             182
## 10 PA             164
## # ... with 42 more rows

The table above shows that California has the highest number of top 5000 companies, which is likely on account of Silicon valley as well as it’s large size compared to the other states.

#Create a horizontal bar chart of number of companies by state
chart<-ggplot(state.count)+geom_bar(aes(reorder(State,NumCompanies),NumCompanies,fill=State),stat='identity')+coord_flip()

chart+scale_y_continuous("Number of Companies")+ggtitle("Distribution of Companies by State")+theme(axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=14,face="bold")) + labs(title = 'Distribution of Companies by State', x = 'State')

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

From the table in question 1, we can see that NY is the state with the 3rd most companies.

# Filter companies to those in NY State
incNY<-inc%>%filter(State=='NY')

# Restrict the rows to those with no missing values
incNY<-incNY[complete.cases(incNY),]

# Calculate mean and median number of employees by industry for those companies in NY State
incNY_Emp_By_Industry<-incNY%>%group_by(Industry)%>%summarise(MeanEmployees=round(mean(Employees),0),MedEmployees=round(median(Employees),0))
## `summarise()` ungrouping output (override with `.groups` argument)
head(incNY_Emp_By_Industry)
## # A tibble: 6 x 3
##   Industry                     MeanEmployees MedEmployees
##   <fct>                                <dbl>        <dbl>
## 1 Advertising & Marketing                 58           38
## 2 Business Products & Services          1492           70
## 3 Computer Hardware                       44           44
## 4 Construction                            61           24
## 5 Consumer Products & Services           626           25
## 6 Education                               60           50
# Graph the boxplot for Employees by Industry in NY State
ggplot(incNY, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS")

We remove the data points with minimum and maximum number of employees to reduce the impact of the skew. Another alternative would have been to remove additional outliers that fall outside the whiskers of the bokplots.

# Remove outliers by excluding the minimum and maximum value for Employees for each industry in New York State
incNY_sub<-incNY%>%group_by(Industry)%>%filter(Employees<max(Employees)&Employees>min(Employees))

# Graph the boxplot without the minimum and maximum value for Employees
ggplot(incNY_sub, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS without outliers")                                       

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

#Calculate Revenue per employee and create new dataframe with this column in descending order
inc$RevenuePerEmployee<-inc$Revenue/inc$Employees
incMod<-inc[complete.cases(inc),]%>%arrange(desc(RevenuePerEmployee))
incMod<-subset(incMod, select = c('Industry','RevenuePerEmployee'))
#head(incMod)

#Calculate mean, median, standard deviation and quartile of Revenue per Employee, grouped by Industry
inc_Rev_Per_Emp<-incMod%>%group_by(Industry)%>%summarise(MeanRevPerEmp=round(mean(RevenuePerEmployee),0),StdDevRevPerEmp=round(sd(RevenuePerEmployee),0),MedRevPerEmp=round(median(RevenuePerEmployee),0),Q25_RevPerEmp=round(quantile(RevenuePerEmployee)[2],0))
## `summarise()` ungrouping output (override with `.groups` argument)
inc_Rev_Per_Emp%>%arrange(desc(inc_Rev_Per_Emp,MeanRevPerEmp))
## # A tibble: 25 x 5
##    Industry             MeanRevPerEmp StdDevRevPerEmp MedRevPerEmp Q25_RevPerEmp
##    <fct>                        <dbl>           <dbl>        <dbl>         <dbl>
##  1 Travel & Hospitality        414788          528632       224405        111976
##  2 Telecommunications          449260          733984       284000        174519
##  3 Software                    225989          731013       155319        110000
##  4 Security                    283391          414090       158744         72143
##  5 Retail                      412555          381430       312755        185908
##  6 Real Estate                 434516          453724       253571        128583
##  7 Media                       307144          246008       261458        168065
##  8 Manufacturing               453524         1058512       231250        157619
##  9 Logistics & Transpo~        794811         1212270       425024        189584
## 10 IT Services                 270494          357580       163915        112308
## # ... with 15 more rows
#head(inc_Rev_Per_Emp)
#Create box plots of revenue per employee by industry 
ggplot(incMod, aes(x=Industry, y=RevenuePerEmployee))+ scale_y_continuous("Revenue per Employee", trans='log2') + geom_boxplot()+coord_flip()+ggtitle("Revenue per Employee by Industry")

chart1<-ggplot(incMod)+geom_bar(aes(reorder(Industry,RevenuePerEmployee),RevenuePerEmployee),stat='identity')+coord_flip()

chart1+scale_y_continuous("Industry")+ggtitle("Revenue Efficiency by Industry")+theme(aspect.ratio = 2.1, axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=8,face="bold")) + labs(title = 'Revenue Efficiency by Industry', x = 'Revenue Per Employee')