Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
#class(inc)
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
As can be seen from the above, there are 12 records with missing values for employees, which then impacts the revenue per employee column as well.
#inc=inc[complete.cases(inc),]
#head(inc)
# View the structure of the different data columns in this dataset
str(inc)
## 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
## $ State : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
# Import libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
#library(tidyverse)
options(scipen=999)
# Check the number of unique values for Industry, State and City
cat('The number of unique states in this dataset is:',inc$State%>%nlevels(),'\n')
## The number of unique states in this dataset is: 52
cat('The number of unique industries in this dataset is:',inc$Industry%>%nlevels(),'\n')
## The number of unique industries in this dataset is: 25
cat('The number of unique cities in this dataset is:',inc$City%>%nlevels(),'\n')
## The number of unique cities in this dataset is: 1519
The growth rate for the top 5000 companies seems to have a very wide range from 0.34% to 421%. On the other hand, the revenue ranges from $2 million to $10.1 billion. It’s easier for smaller companies (by revenue) to grow quickly. The sweet spot would be companies that are not too small and still growing aggressively. Let’s filter the companies that have an annual revenue of $50 million and grew by atleast 100% year-over-year.
kable(inc%>%filter(Growth_Rate>=100&Revenue>=50000000))%>%kable_styling()
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 117900000 | Consumer Products & Services | 104 | El Segundo | CA |
| 4 | Bridger | 233.08 | 1900000000 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 87000000 | Advertising & Marketing | 220 | Boston | MA |
| 15 | LivingSocial | 123.33 | 536000000 | Consumer Products & Services | 4100 | Washington | DC |
| 17 | Intelligent Audit | 105.73 | 145000000 | Logistics & Transportation | 15 | Rochelle Park | NJ |
This results in only 5 companies that meet the criteria.
Now let’s look at the top 10 cities by number of companies. This should give us an indication of the relative business-friendliness of the cities in this dataset.
kable(inc%>%group_by(City)%>%summarise(n=n())%>%arrange(desc(n))%>%top_n(10))%>%kable_styling()
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by n
| City | n |
|---|---|
| New York | 160 |
| Chicago | 90 |
| Austin | 88 |
| Houston | 76 |
| San Francisco | 75 |
| Atlanta | 74 |
| San Diego | 67 |
| Seattle | 52 |
| Boston | 43 |
| Dallas | 42 |
| Denver | 42 |
So NY City has the maximum number of companies followed by Chicago and Austin. But if we want to find which cities are the hub for growth, we need to bring in growth rate into the mix. Let’s examine which are the top 10 cities for the 500 fastest growing companies.
top10_cities<-inc%>%arrange(Growth_Rate)%>%head(500)%>%count(City,sort=TRUE)%>%head(10)
kable(top10_cities)
| City | n |
|---|---|
| New York | 18 |
| Houston | 15 |
| Austin | 11 |
| Chicago | 9 |
| Salt Lake City | 6 |
| Seattle | 6 |
| Atlanta | 5 |
| Cleveland | 5 |
| Dallas | 5 |
| Boston | 4 |
So NY retains the top spot - not only is it home to most companies, it is also home to the fastest growing ones.
# Look at the proportion of each industry in the top 5000 companies sorted in descending order
industry.freq<-(as.data.frame(round(100*prop.table(table(inc$Industry)),2)))
industry.freq%>%arrange(desc(industry.freq$Freq))
## Var1 Freq
## 1 IT Services 14.66
## 2 Business Products & Services 9.64
## 3 Advertising & Marketing 9.42
## 4 Health 7.10
## 5 Software 6.84
## 6 Financial Services 5.20
## 7 Manufacturing 5.12
## 8 Consumer Products & Services 4.06
## 9 Retail 4.06
## 10 Government Services 4.04
## 11 Human Resources 3.92
## 12 Construction 3.74
## 13 Logistics & Transportation 3.10
## 14 Food & Beverage 2.62
## 15 Telecommunications 2.58
## 16 Energy 2.18
## 17 Real Estate 1.92
## 18 Education 1.66
## 19 Engineering 1.48
## 20 Security 1.46
## 21 Travel & Hospitality 1.24
## 22 Media 1.08
## 23 Environmental Services 1.02
## 24 Insurance 1.00
## 25 Computer Hardware 0.88
#head(industry.freq)
From the above, it can be seen that IT Services comprises the biggest share of the top 5000 based on number of companies, followed by Business Products & Services. Let’s dig deeper into the industry-specific metrics below.
# Look at Industry-specific metrics such as Average Revenue, Total Employees etc
inc1<-inc%>%group_by(Industry)%>%summarise(NumCompanies=sum(!is.na(Revenue)),Prop_NumCompanies=round(NumCompanies/nrow(.),2), TotalRevenueMln=round(sum(Revenue)/1000000,2), Prop_TotalRev=round(sum(Revenue)/sum(inc$Revenue),2),TotalEmployees=sum(Employees,na.rm=TRUE),AvgRevenueMln=round(mean(Revenue)/1000000,2),AvgNumEmployees=round(mean(Employees,na.rm=TRUE),0),WtdAvgGrowthRate=round(mean((Growth_Rate*Revenue)/sum(Revenue)),4), AvgRevPerEmp=sum(Revenue)/sum(Employees,na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
kable(inc1%>%arrange(desc(NumCompanies)))
| Industry | NumCompanies | Prop_NumCompanies | TotalRevenueMln | Prop_TotalRev | TotalEmployees | AvgRevenueMln | AvgNumEmployees | WtdAvgGrowthRate | AvgRevPerEmp |
|---|---|---|---|---|---|---|---|---|---|
| IT Services | 733 | 0.15 | 20681.3 | 0.09 | 102788 | 28.21 | 140 | 0.0027 | 201203.45 |
| Business Products & Services | 482 | 0.10 | 26367.9 | 0.11 | 117357 | 54.71 | 244 | 0.0035 | 224681.10 |
| Advertising & Marketing | 471 | 0.09 | 7785.0 | 0.03 | 39731 | 16.53 | 84 | 0.0196 | 195942.71 |
| Health | 355 | 0.07 | 17863.4 | 0.07 | 82430 | 50.32 | 233 | 0.0080 | 216709.94 |
| Software | 342 | 0.07 | 8140.6 | 0.03 | 51262 | 23.80 | 150 | 0.0116 | 158803.79 |
| Financial Services | 260 | 0.05 | 13150.9 | 0.05 | 47693 | 50.58 | 183 | 0.0150 | 275740.67 |
| Manufacturing | 256 | 0.05 | 12684.0 | 0.05 | 43942 | 49.55 | 172 | 0.0066 | 288653.22 |
| Consumer Products & Services | 203 | 0.04 | 14956.4 | 0.06 | 45464 | 73.68 | 224 | 0.0515 | 328972.37 |
| Retail | 203 | 0.04 | 10257.4 | 0.04 | 37068 | 50.53 | 183 | 0.0144 | 276718.46 |
| Government Services | 202 | 0.04 | 6009.1 | 0.02 | 26185 | 29.75 | 130 | 0.0285 | 229486.35 |
| Human Resources | 196 | 0.04 | 9246.1 | 0.04 | 226980 | 47.17 | 1158 | 0.0100 | 40735.31 |
| Construction | 187 | 0.04 | 13174.3 | 0.05 | 29099 | 70.45 | 156 | 0.0089 | 452740.64 |
| Logistics & Transportation | 155 | 0.03 | 14840.5 | 0.06 | 39994 | 95.75 | 260 | 0.0185 | 371068.16 |
| Food & Beverage | 131 | 0.03 | 12911.3 | 0.05 | 65911 | 98.56 | 511 | 0.0148 | 195889.91 |
| Telecommunications | 129 | 0.03 | 7334.4 | 0.03 | 30842 | 56.86 | 243 | 0.0138 | 237805.59 |
| Energy | 109 | 0.02 | 13771.6 | 0.06 | 26437 | 126.34 | 243 | 0.3315 | 520921.44 |
| Real Estate | 96 | 0.02 | 2965.7 | 0.01 | 18893 | 30.89 | 199 | 0.0773 | 156973.48 |
| Education | 83 | 0.02 | 1139.3 | 0.00 | 7685 | 13.73 | 93 | 0.0409 | 148249.84 |
| Engineering | 74 | 0.01 | 2532.5 | 0.01 | 20435 | 34.22 | 276 | 0.0262 | 123929.53 |
| Security | 73 | 0.01 | 3812.8 | 0.02 | 41059 | 52.23 | 562 | 0.0247 | 92861.49 |
| Travel & Hospitality | 62 | 0.01 | 2931.6 | 0.01 | 23035 | 47.28 | 372 | 0.0193 | 127267.20 |
| Media | 54 | 0.01 | 1742.4 | 0.01 | 9532 | 32.27 | 177 | 0.0525 | 182794.80 |
| Environmental Services | 51 | 0.01 | 2638.8 | 0.01 | 10155 | 51.74 | 199 | 0.0650 | 259852.29 |
| Insurance | 50 | 0.01 | 2337.9 | 0.01 | 7339 | 46.76 | 147 | 0.0315 | 318558.39 |
| Computer Hardware | 44 | 0.01 | 11885.7 | 0.05 | 9714 | 270.13 | 221 | 0.0153 | 1223563.93 |
While IT services drops to second place in terms of proportion of total revenue across all industries, another interesting aspect that come to light in the table above is the much higher ‘Revenue per Employee’ for Computer Hardware. Instead of taking this at face value, I think this needs more investigation and additional data - it is possible that this number is so high because this industry relies heavily on imports, and therefore the contribution of its supply chain to the revenue is not reflected in the number of employees.
# Look at averages for numerical metrics by State and Industry
inc2<-inc%>%group_by(State,Industry)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))
## `summarise()` regrouping output by 'State' (override with `.groups` argument)
head(inc2)
## # A tibble: 6 x 5
## # Groups: State [2]
## State Industry AvgRevenueMln AvgEmployees AvgGrowthRate
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 AK Construction 4.7 27 9.19
## 2 AK Energy 338. 2501 0.42
## 3 AL Advertising & Marketing 11.7 105 1.33
## 4 AL Business Products & Services 43.3 118 1.95
## 5 AL Construction 19.4 11 0.63
## 6 AL Consumer Products & Services 2.85 79 2.17
Unlike the growth rate in the previous table, the growth rate shown in the table above is a simple average so it may be less informative than the weighted average growth rate shown in the previous table.
# Look at averages for numerical metrics by City
inc3<-inc%>%group_by(City)%>%summarize(AvgRevenueMln=round(mean(Revenue)/1000000,2), AvgEmployees=round(mean(Employees),0), AvgGrowthRate=round(mean(Growth_Rate),2))
## `summarise()` ungrouping output (override with `.groups` argument)
head(inc3)
## # A tibble: 6 x 4
## City AvgRevenueMln AvgEmployees AvgGrowthRate
## <fct> <dbl> <dbl> <dbl>
## 1 Acton 6.1 17 0.43
## 2 Addison 183. 153 24
## 3 Adrian 26.5 31 2.81
## 4 Agoura Hills 7.67 166 2.62
## 5 Aiea 45.1 80 16.1
## 6 Akron 66.6 198 2.96
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
library(ggplot2)
#Group number of companies by state and sort in descending order
state.count<-inc%>%group_by(State)%>%summarize(NumCompanies=sum(!is.na(Revenue)))%>%arrange(desc(NumCompanies))
## `summarise()` ungrouping output (override with `.groups` argument)
state.count
## # A tibble: 52 x 2
## State NumCompanies
## <fct> <int>
## 1 CA 701
## 2 TX 387
## 3 NY 311
## 4 VA 283
## 5 FL 282
## 6 IL 273
## 7 GA 212
## 8 OH 186
## 9 MA 182
## 10 PA 164
## # ... with 42 more rows
The table above shows that California has the highest number of top 5000 companies, which is likely on account of Silicon valley as well as it’s large size compared to the other states.
#Create a horizontal bar chart of number of companies by state
chart<-ggplot(state.count)+geom_bar(aes(reorder(State,NumCompanies),NumCompanies,fill=State),stat='identity')+coord_flip()
chart+scale_y_continuous("Number of Companies")+ggtitle("Distribution of Companies by State")+theme(axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=14,face="bold")) + labs(title = 'Distribution of Companies by State', x = 'State')
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
From the table in question 1, we can see that NY is the state with the 3rd most companies.
# Filter companies to those in NY State
incNY<-inc%>%filter(State=='NY')
# Restrict the rows to those with no missing values
incNY<-incNY[complete.cases(incNY),]
# Calculate mean and median number of employees by industry for those companies in NY State
incNY_Emp_By_Industry<-incNY%>%group_by(Industry)%>%summarise(MeanEmployees=round(mean(Employees),0),MedEmployees=round(median(Employees),0))
## `summarise()` ungrouping output (override with `.groups` argument)
head(incNY_Emp_By_Industry)
## # A tibble: 6 x 3
## Industry MeanEmployees MedEmployees
## <fct> <dbl> <dbl>
## 1 Advertising & Marketing 58 38
## 2 Business Products & Services 1492 70
## 3 Computer Hardware 44 44
## 4 Construction 61 24
## 5 Consumer Products & Services 626 25
## 6 Education 60 50
# Graph the boxplot for Employees by Industry in NY State
ggplot(incNY, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS")
We remove the data points with minimum and maximum number of employees to reduce the impact of the skew. Another alternative would have been to remove additional outliers that fall outside the whiskers of the bokplots.
# Remove outliers by excluding the minimum and maximum value for Employees for each industry in New York State
incNY_sub<-incNY%>%group_by(Industry)%>%filter(Employees<max(Employees)&Employees>min(Employees))
# Graph the boxplot without the minimum and maximum value for Employees
ggplot(incNY_sub, aes(x=Industry, y=Employees))+geom_boxplot()+scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Employment by Industry in NYS without outliers")
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
#Calculate Revenue per employee and create new dataframe with this column in descending order
inc$RevenuePerEmployee<-inc$Revenue/inc$Employees
incMod<-inc[complete.cases(inc),]%>%arrange(desc(RevenuePerEmployee))
incMod<-subset(incMod, select = c('Industry','RevenuePerEmployee'))
#head(incMod)
#Calculate mean, median, standard deviation and quartile of Revenue per Employee, grouped by Industry
inc_Rev_Per_Emp<-incMod%>%group_by(Industry)%>%summarise(MeanRevPerEmp=round(mean(RevenuePerEmployee),0),StdDevRevPerEmp=round(sd(RevenuePerEmployee),0),MedRevPerEmp=round(median(RevenuePerEmployee),0),Q25_RevPerEmp=round(quantile(RevenuePerEmployee)[2],0))
## `summarise()` ungrouping output (override with `.groups` argument)
inc_Rev_Per_Emp%>%arrange(desc(inc_Rev_Per_Emp,MeanRevPerEmp))
## # A tibble: 25 x 5
## Industry MeanRevPerEmp StdDevRevPerEmp MedRevPerEmp Q25_RevPerEmp
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Travel & Hospitality 414788 528632 224405 111976
## 2 Telecommunications 449260 733984 284000 174519
## 3 Software 225989 731013 155319 110000
## 4 Security 283391 414090 158744 72143
## 5 Retail 412555 381430 312755 185908
## 6 Real Estate 434516 453724 253571 128583
## 7 Media 307144 246008 261458 168065
## 8 Manufacturing 453524 1058512 231250 157619
## 9 Logistics & Transpo~ 794811 1212270 425024 189584
## 10 IT Services 270494 357580 163915 112308
## # ... with 15 more rows
#head(inc_Rev_Per_Emp)
#Create box plots of revenue per employee by industry
ggplot(incMod, aes(x=Industry, y=RevenuePerEmployee))+ scale_y_continuous("Revenue per Employee", trans='log2') + geom_boxplot()+coord_flip()+ggtitle("Revenue per Employee by Industry")
chart1<-ggplot(incMod)+geom_bar(aes(reorder(Industry,RevenuePerEmployee),RevenuePerEmployee),stat='identity')+coord_flip()
chart1+scale_y_continuous("Industry")+ggtitle("Revenue Efficiency by Industry")+theme(aspect.ratio = 2.1, axis.text.x = element_text(size=8, angle = 90, hjust = 1), axis.text.y = element_text(size=6), axis.title=element_text(size=8,face="bold")) + labs(title = 'Revenue Efficiency by Industry', x = 'Revenue Per Employee')