Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
library(ggplot2)

We can start by quickly looking at what industries have most of the fast growing companies. As we can see below, by far that is the IT sector, with several categories of service companies coming second but fairly far behind. We also see there are 25 Industry sectors in the dataset.

ind<-data.frame(inc %>% group_by(Industry) %>% tally()) %>% arrange(desc(n))
ind %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Industry n
IT Services 733
Business Products & Services 482
Advertising & Marketing 471
Health 355
Software 342
Financial Services 260
Manufacturing 256
Consumer Products & Services 203
Retail 203
Government Services 202
Human Resources 196
Construction 187
Logistics & Transportation 155
Food & Beverage 131
Telecommunications 129
Energy 109
Real Estate 96
Education 83
Engineering 74
Security 73
Travel & Hospitality 62
Media 54
Environmental Services 51
Insurance 50
Computer Hardware 44

But out of the 5,000 companies IT Services is a somewhat low percentage. So no sector really dominating growth.

ind[which(ind$Industry=="IT Services"),]$n/nrow(inc)*100
## [1] 14.65707

We can do the same analysis for States, to see if a State in particular is where most of the growth is happening.

sta<-data.frame(inc %>% group_by(State) %>% tally()) %>% arrange(desc(n))
sta %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
State n
CA 701
TX 387
NY 311
VA 283
FL 282
IL 273
GA 212
OH 186
MA 182
PA 164
NJ 158
NC 137
CO 134
MD 131
WA 130
MI 126
AZ 100
UT 95
MN 88
TN 82
WI 79
IN 69
MO 59
AL 51
CT 50
OR 49
SC 48
OK 46
DC 43
KY 40
KS 38
LA 37
IA 28
NE 27
NV 26
NH 24
ID 17
DE 16
RI 16
ME 13
MS 12
ND 10
AR 9
HI 7
VT 6
NM 5
MT 4
SD 3
AK 2
WV 2
WY 2
PR 1

Here California comes on top a fair bit. Glad to see my home state of Texas coming second, although a fair bit behind.

sta[which(sta$State=="CA"),]$n/nrow(inc)*100
## [1] 14.0172
sta[which(sta$State=="TX"),]$n/nrow(inc)*100
## [1] 7.738452

We can actually quickly look at the proportion of all states.

-sort(desc(prop.table(table(inc$State)))) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
x
CA 0.14017197
TX 0.07738452
NY 0.06218756
VA 0.05658868
FL 0.05638872
IL 0.05458908
GA 0.04239152
OH 0.03719256
MA 0.03639272
PA 0.03279344
NJ 0.03159368
NC 0.02739452
CO 0.02679464
MD 0.02619476
WA 0.02599480
MI 0.02519496
AZ 0.01999600
UT 0.01899620
MN 0.01759648
TN 0.01639672
WI 0.01579684
IN 0.01379724
MO 0.01179764
AL 0.01019796
CT 0.00999800
OR 0.00979804
SC 0.00959808
OK 0.00919816
DC 0.00859828
KY 0.00799840
KS 0.00759848
LA 0.00739852
IA 0.00559888
NE 0.00539892
NV 0.00519896
NH 0.00479904
ID 0.00339932
DE 0.00319936
RI 0.00319936
ME 0.00259948
MS 0.00239952
ND 0.00199960
AR 0.00179964
HI 0.00139972
VT 0.00119976
NM 0.00099980
MT 0.00079984
SD 0.00059988
AK 0.00039992
WV 0.00039992
WY 0.00039992
PR 0.00019996

We can also ask specifically what are the top 20 faster growing companies.

inc %>% arrange(desc(Growth_Rate)) %>% head(20) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
7 Value Payment Systems 174.04 2.550e+07 Financial Services 27 Nashville TN
8 Emerge Digital Group 170.64 2.390e+07 Advertising & Marketing 75 San Francisco CA
9 Goal Zero 169.81 3.310e+07 Consumer Products & Services 97 Bluffdale UT
10 Yagoozon 166.89 1.860e+07 Retail 15 Warwick RI
11 OBXtek 164.33 2.960e+07 Government Services 149 Tysons Corner VA
12 AdRoll 150.65 3.410e+07 Advertising & Marketing 165 San Francisco CA
13 uBreakiFix 141.02 1.700e+07 Retail 250 Orlando FL
14 Sparc 128.63 2.110e+07 Software 160 Charleston SC
15 LivingSocial 123.33 5.360e+08 Consumer Products & Services 4100 Washington DC
16 Amped Wireless 110.68 1.430e+07 Computer Hardware 26 Chino CA
17 Intelligent Audit 105.73 1.450e+08 Logistics & Transportation 15 Rochelle Park NJ
18 Integrity Funding 104.62 1.110e+07 Financial Services 11 Sarasota FL
19 Vertex Body Sciences 100.10 1.180e+07 Food & Beverage 51 columbus OH
20 BlueKai 92.45 2.680e+07 Advertising & Marketing 107 Cupertino CA

Finally we can ask which companies are growing quicker, small or large. If we define a small company as having less than 100 employees, medium size between 100 and 1000, and large more than 1000, we find most fast growing companies are small.

b <- c(-Inf, 100, 1000, Inf)
names <- c("Small", "Medium", "Large")
inc$bin <- cut(inc$Employees, breaks = b, labels = names)
inc %>% group_by(bin) %>% tally() %>% kable() %>% kable_styling() 
bin n
Small 3456
Medium 1365
Large 168
NA 12

Interestingly we stumble upon some companies which do not report number of employees, something that was shown in the summary of the data frame.

inc[is.na(inc$Employees),] %>% kable() %>% kable_styling()
Rank Name Growth_Rate Revenue Industry Employees City State bin
183 183 First Flight Solutions 22.32 2700000 Logistics & Transportation NA Emerald Isle NC NA
1063 1064 Popchips 3.98 93300000 Food & Beverage NA San Francisco CA NA
1123 1124 Vocalocity 3.72 42900000 Telecommunications NA Atlanta GA NA
1652 1653 Higher Logic 2.36 6000000 Software NA Washington DC NA
1685 1686 Global Communications Group 2.30 3600000 Telecommunications NA Englewood CO NA
2196 2197 JeffreyM Consulting 1.68 12100000 Business Products & Services NA Bellevue WA NA
2742 2743 Excalibur Exhibits 1.27 9900000 Business Products & Services NA houston TX NA
3000 3001 Heartland Business Systems 1.12 156300000 IT Services NA Little Chute WI NA
3978 3978 SSEC 0.68 80400000 Manufacturing NA Horsham PA NA
4112 4112 Carolinas Home Medical Equipment 0.64 3300000 Health NA Matthews NC NA
4566 4566 Oakbrook 0.48 8900000 Real Estate NA Madison WI NA
4968 4968 Popcorn Palace 0.35 5500000 Food & Beverage NA Schiller Park IL NA
nrow(inc[is.na(inc$Employees),])
## [1] 12

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

sta$State <- factor(sta$State, levels = sta$State[order(sta$n)])

ggplot(sta, aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")
  )

This plot is hard to read, other than to know overall which states have the most fast growing companies. For a more detailed look we can plot just the top 5 states.

ggplot(head(sta,5), aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + geom_text(aes(label=n), vjust=0.3, hjust=1.6, color="white", size=3.5) + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95") 
  )

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

ny <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry) %>% summarize(Average=mean(Employees), Median=median(Employees)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Industry Average Median
Advertising & Marketing 58.43860 38.0
Business Products & Services 1492.46154 70.5
Computer Hardware 44.00000 44.0
Construction 61.00000 24.5
Consumer Products & Services 626.29412 25.0
Education 59.85714 50.5
Energy 129.20000 120.0
Engineering 53.50000 54.5
Environmental Services 155.00000 155.0
Financial Services 144.30769 81.0
Food & Beverage 76.44444 41.0
Government Services 17.00000 17.0
Health 81.84615 45.0
Human Resources 437.54545 56.0
Insurance 32.50000 32.5
IT Services 204.09302 54.0
Logistics & Transportation 29.50000 23.5
Manufacturing 73.30769 30.0
Media 108.00000 45.0
Real Estate 18.25000 18.0
Retail 24.78571 13.5
Security 135.00000 32.5
Software 245.92308 80.0
Telecommunications 95.35294 31.0
Travel & Hospitality 547.71429 61.0
library(reshape2)
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

A large discrepancy between mean and median is noticeable for several industries. This discrepancy usually means the dataset has outliners which affect the average more than the median calculations. This is still valuable information, if the data is accurate. We now know in Travel & Hospitality, HR, Consumer Products & Services and especially in Business Products & Services, there are companies that employ a disproportionate number of people compared to their peers. Outliers can be visualized with a simple boxplot, which results in a not so pretty graph.

ny_all <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry)
ggplot(ny_all, aes(x=Industry, y=Employees)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
                outlier.size=4) + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

We can eliminate the outliers and re-plot ordering by the median. Outliers are defined for the entire dataset, not for individual groups.

quantiles <- quantile(ny_all$Employees, probs = c(.25, .75))
quantiles
##   25%   75% 
##  21.0 105.5
range <- 1.5 * IQR(ny_all$Employees)
range
## [1] 126.75
ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Employees & quantiles[1]+range > ny_all$Employees ))


ggplot(ny_clean, aes(x=reorder(Industry,Employees,FUN=median), y=Employees)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees") + xlab("Industry")

Now we can go back to our original plot and make better sense of the data now that there aren’t large discrepancies between mean and median.

ny <- ny_clean %>% summarize(Average=mean(Employees), Median=median(Employees)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Industry Average Median
Advertising & Marketing 41.84615 35.0
Business Products & Services 47.47059 32.0
Computer Hardware 44.00000 44.0
Construction 29.40000 24.0
Consumer Products & Services 40.43750 24.0
Education 49.07692 50.0
Energy 88.00000 105.0
Engineering 53.50000 54.5
Environmental Services 60.00000 60.0
Financial Services 51.25000 39.0
Food & Beverage 38.12500 35.5
Government Services 17.00000 17.0
Health 41.90000 38.5
Human Resources 39.00000 30.0
Insurance 32.50000 32.5
IT Services 53.77419 40.0
Logistics & Transportation 29.50000 23.5
Manufacturing 40.54545 26.0
Media 42.88889 16.0
Real Estate 18.25000 18.0
Retail 24.78571 13.5
Security 30.00000 25.0
Software 48.50000 37.0
Telecommunications 44.15385 25.0
Travel & Hospitality 34.75000 36.0
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

Some groups still show outliers, but as stated before these still provide valuable information about the number of employees per industry sector. The Energy industry is surely the largest employer, with Environmental Services and Engineering coming behind as distant second and third. Overall Government Services shows having the least number of employees.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

ny_rev_ee <- ny_all %>%  mutate(Revenue_Per_EE = Revenue/Employees) %>%  group_by(Industry) %>% summarize(Average_Rev_Per_EE = mean(Revenue_Per_EE), Median_Rev_Per_EE = median(Revenue_Per_EE))

ny_rev_ee %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Industry Average_Rev_Per_EE Median_Rev_Per_EE
Advertising & Marketing 373403.5 255555.6
Business Products & Services 527816.9 203148.1
Computer Hardware 520454.5 520454.5
Construction 238694.5 238613.0
Consumer Products & Services 382942.6 222222.2
Education 112060.6 114848.5
Energy 8472533.5 283211.7
Engineering 215744.7 202998.1
Environmental Services 134366.7 134366.7
Financial Services 400174.4 219607.8
Food & Beverage 174630.9 120238.1
Government Services 158823.5 158823.5
Health 532491.0 155000.0
Human Resources 337366.3 175000.0
Insurance 371000.0 371000.0
IT Services 228816.1 164285.7
Logistics & Transportation 1245870.1 996285.7
Manufacturing 665818.6 217500.0
Media 333549.6 262500.0
Real Estate 383809.5 330952.4
Retail 520790.3 305000.0
Security 153277.8 149000.0
Software 143749.0 133333.3
Telecommunications 408143.4 410714.3
Travel & Hospitality 282089.8 223333.3
ggplot(melt(ny_rev_ee, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee")

As in the previous case, we see at least one industry with a large discrepancy between average and median. We proceed with the same analysis.

ny_all <- ny_all %>%  mutate(Revenue_Per_EE = Revenue/Employees)

ggplot(ny_all, aes(x=Industry, y=Revenue_Per_EE)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
                outlier.size=4) + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

quantiles <- quantile(ny_all$Revenue_Per_EE, probs = c(.25, .75))
quantiles
##      25%      75% 
## 138653.7 373333.3
range <- 1.5 * IQR(ny_all$Revenue_Per_EE)
range
## [1] 352019.4
ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Revenue_Per_EE & quantiles[1]+range > ny_all$Revenue_Per_EE))


ggplot(ny_clean, aes(x=reorder(Industry,Revenue_Per_EE,FUN=median), y=Revenue_Per_EE)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee") + xlab("Industry")

ny <- ny_clean %>% summarize(Average=mean(Revenue_Per_EE), Median=median(Revenue_Per_EE)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
Industry Average Median
Advertising & Marketing 229838.8 200000.0
Business Products & Services 184010.4 180000.0
Construction 238694.5 238613.0
Consumer Products & Services 187158.4 157777.8
Education 112060.6 114848.5
Energy 149778.0 106122.4
Engineering 215744.7 202998.1
Environmental Services 134366.7 134366.7
Financial Services 209618.4 195169.1
Food & Beverage 174630.9 120238.1
Government Services 158823.5 158823.5
Health 114565.5 124000.0
Human Resources 117477.8 103267.7
Insurance 180000.0 180000.0
IT Services 176233.7 160502.9
Logistics & Transportation 299740.3 299740.3
Manufacturing 224302.8 186363.6
Media 254404.6 252844.7
Real Estate 267301.6 300000.0
Retail 259547.4 260000.0
Security 153277.8 149000.0
Software 143749.0 133333.3
Telecommunications 296756.2 373333.3
Travel & Hospitality 209660.3 184617.5
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Renevue Employee per Industry in New York") + ylab("Revenue per Employee")

Now looking at the data, Telecommunications stands out as the most efficient industry (highest revenue per employee), with Logistics and transportation coming second. The one to avoid is Human Resources which comes last, with Health and Education also close to the bottom.