Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764
str(inc)
## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

library(ggplot2)
library(dplyr)
library(kableExtra)

Growth Rate

I noticed that the growth rate goes from 0.340 to 421.480. Below, you will see that there are 19 companies that experienced growth rates of 100 or higher.

inc %>% dplyr::filter(Growth_Rate >= 100) %>% summarise(n = n())
##    n
## 1 19

Below is the list of these 19 companies with growth rates of 100 or higher.

kable(inc %>% dplyr::filter(Growth_Rate >= 100)) %>% kable_styling()
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
7 Value Payment Systems 174.04 2.550e+07 Financial Services 27 Nashville TN
8 Emerge Digital Group 170.64 2.390e+07 Advertising & Marketing 75 San Francisco CA
9 Goal Zero 169.81 3.310e+07 Consumer Products & Services 97 Bluffdale UT
10 Yagoozon 166.89 1.860e+07 Retail 15 Warwick RI
11 OBXtek 164.33 2.960e+07 Government Services 149 Tysons Corner VA
12 AdRoll 150.65 3.410e+07 Advertising & Marketing 165 San Francisco CA
13 uBreakiFix 141.02 1.700e+07 Retail 250 Orlando FL
14 Sparc 128.63 2.110e+07 Software 160 Charleston SC
15 LivingSocial 123.33 5.360e+08 Consumer Products & Services 4100 Washington DC
16 Amped Wireless 110.68 1.430e+07 Computer Hardware 26 Chino CA
17 Intelligent Audit 105.73 1.450e+08 Logistics & Transportation 15 Rochelle Park NJ
18 Integrity Funding 104.62 1.110e+07 Financial Services 11 Sarasota FL
19 Vertex Body Sciences 100.10 1.180e+07 Food & Beverage 51 columbus OH

Revenue

The revenue ranges from 2 million to about 10 billion. The median revenue is about 11 million.

inc %>% dplyr::summarise(min=min(Revenue), median=median(Revenue), max=max(Revenue))
##     min   median      max
## 1 2e+06 10900000 1.01e+10

Industry

There are 25 distinct industries.

kable(inc %>% dplyr::group_by(Industry) %>% dplyr::summarise(n=n()) %>% arrange(desc(n))) %>% kable_styling()
Industry n
IT Services 733
Business Products & Services 482
Advertising & Marketing 471
Health 355
Software 342
Financial Services 260
Manufacturing 256
Consumer Products & Services 203
Retail 203
Government Services 202
Human Resources 196
Construction 187
Logistics & Transportation 155
Food & Beverage 131
Telecommunications 129
Energy 109
Real Estate 96
Education 83
Engineering 74
Security 73
Travel & Hospitality 62
Media 54
Environmental Services 51
Insurance 50
Computer Hardware 44

Employees

There are some companies that do not have data for Employee. The number of employees range from 1 to 66,803. The median employee size is 53.

kable(inc %>% dplyr::summarise(min=min(Employees, na.rm = TRUE), median=median(Employees, na.rm = TRUE), max=max(Employees, na.rm = TRUE))) %>% kable_styling()
min median max
1 53 66803

City

There are 1,519 distinct cities.

cities <- inc %>% group_by(City) %>% summarise(n=n())
nrow(cities)
## [1] 1519

These are the top 10 cities (based on the number of companies that are located in the city).

kable(inc %>% group_by(City) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(10)) %>% kable_styling()
## Selecting by n
City n
New York 160
Chicago 90
Austin 88
Houston 76
San Francisco 75
Atlanta 74
San Diego 67
Seattle 52
Boston 43
Dallas 42
Denver 42

State

There are 52 distinct states in the data set.

states <- inc %>% group_by(State) %>% summarise(n=n())
nrow(states)
## [1] 52

These are the top 10 States (based on the number of companies that are located in the State).

kable(inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(10)) %>% kable_styling()
## Selecting by n
State n
CA 701
TX 387
NY 311
VA 283
FL 282
IL 273
GA 212
OH 186
MA 182
PA 164

Question 1

Create a graph that shows the distribution of companies in the data set by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here

ordered <- inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n))

plt1 <- 
  ggplot(data = ordered[1:52,], aes(x=reorder(State,n), y=n)) + 
  geom_bar(stat="identity", width=0.5, color="#1F3552", fill="steelblue", 
           position=position_dodge()) +
    #geom_text(aes(label=round(n, digits=2)), hjust=1.3, size=3.0, color="white") + 
    coord_flip() + 
    scale_y_continuous(breaks=seq(0,700,100)) + 
    ggtitle("Disbribution by State") +
    xlab("") + ylab("") + 
    theme_minimal()

I couldn’t find a way to increase the plot canvas size. This would look better if there’s more space in between each state, and the bars are slightly bigger.

The graph below orders the distribution from highest to lowest states.

plt1


Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

As you can see, the state with the 3rd most companies in the data set is New York.

kable(inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(3)) %>% kable_styling()
## Selecting by n
State n
CA 701
TX 387
NY 311

inc_cc holds complete cases only.

inc_cc <- inc[complete.cases(inc),]

Below is a break down of median number of employees in each industry for New York state. It shows the min, median, and max number of employees for each industry in NY. It is ordered from highest to lowest variability.

kable(inc_cc %>% filter(State=='NY') %>% group_by(Industry) %>% summarise(min=min(Employees),median=median(Employees), max=max(Employees), var=var(Employees)) %>% arrange(desc(var))) %>% kable_styling()
Industry min median max var
Business Products & Services 4 70.5 32000 3.894641e+07
Consumer Products & Services 5 25.0 10000 5.835802e+06
Travel & Hospitality 6 61.0 2280 6.974669e+05
Human Resources 7 56.0 2081 4.634787e+05
IT Services 8 54.0 3000 2.241769e+05
Software 15 80.0 1271 1.404907e+05
Security 25 32.5 450 4.415000e+04
Media 4 45.0 602 3.099560e+04
Financial Services 14 81.0 483 2.299190e+04
Environmental Services 60 155.0 250 1.805000e+04
Food & Beverage 5 41.0 383 1.390028e+04
Energy 5 120.0 294 1.106670e+04
Telecommunications 6 31.0 316 1.064462e+04
Manufacturing 11 30.0 307 8.048231e+03
Health 2 45.0 298 7.505141e+03
Construction 10 24.5 219 6.392000e+03
Advertising & Marketing 2 38.0 270 3.872536e+03
Education 19 50.5 200 2.359516e+03
Engineering 11 54.5 94 1.583000e+03
Logistics & Transportation 1 23.5 70 8.430000e+02
Retail 3 13.5 75 6.378736e+02
Insurance 15 32.5 50 6.125000e+02
Real Estate 7 18.0 30 9.425000e+01
Computer Hardware 44 44.0 44 NA
Government Services 17 17.0 17 NA

A box plot could show the median number employees (this is indicated by the dark black line in the box). A box plot also shows the range of the data and outliers (indicated by a red asterisk symbol).

There are 25 different industries. I tried plotting them all in a single box plot call, and the result was too tiny to get any kind of useful information. In question 1, I also had a similar problem of properly spacing out the data elements on the screen. As a workaround, I created vectors that group records based on variability. The table above was used for this purpose. In this case, companies that have higher variability in number of employees are also ones with higher maximum number of employees.

The code below groups industries together with similar variability. I try to limit each group up to 5 industries so that the plot doesn’t get too small.

g1a <- c('Business Products & Services')
g1b <- c('Consumer Products & Services')
g2 <- c('Travel & Hospitality', 'Human Resources', 'IT Services', 'Software')
g3 <- c('Security', 'Media', 'Financial Services',  'Environmental Services', 'Food & Beverage')
g4 <- c('Energy', 'Telecommunications', 'Manufacturing', 'Health', 'Construction')
g5 <- c('Advertising & Marketing', 'Education', 'Engineering', 'Logistics & Transportation', 'Retail')
g6 <- c('Insurance', 'Real Estate', 'Computer Hardware', 'Government Services')

Below is the code for creating the box plots for each grouping.

Please note that each plot for each group has a different x-axis scale, which depends on the range of number of employees for each respective group.

The industries ‘Computer Hardware’ and ‘Government Services’ do not have enough data to generate a box plot.

plt_g1a <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g1a), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g1b <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g1b), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g2 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g2), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g3 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g3), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g4 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g4), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g5 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g5), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE) 

plt_g6 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g6), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

I created a separate group for ‘Business Products & Services’ and ‘Consumer Products & Services’ because the box plots for these came out so tiny. It looks like the outlier data is causing the box plot of these 2 industries to flatten out too much.

plt_g1a

plt_g1b

Below are the box plots for the rest of the other industries.

Please be mindful that the x-axis scale for each grouping is different.


Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

The table below shows the total number of companies in each industry and the revenue per employee for each industry.

revenue_per_employee <- 
inc_cc %>% group_by(Industry) %>% summarise(count=n(), total_revenue=sum(Revenue), total_employees=sum(Employees), revenue_per_employee=total_revenue/total_employees) %>% arrange(desc(revenue_per_employee))

kable(revenue_per_employee) %>% kable_styling()
Industry count total_revenue total_employees revenue_per_employee
Computer Hardware 44 11885700000 9714 1223563.93
Energy 109 13771600000 26437 520921.44
Construction 187 13174300000 29099 452740.64
Logistics & Transportation 154 14837800000 39994 371000.65
Consumer Products & Services 203 14956400000 45464 328972.37
Insurance 50 2337900000 7339 318558.39
Manufacturing 255 12603600000 43942 286823.54
Retail 203 10257400000 37068 276718.46
Financial Services 260 13150900000 47693 275740.67
Environmental Services 51 2638800000 10155 259852.29
Telecommunications 127 7287900000 30842 236297.91
Government Services 202 6009100000 26185 229486.35
Business Products & Services 480 26345900000 117357 224493.64
Health 354 17860100000 82430 216669.90
IT Services 732 20525000000 102788 199682.84
Advertising & Marketing 471 7785000000 39731 195942.71
Food & Beverage 129 12812500000 65911 194390.92
Media 54 1742400000 9532 182794.80
Software 341 8134600000 51262 158686.75
Real Estate 95 2956800000 18893 156502.41
Education 83 1139300000 7685 148249.84
Travel & Hospitality 62 2931600000 23035 127267.20
Engineering 74 2532500000 20435 123929.53
Security 73 3812800000 41059 92861.49
Human Resources 196 9246100000 226980 40735.31

The code below plots the revenue per employee as a bar chart sorted by revenue per employee from highest to lowest. A second bar chart plot is generated that shows the distribution of companies by industry sorted by revenue per employee from highest to lowest (same order as the first plot).

plt3_1 <- ggplot(data=revenue_per_employee, aes(x=reorder(Industry,-revenue_per_employee), y=revenue_per_employee)) +
     geom_bar(stat="identity", fill="steelblue") +
     theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
     ggtitle("Revenue Per Employee by Industry") + 
     ylab("Revenue Per Employee") + 
     xlab("")

plt3_2 <- ggplot(data=revenue_per_employee, aes(x=reorder(Industry,-revenue_per_employee), y=count)) +
     geom_bar(stat="identity", fill="steelblue") +
     theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
     ggtitle("Distribution of Companies by Industry") + 
     ylab("Revenue Per Employee") + 
     xlab("")