CUNY 608 - Joel Park

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

The summary provides an overall glimpse via descriptive statistics for the columns in the dataset. Some of the information is helpful, others not helpful (for example, descriptive statistics on ranking). However, perhaps we may be able to gain more insight if we looked at the data from another perspective. Let’s group look at the fastest growing companies and see which cities they are coming from?

# Importing the tidyverse library
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# What are the top 10 fastest growing companies in the US?

# Sort by 'Growth_Rate'
top_10 <- inc %>% arrange(desc(Growth_Rate)) %>% head(10) %>% select(c(Rank, Name, Growth_Rate, Revenue, City, State))
print("What are the top 10 fastest growing companies in the US?")
## [1] "What are the top 10 fastest growing companies in the US?"
top_10
##    Rank                         Name Growth_Rate   Revenue          City
## 1     1                         Fuhu      421.48 1.179e+08    El Segundo
## 2     2        FederalConference.com      248.31 4.960e+07      Dumfries
## 3     3                The HCI Group      245.45 2.550e+07  Jacksonville
## 4     4                      Bridger      233.08 1.900e+09       Addison
## 5     5                       DataXu      213.37 8.700e+07        Boston
## 6     6 MileStone Community Builders      179.38 4.570e+07        Austin
## 7     7        Value Payment Systems      174.04 2.550e+07     Nashville
## 8     8         Emerge Digital Group      170.64 2.390e+07 San Francisco
## 9     9                    Goal Zero      169.81 3.310e+07     Bluffdale
## 10   10                     Yagoozon      166.89 1.860e+07       Warwick
##    State
## 1     CA
## 2     VA
## 3     FL
## 4     TX
## 5     MA
## 6     TX
## 7     TN
## 8     CA
## 9     UT
## 10    RI

Within the top 10 fastest growing companies, there does not appear to be a single geographical location that dominates. Let’s take a look at the top 1000 fastest growing companies. What are the top 10 cities?

# What are the top ten cities in America with the 1000 fastest growing companies?
top_10_cities <- inc %>% arrange(desc(Growth_Rate)) %>% head(1000) %>% count(City, sort = TRUE) %>% head(10)
print("What are the top ten cities in America with the 1000 fastest growing companies?")
## [1] "What are the top ten cities in America with the 1000 fastest growing companies?"
top_10_cities
## # A tibble: 10 x 2
##    City              n
##    <fct>         <int>
##  1 New York         35
##  2 San Francisco    26
##  3 Chicago          20
##  4 Austin           19
##  5 Atlanta          17
##  6 Houston          15
##  7 Irvine           15
##  8 Boston           14
##  9 San Diego        11
## 10 Washington       11

It’s not surprising to see that New York, San Francisco, Chicago, Austin, and Atlanta contain the fastest growing companies (likely startups) in the country.

Let’s evaluate what type of industries contain the fastest growing companies in the US? How many different unique industries postings are there in this dataset?

# How many unique industries in this dataset?
print(paste0("There are ", length(unique(inc$Industry)), " unique industries in this dataset."))
## [1] "There are 25 unique industries in this dataset."
print("The industries in this dataset are: ")
## [1] "The industries in this dataset are: "
unique(inc$Industry)
##  [1] Consumer Products & Services Government Services         
##  [3] Health                       Energy                      
##  [5] Advertising & Marketing      Real Estate                 
##  [7] Financial Services           Retail                      
##  [9] Software                     Computer Hardware           
## [11] Logistics & Transportation   Food & Beverage             
## [13] IT Services                  Business Products & Services
## [15] Education                    Construction                
## [17] Manufacturing                Telecommunications          
## [19] Security                     Human Resources             
## [21] Travel & Hospitality         Media                       
## [23] Environmental Services       Engineering                 
## [25] Insurance                   
## 25 Levels: Advertising & Marketing ... Travel & Hospitality

How many different industries are there in the top 1000 companies (in descending order)?

# Sorting by industries with the 1000 fastest growing companies in the US
print("Sorting by industries with the 1000 fastest growing companies in the US")
## [1] "Sorting by industries with the 1000 fastest growing companies in the US"
inc %>% arrange(desc(Growth_Rate)) %>% head(1000) %>% count(Industry, sort=TRUE) %>% head(10)
## # A tibble: 10 x 2
##    Industry                         n
##    <fct>                        <int>
##  1 IT Services                    120
##  2 Advertising & Marketing        112
##  3 Software                        83
##  4 Health                          77
##  5 Business Products & Services    67
##  6 Financial Services              65
##  7 Consumer Products & Services    61
##  8 Retail                          59
##  9 Government Services             56
## 10 Energy                          38

IT Services and Advertising/Marketing are by far the fastest growing industries in the top 1000 companies.

Does the fastest growing companies also bring in the highest revenue?

print("Top 10 Fastest Growing Companies")
## [1] "Top 10 Fastest Growing Companies"
top_10
##    Rank                         Name Growth_Rate   Revenue          City
## 1     1                         Fuhu      421.48 1.179e+08    El Segundo
## 2     2        FederalConference.com      248.31 4.960e+07      Dumfries
## 3     3                The HCI Group      245.45 2.550e+07  Jacksonville
## 4     4                      Bridger      233.08 1.900e+09       Addison
## 5     5                       DataXu      213.37 8.700e+07        Boston
## 6     6 MileStone Community Builders      179.38 4.570e+07        Austin
## 7     7        Value Payment Systems      174.04 2.550e+07     Nashville
## 8     8         Emerge Digital Group      170.64 2.390e+07 San Francisco
## 9     9                    Goal Zero      169.81 3.310e+07     Bluffdale
## 10   10                     Yagoozon      166.89 1.860e+07       Warwick
##    State
## 1     CA
## 2     VA
## 3     FL
## 4     TX
## 5     MA
## 6     TX
## 7     TN
## 8     CA
## 9     UT
## 10    RI
print("Top 10 Highest Revenue Companies")
## [1] "Top 10 Highest Revenue Companies"
inc %>% arrange(desc(Revenue)) %>% head(10) %>% select(c(Rank, Name, Growth_Rate, Revenue, City, State))
##    Rank                       Name Growth_Rate  Revenue            City
## 1  4788                        CDW        0.41 1.01e+10    Vernon Hills
## 2  3853                 ABC Supply        0.73 4.70e+09          Beloit
## 3  4936                       Coty        0.36 4.60e+09        New York
## 4  4997                  Dot Foods        0.34 4.50e+09    Mt. Sterling
## 5  4716              Westcon Group        0.44 3.80e+09       Tarrytown
## 6  4246 American Tire Distributors        0.59 3.50e+09    Huntersville
## 7  4052                   Kum & Go        0.65 2.80e+09 West Des Moines
## 8  4802              Boise Cascade        0.41 2.80e+09           Boise
## 9  1397          EnvisionRxOptions        2.88 2.70e+09       Twinsburg
## 10 2522                  DLA Piper        1.41 2.40e+09         Chicago
##    State
## 1     IL
## 2     WI
## 3     NY
## 4     IL
## 5     NY
## 6     NC
## 7     IA
## 8     ID
## 9     OH
## 10    IL

I suspect that these fast growing companies are likely startups that have received seed or Series A funding, whereas the highest revenue companies are stable companies that have established themselves in the industry.

Assuming my above statement is true, more established companies in theory should have more employees than start up companies.

print("Top 10 Fastest Growing Companies - Employee Count")
## [1] "Top 10 Fastest Growing Companies - Employee Count"
inc %>% arrange(desc(Growth_Rate)) %>% head(10) %>% select(c(Rank, Name, Employees))
##    Rank                         Name Employees
## 1     1                         Fuhu       104
## 2     2        FederalConference.com        51
## 3     3                The HCI Group       132
## 4     4                      Bridger        50
## 5     5                       DataXu       220
## 6     6 MileStone Community Builders        63
## 7     7        Value Payment Systems        27
## 8     8         Emerge Digital Group        75
## 9     9                    Goal Zero        97
## 10   10                     Yagoozon        15
print("Top 10 Highest Revenue Companies - Employee Count")
## [1] "Top 10 Highest Revenue Companies - Employee Count"
inc %>% arrange(desc(Revenue)) %>% head(10) %>% select(c(Rank, Name, Employees))
##    Rank                       Name Employees
## 1  4788                        CDW      6800
## 2  3853                 ABC Supply      6549
## 3  4936                       Coty     10000
## 4  4997                  Dot Foods      3919
## 5  4716              Westcon Group      3000
## 6  4246 American Tire Distributors      3341
## 7  4052                   Kum & Go      4589
## 8  4802              Boise Cascade      4470
## 9  1397          EnvisionRxOptions       625
## 10 2522                  DLA Piper      4036

It appears that the above statement is true.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Focusing on the number of companies per state
companies <- inc %>% count(State, sort = TRUE)

# So that the data is sorted in descending order
companies$State <- factor(companies$State, levels = companies$State[order(companies$n)])

# Creating the ggplot for all states
ggplot(companies, aes(State,n)) + geom_bar(stat="identity", fill='steelblue') + coord_flip() + ylab("Count") + theme_minimal() + ggtitle("Number of Fastest Growing Companies Per State")

# Creating the ggplot for top 10 states
companies_10 <- inc %>% count(State, sort = TRUE) %>% head(10)
companies_10$State <- factor(companies_10$State, levels = companies_10$State[order(companies_10$n)])
ggplot(companies_10, aes(State,n)) + geom_bar(stat="identity", fill='steelblue') + coord_flip() + ylab("Count") + theme_minimal() + ggtitle("Number of Fastest Growing Companies Per State - Top 10") + geom_text(aes(label=n), vjust=0.3, hjust=1.6, color="white", size=3.5)

California, Texas, and New York appear to have the 3 fastest growing companies in the US.

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# State with the 3rd most companies in the data set: New York
NY <- inc %>% filter(State == 'NY')
NY <- NY[complete.cases(NY),] 
NY_summary <- NY %>%
    group_by(Industry) %>%
    summarize(Median_Employment=median(Employees),
              Mean_Employment=mean(Employees))

# Create a plot showing average employment by industry for companies in the state
ggplot(NY_summary  %>%  mutate(Industry = fct_reorder(Industry, Mean_Employment)) , aes(x=Industry,y=Mean_Employment, fill=Industry)) + geom_bar(stat="identity") + ylab('Mean Employment by Industry') + coord_flip() +scale_fill_grey(start=0.8, end=0.1) + theme(legend.position='none')

# Create a plot showing median employment by industry for companies in the state
ggplot(NY_summary  %>%  mutate(Industry = fct_reorder(Industry, Median_Employment)), aes(x=Industry,y=Median_Employment, fill=Industry)) + geom_bar(stat="identity") + ylab('Median Employment by Industry') + coord_flip() + scale_fill_grey(start=0.8, end=0.1) + theme(legend.position='none')

Side by Side Comparisons (Mean Employment vs. Median Employment by Industry)

# Side by Side differences
par(mfrow=c(1,2))
ggplot(NY_summary, aes(x=Industry,y=Mean_Employment)) + geom_bar(stat="identity", fill='steelblue') + ylab('Mean Employment by Industry') + coord_flip() 

ggplot(NY_summary, aes(x=Industry,y=Median_Employment)) + geom_bar(stat="identity", fill='steelblue') + ylab('Median Employment by Industry') + coord_flip()

Interestingly, many of the industries means and medians do not appear to match up. Typically, such drastic differences in mean median indicate skew in the distribution of employees within each industry.

NY_boxplot_df <- NY %>%
    group_by(Industry)
ggplot(NY_boxplot_df, aes(x=Industry,y=Employees)) + geom_boxplot() + coord_flip()

As we see, what’s the most obvious from this graph is the outliers with Consumer Products $ Services and with Business Products & Services. Though the information can give us a general sense of how extreme these outliers are, the graph unfortunately does not provide much information beyond that. Let us remove these two outliers, and take a closer look at the rest of the boxplots.

NY_boxplot_df_adj <- NY %>%
  filter(Employees < 900) %>% 
  group_by(Industry)

ggplot(NY_boxplot_df_adj, aes(x=Industry, y=Employees, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none')

The median number is most likely a better reflection of which industry (with the fastest growing companies in the US) as there are a significant of right skew distribution plots. Here, Environmental Services and Energy appears to have the highest median number of employees.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

We will be looking at the mean and median distribution.

revenue <- NY %>% 
  mutate(Revenue_Per_Employee = Revenue/Employees) %>% 
  group_by(Industry) %>% 
  summarize(Mean_Rev_Per_Emp = mean(Revenue_Per_Employee),
            Median_Rev_Per_Emp = median(Revenue_Per_Employee))

# Mean Boxplots of Revenue Per Employee
ggplot(revenue %>%  mutate(Industry = fct_reorder(Industry, Mean_Rev_Per_Emp)), aes(x=Industry,y=Mean_Rev_Per_Emp, fill=Industry)) + geom_bar(stat="identity", fill='steelblue') + ylab('Mean Revenue per Employee by Industry') + coord_flip() + theme(legend.position = 'none')

# Mean Boxplots of Revenue Per Employee
ggplot(revenue %>%  mutate(Industry = fct_reorder(Industry, Median_Rev_Per_Emp)), aes(x=Industry,y=Median_Rev_Per_Emp, fill=Industry)) + geom_bar(stat="identity", fill='steelblue') + ylab('Median Revenue per Employee by Industry') + coord_flip() + theme(legend.position = 'none')

The most notable difference in the mean and median is in the energy industry. Let’s check for outliers.

#Boxplot
revenue_1 <- NY %>% 
  mutate(Revenue_Per_Employee = Revenue/Employees) %>% 
  group_by(Industry)

ggplot(revenue_1, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none')

Energy appears to have a company that is quite productive and generating a significant amount of revenue. This likely explains why the mean differs so much from the median.

Let’s see what other differences between the mean and median exists within all of the industries.

rev_2 <- revenue_1 %>% group_by(Industry) %>% select(Name, Revenue_Per_Employee) %>% summarise(Revenue_Per_Employee_Mean = mean(Revenue_Per_Employee), Revenue_Per_Employee_Median = median(Revenue_Per_Employee)) %>% mutate(Difference_Mean_Median = abs(Revenue_Per_Employee_Mean - Revenue_Per_Employee_Median))
## Adding missing grouping variables: `Industry`
rev_2
## # A tibble: 25 x 4
##    Industry       Revenue_Per_Employ… Revenue_Per_Emplo… Difference_Mean_…
##    <fct>                        <dbl>              <dbl>             <dbl>
##  1 Advertising &…              373403             255556          117848  
##  2 Business Prod…              527817             203148          324669  
##  3 Computer Hard…              520455             520455               0  
##  4 Construction                238695             238613              81.5
##  5 Consumer Prod…              382943             222222          160720  
##  6 Education                   112061             114848            2788  
##  7 Energy                     8472533             283212         8189322  
##  8 Engineering                 215745             202998           12747  
##  9 Environmental…              134367             134367               0  
## 10 Financial Ser…              400174             219608          180567  
## # ... with 15 more rows
ggplot(rev_2 %>% mutate(Industry = fct_reorder(Industry, Difference_Mean_Median)) , aes(x=Industry, y=Difference_Mean_Median, fill=Industry)) + geom_bar(stat="identity") + coord_flip() + theme(legend.position = 'none') + scale_fill_grey() + ylab("Absolute Difference Between Mean and Median")

For the most part, other than Energy, there isn’t a significant difference between the mean and the median. Let’s remove the one energy company in the outlier and examine the rest of the data.

revenue_2 <- revenue_1 %>% 
  filter(Revenue_Per_Employee < 10^7)

ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee")

Let’s replace the X scale with a logarithmic scale and see if the differences become more apparent.

ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + scale_y_continuous(trans='log10') + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee - Logarithmic Transformation Base 10")

Another way to visualize this data is via Violin Plot. Again, the scale has had a logarithmic transformation.

ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_violin() + scale_y_continuous(trans='log10') + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee - Logarithmic Transformation Base 10") + ggtitle("Violin Plot of Revenue Per Employee by Industry")

It appears that Logistics & Transportation provides generally the most revenue per employee in the NY State.