library(tidyverse)
library(dlookr)

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

inc <-as_tibble(inc)

And lets preview this data:

head(inc)
## # A tibble: 6 x 8
##    Rank Name       Growth_Rate  Revenue Industry     Employees City   State
##   <int> <fct>            <dbl>    <dbl> <fct>            <int> <fct>  <fct>
## 1     1 Fuhu              421.   1.18e8 Consumer Pr~       104 El Se~ CA   
## 2     2 FederalCo~        248.   4.96e7 Government ~        51 Dumfr~ VA   
## 3     3 The HCI G~        245.   2.55e7 Health             132 Jacks~ FL   
## 4     4 Bridger           233.   1.90e9 Energy              50 Addis~ TX   
## 5     5 DataXu            213.   8.70e7 Advertising~       220 Boston MA   
## 6     6 MileStone~        179.   4.57e7 Real Estate         63 Austin TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

My Approach

We have looked at a summary and the top of the data, but not the bottom. I’ll perform a tail to review the bottom of the data and then I will use some dlookr functions to develop a better understanding of the data.

tail(inc)
## # A tibble: 6 x 8
##    Rank Name      Growth_Rate  Revenue Industry      Employees City   State
##   <int> <fct>           <dbl>    <dbl> <fct>             <int> <fct>  <fct>
## 1  4996 cSubs            0.34   1.34e7 Business Pro~        19 Montv~ NJ   
## 2  4997 Dot Foods        0.34   4.50e9 Food & Bever~      3919 Mt. S~ IL   
## 3  4998 Lethal P~        0.34   6.80e6 Retail                8 Welli~ FL   
## 4  4999 ArcaTech~        0.34   3.26e7 Financial Se~        63 Mebane NC   
## 5  5000 INE              0.34   6.80e6 IT Services          35 Belle~ WA   
## 6  5000 ALL4             0.34   4.70e6 Environmenta~        34 Kimbe~ PA
The summary and and tail functions revealed some cleaning is required in the Name variable.

dlookr Package

dlookr diagnose function allows you to diagnose varables on a data frame.

The package provides a variety of functions that make it easier to understand your data and its challenges. 
diagnose(inc)
## # A tibble: 8 x 6
##   variables   types  missing_count missing_percent unique_count unique_rate
##   <chr>       <chr>          <int>           <dbl>        <int>       <dbl>
## 1 Rank        integ~             0           0             4999     1.000  
## 2 Name        factor             0           0             5001     1      
## 3 Growth_Rate numer~             0           0             1147     0.229  
## 4 Revenue     numer~             0           0             1069     0.214  
## 5 Industry    factor             0           0               25     0.00500
## 6 Employees   integ~            12           0.240          692     0.138  
## 7 City        factor             0           0             1519     0.304  
## 8 State       factor             0           0               52     0.0104
Clearly shows variable types and reveals some missing data for Employees
diagnose_numeric(inc)
## # A tibble: 4 x 10
##   variables     min      Q1   mean median     Q3     max  zero minus
##   <chr>       <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <int> <int>
## 1 Rank      1.00e+0 1.25e+3 2.50e3 2.50e3 3.75e3 5.00e 3     0     0
## 2 Growth_R~ 3.40e-1 7.70e-1 4.61e0 1.42e0 3.29e0 4.21e 2     0     0
## 3 Revenue   2.00e+6 5.10e+6 4.82e7 1.09e7 2.86e7 1.01e10     0     0
## 4 Employees 1.00e+0 2.50e+1 2.33e2 5.30e1 1.32e2 6.68e 4     0     0
## # ... with 1 more variable: outlier <int>
diagnose_category(inc)
## # A tibble: 5,032 x 6
##    variables levels                                N  freq  ratio  rank
##    <chr>     <fct>                             <int> <int>  <dbl> <int>
##  1 Name      (Add)ventures                      5001     1 0.0200     1
##  2 Name      @Properties                        5001     1 0.0200     2
##  3 Name      1-Stop Translation USA             5001     1 0.0200     3
##  4 Name      110 Consulting                     5001     1 0.0200     4
##  5 Name      11thStreetCoffee.com               5001     1 0.0200     5
##  6 Name      123 Exteriors                      5001     1 0.0200     6
##  7 Name      1st American Systems and Services  5001     1 0.0200     7
##  8 Name      1st Equity                         5001     1 0.0200     8
##  9 Name      2020 Exhibits                      5001     1 0.0200     9
## 10 Name      206inc                             5001     1 0.0200    10
## # ... with 5,022 more rows
diagnose_numeric provides some descriptive stats and outlier information on the numeric variables. diagnose_category returns diagnostic information for the non-numeric variables. 
describe(inc)
## # A tibble: 4 x 26
##   variable     n    na   mean     sd se_mean    IQR skewness kurtosis
##   <chr>    <int> <int>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
## 1 Rank      5001     0 2.50e3 1.44e3 2.04e+1 2.50e3 -4.90e-4    -1.20
## 2 Growth_~  5001     0 4.61e0 1.41e1 2.00e-1 2.52e0  1.26e+1   243.  
## 3 Revenue   5001     0 4.82e7 2.41e8 3.40e+6 2.35e7  2.22e+1   724.  
## 4 Employe~  4989    12 2.33e2 1.35e3 1.92e+1 1.07e2  2.98e+1  1270.  
## # ... with 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>,
## #   p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>,
## #   p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>,
## #   p100 <dbl>
normality(inc)
## # A tibble: 4 x 4
##   vars        statistic  p_value sample
##   <chr>           <dbl>    <dbl>  <dbl>
## 1 Rank            0.955 9.48e-37   5000
## 2 Growth_Rate     0.252 4.18e-89   5000
## 3 Revenue         0.135 1.91e-92   5000
## 4 Employees       0.106 3.69e-93   5000

`describe() and normality provide some additional information on skewness and level of normaility.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(hrbrthemes)
library(ggthemes)
library(tidyverse)
library(kableExtra)


state = inc %>%
  select(State, Name) %>% 
  group_by(State) %>% 
  count(State) %>% 
  arrange(desc(n))
  
  p <- ggplot(state, aes(x=reorder(State, n), y=n, fill=n)) +
  geom_col() +
  geom_text(aes(label=scales::comma(n)), hjust=0, nudge_y=2000) +
  scale_y_comma(limits=c(0,800)) +
  coord_flip() +
  labs(x="", y="Companies per state (n)",
       title="Fastest Growing Companies",
       subtitle="Number of high growth companies by state.",
       caption="Source: Inc. Magazine (2016)") + 
  theme_ipsum(grid="X") + theme(legend.title = element_blank()) + theme(axis.text.y =element_text(size = 7))
   

p

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc2 <- inc %>% 
  filter(State == "NY") %>% 
  filter(complete.cases(.)) %>% 
  group_by(Industry) %>% 
  summarise(Mean = mean(Employees),
            Median = median(Employees)) %>% 
  gather(statType, Amount, Mean, Median)
  
  kable(inc2, format = "markdown")
Industry statType Amount
Advertising & Marketing Mean 58.43860
Business Products & Services Mean 1492.46154
Computer Hardware Mean 44.00000
Construction Mean 61.00000
Consumer Products & Services Mean 626.29412
Education Mean 59.85714
Energy Mean 129.20000
Engineering Mean 53.50000
Environmental Services Mean 155.00000
Financial Services Mean 144.30769
Food & Beverage Mean 76.44444
Government Services Mean 17.00000
Health Mean 81.84615
Human Resources Mean 437.54545
Insurance Mean 32.50000
IT Services Mean 204.09302
Logistics & Transportation Mean 29.50000
Manufacturing Mean 73.30769
Media Mean 108.00000
Real Estate Mean 18.25000
Retail Mean 24.78571
Security Mean 135.00000
Software Mean 245.92308
Telecommunications Mean 95.35294
Travel & Hospitality Mean 547.71429
Advertising & Marketing Median 38.00000
Business Products & Services Median 70.50000
Computer Hardware Median 44.00000
Construction Median 24.50000
Consumer Products & Services Median 25.00000
Education Median 50.50000
Energy Median 120.00000
Engineering Median 54.50000
Environmental Services Median 155.00000
Financial Services Median 81.00000
Food & Beverage Median 41.00000
Government Services Median 17.00000
Health Median 45.00000
Human Resources Median 56.00000
Insurance Median 32.50000
IT Services Median 54.00000
Logistics & Transportation Median 23.50000
Manufacturing Median 30.00000
Media Median 45.00000
Real Estate Median 18.00000
Retail Median 13.50000
Security Median 32.50000
Software Median 80.00000
Telecommunications Median 31.00000
Travel & Hospitality Median 61.00000
  (p <- 
  ggplot(inc2, aes(x=reorder(Industry, Amount), y = Amount)) +
  geom_bar(stat = 'identity', aes(fill = statType), position = 'dodge') +
  coord_flip() + 
  labs(y="Employees (n)", x="",
       title="New York State Employment",
       subtitle="Employment segmented by Industry",
       caption="Source: Inc. Magazine (2016)") + 
  theme_ipsum_rc(grid="X") + theme(axis.text.y =element_text(size = 8))+theme(legend.title = element_blank()))

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

inc3 <- inc %>% 
  filter(State == "NY") %>% 
  filter(complete.cases(.)) %>% 
  mutate(RevPercentage = (Revenue / Employees)/1000) %>% 
  group_by(Industry) %>% 
  summarise(Mean = mean(RevPercentage))
  

  
kable(inc3, format = "markdown")
Industry Mean
Advertising & Marketing 373.4035
Business Products & Services 527.8169
Computer Hardware 520.4545
Construction 238.6945
Consumer Products & Services 382.9426
Education 112.0606
Energy 8472.5335
Engineering 215.7447
Environmental Services 134.3667
Financial Services 400.1744
Food & Beverage 174.6309
Government Services 158.8235
Health 532.4910
Human Resources 337.3663
Insurance 371.0000
IT Services 228.8161
Logistics & Transportation 1245.8701
Manufacturing 665.8186
Media 333.5496
Real Estate 383.8095
Retail 520.7903
Security 153.2778
Software 143.7490
Telecommunications 408.1434
Travel & Hospitality 282.0898
(p <- 
  ggplot(inc3, aes(x=reorder(Industry, Mean), y = Mean)) +
  geom_bar(stat = 'identity', aes(fill = 'Blue')) +
  coord_flip() + 
  labs(y="Revenue Per Employee", x="",
       title="NY Revenue Per Employee by Industry",
       subtitle="$000",
       caption="Source: Inc. Magazine (2016)") + 
  theme_ipsum_rc(grid="X") + theme(axis.text.y =element_text(size = 8)) + theme(legend.position = "none"))