DATA 608 - Module 1

library(tidyverse)
library(dlookr)

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

inc <-as_tibble(inc)

And lets preview this data:

head(inc)

## # A tibble: 6 x 8
##    Rank Name       Growth_Rate  Revenue Industry     Employees City   State
##   <int> <fct>            <dbl>    <dbl> <fct>            <int> <fct>  <fct>
## 1     1 Fuhu              421.   1.18e8 Consumer Pr~       104 El Se~ CA   
## 2     2 FederalCo~        248.   4.96e7 Government ~        51 Dumfr~ VA   
## 3     3 The HCI G~        245.   2.55e7 Health             132 Jacks~ FL   
## 4     4 Bridger           233.   1.90e9 Energy              50 Addis~ TX   
## 5     5 DataXu            213.   8.70e7 Advertising~       220 Boston MA   
## 6     6 MileStone~        179.   4.57e7 Real Estate         63 Austin TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

My Approach

We have looked at a summary and the top of the data, but not the bottom. I’ll perform a tail to review the bottom of the data and then I will use some dlookr functions to develop a better understanding of the data.

tail(inc)

## # A tibble: 6 x 8
##    Rank Name      Growth_Rate  Revenue Industry      Employees City   State
##   <int> <fct>           <dbl>    <dbl> <fct>             <int> <fct>  <fct>
## 1  4996 cSubs            0.34   1.34e7 Business Pro~        19 Montv~ NJ   
## 2  4997 Dot Foods        0.34   4.50e9 Food & Bever~      3919 Mt. S~ IL   
## 3  4998 Lethal P~        0.34   6.80e6 Retail                8 Welli~ FL   
## 4  4999 ArcaTech~        0.34   3.26e7 Financial Se~        63 Mebane NC   
## 5  5000 INE              0.34   6.80e6 IT Services          35 Belle~ WA   
## 6  5000 ALL4             0.34   4.70e6 Environmenta~        34 Kimbe~ PA

The summary and and tail functions revealed some cleaning is required in the Name variable.

dlookr Package

dlookr diagnose function allows you to diagnose varables on a data frame.

The package provides a variety of functions that make it easier to understand your data and its challenges.

diagnose(inc)

## # A tibble: 8 x 6
##   variables   types  missing_count missing_percent unique_count unique_rate
##   <chr>       <chr>          <int>           <dbl>        <int>       <dbl>
## 1 Rank        integ~             0           0             4999     1.000  
## 2 Name        factor             0           0             5001     1      
## 3 Growth_Rate numer~             0           0             1147     0.229  
## 4 Revenue     numer~             0           0             1069     0.214  
## 5 Industry    factor             0           0               25     0.00500
## 6 Employees   integ~            12           0.240          692     0.138  
## 7 City        factor             0           0             1519     0.304  
## 8 State       factor             0           0               52     0.0104

Clearly shows variable types and reveals some missing data for Employees

diagnose_numeric(inc)

## # A tibble: 4 x 10
##   variables     min      Q1   mean median     Q3     max  zero minus
##   <chr>       <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <int> <int>
## 1 Rank      1.00e+0 1.25e+3 2.50e3 2.50e3 3.75e3 5.00e 3     0     0
## 2 Growth_R~ 3.40e-1 7.70e-1 4.61e0 1.42e0 3.29e0 4.21e 2     0     0
## 3 Revenue   2.00e+6 5.10e+6 4.82e7 1.09e7 2.86e7 1.01e10     0     0
## 4 Employees 1.00e+0 2.50e+1 2.33e2 5.30e1 1.32e2 6.68e 4     0     0
## # ... with 1 more variable: outlier <int>

diagnose_category(inc)

## # A tibble: 5,032 x 6
##    variables levels                                N  freq  ratio  rank
##    <chr>     <fct>                             <int> <int>  <dbl> <int>
##  1 Name      (Add)ventures                      5001     1 0.0200     1
##  2 Name      @Properties                        5001     1 0.0200     2
##  3 Name      1-Stop Translation USA             5001     1 0.0200     3
##  4 Name      110 Consulting                     5001     1 0.0200     4
##  5 Name      11thStreetCoffee.com               5001     1 0.0200     5
##  6 Name      123 Exteriors                      5001     1 0.0200     6
##  7 Name      1st American Systems and Services  5001     1 0.0200     7
##  8 Name      1st Equity                         5001     1 0.0200     8
##  9 Name      2020 Exhibits                      5001     1 0.0200     9
## 10 Name      206inc                             5001     1 0.0200    10
## # ... with 5,022 more rows

diagnose_numeric provides some descriptive stats and outlier information on the numeric variables. diagnose_category returns diagnostic information for the non-numeric variables.

describe(inc)

## # A tibble: 4 x 26
##   variable     n    na   mean     sd se_mean    IQR skewness kurtosis
##   <chr>    <int> <int>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
## 1 Rank      5001     0 2.50e3 1.44e3 2.04e+1 2.50e3 -4.90e-4    -1.20
## 2 Growth_~  5001     0 4.61e0 1.41e1 2.00e-1 2.52e0  1.26e+1   243.  
## 3 Revenue   5001     0 4.82e7 2.41e8 3.40e+6 2.35e7  2.22e+1   724.  
## 4 Employe~  4989    12 2.33e2 1.35e3 1.92e+1 1.07e2  2.98e+1  1270.  
## # ... with 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>,
## #   p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>,
## #   p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>,
## #   p100 <dbl>

normality(inc)

## # A tibble: 4 x 4
##   vars        statistic  p_value sample
##   <chr>           <dbl>    <dbl>  <dbl>
## 1 Rank            0.955 9.48e-37   5000
## 2 Growth_Rate     0.252 4.18e-89   5000
## 3 Revenue         0.135 1.91e-92   5000
## 4 Employees       0.106 3.69e-93   5000

`describe() and normality provide some additional information on skewness and level of normaility.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(hrbrthemes)
library(ggthemes)
library(tidyverse)
library(kableExtra)


state = inc %>%
  select(State, Name) %>% 
  group_by(State) %>% 
  count(State) %>% 
  arrange(desc(n))
  
  p <- ggplot(state, aes(x=reorder(State, n), y=n, fill=n)) +
  geom_col() +
  geom_text(aes(label=scales::comma(n)), hjust=0, nudge_y=2000) +
  scale_y_comma(limits=c(0,800)) +
  coord_flip() +
  labs(x="", y="Companies per state (n)",
       title="Fastest Growing Companies",
       subtitle="Number of high growth companies by state.",
       caption="Source: Inc. Magazine (2016)") + 
  theme_ipsum(grid="X") + theme(legend.title = element_blank()) + theme(axis.text.y =element_text(size = 7))
   

p

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc2 <- inc %>% 
  filter(State == "NY") %>% 
  filter(complete.cases(.)) %>% 
  group_by(Industry) %>% 
  summarise(Mean = mean(Employees),
            Median = median(Employees)) %>% 
  gather(statType, Amount, Mean, Median)
  
  kable(inc2, format = "markdown")

Industry	statType	Amount
Advertising & Marketing	Mean	58.43860
Business Products & Services	Mean	1492.46154
Computer Hardware	Mean	44.00000
Construction	Mean	61.00000
Consumer Products & Services	Mean	626.29412
Education	Mean	59.85714
Energy	Mean	129.20000
Engineering	Mean	53.50000
Environmental Services	Mean	155.00000
Financial Services	Mean	144.30769
Food & Beverage	Mean	76.44444
Government Services	Mean	17.00000
Health	Mean	81.84615
Human Resources	Mean	437.54545
Insurance	Mean	32.50000
IT Services	Mean	204.09302
Logistics & Transportation	Mean	29.50000
Manufacturing	Mean	73.30769
Media	Mean	108.00000
Real Estate	Mean	18.25000
Retail	Mean	24.78571
Security	Mean	135.00000
Software	Mean	245.92308
Telecommunications	Mean	95.35294
Travel & Hospitality	Mean	547.71429
Advertising & Marketing	Median	38.00000
Business Products & Services	Median	70.50000
Computer Hardware	Median	44.00000
Construction	Median	24.50000
Consumer Products & Services	Median	25.00000
Education	Median	50.50000
Energy	Median	120.00000
Engineering	Median	54.50000
Environmental Services	Median	155.00000
Financial Services	Median	81.00000
Food & Beverage	Median	41.00000
Government Services	Median	17.00000
Health	Median	45.00000
Human Resources	Median	56.00000
Insurance	Median	32.50000
IT Services	Median	54.00000
Logistics & Transportation	Median	23.50000
Manufacturing	Median	30.00000
Media	Median	45.00000
Real Estate	Median	18.00000
Retail	Median	13.50000
Security	Median	32.50000
Software	Median	80.00000
Telecommunications	Median	31.00000
Travel & Hospitality	Median	61.00000

  (p <- 
  ggplot(inc2, aes(x=reorder(Industry, Amount), y = Amount)) +
  geom_bar(stat = 'identity', aes(fill = statType), position = 'dodge') +
  coord_flip() + 
  labs(y="Employees (n)", x="",
       title="New York State Employment",
       subtitle="Employment segmented by Industry",
       caption="Source: Inc. Magazine (2016)") + 
  theme_ipsum_rc(grid="X") + theme(axis.text.y =element_text(size = 8))+theme(legend.title = element_blank()))

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

inc3 <- inc %>% 
  filter(State == "NY") %>% 
  filter(complete.cases(.)) %>% 
  mutate(RevPercentage = (Revenue / Employees)/1000) %>% 
  group_by(Industry) %>% 
  summarise(Mean = mean(RevPercentage))
  

  
kable(inc3, format = "markdown")

Industry	Mean
Advertising & Marketing	373.4035
Business Products & Services	527.8169
Computer Hardware	520.4545
Construction	238.6945
Consumer Products & Services	382.9426
Education	112.0606
Energy	8472.5335
Engineering	215.7447
Environmental Services	134.3667
Financial Services	400.1744
Food & Beverage	174.6309
Government Services	158.8235
Health	532.4910
Human Resources	337.3663
Insurance	371.0000
IT Services	228.8161
Logistics & Transportation	1245.8701
Manufacturing	665.8186
Media	333.5496
Real Estate	383.8095
Retail	520.7903
Security	153.2778
Software	143.7490
Telecommunications	408.1434
Travel & Hospitality	282.0898