Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Environmental Dependancies

library( dplyr )
library( kableExtra )
library( ggplot2 )
library( forcats )
library( scales )
# Insert your code here, create more chunks as necessary
# I find it helpful use dplyr's glimpse() function to view a dataframe
glimpse( inc )
## Rows: 5,001
## Columns: 8
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Name        <chr> "Fuhu", "FederalConference.com", "The HCI Group", "Bridger…
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 174.04, 17…
## $ Revenue     <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e+07, 4.5…
## $ Industry    <chr> "Consumer Products & Services", "Government Services", "He…
## $ Employees   <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 165, 250, …
## $ City        <chr> "El Segundo", "Dumfries", "Jacksonville", "Addison", "Bost…
## $ State       <chr> "CA", "VA", "FL", "TX", "MA", "TX", "TN", "CA", "UT", "RI"…

glimpse organizes the dataframe in a more succinct way than the head function and, in combination with the statistical output of the summary function hints at several ways that we might explore the data further to understand it further:

  1. Select a Categorical variable to group by and view the summary of other features
  2. Select a Numeric variable to rank other features

Aggregating by Categorical variables:

State-Level Data The following code will group data features by State, although a similar analysis can be performed by city

#Geographic grouping (State) & summarize several features
inc_state <- inc %>%
  group_by( State ) %>%
  summarise( num_companies = n(), 
             num_city = n_distinct( City ), 
             mean_Rev = mean( Revenue ), 
             mean_Growth = mean( Growth_Rate ),
             mean_Rank = mean( Rank ),
             mean_Size = mean( Employees, na.rm=TRUE ),
             num_Industry = n_distinct( Industry ) )

A full view of state level data sorted in descending order by the number of companies:

inc_state %>% 
  arrange( desc( num_companies ) ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
State num_companies num_city mean_Rev mean_Growth mean_Rank mean_Size num_Industry
CA 701 204 33463481 5.897489 2273.867 230.31286 25
TX 387 67 57271835 6.024186 2468.930 235.14249 24
NY 311 90 58715113 4.371158 2611.688 271.28617 25
VA 283 56 30627915 4.877350 2437.286 126.03180 24
FL 282 102 37625177 5.846099 2366.695 217.09574 25
IL 273 104 121773993 3.738755 2530.011 379.65441 25
GA 212 45 30510849 3.523538 2499.193 163.72512 22
OH 186 79 68745161 3.557527 2739.651 204.31183 24
MA 182 72 33174725 5.416648 2522.874 135.61538 24
PA 164 80 34568902 2.566585 2755.872 186.45399 24
NJ 158 97 29574684 4.445380 2453.924 190.89873 22
NC 137 46 67580292 3.511679 2535.482 271.74074 22
CO 134 28 31270149 4.952015 2481.709 198.78195 23
MD 131 44 25193893 4.984809 2255.901 308.69466 22
WA 130 30 27156923 4.002692 2447.946 135.00775 19
MI 126 52 61950794 2.238571 2721.484 292.89683 22
AZ 100 13 55015000 4.616700 2488.230 342.81000 22
UT 95 26 36038947 6.307790 2234.516 200.29474 20
MN 88 37 57256818 3.821477 2496.966 210.61364 19
TN 82 21 35870732 4.950366 2789.878 177.87805 20
WI 79 43 92362025 2.690253 2968.519 201.92208 20
IN 69 30 50105797 4.788261 2150.290 184.01449 21
MO 59 27 45164407 2.497288 2811.085 293.15254 18
AL 51 13 25907843 2.407451 2728.137 125.35294 18
CT 50 34 49486000 4.994600 2495.440 139.78000 18
OR 49 12 28589796 3.148367 2726.918 89.77551 16
SC 48 22 30993750 6.060625 2397.167 111.41667 18
OK 46 12 41015217 3.097174 2582.065 151.65217 15
DC 43 3 76344186 8.298139 2161.047 219.54762 13
KY 40 13 33232500 2.064000 2693.725 138.60000 14
KS 38 16 40752632 3.628684 2737.842 229.60526 17
LA 37 15 56648649 1.944595 2700.973 288.35135 13
IA 28 18 123142857 1.761071 3195.964 405.14286 13
NE 27 7 40696296 2.078889 3343.333 141.59259 13
NV 26 6 19915385 2.330769 2778.308 66.34615 11
NH 24 15 41691667 1.512917 2870.708 120.41667 13
ID 17 12 231523529 2.645294 2561.824 342.17647 10
DE 16 7 42300000 2.420000 2424.562 4284.00000 11
RI 16 9 46981250 16.031250 1513.500 185.25000 12
ME 13 5 12476923 16.210000 1951.615 67.61538 8
MS 12 8 43766667 5.642500 2641.417 460.91667 9
ND 10 5 18240000 1.227000 2965.200 96.30000 8
AR 9 7 8333333 1.670000 3449.222 55.11111 7
HI 7 4 99485714 6.792857 1453.571 88.71429 5
VT 6 4 46200000 1.296667 3089.167 178.16667 6
NM 5 2 9640000 1.364000 2965.000 123.40000 5
MT 4 2 6150000 0.762500 3835.500 418.25000 3
SD 3 3 5900000 1.406667 2719.333 253.66667 3
AK 2 2 171500000 4.805000 2634.500 1264.00000 2
WV 2 2 15650000 0.620000 4164.500 120.00000 2
WY 2 2 34750000 19.145000 1998.500 53.50000 2
PR 1 1 2300000 1.730000 2140.000 29.00000 1

The above table organizes State level data summaries by the number of top 5000 ranked companies (num_companies). Predictably, we see that the larger states (California, CA; Texas, TX; New York, NY) have more companies that made the ranked listing compared to smaller states (e.g. West Virginia, WV; Wyoming, WY). Additionally, we can see that states with more ranked companies also have more represented industries (num_Industry); this can be interpreted as a measure of a state’s economic diversity.

The table above is very dense with information. We can subset the aggregated data to pull out details of interest. For example, the following table organizes the top 10 states by mean Growth Rate

#sort by a feature to rearrange the data
inc_state %>% 
  select( State, mean_Growth ) %>%
  arrange( desc( mean_Growth ) ) %>%
  top_n(10) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
## Selecting by mean_Growth
State mean_Growth
WY 19.145000
ME 16.210000
RI 16.031250
DC 8.298139
HI 6.792857
UT 6.307790
SC 6.060625
TX 6.024186
CA 5.897489
FL 5.846099

Industry-Level Data Industry is another interesting categorical variable organize the data by. The following lists the unique Industry categories:

unique( inc$Industry )
##  [1] "Consumer Products & Services" "Government Services"         
##  [3] "Health"                       "Energy"                      
##  [5] "Advertising & Marketing"      "Real Estate"                 
##  [7] "Financial Services"           "Retail"                      
##  [9] "Software"                     "Computer Hardware"           
## [11] "Logistics & Transportation"   "Food & Beverage"             
## [13] "IT Services"                  "Business Products & Services"
## [15] "Education"                    "Construction"                
## [17] "Manufacturing"                "Telecommunications"          
## [19] "Security"                     "Human Resources"             
## [21] "Travel & Hospitality"         "Media"                       
## [23] "Environmental Services"       "Engineering"                 
## [25] "Insurance"

Here, the data is aggregated by Industry and several summary metrics are tabulated

#Aggregating by Industry & summarize several features
inc_industry <- inc %>%
  group_by( Industry ) %>%
  summarise( num_companies = n(), 
             num_city = n_distinct( City ), 
             mean_Rev = mean( Revenue ), 
             mean_Growth = mean( Growth_Rate ),
             mean_Rank = mean( Rank ),
             mean_Size = mean( Employees, na.rm=TRUE ),
             num_State = n_distinct( State ) )
#A full view of industry level data sorted in descending order by the number of companies:
inc_industry %>% 
  arrange( desc( num_companies ) ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
Industry num_companies num_city mean_Rev mean_Growth mean_Rank mean_Size num_State
IT Services 733 389 28214598 3.331815 2540.729 140.42076 44
Business Products & Services 482 293 54705187 3.518486 2878.041 244.49375 42
Advertising & Marketing 471 252 16528662 6.225478 2358.718 84.35456 43
Health 355 258 50319437 4.856394 2415.746 232.85311 44
Software 342 202 23802924 5.020643 2328.196 150.32845 44
Financial Services 260 191 50580385 5.435308 2352.785 183.43462 39
Manufacturing 256 224 49546875 2.295391 3000.590 172.32157 38
Consumer Products & Services 203 152 73676847 8.776108 2193.768 223.96059 34
Retail 203 164 50529064 6.184729 2233.729 182.60099 37
Government Services 202 102 29748020 7.238168 2087.866 129.62871 30
Human Resources 196 130 47173980 3.300459 2426.398 1158.06122 33
Construction 187 157 70450802 3.366684 2642.241 155.60963 36
Logistics & Transportation 155 124 95745161 4.339226 2580.284 259.70130 32
Food & Beverage 131 108 98559542 3.636565 2649.435 510.93798 32
Telecommunications 129 103 56855814 2.883721 2596.504 242.85039 28
Energy 109 83 126344954 9.603303 2038.220 242.54128 31
Real Estate 96 75 30892708 7.746667 2017.740 198.87368 27
Education 83 68 13726506 3.642651 2758.000 92.59036 25
Engineering 74 65 34222973 1.984324 2733.095 276.14865 26
Security 73 63 52230137 3.388904 2494.041 562.45205 30
Travel & Hospitality 62 52 47283871 2.353064 2906.242 371.53226 20
Media 54 39 32266667 4.374074 2222.870 176.51852 21
Environmental Services 51 50 51741176 2.068039 2699.118 199.11765 22
Insurance 50 49 46758000 2.008400 2899.400 146.78000 23
Computer Hardware 44 39 270129545 4.089773 3008.068 220.77273 17

This organized view of Industry listed in descending order of num_companies informs on what industries have a dominate presence in the ranking of fastest growing companies. However, we can draw subsets from the aggregate to highlight details of the dataset. The following ranks the top 10 Industry by mean Growth Rate (mean_growth)

inc_industry %>% 
  select( Industry, mean_Growth ) %>%
  arrange( desc( mean_Growth ) ) %>%
  top_n(10) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
## Selecting by mean_Growth
Industry mean_Growth
Energy 9.603303
Consumer Products & Services 8.776108
Real Estate 7.746667
Government Services 7.238168
Advertising & Marketing 6.225478
Retail 6.184729
Financial Services 5.435308
Software 5.020643
Health 4.856394
Media 4.374074

Ranking the data by Numeric variables:

Aggregating the data by categorical features provides some informative summaries of the data. However, we may also be interested in drawing information based on one of the numeric features. For example, the following code will rank the 10 top companies by highest Growth Rate:

inc %>%
  arrange( Rank ) %>%
  top_n( 10, Growth_Rate ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
7 Value Payment Systems 174.04 2.550e+07 Financial Services 27 Nashville TN
8 Emerge Digital Group 170.64 2.390e+07 Advertising & Marketing 75 San Francisco CA
9 Goal Zero 169.81 3.310e+07 Consumer Products & Services 97 Bluffdale UT
10 Yagoozon 166.89 1.860e+07 Retail 15 Warwick RI

This view organizes the top 10 companies ranked by Revenue

inc %>%
  arrange( Rank ) %>%
  top_n( 10, Revenue ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)
Rank Name Growth_Rate Revenue Industry Employees City State
1397 EnvisionRxOptions 2.88 2.70e+09 Health 625 Twinsburg OH
2522 DLA Piper 1.41 2.40e+09 Business Products & Services 4036 Chicago IL
3853 ABC Supply 0.73 4.70e+09 Construction 6549 Beloit WI
4052 Kum & Go 0.65 2.80e+09 Retail 4589 West Des Moines IA
4246 American Tire Distributors 0.59 3.50e+09 Consumer Products & Services 3341 Huntersville NC
4716 Westcon Group 0.44 3.80e+09 IT Services 3000 Tarrytown NY
4788 CDW 0.41 1.01e+10 Computer Hardware 6800 Vernon Hills IL
4802 Boise Cascade 0.41 2.80e+09 Construction 4470 Boise ID
4936 Coty 0.36 4.60e+09 Consumer Products & Services 10000 New York NY
4997 Dot Foods 0.34 4.50e+09 Food & Beverage 3919 Mt. Sterling IL

It is interesting that none of the top 10 in Growth_Rate are present in the top 10 companies listed by Revenue. Out of curiosity, let’s investigate to see how many of the top 100 Growth_Rate companies rank in the top 500 of Revenue:

num = 500
top_GR <- inc %>%
  arrange( Rank ) %>%
  top_n( num, Growth_Rate ) %>%
  select( Name )

top_Rev <- inc %>%
  arrange( Rank ) %>%
  top_n( num, Revenue ) %>%
  select( Name )

both <- intersect( top_GR, top_Rev )
per_both <- nrow( both )/num*100
paste0( 'Only ', per_both, '% of the top ', num, ' Growth Rate ranked companies feature among the top 100 in Revenue (on the list)' )
## [1] "Only 6.6% of the top 500 Growth Rate ranked companies feature among the top 100 in Revenue (on the list)"

That only 6.6% intersect in the two orderings of the data is an interesting outcome.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

y_lines = c( 200, 400, 600 )
# Answer Question 1 here
inc %>%
  group_by( State ) %>% #aggregate by state
  count() %>% #count the number of records for each state
  ggplot( aes( x= reorder(State,n), y= n ) ) +
  geom_col( ) +
  coord_flip() +
  labs(
    title = 'Number of top 5000 companies by State',
    x = 'State',
    y = '# Companies'
  ) + 
  theme_classic() +
  geom_hline(yintercept=y_lines, color="white", size=1)

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

From visual inspection of the figure above, we see that the state with the 3rd most companies in the data set if New York (NY).

# Answer Question 2 here
backtrans <- function(y){log10(mean(10 ^ y)) }
inc %>%
  filter( State == 'NY' ) %>%  #filter records for NY state
  filter( complete.cases(.) ) %>% #filter only complete cases
  select( Industry, Employees ) %>% #select the features Industry & Employees
  mutate( Industry = as_factor( Industry ) ) %>% #recast Idustry as a factor
  mutate(zRT = scale(Employees)) %>% 
  filter(between(zRT,-2.5,+2.5)) %>% #filter outliers based on Z-score
  ggplot( aes(x = reorder( Industry, Employees, FUN = median ), y = Employees ) ) +
  geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
  stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
  scale_y_log10() +
  coord_flip() +
  labs(
    title = 'Employment by Industry for New York State',
    subtitle = 'descending order by median, mean given in red',
    x = 'Industry',
    y = 'Employees'
  ) +
  theme_linedraw()

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here
inc %>%
  filter( complete.cases(.) ) %>% #filter incomplete cases
  mutate( rev_per_emp = Revenue/Employees ) %>% #create a new feature that quantifies revenue per employee
  ggplot( aes(x = reorder( Industry, rev_per_emp, FUN = median ), y = rev_per_emp )) +
  geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
  stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
  scale_y_log10( labels = trans_format("log10", math_format(10^.x)) ) +
  coord_flip() +
  labs(
    title = 'Revenue per Employee by Industry',
    subtitle = 'descending order by Industry median, mean given in red',
    x = 'Industry',
    y = 'Revenue per Employee'
  ) +
  theme_linedraw()