Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. Lets read this in:

## Warning: package 'ggplot2' was built under R version 3.6.3

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate    Revenue
## 1    1                         Fuhu      421.48  117900000
## 2    2        FederalConference.com      248.31   49600000
## 3    3                The HCI Group      245.45   25500000
## 4    4                      Bridger      233.08 1900000000
## 5    5                       DataXu      213.37   87000000
## 6    6 MileStone Community Builders      179.38   45700000
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                    Industry   
##  Min.   :    2000000   IT Services                 : 733  
##  1st Qu.:    5100000   Business Products & Services: 482  
##  Median :   10900000   Advertising & Marketing     : 471  
##  Mean   :   48222535   Health                      : 355  
##  3rd Qu.:   28600000   Software                    : 342  
##  Max.   :10100000000   Financial Services          : 260  
##                        (Other)                     :2358  
##    Employees                  City          State     
##  Min.   :    1.0   New York     : 160   CA     : 701  
##  1st Qu.:   25.0   Chicago      :  90   TX     : 387  
##  Median :   53.0   Austin       :  88   NY     : 311  
##  Mean   :  232.7   Houston      :  76   VA     : 283  
##  3rd Qu.:  132.0   San Francisco:  75   FL     : 282  
##  Max.   :66803.0   Atlanta      :  74   IL     : 273  
##  NA's   :12        (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

First off, how many states are included in this data? If it’s less than 50, which states are not part of the data; more than 50, which of the territories are part of this data?

inc$State %>% unique() %>% length()
## [1] 52

52 is more than the conventional US states. Lets see which one is extra from the standard states.

inc$State %>% unique()
##  [1] CA VA FL TX MA TN UT RI SC DC NJ OH WA ME NY CO GA IL AZ NC MD MN OK
## [24] PA CT IN MS WI WY MI MO KS OR NE AL HI NV IA KY ID AK LA DE AR NH VT
## [47] NM SD ND PR MT WV
## 52 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY

Unfortunately, most people do not realize the District of Columbia does not fall into any state as it’s federal land.

filter(inc, State == "PR")
##   Rank      Name Growth_Rate Revenue Industry Employees     City State
## 1 2140 Wovenware        1.73 2300000 Software        29 San Juan    PR

Puerto Rico has only one company as part of this data. No useful insights can be gained.

dc_stats<- filter(inc, State == "DC")
summary(dc_stats)
##       Rank                           Name     Growth_Rate     
##  Min.   :  15   Apprio                 : 1   Min.   :  0.400  
##  1st Qu.: 710   Arnold & Porter        : 1   1st Qu.:  0.930  
##  Median :1914   Atlas Research         : 1   Median :  1.980  
##  Mean   :2161   Barbaricum             : 1   Mean   :  8.298  
##  3rd Qu.:3367   Brailsford & Dunlavey  : 1   3rd Qu.:  6.525  
##  Max.   :4854   CaseDriven Technologies: 1   Max.   :123.330  
##                 (Other)                :37                    
##     Revenue                                   Industry    Employees      
##  Min.   :   2400000   Business Products & Services:10   Min.   :   7.00  
##  1st Qu.:   5400000   Government Services         :10   1st Qu.:  20.00  
##  Median :   8700000   IT Services                 : 9   Median :  41.00  
##  Mean   :  76344186   Construction                : 2   Mean   : 219.55  
##  3rd Qu.:  16050000   Health                      : 2   3rd Qu.:  84.25  
##  Max.   :1600000000   Real Estate                 : 2   Max.   :4100.00  
##                       (Other)                     : 8   NA's   :1        
##              City        State   
##  Washington    :41   DC     :43  
##  Washington, DC: 1   AK     : 0  
##  Washingtonton : 1   AL     : 0  
##  Acton         : 0   AR     : 0  
##  Addison       : 0   AZ     : 0  
##  Adrian        : 0   CA     : 0  
##  (Other)       : 0   (Other): 0

Business Products and Government Services make sense for DC businesses. A large about of them are focused on federal and state governments. There is a discrepancy on the Growth Rate. The 3rd QR is 6.525, however, the max is 123.33. This would generate a heavy skew to the rate.

Another item of note, the data does not appear to be completely clean, mostly likely human error. The City column has 3 different ways to spell Washington, DC. If any analysis is going to happen on the cities of this dataset, a cleaning will need to happen.

Focusing back on the dataset, the same item pops out of the total summary numbers.

summary(inc$Growth_Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.340   0.770   1.420   4.612   3.290 421.480

Growth Rate, even without plotting any item, shows heavy skew much like DC.
The 3rd qr numbers are 3.290 however the Max number is 421.480.

inc %>% count(Growth_Rate > 3.291)
## # A tibble: 2 x 2
##   `Growth_Rate > 3.291`     n
##   <lgl>                 <int>
## 1 FALSE                  3752
## 2 TRUE                   1249

33% of the businesses are above the 3rd QR range.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

q1 <- inc%>%
  group_by(State)%>%
  count(State)%>%
  arrange(desc(n))%>%
  as_tibble(q1)
  
ggplot(q1, aes(x=reorder(State,n), y=n))+
  geom_bar(stat="identity", width=.6)+
  theme(axis.title=element_blank())+
  geom_hline(yintercept=seq(1,800,100), col="white", lwd=1)+
  theme(panel.grid.major.y = element_blank())+
  theme_classic()+
  coord_flip()+
  xlab("State")+
  ylab("Number of Fastest Growing Companies")

The Bar chart is chosen for it’s clean representation of the data. It’s easily read and quickly understood. I chose to not use color on most of these plots. I am not comparing data point to another where I would need to have the same color running through the report.

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# finding the state in question
q1[3,"State"]
## # A tibble: 1 x 1
##   State
##   <fct>
## 1 NY

How does NY compare to the country as a whole?

summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                    Industry   
##  Min.   :    2000000   IT Services                 : 733  
##  1st Qu.:    5100000   Business Products & Services: 482  
##  Median :   10900000   Advertising & Marketing     : 471  
##  Mean   :   48222535   Health                      : 355  
##  3rd Qu.:   28600000   Software                    : 342  
##  Max.   :10100000000   Financial Services          : 260  
##                        (Other)                     :2358  
##    Employees                  City          State     
##  Min.   :    1.0   New York     : 160   CA     : 701  
##  1st Qu.:   25.0   Chicago      :  90   TX     : 387  
##  Median :   53.0   Austin       :  88   NY     : 311  
##  Mean   :  232.7   Houston      :  76   VA     : 283  
##  3rd Qu.:  132.0   San Francisco:  75   FL     : 282  
##  Max.   :66803.0   Atlanta      :  74   IL     : 273  
##  NA's   :12        (Other)      :4438   (Other):2764
#comparing NY to the country
ny_stats<- filter(inc, State == "NY")
summary(ny_stats)
##       Rank                        Name      Growth_Rate    
##  Min.   :  26   1st Equity          :  1   Min.   : 0.350  
##  1st Qu.:1186   33Across            :  1   1st Qu.: 0.670  
##  Median :2702   5Linx Enterprises   :  1   Median : 1.310  
##  Mean   :2612   Access Display Group:  1   Mean   : 4.371  
##  3rd Qu.:4005   Adafruit            :  1   3rd Qu.: 3.580  
##  Max.   :4981   AdCorp Media Group  :  1   Max.   :84.430  
##                 (Other)             :305                   
##     Revenue                                   Industry     Employees      
##  Min.   :   2000000   Advertising & Marketing     : 57   Min.   :    1.0  
##  1st Qu.:   4300000   IT Services                 : 43   1st Qu.:   21.0  
##  Median :   8800000   Business Products & Services: 26   Median :   45.0  
##  Mean   :  58715113   Consumer Products & Services: 17   Mean   :  271.3  
##  3rd Qu.:  25700000   Telecommunications          : 17   3rd Qu.:  105.5  
##  Max.   :4600000000   Education                   : 14   Max.   :32000.0  
##                       (Other)                     :137                    
##         City         State    
##  New York :160   NY     :311  
##  Brooklyn : 15   AK     :  0  
##  Rochester:  9   AL     :  0  
##  Buffalo  :  5   AR     :  0  
##  Fairport :  5   AZ     :  0  
##  new york :  5   CA     :  0  
##  (Other)  :112   (Other):  0
q2 <- ny_stats%>%
  filter(complete.cases(.))%>%
  group_by(Industry)%>%
  summarise(Industry_Mean = mean(Employees),
            Industry_Median = median(Employees))%>%
  gather(statistic, Employees, Industry_Mean, Industry_Median)

ggplot(q2, aes(x=Industry, y=Employees))+
  geom_bar(stat = "identity", aes(fill=statistic))+
  geom_hline(yintercept=seq(1,1600,100), col="white", lwd=1)+
  theme_classic()+
  coord_flip()+
  ggtitle("NY Mean/Median by Industry")

q2_bar <- ny_stats %>%
  filter(complete.cases(.))%>%
  select(Industry, Employees)


ggplot(q2_bar, aes(x=Employees, y=reorder(Industry, Employees, median, order = TRUE)))+
  geom_boxplot(fill="slateblue", alpha=0.2)+
  scale_x_log10()+
  theme_classic()+
  ylab("Industry (by median)")+
  ggtitle("NY Median by Industry")

I wanted to present the data in two different ways and give a comparison between the requested value and the data as a whole.

The bar charts give, again, clean representation of the data. I chose to allow color as there is two items on the same charts and wanted a clear delineation between the values in both B/w and color display.

The Box plot displays the same data, however, given the median value graphically in the form of the line tells the story of the data easier.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

q3 <- ny_stats%>%
  filter(complete.cases(.))%>%
  group_by(Industry)%>%
  summarise(Revenue_total = sum(Revenue), Employees_Total= sum(Employees))%>%
  mutate(Revenue_per_employee = Revenue_total/Employees_Total)

ggplot(q3, aes(x=reorder(Industry, Revenue_per_employee), y=Revenue_per_employee))+
  geom_bar(stat = "identity")+
    geom_hline(yintercept=seq(1,700000,100000), col="white", lwd=1)+
  theme_classic()+
  coord_flip()+
  xlab("Industry")+
  ggtitle("NY Industry Revenue per Employee")