Principles of Data Visualization and Introduction to ggplot2

library(dplyr)
library(ggplot2)

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

# Insert your code here, create more chunks as necessary
glimpse(inc)
## Observations: 5,001
## Variables: 8
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Name        <fct> Fuhu, FederalConference.com, The HCI Group, Bridger, Data…
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 174.04, 1…
## $ Revenue     <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e+07, 4.…
## $ Industry    <fct> Consumer Products & Services, Government Services, Health…
## $ Employees   <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 165, 250,…
## $ City        <fct> El Segundo, Dumfries, Jacksonville, Addison, Boston, Aust…
## $ State       <fct> CA, VA, FL, TX, MA, TX, TN, CA, UT, RI, VA, CA, FL, SC, D…
# Count by industries
industry <- inc %>% 
  group_by(Industry) %>% 
  count(Industry) %>% 
  arrange(desc(n))

industry
## # A tibble: 25 x 2
## # Groups:   Industry [25]
##    Industry                         n
##    <fct>                        <int>
##  1 IT Services                    733
##  2 Business Products & Services   482
##  3 Advertising & Marketing        471
##  4 Health                         355
##  5 Software                       342
##  6 Financial Services             260
##  7 Manufacturing                  256
##  8 Consumer Products & Services   203
##  9 Retail                         203
## 10 Government Services            202
## # … with 15 more rows
# select top 5 industries by revenue
ind_revenue <- inc %>% 
  group_by(Industry) %>% 
  summarise(tot_rev_ind = sum(Revenue)) %>% 
  mutate(total_revenue_billions = round((tot_rev_ind / 1e9), 1)) %>% 
  select(-tot_rev_ind) %>%
  arrange(desc(total_revenue_billions)) %>% 
  top_n(n = 5)
## Selecting by total_revenue_billions
ind_revenue
## # A tibble: 5 x 2
##   Industry                     total_revenue_billions
##   <fct>                                         <dbl>
## 1 Business Products & Services                   26.4
## 2 IT Services                                    20.7
## 3 Health                                         17.9
## 4 Consumer Products & Services                   15  
## 5 Logistics & Transportation                     14.8
# select top 5 industries that employs most
ind_employs <- inc %>% 
  group_by(Industry) %>% 
  summarise(tot_ind_emp = sum(Employees)) %>% 
  arrange(desc(tot_ind_emp)) %>% 
  top_n(n = 5)
## Selecting by tot_ind_emp
ind_employs
## # A tibble: 5 x 2
##   Industry                     tot_ind_emp
##   <fct>                              <int>
## 1 Human Resources                   226980
## 2 Financial Services                 47693
## 3 Consumer Products & Services       45464
## 4 Security                           41059
## 5 Advertising & Marketing            39731

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here

# the distribution of companies by State
inc %>% count(State) %>% 
  ggplot(aes(x=reorder(State, n), y=n, fill=n)) + 
  geom_col() + 
  coord_flip() + 
  xlab("States") +
  ylab("Number of Companies") +
  ggtitle("Number of Companies by state")

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# Answer Question 2 here

# consider complete cases
inc_complete <- inc[complete.cases(inc),]

# for NY state
inc_complete %>% 
  filter(State=="NY") %>% 
  ggplot(aes(x=Industry, y=Employees)) + 
  geom_boxplot(width=.5, fill="grey", outlier.colour=NA) +
  stat_summary(aes(colour = "mean"), fun.y = mean, geom="point", fill="black", colour="red", shape=21, size=2, show.legend=TRUE) +
  stat_summary(aes(colour = "median"), fun.y = median, geom="point", fill="blue", colour="blue", shape=21, size=2, show.legend=TRUE) +
  coord_flip(ylim = c(0, 1500), expand = TRUE) +   
  scale_y_continuous(labels = scales::comma, breaks = seq(0, 1500, by = 100)) +
  xlab("Industry") +
  ylab("Employees by industry for companies") +
  ggtitle("Mean and Median Employment by Industry in NY State") + 
  theme(panel.background = element_blank(), legend.position = "top")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here

# group by industry and calculate revenue ind per employee
ind_rev_per_emp <- inc[complete.cases(inc),] %>%
  group_by(Industry) %>%
  summarise(Rev_ind_per_emp=sum(Revenue) / sum(Employees))  %>%
  arrange(desc(Rev_ind_per_emp))

ind_rev_per_emp
## # A tibble: 25 x 2
##    Industry                     Rev_ind_per_emp
##    <fct>                                  <dbl>
##  1 Computer Hardware                   1223564.
##  2 Energy                               520921.
##  3 Construction                         452741.
##  4 Logistics & Transportation           371001.
##  5 Consumer Products & Services         328972.
##  6 Insurance                            318558.
##  7 Manufacturing                        286824.
##  8 Retail                               276718.
##  9 Financial Services                   275741.
## 10 Environmental Services               259852.
## # … with 15 more rows
# plot industries that generate the most revenue per employee
ggplot(ind_rev_per_emp, aes(x=reorder(Industry, Rev_ind_per_emp), y=Rev_ind_per_emp)) + 
  geom_bar(stat = 'Identity') +
  coord_flip() +
  xlab("Industries") +
  ylab("Revenue per employee") +
  ggtitle("Industries revenue per employee") +
  scale_y_continuous(labels = scales::comma)