Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Number of unique values for each nominal variable

library(dplyr)
library(ggplot2)
df <-data_frame('Nominal Variables' = c('Industries','Cities','States'),
                'Number of Unique Values' = c(n_distinct(inc$Industry),
                                              n_distinct(inc$City),
                                              n_distinct(inc$State)))
knitr::kable(df)
Nominal Variables Number of Unique Values
Industries 25
Cities 1519
States 52

Number of unique cities, industries, companies, and employees by state

df_city <- inc %>%
  group_by(State) %>%
  summarize(city_cnt = n_distinct(City))
df_ind <- inc %>%
  group_by(State) %>%
  summarize(ind_cnt = n_distinct(Industry))
df_co <- inc %>%
  group_by(State) %>%
  summarize(company_cnt = n_distinct(Name))
df_emp <- inc %>%
  group_by(State) %>%
  summarize(emp_cnt = sum(Employees))
df_by_state <- full_join(df_city, df_ind, by = "State")
df_by_state <- full_join(df_by_state, df_co, by="State")
df_by_state <- full_join(df_by_state, df_emp, by="State")
knitr::kable(df_by_state)
State city_cnt ind_cnt company_cnt emp_cnt
AK 2 2 2 2528
AL 13 18 51 6393
AR 7 7 9 496
AZ 13 22 100 34281
CA 204 25 701 NA
CO 28 23 134 NA
CT 34 18 50 6989
DC 3 13 43 NA
DE 7 11 16 68544
FL 102 25 282 61221
GA 45 22 212 NA
HI 4 5 7 621
IA 18 13 28 11344
ID 12 10 17 5817
IL 104 25 273 NA
IN 30 21 69 12697
KS 16 17 38 8725
KY 13 14 40 5544
LA 15 13 37 10669
MA 72 24 182 24682
MD 44 22 131 40439
ME 5 8 13 879
MI 52 22 126 36905
MN 37 19 88 18534
MO 27 18 59 17296
MS 8 9 12 5531
MT 2 3 4 1673
NC 46 22 137 NA
ND 5 8 10 963
NE 7 13 27 3823
NH 15 13 24 2890
NJ 97 22 158 30162
NM 2 5 5 617
NV 6 11 26 1725
NY 90 25 311 84370
OH 79 24 186 38002
OK 12 15 46 6976
OR 12 16 49 4399
PA 80 24 164 NA
PR 1 1 1 29
RI 9 12 16 2964
SC 22 18 48 5348
SD 3 3 3 761
TN 21 20 82 14586
TX 67 24 387 NA
UT 26 20 95 19028
VA 56 24 283 35667
VT 4 6 6 1069
WA 30 19 130 NA
WI 43 20 79 NA
WV 2 2 2 240
WY 2 2 2 107

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

ggplot(data=df_by_state, 
       aes(x=reorder(State,company_cnt),
           y=company_cnt,fill=company_cnt)) +
  geom_bar(position="dodge",stat="identity") +
  geom_text(aes(label=company_cnt),hjust=-0.5) +
  scale_fill_gradient(low = "blue",high = "green") +
  coord_flip() +
  xlab("State") + ylab("Number of Companies") +
  labs(fill = "Number of Companies") +
  ggtitle("Distribution of Companies by State")

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc_comp <- inc %>% filter(complete.cases(.))
ny <- inc_comp %>%
  filter(State=='NY') 

ggplot(data=ny,aes(x=reorder(Industry,Employees, FUN = median),
                   y=Employees)) +
  xlab("Industry") +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  ggtitle("New York State Number of Employees by Industry")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Assuming we’re working with national data again (Not NY specific)!!

From the graph below, we can tell immediately that the best and probably safest industry to invest in would be in computer hardware.

inc_comp$rev_emp <- inc_comp$Revenue / inc_comp$Employees

ggplot(data=inc_comp,aes(x=reorder(Industry,rev_emp, FUN = median),
                   y=rev_emp)) +
  xlab("Industry") +
  ylab("Revenue per Employee") +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  ggtitle("Revenue per Single Employee by Industry")