Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

# library
library(tidyverse) # load in ggplot2, amongst others I commonly use.

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

#1 Let's examine the dataset itself
dim(inc) # get its dimension
## [1] 5001    8
str(inc) # compactly display the features of the data
## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
#2 Top 5 fastest growing companies
inc %>% arrange(Rank) %>% head(5) # Arrange Rank and print top 5 cases
##   Rank                  Name Growth_Rate   Revenue                     Industry
## 1    1                  Fuhu      421.48 1.179e+08 Consumer Products & Services
## 2    2 FederalConference.com      248.31 4.960e+07          Government Services
## 3    3         The HCI Group      245.45 2.550e+07                       Health
## 4    4               Bridger      233.08 1.900e+09                       Energy
## 5    5                DataXu      213.37 8.700e+07      Advertising & Marketing
##   Employees         City State
## 1       104   El Segundo    CA
## 2        51     Dumfries    VA
## 3       132 Jacksonville    FL
## 4        50      Addison    TX
## 5       220       Boston    MA
#3 Is there at least one fast growing company listed in each state?
levels(inc$State) # list the levels of State, and note that each represents an individual state, in addition to District of Columbia (DC) and Puerto Rico (PR).
##  [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "PR" "RI" "SC" "SD" "TN" "TX"
## [46] "UT" "VA" "VT" "WA" "WI" "WV" "WY"
length(levels(inc$State)) # 52 items, further indicate no state is missing. 
## [1] 52
#4 Top 5 States with the most listed companies among the fastest growing companies
inc %>% count(State, sort = TRUE) %>% head(5)
## # A tibble: 5 x 2
##   State     n
##   <fct> <int>
## 1 CA      701
## 2 TX      387
## 3 NY      311
## 4 VA      283
## 5 FL      282
#5 Top 5 Cities with the most listed companies among the fastest growing companies
inc %>% count(City, sort = TRUE) %>% head(5)
## # A tibble: 5 x 2
##   City              n
##   <fct>         <int>
## 1 New York        160
## 2 Chicago          90
## 3 Austin           88
## 4 Houston          76
## 5 San Francisco    75
#6 Top 5 Industry by Mean Revenue
inc %>% group_by(Industry) %>% summarise(Avg_Revenue = mean(Revenue)) %>%  arrange(desc(Avg_Revenue)) %>% head(5)
## # A tibble: 5 x 2
##   Industry                     Avg_Revenue
##   <fct>                              <dbl>
## 1 Computer Hardware             270129545.
## 2 Energy                        126344954.
## 3 Food & Beverage                98559542.
## 4 Logistics & Transportation     95745161.
## 5 Consumer Products & Services   73676847.
#7 Top 5 Industry by Mean Employees
inc %>% group_by(Industry) %>% summarise(Avg_Employees = mean(Employees)) %>%  arrange(desc(Avg_Employees)) %>% head(5)
## # A tibble: 5 x 2
##   Industry             Avg_Employees
##   <fct>                        <dbl>
## 1 Human Resources              1158.
## 2 Security                      562.
## 3 Travel & Hospitality          372.
## 4 Engineering                   276.
## 5 Energy                        243.
#8 Mean Revenue by State
inc %>% group_by(State) %>%  summarise(Avg_Revenue = mean(Revenue)) %>%  arrange(desc(Avg_Revenue)) 
## # A tibble: 52 x 2
##    State Avg_Revenue
##    <fct>       <dbl>
##  1 ID     231523529.
##  2 AK     171500000 
##  3 IA     123142857.
##  4 IL     121773993.
##  5 HI      99485714.
##  6 WI      92362025.
##  7 DC      76344186.
##  8 OH      68745161.
##  9 NC      67580292.
## 10 MI      61950794.
## # ... with 42 more rows
#9 Correlation of rank with revenue and growth rate
cor.test(inc$Rank, inc$Revenue) # With r = 0.08 (p < 0.05), there is a very weak correlation.
## 
##  Pearson's product-moment correlation
## 
## data:  inc$Rank and inc$Revenue
## t = 5.8249, df = 4999, p-value = 6.071e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05451435 0.10957397
## sample estimates:
##        cor 
## 0.08210681
cor.test(inc$Rank, inc$Growth_Rate) # With r = -0.40 (p < 0.05), there is a weak correlation.
## 
##  Pearson's product-moment correlation
## 
## data:  inc$Rank and inc$Growth_Rate
## t = -30.644, df = 4999, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4207488 -0.3740763
## sample estimates:
##        cor 
## -0.3976698

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Answer 1

A bargraph was selected because it is a good way to represent frequency and compare across levels. The orientation was flipped in order to make it more readable when fit in a portrait layout. A colorblind-friendly theme was selected to consider a general audience.

# Summarizing the data by State
compBYstate = inc %>%
  group_by(State) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Plotting a bargraph
p1 = ggplot(compBYstate, aes(x = reorder(State, count), y = count)) + 
  geom_bar(stat = "identity", fill = "honeydew4") + coord_flip() + 
  labs(title = "Distribution of the Fastest Growing Companies by State*", 
       caption = "*District of Columbia (DC) and Puerto Rico (PR) included.", 
       x = "State", y = "Company Count") +  
  geom_text(aes(label = count, y = count + 20), vjust = 0.5, size = 3.5)
p1

The graph above highlights the count of fastest growing companies by State, and it is shows that the state of California is at the top of this list.

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

Answer 2

Once the 3rd State with the most companies was identified, complete cases were selected. Next, outliers were detected. One method of detecting outlier is to identify the values of any data points which lie beyond the extremes of the whiskers, and remove them. Therefore, in order to depict average and/or median employment by industry for companies in this state, while also the variability of the ranges, boxplot provides visual summary of the data allowing quick identification of mean values, the dispersion of the data set, and signs of skewness. Moreover, ggplot2.stat_summary(), allow for simple statistical computation be displayed on the graph. Here, the mean of the dataset, with outliers removed, are indicated by the dot. Lastly, because the boxplots are difficult to compare in the normal scale, the data was transformed into a logarithmic scale.

# Get data on the third state with the most companies
sprintf("The state with the 3rd most companies is %s.", compBYstate$State[3])
## [1] "The state with the 3rd most companies is NY."
third.state = inc[complete.cases(inc),] %>% filter(State == compBYstate$State[3])

# Detecting and removing outlier
outliers = boxplot(third.state$Employees ~ third.state$Industry, plot=FALSE)$out
third.state.out = third.state[-which(third.state$Employees %in% outliers),]

# Plotting a boxplot 
p2 = ggplot(third.state.out, aes(x = reorder(Industry, Employees), y = log(Employees))) +
  geom_boxplot(outlier.shape = NA, show.legend = FALSE) + coord_flip() +
  stat_summary(fun.y = mean, col = "honeydew4", geom = 'point') +
  labs(x = "Industry", y = "Average Number of Employees, in logarithmic scale", title = "Average Employee Size by Industry", subtitle = "average given by dot")
p2

Even with outliers removed from the data, the skewness within most industries can be clearly seen and there are likely to be a differences among them

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Answer 3

Back to using the full dataset and summarizing it to see which industries generate the most revenue per employee. Below are two informative ways of presenting the data. The first graph depicts a barplot of the average revenue earned by industry but also highlights the distribution of the number of employees within a specific industry. Secondly, not only does the second bargraph depicts the average revenue earned by industry, but it highlights the distribution of the number of companies within a specific industry.

# Revenue per employee by industry
revenue = inc[complete.cases(inc),] %>%
  group_by(Industry) %>%
  summarise(count = n(), Revenue = sum(Revenue), Employees = sum(Employees)) %>%
  mutate(rev.per.employee = Revenue / Employees )

# Plotting a bargraph, number of employees per industry shown
p3 = ggplot(revenue, aes(x = reorder(Industry, rev.per.employee), y = rev.per.employee)) + 
  geom_bar(stat = "identity", aes(fill = Employees)) + coord_flip() +
  scale_y_continuous(labels = scales::dollar_format(scale = .001, suffix = "K")) +
  labs(title = "Revenue per Employee by Industry", x = "Industry", 
       y = "Revenue per Employee, in US$", fill = "# of Employees") +
  geom_text(data = filter(revenue, rev.per.employee > 10^6),
            aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)), 
            hjust = 1.1, vjust = 0.4, color = "white", size = 3.5) +
  geom_text(data = filter(revenue, rev.per.employee < 10^6),
            aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)), 
            hjust = -0.1, vjust = 0.4, color = "black", size = 3.5) 
p3

It is clear that computer hardware earns the most revenue among industries based on its average of $1,223,564. The graph above also reveals that there is less than 50,000 employees within this industry among the fastest growing companies of the United States. Whereas, human resources earns an average revenue of $40,735 per employee, and there is more than 200,000 employees within this industry among the fastest growing companies of the United States.

# Plotting a bargraph, number of organization per industry shown

p4 = ggplot(revenue, aes(x = reorder(Industry, rev.per.employee), y = rev.per.employee)) + 
  geom_bar(stat = "identity", aes(fill = count)) + coord_flip() +
  scale_y_continuous(labels = scales::dollar_format(scale = .001, suffix = "K")) +
  labs(title = "Revenue per Employee by Industry", x = "Industry", 
       y = "Revenue per Employee, in US$", fill = "# of Companies") +
  geom_text(data = filter(revenue, rev.per.employee > 10^6),
            aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)), 
            hjust = 1.1, vjust = 0.4, color = "white", size = 3.5) +
  geom_text(data = filter(revenue, rev.per.employee < 10^6),
            aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)), 
            hjust = -0.1, vjust = 0.4, color = "black", size = 3.5) 
p4

This graph also clearly shows that computer hardware earns the most revenue among industries. But it also highlights that there is less than 200 companies within this industry among the fastest growing companies of the United States. Whereas, human resources, which earns an average revenue of $40,735 per employee, ranges within 200 - 400 companies within this industry among the fastest growing companies of the United States. From this graph, it further reveals that IT service industry has the most companies among the fastest growing companies of the United States, and earns an average revenue of $199,683 per employee.