#libraries
library(dplyr)
library(ggplot2)

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

tail(inc)
##      Rank               Name Growth_Rate  Revenue                     Industry
## 4996 4996              cSubs        0.34 1.34e+07 Business Products & Services
## 4997 4997          Dot Foods        0.34 4.50e+09              Food & Beverage
## 4998 4998 Lethal Performance        0.34 6.80e+06                       Retail
## 4999 4999   ArcaTech Systems        0.34 3.26e+07           Financial Services
## 5000 5000                INE        0.34 6.80e+06                  IT Services
## 5001 5000               ALL4        0.34 4.70e+06       Environmental Services
##      Employees         City State
## 4996        19     Montvale    NJ
## 4997      3919 Mt. Sterling    IL
## 4998         8   Wellington    FL
## 4999        63       Mebane    NC
## 5000        35     Bellevue    WA
## 5001        34    Kimberton    PA
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

After looking over the summary data, I now want to dig into some max values and other data around max values. The easiest to explore is max number of Employees by Industry and the company within that industry. The top 3 industries based on the number of employees within 1 company is Human Resources, Business Products & Services, and Security. The top companies within these industries are Integrity Staffing Solutions, Sutherland Global Services, and Universal Services of America, respectively.

inc %>%
  select(Name, Industry, Employees) %>%
  filter(!is.na(Employees)) %>%
  group_by(Industry) %>%
  arrange(desc(Employees)) %>%
  top_n(1)
## Selecting by Employees
## # A tibble: 25 x 3
## # Groups:   Industry [25]
##    Name                                  Industry                     Employees
##    <chr>                                 <chr>                            <int>
##  1 Integrity staffing Solutions          Human Resources                  66803
##  2 Sutherland Global Services            Business Products & Services     32000
##  3 Universal Services of America         Security                         20000
##  4 Sprouts Farmers Market                Consumer Products & Services     13200
##  5 Genco                                 Logistics & Transportation       10800
##  6 VXI Global Solutions                  Telecommunications               10000
##  7 Belcan                                Engineering                      10000
##  8 KSS                                   Manufacturing                     8500
##  9 Bojangles' Famous Chicken 'n Biscuits Food & Beverage                   7681
## 10 Collabera                             IT Services                       7000
## # ... with 15 more rows

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Since there are so many states to view, my approach to viewing this data was to split it by states with the highest number of companies and states with the least companies. I also

#get top 20 states
state_count_top <- inc %>%
  count(State) %>%
  arrange(desc(n)) %>%
  slice(1:20)

state_count_top
##    State   n
## 1     CA 701
## 2     TX 387
## 3     NY 311
## 4     VA 283
## 5     FL 282
## 6     IL 273
## 7     GA 212
## 8     OH 186
## 9     MA 182
## 10    PA 164
## 11    NJ 158
## 12    NC 137
## 13    CO 134
## 14    MD 131
## 15    WA 130
## 16    MI 126
## 17    AZ 100
## 18    UT  95
## 19    MN  88
## 20    TN  82
ggplot(state_count_top, aes(x = reorder(State, n), y = n)) +
  geom_col() +
  labs(x = "State", y = "Company Count", title = "Top 20 States with Most Companies") +
  theme_bw()

20 states with the least number of companies:

#get bottom 20 states
state_count_low <- inc %>%
  count(State) %>%
  arrange(n) %>%
  slice(1:20)

state_count_low
##    State  n
## 1     PR  1
## 2     AK  2
## 3     WV  2
## 4     WY  2
## 5     SD  3
## 6     MT  4
## 7     NM  5
## 8     VT  6
## 9     HI  7
## 10    AR  9
## 11    ND 10
## 12    MS 12
## 13    ME 13
## 14    DE 16
## 15    RI 16
## 16    ID 17
## 17    NH 24
## 18    NV 26
## 19    NE 27
## 20    IA 28
ggplot(state_count_low, aes(x = reorder(State, n), y = n)) +
  geom_col() +
  labs(x = "State", y = "Company Count", title = "20 States with Least Companies") +
  theme_bw()

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc %>%  
  count(State) %>%
  arrange(desc(n)) %>%
  slice(1:3)
##   State   n
## 1    CA 701
## 2    TX 387
## 3    NY 311
ny_inc <- inc %>%
  filter(State == 'NY')

head(ny_inc)
##   Rank                      Name Growth_Rate  Revenue
## 1   26              BeenVerified       84.43 13700000
## 2   30                  Sailthru       73.22  8100000
## 3   37              YellowHammer       67.40 18000000
## 4   38                 Conductor       67.02  7100000
## 5   48 Cinium Financial Services       53.65  5900000
## 6   70                  33Across       44.99 27900000
##                       Industry Employees      City State
## 1 Consumer Products & Services        17  New York    NY
## 2      Advertising & Marketing        79  New York    NY
## 3      Advertising & Marketing        27  New York    NY
## 4      Advertising & Marketing        89  New York    NY
## 5           Financial Services        32 Rock Hill    NY
## 6      Advertising & Marketing        75  New York    NY
ny_avg <- ny_inc %>%
  group_by(Industry) %>%
  summarise_at(vars(Employees), list(Avg = mean))

ny_med <- ny_inc %>%
  group_by(Industry) %>%
  summarise_at(vars(Employees), list(Median = median))

ny_summary <- cbind(ny_avg, ny_med$Median)

ggplot(ny_summary, aes(x = Industry, y = ny_med$Median)) + 
  geom_col() +
  coord_flip() +
  labs(y = "Median Employees", title = "Median Number of Employees by Industry in New York")

ggplot(ny_summary, aes(x = Industry, y = Avg)) + 
  geom_col() +
  coord_flip() +
  labs(y = "Average Employees", title = "Average Number of Employees by Industry in New York")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

rev_per_emp <- inc %>%
  select(Industry, Employees, Revenue) %>%
  filter(!is.na(Employees)) %>%
  group_by(Industry) %>%
  summarise(Emp = sum(Employees), Rev = sum(Revenue))

ggplot(rev_per_emp, aes(x = Emp, y = Rev, colour = Industry)) +
  geom_point() + 
  labs(y = "Revenue", x = "Employee", title = "Revenue per Employee by Industry")