Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc) %>% kable(caption = "Preview") %>% kable_styling("striped", full_width = TRUE)
Preview
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
summary(inc) %>% kable(caption = "Summary") %>% kable_styling("striped", "condensed", full_width = TRUE, font_size = 12)
Summary
  Rank </th>
                 Name </th>
Growth_Rate
Revenue </th>
                     Industry </th>
Employees
        City </th>
 State </th>
Min. : 1 (Add)ventures : 1 Min. : 0.340 Min. :2.000e+06 IT Services : 733 Min. : 1.0 New York : 160 CA : 701
1st Qu.:1252 @Properties : 1 1st Qu.: 0.770 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0 Chicago : 90 TX : 387
Median :2502 1-Stop Translation USA: 1 Median : 1.420 Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0 Austin : 88 NY : 311
Mean :2502 110 Consulting : 1 Mean : 4.612 Mean :4.822e+07 Health : 355 Mean : 232.7 Houston : 76 VA : 283
3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0 San Francisco: 75 FL : 282
Max. :5000 123 Exteriors : 1 Max. :421.480 Max. :1.010e+10 Financial Services : 260 Max. :66803.0 Atlanta : 74 IL : 273
NA (Other) :4995 NA NA (Other) :2358 NA’s :12 (Other) :4438 (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

#Identifying top 10 Industries by Average Growth Rate
inc1 <- inc %>%
  group_by(Industry) %>%
  summarize(Growth_Rate = mean(Growth_Rate))
arrange(inc1, desc(Growth_Rate)) %>% top_n(10) %>% kable(caption = "Top 10 Industries by Average Growth Rate") %>% kable_styling("striped", full_width = FALSE, position = "center")
## Selecting by Growth_Rate
Top 10 Industries by Average Growth Rate
Industry Growth_Rate
Energy 9.603303
Consumer Products & Services 8.776108
Real Estate 7.746667
Government Services 7.238168
Advertising & Marketing 6.225478
Retail 6.184729
Financial Services 5.435308
Software 5.020643
Health 4.856394
Media 4.374074
#Identifying top 10 Industries by Average Revenue
inc2 <- inc %>%
  group_by(Industry) %>%
  summarize(Revenue = mean(Revenue))
arrange(inc2, desc(Revenue)) %>% top_n(10) %>% kable(caption = "Top 10 Industries by Average Revenue") %>% kable_styling("striped", full_width = FALSE, position = "center")
## Selecting by Revenue
Top 10 Industries by Average Revenue
Industry Revenue
Computer Hardware 270129545
Energy 126344954
Food & Beverage 98559542
Logistics & Transportation 95745161
Consumer Products & Services 73676847
Construction 70450802
Telecommunications 56855814
Business Products & Services 54705187
Security 52230137
Environmental Services 51741176
#Identifying top 10 States by Average Revenue
inc3 <- inc %>%
  group_by(State) %>%
  summarize(Revenue = mean(Revenue))
arrange(inc3, desc(Revenue)) %>% top_n(10) %>% kable(caption = "Top 10 States by Average Revenue") %>% kable_styling("striped", full_width = FALSE, position = "center")
## Selecting by Revenue
Top 10 States by Average Revenue
State Revenue
ID 231523529
AK 171500000
IA 123142857
IL 121773993
HI 99485714
WI 92362025
DC 76344186
OH 68745161
NC 67580292
MI 61950794

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

inc4 <- inc %>% count(State)

ggplot(inc4, aes(x=reorder(State, n), y = n)) + 
  geom_bar(stat = "identity", fill = 'light blue')+
  labs(title="Company Count by State", x = "", y = "") +
  coord_flip() +
  geom_text(aes(label=n), size = 2, color = "black") +
  theme(axis.text=element_text(size=5, face="bold"))

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

Based on the above plot - NY is the state with the 3rd most companies in our dataset.

inc5 <- inc[complete.cases(inc),]
inc_NY <- subset(inc5,State=="NY")
dim(inc_NY)
## [1] 311   8
ggplot(inc_NY, aes(Industry, Employees)) + 
  geom_boxplot(outlier.shape = NA) +
  stat_summary(fun.y = mean, color = "blue", geom = "point", shape = 15, size = 2) +
  coord_cartesian(ylim=c(0, 1200)) + 
  labs(title="Number of NY Employees Per industry", x="", y="") +
  theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 1)) 

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

inc5_rev <- inc5 %>% 
  group_by(Industry) %>% 
  summarise(SumEmployees = sum(Employees), SumRevenue = sum(Revenue)) %>%
  mutate(revperemp = SumRevenue / SumEmployees)

inc5_rev <- arrange(inc5_rev, desc(revperemp))

ggplot(inc5_rev, aes(x=reorder(Industry, revperemp), y=revperemp)) + 
  geom_bar(stat = "identity", fill = 'light blue')+
  labs(title="Revenue per Employee", x="", y="") +
  coord_flip() +
  geom_text(aes(label=round(revperemp,1)), size = 2, color="black") +
  theme(axis.text=element_text(size=6, face="bold"))