Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
#disable scientific notation 
options(scipen=99999)
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                    Industry   
##  Min.   :    2000000   IT Services                 : 733  
##  1st Qu.:    5100000   Business Products & Services: 482  
##  Median :   10900000   Advertising & Marketing     : 471  
##  Mean   :   48222535   Health                      : 355  
##  3rd Qu.:   28600000   Software                    : 342  
##  Max.   :10100000000   Financial Services          : 260  
##                        (Other)                     :2358  
##    Employees                  City          State     
##  Min.   :    1.0   New York     : 160   CA     : 701  
##  1st Qu.:   25.0   Chicago      :  90   TX     : 387  
##  Median :   53.0   Austin       :  88   NY     : 311  
##  Mean   :  232.7   Houston      :  76   VA     : 283  
##  3rd Qu.:  132.0   San Francisco:  75   FL     : 282  
##  Max.   :66803.0   Atlanta      :  74   IL     : 273  
##  NA's   :12        (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Calculate percent missing data to learn how complete/incomplete the data is. Granted we can see the number of NA’s in the summary but sometimes that detail is easily missed.

# Insert your code here, create more chunks as necessary
library(dplyr)
inc %>% summarize_all(funs(sum(is.na(.)) / length(.)))
##   Rank Name Growth_Rate Revenue Industry  Employees City State
## 1    0    0           0       0        0 0.00239952    0     0

Growth rate, Employees, and revenue have some very large maximums, consideraly larger than their respective means. Lets identify those rows to get a better picture.

Growth Rate (Where does the max growth rate occur)

##   Rank Name Growth_Rate   Revenue                     Industry Employees
## 1    1 Fuhu      421.48 117900000 Consumer Products & Services       104
##         City State
## 1 El Segundo    CA

Fuhu inc has the largest growth rate. We can do some additional research into this CA based company. It’s a company that focuses on android tablet learning for childen. https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=58913811

Revenue (Where does the max revenue occur?)

##      Rank Name Growth_Rate     Revenue          Industry Employees
## 4788 4788  CDW        0.41 10100000000 Computer Hardware      6800
##              City State
## 4788 Vernon Hills    IL

CDW is an IL based tech company that provides hardware for education, business, and government. https://en.wikipedia.org/wiki/CDW

Employees (Where does the max number of Employees occur?)

##      Rank                         Name Growth_Rate   Revenue
## 2344 2345 Integrity staffing Solutions        1.55 278200000
##             Industry Employees       City State
## 2344 Human Resources     66803 Wilmington    DE

Integrity Staffing solutions based in DE has the maximum number of employees. This company allocates contract resources to other companies. https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4442533

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here
library(sqldf)
library(ggplot2)

#sqldf is my go to for quick data manipulation 
q1<-sqldf("select 
          State, count(distinct Name) as num_companies
          from inc 
          group by State")



ggplot(q1, aes(x=reorder(State,num_companies),round(num_companies, digits=2)))+ 
  geom_bar(stat="identity", fill="deepskyblue3")+
  geom_text(aes(label=round(num_companies, digits=2)), vjust=0.5, size=5, position=position_dodge(width = 1), hjust=1.5)+
  theme_minimal()+
  theme(axis.text.x=element_text(size=12, vjust=0.5))+
  theme(axis.text.y=element_text(size=12, vjust=0.5))+
  theme(plot.title = element_text(size=18))+
  labs( x="State", y="Number of Unique Companies")+
  coord_flip()+
  labs(caption="Inc. Magazine Data")+  
  ggtitle("Distribution of Unique Companies by State")

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# we can see that NY has the third most number of companies. Lets subset NY avg employees by industry, complete cases only
ny<-sqldf("select
          Industry, avg(Employees) as avg_emp
          from inc 
          where State in ('NY') 
          and Name is not null 
          and Employees is not null
          group by Industry")

#the extreme values observed in the summary lead me to believe that there are outliers in our data. Lets check
library(outliers)

#returns true is a value is flagged as an outlier. Save this to some temp variable to then remove from the data
 remove<-outlier(ny$avg_emp,logical=TRUE)
 
 #identify any rows that returned true for number of employees that are outliers 
 remove_rows = which(remove==TRUE,arr.ind=TRUE)
 
 #remove those rows 
ny2 = ny[-remove_rows,]

#delete temp table 
rm(ny)

#visualize 
#we also round to the nearest intger since you can't have a fractional employee 
ggplot(ny2, aes(x=reorder(Industry,avg_emp),round(avg_emp, digits=0)))+ 
  geom_bar(stat="identity", fill="deepskyblue3")+
  geom_text(aes(label=round(avg_emp, digits=0)), vjust=0.5, size=4, position=position_dodge(width = 1), hjust=1.5)+
  theme_minimal()+
  theme(axis.text.x=element_text(size=12, vjust=0.5))+
  theme(axis.text.y=element_text(size=12, vjust=0.5))+
  theme(plot.title = element_text(size=18))+
  labs( x="Industry", y="Avg Employees")+
  coord_flip()+
  labs(caption="Inc. Magazine Data")+  
  ggtitle("Distribution of Avg Number of Employees by Industry in NY")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here

#we need to divide revenue by employees to get revenue per employee

rev<-sqldf("
           select 
           Industry, rev/emp as per_emp
           from
           (
           select 
           Industry, sum(Revenue) as rev, sum(Employees) as emp
           from 
           inc
           where Industry is not null
           group by Industry
           )
           group by Industry")


#visualize 
ggplot(rev, aes(x=reorder(Industry,per_emp),round(per_emp, digits=2)))+ 
  geom_bar(stat="identity", fill="deepskyblue3")+
  geom_text(aes(label=round(per_emp, digits=2)), vjust=1, size=5, position=position_dodge(width = 2), hjust=1)+
  theme_minimal()+
  theme(axis.text.x=element_text(size=12, vjust=0.5))+
  theme(axis.text.y=element_text(size=12, vjust=0.5))+
  theme(plot.title = element_text(size=18))+
  labs( x="Industry", y="Revenue per Employee")+
  coord_flip()+
  labs(caption="Inc. Magazine Data")+  
  ggtitle("Revenue per Employee by Industry")