suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(psych)))
suppressWarnings(suppressMessages(library(scales)))

Read data

inc <- fread("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv")

Data Exploration

dim(inc)
## [1] 5001    8
kable(head(inc))
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
str(inc)
## Classes 'data.table' and 'data.frame':   5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : chr  "Fuhu" "FederalConference.com" "The HCI Group" "Bridger" ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : chr  "Consumer Products & Services" "Government Services" "Health" "Energy" ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : chr  "El Segundo" "Dumfries" "Jacksonville" "Addison" ...
##  $ State      : chr  "CA" "VA" "FL" "TX" ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City          
##  Length:5001        Min.   :    1.0   Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character  
##                     Mean   :  232.7                     
##                     3rd Qu.:  132.0                     
##                     Max.   :66803.0                     
##                     NA's   :12                          
##     State          
##  Length:5001       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
colSums(is.na(inc))
##        Rank        Name Growth_Rate     Revenue    Industry   Employees 
##           0           0           0           0           0          12 
##        City       State 
##           0           0

The dataset has 8 variables and 5,001 records. The Employement column has 12 missing records. Exploratrion of summary reveals that on average there are 233 employees in an industry. Average growth rate per industry is 4.6%. On average, the revenue generated by the industries is $48,222,535

1. Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

q1 <- inc %>% group_by(State) %>% summarise(Count = n()) 
ggplot(q1, aes(x=State, y=Count)) +
 geom_bar(stat="identity", width=0.5,fill= "grey") +  coord_flip() +
 geom_text(aes(label=Count), size=2.5) +
 ylab("Count of Companies") + xlab("State") +
 ggtitle("Distribution of Companies by State")+
 theme(text = element_text(size = 8),panel.background = element_rect(fill='white', colour="white"))

California has mst number of states.

2.Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

q2 <- q1 %>% arrange(-Count) 
top_n(q2,3)
## Selecting by Count
## # A tibble: 3 x 2
##   State Count
##   <chr> <int>
## 1 CA      701
## 2 TX      387
## 3 NY      311
q2 <- inc %>%filter(State=="NY") %>% na.omit()


 ggplot(q2, aes(x=Industry, y=Employees)) + 
  stat_boxplot(geom ='errorbar') +
  geom_boxplot() +
    coord_cartesian(ylim = c(0,1000)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.3))+
ggtitle("NY Employee Count by Industry") 

It is clear from the graph that there are outliers in Human Resources, IT services, Advertising and Marketing and few other industries. The default when plotting a boxplot, range=1.5, means that the whiskers will extend 1.5 times the interquartile range above the third quartile and below the first quartile; all other points will be labeled as outliers. Link. Due to the large difference in the ranges of different industries, the mean is not clearly evident from the box plot, bar plot depicting the mean is presented below.

q2a <- inc %>% filter(State=="NY")%>%na.omit() %>% group_by(Industry)%>% summarize(avg =round(mean(Employees),1))
ggplot(q2a, aes(x=reorder(Industry, avg), y=avg)) +
 geom_bar(stat="identity",width=0.3,fill= "grey") +  coord_flip() +
 geom_text(aes(label=avg), size=2.5) +
 ylab("Count of Companies") + xlab("State") +
 ggtitle("Average Employees by Industry NY") +
  theme(text = element_text(size = 8),panel.background = element_rect(fill='white', colour="white"))

3. Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

q3 <- inc %>% na.omit() %>% group_by(Industry) %>%
  summarise(Revenue_Gen = sum(Revenue)/ sum(Employees))


ggplot(q3, aes(x = reorder(Industry, Revenue_Gen), y = Revenue_Gen)) + 
  geom_bar(stat="identity", fill="grey") +  coord_flip() + 
    ggtitle("Revenue per Employee by Industry") +
  geom_text( aes(label=dollar_format()(Revenue_Gen)), size=2.5) +
 ylab("") + xlab("") + theme_minimal()

A look at the graph reveals that Computer Hardware industry generates more revenue per employee followed by Energy and Construction. As an investor, I would review the numbers by the region of my investment and the graph can be broken down by State and Industry for additional trends.