#Any libraries used
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

#Hi!!! I might be overthinking this question but lets explore non-visual information

#First, let's see all unique industries of the data set, There are in total twenty five industries in the data set
unique(inc$Industry)
##  [1] "Consumer Products & Services" "Government Services"         
##  [3] "Health"                       "Energy"                      
##  [5] "Advertising & Marketing"      "Real Estate"                 
##  [7] "Financial Services"           "Retail"                      
##  [9] "Software"                     "Computer Hardware"           
## [11] "Logistics & Transportation"   "Food & Beverage"             
## [13] "IT Services"                  "Business Products & Services"
## [15] "Education"                    "Construction"                
## [17] "Manufacturing"                "Telecommunications"          
## [19] "Security"                     "Human Resources"             
## [21] "Travel & Hospitality"         "Media"                       
## [23] "Environmental Services"       "Engineering"                 
## [25] "Insurance"
#we can find the Industry with the most employees. We can now pair the max employee found in summary above to its industry of Human resources
inc%>%group_by(Industry)%>%summarise(max_employees=max(Employees))%>% arrange(desc(max_employees))
## # A tibble: 25 x 2
##    Industry                     max_employees
##    <chr>                                <int>
##  1 Human Resources                      66803
##  2 Security                             20000
##  3 Consumer Products & Services         13200
##  4 Engineering                          10000
##  5 Computer Hardware                     6800
##  6 Construction                          6549
##  7 Retail                                5821
##  8 Advertising & Marketing               5637
##  9 Environmental Services                5347
## 10 Travel & Hospitality                  4878
## # ... with 15 more rows

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

#We want to find out the the 50 states, how many companies are in each
#using a bar plot, we can plot out the distribution of entries by each state and use the coord_flip() to flip the graph for a "potriat oriented screen"
st_count<-inc%>%group_by(State)%>%summarise(count=n())
ggplot(inc,aes(x=State))+geom_bar()+theme_classic()+ coord_flip()

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# Lets break this question up as there's multiple parts

#Finding the state with the third most companies
#We can use dplyr to group by state and find its total company count with n()
st_count<-inc%>%group_by(State)%>%summarize(count=n())%>%arrange(desc(count))
head(st_count)
## # A tibble: 6 x 2
##   State count
##   <chr> <int>
## 1 CA      701
## 2 TX      387
## 3 NY      311
## 4 VA      283
## 5 FL      282
## 6 IL      273
#NY is the third state with the most companies (Knew we were in the top three!)
#Now, Lets Find the average employment by industry
#We only want complete cases, so let create a dataset that only have NY entries with complete cases
NY_ind<-inc%>%filter(State=="NY")
NY_ind<-NY_ind%>%filter(complete.cases(.))


#All industries in NY graph saved below,we can use bloxplot() to show the average employee count and its ranges in employees, we limited the graph to 3,000 as IT has the largest outlier which isn't relevant to our avg employement
ggplot(NY_ind,aes(y=Industry,x=Employees))+geom_boxplot()+theme_classic()+xlim(0,3000)+ggtitle("Average employmee count in NY")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

#we will assume this question is overall and not NY as it is not specify 

#First, let's only keep complete cases
inc<-inc%>%filter(complete.cases(.))

#Let's find the highest rev/emp by industry with a group_by and summarizing the highest revenue possible
top_rev<-inc%>%group_by(Industry)%>%summarise(avg_revEmp=sum(Revenue)/sum(Employees))%>%arrange(desc(avg_revEmp))

#The top three lucrative industries are Computer Hardware, Energy, and Construction with the highest revenue per employee
head(top_rev)
## # A tibble: 6 x 2
##   Industry                     avg_revEmp
##   <chr>                             <dbl>
## 1 Computer Hardware              1223564.
## 2 Energy                          520921.
## 3 Construction                    452741.
## 4 Logistics & Transportation      371001.
## 5 Consumer Products & Services    328972.
## 6 Insurance                       318558.
#We can also visualize this distribution of revenue via bar plot
ggplot(top_rev,aes(x=Industry,y=avg_revEmp))+geom_bar(stat = "identity")+theme_classic()+ coord_flip()+ggtitle("Revenue per employee by Industry")