#Any libraries used
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)And lets preview this data:
head(inc)## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
#Hi!!! I might be overthinking this question but lets explore non-visual information
#First, let's see all unique industries of the data set, There are in total twenty five industries in the data set
unique(inc$Industry)## [1] "Consumer Products & Services" "Government Services"
## [3] "Health" "Energy"
## [5] "Advertising & Marketing" "Real Estate"
## [7] "Financial Services" "Retail"
## [9] "Software" "Computer Hardware"
## [11] "Logistics & Transportation" "Food & Beverage"
## [13] "IT Services" "Business Products & Services"
## [15] "Education" "Construction"
## [17] "Manufacturing" "Telecommunications"
## [19] "Security" "Human Resources"
## [21] "Travel & Hospitality" "Media"
## [23] "Environmental Services" "Engineering"
## [25] "Insurance"
#we can find the Industry with the most employees. We can now pair the max employee found in summary above to its industry of Human resources
inc%>%group_by(Industry)%>%summarise(max_employees=max(Employees))%>% arrange(desc(max_employees))## # A tibble: 25 x 2
## Industry max_employees
## <chr> <int>
## 1 Human Resources 66803
## 2 Security 20000
## 3 Consumer Products & Services 13200
## 4 Engineering 10000
## 5 Computer Hardware 6800
## 6 Construction 6549
## 7 Retail 5821
## 8 Advertising & Marketing 5637
## 9 Environmental Services 5347
## 10 Travel & Hospitality 4878
## # ... with 15 more rows
Question 1
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
#We want to find out the the 50 states, how many companies are in each
#using a bar plot, we can plot out the distribution of entries by each state and use the coord_flip() to flip the graph for a "potriat oriented screen"
st_count<-inc%>%group_by(State)%>%summarise(count=n())
ggplot(inc,aes(x=State))+geom_bar()+theme_classic()+ coord_flip()Quesiton 2
Lets dig in on the state with the 3rd most companies in the data set.
Imagine you work for the state and are interested in how many people are
employed by companies in different industries. Create a plot that shows
the average and/or median employment by industry for companies in this
state (only use cases with full data, use R’s
complete.cases() function.) In addition to this, your graph
should show how variable the ranges are, and you should deal with
outliers.
# Lets break this question up as there's multiple parts
#Finding the state with the third most companies
#We can use dplyr to group by state and find its total company count with n()
st_count<-inc%>%group_by(State)%>%summarize(count=n())%>%arrange(desc(count))
head(st_count)## # A tibble: 6 x 2
## State count
## <chr> <int>
## 1 CA 701
## 2 TX 387
## 3 NY 311
## 4 VA 283
## 5 FL 282
## 6 IL 273
#NY is the third state with the most companies (Knew we were in the top three!)
#Now, Lets Find the average employment by industry
#We only want complete cases, so let create a dataset that only have NY entries with complete cases
NY_ind<-inc%>%filter(State=="NY")
NY_ind<-NY_ind%>%filter(complete.cases(.))
#All industries in NY graph saved below,we can use bloxplot() to show the average employee count and its ranges in employees, we limited the graph to 3,000 as IT has the largest outlier which isn't relevant to our avg employement
ggplot(NY_ind,aes(y=Industry,x=Employees))+geom_boxplot()+theme_classic()+xlim(0,3000)+ggtitle("Average employmee count in NY")## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Question 3
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
#we will assume this question is overall and not NY as it is not specify
#First, let's only keep complete cases
inc<-inc%>%filter(complete.cases(.))
#Let's find the highest rev/emp by industry with a group_by and summarizing the highest revenue possible
top_rev<-inc%>%group_by(Industry)%>%summarise(avg_revEmp=sum(Revenue)/sum(Employees))%>%arrange(desc(avg_revEmp))
#The top three lucrative industries are Computer Hardware, Energy, and Construction with the highest revenue per employee
head(top_rev)## # A tibble: 6 x 2
## Industry avg_revEmp
## <chr> <dbl>
## 1 Computer Hardware 1223564.
## 2 Energy 520921.
## 3 Construction 452741.
## 4 Logistics & Transportation 371001.
## 5 Consumer Products & Services 328972.
## 6 Insurance 318558.
#We can also visualize this distribution of revenue via bar plot
ggplot(top_rev,aes(x=Industry,y=avg_revEmp))+geom_bar(stat = "identity")+theme_classic()+ coord_flip()+ggtitle("Revenue per employee by Industry")