Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
#disable scientific notation
options(scipen=99999)
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry
## Min. : 2000000 IT Services : 733
## 1st Qu.: 5100000 Business Products & Services: 482
## Median : 10900000 Advertising & Marketing : 471
## Mean : 48222535 Health : 355
## 3rd Qu.: 28600000 Software : 342
## Max. :10100000000 Financial Services : 260
## (Other) :2358
## Employees City State
## Min. : 1.0 New York : 160 CA : 701
## 1st Qu.: 25.0 Chicago : 90 TX : 387
## Median : 53.0 Austin : 88 NY : 311
## Mean : 232.7 Houston : 76 VA : 283
## 3rd Qu.: 132.0 San Francisco: 75 FL : 282
## Max. :66803.0 Atlanta : 74 IL : 273
## NA's :12 (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
Calculate percent missing data to learn how complete/incomplete the data is. Granted we can see the number of NA’s in the summary but sometimes that detail is easily missed.
# Insert your code here, create more chunks as necessary
library(dplyr)
inc %>% summarize_all(funs(sum(is.na(.)) / length(.)))
## Rank Name Growth_Rate Revenue Industry Employees City State
## 1 0 0 0 0 0 0.00239952 0 0
Growth rate, Employees, and revenue have some very large maximums, consideraly larger than their respective means. Lets identify those rows to get a better picture.
Growth Rate (Where does the max growth rate occur)
## Rank Name Growth_Rate Revenue Industry Employees
## 1 1 Fuhu 421.48 117900000 Consumer Products & Services 104
## City State
## 1 El Segundo CA
Fuhu inc has the largest growth rate. We can do some additional research into this CA based company. It’s a company that focuses on android tablet learning for childen. https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=58913811
Revenue (Where does the max revenue occur?)
## Rank Name Growth_Rate Revenue Industry Employees
## 4788 4788 CDW 0.41 10100000000 Computer Hardware 6800
## City State
## 4788 Vernon Hills IL
CDW is an IL based tech company that provides hardware for education, business, and government. https://en.wikipedia.org/wiki/CDW
Employees (Where does the max number of Employees occur?)
## Rank Name Growth_Rate Revenue
## 2344 2345 Integrity staffing Solutions 1.55 278200000
## Industry Employees City State
## 2344 Human Resources 66803 Wilmington DE
Integrity Staffing solutions based in DE has the maximum number of employees. This company allocates contract resources to other companies. https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4442533
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Answer Question 1 here
library(sqldf)
library(ggplot2)
#sqldf is my go to for quick data manipulation
q1<-sqldf("select
State, count(distinct Name) as num_companies
from inc
group by State")
ggplot(q1, aes(x=reorder(State,num_companies),round(num_companies, digits=2)))+
geom_bar(stat="identity", fill="deepskyblue3")+
geom_text(aes(label=round(num_companies, digits=2)), vjust=0.5, size=5, position=position_dodge(width = 1), hjust=1.5)+
theme_minimal()+
theme(axis.text.x=element_text(size=12, vjust=0.5))+
theme(axis.text.y=element_text(size=12, vjust=0.5))+
theme(plot.title = element_text(size=18))+
labs( x="State", y="Number of Unique Companies")+
coord_flip()+
labs(caption="Inc. Magazine Data")+
ggtitle("Distribution of Unique Companies by State")
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
# we can see that NY has the third most number of companies. Lets subset NY avg employees by industry, complete cases only
ny<-sqldf("select
Industry, avg(Employees) as avg_emp
from inc
where State in ('NY')
and Name is not null
and Employees is not null
group by Industry")
#the extreme values observed in the summary lead me to believe that there are outliers in our data. Lets check
library(outliers)
#returns true is a value is flagged as an outlier. Save this to some temp variable to then remove from the data
remove<-outlier(ny$avg_emp,logical=TRUE)
#identify any rows that returned true for number of employees that are outliers
remove_rows = which(remove==TRUE,arr.ind=TRUE)
#remove those rows
ny2 = ny[-remove_rows,]
#delete temp table
rm(ny)
#visualize
#we also round to the nearest intger since you can't have a fractional employee
ggplot(ny2, aes(x=reorder(Industry,avg_emp),round(avg_emp, digits=0)))+
geom_bar(stat="identity", fill="deepskyblue3")+
geom_text(aes(label=round(avg_emp, digits=0)), vjust=0.5, size=4, position=position_dodge(width = 1), hjust=1.5)+
theme_minimal()+
theme(axis.text.x=element_text(size=12, vjust=0.5))+
theme(axis.text.y=element_text(size=12, vjust=0.5))+
theme(plot.title = element_text(size=18))+
labs( x="Industry", y="Avg Employees")+
coord_flip()+
labs(caption="Inc. Magazine Data")+
ggtitle("Distribution of Avg Number of Employees by Industry in NY")
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
# Answer Question 3 here
#we need to divide revenue by employees to get revenue per employee
rev<-sqldf("
select
Industry, rev/emp as per_emp
from
(
select
Industry, sum(Revenue) as rev, sum(Employees) as emp
from
inc
where Industry is not null
group by Industry
)
group by Industry")
#visualize
ggplot(rev, aes(x=reorder(Industry,per_emp),round(per_emp, digits=2)))+
geom_bar(stat="identity", fill="deepskyblue3")+
geom_text(aes(label=round(per_emp, digits=2)), vjust=1, size=5, position=position_dodge(width = 2), hjust=1)+
theme_minimal()+
theme(axis.text.x=element_text(size=12, vjust=0.5))+
theme(axis.text.y=element_text(size=12, vjust=0.5))+
theme(plot.title = element_text(size=18))+
labs( x="Industry", y="Revenue per Employee")+
coord_flip()+
labs(caption="Inc. Magazine Data")+
ggtitle("Revenue per Employee by Industry")