# Required libraries
library(dplyr)
library(ggplot2)
library(scales)
library(plotly)
# Turn off scientific notation
options(scipen=999)
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv(paste0("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/",
"master/module1/Data/inc5000_data.csv"), header= TRUE)
And lets preview this data:
knitr::kable(head(inc))
Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
---|---|---|---|---|---|---|---|
1 | Fuhu | 421.48 | 117900000 | Consumer Products & Services | 104 | El Segundo | CA |
2 | FederalConference.com | 248.31 | 49600000 | Government Services | 51 | Dumfries | VA |
3 | The HCI Group | 245.45 | 25500000 | Health | 132 | Jacksonville | FL |
4 | Bridger | 233.08 | 1900000000 | Energy | 50 | Addison | TX |
5 | DataXu | 213.37 | 87000000 | Advertising & Marketing | 220 | Boston | MA |
6 | MileStone Community Builders | 179.38 | 45700000 | Real Estate | 63 | Austin | TX |
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 110 Consulting : 1 Min. : 0.340
## 1st Qu.:1252 11thStreetCoffee.com : 1 1st Qu.: 0.770
## Median :2502 123 Exteriors : 1 Median : 1.420
## Mean :2502 1st American Systems and Services: 1 Mean : 4.612
## 3rd Qu.:3751 1st Equity : 1 3rd Qu.: 3.290
## Max. :5000 1-Stop Translation USA : 1 Max. :421.480
## (Other) :4995
## Revenue Industry
## Min. : 2000000 IT Services : 733
## 1st Qu.: 5100000 Business Products & Services: 482
## Median : 10900000 Advertising & Marketing : 471
## Mean : 48222535 Health : 355
## 3rd Qu.: 28600000 Software : 342
## Max. :10100000000 Financial Services : 260
## (Other) :2358
## Employees City State
## Min. : 1.0 New York : 160 CA : 701
## 1st Qu.: 25.0 Chicago : 90 TX : 387
## Median : 53.0 Austin : 88 NY : 311
## Mean : 232.7 Houston : 76 VA : 283
## 3rd Qu.: 132.0 San Francisco: 75 FL : 282
## Max. :66803.0 Atlanta : 74 IL : 273
## NA's :12 (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data.
Maximum number of employees seems high. Consider top ten companies based on employee count.
knitr::kable(head(inc[order(-inc$Employees),c(2,5:8)],10))
Name | Industry | Employees | City | State | |
---|---|---|---|---|---|
2344 | Integrity staffing Solutions | Human Resources | 66803 | Wilmington | DE |
4577 | Sutherland Global Services | Business Products & Services | 32000 | Pittsford | NY |
1868 | Universal Services of America | Security | 20000 | Santa Ana | CA |
3456 | The Seaton Companies | Human Resources | 18887 | Chicago | IL |
2870 | PrideStaff | Human Resources | 17057 | Fresno | CA |
2313 | Infiniti HR | Human Resources | 17000 | Olney | MD |
4655 | CareersUSA | Human Resources | 14451 | Boca Raton | FL |
1487 | Sprouts Farmers Market | Consumer Products & Services | 13200 | Phoenix | AZ |
4140 | Cornerstone Staffing Solutions | Human Resources | 13071 | Pleasanton | CA |
3650 | Genco | Logistics & Transportation | 10800 | Pittsburgh | PA |
Check for unique company names (that companies are not duplicated in the data).
dupNames <- group_by(inc, Name) %>%
summarize(Count=n()) %>%
filter(Count>1)
cat("Number of duplicate company names:",nrow(dupNames))
## Number of duplicate company names: 0
Consider all industries.
knitr::kable(group_by(inc, Industry) %>% summarize(Count=n()) %>% arrange(desc(Count)))
Industry | Count |
---|---|
IT Services | 733 |
Business Products & Services | 482 |
Advertising & Marketing | 471 |
Health | 355 |
Software | 342 |
Financial Services | 260 |
Manufacturing | 256 |
Consumer Products & Services | 203 |
Retail | 203 |
Government Services | 202 |
Human Resources | 196 |
Construction | 187 |
Logistics & Transportation | 155 |
Food & Beverage | 131 |
Telecommunications | 129 |
Energy | 109 |
Real Estate | 96 |
Education | 83 |
Engineering | 74 |
Security | 73 |
Travel & Hospitality | 62 |
Media | 54 |
Environmental Services | 51 |
Insurance | 50 |
Computer Hardware | 44 |
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Get a list of counts by state
stateCount <- group_by(inc, State) %>%
summarize(Count=n())
# Plot results
p1 <- ggplot(data = stateCount, aes(x = State, y = Count)) +
geom_bar(stat="identity", fill="#CCCCCC") +
geom_text(aes(label=Count), hjust=-0.2, vjust=0.4, color="#555555") +
scale_x_discrete(limits = rev(levels(stateCount$State))) +
coord_flip() +
ggtitle("No of Companies per State") + labs(x = "", y = "") +
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(margin = margin(r=-30)))
ggplotly(p1)
Decision points:
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases()
function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
# Top 3 states
knitr::kable(arrange(stateCount, desc(Count)) %>% top_n(3))
State | Count |
---|---|
CA | 701 |
TX | 387 |
NY | 311 |
# Get NY industry employee counts
nyInd <- filter(inc, State=="NY") %>%
select(Industry, Name, Employees)
# Check if any NAs
cat("Number of NAs:",sum(is.na(nyInd$Employees)))
## Number of NAs: 0
There are no NAs for NY data. complete.cases()
is not necessary.
Rather than discarding a few large outliers, which skew averages, below plots display median values.
# Plot
p2 <- ggplot(aes(x=Industry, y=Employees), data = nyInd) +
stat_boxplot(geom ='errorbar') +
geom_boxplot() +
coord_cartesian(ylim = c(0,1200)) +
scale_y_continuous(breaks=c(0,250,500,750,1000), expand = c(0,.05)) +
ggtitle("NY State: Employee Count per Industry*") + labs(x = "", y = "") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.3),
axis.ticks.x = element_blank(),
panel.grid.minor.y = element_blank())
ggplotly(p2)
*The following companies are not displayed on above plot, but are included in industry representation:
Industry | Name | Employees |
---|---|---|
Business Products & Services | Sutherland Global Services | 32000 |
Consumer Products & Services | Coty | 10000 |
IT Services | Westcon Group | 3000 |
Travel & Hospitality | Denihan Hospitality Group | 2280 |
Business Products & Services | TransPerfect | 2218 |
Human Resources | Sterling Infosystems | 2081 |
Software | OpenLink | 1271 |
Decision points:
One problem with boxplots is that they do not give a sense of how many data points there are for each industry. 25 industries range from 1 to 57 companies each and employee count ranges from 17 to 38,804. The plot below tries to address it. It represents median values. Size is relative to number of companies per industry and color is relative to number of employees per idustry. Consider Human Resources industry. A smaller point indicates few companies (in fact 11), but lighter color indicates relatively large number of employees (4,813). Because the plot is meant to be illutrative, legends are omitted. Perhaps, hover functionality to display actual values would be a good addition.
# Summarize NY data
nyIndSum <- group_by(nyInd, Industry) %>%
summarise(Median = median(Employees), TotalEmp = sum(Employees), Count = n())
# Plot
p3 <- ggplot(aes(x = Industry, y = Median, size = Count, color = log(TotalEmp)),
data = nyIndSum) +
geom_point(show.legend = FALSE) +
scale_size(range = c(0, 30)) +
scale_x_discrete(limits = rev(levels(nyIndSum$Industry))) +
coord_flip() +
ggtitle("NY State: Median Employee Count per Industry") + labs(x = "", y = "") +
theme(axis.ticks = element_blank(),
axis.text = element_text(size = 14),
panel.grid.major = element_line(color = "#CCCCCC"),
panel.background = element_blank(),
panel.grid.minor.x = element_blank())
ggplotly(p3)
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
Based on the summary statistics above there are no missing, negative or zero values in the Revenue
column.
# Get data
revenue <- select(inc, Industry, Revenue, Employees) %>%
na.omit() %>%
group_by(Industry) %>%
summarise(TotalRev = sum(Revenue), TotalEmp = sum(Employees)) %>%
mutate(RevEmployee = TotalRev / TotalEmp)
# Plot results
p4 <- ggplot(data = revenue, aes(x = reorder(Industry, RevEmployee), y = RevEmployee)) +
geom_bar(stat="identity", fill="#184489") +
geom_text(data = filter(revenue, RevEmployee>150000),
aes(x = Industry, y = RevEmployee, label=dollar_format()(RevEmployee)),
hjust=1.1, vjust=0.4, color="#FFFFFF") +
geom_text(data = filter(revenue, RevEmployee<150000),
aes(x = Industry, y = RevEmployee, label=dollar_format()(RevEmployee)),
hjust=-0.1, vjust=0.4, color="#000000") +
coord_flip() +
ggtitle("Revenue per Employee per Industry") + labs(x = "", y = "") +
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 14, margin = margin(r=-20)))
ggplotly(p4)
Decision points: