Data-Ink Ratio Exercise with ggplot2
Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)And lets preview this data:
inc <- dplyr::as_tibble(inc)
head(inc)## # A tibble: 6 x 8
## Rank Name Growth_Rate Revenue Industry Employees City State
## <int> <chr> <dbl> <dbl> <chr> <int> <chr> <chr>
## 1 1 Fuhu 421. 1.18e8 Consumer Prod~ 104 El Seg~ CA
## 2 2 FederalConf~ 248. 4.96e7 Government Se~ 51 Dumfri~ VA
## 3 3 The HCI Gro~ 245. 2.55e7 Health 132 Jackso~ FL
## 4 4 Bridger 233. 1.90e9 Energy 50 Addison TX
## 5 5 DataXu 213. 8.70e7 Advertising &~ 220 Boston MA
## 6 6 MileStone C~ 179. 4.57e7 Real Estate 63 Austin TX
summary(inc)## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
I’ll keep this section relatively short, as experience has taught me that numeric summaries, like the ones above, can be misleading without visualizing the data as well.
I am curious to see how many companies are in each industry, the industries with the highest median growth rates, as well as the aggregated revenue from each industry.
# Insert your code here, create more chunks as necessary
inc %>%
dplyr::select(Industry, Revenue, Growth_Rate) %>%
dplyr::group_by(Industry) %>%
dplyr::summarize(
Count = n(),
'%' = round(n()/nrow(.),2),
Total_Revenue = sum(Revenue),
Mdn_Growth_Rate = median(Growth_Rate)
) %>%
dplyr::arrange(desc(Mdn_Growth_Rate)) ## # A tibble: 25 x 5
## Industry Count `%` Total_Revenue Mdn_Growth_Rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Government Services 202 0.04 6009100000 2.11
## 2 Energy 109 0.02 13771600000 2.08
## 3 Real Estate 96 0.02 2965700000 2.07
## 4 Media 54 0.01 1742400000 1.94
## 5 Consumer Products & Services 203 0.04 14956400000 1.82
## 6 Retail 203 0.04 10257400000 1.76
## 7 Software 342 0.07 8140600000 1.72
## 8 Advertising & Marketing 471 0.09 7785000000 1.61
## 9 Health 355 0.07 17863400000 1.57
## 10 Security 73 0.01 3812800000 1.54
## # ... with 15 more rows
Interestingly, it looks like Government Services has the highest median growth rate, followed by Energy and Real Estate. It is also interesting that some of the faster growing companies are in industries which make up a relatively small percentage of companies in this list such as Energy, Real Estate, and Media. None of these industries are new, so there must have been something occurring at the time this data was gathered that would have fueled that growth.
Question 1
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
inc %>%
dplyr::select(State) %>%
dplyr::group_by(State) %>%
dplyr::count() %>%
ggplot() +
aes(x = reorder(State, n) , y = n, fill = n) +
geom_col(position = 'dodge') +
geom_text(aes(label = n), size = 3, hjust=-0.10)+
labs(title = 'Number of Fastest Growing Companies by State') +
xlab('') +
ylab('') +
scale_fill_gradient(low = "indianred1",
high = "indianred4"
) +
coord_flip() +
theme(
panel.background = element_rect(fill = "white", color = NA),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position="none",
plot.margin=unit(c(.2,.5,.2,.2),"cm")
)We can see from the above visual, a substantial amount of the fastest growing companies come from California.
Question 2
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
In looking at the data, there are 12 rows with missing values in the Employees column. We’ll filter these out using tidyr::drop_na, which will drop any rows with missing values. Outliers are an issue with this visual, however, as an employee of the state, we are more interested in general cases, not the outliers, so we’ll exclude several outliers to make our plot readable.
for_box <- inc %>%
filter(State == 'NY') %>%
tidyr::drop_na()
for_box %>%
ggplot() +
aes(x = reorder(Industry, Employees, FUN = median), y = Employees) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.45) +
ylim(0,1250) +
xlab('') +
ylab('') +
coord_flip() +
labs(title = "Number of Empolyees by Industry", caption = "*Outliers above 1,250 employees were excluded") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.15),
panel.grid.major.x = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey"),
plot.margin=unit(c(.2,.5,.2,.2),"cm")
)In looking at the box plots above, we can see that Environmental Services has the highest median number of employees out of all industries.
Question 3
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
Here again we’ll have an issue with outliers, however, as an investor we’re interested in the overall health of the industry, not a one-off outlier, so excluding extreme outliers here will help us visualize our data better.
for_box %>%
mutate(rev_per_emp = Revenue / Employees) %>%
filter(rev_per_emp <= 1750000) %>%
ggplot() +
aes(x = reorder(Industry, rev_per_emp, FUN = median), y = rev_per_emp) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.45) +
xlab('') +
ylab('') +
coord_flip() +
labs(title = "Revenue per Employee by Industry", caption = "*Outliers above $1,750,000 were excluded") +
scale_y_continuous(labels=scales::dollar_format()) +
theme_minimal() +
theme(
panel.grid.major.x = element_line(color = "gray90", linetype = "dashed"),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey"),
plot.margin=unit(c(.2,.5,.2,.2),"cm")
)Looking at the above plot as an investor, we can see that Logistics & Transportation is probably the best bet for our money. While it does have a wide range, the minimum revenue per employee is fairly high, the median value is the 3rd highest (almost tied for 2nd), and the higher end of the range is the highest of any industry, meaning there is a lot possibility.