Libraries Used
require(dplyr)
require(tidyr)
require(knitr)
require(kableExtra)
require(kable)
require(ggplot2)
Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
The initial summaries above are helpful in allowing us to conclude the following:
The growth rate of the 5,000 fastest growing companies typically is around 1-4%, however there may be a few outliers in the dataset that we’ll need to investigate.
The median amount of revenue generated by these companies is roughly $10,000,000, and the median number of employees per company is 53 individuals.
There appear to be a few outliers that are skewing the mean values for many of our continuous variables, so it’ll be important to take this into account in our future analyses.
We can also investigate other aspects of this data by using the group_by function to create more specific breakdowns of some of our variables.
Average Revenue By Industry
If we were to break down the amount of revenue per industry, we can see the following:
inc %>%
group_by(Industry) %>%
summarise(Mean_Revenue = mean(Revenue), Number_of_Companies = n()) %>%
arrange(desc(Mean_Revenue)) %>%
kable('html') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Industry | Mean_Revenue | Number_of_Companies |
|---|---|---|
| Computer Hardware | 270129545 | 44 |
| Energy | 126344954 | 109 |
| Food & Beverage | 98559542 | 131 |
| Logistics & Transportation | 95745161 | 155 |
| Consumer Products & Services | 73676847 | 203 |
| Construction | 70450802 | 187 |
| Telecommunications | 56855814 | 129 |
| Business Products & Services | 54705187 | 482 |
| Security | 52230137 | 73 |
| Environmental Services | 51741176 | 51 |
| Financial Services | 50580385 | 260 |
| Retail | 50529064 | 203 |
| Health | 50319437 | 355 |
| Manufacturing | 49546875 | 256 |
| Travel & Hospitality | 47283871 | 62 |
| Human Resources | 47173980 | 196 |
| Insurance | 46758000 | 50 |
| Engineering | 34222973 | 74 |
| Media | 32266667 | 54 |
| Real Estate | 30892708 | 96 |
| Government Services | 29748020 | 202 |
| IT Services | 28214598 | 733 |
| Software | 23802924 | 342 |
| Advertising & Marketing | 16528662 | 471 |
| Education | 13726506 | 83 |
It looks like, on average, the Computer Hardware industry generates the most revenue out of the 25 industries in this dataset.
Average Revenue Per Employee
Next, we will factor in the number of employees into this calculation to see how this impacts the average revenue calculation.
inc %>%
group_by(Industry) %>%
summarise(Mean_Revenue_per_employee = mean((Revenue / Employees), na.rm = TRUE), Number_of_companies = n(), Mean_Employees = mean(Employees, na.rm = TRUE)) %>%
arrange(desc(Mean_Revenue_per_employee)) %>%
kable('html') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Industry | Mean_Revenue_per_employee | Number_of_companies | Mean_Employees |
|---|---|---|---|
| Energy | 1554655.8 | 109 | 242.54128 |
| Computer Hardware | 817702.2 | 44 | 220.77273 |
| Logistics & Transportation | 794810.9 | 155 | 259.70130 |
| Food & Beverage | 618382.8 | 131 | 510.93798 |
| Insurance | 474966.4 | 50 | 146.78000 |
| Consumer Products & Services | 466068.1 | 203 | 223.96059 |
| Construction | 465682.4 | 187 | 155.60963 |
| Manufacturing | 453524.0 | 256 | 172.32157 |
| Telecommunications | 449259.6 | 129 | 242.85039 |
| Real Estate | 434515.6 | 96 | 198.87368 |
| Travel & Hospitality | 414788.1 | 62 | 371.53226 |
| Retail | 412554.9 | 203 | 182.60099 |
| Human Resources | 395972.2 | 196 | 1158.06122 |
| Financial Services | 394230.9 | 260 | 183.43462 |
| Business Products & Services | 359096.9 | 482 | 244.49375 |
| Health | 325198.9 | 355 | 232.85311 |
| Media | 307143.8 | 54 | 176.51852 |
| Advertising & Marketing | 306036.3 | 471 | 84.35456 |
| Education | 296453.5 | 83 | 92.59036 |
| Environmental Services | 283607.3 | 51 | 199.11765 |
| Security | 283391.4 | 73 | 562.45205 |
| IT Services | 270494.2 | 733 | 140.42076 |
| Government Services | 243596.0 | 202 | 129.62871 |
| Software | 225989.2 | 342 | 150.32845 |
| Engineering | 201119.9 | 74 | 276.14865 |
As we can see above, although the Computer Hardware industry appears to generate, on average, the most revenue, when factoring in the number of employees, we see the Energy industry supersedes Computer Hardware. Additionally, it’s interesting to see Insurance jump up into the top 5 industries when factoring in the average number of employees, and to see Food & Beverage fall below Logistics & Transportation. I believe this is due to the number of employees required for each industry. As we can see from the Mean_Employees column, on average, Food & Beverage companies require over 500 employees, which is far greater than many others on the list. Additionally, Insurance companies require about 150 employees on average, which is far lower than many others on this list. Therefore, these movements seem to be impacted by this phenomenon.
Top 10 Companies with Highest/Lowest Growth Rate
We can also take a look at the companies that had the highest and lowest growth rates. Below, are the top 10 companies with the highest growth rates:
inc %>%
arrange(desc(Growth_Rate)) %>%
head(n = 10L) %>%
kable('html') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
| 7 | Value Payment Systems | 174.04 | 2.550e+07 | Financial Services | 27 | Nashville | TN |
| 8 | Emerge Digital Group | 170.64 | 2.390e+07 | Advertising & Marketing | 75 | San Francisco | CA |
| 9 | Goal Zero | 169.81 | 3.310e+07 | Consumer Products & Services | 97 | Bluffdale | UT |
| 10 | Yagoozon | 166.89 | 1.860e+07 | Retail | 15 | Warwick | RI |
As we can see, it looks like Fuhu, based in El Segundo, CA had an extremely high growth rate (almost double the next highest company on the list).
And here are the 10 companies with the lowest growth rates:
inc %>%
arrange(desc(Growth_Rate)) %>%
tail(n = 10L) %>%
kable('html') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State | |
|---|---|---|---|---|---|---|---|---|
| 4992 | 4992 | Salem Metal Fabricators | 0.35 | 7.40e+06 | Manufacturing | 50 | Middleton | MA |
| 4993 | 4993 | The PI Company | 0.35 | 2.00e+06 | Business Products & Services | 6 | North Little Rock | AR |
| 4994 | 4994 | RFB Holdings | 0.35 | 7.20e+06 | Human Resources | 27 | Downer Grove | IL |
| 4995 | 4995 | Sterling Computers | 0.35 | 1.66e+08 | Government Services | 98 | Norfolk | NE |
| 4996 | 4996 | cSubs | 0.34 | 1.34e+07 | Business Products & Services | 19 | Montvale | NJ |
| 4997 | 4997 | Dot Foods | 0.34 | 4.50e+09 | Food & Beverage | 3919 | Mt. Sterling | IL |
| 4998 | 4998 | Lethal Performance | 0.34 | 6.80e+06 | Retail | 8 | Wellington | FL |
| 4999 | 4999 | ArcaTech Systems | 0.34 | 3.26e+07 | Financial Services | 63 | Mebane | NC |
| 5000 | 5000 | INE | 0.34 | 6.80e+06 | IT Services | 35 | Bellevue | WA |
| 5001 | 5000 | ALL4 | 0.34 | 4.70e+06 | Environmental Services | 34 | Kimberton | PA |
Above, we can see that ALL4, located in Kimberton, PA had the slowest growth rate among the top 5,000 companies in the dataset.
Growth Rates by State (states with at least 50 companies)
We can also break down growth rates by state:
inc %>%
group_by(State) %>%
summarise(Mean_Growth_Rate = mean(Growth_Rate), Number_of_Companies = n()) %>%
arrange(desc(Mean_Growth_Rate)) %>%
filter(Number_of_Companies >= 50) %>%
kable('html') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| State | Mean_Growth_Rate | Number_of_Companies |
|---|---|---|
| UT | 6.307790 | 95 |
| TX | 6.024186 | 387 |
| CA | 5.897489 | 701 |
| FL | 5.846099 | 282 |
| MA | 5.416648 | 182 |
| CT | 4.994600 | 50 |
| MD | 4.984809 | 131 |
| CO | 4.952015 | 134 |
| TN | 4.950366 | 82 |
| VA | 4.877350 | 283 |
| IN | 4.788261 | 69 |
| AZ | 4.616700 | 100 |
| NJ | 4.445380 | 158 |
| NY | 4.371158 | 311 |
| WA | 4.002692 | 130 |
| MN | 3.821477 | 88 |
| IL | 3.738755 | 273 |
| OH | 3.557527 | 186 |
| GA | 3.523538 | 212 |
| NC | 3.511679 | 137 |
| WI | 2.690253 | 79 |
| PA | 2.566585 | 164 |
| MO | 2.497288 | 59 |
| AL | 2.407451 | 51 |
| MI | 2.238571 | 126 |
Although it’s difficult to glean too much here given the wide variety of the number of companies per state in this dataset, we can see that states with at least 50 companies in this dataset have mean growth rates ranging from ~6.3 percent to ~2.2 percent.
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
In order to do this, we’ll first need to isolate and group the data by state, counting the number of companies per state using the group_by and summarise functions.
comp_state <- inc %>%
group_by(State) %>%
summarise(Number_of_companies = n()) %>%
arrange(desc(Number_of_companies))
Then, using ggplot, we can create a bar graph:
g <- ggplot(comp_state, aes(x = reorder(State, Number_of_companies), y = Number_of_companies))
g <- g + geom_bar(stat = "identity", fill = "#009a9c") + coord_flip()
g <- g + scale_y_continuous(expand = c(0, 0), limits = c(0, 900))
g <- g + ggtitle("Distribution of the 5,000 Fastest Growing \nCompanies in the U.S. (by State)")
g <- g + geom_hline(yintercept=seq(1,800,250), col="white", lwd=1)
g <- g + ylab("Number of Companies")
g <- g + xlab("State")
g <- g + geom_text(aes(label = Number_of_companies, size = 1), vjust = 0.20, hjust = -0.25, size=2.5, fontface='bold', color="#009a9c")
g <- g + theme(plot.title = element_text(size=12, face="bold", hjust = 0.25),
axis.text=element_text(size=9, face = "bold"),
axis.title=element_text(size=9,face="bold", color = "#555555"),
panel.background = element_blank(),
axis.line = element_line(color = "#cccccc",
size = 0.5, linetype = "solid"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_line(color="#cccccc"))
g
As mentioned in the question text, I decided to display the bar graph in portrait format, with the bars extending horizontally rather than vertically since there were a large number of states. I also attempted to be mindful of the data-ink ratio here, by removing extra gridlines in the background, and any extraneous borders and ink that could distract from the overall message of the graph. I did add some white guidelines to the bars in order to show deliniations between our major gridlines on the x-axis (0, 250, 500, and 750).
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
First, we’ll have to isolate the state with the 3rd most companies in the dataset. This appears to be the state of New York. In order to do this, we can use the filter function in R:
ny_companies <- inc %>%
filter(State == 'NY') %>%
filter(complete.cases(.))
With the New York State companies saved in a separate data frame, checking to ensure that all companies in the list are complete.cases(), we can now create another data frame that holds the specific mean employee calculations for each industry in the state. Also, we’ll find the maximum average mean employee value for all of the industries in the data set – in order to use later for our scaling on our chart. Similar to before, we can use the group_by and summarise functions again:
mean_industry <- ny_companies %>%
group_by(Industry) %>%
summarise(Mean_employees = mean(Employees))
mean_industry_max <- max(mean_industry$Mean_employees)
ny_companies_max <- max(ny_companies$Employees)
With these data frames and values ready, we can now create our graph that shows both the mean and median employment by industry for companies in NY. To do this, we’ll set up a ggplot boxplot:
g <- ggplot(ny_companies, aes(x= reorder(Industry, Employees, mean), y=Employees))
g <- g + geom_boxplot(width = 0.5, fill = "aliceblue", fatten = 2, color="#333333", alpha=0.2) + coord_flip()
g <- g + geom_point(data = mean_industry, aes(x=Industry, y=Mean_employees, fill = "Mean Employees"), color="maroon", size = 2, show.legend = TRUE)
g <- g + scale_y_continuous(expand = c(0,0), limits = c(0, mean_industry_max + 50))
g <- g + ggtitle("Distribution of Employees per Industry (New York State)")
g <- g + ylab("Employees")
g <- g + xlab("Industry")
g <- g + theme_bw()
g <- g + theme(plot.title = element_text(size=12, face="bold", hjust = 0.5, color = "#555555"),
axis.text=element_text(size=9, face = "bold"),
axis.title=element_text(size=9,face="bold", color = "#555555"),
panel.background = element_blank(),
panel.border = element_blank(),
axis.line = element_line(color = "#cccccc",
size = 0.5, linetype = "solid"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_line(color="#cccccc"),
legend.position = "top")
g <- g + labs(fill = "")
g
As we can see above, it looks like our mean employee calculations are a bit skewed. When examining the data, for the Business Products & Services industry, we can see one company really pulls the mean employee value much higher (Sutherland Global Services), and given our scaling, it isn’t even pictured on this first iteration of the graph. This is also happening with a few other industries. Therefore, to account for this, we can do a log transformation on our visualization to account for these outliers:
g <- ggplot(ny_companies, aes(x= reorder(Industry, Employees, mean), y=Employees))
g <- g + geom_boxplot(width = 0.5, fill = "aliceblue", fatten = 2, color="#333333", alpha = 0.2) + coord_flip()
g <- g + geom_point(data = mean_industry, aes(x=Industry, y=Mean_employees, fill = "Mean Employees"), color="maroon", size = 2, show.legend = TRUE)
g <- g + scale_y_continuous(expand = c(0,0), limits = c(0, mean_industry_max))
g <- g + ggtitle("Distribution of Employees per Industry (New York State) | Log Transformation")
g <- g + ylab("Employees")
g <- g + xlab("Industry")
g <- g + theme_bw()
g <- g + theme(plot.title = element_text(size=11, face="bold", color = "#555555"),
axis.text=element_text(size=9, face = "bold"),
axis.title=element_text(size=9,face="bold", color = "#555555"),
panel.background = element_blank(),
panel.border = element_blank(),
axis.line = element_line(color = "#cccccc",
size = 0.5, linetype = "solid"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_line(color="#cccccc"),
legend.position = "top")
g <- g + labs(fill = "")
g <- g + scale_y_log10(limits = c(1, mean_industry_max))
g <- g + labs(caption = "(log transformation of employees | x-axis on log scale)")
g <- g + theme(plot.caption = element_text(size = 8, hjust = 0.5, color = "#333333"))
g
Above, we can see that the visualization now accounts for these outliers. As with the first graph, I did my best to minimize the data-ink ratio by removing excess borders, only including important gridlines, and keeping the non-data ink to a minimum.
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
Similar to our work in the exploratory analysis section above, I’ll use the group_by and summarise functions again to group the data based on Industry, and calculate the mean revenue per employee for each industry in the dataset:
rev_emp <- inc %>%
filter(complete.cases(.)) %>%
group_by(Industry) %>%
summarise(Rev_total = sum(Revenue), Employee_total = sum(Employees)) %>%
mutate(Revenue_per_employee = Rev_total / Employee_total) %>%
arrange(desc(Revenue_per_employee))
With the data grouped accordingly, I then decided to create a similar visualization (bar chart), as I did in the first question:
g <- ggplot(rev_emp, aes(x = reorder(Industry, Revenue_per_employee), y = Revenue_per_employee))
g <- g + geom_bar(stat = "identity", fill = "#009a9c") + coord_flip()
g <- g + scale_y_continuous(expand = c(0, 0), limits = c(0, 1500000), breaks = c(0, 500000, 1000000), labels = scales::comma)
g <- g + ggtitle("Revenue per Employee by Industry of the 5,000 Fastest Growing \nCompanies in the U.S.")
g <- g + geom_hline(yintercept=seq(0,1500000,250000), col="white", lwd=1)
g <- g + ylab("Revenue per Employee")
g <- g + xlab("Industry")
g <- g + geom_text(aes(label = scales::comma(round(Revenue_per_employee, 0)), size = 1), vjust = 0.20, hjust = -0.15, size=2.5, fontface='bold', color="#009a9c")
g <- g + theme(plot.title = element_text(size=11, face="bold", color = "#555555"),
axis.text=element_text(size=9, face = "bold"),
axis.title=element_text(size=9,face="bold", color = "#555555"),
panel.background = element_blank(),
axis.line = element_line(color = "#cccccc",
size = 0.5, linetype = "solid"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_line(color="#cccccc"))
g
Above, I used the same format as Question 1, trying my best to minimize my data-ink ratio, by reducing my gridlines and extra non-data ink. As we can see above, the Computer Hardware industry far surpasses all other industries for the amount of revenue generated per employee amongst the 5,000 fastest growing companies in the United States.