Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
The summary provides an overall glimpse via descriptive statistics for the columns in the dataset. Some of the information is helpful, others not helpful (for example, descriptive statistics on ranking). However, perhaps we may be able to gain more insight if we looked at the data from another perspective. Let’s group look at the fastest growing companies and see which cities they are coming from?
# Importing the tidyverse library
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# What are the top 10 fastest growing companies in the US?
# Sort by 'Growth_Rate'
top_10 <- inc %>% arrange(desc(Growth_Rate)) %>% head(10) %>% select(c(Rank, Name, Growth_Rate, Revenue, City, State))
print("What are the top 10 fastest growing companies in the US?")
## [1] "What are the top 10 fastest growing companies in the US?"
top_10
## Rank Name Growth_Rate Revenue City
## 1 1 Fuhu 421.48 1.179e+08 El Segundo
## 2 2 FederalConference.com 248.31 4.960e+07 Dumfries
## 3 3 The HCI Group 245.45 2.550e+07 Jacksonville
## 4 4 Bridger 233.08 1.900e+09 Addison
## 5 5 DataXu 213.37 8.700e+07 Boston
## 6 6 MileStone Community Builders 179.38 4.570e+07 Austin
## 7 7 Value Payment Systems 174.04 2.550e+07 Nashville
## 8 8 Emerge Digital Group 170.64 2.390e+07 San Francisco
## 9 9 Goal Zero 169.81 3.310e+07 Bluffdale
## 10 10 Yagoozon 166.89 1.860e+07 Warwick
## State
## 1 CA
## 2 VA
## 3 FL
## 4 TX
## 5 MA
## 6 TX
## 7 TN
## 8 CA
## 9 UT
## 10 RI
Within the top 10 fastest growing companies, there does not appear to be a single geographical location that dominates. Let’s take a look at the top 1000 fastest growing companies. What are the top 10 cities?
# What are the top ten cities in America with the 1000 fastest growing companies?
top_10_cities <- inc %>% arrange(desc(Growth_Rate)) %>% head(1000) %>% count(City, sort = TRUE) %>% head(10)
print("What are the top ten cities in America with the 1000 fastest growing companies?")
## [1] "What are the top ten cities in America with the 1000 fastest growing companies?"
top_10_cities
## # A tibble: 10 x 2
## City n
## <fct> <int>
## 1 New York 35
## 2 San Francisco 26
## 3 Chicago 20
## 4 Austin 19
## 5 Atlanta 17
## 6 Houston 15
## 7 Irvine 15
## 8 Boston 14
## 9 San Diego 11
## 10 Washington 11
It’s not surprising to see that New York, San Francisco, Chicago, Austin, and Atlanta contain the fastest growing companies (likely startups) in the country.
Let’s evaluate what type of industries contain the fastest growing companies in the US? How many different unique industries postings are there in this dataset?
# How many unique industries in this dataset?
print(paste0("There are ", length(unique(inc$Industry)), " unique industries in this dataset."))
## [1] "There are 25 unique industries in this dataset."
print("The industries in this dataset are: ")
## [1] "The industries in this dataset are: "
unique(inc$Industry)
## [1] Consumer Products & Services Government Services
## [3] Health Energy
## [5] Advertising & Marketing Real Estate
## [7] Financial Services Retail
## [9] Software Computer Hardware
## [11] Logistics & Transportation Food & Beverage
## [13] IT Services Business Products & Services
## [15] Education Construction
## [17] Manufacturing Telecommunications
## [19] Security Human Resources
## [21] Travel & Hospitality Media
## [23] Environmental Services Engineering
## [25] Insurance
## 25 Levels: Advertising & Marketing ... Travel & Hospitality
How many different industries are there in the top 1000 companies (in descending order)?
# Sorting by industries with the 1000 fastest growing companies in the US
print("Sorting by industries with the 1000 fastest growing companies in the US")
## [1] "Sorting by industries with the 1000 fastest growing companies in the US"
inc %>% arrange(desc(Growth_Rate)) %>% head(1000) %>% count(Industry, sort=TRUE) %>% head(10)
## # A tibble: 10 x 2
## Industry n
## <fct> <int>
## 1 IT Services 120
## 2 Advertising & Marketing 112
## 3 Software 83
## 4 Health 77
## 5 Business Products & Services 67
## 6 Financial Services 65
## 7 Consumer Products & Services 61
## 8 Retail 59
## 9 Government Services 56
## 10 Energy 38
IT Services and Advertising/Marketing are by far the fastest growing industries in the top 1000 companies.
Does the fastest growing companies also bring in the highest revenue?
print("Top 10 Fastest Growing Companies")
## [1] "Top 10 Fastest Growing Companies"
top_10
## Rank Name Growth_Rate Revenue City
## 1 1 Fuhu 421.48 1.179e+08 El Segundo
## 2 2 FederalConference.com 248.31 4.960e+07 Dumfries
## 3 3 The HCI Group 245.45 2.550e+07 Jacksonville
## 4 4 Bridger 233.08 1.900e+09 Addison
## 5 5 DataXu 213.37 8.700e+07 Boston
## 6 6 MileStone Community Builders 179.38 4.570e+07 Austin
## 7 7 Value Payment Systems 174.04 2.550e+07 Nashville
## 8 8 Emerge Digital Group 170.64 2.390e+07 San Francisco
## 9 9 Goal Zero 169.81 3.310e+07 Bluffdale
## 10 10 Yagoozon 166.89 1.860e+07 Warwick
## State
## 1 CA
## 2 VA
## 3 FL
## 4 TX
## 5 MA
## 6 TX
## 7 TN
## 8 CA
## 9 UT
## 10 RI
print("Top 10 Highest Revenue Companies")
## [1] "Top 10 Highest Revenue Companies"
inc %>% arrange(desc(Revenue)) %>% head(10) %>% select(c(Rank, Name, Growth_Rate, Revenue, City, State))
## Rank Name Growth_Rate Revenue City
## 1 4788 CDW 0.41 1.01e+10 Vernon Hills
## 2 3853 ABC Supply 0.73 4.70e+09 Beloit
## 3 4936 Coty 0.36 4.60e+09 New York
## 4 4997 Dot Foods 0.34 4.50e+09 Mt. Sterling
## 5 4716 Westcon Group 0.44 3.80e+09 Tarrytown
## 6 4246 American Tire Distributors 0.59 3.50e+09 Huntersville
## 7 4052 Kum & Go 0.65 2.80e+09 West Des Moines
## 8 4802 Boise Cascade 0.41 2.80e+09 Boise
## 9 1397 EnvisionRxOptions 2.88 2.70e+09 Twinsburg
## 10 2522 DLA Piper 1.41 2.40e+09 Chicago
## State
## 1 IL
## 2 WI
## 3 NY
## 4 IL
## 5 NY
## 6 NC
## 7 IA
## 8 ID
## 9 OH
## 10 IL
I suspect that these fast growing companies are likely startups that have received seed or Series A funding, whereas the highest revenue companies are stable companies that have established themselves in the industry.
Assuming my above statement is true, more established companies in theory should have more employees than start up companies.
print("Top 10 Fastest Growing Companies - Employee Count")
## [1] "Top 10 Fastest Growing Companies - Employee Count"
inc %>% arrange(desc(Growth_Rate)) %>% head(10) %>% select(c(Rank, Name, Employees))
## Rank Name Employees
## 1 1 Fuhu 104
## 2 2 FederalConference.com 51
## 3 3 The HCI Group 132
## 4 4 Bridger 50
## 5 5 DataXu 220
## 6 6 MileStone Community Builders 63
## 7 7 Value Payment Systems 27
## 8 8 Emerge Digital Group 75
## 9 9 Goal Zero 97
## 10 10 Yagoozon 15
print("Top 10 Highest Revenue Companies - Employee Count")
## [1] "Top 10 Highest Revenue Companies - Employee Count"
inc %>% arrange(desc(Revenue)) %>% head(10) %>% select(c(Rank, Name, Employees))
## Rank Name Employees
## 1 4788 CDW 6800
## 2 3853 ABC Supply 6549
## 3 4936 Coty 10000
## 4 4997 Dot Foods 3919
## 5 4716 Westcon Group 3000
## 6 4246 American Tire Distributors 3341
## 7 4052 Kum & Go 4589
## 8 4802 Boise Cascade 4470
## 9 1397 EnvisionRxOptions 625
## 10 2522 DLA Piper 4036
It appears that the above statement is true.
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Focusing on the number of companies per state
companies <- inc %>% count(State, sort = TRUE)
# So that the data is sorted in descending order
companies$State <- factor(companies$State, levels = companies$State[order(companies$n)])
# Creating the ggplot for all states
ggplot(companies, aes(State,n)) + geom_bar(stat="identity", fill='steelblue') + coord_flip() + ylab("Count") + theme_minimal() + ggtitle("Number of Fastest Growing Companies Per State")
# Creating the ggplot for top 10 states
companies_10 <- inc %>% count(State, sort = TRUE) %>% head(10)
companies_10$State <- factor(companies_10$State, levels = companies_10$State[order(companies_10$n)])
ggplot(companies_10, aes(State,n)) + geom_bar(stat="identity", fill='steelblue') + coord_flip() + ylab("Count") + theme_minimal() + ggtitle("Number of Fastest Growing Companies Per State - Top 10") + geom_text(aes(label=n), vjust=0.3, hjust=1.6, color="white", size=3.5)
California, Texas, and New York appear to have the 3 fastest growing companies in the US.
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases()
function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
# State with the 3rd most companies in the data set: New York
NY <- inc %>% filter(State == 'NY')
NY <- NY[complete.cases(NY),]
NY_summary <- NY %>%
group_by(Industry) %>%
summarize(Median_Employment=median(Employees),
Mean_Employment=mean(Employees))
# Create a plot showing average employment by industry for companies in the state
ggplot(NY_summary %>% mutate(Industry = fct_reorder(Industry, Mean_Employment)) , aes(x=Industry,y=Mean_Employment, fill=Industry)) + geom_bar(stat="identity") + ylab('Mean Employment by Industry') + coord_flip() +scale_fill_grey(start=0.8, end=0.1) + theme(legend.position='none')
# Create a plot showing median employment by industry for companies in the state
ggplot(NY_summary %>% mutate(Industry = fct_reorder(Industry, Median_Employment)), aes(x=Industry,y=Median_Employment, fill=Industry)) + geom_bar(stat="identity") + ylab('Median Employment by Industry') + coord_flip() + scale_fill_grey(start=0.8, end=0.1) + theme(legend.position='none')
Side by Side Comparisons (Mean Employment vs. Median Employment by Industry)
# Side by Side differences
par(mfrow=c(1,2))
ggplot(NY_summary, aes(x=Industry,y=Mean_Employment)) + geom_bar(stat="identity", fill='steelblue') + ylab('Mean Employment by Industry') + coord_flip()
ggplot(NY_summary, aes(x=Industry,y=Median_Employment)) + geom_bar(stat="identity", fill='steelblue') + ylab('Median Employment by Industry') + coord_flip()
Interestingly, many of the industries means and medians do not appear to match up. Typically, such drastic differences in mean median indicate skew in the distribution of employees within each industry.
NY_boxplot_df <- NY %>%
group_by(Industry)
ggplot(NY_boxplot_df, aes(x=Industry,y=Employees)) + geom_boxplot() + coord_flip()
As we see, what’s the most obvious from this graph is the outliers with Consumer Products $ Services
and with Business Products & Services
. Though the information can give us a general sense of how extreme these outliers are, the graph unfortunately does not provide much information beyond that. Let us remove these two outliers, and take a closer look at the rest of the boxplots.
NY_boxplot_df_adj <- NY %>%
filter(Employees < 900) %>%
group_by(Industry)
ggplot(NY_boxplot_df_adj, aes(x=Industry, y=Employees, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none')
The median number is most likely a better reflection of which industry (with the fastest growing companies in the US) as there are a significant of right skew distribution plots. Here, Environmental Services
and Energy
appears to have the highest median number of employees.
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
We will be looking at the mean and median distribution.
revenue <- NY %>%
mutate(Revenue_Per_Employee = Revenue/Employees) %>%
group_by(Industry) %>%
summarize(Mean_Rev_Per_Emp = mean(Revenue_Per_Employee),
Median_Rev_Per_Emp = median(Revenue_Per_Employee))
# Mean Boxplots of Revenue Per Employee
ggplot(revenue %>% mutate(Industry = fct_reorder(Industry, Mean_Rev_Per_Emp)), aes(x=Industry,y=Mean_Rev_Per_Emp, fill=Industry)) + geom_bar(stat="identity", fill='steelblue') + ylab('Mean Revenue per Employee by Industry') + coord_flip() + theme(legend.position = 'none')
# Mean Boxplots of Revenue Per Employee
ggplot(revenue %>% mutate(Industry = fct_reorder(Industry, Median_Rev_Per_Emp)), aes(x=Industry,y=Median_Rev_Per_Emp, fill=Industry)) + geom_bar(stat="identity", fill='steelblue') + ylab('Median Revenue per Employee by Industry') + coord_flip() + theme(legend.position = 'none')
The most notable difference in the mean and median is in the energy industry. Let’s check for outliers.
#Boxplot
revenue_1 <- NY %>%
mutate(Revenue_Per_Employee = Revenue/Employees) %>%
group_by(Industry)
ggplot(revenue_1, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none')
Energy appears to have a company that is quite productive and generating a significant amount of revenue. This likely explains why the mean differs so much from the median.
Let’s see what other differences between the mean and median exists within all of the industries.
rev_2 <- revenue_1 %>% group_by(Industry) %>% select(Name, Revenue_Per_Employee) %>% summarise(Revenue_Per_Employee_Mean = mean(Revenue_Per_Employee), Revenue_Per_Employee_Median = median(Revenue_Per_Employee)) %>% mutate(Difference_Mean_Median = abs(Revenue_Per_Employee_Mean - Revenue_Per_Employee_Median))
## Adding missing grouping variables: `Industry`
rev_2
## # A tibble: 25 x 4
## Industry Revenue_Per_Employ… Revenue_Per_Emplo… Difference_Mean_…
## <fct> <dbl> <dbl> <dbl>
## 1 Advertising &… 373403 255556 117848
## 2 Business Prod… 527817 203148 324669
## 3 Computer Hard… 520455 520455 0
## 4 Construction 238695 238613 81.5
## 5 Consumer Prod… 382943 222222 160720
## 6 Education 112061 114848 2788
## 7 Energy 8472533 283212 8189322
## 8 Engineering 215745 202998 12747
## 9 Environmental… 134367 134367 0
## 10 Financial Ser… 400174 219608 180567
## # ... with 15 more rows
ggplot(rev_2 %>% mutate(Industry = fct_reorder(Industry, Difference_Mean_Median)) , aes(x=Industry, y=Difference_Mean_Median, fill=Industry)) + geom_bar(stat="identity") + coord_flip() + theme(legend.position = 'none') + scale_fill_grey() + ylab("Absolute Difference Between Mean and Median")
For the most part, other than Energy
, there isn’t a significant difference between the mean and the median. Let’s remove the one energy company in the outlier and examine the rest of the data.
revenue_2 <- revenue_1 %>%
filter(Revenue_Per_Employee < 10^7)
ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee")
Let’s replace the X scale with a logarithmic scale and see if the differences become more apparent.
ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_boxplot() + scale_y_continuous(trans='log10') + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee - Logarithmic Transformation Base 10")
Another way to visualize this data is via Violin Plot. Again, the scale has had a logarithmic transformation.
ggplot(revenue_2, aes(x=Industry, y=Revenue_Per_Employee, fill=Industry)) + geom_violin() + scale_y_continuous(trans='log10') + coord_flip() + theme(legend.position='none') + ylab("Revenue Per Employee - Logarithmic Transformation Base 10") + ggtitle("Violin Plot of Revenue Per Employee by Industry")
It appears that Logistics & Transportation
provides generally the most revenue per employee in the NY State.