Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
library(dplyr)
library(ggplot2)
library(forcats)
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
Looking at the top companies by growth, it apears that the top 5000 companies come from inc.com’s 2013 list.
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
The dataset provides eight variables to consider to describe the patterns of growth. Feature ‘Name’ is an important label for each firm, while ‘Industry’ may be used to draw useful comparisons. ‘City’ and ‘State’ variables offer the possibilty of identifying geographical trends or high growth clusters. ‘Rank’ and ‘Growth_Rate’ both describe the same underlying data - the rank given provides a rank of the highest growth companies. Growth rate is in terms of percent, i.e., the top company Fuhu grew 42,148% in 2013 or grew over 400 times its original size. ‘Revenue’ and ‘Employees’ provide a measure of each company’s size and income.
There are two features I’m most interested in adding. The first is simply revenue divided by employee as a rough estimate of company and sector productivity. Next, I’ll consider what I’m calling ‘revenue change’, or the absolute year-on-year growth for a company. This will allow for direct comparisons of change in revenue irrespective of the company’s size, and will consider the growth rate paired with the revenue. My reasoning behind this is because many of the growth rates are exaggerated owing ot the small size of the company. Consistent but mediocre growth in a firm with high capitalization could still be more lucrative in the long run than speculating on smaller, private companies. Revenue growth will be useful in steering away from untested firms.
These variables can be further explored by grouping companies by industry. This will allow identification of fastest growing industries and provide useful rankings Additionally, I’ll consider missing values for employees to see if this is significant. Finally, revenue and change in revenue can be expressed in milliions of dollars (MM) for readability reasons.
inc$Revenue <- (inc$Revenue)/10e5 #change to millions of dollars
inc$City <- toupper(inc$City) #keep city names a consistent case
inc$Industry <- as.factor(inc$Industry) #change to factor
inc$State <- as.factor(inc$State)#change to factor
inc$City <- as.factor(inc$City) #change to factor
inc <- inc %>%
mutate(revenue_change = round(Revenue*(1-(1+Growth_Rate)^-1),2)) %>% # absolute dollar change in revenue over the past year, in millions
mutate(revenue_employee = round((Revenue/Employees)*10e2,2)) %>% #dollars revenue per employee, thousands
mutate(revenue_change_employee = round(10e2*Revenue*(1-(1+Growth_Rate)^-1)/Employees,2)) #change in revenue per employee, thousands
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. : 2.00
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.: 5.10
## Median :2502 Mode :character Median : 1.420 Median : 10.90
## Mean :2502 Mean : 4.612 Mean : 48.22
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.: 28.60
## Max. :5000 Max. :421.480 Max. :10100.00
##
## Industry Employees City
## IT Services : 733 Min. : 1.0 NEW YORK : 166
## Business Products & Services: 482 1st Qu.: 25.0 CHICAGO : 93
## Advertising & Marketing : 471 Median : 53.0 AUSTIN : 89
## Health : 355 Mean : 232.7 HOUSTON : 77
## Software : 342 3rd Qu.: 132.0 ATLANTA : 75
## Financial Services : 260 Max. :66803.0 SAN FRANCISCO: 75
## (Other) :2358 NA's :12 (Other) :4426
## State revenue_change revenue_employee revenue_change_employee
## CA : 701 Min. : 0.52 Min. : 1.8 Min. : 0.68
## TX : 387 1st Qu.: 2.89 1st Qu.: 125.0 1st Qu.: 65.87
## NY : 311 Median : 6.46 Median : 198.7 Median : 112.51
## VA : 283 Mean : 25.16 Mean : 393.6 Mean : 246.82
## FL : 282 3rd Qu.: 16.50 3rd Qu.: 375.0 3rd Qu.: 222.93
## IL : 273 Max. :2936.88 Max. :40740.0 Max. :37837.66
## (Other):2764 NA's :12 NA's :12
nlevels(inc$Industry)
## [1] 25
nlevels(inc$City) #unique city names, double-counts commmon names (e.g,. Portland, Springfield)
## [1] 1425
nlevels(inc$State) #uniqure state names, 50 states + DC + PR
## [1] 52
Recoding several variables as features offers a little more insight into which companies are growing. The industry with the most representation in the top 5000 is Information Technology, encompassing almost 15% of the firms. There are 25 different industries represented on the list. Likewise, 14% of the companies are located in one state: California. Beyond this, there is at least one fast-growing company in each state, including Washingotn DC and Puerto Rico.
For my engineered features, revenue_change appears to be skewed to the right, with mean 50% higher than median. Considering revenue and its change relative to number of employees provides more question than answers, as there appear to be some companies generating massive revenue with few employees.
inc[is.na(inc$Employees),]
## Rank Name Growth_Rate Revenue
## 183 183 First Flight Solutions 22.32 2.7
## 1063 1064 Popchips 3.98 93.3
## 1123 1124 Vocalocity 3.72 42.9
## 1652 1653 Higher Logic 2.36 6.0
## 1685 1686 Global Communications Group 2.30 3.6
## 2196 2197 JeffreyM Consulting 1.68 12.1
## 2742 2743 Excalibur Exhibits 1.27 9.9
## 3000 3001 Heartland Business Systems 1.12 156.3
## 3978 3978 SSEC 0.68 80.4
## 4112 4112 Carolinas Home Medical Equipment 0.64 3.3
## 4566 4566 Oakbrook 0.48 8.9
## 4968 4968 Popcorn Palace 0.35 5.5
## Industry Employees City State revenue_change
## 183 Logistics & Transportation NA EMERALD ISLE NC 2.58
## 1063 Food & Beverage NA SAN FRANCISCO CA 74.57
## 1123 Telecommunications NA ATLANTA GA 33.81
## 1652 Software NA WASHINGTON DC 4.21
## 1685 Telecommunications NA ENGLEWOOD CO 2.51
## 2196 Business Products & Services NA BELLEVUE WA 7.59
## 2742 Business Products & Services NA HOUSTON TX 5.54
## 3000 IT Services NA LITTLE CHUTE WI 82.57
## 3978 Manufacturing NA HORSHAM PA 32.54
## 4112 Health NA MATTHEWS NC 1.29
## 4566 Real Estate NA MADISON WI 2.89
## 4968 Food & Beverage NA SCHILLER PARK IL 1.43
## revenue_employee revenue_change_employee
## 183 NA NA
## 1063 NA NA
## 1123 NA NA
## 1652 NA NA
## 1685 NA NA
## 2196 NA NA
## 2742 NA NA
## 3000 NA NA
## 3978 NA NA
## 4112 NA NA
## 4566 NA NA
## 4968 NA NA
summary(inc[inc$Employees < 24,])
## Rank Name Growth_Rate Revenue
## Min. : 10 Length:1153 Min. : 0.340 Min. : 2.000
## 1st Qu.: 925 Class :character 1st Qu.: 0.880 1st Qu.: 2.800
## Median :2089 Mode :character Median : 1.780 Median : 4.300
## Mean :2244 Mean : 5.124 Mean : 8.024
## 3rd Qu.:3479 3rd Qu.: 4.720 3rd Qu.: 7.600
## Max. :4998 Max. :166.890 Max. :303.000
## NA's :12 NA's :12 NA's :12
## Industry Employees City
## IT Services :141 Min. : 1.00 NEW YORK : 45
## Advertising & Marketing :136 1st Qu.:10.00 SAN DIEGO : 20
## Business Products & Services:111 Median :15.00 SAN FRANCISCO: 19
## Retail : 87 Mean :14.27 AUSTIN : 18
## Consumer Products & Services: 74 3rd Qu.:19.00 ATLANTA : 16
## (Other) :592 Max. :23.00 (Other) :1023
## NA's : 12 NA's :12 NA's : 12
## State revenue_change revenue_employee revenue_change_employee
## CA :180 Min. : 0.520 Min. : 86.96 Min. : 35.29
## NY : 90 1st Qu.: 1.680 1st Qu.: 205.00 1st Qu.: 117.82
## FL : 81 Median : 2.680 Median : 344.44 Median : 210.02
## TX : 76 Mean : 5.344 Mean : 681.48 Mean : 458.68
## OH : 57 3rd Qu.: 4.730 3rd Qu.: 631.25 3rd Qu.: 421.93
## (Other):657 Max. :198.520 Max. :40740.00 Max. :23103.64
## NA's : 12 NA's :12 NA's :12 NA's :12
For missing values, no immediate patterns can be identified. Looking at the smallest 25% of companies, these companies appear to have lower revenue but conversely much higher revenue per employee. Absolute change in revenue is also lower than the top 5000 but higher on a per-emplmoyee basis. The bottom 25% also may represent more retail and fewer health-related companies.
Create a graph that shows the distribution of companies in the data set by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
companies_by_state <- inc %>%
group_by(State) %>%
summarize(n_companies = n()) %>% #get number of companies per state
arrange(desc(n_companies)) %>%
mutate(State = fct_reorder(State,n_companies)) %>% #rearrange by state in descending order
mutate(Region = recode_factor(State, VA = 'South', WV = 'South', AR = 'South', DE = 'South', DC = 'South', FL = 'South', GA = 'South', MD = 'South', NC = 'South', SC = 'South', AL = 'South', KY = 'South', MS = 'South', TN = 'South', AK = 'South', LA = 'South', OK = 'South' , TX = 'South', PR = 'South', #south
AZ = 'West', CO = 'West', ID = 'West', NM = 'West', MT = 'West', UT = 'West', NV = 'West', WY = 'West', AK = 'West', CA = 'West', HI = 'West', OR = 'West', WA = 'West',
IA = 'Midwest', IL = 'Midwest', MI = 'Midwest', OH = 'Midwest', WI = 'Midwest', IA = 'Midwest', KS = 'Midwest', MN = 'Midwest', MS = 'Midwest', NE = 'Midwest', SD = 'Midwest', ND = 'Midwest', IN = 'Midwest', MO = 'Midwest',#midwest
CT = 'Northeast', ME = 'Northeast', MA = 'Northeast', NH = 'Northeast', RI = 'Northeast', VT = 'Northeast', NJ = 'Northeast', NY = 'Northeast', PA = 'Northeast' #northeast
)) %>%
mutate(top_states = recode_factor(State, CA= 'CA', TX = 'TX' , NY = 'NY', VA = 'VA', FL = 'FL' , IL = 'IL', GA = 'GA', OH = 'OH', .default = "Other")) %>%
mutate(top_states = fct_reorder(top_states, n_companies))#rearrange by state in descending order
ggplot(companies_by_state[companies_by_state$top_states != 'Other',], aes(top_states,n_companies)) + #omit states listed as 'other'
geom_bar(stat = 'identity') +
coord_flip() + #change axis
labs(title = 'States with the Most Inc.com 5000 Fastest Growing Private Companies') +
ylab(label = 'Number of Companies') +
xlab(label = 'State Name')
ggplot(companies_by_state, aes(fill = top_states, Region,n_companies)) + #all data included
geom_bar(stat = 'identity', color = 'grey') + #use grey to distinguish between states within the 'other' column
scale_fill_brewer(palette='PuRd') +
coord_flip() + #change axis
scale_x_discrete(limits = rev) + #largest regions on top, largest states closest to y-axis
labs(title = 'Regional Distribution of Inc.com\'s 5000 Fastest Growing Private Companies', color = 'State Name') +
ylab(label = 'Number of Companies') +
xlab(label = 'Region Name')
To visualize companies by state in the most concise way possible, I provided two bar graphs. The first graph provides the top 8 states in terms of number of high-growht companies. This graph, complete with a coord flip to show bar graphs horizontally, only provides ~50% of the top 5,000 companies. This a large loss of information and context.
To compensate for this, the second graph also utilizes color as a third element of data to distinguish the top 8 states. These states are part of four larger regions of the United States, as defined by the Census Bureau: Northeast, South, Midwest, and West. While these regions are very diverse, this provides a way of comparing states with their geographical counterparts. To interpret the second graph, the largest comntributors are the most saturated, while the ‘Other’ column is white in each region. This allows for comparisons between geographical regions, as well as showing the relative contribution of a state to its region. For example, California makes up over half of the West’s fastest growing companies. This second graph also illustrates an important takeaway that’s not immediately obvious from the top cities and states: the South is not only the largest contributor to fastest growing compnaies, but also contains four of the top eight. I also added grey borders in each state to add an intuition about how many states are in each region - the South for instance has many states but only some of them contribute appreciably to the largest growth companies tally.
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
companies_by_state %>%
arrange(desc(n_companies)) %>%
head(3)
## # A tibble: 3 x 4
## State n_companies Region top_states
## <fct> <int> <fct> <fct>
## 1 CA 701 West CA
## 2 TX 387 South TX
## 3 NY 311 Northeast NY
# Answer Question 2 here
The state with the third largest companies is New York.
ny_inc <- inc[complete.cases(inc),] %>%
filter(State == 'NY')
ny_industry <- ny_inc %>%
group_by(Industry) %>%
mutate(iqr = 1.5*(quantile(Employees,.75) - quantile(Employees,.25))) %>% #define each other's IQR
ungroup() %>% #ungroup to filter outliers
filter(((Employees < quantile(Employees,0.75) + 1.5*iqr) & (Employees > quantile(Employees,.25) - 1.5*iqr)) | iqr == 0)%>% #filter outliers using (Q1 + 1.5IQR, Q3 + 1.5IQR) criteria
mutate(mean_employee = mean(Employees)) %>% #get mean for residual calculation
mutate(sqr_resid = (Employees - mean_employee)^2) %>% #get squared residuals for standard deviation while ungrouped
group_by(Industry) %>% #regroup to generate industry statistics
summarize(n_company = n(), n_employee = sum(Employees), avg_employee = mean(Employees), med_employee = median(Employees), sd_employee = sd(Employees)) %>% #summarize industry statistics
arrange(desc(avg_employee)) %>%
mutate(Industry = fct_reorder(Industry, avg_employee)) #arrange so that largest average industry is on top
ggplot(ny_industry, aes(fill=n_employee, Industry, avg_employee)) + #this graph will show each industry versus average employees, colored by total employees in the industry in NY state
geom_bar(aes(Industry, avg_employee), stat = 'identity') + #bar length defined by average employee in industry
geom_errorbar(aes(x = Industry, ymin=ifelse(avg_employee-sd_employee<0,0,avg_employee-sd_employee), ymax = avg_employee+sd_employee), color = 'orange') + #error bars +/- standard deviation
coord_flip() + #show industry on y-axis
labs(title = 'Employment in Fastest Growing Companies, NY State', fill = 'Total Employment \n per Industry') +
ylab(label = 'Average Employment per Company') +
theme(legend.position = c(0.8,0.25))
Payrolls are largest in the Human Resources and Travel industries. Surprisingly, real estate is not well representated. However, variance and thereby error bars are large, owing to vast differences in staffing at each company. This remains true even after removal of outliers. Finally, IT services is the largest industry represented in the fastest growing companies in NY state.
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
ny_revenue <- ny_inc %>%
group_by(Industry) %>%
summarize(total_revenue = sum(Revenue))
ny_emp_revenue <- ny_inc %>%
group_by(Industry) %>%
mutate(iqr = 1.5*(quantile(revenue_employee,.75) - quantile(revenue_employee,.25))) %>% #define each Industry's IQR
ungroup() %>% #ungroup to filter outliers
filter(((revenue_employee < quantile(revenue_employee, 0.75) + 1.5*iqr) & (revenue_employee > quantile(revenue_employee, 0.25) - 1.5*iqr) )| (iqr == 0))%>% #filter outliers using (Q1 + 1.5IQR, Q3 + 1.5IQR) criteria
group_by(Industry) %>%
mutate(mean_revenue_employee = mean(revenue_employee)) %>% #get mean for residual calculation
summarize(n_company = n(), avg_revenue = mean(revenue_employee), med_revenue = median(revenue_employee), sd_revenue = sd(revenue_employee)) %>% #summarize industry statistics
cbind(ny_revenue$total_revenue) %>%
mutate(Industry = fct_reorder(Industry, avg_revenue)) #arrange so that largest average industry is on top
ggplot(ny_emp_revenue, aes(fill=ny_revenue$total_revenue,Industry, avg_revenue)) + #this graph will show each industry versus average employees, colored by total employees in the industry in NY state
geom_bar(aes(Industry, avg_revenue), stat = 'identity') + #bar length defined by average employee in industry
geom_errorbar(aes(x = Industry, ymin=ifelse(avg_revenue-sd_revenue <0, 0, avg_revenue-sd_revenue), ymax = avg_revenue+sd_revenue), color = 'orange') + #error bars +/- standard deviation
coord_flip() + #show industry on y-axis
labs(title = 'Revenue per Emplmoyee, NY State', fill = 'Total Revenue \nper Industry, \n$MM ') +
ylab(label = 'Average Revenue per Employee, $MM') +
theme(legend.position = c(0.8,0.25))
When considering revenue per employee, the must lucrative industry is Logistics & Transportation, with all other industries trailing significantly. Even after removing one large outlier, a second smaller outlier remains. Variance for Logistics in particular is large owing to small payrolls in some of the most lucrative companies. While not evident in our dataset, a brief search on the wbe suggests that these compoanies do consulting work and act as a broker between shipping agencies and prospective customers. While not the most lucrative industries identified, Consumer Products and IT are the largest industries by revenue.