Data 608 - Module 1 Assignment

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

str(inc)

## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

library(ggplot2)
library(dplyr)
library(kableExtra)

Growth Rate

I noticed that the growth rate goes from 0.340 to 421.480. Below, you will see that there are 19 companies that experienced growth rates of 100 or higher.

inc %>% dplyr::filter(Growth_Rate >= 100) %>% summarise(n = n())

##    n
## 1 19

Below is the list of these 19 companies with growth rates of 100 or higher.

kable(inc %>% dplyr::filter(Growth_Rate >= 100)) %>% kable_styling()

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX
7	Value Payment Systems	174.04	2.550e+07	Financial Services	27	Nashville	TN
8	Emerge Digital Group	170.64	2.390e+07	Advertising & Marketing	75	San Francisco	CA
9	Goal Zero	169.81	3.310e+07	Consumer Products & Services	97	Bluffdale	UT
10	Yagoozon	166.89	1.860e+07	Retail	15	Warwick	RI
11	OBXtek	164.33	2.960e+07	Government Services	149	Tysons Corner	VA
12	AdRoll	150.65	3.410e+07	Advertising & Marketing	165	San Francisco	CA
13	uBreakiFix	141.02	1.700e+07	Retail	250	Orlando	FL
14	Sparc	128.63	2.110e+07	Software	160	Charleston	SC
15	LivingSocial	123.33	5.360e+08	Consumer Products & Services	4100	Washington	DC
16	Amped Wireless	110.68	1.430e+07	Computer Hardware	26	Chino	CA
17	Intelligent Audit	105.73	1.450e+08	Logistics & Transportation	15	Rochelle Park	NJ
18	Integrity Funding	104.62	1.110e+07	Financial Services	11	Sarasota	FL
19	Vertex Body Sciences	100.10	1.180e+07	Food & Beverage	51	columbus	OH

Revenue

The revenue ranges from 2 million to about 10 billion. The median revenue is about 11 million.

inc %>% dplyr::summarise(min=min(Revenue), median=median(Revenue), max=max(Revenue))

##     min   median      max
## 1 2e+06 10900000 1.01e+10

Industry

There are 25 distinct industries.

kable(inc %>% dplyr::group_by(Industry) %>% dplyr::summarise(n=n()) %>% arrange(desc(n))) %>% kable_styling()

Industry	n
IT Services	733
Business Products & Services	482
Advertising & Marketing	471
Health	355
Software	342
Financial Services	260
Manufacturing	256
Consumer Products & Services	203
Retail	203
Government Services	202
Human Resources	196
Construction	187
Logistics & Transportation	155
Food & Beverage	131
Telecommunications	129
Energy	109
Real Estate	96
Education	83
Engineering	74
Security	73
Travel & Hospitality	62
Media	54
Environmental Services	51
Insurance	50
Computer Hardware	44

Employees

There are some companies that do not have data for Employee. The number of employees range from 1 to 66,803. The median employee size is 53.

kable(inc %>% dplyr::summarise(min=min(Employees, na.rm = TRUE), median=median(Employees, na.rm = TRUE), max=max(Employees, na.rm = TRUE))) %>% kable_styling()

min	median	max
1	53	66803

City

There are 1,519 distinct cities.

cities <- inc %>% group_by(City) %>% summarise(n=n())
nrow(cities)

## [1] 1519

These are the top 10 cities (based on the number of companies that are located in the city).

kable(inc %>% group_by(City) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(10)) %>% kable_styling()

## Selecting by n

City	n
New York	160
Chicago	90
Austin	88
Houston	76
San Francisco	75
Atlanta	74
San Diego	67
Seattle	52
Boston	43
Dallas	42
Denver	42

State

There are 52 distinct states in the data set.

states <- inc %>% group_by(State) %>% summarise(n=n())
nrow(states)

## [1] 52

These are the top 10 States (based on the number of companies that are located in the State).

kable(inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(10)) %>% kable_styling()

## Selecting by n

State	n
CA	701
TX	387
NY	311
VA	283
FL	282
IL	273
GA	212
OH	186
MA	182
PA	164

Question 1

Create a graph that shows the distribution of companies in the data set by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here

ordered <- inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n))

plt1 <- 
  ggplot(data = ordered[1:52,], aes(x=reorder(State,n), y=n)) + 
  geom_bar(stat="identity", width=0.5, color="#1F3552", fill="steelblue", 
           position=position_dodge()) +
    #geom_text(aes(label=round(n, digits=2)), hjust=1.3, size=3.0, color="white") + 
    coord_flip() + 
    scale_y_continuous(breaks=seq(0,700,100)) + 
    ggtitle("Disbribution by State") +
    xlab("") + ylab("") + 
    theme_minimal()

I couldn’t find a way to increase the plot canvas size. This would look better if there’s more space in between each state, and the bars are slightly bigger.

The graph below orders the distribution from highest to lowest states.

plt1

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

As you can see, the state with the 3rd most companies in the data set is New York.

kable(inc %>% group_by(State) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(3)) %>% kable_styling()

## Selecting by n

State	n
CA	701
TX	387
NY	311

inc_cc holds complete cases only.

inc_cc <- inc[complete.cases(inc),]

Below is a break down of median number of employees in each industry for New York state. It shows the min, median, and max number of employees for each industry in NY. It is ordered from highest to lowest variability.

kable(inc_cc %>% filter(State=='NY') %>% group_by(Industry) %>% summarise(min=min(Employees),median=median(Employees), max=max(Employees), var=var(Employees)) %>% arrange(desc(var))) %>% kable_styling()

Industry	min	median	max	var
Business Products & Services	4	70.5	32000	3.894641e+07
Consumer Products & Services	5	25.0	10000	5.835802e+06
Travel & Hospitality	6	61.0	2280	6.974669e+05
Human Resources	7	56.0	2081	4.634787e+05
IT Services	8	54.0	3000	2.241769e+05
Software	15	80.0	1271	1.404907e+05
Security	25	32.5	450	4.415000e+04
Media	4	45.0	602	3.099560e+04
Financial Services	14	81.0	483	2.299190e+04
Environmental Services	60	155.0	250	1.805000e+04
Food & Beverage	5	41.0	383	1.390028e+04
Energy	5	120.0	294	1.106670e+04
Telecommunications	6	31.0	316	1.064462e+04
Manufacturing	11	30.0	307	8.048231e+03
Health	2	45.0	298	7.505141e+03
Construction	10	24.5	219	6.392000e+03
Advertising & Marketing	2	38.0	270	3.872536e+03
Education	19	50.5	200	2.359516e+03
Engineering	11	54.5	94	1.583000e+03
Logistics & Transportation	1	23.5	70	8.430000e+02
Retail	3	13.5	75	6.378736e+02
Insurance	15	32.5	50	6.125000e+02
Real Estate	7	18.0	30	9.425000e+01
Computer Hardware	44	44.0	44	NA
Government Services	17	17.0	17	NA

A box plot could show the median number employees (this is indicated by the dark black line in the box). A box plot also shows the range of the data and outliers (indicated by a red asterisk symbol).

There are 25 different industries. I tried plotting them all in a single box plot call, and the result was too tiny to get any kind of useful information. In question 1, I also had a similar problem of properly spacing out the data elements on the screen. As a workaround, I created vectors that group records based on variability. The table above was used for this purpose. In this case, companies that have higher variability in number of employees are also ones with higher maximum number of employees.

The code below groups industries together with similar variability. I try to limit each group up to 5 industries so that the plot doesn’t get too small.

g1a <- c('Business Products & Services')
g1b <- c('Consumer Products & Services')
g2 <- c('Travel & Hospitality', 'Human Resources', 'IT Services', 'Software')
g3 <- c('Security', 'Media', 'Financial Services',  'Environmental Services', 'Food & Beverage')
g4 <- c('Energy', 'Telecommunications', 'Manufacturing', 'Health', 'Construction')
g5 <- c('Advertising & Marketing', 'Education', 'Engineering', 'Logistics & Transportation', 'Retail')
g6 <- c('Insurance', 'Real Estate', 'Computer Hardware', 'Government Services')

Below is the code for creating the box plots for each grouping.

Please note that each plot for each group has a different x-axis scale, which depends on the range of number of employees for each respective group.

The industries ‘Computer Hardware’ and ‘Government Services’ do not have enough data to generate a box plot.

plt_g1a <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g1a), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g1b <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g1b), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g2 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g2), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g3 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g3), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g4 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g4), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

plt_g5 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g5), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE) 

plt_g6 <- ggplot(inc_cc %>% filter(State=='NY' & Industry %in% g6), aes(x = Industry, y = Employees)) + 
        coord_flip() + 
        geom_boxplot(outlier.colour="red", outlier.shape=8,
             outlier.size=1, notch=FALSE)

I created a separate group for ‘Business Products & Services’ and ‘Consumer Products & Services’ because the box plots for these came out so tiny. It looks like the outlier data is causing the box plot of these 2 industries to flatten out too much.

plt_g1a

plt_g1b

Below are the box plots for the rest of the other industries.

Please be mindful that the x-axis scale for each grouping is different.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

The table below shows the total number of companies in each industry and the revenue per employee for each industry.

revenue_per_employee <- 
inc_cc %>% group_by(Industry) %>% summarise(count=n(), total_revenue=sum(Revenue), total_employees=sum(Employees), revenue_per_employee=total_revenue/total_employees) %>% arrange(desc(revenue_per_employee))

kable(revenue_per_employee) %>% kable_styling()

Industry	count	total_revenue	total_employees	revenue_per_employee
Computer Hardware	44	11885700000	9714	1223563.93
Energy	109	13771600000	26437	520921.44
Construction	187	13174300000	29099	452740.64
Logistics & Transportation	154	14837800000	39994	371000.65
Consumer Products & Services	203	14956400000	45464	328972.37
Insurance	50	2337900000	7339	318558.39
Manufacturing	255	12603600000	43942	286823.54
Retail	203	10257400000	37068	276718.46
Financial Services	260	13150900000	47693	275740.67
Environmental Services	51	2638800000	10155	259852.29
Telecommunications	127	7287900000	30842	236297.91
Government Services	202	6009100000	26185	229486.35
Business Products & Services	480	26345900000	117357	224493.64
Health	354	17860100000	82430	216669.90
IT Services	732	20525000000	102788	199682.84
Advertising & Marketing	471	7785000000	39731	195942.71
Food & Beverage	129	12812500000	65911	194390.92
Media	54	1742400000	9532	182794.80
Software	341	8134600000	51262	158686.75
Real Estate	95	2956800000	18893	156502.41
Education	83	1139300000	7685	148249.84
Travel & Hospitality	62	2931600000	23035	127267.20
Engineering	74	2532500000	20435	123929.53
Security	73	3812800000	41059	92861.49
Human Resources	196	9246100000	226980	40735.31

The code below plots the revenue per employee as a bar chart sorted by revenue per employee from highest to lowest. A second bar chart plot is generated that shows the distribution of companies by industry sorted by revenue per employee from highest to lowest (same order as the first plot).

plt3_1 <- ggplot(data=revenue_per_employee, aes(x=reorder(Industry,-revenue_per_employee), y=revenue_per_employee)) +
     geom_bar(stat="identity", fill="steelblue") +
     theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
     ggtitle("Revenue Per Employee by Industry") + 
     ylab("Revenue Per Employee") + 
     xlab("")

plt3_2 <- ggplot(data=revenue_per_employee, aes(x=reorder(Industry,-revenue_per_employee), y=count)) +
     geom_bar(stat="identity", fill="steelblue") +
     theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
     ggtitle("Distribution of Companies by Industry") + 
     ylab("Revenue Per Employee") + 
     xlab("")