R Notebook

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Environmental Dependancies

library( dplyr )
library( kableExtra )
library( ggplot2 )
library( forcats )
library( scales )

# Insert your code here, create more chunks as necessary
# I find it helpful use dplyr's glimpse() function to view a dataframe
glimpse( inc )

## Rows: 5,001
## Columns: 8
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Name        <chr> "Fuhu", "FederalConference.com", "The HCI Group", "Bridger…
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 174.04, 17…
## $ Revenue     <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e+07, 4.5…
## $ Industry    <chr> "Consumer Products & Services", "Government Services", "He…
## $ Employees   <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 165, 250, …
## $ City        <chr> "El Segundo", "Dumfries", "Jacksonville", "Addison", "Bost…
## $ State       <chr> "CA", "VA", "FL", "TX", "MA", "TX", "TN", "CA", "UT", "RI"…

glimpse organizes the dataframe in a more succinct way than the head function and, in combination with the statistical output of the summary function hints at several ways that we might explore the data further to understand it further:

Select a Categorical variable to group by and view the summary of other features
Select a Numeric variable to rank other features

Aggregating by Categorical variables:

State-Level Data The following code will group data features by State, although a similar analysis can be performed by city

#Geographic grouping (State) & summarize several features
inc_state <- inc %>%
  group_by( State ) %>%
  summarise( num_companies = n(), 
             num_city = n_distinct( City ), 
             mean_Rev = mean( Revenue ), 
             mean_Growth = mean( Growth_Rate ),
             mean_Rank = mean( Rank ),
             mean_Size = mean( Employees, na.rm=TRUE ),
             num_Industry = n_distinct( Industry ) )

A full view of state level data sorted in descending order by the number of companies:

inc_state %>% 
  arrange( desc( num_companies ) ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

State	num_companies	num_city	mean_Rev	mean_Growth	mean_Rank	mean_Size	num_Industry
CA	701	204	33463481	5.897489	2273.867	230.31286	25
TX	387	67	57271835	6.024186	2468.930	235.14249	24
NY	311	90	58715113	4.371158	2611.688	271.28617	25
VA	283	56	30627915	4.877350	2437.286	126.03180	24
FL	282	102	37625177	5.846099	2366.695	217.09574	25
IL	273	104	121773993	3.738755	2530.011	379.65441	25
GA	212	45	30510849	3.523538	2499.193	163.72512	22
OH	186	79	68745161	3.557527	2739.651	204.31183	24
MA	182	72	33174725	5.416648	2522.874	135.61538	24
PA	164	80	34568902	2.566585	2755.872	186.45399	24
NJ	158	97	29574684	4.445380	2453.924	190.89873	22
NC	137	46	67580292	3.511679	2535.482	271.74074	22
CO	134	28	31270149	4.952015	2481.709	198.78195	23
MD	131	44	25193893	4.984809	2255.901	308.69466	22
WA	130	30	27156923	4.002692	2447.946	135.00775	19
MI	126	52	61950794	2.238571	2721.484	292.89683	22
AZ	100	13	55015000	4.616700	2488.230	342.81000	22
UT	95	26	36038947	6.307790	2234.516	200.29474	20
MN	88	37	57256818	3.821477	2496.966	210.61364	19
TN	82	21	35870732	4.950366	2789.878	177.87805	20
WI	79	43	92362025	2.690253	2968.519	201.92208	20
IN	69	30	50105797	4.788261	2150.290	184.01449	21
MO	59	27	45164407	2.497288	2811.085	293.15254	18
AL	51	13	25907843	2.407451	2728.137	125.35294	18
CT	50	34	49486000	4.994600	2495.440	139.78000	18
OR	49	12	28589796	3.148367	2726.918	89.77551	16
SC	48	22	30993750	6.060625	2397.167	111.41667	18
OK	46	12	41015217	3.097174	2582.065	151.65217	15
DC	43	3	76344186	8.298139	2161.047	219.54762	13
KY	40	13	33232500	2.064000	2693.725	138.60000	14
KS	38	16	40752632	3.628684	2737.842	229.60526	17
LA	37	15	56648649	1.944595	2700.973	288.35135	13
IA	28	18	123142857	1.761071	3195.964	405.14286	13
NE	27	7	40696296	2.078889	3343.333	141.59259	13
NV	26	6	19915385	2.330769	2778.308	66.34615	11
NH	24	15	41691667	1.512917	2870.708	120.41667	13
ID	17	12	231523529	2.645294	2561.824	342.17647	10
DE	16	7	42300000	2.420000	2424.562	4284.00000	11
RI	16	9	46981250	16.031250	1513.500	185.25000	12
ME	13	5	12476923	16.210000	1951.615	67.61538	8
MS	12	8	43766667	5.642500	2641.417	460.91667	9
ND	10	5	18240000	1.227000	2965.200	96.30000	8
AR	9	7	8333333	1.670000	3449.222	55.11111	7
HI	7	4	99485714	6.792857	1453.571	88.71429	5
VT	6	4	46200000	1.296667	3089.167	178.16667	6
NM	5	2	9640000	1.364000	2965.000	123.40000	5
MT	4	2	6150000	0.762500	3835.500	418.25000	3
SD	3	3	5900000	1.406667	2719.333	253.66667	3
AK	2	2	171500000	4.805000	2634.500	1264.00000	2
WV	2	2	15650000	0.620000	4164.500	120.00000	2
WY	2	2	34750000	19.145000	1998.500	53.50000	2
PR	1	1	2300000	1.730000	2140.000	29.00000	1

The above table organizes State level data summaries by the number of top 5000 ranked companies (num_companies). Predictably, we see that the larger states (California, CA; Texas, TX; New York, NY) have more companies that made the ranked listing compared to smaller states (e.g. West Virginia, WV; Wyoming, WY). Additionally, we can see that states with more ranked companies also have more represented industries (num_Industry); this can be interpreted as a measure of a state’s economic diversity.

The table above is very dense with information. We can subset the aggregated data to pull out details of interest. For example, the following table organizes the top 10 states by mean Growth Rate

#sort by a feature to rearrange the data
inc_state %>% 
  select( State, mean_Growth ) %>%
  arrange( desc( mean_Growth ) ) %>%
  top_n(10) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

## Selecting by mean_Growth

State	mean_Growth
WY	19.145000
ME	16.210000
RI	16.031250
DC	8.298139
HI	6.792857
UT	6.307790
SC	6.060625
TX	6.024186
CA	5.897489
FL	5.846099

Industry-Level Data Industry is another interesting categorical variable organize the data by. The following lists the unique Industry categories:

unique( inc$Industry )

##  [1] "Consumer Products & Services" "Government Services"         
##  [3] "Health"                       "Energy"                      
##  [5] "Advertising & Marketing"      "Real Estate"                 
##  [7] "Financial Services"           "Retail"                      
##  [9] "Software"                     "Computer Hardware"           
## [11] "Logistics & Transportation"   "Food & Beverage"             
## [13] "IT Services"                  "Business Products & Services"
## [15] "Education"                    "Construction"                
## [17] "Manufacturing"                "Telecommunications"          
## [19] "Security"                     "Human Resources"             
## [21] "Travel & Hospitality"         "Media"                       
## [23] "Environmental Services"       "Engineering"                 
## [25] "Insurance"

Here, the data is aggregated by Industry and several summary metrics are tabulated

#Aggregating by Industry & summarize several features
inc_industry <- inc %>%
  group_by( Industry ) %>%
  summarise( num_companies = n(), 
             num_city = n_distinct( City ), 
             mean_Rev = mean( Revenue ), 
             mean_Growth = mean( Growth_Rate ),
             mean_Rank = mean( Rank ),
             mean_Size = mean( Employees, na.rm=TRUE ),
             num_State = n_distinct( State ) )
#A full view of industry level data sorted in descending order by the number of companies:
inc_industry %>% 
  arrange( desc( num_companies ) ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

Industry	num_companies	num_city	mean_Rev	mean_Growth	mean_Rank	mean_Size	num_State
IT Services	733	389	28214598	3.331815	2540.729	140.42076	44
Business Products & Services	482	293	54705187	3.518486	2878.041	244.49375	42
Advertising & Marketing	471	252	16528662	6.225478	2358.718	84.35456	43
Health	355	258	50319437	4.856394	2415.746	232.85311	44
Software	342	202	23802924	5.020643	2328.196	150.32845	44
Financial Services	260	191	50580385	5.435308	2352.785	183.43462	39
Manufacturing	256	224	49546875	2.295391	3000.590	172.32157	38
Consumer Products & Services	203	152	73676847	8.776108	2193.768	223.96059	34
Retail	203	164	50529064	6.184729	2233.729	182.60099	37
Government Services	202	102	29748020	7.238168	2087.866	129.62871	30
Human Resources	196	130	47173980	3.300459	2426.398	1158.06122	33
Construction	187	157	70450802	3.366684	2642.241	155.60963	36
Logistics & Transportation	155	124	95745161	4.339226	2580.284	259.70130	32
Food & Beverage	131	108	98559542	3.636565	2649.435	510.93798	32
Telecommunications	129	103	56855814	2.883721	2596.504	242.85039	28
Energy	109	83	126344954	9.603303	2038.220	242.54128	31
Real Estate	96	75	30892708	7.746667	2017.740	198.87368	27
Education	83	68	13726506	3.642651	2758.000	92.59036	25
Engineering	74	65	34222973	1.984324	2733.095	276.14865	26
Security	73	63	52230137	3.388904	2494.041	562.45205	30
Travel & Hospitality	62	52	47283871	2.353064	2906.242	371.53226	20
Media	54	39	32266667	4.374074	2222.870	176.51852	21
Environmental Services	51	50	51741176	2.068039	2699.118	199.11765	22
Insurance	50	49	46758000	2.008400	2899.400	146.78000	23
Computer Hardware	44	39	270129545	4.089773	3008.068	220.77273	17

This organized view of Industry listed in descending order of num_companies informs on what industries have a dominate presence in the ranking of fastest growing companies. However, we can draw subsets from the aggregate to highlight details of the dataset. The following ranks the top 10 Industry by mean Growth Rate (mean_growth)

inc_industry %>% 
  select( Industry, mean_Growth ) %>%
  arrange( desc( mean_Growth ) ) %>%
  top_n(10) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

## Selecting by mean_Growth

Industry	mean_Growth
Energy	9.603303
Consumer Products & Services	8.776108
Real Estate	7.746667
Government Services	7.238168
Advertising & Marketing	6.225478
Retail	6.184729
Financial Services	5.435308
Software	5.020643
Health	4.856394
Media	4.374074

Ranking the data by Numeric variables:

Aggregating the data by categorical features provides some informative summaries of the data. However, we may also be interested in drawing information based on one of the numeric features. For example, the following code will rank the 10 top companies by highest Growth Rate:

inc %>%
  arrange( Rank ) %>%
  top_n( 10, Growth_Rate ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX
7	Value Payment Systems	174.04	2.550e+07	Financial Services	27	Nashville	TN
8	Emerge Digital Group	170.64	2.390e+07	Advertising & Marketing	75	San Francisco	CA
9	Goal Zero	169.81	3.310e+07	Consumer Products & Services	97	Bluffdale	UT
10	Yagoozon	166.89	1.860e+07	Retail	15	Warwick	RI

This view organizes the top 10 companies ranked by Revenue

inc %>%
  arrange( Rank ) %>%
  top_n( 10, Revenue ) %>%
  kbl() %>%
  kable_classic(bootstrap_options = "striped",full_width = F)

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1397	EnvisionRxOptions	2.88	2.70e+09	Health	625	Twinsburg	OH
2522	DLA Piper	1.41	2.40e+09	Business Products & Services	4036	Chicago	IL
3853	ABC Supply	0.73	4.70e+09	Construction	6549	Beloit	WI
4052	Kum & Go	0.65	2.80e+09	Retail	4589	West Des Moines	IA
4246	American Tire Distributors	0.59	3.50e+09	Consumer Products & Services	3341	Huntersville	NC
4716	Westcon Group	0.44	3.80e+09	IT Services	3000	Tarrytown	NY
4788	CDW	0.41	1.01e+10	Computer Hardware	6800	Vernon Hills	IL
4802	Boise Cascade	0.41	2.80e+09	Construction	4470	Boise	ID
4936	Coty	0.36	4.60e+09	Consumer Products & Services	10000	New York	NY
4997	Dot Foods	0.34	4.50e+09	Food & Beverage	3919	Mt. Sterling	IL

It is interesting that none of the top 10 in Growth_Rate are present in the top 10 companies listed by Revenue. Out of curiosity, let’s investigate to see how many of the top 100 Growth_Rate companies rank in the top 500 of Revenue:

num = 500
top_GR <- inc %>%
  arrange( Rank ) %>%
  top_n( num, Growth_Rate ) %>%
  select( Name )

top_Rev <- inc %>%
  arrange( Rank ) %>%
  top_n( num, Revenue ) %>%
  select( Name )

both <- intersect( top_GR, top_Rev )
per_both <- nrow( both )/num*100
paste0( 'Only ', per_both, '% of the top ', num, ' Growth Rate ranked companies feature among the top 100 in Revenue (on the list)' )

## [1] "Only 6.6% of the top 500 Growth Rate ranked companies feature among the top 100 in Revenue (on the list)"

That only 6.6% intersect in the two orderings of the data is an interesting outcome.

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

y_lines = c( 200, 400, 600 )
# Answer Question 1 here
inc %>%
  group_by( State ) %>% #aggregate by state
  count() %>% #count the number of records for each state
  ggplot( aes( x= reorder(State,n), y= n ) ) +
  geom_col( ) +
  coord_flip() +
  labs(
    title = 'Number of top 5000 companies by State',
    x = 'State',
    y = '# Companies'
  ) + 
  theme_classic() +
  geom_hline(yintercept=y_lines, color="white", size=1)

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

From visual inspection of the figure above, we see that the state with the 3rd most companies in the data set if New York (NY).

# Answer Question 2 here
backtrans <- function(y){log10(mean(10 ^ y)) }
inc %>%
  filter( State == 'NY' ) %>%  #filter records for NY state
  filter( complete.cases(.) ) %>% #filter only complete cases
  select( Industry, Employees ) %>% #select the features Industry & Employees
  mutate( Industry = as_factor( Industry ) ) %>% #recast Idustry as a factor
  mutate(zRT = scale(Employees)) %>% 
  filter(between(zRT,-2.5,+2.5)) %>% #filter outliers based on Z-score
  ggplot( aes(x = reorder( Industry, Employees, FUN = median ), y = Employees ) ) +
  geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
  stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
  scale_y_log10() +
  coord_flip() +
  labs(
    title = 'Employment by Industry for New York State',
    subtitle = 'descending order by median, mean given in red',
    x = 'Industry',
    y = 'Employees'
  ) +
  theme_linedraw()

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here
inc %>%
  filter( complete.cases(.) ) %>% #filter incomplete cases
  mutate( rev_per_emp = Revenue/Employees ) %>% #create a new feature that quantifies revenue per employee
  ggplot( aes(x = reorder( Industry, rev_per_emp, FUN = median ), y = rev_per_emp )) +
  geom_boxplot( alpha= 0.7, outlier.shape = NA ) +
  stat_summary(fun.y = backtrans, geom = "point", aes(group=1), colour = "red") +
  scale_y_log10( labels = trans_format("log10", math_format(10^.x)) ) +
  coord_flip() +
  labs(
    title = 'Revenue per Employee by Industry',
    subtitle = 'descending order by Industry median, mean given in red',
    x = 'Industry',
    y = 'Revenue per Employee'
  ) +
  theme_linedraw()