Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

library(kableExtra)

kable(head(inc))

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

#add the tidyverse library
library(tidyverse)
# Count the number of obervations for each variable
inc %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  5001

#take a look at the structure of the data
kable(str(inc))

## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

#count all the NA values in the dataframe
cInc <- inc %>%
  is.na() %>%
  sum()

kable(cInc)

x
12

#Count NAs in each column
kable(sapply(inc, function(x) sum(is.na(x))))

	x
Rank	0
Name	0
Growth_Rate	0
Revenue	0
Industry	0
Employees	12
City	0
State	0

#count the number of unique types of responses in categorical variables
kable(sapply(inc, function(x) length(unique(x))))

	x
Rank	4999
Name	5001
Growth_Rate	1147
Revenue	1069
Industry	25
Employees	692
City	1519
State	52

#Peaking at rows with the same ranking

#values of Rank that are duplicated as a list
dups <- unlist(as.list(inc$Rank[duplicated(inc$Rank)]))

#filters the rows with duplicated rankings
fildups <- inc %>% 
  filter(Rank %in% dups)

#display
kable(head(fildups))

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
3424	Stemp Systems Group	19.37	6800000	IT Services	39	Long Island City	NY
3424	Total Beverage Solution	0.90	41500000	Food & Beverage	35	Mt. Pleasant	SC
5000	INE	0.34	6800000	IT Services	35	Bellevue	WA
5000	ALL4	0.34	4700000	Environmental Services	34	Kimberton	PA

#Little bit of a deeper dive into Growth Rates
kable(head(inc %>% arrange(desc(Growth_Rate))))

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX

#take a better look at correlated variables
corInc <- inc %>%
  select_if(is.numeric) %>%
  cor()

kable(corInc)

	Rank	Growth_Rate	Revenue	Employees
Rank	1.0000000	-0.3976698	0.0821068	NA
Growth_Rate	-0.3976698	1.0000000	0.0063041	NA
Revenue	0.0821068	0.0063041	1.0000000	NA
Employees	NA	NA	NA	1

library(psych)

#Some quick statistical information not included in summary() function
kable(describe(inc))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Rank	1	5001	2.501641e+03	1.443506e+03	2.502e+03	2.501731e+03	1.853250e+03	1.0e+00	5.0000e+03	4.9990e+03	-0.0004896	-1.200432	2.041222e+01
Name*	2	5001	2.501000e+03	1.443809e+03	2.501e+03	2.501000e+03	1.853250e+03	1.0e+00	5.0010e+03	5.0000e+03	0.0000000	-1.200720	2.041650e+01
Growth_Rate	3	5001	4.611826e+00	1.412369e+01	1.420e+00	2.136756e+00	1.215732e+00	3.4e-01	4.2148e+02	4.2114e+02	12.5495059	242.336616	1.997192e-01
Revenue	4	5001	4.822254e+07	2.405423e+08	1.090e+07	1.733497e+07	1.067472e+07	2.0e+06	1.0100e+10	1.0098e+10	22.1744453	722.656318	3.401441e+06
Industry*	5	5001	1.210038e+01	7.328351e+00	1.300e+01	1.205499e+01	8.895600e+00	1.0e+00	2.5000e+01	2.4000e+01	-0.1012333	-1.184515	1.036282e-01
Employees	6	4989	2.327180e+02	1.353128e+03	5.300e+01	8.177511e+01	5.337360e+01	1.0e+00	6.6803e+04	6.6802e+04	29.8104167	1268.671130	1.915720e+01
City*	7	5001	7.320010e+02	4.411164e+02	7.610e+02	7.317383e+02	6.049008e+02	1.0e+00	1.5190e+03	1.5180e+03	-0.0420954	-1.264827	6.237704e+00
State*	8	5001	2.480324e+01	1.563706e+01	2.300e+01	2.444364e+01	1.927380e+01	1.0e+00	5.2000e+01	5.1000e+01	0.1190507	-1.461030	2.211193e-01

Notes:

We can see from the additional EDA above that out of the 5001 observations in this data set with 8 variables, half of which are numerical and the other are categorical variables. We can also show that 12 RevperEmp data points are of NA value, and each column contains a varying number of unique values. It is interesting to note that there are 52 unique data points in the State column, when we know that there are 50 states, which leads me to believe that terratories other than states are included in this dataset. We also see that there are 2 duplicated rankings - which may be interesting to explore the reason why that is in more detail because the data seemes really different for the companies that have the same ranking in this dataset. Upon deeper insection of the variable Growth_Rate we can see that it is clearly correlated to the Rank variable. Upon further inspection of the correlation between numeric variables we can also see that Growth_Rate and Rank are highly correlated (negative correlation), and Revenue and Rank also seem to be significantly correlated (positive correlation). Finally by using the describe() function from the psych library, we can uncover some more helpful information about the variables in this dataset (such as kurtosis, skew)

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(ggplot2)

q1 <- inc%>%
  group_by(State)%>%
  count(State)%>% #counts the number of company in each state
  arrange(desc(n))%>% #arranges them in decending order
  as_tibble(q1) #changes the format of the data for ggplot()


#graph with count of companies in each state
ggplot(q1, aes(x=reorder(State,n), y=n, fill = State))+
  geom_bar(stat="identity", width=.6, show.legend = FALSE)+
  geom_hline(yintercept=seq(1,800,100), col="white", lwd=1)+
  theme(panel.grid.major.x = element_blank())+
  coord_flip()+
  geom_text(aes(x = State, label = State, y = n +20), colour="blue", size = 2)+
  labs(y = "Count", x ="State", color="legend", title = "Fastest Growing Companies by State") + 
  theme(axis.text.y=element_blank()
        ,axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

Approach

We are essentially creating a frequency chart of the number of companies in each state therefore it is appropriate to group the data by state and count the number of companies in each state. Then I use a bar graph to display the counts for each state and arrange them in deecending order so that the state with the greatest number of growing companies is at the top. Since the visualization was meant to be viewed in a portate format I decided to flip the axis of this graph so that more space can be placed between the state columns and therefore make the width less than the height of this vizualization. I then changed the color of each bar to make it easier to compare adjacent bars and included the abreviation for each state next to the bar (rather than on the axis).

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

#initial group
NY <- inc %>% 
  filter(State== "NY") %>%
  filter(complete.cases(.)) %>%
  group_by(Industry)

#Initial boxplot with outliers
ggplot(NY, aes(Industry, Employees, fill = Industry)) + 
  geom_boxplot(outlier.colour="blue", 
               outlier.shape=4, 
               outlier.size=2, 
               show.legend=FALSE) + 
   theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   coord_flip()+
  labs(title="Employees by Industry in NY - GRAPH A", y="Number of Employees")

#Boxplot without outliers (ignoring all points greeater than 1500)
ggplot(NY, aes(Industry, Employees, fill = Industry)) + 
  geom_boxplot(outlier.colour="blue", 
               outlier.shape=4, 
               outlier.size=2, 
               show.legend=FALSE) + 
  ylim(0,1500) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip()+
  labs(title="Employees by Industry in NY - GRAPH B", y="Number of Employees")

#Specifically compare mean and median
NY2 <- NY%>%
  summarise(IndMean = mean(Employees),
            IndMedian = median(Employees))%>%
  gather(statistic, Employees, IndMean, IndMedian)

#Plot
ggplot(NY2, aes(x=Industry, y=Employees))+
  geom_bar(stat = "identity", aes(fill=statistic))+
  geom_hline(yintercept=seq(1,1700,100), col="white", lwd=1)+
  theme_classic()+
  coord_flip()+
  scale_fill_discrete(name = "Metric", labels = c("Mean", "Median")) +
  labs(title="Mean & Median Employees in NY - Graph C", y="Number of Employees")

#Biggest Skew 
NYskew <- NY2 %>%
  spread(statistic, Employees) 

NYskew <- NYskew %>%
  mutate(diff = IndMean - IndMedian)

ggplot(NYskew, aes(x=fct_reorder(Industry, diff), y=diff, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(diff, 2), y = diff +70), colour="blue", size = 3)+
  labs(title="(Mean - Median) Employees by Industry in NY - Graph D", y="Difference") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

#Mean Graph
NYmean <- NY2 %>%
  filter(statistic == "IndMean")

ggplot(NYmean, aes(x=fct_reorder(Industry, Employees), y=Employees, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(Employees, 2), y = Employees +70), colour="blue", size = 3)+
  labs(title="Mean Employees by Industry in NY - Graph E", y="Mean Number of Employees") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

#Median Graph
NYmedian <- NY2 %>%
  filter(statistic == "IndMedian")

ggplot(NYmedian, aes(x=fct_reorder(Industry, Employees), y=Employees, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(Employees, 2), y = Employees +10), colour="blue", size = 3)+
  labs(title="Median Employees by Industry in NY - Graph F", y="Median Number of Employees") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

Approach:

According to the graph in question 1, the state with the 3rd most companies in the dataset is NY. To prep the data I filtered out NY then used complete.cases() to deal with cases with incomplete data. The graphs I chose were boxplots and bar graphs which seemed most appropriate because we are comparing measures of central tendency for various categories. Initially when we plot the data we see that in Graph A there are two outliers that are not allowing us to really see the boxplots for the other categories; this graph gives insight into the range and distribution of each Industry. In Graph B we can see that if we change the scale and set a limit for the y scale at about 1500 (which essentially means to ignore all outliers outside that range), we can clearly see the distribution of Employees in each Industry is highly skewed. We confirm this in Graph C and D where we compare mean and median number of employees for each Industry; Business Products and Services has the biggest skew in number of employees whereas Real Estate, Insurance, Government Services, Enviernmental Services, Computer Hardware anmd Engineering are normally distributed (mean = median).

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

#group by Industry and calculate the reveue per employee
revEmp <- inc %>%
  mutate(RevperEmp = Revenue/Employees) %>%
  filter(complete.cases(.)) %>%
  group_by(Industry)

#Boxplot to show distribution
ggplot(revEmp, aes(Industry, RevperEmp, fill = Industry)) + 
  geom_boxplot(outlier.colour="blue", 
               outlier.shape=4, 
               outlier.size=2, 
               show.legend=FALSE) + 
   theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   coord_flip()+
  labs(title="Fastest Growing Companies in  USA", y="Revenue per Employee")

#Boxplot without outliers (ignoring all points greeater than 1500)
ggplot(revEmp, aes(Industry, RevperEmp, fill = Industry)) + 
  geom_boxplot(outlier.colour="blue", 
               outlier.shape=4, 
               outlier.size=2, 
               show.legend=FALSE) + 
  ylim(0, 10000000) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip()+
  labs(title="Fastest Growing Companies in  USA", y="Revenue per Employee")

#Specifically compare mean and median
revEmp2 <- revEmp%>%
  summarise(IndMean = mean(RevperEmp),
            IndMedian = median(RevperEmp))%>%
  gather(statistic, RevperEmp, IndMean, IndMedian)

#Plot
ggplot(revEmp2, aes(x=Industry, y=RevperEmp))+
  geom_bar(stat = "identity", aes(fill=statistic))+
  geom_hline(yintercept=seq(1,1700,100), col="white", lwd=1)+
  theme_classic()+
  coord_flip()+
  scale_fill_discrete(name = "Metric", labels = c("Mean", "Median")) +
  labs(title="Fastest Growing Companies in  USA", y="Revenue per Employee")

revEmp2

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 50 x 3
##    Industry                     statistic RevperEmp
##    <fct>                        <chr>         <dbl>
##  1 Advertising & Marketing      IndMean     306036.
##  2 Business Products & Services IndMean     359097.
##  3 Computer Hardware            IndMean     817702.
##  4 Construction                 IndMean     465682.
##  5 Consumer Products & Services IndMean     466068.
##  6 Education                    IndMean     296454.
##  7 Energy                       IndMean    1554656.
##  8 Engineering                  IndMean     201120.
##  9 Environmental Services       IndMean     283607.
## 10 Financial Services           IndMean     394231.
## # ... with 40 more rows

#Biggest Skew 
revEmpSkew <- revEmp2 %>%
  spread(statistic, RevperEmp) 

revEmpSkew  <- revEmpSkew  %>%
  mutate(diff = IndMean - IndMedian)

ggplot(revEmpSkew , aes(x=fct_reorder(Industry, diff), y=diff, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(diff, 2), y = diff +80000), colour="blue", size = 3)+
  labs(title="Fastest Growing Companies in  USA (mean - median)", y="Difference") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

#Median Graph
revEmpMedian<- revEmp2  %>%
  filter(statistic == "IndMedian")

ggplot(revEmpMedian, aes(x=fct_reorder(Industry, RevperEmp), y=RevperEmp, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(RevperEmp, 2), y = RevperEmp +40000), colour="blue", size = 3)+
  labs(title="Revenue per Employee by Industry ", y="Median") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

#Mean Graph
revEmpMean <- revEmp2  %>%
  filter(statistic == "IndMean")

ggplot(revEmpMean , aes(x=fct_reorder(Industry, RevperEmp), y=RevperEmp, fill = Industry))+
  geom_bar(stat="identity", show.legend = FALSE)+
  coord_flip() +
  geom_text(aes(x = Industry, label = round(RevperEmp, 2), y = RevperEmp +100000), colour="blue", size = 3)+
  labs(title="Revenue per Employee by Industry ", y="Mean") +
  theme(axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(hjust = 0.5))

Approach:

The same approach taken in Question 2 is the same approach taken here except, the data from all states were used. First the ratio of revenue per employee was calculated and added as a new column in the dataframe. Then two boxplots were chosen to display the distribution and range of each industry’s revenue per employee, one with outliers the original scale and one with a scale that had a limit so that the visualization was more informative. A deeper dive into comparing means and medians was represented in the bar charts that follow. We can see that the distribution of revenue per employee is highly skewed right for all industries (mean > median). This means that there are many companies that have a higher average revenue per employee than median revenue per employee. We can see that the Computer Hardware industry has the most revenue per employee whereas Human Resources has the least. This makes sense to me - most people spend lots of money on technology such as smart phones, smart tvs, computers etc; Human Resources on the other hand is not an industry particularly looking to make money. It is interesting to note that Energy has the highest median revenue per employee.

Data 608 HW 1

Data 608 HW 1

Principles of Data Visualization and Introduction to ggplot2

Notes:

Question 1

Approach

Quesiton 2

Approach:

Question 3

Approach: