Load the Packages

library(MASS)
library(dplyr)
library(ggplot2)
library(tidyr)

Problem Sets

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. Lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank          Name            Growth_Rate         Revenue         
##  Min.   :   1   Length:5001        Min.   :  0.340   Min.   :2.000e+06  
##  1st Qu.:1252   Class :character   1st Qu.:  0.770   1st Qu.:5.100e+06  
##  Median :2502   Mode  :character   Median :  1.420   Median :1.090e+07  
##  Mean   :2502                      Mean   :  4.612   Mean   :4.822e+07  
##  3rd Qu.:3751                      3rd Qu.:  3.290   3rd Qu.:2.860e+07  
##  Max.   :5000                      Max.   :421.480   Max.   :1.010e+10  
##                                                                         
##    Industry           Employees           City              State          
##  Length:5001        Min.   :    1.0   Length:5001        Length:5001       
##  Class :character   1st Qu.:   25.0   Class :character   Class :character  
##  Mode  :character   Median :   53.0   Mode  :character   Mode  :character  
##                     Mean   :  232.7                                        
##                     3rd Qu.:  132.0                                        
##                     Max.   :66803.0                                        
##                     NA's   :12

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Answer:

The summary result given above is the summary statistics, including mean, median, interquartile range, minimum, and maximum, for each numeric variable of the 5,000 fastest growing companies dataset. These summaries tell me about the central tendency and spread of data. It gives me quick overview of such as where the mean of a variable lies and whether it’s skewed. For instance, the growth_rate variable which maximum value is hundred times higher than the mean. This indicates there are outliers which need to be taken care of. Also, if the mean is lower or higher than the median it means that a variable could be right or left skewed. The degree of skewness depending on how large is mean deviated from median. Below charts are just for proving the interpretation above. The figure 1 shows there are outliers. Because the left chart (original) in the figure 2 does not clearly show the skewness of a distribution, I use the fitdistr() function to find out the lambda (mean) of exponential probability density function and use the rexp() function to generate 5000 samples of the lambda to plot another histogram to show the right-skewed distribution.

# A simple box plot to show outliers
boxplot(inc$Growth_Rate)

# Histogram to show skewness of a distribution
par(mfrow = c(1, 2))
hist(inc$Revenue, breaks=50, col="royalblue", ylab = '', xlab = '', main="Original - Revenue")
epdf <- fitdistr(inc$Revenue, densfun="exponential")
epdf_samples <- rexp(5000, epdf$estimate)
hist(epdf_samples, breaks=50, col="hotpink", ylab = '', xlab = '', main="Exponential - Revenue")

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Answer:

When I do the states abbreviation to states full name conversion, I found ‘PR’ and ‘DC’ are not in the 50 states. So, I filter out these two non-50 states abbreviation.

# Filter out Washington DC and Puerto Rico which are not a State
inc_new <- inc %>% filter(State != "PR" & State != "DC")

# Group the state by industry and count the industry total
state_ind_bus <- inc_new %>% select(c(State, Industry)) %>% mutate(State_Name = state.name[match(State, state.abb)]) %>% group_by(State_Name) %>% mutate(Total_Bus_Count = n()) %>% group_by(State_Name, Industry, Total_Bus_Count, State) %>% summarise(Total_Ind_Count = n())

# Plot a bar chart for the above aggregation result
ggplot(state_ind_bus, aes(x=reorder(State_Name, Total_Bus_Count), y=Total_Ind_Count, fill=Industry)) +
  geom_bar(stat='identity') + labs(title='Distribution of Companies Count by Industry in 50 States', 
                                   x='', y='', 
                                   caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=0.5),
    legend.title=element_text(size=8), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.text.y = element_text(size=7), 
    axis.ticks.y = element_blank(), 
    plot.caption = element_text(hjust=0.5), 
    legend.position = c(.96, .96),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(5, 5, 5, 5)
    ) +
  theme(axis.text.y = element_text(margin = margin(0,-0.7,0,0,"cm"), face= "bold")) + 
  coord_flip()

From the figure above, we can just pick the top 10 companies count states to show a more detailed work.

# Get the top 10 states with most companies
top_10_counts <- state_ind_bus[!duplicated(state_ind_bus$State_Name), ] %>%  arrange(desc(Total_Bus_Count))
top_10_counts <- top_10_counts[1:10,]
top_10_states <- state_ind_bus %>% filter(Total_Bus_Count %in% top_10_counts$Total_Bus_Count)

# Bar plot to show the distribution of companies count by industry in top 10 states
ggplot(top_10_states, aes(x=reorder(State_Name, Total_Bus_Count), y=Total_Ind_Count, fill=Industry)) +
  geom_bar(stat='identity') + labs(title='Top 10 States of Most Companies by Industry', 
                                   x='', y='Companies Count Per Industry', 
                                   caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=-0.5),
    legend.title=element_text(size=8), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.ticks.y = element_blank(),
    plot.caption = element_text(hjust=5), 
    ) +
  guides(x = "none") +
  geom_text(aes(y=Total_Bus_Count-100, label=Total_Bus_Count), color='white', size=3) +
  scale_fill_hue(l=45) +
  coord_flip()

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

Answer:

From the bar chart in the question 1, we can see that the 3rd most companies is the New York state. Let’s now select only the New York state data and do the visualization.

# Select the New York state data
ny_bus_1 <- inc %>% filter(State=='NY')

# Keep only the complete rows
ny_bus_1 <- ny_bus_1[complete.cases(ny_bus_1), ]

# Calculate the mean and median of employees for each industry
ny_bus_1 <- ny_bus_1 %>% group_by(Industry) %>% summarise(Mean=mean(Employees), Median=median(Employees)) %>% gather("Mean_Median",'value', Mean:Median)

# Plot a bar chart to show the mean and median of employees for each industry
ggplot(ny_bus_1, aes(x=reorder(factor(Industry),value, fun=median),y=value,fill=factor(Mean_Median))) +
  geom_bar(stat='identity', position='dodge') + labs(title='Mean & Median of Employees Count Per Industry in NY', x='', y='', caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=0.5),
    legend.title=element_blank(), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.ticks.y = element_blank(),
    plot.caption = element_text(hjust=0.5)) + 
  theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'dashed',
                                colour = "red"), 
  panel.grid.major.y = element_line(size = 0.15, linetype = 'dashed',
                                colour = "white")) +
  coord_flip()

We can see that there are few industries has much higher of mean. This could be due to the outliers exist in where few companies within the industries have much higher number of employees than others. We can further to investigate this situation by plotting a box plot for the number of employees for all industries.

# Select the New York state data
ny_bus_2 <- inc %>% filter(State=='NY')

# Keep only the complete rows
ny_bus_2 <- ny_bus_2[complete.cases(ny_bus_2), ]

# Plot a box chart to investigate outliers for each industry 
ggplot(ny_bus_2, aes(x=reorder(factor(Industry),Employees, fun=median),y=Employees,fill=factor(Industry))) + geom_boxplot() + labs(title="Distribution of Employees Count for Each Industry", y='', x='', caption='Source: www.inc.com/inc5000') + 
  theme(
    legend.position = "none", 
    plot.title = element_text(hjust=0.5), 
    panel.background=element_blank(),
    plot.caption = element_text(hjust=0.3)) + 
  theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'dashed',
                                colour = "red"), 
  panel.grid.major.y = element_line(size = 0.15, linetype = 'dashed',
                                colour = "white")) + 
  coord_flip()

From the above box plot, we can see there are some outliers on few industries. We can remove these outliers by using the Interquantile Range (IQR) method. A point is an outlier if it’s above the 75th or below the 25th percentile by a factor of 1.5 times the IQR. An outlier would be a point below [Q1- (1.5)IQR] or above [Q3+(1.5)IQR]. Let’s now find the outliers, remove them, and re-plot with a new dataset.

# Select New York state data
ny_bus_3 <- inc %>% filter(State=='NY')

# Keep only the complete rows
ny_bus_3 <- ny_bus_3[complete.cases(ny_bus_2), ]

# Find the 25th and 75th percentile of employees
quantile <- quantile(ny_bus_3$Employees, probs=c(0.25, 0.75))

# Find the IQR of employees
iqr <- IQR(ny_bus_3$Employees)

# Eliminating outliers from the dataset
ny_bus_3<- subset(ny_bus_3, ny_bus_3$Employees > (quantile[1]-1.5*iqr) & ny_bus_3$Employees < (quantile[2]+1.5*iqr))

# Calculate the employees mean and median for each industry
ny_bus_3 <- ny_bus_3 %>% group_by(Industry) %>% summarise(Mean=mean(Employees), Median=median(Employees)) %>% gather("Mean_Median",'value', Mean:Median)

# Plot a bar chart to show the mean and media for each industry
ggplot(ny_bus_3, aes(x=reorder(factor(Industry),value, fun=median),y=value,fill=factor(Mean_Median))) +
  geom_bar(stat='identity', position='dodge') + labs(title='Mean & Median of Employees Count Per Industry in NY', x='', y='', caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=0.5),
    legend.title=element_blank(), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.ticks.y = element_blank(),
    plot.caption = element_text(hjust=0.5), 
    ) +
  coord_flip()

Above chart is without outliers and looks much better.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Calculate the revenue per employee for each industry
state_emp_rev <- inc %>% filter(complete.cases(.)) %>% group_by(Industry) %>% summarise(RevPerEmp=sum(Revenue, na.rm=TRUE)/sum(Employees, na.rm=TRUE)) %>% mutate(RevPerEmp_rounded = paste("$", round(RevPerEmp, -4)/1000, sep="")) %>% mutate(RevPerEmp_rounded=paste(RevPerEmp_rounded, "k", sep=""))

# Plot a bar chart to show the revenue per employee for each industry
ggplot(state_emp_rev, aes(x=reorder(Industry,RevPerEmp,fun=median), y=RevPerEmp)) + 
  geom_bar(stat='identity', fill='Orange') + labs(title='Revenue Per Employee for Each Industry', x='', y='', caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=0.5),
    legend.title=element_blank(), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.ticks.y = element_blank(),
    plot.caption = element_text(hjust=0.5), 
    ) +
  guides(x = "none") +
  geom_text(aes(label=RevPerEmp_rounded), 
            color='black', size=3, 
            nudge_x=0, hjust = 1.1, 
            fontface = "bold") +
  coord_flip()

From the above chart, it’s obvious that the Computer Hardware industry generates the most and also far more revenue per employee than others. There could be outliers exist as well in the Revenue column data. We can use the same method used in the question 2 to investigate and remove outliers.

# Filter to keep only the complete rows
state_emp_rev_2 <- inc %>% filter(complete.cases(.))

# Plot a box chart to investigate outliers for each industry 
ggplot(state_emp_rev_2, aes(x=reorder(factor(Industry),Revenue,fun=median),y=Revenue,fill=factor(Industry))) + 
  geom_boxplot() + labs(title="Distribution of Revenue Per Employee for Each Industry", y='', x='', caption='Source: www.inc.com/inc5000') +  
  theme(
    legend.position = "none", 
    plot.title = element_text(hjust=0.5), 
    panel.background=element_blank(), 
    plot.caption = element_text(hjust=0.3)) + 
  theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'dashed',
                                colour = "red"), 
  panel.grid.major.y = element_line(size = 0.15, linetype = 'dashed',
                                colour = "white")) + 
  coord_flip()

From the above box plot, we also see that there are some outliers on few industries. We can remove these outliers by using the same method, the Interquantile Range (IQR) method.

# Find the 25th and 75th percentile of revenue
quantile <- quantile(state_emp_rev_2$Revenue, probs=c(0.25, 0.75))

# Find the IQR of revenue
iqr <- IQR(state_emp_rev_2$Revenue)

# Eliminating outliers from the dataset
state_emp_rev_3<- subset(state_emp_rev_2, state_emp_rev_2$Revenue > (quantile[1]-1.5*iqr) & state_emp_rev_2$Revenue < (quantile[2]+1.5*iqr))

# Calculate the revenue per employee for each industry
state_emp_rev_3 <- state_emp_rev_3 %>% group_by(Industry) %>% summarise(RevPerEmp=sum(Revenue, na.rm=TRUE)/sum(Employees, na.rm=TRUE)) %>% mutate(RevPerEmp_rounded = paste("$", round(RevPerEmp, -4)/1000, sep="")) %>% mutate(RevPerEmp_rounded=paste(RevPerEmp_rounded, "k", sep=""))

# Plot a bar chart to show the revenue per employee for each industry
ggplot(state_emp_rev_3, aes(x=reorder(Industry,RevPerEmp,fun=median), y=RevPerEmp)) + 
  geom_bar(stat='identity', fill='Orange') + labs(title='Revenue Per Employee for Each Industry', x='', y='', caption='Source: www.inc.com/inc5000') + 
  theme(
    plot.title = element_text(hjust=0.5),
    legend.title=element_blank(), 
    legend.text=element_text(size=8), 
    panel.background=element_blank(), 
    axis.ticks.y = element_blank(),
    plot.caption = element_text(hjust=0.5), 
    ) +
  guides(x = "none") +
  geom_text(aes(label=RevPerEmp_rounded), 
            color='black', size=3, 
            nudge_x=0, hjust = 1.1, 
            fontface = "bold") +
  coord_flip()

After eliminating the outliers, the industry that generates the most revenue per employee remains the Computer Hardware industry followed by Energy and Manufacturing.

DATA 608 - Assignment 1

Sie Siong Wong

02/07/2021

Load the Packages

Problem Sets

Question 1

Quesiton 2

Question 3