Data 608 HW 1
Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
And lets preview this data:
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
#add the tidyverse library
library(tidyverse)
# Count the number of obervations for each variable
inc %>%
count()## # A tibble: 1 x 1
## n
## <int>
## 1 5001
## 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
## $ State : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
| x |
|---|
| 12 |
| x | |
|---|---|
| Rank | 0 |
| Name | 0 |
| Growth_Rate | 0 |
| Revenue | 0 |
| Industry | 0 |
| Employees | 12 |
| City | 0 |
| State | 0 |
#count the number of unique types of responses in categorical variables
kable(sapply(inc, function(x) length(unique(x))))| x | |
|---|---|
| Rank | 4999 |
| Name | 5001 |
| Growth_Rate | 1147 |
| Revenue | 1069 |
| Industry | 25 |
| Employees | 692 |
| City | 1519 |
| State | 52 |
#Peaking at rows with the same ranking
#values of Rank that are duplicated as a list
dups <- unlist(as.list(inc$Rank[duplicated(inc$Rank)]))
#filters the rows with duplicated rankings
fildups <- inc %>%
filter(Rank %in% dups)
#display
kable(head(fildups))| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 3424 | Stemp Systems Group | 19.37 | 6800000 | IT Services | 39 | Long Island City | NY |
| 3424 | Total Beverage Solution | 0.90 | 41500000 | Food & Beverage | 35 | Mt. Pleasant | SC |
| 5000 | INE | 0.34 | 6800000 | IT Services | 35 | Bellevue | WA |
| 5000 | ALL4 | 0.34 | 4700000 | Environmental Services | 34 | Kimberton | PA |
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
#take a better look at correlated variables
corInc <- inc %>%
select_if(is.numeric) %>%
cor()
kable(corInc)| Rank | Growth_Rate | Revenue | Employees | |
|---|---|---|---|---|
| Rank | 1.0000000 | -0.3976698 | 0.0821068 | NA |
| Growth_Rate | -0.3976698 | 1.0000000 | 0.0063041 | NA |
| Revenue | 0.0821068 | 0.0063041 | 1.0000000 | NA |
| Employees | NA | NA | NA | 1 |
library(psych)
#Some quick statistical information not included in summary() function
kable(describe(inc))| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 1 | 5001 | 2.501641e+03 | 1.443506e+03 | 2.502e+03 | 2.501731e+03 | 1.853250e+03 | 1.0e+00 | 5.0000e+03 | 4.9990e+03 | -0.0004896 | -1.200432 | 2.041222e+01 |
| Name* | 2 | 5001 | 2.501000e+03 | 1.443809e+03 | 2.501e+03 | 2.501000e+03 | 1.853250e+03 | 1.0e+00 | 5.0010e+03 | 5.0000e+03 | 0.0000000 | -1.200720 | 2.041650e+01 |
| Growth_Rate | 3 | 5001 | 4.611826e+00 | 1.412369e+01 | 1.420e+00 | 2.136756e+00 | 1.215732e+00 | 3.4e-01 | 4.2148e+02 | 4.2114e+02 | 12.5495059 | 242.336616 | 1.997192e-01 |
| Revenue | 4 | 5001 | 4.822254e+07 | 2.405423e+08 | 1.090e+07 | 1.733497e+07 | 1.067472e+07 | 2.0e+06 | 1.0100e+10 | 1.0098e+10 | 22.1744453 | 722.656318 | 3.401441e+06 |
| Industry* | 5 | 5001 | 1.210038e+01 | 7.328351e+00 | 1.300e+01 | 1.205499e+01 | 8.895600e+00 | 1.0e+00 | 2.5000e+01 | 2.4000e+01 | -0.1012333 | -1.184515 | 1.036282e-01 |
| Employees | 6 | 4989 | 2.327180e+02 | 1.353128e+03 | 5.300e+01 | 8.177511e+01 | 5.337360e+01 | 1.0e+00 | 6.6803e+04 | 6.6802e+04 | 29.8104167 | 1268.671130 | 1.915720e+01 |
| City* | 7 | 5001 | 7.320010e+02 | 4.411164e+02 | 7.610e+02 | 7.317383e+02 | 6.049008e+02 | 1.0e+00 | 1.5190e+03 | 1.5180e+03 | -0.0420954 | -1.264827 | 6.237704e+00 |
| State* | 8 | 5001 | 2.480324e+01 | 1.563706e+01 | 2.300e+01 | 2.444364e+01 | 1.927380e+01 | 1.0e+00 | 5.2000e+01 | 5.1000e+01 | 0.1190507 | -1.461030 | 2.211193e-01 |
Notes:
We can see from the additional EDA above that out of the 5001 observations in this data set with 8 variables, half of which are numerical and the other are categorical variables. We can also show that 12 RevperEmp data points are of NA value, and each column contains a varying number of unique values. It is interesting to note that there are 52 unique data points in the State column, when we know that there are 50 states, which leads me to believe that terratories other than states are included in this dataset. We also see that there are 2 duplicated rankings - which may be interesting to explore the reason why that is in more detail because the data seemes really different for the companies that have the same ranking in this dataset. Upon deeper insection of the variable Growth_Rate we can see that it is clearly correlated to the Rank variable. Upon further inspection of the correlation between numeric variables we can also see that Growth_Rate and Rank are highly correlated (negative correlation), and Revenue and Rank also seem to be significantly correlated (positive correlation). Finally by using the describe() function from the psych library, we can uncover some more helpful information about the variables in this dataset (such as kurtosis, skew)
Question 1
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
library(ggplot2)
q1 <- inc%>%
group_by(State)%>%
count(State)%>% #counts the number of company in each state
arrange(desc(n))%>% #arranges them in decending order
as_tibble(q1) #changes the format of the data for ggplot()
#graph with count of companies in each state
ggplot(q1, aes(x=reorder(State,n), y=n, fill = State))+
geom_bar(stat="identity", width=.6, show.legend = FALSE)+
geom_hline(yintercept=seq(1,800,100), col="white", lwd=1)+
theme(panel.grid.major.x = element_blank())+
coord_flip()+
geom_text(aes(x = State, label = State, y = n +20), colour="blue", size = 2)+
labs(y = "Count", x ="State", color="legend", title = "Fastest Growing Companies by State") +
theme(axis.text.y=element_blank()
,axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))Approach
We are essentially creating a frequency chart of the number of companies in each state therefore it is appropriate to group the data by state and count the number of companies in each state. Then I use a bar graph to display the counts for each state and arrange them in deecending order so that the state with the greatest number of growing companies is at the top. Since the visualization was meant to be viewed in a portate format I decided to flip the axis of this graph so that more space can be placed between the state columns and therefore make the width less than the height of this vizualization. I then changed the color of each bar to make it easier to compare adjacent bars and included the abreviation for each state next to the bar (rather than on the axis).
Quesiton 2
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
#initial group
NY <- inc %>%
filter(State== "NY") %>%
filter(complete.cases(.)) %>%
group_by(Industry)
#Initial boxplot with outliers
ggplot(NY, aes(Industry, Employees, fill = Industry)) +
geom_boxplot(outlier.colour="blue",
outlier.shape=4,
outlier.size=2,
show.legend=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
labs(title="Employees by Industry in NY - GRAPH A", y="Number of Employees")#Boxplot without outliers (ignoring all points greeater than 1500)
ggplot(NY, aes(Industry, Employees, fill = Industry)) +
geom_boxplot(outlier.colour="blue",
outlier.shape=4,
outlier.size=2,
show.legend=FALSE) +
ylim(0,1500) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
labs(title="Employees by Industry in NY - GRAPH B", y="Number of Employees")#Specifically compare mean and median
NY2 <- NY%>%
summarise(IndMean = mean(Employees),
IndMedian = median(Employees))%>%
gather(statistic, Employees, IndMean, IndMedian)
#Plot
ggplot(NY2, aes(x=Industry, y=Employees))+
geom_bar(stat = "identity", aes(fill=statistic))+
geom_hline(yintercept=seq(1,1700,100), col="white", lwd=1)+
theme_classic()+
coord_flip()+
scale_fill_discrete(name = "Metric", labels = c("Mean", "Median")) +
labs(title="Mean & Median Employees in NY - Graph C", y="Number of Employees")#Biggest Skew
NYskew <- NY2 %>%
spread(statistic, Employees)
NYskew <- NYskew %>%
mutate(diff = IndMean - IndMedian)
ggplot(NYskew, aes(x=fct_reorder(Industry, diff), y=diff, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(diff, 2), y = diff +70), colour="blue", size = 3)+
labs(title="(Mean - Median) Employees by Industry in NY - Graph D", y="Difference") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))#Mean Graph
NYmean <- NY2 %>%
filter(statistic == "IndMean")
ggplot(NYmean, aes(x=fct_reorder(Industry, Employees), y=Employees, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(Employees, 2), y = Employees +70), colour="blue", size = 3)+
labs(title="Mean Employees by Industry in NY - Graph E", y="Mean Number of Employees") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))#Median Graph
NYmedian <- NY2 %>%
filter(statistic == "IndMedian")
ggplot(NYmedian, aes(x=fct_reorder(Industry, Employees), y=Employees, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(Employees, 2), y = Employees +10), colour="blue", size = 3)+
labs(title="Median Employees by Industry in NY - Graph F", y="Median Number of Employees") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))Approach:
According to the graph in question 1, the state with the 3rd most companies in the dataset is NY. To prep the data I filtered out NY then used complete.cases() to deal with cases with incomplete data. The graphs I chose were boxplots and bar graphs which seemed most appropriate because we are comparing measures of central tendency for various categories. Initially when we plot the data we see that in Graph A there are two outliers that are not allowing us to really see the boxplots for the other categories; this graph gives insight into the range and distribution of each Industry. In Graph B we can see that if we change the scale and set a limit for the y scale at about 1500 (which essentially means to ignore all outliers outside that range), we can clearly see the distribution of Employees in each Industry is highly skewed. We confirm this in Graph C and D where we compare mean and median number of employees for each Industry; Business Products and Services has the biggest skew in number of employees whereas Real Estate, Insurance, Government Services, Enviernmental Services, Computer Hardware anmd Engineering are normally distributed (mean = median).
Question 3
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
#group by Industry and calculate the reveue per employee
revEmp <- inc %>%
mutate(RevperEmp = Revenue/Employees) %>%
filter(complete.cases(.)) %>%
group_by(Industry)
#Boxplot to show distribution
ggplot(revEmp, aes(Industry, RevperEmp, fill = Industry)) +
geom_boxplot(outlier.colour="blue",
outlier.shape=4,
outlier.size=2,
show.legend=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
labs(title="Fastest Growing Companies in USA", y="Revenue per Employee")#Boxplot without outliers (ignoring all points greeater than 1500)
ggplot(revEmp, aes(Industry, RevperEmp, fill = Industry)) +
geom_boxplot(outlier.colour="blue",
outlier.shape=4,
outlier.size=2,
show.legend=FALSE) +
ylim(0, 10000000) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()+
labs(title="Fastest Growing Companies in USA", y="Revenue per Employee")#Specifically compare mean and median
revEmp2 <- revEmp%>%
summarise(IndMean = mean(RevperEmp),
IndMedian = median(RevperEmp))%>%
gather(statistic, RevperEmp, IndMean, IndMedian)
#Plot
ggplot(revEmp2, aes(x=Industry, y=RevperEmp))+
geom_bar(stat = "identity", aes(fill=statistic))+
geom_hline(yintercept=seq(1,1700,100), col="white", lwd=1)+
theme_classic()+
coord_flip()+
scale_fill_discrete(name = "Metric", labels = c("Mean", "Median")) +
labs(title="Fastest Growing Companies in USA", y="Revenue per Employee")## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 50 x 3
## Industry statistic RevperEmp
## <fct> <chr> <dbl>
## 1 Advertising & Marketing IndMean 306036.
## 2 Business Products & Services IndMean 359097.
## 3 Computer Hardware IndMean 817702.
## 4 Construction IndMean 465682.
## 5 Consumer Products & Services IndMean 466068.
## 6 Education IndMean 296454.
## 7 Energy IndMean 1554656.
## 8 Engineering IndMean 201120.
## 9 Environmental Services IndMean 283607.
## 10 Financial Services IndMean 394231.
## # ... with 40 more rows
#Biggest Skew
revEmpSkew <- revEmp2 %>%
spread(statistic, RevperEmp)
revEmpSkew <- revEmpSkew %>%
mutate(diff = IndMean - IndMedian)
ggplot(revEmpSkew , aes(x=fct_reorder(Industry, diff), y=diff, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(diff, 2), y = diff +80000), colour="blue", size = 3)+
labs(title="Fastest Growing Companies in USA (mean - median)", y="Difference") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))#Median Graph
revEmpMedian<- revEmp2 %>%
filter(statistic == "IndMedian")
ggplot(revEmpMedian, aes(x=fct_reorder(Industry, RevperEmp), y=RevperEmp, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(RevperEmp, 2), y = RevperEmp +40000), colour="blue", size = 3)+
labs(title="Revenue per Employee by Industry ", y="Median") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))#Mean Graph
revEmpMean <- revEmp2 %>%
filter(statistic == "IndMean")
ggplot(revEmpMean , aes(x=fct_reorder(Industry, RevperEmp), y=RevperEmp, fill = Industry))+
geom_bar(stat="identity", show.legend = FALSE)+
coord_flip() +
geom_text(aes(x = Industry, label = round(RevperEmp, 2), y = RevperEmp +100000), colour="blue", size = 3)+
labs(title="Revenue per Employee by Industry ", y="Mean") +
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(hjust = 0.5))Approach:
The same approach taken in Question 2 is the same approach taken here except, the data from all states were used. First the ratio of revenue per employee was calculated and added as a new column in the dataframe. Then two boxplots were chosen to display the distribution and range of each industry’s revenue per employee, one with outliers the original scale and one with a scale that had a limit so that the visualization was more informative. A deeper dive into comparing means and medians was represented in the bar charts that follow. We can see that the distribution of revenue per employee is highly skewed right for all industries (mean > median). This means that there are many companies that have a higher average revenue per employee than median revenue per employee. We can see that the Computer Hardware industry has the most revenue per employee whereas Human Resources has the least. This makes sense to me - most people spend lots of money on technology such as smart phones, smart tvs, computers etc; Human Resources on the other hand is not an industry particularly looking to make money. It is interesting to note that Energy has the highest median revenue per employee.