Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
# library
library(tidyverse) # load in ggplot2, amongst others I commonly use.
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
#1 Let's examine the dataset itself
dim(inc) # get its dimension
## [1] 5001 8
str(inc) # compactly display the features of the data
## 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
## $ State : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
#2 Top 5 fastest growing companies
inc %>% arrange(Rank) %>% head(5) # Arrange Rank and print top 5 cases
## Rank Name Growth_Rate Revenue Industry
## 1 1 Fuhu 421.48 1.179e+08 Consumer Products & Services
## 2 2 FederalConference.com 248.31 4.960e+07 Government Services
## 3 3 The HCI Group 245.45 2.550e+07 Health
## 4 4 Bridger 233.08 1.900e+09 Energy
## 5 5 DataXu 213.37 8.700e+07 Advertising & Marketing
## Employees City State
## 1 104 El Segundo CA
## 2 51 Dumfries VA
## 3 132 Jacksonville FL
## 4 50 Addison TX
## 5 220 Boston MA
#3 Is there at least one fast growing company listed in each state?
levels(inc$State) # list the levels of State, and note that each represents an individual state, in addition to District of Columbia (DC) and Puerto Rico (PR).
## [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "PR" "RI" "SC" "SD" "TN" "TX"
## [46] "UT" "VA" "VT" "WA" "WI" "WV" "WY"
length(levels(inc$State)) # 52 items, further indicate no state is missing.
## [1] 52
#4 Top 5 States with the most listed companies among the fastest growing companies
inc %>% count(State, sort = TRUE) %>% head(5)
## # A tibble: 5 x 2
## State n
## <fct> <int>
## 1 CA 701
## 2 TX 387
## 3 NY 311
## 4 VA 283
## 5 FL 282
#5 Top 5 Cities with the most listed companies among the fastest growing companies
inc %>% count(City, sort = TRUE) %>% head(5)
## # A tibble: 5 x 2
## City n
## <fct> <int>
## 1 New York 160
## 2 Chicago 90
## 3 Austin 88
## 4 Houston 76
## 5 San Francisco 75
#6 Top 5 Industry by Mean Revenue
inc %>% group_by(Industry) %>% summarise(Avg_Revenue = mean(Revenue)) %>% arrange(desc(Avg_Revenue)) %>% head(5)
## # A tibble: 5 x 2
## Industry Avg_Revenue
## <fct> <dbl>
## 1 Computer Hardware 270129545.
## 2 Energy 126344954.
## 3 Food & Beverage 98559542.
## 4 Logistics & Transportation 95745161.
## 5 Consumer Products & Services 73676847.
#7 Top 5 Industry by Mean Employees
inc %>% group_by(Industry) %>% summarise(Avg_Employees = mean(Employees)) %>% arrange(desc(Avg_Employees)) %>% head(5)
## # A tibble: 5 x 2
## Industry Avg_Employees
## <fct> <dbl>
## 1 Human Resources 1158.
## 2 Security 562.
## 3 Travel & Hospitality 372.
## 4 Engineering 276.
## 5 Energy 243.
#8 Mean Revenue by State
inc %>% group_by(State) %>% summarise(Avg_Revenue = mean(Revenue)) %>% arrange(desc(Avg_Revenue))
## # A tibble: 52 x 2
## State Avg_Revenue
## <fct> <dbl>
## 1 ID 231523529.
## 2 AK 171500000
## 3 IA 123142857.
## 4 IL 121773993.
## 5 HI 99485714.
## 6 WI 92362025.
## 7 DC 76344186.
## 8 OH 68745161.
## 9 NC 67580292.
## 10 MI 61950794.
## # ... with 42 more rows
#9 Correlation of rank with revenue and growth rate
cor.test(inc$Rank, inc$Revenue) # With r = 0.08 (p < 0.05), there is a very weak correlation.
##
## Pearson's product-moment correlation
##
## data: inc$Rank and inc$Revenue
## t = 5.8249, df = 4999, p-value = 6.071e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05451435 0.10957397
## sample estimates:
## cor
## 0.08210681
cor.test(inc$Rank, inc$Growth_Rate) # With r = -0.40 (p < 0.05), there is a weak correlation.
##
## Pearson's product-moment correlation
##
## data: inc$Rank and inc$Growth_Rate
## t = -30.644, df = 4999, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4207488 -0.3740763
## sample estimates:
## cor
## -0.3976698
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
A bargraph was selected because it is a good way to represent frequency and compare across levels. The orientation was flipped in order to make it more readable when fit in a portrait layout. A colorblind-friendly theme was selected to consider a general audience.
# Summarizing the data by State
compBYstate = inc %>%
group_by(State) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Plotting a bargraph
p1 = ggplot(compBYstate, aes(x = reorder(State, count), y = count)) +
geom_bar(stat = "identity", fill = "honeydew4") + coord_flip() +
labs(title = "Distribution of the Fastest Growing Companies by State*",
caption = "*District of Columbia (DC) and Puerto Rico (PR) included.",
x = "State", y = "Company Count") +
geom_text(aes(label = count, y = count + 20), vjust = 0.5, size = 3.5)
p1
The graph above highlights the count of fastest growing companies by State, and it is shows that the state of California is at the top of this list.
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
Once the 3rd State with the most companies was identified, complete cases were selected. Next, outliers were detected. One method of detecting outlier is to identify the values of any data points which lie beyond the extremes of the whiskers, and remove them. Therefore, in order to depict average and/or median employment by industry for companies in this state, while also the variability of the ranges, boxplot provides visual summary of the data allowing quick identification of mean values, the dispersion of the data set, and signs of skewness. Moreover, ggplot2.stat_summary(), allow for simple statistical computation be displayed on the graph. Here, the mean of the dataset, with outliers removed, are indicated by the dot. Lastly, because the boxplots are difficult to compare in the normal scale, the data was transformed into a logarithmic scale.
# Get data on the third state with the most companies
sprintf("The state with the 3rd most companies is %s.", compBYstate$State[3])
## [1] "The state with the 3rd most companies is NY."
third.state = inc[complete.cases(inc),] %>% filter(State == compBYstate$State[3])
# Detecting and removing outlier
outliers = boxplot(third.state$Employees ~ third.state$Industry, plot=FALSE)$out
third.state.out = third.state[-which(third.state$Employees %in% outliers),]
# Plotting a boxplot
p2 = ggplot(third.state.out, aes(x = reorder(Industry, Employees), y = log(Employees))) +
geom_boxplot(outlier.shape = NA, show.legend = FALSE) + coord_flip() +
stat_summary(fun.y = mean, col = "honeydew4", geom = 'point') +
labs(x = "Industry", y = "Average Number of Employees, in logarithmic scale", title = "Average Employee Size by Industry", subtitle = "average given by dot")
p2
Even with outliers removed from the data, the skewness within most industries can be clearly seen and there are likely to be a differences among them
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
Back to using the full dataset and summarizing it to see which industries generate the most revenue per employee. Below are two informative ways of presenting the data. The first graph depicts a barplot of the average revenue earned by industry but also highlights the distribution of the number of employees within a specific industry. Secondly, not only does the second bargraph depicts the average revenue earned by industry, but it highlights the distribution of the number of companies within a specific industry.
# Revenue per employee by industry
revenue = inc[complete.cases(inc),] %>%
group_by(Industry) %>%
summarise(count = n(), Revenue = sum(Revenue), Employees = sum(Employees)) %>%
mutate(rev.per.employee = Revenue / Employees )
# Plotting a bargraph, number of employees per industry shown
p3 = ggplot(revenue, aes(x = reorder(Industry, rev.per.employee), y = rev.per.employee)) +
geom_bar(stat = "identity", aes(fill = Employees)) + coord_flip() +
scale_y_continuous(labels = scales::dollar_format(scale = .001, suffix = "K")) +
labs(title = "Revenue per Employee by Industry", x = "Industry",
y = "Revenue per Employee, in US$", fill = "# of Employees") +
geom_text(data = filter(revenue, rev.per.employee > 10^6),
aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)),
hjust = 1.1, vjust = 0.4, color = "white", size = 3.5) +
geom_text(data = filter(revenue, rev.per.employee < 10^6),
aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)),
hjust = -0.1, vjust = 0.4, color = "black", size = 3.5)
p3
It is clear that computer hardware earns the most revenue among industries based on its average of $1,223,564. The graph above also reveals that there is less than 50,000 employees within this industry among the fastest growing companies of the United States. Whereas, human resources earns an average revenue of $40,735 per employee, and there is more than 200,000 employees within this industry among the fastest growing companies of the United States.
# Plotting a bargraph, number of organization per industry shown
p4 = ggplot(revenue, aes(x = reorder(Industry, rev.per.employee), y = rev.per.employee)) +
geom_bar(stat = "identity", aes(fill = count)) + coord_flip() +
scale_y_continuous(labels = scales::dollar_format(scale = .001, suffix = "K")) +
labs(title = "Revenue per Employee by Industry", x = "Industry",
y = "Revenue per Employee, in US$", fill = "# of Companies") +
geom_text(data = filter(revenue, rev.per.employee > 10^6),
aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)),
hjust = 1.1, vjust = 0.4, color = "white", size = 3.5) +
geom_text(data = filter(revenue, rev.per.employee < 10^6),
aes(x = Industry, y = rev.per.employee, label=scales::dollar_format()(rev.per.employee)),
hjust = -0.1, vjust = 0.4, color = "black", size = 3.5)
p4
This graph also clearly shows that computer hardware earns the most revenue among industries. But it also highlights that there is less than 200 companies within this industry among the fastest growing companies of the United States. Whereas, human resources, which earns an average revenue of $40,735 per employee, ranges within 200 - 400 companies within this industry among the fastest growing companies of the United States. From this graph, it further reveals that IT service industry has the most companies among the fastest growing companies of the United States, and earns an average revenue of $199,683 per employee.