Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
The rank appears to be 1 through 5000. So let’s see if there are exactly 5000 rows.
nrow(inc)
## [1] 5001
inc_comp <- inc[complete.cases(inc), ]
nrow(inc_comp)
## [1] 4989
The dataframe contains 5001 rows. Interesting given the highest numeric Rank is 5000. Also, the number of rows with at least one missing element is 12 (5001-4989).
Turns out the dataframe contains 5001 rows, so let’s investigate the number of ranks included and the number of companies identified, and the number of states included.
# Ouput number of distinct values for each column
sapply(inc, function(x) length(unique(x)))
## Rank Name Growth_Rate Revenue Industry Employees
## 4999 5001 1147 1069 25 692
## City State
## 1519 52
The number of ranks being 4999 indicates duplicate rank numbers exist. We should be able to identify two ranks duplicated or one rank used three times.
The name column returns 5001, a comforting result which indicates each name is unique, and thus should represent a different company.
The state count results in 52. Given 50 states in the United States, let’s see if Washington, D.C. is included or perhaps a couple of US territories. If not true, then perhaps a company exists that is not in the United States.
# Identify duplicated values for Rank
inc$dupe <- duplicated(inc$Rank)
subset(inc, dupe == TRUE)
## Rank Name Growth_Rate Revenue
## 3424 3424 Total Beverage Solution 0.90 41500000
## 5001 5000 ALL4 0.34 4700000
## Industry Employees City State dupe
## 3424 Food & Beverage 35 Mt. Pleasant SC TRUE
## 5001 Environmental Services 34 Kimberton PA TRUE
subset(inc, Rank == 3424 | Rank == 5000)
## Rank Name Growth_Rate Revenue
## 3423 3424 Stemp Systems Group 19.37 6800000
## 3424 3424 Total Beverage Solution 0.90 41500000
## 5000 5000 INE 0.34 6800000
## 5001 5000 ALL4 0.34 4700000
## Industry Employees City State dupe
## 3423 IT Services 39 Long Island City NY FALSE
## 3424 Food & Beverage 35 Mt. Pleasant SC TRUE
## 5000 IT Services 35 Bellevue WA FALSE
## 5001 Environmental Services 34 Kimberton PA TRUE
Based on above output, the ranks of 3424 and 5000 are each used twice. Further investigation of Stemp Systems Group finds the Growth Rate value indicates the company should be ranked higher. After visual examination of the dataset, I find rank 218 is missing and appears Stemp Systems Group should be there.
As for rank 5000, based on the pattern of duplicate growth rates, yet unique ranks, the final rank should be 5001.
unique(inc$State)
## [1] CA VA FL TX MA TN UT RI SC DC NJ OH WA ME NY CO GA IL AZ NC MD MN OK
## [24] PA CT IN MS WI WY MI MO KS OR NE AL HI NV IA KY ID AK LA DE AR NH VT
## [47] NM SD ND PR MT WV
## 52 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
Not too surprisingly, the two additional entries in the State column are DC (Washington, D.C.) and PR (Puerto Rico). This discovery is certainly reasonable given the territorial make-up of the United States.
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
library(ggplot2)
# Bar graph of counts
# This uses the mtcars data frame, with the "cyl" column for
# x position. The y position is calculated by counting the number of rows for
# each value of cyl.
ggplot(inc, aes(x = State)) +
geom_bar() +
coord_flip() +
scale_x_discrete(limits = rev(levels(as.factor(inc$State)))) +
ggtitle("Companies per State") +
xlab("State") +
ylab("Companies") +
geom_text(stat='count', aes(label=..count..), hjust=-.3) +
theme(axis.text.x=element_blank()) # Following the gif from the week 2 discussion
I chose a bar graph as the chart style allows for clear comparison of the different bars. Each state is an individual count so a line graph would not have made any sense. I also chose not to use a histogram or distribution chart as that would force the states to be out of alphabetical order. Also, the histogram or distribution plot often implies a relationship for the given distribution. Seeing as the states are vastly different in size and population, not each state is equal in that regard.
As the number of states is 52 in the chart, I decided to list the states along the Y axis. This allows the chart to be presented in portrait orientation. I also had to reorder the list to ensure the state abbreviations are alphabetical starting at the top. I also believe the choice of Y axis for states is good because the abbreviations aren’t overlapping and aren’t tilted at 30 degrees which often happens when many categories are identified along an X axis.
I chose alphabetical ordering over numeric ordering because I believe the alphabetizing of the state abbreviations allows for easier comprehension by the viewer.
Following some principles of Edward Tufte, and also with the default ggplot2 theme, the bars are one solid color, no reason to present each state with a different color. I removed the Y axis tick labels in favor of displaying the actual count for each bar on the graph itself.
Let’s dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases()
function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
Based on Question 1 output, New York is the state with the 3rd most companies.
# Subset the intial dataframe to just the rows with State equal to 'NY'
ny <- subset(inc, State == 'NY')
# Output data for visual confirmation
head(ny)
## Rank Name Growth_Rate Revenue
## 26 26 BeenVerified 84.43 13700000
## 30 30 Sailthru 73.22 8100000
## 37 37 YellowHammer 67.40 18000000
## 38 38 Conductor 67.02 7100000
## 48 48 Cinium Financial Services 53.65 5900000
## 70 70 33Across 44.99 27900000
## Industry Employees City State dupe
## 26 Consumer Products & Services 17 New York NY FALSE
## 30 Advertising & Marketing 79 New York NY FALSE
## 37 Advertising & Marketing 27 New York NY FALSE
## 38 Advertising & Marketing 89 New York NY FALSE
## 48 Financial Services 32 Rock Hill NY FALSE
## 70 Advertising & Marketing 75 New York NY FALSE
nrow(ny)
## [1] 311
# Remove any non-complete rows from the NY dataframe
ny_comp <- ny[complete.cases(ny), ]
nrow(ny_comp)
## [1] 311
Confirming the 311 rows of New York have complete data
library(dplyr)
# Reorder based on the employee average instead of industry name
ny$Industry = with(ny, reorder(Industry, Employees, mean))
# Calculate the mean, min, and max of employees for each company and group by industry
ny.summary <- ny %>%
group_by(Industry) %>%
summarise(
emp_mean = mean(Employees),
emp_min = min(Employees),
emp_max = max(Employees)
)
# Output the resulting dataframe for visual inspection
ny.summary
## # A tibble: 25 x 4
## Industry emp_mean emp_min emp_max
## <fct> <dbl> <int> <int>
## 1 Government Services 17 17 17
## 2 Real Estate 18.2 7 30
## 3 Retail 24.8 3 75
## 4 Logistics & Transportation 29.5 1 70
## 5 Insurance 32.5 15 50
## 6 Computer Hardware 44 44 44
## 7 Engineering 53.5 11 94
## 8 Advertising & Marketing 58.4 2 270
## 9 Education 59.9 19 200
## 10 Construction 61 10 219
## # … with 15 more rows
# Trying to identify outliers
maxval <- 3200
dd <- ny.summary %>% filter(ny.summary$emp_max > maxval) %>%
group_by(Industry) %>%
summarise(outlier_txt=paste(emp_max,collapse=","))
# Output resulting df
dd
## # A tibble: 2 x 2
## Industry outlier_txt
## <fct> <chr>
## 1 Consumer Products & Services 10000
## 2 Business Products & Services 32000
# Round the mean values for display purposes
ny.summary$emp_mean <- round(ny.summary$emp_mean, digits=0)
# Plot the chart for Question 2
ggplot() +
geom_linerange(ny.summary, mapping=aes(x=Industry, ymin=emp_min, ymax=emp_max), width=0.2, size=1) +
geom_point(ny.summary, mapping=aes(x=Industry, y=emp_mean), size=2, shape=23, fill="white") +
geom_text(ny.summary, mapping=aes(x=Industry, y=emp_mean, label=emp_mean), size=3, vjust=-.6, nudge_x = 0.05) +
coord_flip(ylim=c(0, maxval)) +
ggtitle("Average Employees with Range for New York") +
xlab("Industry") +
ylab("Employees") +
geom_text(data=dd,aes(x=Industry, y=maxval,label=outlier_txt),size=3,vjust=-.6,hjust=.5) +
geom_segment(data=dd,aes(x=Industry, y=maxval*0.95,yend=3365, xend=factor(Industry)),
arrow = arrow(length = unit(0.3,"cm")))
Based on the question requiring an identification of the ranges, initially I generated a boxplot using geom_boxplot(). But then the more I thought about the prompt, I realized the boxplot captures the entire range as long as all the outliers are intact along with median value and the quartiles of distribution. As the prompt left the design open-ended (on purpose), I chose the geom_linerage approach with an identifier for the average as the final approach. I wanted to portray just the data and information requested in the prompt, and thus not include additional information. Yes, the Edward Tufte approach is really starting to weigh on my design decisions. I was imagining this chart in a newspaper or on a news website, not as a submission for a data science analysis project, thus the less-is-more approach to the presented chart.
I ordered the industries by average employees. Given this statistic as the focus of the chart, ordering on the statistic brings awareness to its importance. Also, I figured ordering by alphabetized industry names would provide little value. Sure, the alphabetized list provides a mechanism of order to a viewer, but the names aren’t household names in the way the state abbreviations are in Question 1. I wouldn’t expect a casual viewer to look for “Transportation” as “Logistics & Transportation”.
Using geom_linerange, I’m able to create the range of data using an indiscreet black line, a la Tufte. The average for each industry is portrayed as a diamond on the range line along with the true numeric value. I believe the specific number provides more clarity to a viewer instead of requiring the viewer to estimate the number based on the X axis values. I left the X axis labels on the chart, as many averages are below 1000, I felt the higher numbers would be harder to estimate because the chart doesn’t stop at an exact multiple of 1000.
I presented the industry names on the Y axis, following the approach in Question 1. This allows for a long list of industry names and also allows the names to be written horizontally (not tilted).
As for the outliers, I’m not completely satisfied with this approach, as I added arrows to the two lines that would extend past the graph’s right edge. With those arrows, I also identify the specific maximum values of the range. The size of the chart seemed a good compromise to capture all the numbers below 1000 along with capturing the extent to which the two outliers extend farther beyond the presented chart.
http://sape.inf.usi.ch/quick-reference/ggplot2/geom_linerange
https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html
https://stackoverflow.com/questions/29463067/include-indication-of-extreme-outliers-in-ggplot
https://stackoverflow.com/questions/5677885/ignore-outliers-in-ggplot2-boxplot
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
library(ggthemes)
# Set theme from package ggthemes
theme_set(theme_fivethirtyeight())
# Remove incomplete rows
inc <- inc[complete.cases(inc), ]
nrow(inc)
## [1] 4989
# Calculate the revenue by the number of employees for each company
inc$Rev_Per_Emp <- inc$Revenue / inc$Employees
# Reorder by industry using the mean of the revenue per employee
inc$Industry = with(inc, reorder(Industry, Rev_Per_Emp, mean))
# Graph the boxplot of the data using ggplot
ggplot(inc, aes(x=Industry, y=Rev_Per_Emp)) +
geom_boxplot(outlier.shape = NA) +
stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 2, fill = "white") +
ggtitle("Revenue Per Employee") +
xlab("Industry") +
ylab("Revenue") +
coord_flip(ylim = c(0, 1800000))
As mentioned in Question 2, I first attempted a boxplot in order to capture the range of values. The more I thought about the prompt, the less I considered the boxplot the correct the approach. Nonetheless, I left the boxplot here as evidence of my thought process. The boxplot does capture the revenue per employee by industry, but doesn’t exactly capture the full range of values. The outliers are not captured on this graph, and thus the graph doesn’t identify the full range as requested in the question prompt.
I understood the prompt to imply a presentation in a print publication or news website, thus the use of boxplots would be overkill to a typical consumer of news graphics. I also used the theme of the website fivethirtyeight.com from the ggthemes package for experimentation purposes.
# Set theme to Tufte, an homage
theme_set(theme_tufte())
# Reorder the data based on the calculation of revenue per employee
inc$Industry = with(inc, reorder(Industry, Rev_Per_Emp, mean))
# Calculate the mean, min, and max of the revenue per employee and group by industry
inc.summary <- inc %>%
group_by(Industry) %>%
summarise(
rev_emp_mean = mean(Rev_Per_Emp),
rev_emp_min = min(Rev_Per_Emp),
rev_emp_max = max(Rev_Per_Emp)
)
# Output for visual inspection
inc.summary
## # A tibble: 25 x 4
## Industry rev_emp_mean rev_emp_min rev_emp_max
## <fct> <dbl> <dbl> <dbl>
## 1 Engineering 201120. 37067. 725424.
## 2 Software 225989. 14062. 12600000
## 3 Government Services 243596. 18000 3835616.
## 4 IT Services 270494. 8369. 3506349.
## 5 Security 283391. 14865. 3075000
## 6 Environmental Services 283607. 41623. 2009231.
## 7 Education 296454. 4706. 5100000
## 8 Advertising & Marketing 306036. 2076. 5663492.
## 9 Media 307144. 6273. 1550000
## 10 Health 325199. 9778. 6835484.
## # … with 15 more rows
# Round the mean and max values for presentation purposes
inc.summary$rev_emp_mean <- round(inc.summary$rev_emp_mean, digits=0)
inc.summary$rev_emp_max <- round(inc.summary$rev_emp_max, digits=0)
# Trying to identify outliers
maxval <- 8000000
dd <- inc.summary %>% filter(inc.summary$rev_emp_max > maxval) %>%
group_by(Industry) %>%
summarise(outlier_txt=paste(rev_emp_max,collapse=","))
# Output resulting df of outliers
dd
## # A tibble: 4 x 2
## Industry outlier_txt
## <fct> <chr>
## 1 Software 12600000
## 2 Manufacturing 12800000
## 3 Logistics & Transportation 9666667
## 4 Energy 40740000
# Plot the chart for Question 3
ggplot() +
geom_linerange(inc.summary, mapping=aes(x=Industry, ymin=rev_emp_min, ymax=rev_emp_max), width=0.2, size=1) +
geom_point(inc.summary, mapping=aes(x=Industry, y=rev_emp_mean), size=2, shape=23, fill="white") +
geom_text(inc.summary, mapping=aes(x=Industry, y=rev_emp_mean, label=rev_emp_mean), size=3, vjust=-.6, nudge_x = 0.05) +
coord_flip(ylim=c(0, maxval)) +
ggtitle("Average Revenue Per Employee with Range") +
xlab("Industry") +
ylab("Revenue Per Employee") +
geom_text(data=dd,aes(x=Industry, y=maxval,label=outlier_txt),size=3,vjust=-.6,hjust=.5) +
geom_segment(data=dd,aes(x=Industry, y=maxval*0.95,yend=8400000, xend=factor(Industry)),
arrow = arrow(length = unit(0.3,"cm"))) +
scale_y_continuous(labels = scales::number)
Overall, I followed the display approach used for Question 2. The boxplot was considered first, but instead I used the linerange to clearly denote the range of values for each industry. The boxplot gives more information but doesn’t capture the full range. I wanted the chart to capture the exact data requested in the question prompt, and no more.
The industries are presented on the Y axis to allow for easy readability of the industry names instead of presenting the names tilted on the X axis.
The industries are ordered by the average of revenue per employee. As the prompt requests the presentation of this statistic primarily, I chose to order by this statistic to bring more attention specifically to that information.
In an effort to indicate the exact values, the averages are presented as white diamonds along with the exact number rounded to a whole number. I did keep the X axis numeric values to help readability, but I presented the exact averages to provide the information clearly and not require visual estimating by the viewer.
The outliers are treated the same as in Question 2. Arrow heads are presented for those line ranges that extend beyond the right edge of the chart with the maximum value. To me, not the ideal presentation of the outliers but does include the information.
The selection of just over 8 million as the upper boundary was based on trial an error in trying to allow enough space for the line ranges and averages below 1 million while also having as few ranges extend beyond the chart. At a lower maximum value on the X axis, say 2 million, the averages would be much easier to focus on and compare, but then most line ranges would extend beyond the chart and thus provide no real value. Again, a compromise to capture the averages and full line ranges with just a few identified outliers.
I did have to manipulate the X axis labels to show the full numbers, multiples of 1 million. By default the numbers were presented in scientific notation, but I felt that would be too obtuse of a casual viewer.