Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
library(dplyr)
library(tidyverse)
library(gghighlight)
not sure which year is the data being collected, I did a little research about the top rank company. If it is around 2013, according to Wikipedia, its growth rate makes sense.
# from the summary, Growth_Rate seems to have outliers
inc[1:5,]
## Rank Name Growth_Rate Revenue Industry
## 1 1 Fuhu 421.48 1.179e+08 Consumer Products & Services
## 2 2 FederalConference.com 248.31 4.960e+07 Government Services
## 3 3 The HCI Group 245.45 2.550e+07 Health
## 4 4 Bridger 233.08 1.900e+09 Energy
## 5 5 DataXu 213.37 8.700e+07 Advertising & Marketing
## Employees City State
## 1 104 El Segundo CA
## 2 51 Dumfries VA
## 3 132 Jacksonville FL
## 4 50 Addison TX
## 5 220 Boston MA
less than 1% of data present with NA, for the further analysis, I will decide to drop them.
# from summary, Employees have missing value
data <- inc %>%
filter(!is.na(Employees))
Total 25 different industries in which IT service take up the most, which is the most popular one
# Among all state, which industry is the most popular?
data %>%
select(Industry) %>%
group_by(Industry) %>%
count(sort = T)
## # A tibble: 25 × 2
## # Groups: Industry [25]
## Industry n
## <chr> <int>
## 1 IT Services 732
## 2 Business Products & Services 480
## 3 Advertising & Marketing 471
## 4 Health 354
## 5 Software 341
## 6 Financial Services 260
## 7 Manufacturing 255
## 8 Consumer Products & Services 203
## 9 Retail 203
## 10 Government Services 202
## # … with 15 more rows
based on extracted data, Business Products & Services has the highest revenue across states
# which industry has the most annual revenue?
data %>%
select(Industry, Revenue) %>%
group_by(Industry) %>%
summarize(total_revenue = sum(Revenue)) %>%
arrange(desc(total_revenue))
## # A tibble: 25 × 2
## Industry total_revenue
## <chr> <dbl>
## 1 Business Products & Services 26345900000
## 2 IT Services 20525000000
## 3 Health 17860100000
## 4 Consumer Products & Services 14956400000
## 5 Logistics & Transportation 14837800000
## 6 Energy 13771600000
## 7 Construction 13174300000
## 8 Financial Services 13150900000
## 9 Food & Beverage 12812500000
## 10 Manufacturing 12603600000
## # … with 15 more rows
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# subset of data with only state and number of companies
states <- data %>%
group_by(State) %>%
select(Name, State) %>%
count(Name) %>%
summarize(num_comp = sum(n)) %>%
arrange(desc(num_comp))
# first few row of data
head(states)
## # A tibble: 6 × 2
## State num_comp
## <chr> <int>
## 1 CA 700
## 2 TX 386
## 3 NY 311
## 4 VA 283
## 5 FL 282
## 6 IL 272
# make barplot and highlight the top three
states %>%
ggplot(aes(x = State, y = num_comp, fill = State)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = num_comp), vjust = 0) +
theme(panel.background = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 0.3)) +
gghighlight(State == 'CA' | State == 'TX' | State == 'NY') +
labs(x = 'States',
y = 'number of companies',
title = 'distribution of company amount in states')
## label_key: State
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
# get the complete cases in NY state
ny <- data %>%
filter(State == 'NY') %>%
filter(complete.cases(.))
# make box plot to show median and range. scale to reduce extreme outliers
ny %>%
select(Industry, Employees) %>%
ggplot(aes(x = Industry, y = Employees)) +
geom_boxplot() +
scale_y_log10() +
coord_flip() +
theme(panel.background = element_blank()) +
labs(x = 'Industry (by median)',
y = 'log(Employees)',
title = 'NY employees by different industry')
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
# create data with industry and revenue per employees
inv <- ny %>%
select(Industry, Employees, Revenue) %>%
group_by(Industry) %>%
summarize(rev_per_emp = sum(Revenue) / sum(Employees)) %>%
arrange(desc(rev_per_emp))
# make barplot and highlight the top one.
inv %>%
ggplot(aes(x = Industry, y = rev_per_emp, fill = Industry)) +
geom_bar(stat = 'identity') +
theme(panel.background = element_blank()) +
coord_flip() +
gghighlight(rev_per_emp == max(inv$rev_per_emp)) +
labs(x = 'Industry in NY',
y = 'revenue per employees',
title = 'NY state industries generate revernue per employees')
## label_key: Industry