Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
ANSWER (Part One): By using the dim function, we see that there are 5001 observations over 8 features. The str function shows that there are 4 factor variables, two integer variables, and two numeric variables.
# Insert your code here, create more chunks as necessary
dim(inc)
## [1] 5001 8
str(inc)
## 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
## $ State : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
ANSWER, Part Two: Checking for missing or incomplete data, we see 12 cases where data for the Employees feature is missing. Missing data needs to be accounted for when performing aggregate functions.
inc[!complete.cases(inc),]
## Rank Name Growth_Rate Revenue
## 183 183 First Flight Solutions 22.32 2700000
## 1063 1064 Popchips 3.98 93300000
## 1123 1124 Vocalocity 3.72 42900000
## 1652 1653 Higher Logic 2.36 6000000
## 1685 1686 Global Communications Group 2.30 3600000
## 2196 2197 JeffreyM Consulting 1.68 12100000
## 2742 2743 Excalibur Exhibits 1.27 9900000
## 3000 3001 Heartland Business Systems 1.12 156300000
## 3978 3978 SSEC 0.68 80400000
## 4112 4112 Carolinas Home Medical Equipment 0.64 3300000
## 4566 4566 Oakbrook 0.48 8900000
## 4968 4968 Popcorn Palace 0.35 5500000
## Industry Employees City State
## 183 Logistics & Transportation NA Emerald Isle NC
## 1063 Food & Beverage NA San Francisco CA
## 1123 Telecommunications NA Atlanta GA
## 1652 Software NA Washington DC
## 1685 Telecommunications NA Englewood CO
## 2196 Business Products & Services NA Bellevue WA
## 2742 Business Products & Services NA houston TX
## 3000 IT Services NA Little Chute WI
## 3978 Manufacturing NA Horsham PA
## 4112 Health NA Matthews NC
## 4566 Real Estate NA Madison WI
## 4968 Food & Beverage NA Schiller Park IL
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Answer Question 1 here
suppressWarnings(suppressMessages(library("dplyr", quietly = T)))
library(dplyr)
Companies_per_State <- inc %>%
group_by(State) %>%
summarise(Companies_in_State = n()) %>%
arrange(Companies_in_State)
## Warning: package 'bindrcpp' was built under R version 3.5.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_One.png", width=900, height=1200)
ggplot(Companies_per_State, aes(x=reorder(State, Companies_in_State), y=Companies_in_State)) +
geom_bar(stat = "identity", fill="blue") +
coord_flip() +
ggtitle("Distribution of Growth Companies by State") +
labs(y="No. of Growth COmpanies", x="State") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank())
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
ANSWER 2 (Part One): First, remove the incomplete cases from the dataset as identified earlier. Doing so, reduces the total number of observations down to 4,989. Doing so increases the accuracy of the aggregate data.
# Answer Question 2 here
inc <- inc[complete.cases(inc),]
dim(inc)
## [1] 4989 8
ANSWER 2 (Part Two): Next,identifying the outliers in the dataset. We can see from this dot plot that there are a few huge outliers in the Employees data.
ggplot(inc, aes(x = State, y = Employees)) +
geom_point() +
scale_x_discrete(name = "State") +
scale_y_continuous(name = "Employees") +
geom_jitter() +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
ANSWER 2 (Part Two): Next,identifying the outliers in the dataset. We can see from this dot plot that there are a few huge outliers in the Employees data. The maximum number of employees is 66,803.
max(inc$Employees)
## [1] 66803
min(inc$Employees)
## [1] 1
ANSWER 2 (Part Three):Applying Turkey’s method for identifying outliers, there are 610 data points that could be considered “Outliers” for the Employees feature. Most of these seem to be employee placement/temporary agencies. Removing these outliers shows that the data is more dispersed.
identify_outlier <- function(datapoint) {
q1 = 25
q3 = 132
interquartile_range = q3-q1
if (datapoint < q1-(1.5*interquartile_range) | datapoint > q3+(1.5*interquartile_range)){
return("Outlier")
}
else{
return("Acceptable")
}
}
inc$Outlier <- sapply(inc$Employees, identify_outlier, simplify=T)
head(inc[inc$Outlier=="Outlier",])
## Rank Name Growth_Rate Revenue
## 15 15 LivingSocial 123.33 536000000
## 23 23 Wingspan Portfolio Advisors 87.69 77000000
## 28 28 Kony 77.86 51100000
## 39 39 Intellect Resources 65.54 30000000
## 45 45 Silver Spring Networks 58.67 196700000
## 64 64 GSC Packaging 48.13 112600000
## Industry Employees City State Outlier
## 15 Consumer Products & Services 4100 Washington DC Outlier
## 23 Financial Services 1016 Carrollton TX Outlier
## 28 Software 1100 Orlando FL Outlier
## 39 Health 675 Greensboro NC Outlier
## 45 Energy 566 Redwood City CA Outlier
## 64 Manufacturing 300 Atlanta GA Outlier
inc <- inc %>%
filter(Outlier != 'Outlier')
dim(inc)
## [1] 4379 9
After removing the Outliers, the dataset is reduced to 4,379 observations.
ggplot(inc, aes(x = State, y = Employees)) +
geom_point() +
scale_x_discrete(name = "State") +
scale_y_continuous(name = "Employees") +
geom_jitter() +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
ANSWER (Part Four): The code below confirms the visualization from Question 1 that the State with the 3rd most growth companies is NY.
Cos_in_State <- inc %>%
group_by(State) %>%
summarise(Companies_in_State = n()) %>%
arrange(desc(Companies_in_State)) %>%
top_n(3) %>%
slice(3L)
## Selecting by Companies_in_State
Cos_in_State
## # A tibble: 1 x 2
## State Companies_in_State
## <fct> <int>
## 1 NY 275
State_3rd_Highest <- inc %>%
filter(State == 'NY') %>%
group_by(Industry) %>%
summarise(Name=n(),total_emp=sum(Employees), mean_emp = mean(Employees), median_emp = median(Employees)) %>%
filter(Name > 4) %>%
arrange(desc(Name))
State_3rd_Highest
## # A tibble: 15 x 5
## Industry Name total_emp mean_emp median_emp
## <fct> <int> <int> <dbl> <dbl>
## 1 Advertising & Marketing 57 3331 58.4 38
## 2 IT Services 35 2545 72.7 47
## 3 Business Products & Services 18 1095 60.8 33
## 4 Consumer Products & Services 16 647 40.4 24
## 5 Telecommunications 16 1305 81.6 29.5
## 6 Education 14 838 59.9 50.5
## 7 Retail 14 347 24.8 13.5
## 8 Health 12 766 63.8 42.5
## 9 Manufacturing 12 646 53.8 28
## 10 Financial Services 11 968 88 70
## 11 Media 10 586 58.6 30.5
## 12 Software 10 806 80.6 54.5
## 13 Food & Beverage 8 305 38.1 35.5
## 14 Human Resources 7 273 39 30
## 15 Construction 6 366 61 24.5
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_Two(1).png", width=600, height=900)
ggplot(State_3rd_Highest, aes(y = mean_emp, x = reorder(Industry, mean_emp))) +
geom_bar(stat="identity", fill="#40b8d0") +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
coord_flip() +
ggtitle("NY: Average Number of Employees per Industry") +
labs(y="(minimum of 5 companies per industry)", x="") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank())
#Plot saved as a separte pdf file
png(filename="DATA608_Assignment_One_Question_Two(2).png", width=600, height=900)
ggplot(State_3rd_Highest, aes(y = median_emp, x = reorder(Industry, median_emp))) +
geom_bar(stat="identity", fill="#b2d183") +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
coord_flip() +
ggtitle("NY: Median Number of Employees per Industry") +
labs(y="(minimum of 5 companies per industry)", x="") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank())
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
ANSWER: First, create a new variable for each company showing revenue per employee.
# Answer Question 3 here
inc$Rev_per_Employee <- inc$Revenue / inc$Employees
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State Outlier
## 1 Consumer Products & Services 104 El Segundo CA Acceptable
## 2 Government Services 51 Dumfries VA Acceptable
## 3 Health 132 Jacksonville FL Acceptable
## 4 Energy 50 Addison TX Acceptable
## 5 Advertising & Marketing 220 Boston MA Acceptable
## 6 Real Estate 63 Austin TX Acceptable
## Rev_per_Employee
## 1 1133653.8
## 2 972549.0
## 3 193181.8
## 4 38000000.0
## 5 395454.5
## 6 725396.8
Industry_Rev_Employee <- inc %>%
group_by(Industry) %>%
summarise(Total_Revenue=sum(Revenue), Total_Employee=sum(Employees), Industry_Revenue_perEmployee = sum(Revenue) / sum(Employees)) %>%
arrange(desc(Industry_Revenue_perEmployee))
Industry_Rev_Employee
## # A tibble: 25 x 4
## Industry Total_Revenue Total_Employee Industry_Revenue_perE~
## <fct> <dbl> <int> <dbl>
## 1 Energy 6403000000 5974 1071811.
## 2 Computer Hardware 1785700000 2914 612800.
## 3 Logistics & Transp~ 4498700000 7968 564596.
## 4 Food & Beverage 3124300000 6623 471735.
## 5 Insurance 1226400000 3049 402230.
## 6 Real Estate 1645300000 4346 378578.
## 7 Consumer Products ~ 3725100000 10050 370657.
## 8 Construction 4401400000 12026 365990.
## 9 Financial Services 5768000000 16237 355238.
## 10 Human Resources 3796900000 10834 350462.
## # ... with 15 more rows
labs <- c("$0", "$300K", "$600K", "$900K", "$1M")
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_Three.png", width=600, height=900)
ggplot(Industry_Rev_Employee, aes(y = Industry_Revenue_perEmployee, x = reorder(Industry, Industry_Revenue_perEmployee))) +
geom_bar(stat="identity", fill="#40b8d0") +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
scale_y_continuous(labels = labs) +
coord_flip() +
ggtitle("Industry Revenue per Employee") +
labs(y="", x="") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank())
REFLECTION: To answer these questions, I imagined that I was talking to a group of state legislators or a Chamber of Commerce–an audience with a vested interest in the data in order to make policy. Given that I wanted the visualizations to clearly show disparities between the states when it comes to growth businesses, to show disparities between mean and median, and generated revenues per employees. I chose bar charts to communicate these ideas and sorted them in descending orders which further communicates the information that I wanted to demonstrate. I put less emphasis on the actual hard numbers and more emphasis on showing disparties.