library (tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
From the tidyverse package we will use the glimpse function which provides a cleaner look to what is under the hood of the data set.
glimpse(inc)
## Observations: 5,001
## Variables: 8
## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Name <fctr> Fuhu, FederalConference.com, The HCI Group, Bridg...
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 17...
## $ Revenue <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e...
## $ Industry <fctr> Consumer Products & Services, Government Services...
## $ Employees <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 16...
## $ City <fctr> El Segundo, Dumfries, Jacksonville, Addison, Bost...
## $ State <fctr> CA, VA, FL, TX, MA, TX, TN, CA, UT, RI, VA, CA, F...
Rank: an integer variable displaying the position of the fatest growing companies in the U.S.
Name: a factor variable displaying the name of each company
Growth_Rate: a double variable displaying the rate of growth for each company
Revenue: a double variable dispaying the total revenue of each company
Industry: a factor variable displaying the type of industry for each company
Employees: an integer variable displaying the number of employees for each company
City: a factor variable displaying the city where the company resides
State: a factor variable displaying the state where the company resides
Check the variance and standard deviation on applicable variables
cat("\n","Variance & Standard Deviation of the Growth_Rate Variable","\n")
##
## Variance & Standard Deviation of the Growth_Rate Variable
var(inc$Growth_Rate)
## [1] 199.4787
sd(inc$Growth_Rate)
## [1] 14.12369
cat("\n","Variance & Standard Deviation of the Revenue Variable","\n")
##
## Variance & Standard Deviation of the Revenue Variable
var(inc$Revenue)
## [1] 5.786059e+16
sd(inc$Revenue)
## [1] 240542281
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
To gain some insight on how many states there are we utilize the table function
table(inc$State)
##
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 2 51 9 100 701 134 50 43 16 282 212 7 28 17 273 69 38 40
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 37 182 131 13 126 88 59 12 4 137 10 27 24 158 5 26 311 186
## OK OR PA PR RI SC SD TN TX UT VA VT WA WI WV WY
## 46 49 164 1 16 48 3 82 387 95 283 6 130 79 2 2
The best graph to display the data will be a bar chart with flipped coordinates.
In order to achieve this graph we first created a subset entitled Q1_States which groups the data based on the State variable. Data is then counted based on the number of industries per state; the default variable n is created highlighting the count.
Q1_States <- inc %>%
group_by(State) %>%
count(State, sort = TRUE)
head(Q1_States)
## # A tibble: 6 x 2
## # Groups: State [6]
## State n
## <fctr> <int>
## 1 CA 701
## 2 TX 387
## 3 NY 311
## 4 VA 283
## 5 FL 282
## 6 IL 273
ggplot(data = Q1_States, aes(x = reorder(State, n), y = n)) +
geom_bar(stat = "identity", fill = "purple", width = .5) +
coord_flip() +
xlab("States") +
ylab("Number of Companies") +
ggtitle("Number of Companies Based on State") +
theme(axis.text=element_text(size=6),
axis.title=element_text(size=12,face="bold",color = "navy"))
Even in portrait mode this data seems cluttered, a cleaner graph could be generated if we showcased states with one hundred or more companies.
inc %>%
group_by(State) %>%
count(State, sort = TRUE) %>%
filter(n >= 100) %>%
ggplot(aes(x = reorder(State, n), y = n)) +
geom_bar(stat = "identity", fill = "purple", width = .5) +
coord_flip() +
xlab("States") +
ylab("Number of Companies") +
ggtitle("Number of Companies Based on State", subtitle ="With 100 Or More Companies") +
theme(axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold",color = "navy"))
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
complete.cases Return a logical vector indicating which cases are complete, i.e., have no missing values.
Create a data set entitled Q2_identify which will count and sort the data on the State with the most companies on the list. Next, create a variable named Q2_Third based on the criteria in the question.
Q2_identify <- inc %>%
count(State, sort = TRUE) %>% #counts and sorts data based on the State variable
mutate(State = reorder(State,n)) %>%
rename(Count = n)
(Q2_Third <- Q2_identify[3,1]) #filter on the 3rd row and the 1st column
## # A tibble: 1 x 1
## State
## <fctr>
## 1 NY
Using the complete.cases function we are able to identify complete data and then create a subset based on the findings from Q2_identify Next we use the filter command to only extract data from New York. Lastly, the data is grouped by the Industry variable.
Q2_NY <- inc[complete.cases(inc),] %>%
filter(State == "NY")
After creating the subset, the mean is calculated based on the aggregate feature
(Q2_M <- aggregate(Employees ~ Industry, Q2_NY, mean))
## Industry Employees
## 1 Advertising & Marketing 58.43860
## 2 Business Products & Services 1492.46154
## 3 Computer Hardware 44.00000
## 4 Construction 61.00000
## 5 Consumer Products & Services 626.29412
## 6 Education 59.85714
## 7 Energy 129.20000
## 8 Engineering 53.50000
## 9 Environmental Services 155.00000
## 10 Financial Services 144.30769
## 11 Food & Beverage 76.44444
## 12 Government Services 17.00000
## 13 Health 81.84615
## 14 Human Resources 437.54545
## 15 Insurance 32.50000
## 16 IT Services 204.09302
## 17 Logistics & Transportation 29.50000
## 18 Manufacturing 73.30769
## 19 Media 108.00000
## 20 Real Estate 18.25000
## 21 Retail 24.78571
## 22 Security 135.00000
## 23 Software 245.92308
## 24 Telecommunications 95.35294
## 25 Travel & Hospitality 547.71429
ggplot(data = Q2_NY, aes(x = reorder(Industry, Employees,mean), y = Employees)) +
geom_boxplot(outlier.shape = NA) +
coord_flip() +
xlab("New York Based Industries") +
ylab("Employees") +
ggtitle("The Average of New York Employee Size Based on Industry") +
theme(axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold",color = "navy")) +
scale_y_continuous(limits = c(0,1500))
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
In order to achieve this graph we first created a subset entitled Q3_Revenue which groups the data based on the Industry variable. Data is then summarized by counting the number of industries per group, calculating the sum of the Revenue & Employee variables. Lastly, using the mutate function, a variable entitled RevenueEmp will calculate the Revenue divided by the number of Employees.
At first glance, several key pieces of vital information was missing so it was necessary to add complete.cases to the code.
Q3_Revenue <- inc[complete.cases(inc),] %>%
group_by(Industry) %>%
summarise(IndustryCount = n(),
Revenue = sum(Revenue),
Employees = sum(Employees)) %>%
mutate(RevenueEmp = Revenue / Employees)
Q3_Revenue
## # A tibble: 25 x 5
## Industry IndustryCount Revenue Employees Reven~
## <fctr> <int> <dbl> <int> <dbl>
## 1 Advertising & Marketing 471 7785000000 39731 1.96e5
## 2 Business Products & Services 480 26345900000 117357 2.24e5
## 3 Computer Hardware 44 11885700000 9714 1.22e6
## 4 Construction 187 13174300000 29099 4.53e5
## 5 Consumer Products & Services 203 14956400000 45464 3.29e5
## 6 Education 83 1139300000 7685 1.48e5
## 7 Energy 109 13771600000 26437 5.21e5
## 8 Engineering 74 2532500000 20435 1.24e5
## 9 Environmental Services 51 2638800000 10155 2.60e5
## 10 Financial Services 260 13150900000 47693 2.76e5
## # ... with 15 more rows
ggplot(data = Q3_Revenue, aes(x = reorder(Industry, RevenueEmp), y = RevenueEmp)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip() +
scale_y_continuous("RevenueEmp") +
xlab("Industry") +
ylab("Revenue Per Employee") +
ggtitle("Revenue Per Employee Based on Industry") +
theme(axis.text=element_text(size=11),
axis.title=element_text(size=14,face="bold",color = "Red"))