Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kableExtra)
library(ggplot2)
We can start by quickly looking at what industries have most of the fast growing companies. As we can see below, by far that is the IT sector, with several categories of service companies coming second but fairly far behind. We also see there are 25 Industry sectors in the dataset.
ind<-data.frame(inc %>% group_by(Industry) %>% tally()) %>% arrange(desc(n))
ind %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Industry | n |
|---|---|
| IT Services | 733 |
| Business Products & Services | 482 |
| Advertising & Marketing | 471 |
| Health | 355 |
| Software | 342 |
| Financial Services | 260 |
| Manufacturing | 256 |
| Consumer Products & Services | 203 |
| Retail | 203 |
| Government Services | 202 |
| Human Resources | 196 |
| Construction | 187 |
| Logistics & Transportation | 155 |
| Food & Beverage | 131 |
| Telecommunications | 129 |
| Energy | 109 |
| Real Estate | 96 |
| Education | 83 |
| Engineering | 74 |
| Security | 73 |
| Travel & Hospitality | 62 |
| Media | 54 |
| Environmental Services | 51 |
| Insurance | 50 |
| Computer Hardware | 44 |
But out of the 5,000 companies IT Services is a somewhat low percentage. So no sector really dominating growth.
ind[which(ind$Industry=="IT Services"),]$n/nrow(inc)*100
## [1] 14.65707
We can do the same analysis for States, to see if a State in particular is where most of the growth is happening.
sta<-data.frame(inc %>% group_by(State) %>% tally()) %>% arrange(desc(n))
sta %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| State | n |
|---|---|
| CA | 701 |
| TX | 387 |
| NY | 311 |
| VA | 283 |
| FL | 282 |
| IL | 273 |
| GA | 212 |
| OH | 186 |
| MA | 182 |
| PA | 164 |
| NJ | 158 |
| NC | 137 |
| CO | 134 |
| MD | 131 |
| WA | 130 |
| MI | 126 |
| AZ | 100 |
| UT | 95 |
| MN | 88 |
| TN | 82 |
| WI | 79 |
| IN | 69 |
| MO | 59 |
| AL | 51 |
| CT | 50 |
| OR | 49 |
| SC | 48 |
| OK | 46 |
| DC | 43 |
| KY | 40 |
| KS | 38 |
| LA | 37 |
| IA | 28 |
| NE | 27 |
| NV | 26 |
| NH | 24 |
| ID | 17 |
| DE | 16 |
| RI | 16 |
| ME | 13 |
| MS | 12 |
| ND | 10 |
| AR | 9 |
| HI | 7 |
| VT | 6 |
| NM | 5 |
| MT | 4 |
| SD | 3 |
| AK | 2 |
| WV | 2 |
| WY | 2 |
| PR | 1 |
Here California comes on top a fair bit. Glad to see my home state of Texas coming second, although a fair bit behind.
sta[which(sta$State=="CA"),]$n/nrow(inc)*100
## [1] 14.0172
sta[which(sta$State=="TX"),]$n/nrow(inc)*100
## [1] 7.738452
We can actually quickly look at the proportion of all states.
-sort(desc(prop.table(table(inc$State)))) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| x | |
|---|---|
| CA | 0.14017197 |
| TX | 0.07738452 |
| NY | 0.06218756 |
| VA | 0.05658868 |
| FL | 0.05638872 |
| IL | 0.05458908 |
| GA | 0.04239152 |
| OH | 0.03719256 |
| MA | 0.03639272 |
| PA | 0.03279344 |
| NJ | 0.03159368 |
| NC | 0.02739452 |
| CO | 0.02679464 |
| MD | 0.02619476 |
| WA | 0.02599480 |
| MI | 0.02519496 |
| AZ | 0.01999600 |
| UT | 0.01899620 |
| MN | 0.01759648 |
| TN | 0.01639672 |
| WI | 0.01579684 |
| IN | 0.01379724 |
| MO | 0.01179764 |
| AL | 0.01019796 |
| CT | 0.00999800 |
| OR | 0.00979804 |
| SC | 0.00959808 |
| OK | 0.00919816 |
| DC | 0.00859828 |
| KY | 0.00799840 |
| KS | 0.00759848 |
| LA | 0.00739852 |
| IA | 0.00559888 |
| NE | 0.00539892 |
| NV | 0.00519896 |
| NH | 0.00479904 |
| ID | 0.00339932 |
| DE | 0.00319936 |
| RI | 0.00319936 |
| ME | 0.00259948 |
| MS | 0.00239952 |
| ND | 0.00199960 |
| AR | 0.00179964 |
| HI | 0.00139972 |
| VT | 0.00119976 |
| NM | 0.00099980 |
| MT | 0.00079984 |
| SD | 0.00059988 |
| AK | 0.00039992 |
| WV | 0.00039992 |
| WY | 0.00039992 |
| PR | 0.00019996 |
We can also ask specifically what are the top 20 faster growing companies.
inc %>% arrange(desc(Growth_Rate)) %>% head(20) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
| 7 | Value Payment Systems | 174.04 | 2.550e+07 | Financial Services | 27 | Nashville | TN |
| 8 | Emerge Digital Group | 170.64 | 2.390e+07 | Advertising & Marketing | 75 | San Francisco | CA |
| 9 | Goal Zero | 169.81 | 3.310e+07 | Consumer Products & Services | 97 | Bluffdale | UT |
| 10 | Yagoozon | 166.89 | 1.860e+07 | Retail | 15 | Warwick | RI |
| 11 | OBXtek | 164.33 | 2.960e+07 | Government Services | 149 | Tysons Corner | VA |
| 12 | AdRoll | 150.65 | 3.410e+07 | Advertising & Marketing | 165 | San Francisco | CA |
| 13 | uBreakiFix | 141.02 | 1.700e+07 | Retail | 250 | Orlando | FL |
| 14 | Sparc | 128.63 | 2.110e+07 | Software | 160 | Charleston | SC |
| 15 | LivingSocial | 123.33 | 5.360e+08 | Consumer Products & Services | 4100 | Washington | DC |
| 16 | Amped Wireless | 110.68 | 1.430e+07 | Computer Hardware | 26 | Chino | CA |
| 17 | Intelligent Audit | 105.73 | 1.450e+08 | Logistics & Transportation | 15 | Rochelle Park | NJ |
| 18 | Integrity Funding | 104.62 | 1.110e+07 | Financial Services | 11 | Sarasota | FL |
| 19 | Vertex Body Sciences | 100.10 | 1.180e+07 | Food & Beverage | 51 | columbus | OH |
| 20 | BlueKai | 92.45 | 2.680e+07 | Advertising & Marketing | 107 | Cupertino | CA |
Finally we can ask which companies are growing quicker, small or large. If we define a small company as having less than 100 employees, medium size between 100 and 1000, and large more than 1000, we find most fast growing companies are small.
b <- c(-Inf, 100, 1000, Inf)
names <- c("Small", "Medium", "Large")
inc$bin <- cut(inc$Employees, breaks = b, labels = names)
inc %>% group_by(bin) %>% tally() %>% kable() %>% kable_styling()
| bin | n |
|---|---|
| Small | 3456 |
| Medium | 1365 |
| Large | 168 |
| NA | 12 |
Interestingly we stumble upon some companies which do not report number of employees, something that was shown in the summary of the data frame.
inc[is.na(inc$Employees),] %>% kable() %>% kable_styling()
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State | bin | |
|---|---|---|---|---|---|---|---|---|---|
| 183 | 183 | First Flight Solutions | 22.32 | 2700000 | Logistics & Transportation | NA | Emerald Isle | NC | NA |
| 1063 | 1064 | Popchips | 3.98 | 93300000 | Food & Beverage | NA | San Francisco | CA | NA |
| 1123 | 1124 | Vocalocity | 3.72 | 42900000 | Telecommunications | NA | Atlanta | GA | NA |
| 1652 | 1653 | Higher Logic | 2.36 | 6000000 | Software | NA | Washington | DC | NA |
| 1685 | 1686 | Global Communications Group | 2.30 | 3600000 | Telecommunications | NA | Englewood | CO | NA |
| 2196 | 2197 | JeffreyM Consulting | 1.68 | 12100000 | Business Products & Services | NA | Bellevue | WA | NA |
| 2742 | 2743 | Excalibur Exhibits | 1.27 | 9900000 | Business Products & Services | NA | houston | TX | NA |
| 3000 | 3001 | Heartland Business Systems | 1.12 | 156300000 | IT Services | NA | Little Chute | WI | NA |
| 3978 | 3978 | SSEC | 0.68 | 80400000 | Manufacturing | NA | Horsham | PA | NA |
| 4112 | 4112 | Carolinas Home Medical Equipment | 0.64 | 3300000 | Health | NA | Matthews | NC | NA |
| 4566 | 4566 | Oakbrook | 0.48 | 8900000 | Real Estate | NA | Madison | WI | NA |
| 4968 | 4968 | Popcorn Palace | 0.35 | 5500000 | Food & Beverage | NA | Schiller Park | IL | NA |
nrow(inc[is.na(inc$Employees),])
## [1] 12
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
sta$State <- factor(sta$State, levels = sta$State[order(sta$n)])
ggplot(sta, aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")
)
This plot is hard to read, other than to know overall which states have the most fast growing companies. For a more detailed look we can plot just the top 5 states.
ggplot(head(sta,5), aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + geom_text(aes(label=n), vjust=0.3, hjust=1.6, color="white", size=3.5) + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")
)
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
ny <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry) %>% summarize(Average=mean(Employees), Median=median(Employees))
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Industry | Average | Median |
|---|---|---|
| Advertising & Marketing | 58.43860 | 38.0 |
| Business Products & Services | 1492.46154 | 70.5 |
| Computer Hardware | 44.00000 | 44.0 |
| Construction | 61.00000 | 24.5 |
| Consumer Products & Services | 626.29412 | 25.0 |
| Education | 59.85714 | 50.5 |
| Energy | 129.20000 | 120.0 |
| Engineering | 53.50000 | 54.5 |
| Environmental Services | 155.00000 | 155.0 |
| Financial Services | 144.30769 | 81.0 |
| Food & Beverage | 76.44444 | 41.0 |
| Government Services | 17.00000 | 17.0 |
| Health | 81.84615 | 45.0 |
| Human Resources | 437.54545 | 56.0 |
| Insurance | 32.50000 | 32.5 |
| IT Services | 204.09302 | 54.0 |
| Logistics & Transportation | 29.50000 | 23.5 |
| Manufacturing | 73.30769 | 30.0 |
| Media | 108.00000 | 45.0 |
| Real Estate | 18.25000 | 18.0 |
| Retail | 24.78571 | 13.5 |
| Security | 135.00000 | 32.5 |
| Software | 245.92308 | 80.0 |
| Telecommunications | 95.35294 | 31.0 |
| Travel & Hospitality | 547.71429 | 61.0 |
library(reshape2)
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Employees per Industry in New York") + ylab("Employees")
A large discrepancy between mean and median is noticeable for several industries. This discrepancy usually means the dataset has outliners which affect the average more than the median calculations. This is still valuable information, if the data is accurate. We now know in Travel & Hospitality, HR, Consumer Products & Services and especially in Business Products & Services, there are companies that employ a disproportionate number of people compared to their peers. Outliers can be visualized with a simple boxplot, which results in a not so pretty graph.
ny_all <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry)
ggplot(ny_all, aes(x=Industry, y=Employees)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
outlier.size=4) + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Employees per Industry in New York") + ylab("Employees")
We can eliminate the outliers and re-plot ordering by the median. Outliers are defined for the entire dataset, not for individual groups.
quantiles <- quantile(ny_all$Employees, probs = c(.25, .75))
quantiles
## 25% 75%
## 21.0 105.5
range <- 1.5 * IQR(ny_all$Employees)
range
## [1] 126.75
ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Employees & quantiles[1]+range > ny_all$Employees ))
ggplot(ny_clean, aes(x=reorder(Industry,Employees,FUN=median), y=Employees)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Employees per Industry in New York") + ylab("Employees") + xlab("Industry")
Now we can go back to our original plot and make better sense of the data now that there aren’t large discrepancies between mean and median.
ny <- ny_clean %>% summarize(Average=mean(Employees), Median=median(Employees))
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Industry | Average | Median |
|---|---|---|
| Advertising & Marketing | 41.84615 | 35.0 |
| Business Products & Services | 47.47059 | 32.0 |
| Computer Hardware | 44.00000 | 44.0 |
| Construction | 29.40000 | 24.0 |
| Consumer Products & Services | 40.43750 | 24.0 |
| Education | 49.07692 | 50.0 |
| Energy | 88.00000 | 105.0 |
| Engineering | 53.50000 | 54.5 |
| Environmental Services | 60.00000 | 60.0 |
| Financial Services | 51.25000 | 39.0 |
| Food & Beverage | 38.12500 | 35.5 |
| Government Services | 17.00000 | 17.0 |
| Health | 41.90000 | 38.5 |
| Human Resources | 39.00000 | 30.0 |
| Insurance | 32.50000 | 32.5 |
| IT Services | 53.77419 | 40.0 |
| Logistics & Transportation | 29.50000 | 23.5 |
| Manufacturing | 40.54545 | 26.0 |
| Media | 42.88889 | 16.0 |
| Real Estate | 18.25000 | 18.0 |
| Retail | 24.78571 | 13.5 |
| Security | 30.00000 | 25.0 |
| Software | 48.50000 | 37.0 |
| Telecommunications | 44.15385 | 25.0 |
| Travel & Hospitality | 34.75000 | 36.0 |
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Employees per Industry in New York") + ylab("Employees")
Some groups still show outliers, but as stated before these still provide valuable information about the number of employees per industry sector. The Energy industry is surely the largest employer, with Environmental Services and Engineering coming behind as distant second and third. Overall Government Services shows having the least number of employees.
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
ny_rev_ee <- ny_all %>% mutate(Revenue_Per_EE = Revenue/Employees) %>% group_by(Industry) %>% summarize(Average_Rev_Per_EE = mean(Revenue_Per_EE), Median_Rev_Per_EE = median(Revenue_Per_EE))
ny_rev_ee %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Industry | Average_Rev_Per_EE | Median_Rev_Per_EE |
|---|---|---|
| Advertising & Marketing | 373403.5 | 255555.6 |
| Business Products & Services | 527816.9 | 203148.1 |
| Computer Hardware | 520454.5 | 520454.5 |
| Construction | 238694.5 | 238613.0 |
| Consumer Products & Services | 382942.6 | 222222.2 |
| Education | 112060.6 | 114848.5 |
| Energy | 8472533.5 | 283211.7 |
| Engineering | 215744.7 | 202998.1 |
| Environmental Services | 134366.7 | 134366.7 |
| Financial Services | 400174.4 | 219607.8 |
| Food & Beverage | 174630.9 | 120238.1 |
| Government Services | 158823.5 | 158823.5 |
| Health | 532491.0 | 155000.0 |
| Human Resources | 337366.3 | 175000.0 |
| Insurance | 371000.0 | 371000.0 |
| IT Services | 228816.1 | 164285.7 |
| Logistics & Transportation | 1245870.1 | 996285.7 |
| Manufacturing | 665818.6 | 217500.0 |
| Media | 333549.6 | 262500.0 |
| Real Estate | 383809.5 | 330952.4 |
| Retail | 520790.3 | 305000.0 |
| Security | 153277.8 | 149000.0 |
| Software | 143749.0 | 133333.3 |
| Telecommunications | 408143.4 | 410714.3 |
| Travel & Hospitality | 282089.8 | 223333.3 |
ggplot(melt(ny_rev_ee, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee")
As in the previous case, we see at least one industry with a large discrepancy between average and median. We proceed with the same analysis.
ny_all <- ny_all %>% mutate(Revenue_Per_EE = Revenue/Employees)
ggplot(ny_all, aes(x=Industry, y=Revenue_Per_EE)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
outlier.size=4) + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Employees per Industry in New York") + ylab("Employees")
quantiles <- quantile(ny_all$Revenue_Per_EE, probs = c(.25, .75))
quantiles
## 25% 75%
## 138653.7 373333.3
range <- 1.5 * IQR(ny_all$Revenue_Per_EE)
range
## [1] 352019.4
ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Revenue_Per_EE & quantiles[1]+range > ny_all$Revenue_Per_EE))
ggplot(ny_clean, aes(x=reorder(Industry,Revenue_Per_EE,FUN=median), y=Revenue_Per_EE)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee") + xlab("Industry")
ny <- ny_clean %>% summarize(Average=mean(Revenue_Per_EE), Median=median(Revenue_Per_EE))
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")
| Industry | Average | Median |
|---|---|---|
| Advertising & Marketing | 229838.8 | 200000.0 |
| Business Products & Services | 184010.4 | 180000.0 |
| Construction | 238694.5 | 238613.0 |
| Consumer Products & Services | 187158.4 | 157777.8 |
| Education | 112060.6 | 114848.5 |
| Energy | 149778.0 | 106122.4 |
| Engineering | 215744.7 | 202998.1 |
| Environmental Services | 134366.7 | 134366.7 |
| Financial Services | 209618.4 | 195169.1 |
| Food & Beverage | 174630.9 | 120238.1 |
| Government Services | 158823.5 | 158823.5 |
| Health | 114565.5 | 124000.0 |
| Human Resources | 117477.8 | 103267.7 |
| Insurance | 180000.0 | 180000.0 |
| IT Services | 176233.7 | 160502.9 |
| Logistics & Transportation | 299740.3 | 299740.3 |
| Manufacturing | 224302.8 | 186363.6 |
| Media | 254404.6 | 252844.7 |
| Real Estate | 267301.6 | 300000.0 |
| Retail | 259547.4 | 260000.0 |
| Security | 153277.8 | 149000.0 |
| Software | 143749.0 | 133333.3 |
| Telecommunications | 296756.2 | 373333.3 |
| Travel & Hospitality | 209660.3 | 184617.5 |
ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
colour = "orchid"),
panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
colour = "gray100"),
panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
colour = "gray95")) +
ggtitle("Renevue Employee per Industry in New York") + ylab("Revenue per Employee")
Now looking at the data, Telecommunications stands out as the most efficient industry (highest revenue per employee), with Logistics and transportation coming second. The one to avoid is Human Resources which comes last, with Health and Education also close to the bottom.