Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
tail(inc)
## Rank Name Growth_Rate Revenue
## 4996 4996 cSubs 0.34 1.34e+07
## 4997 4997 Dot Foods 0.34 4.50e+09
## 4998 4998 Lethal Performance 0.34 6.80e+06
## 4999 4999 ArcaTech Systems 0.34 3.26e+07
## 5000 5000 INE 0.34 6.80e+06
## 5001 5000 ALL4 0.34 4.70e+06
## Industry Employees City State
## 4996 Business Products & Services 19 Montvale NJ
## 4997 Food & Beverage 3919 Mt. Sterling IL
## 4998 Retail 8 Wellington FL
## 4999 Financial Services 63 Mebane NC
## 5000 IT Services 35 Bellevue WA
## 5001 Environmental Services 34 Kimberton PA
#filter(inc,Name=='Cedar Petrochemcials')
I will look at some frequencies
head(as.data.frame(prop.table(table(inc$Industry))))
## Var1 Freq
## 1 Advertising & Marketing 0.09418116
## 2 Business Products & Services 0.09638072
## 3 Computer Hardware 0.00879824
## 4 Construction 0.03739252
## 5 Consumer Products & Services 0.04059188
## 6 Education 0.01659668
head(as.data.frame(prop.table(table(inc$State))))
## Var1 Freq
## 1 AK 0.00039992
## 2 AL 0.01019796
## 3 AR 0.00179964
## 4 AZ 0.01999600
## 5 CA 0.14017197
## 6 CO 0.02679464
Let’s look at the companies with lowest and highest revenue
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
inc%>%arrange(Revenue)%>%mutate(RevenueF = sprintf('$%.0f', Revenue))%>%head(10)
## Rank Name Growth_Rate Revenue
## 1 246 Cardinal Point Captains 17.65 2e+06
## 2 252 Benzara 17.02 2e+06
## 3 439 SPICA Computers 10.41 2e+06
## 4 459 Bows-N-Ties.com 10.03 2e+06
## 5 500 Mpact Financial Group 9.18 2e+06
## 6 588 ADA Collection 7.72 2e+06
## 7 602 The Rocket Company 7.62 2e+06
## 8 635 Kyzen Consulting Services 7.25 2e+06
## 9 959 Punchkick Interactive 4.52 2e+06
## 10 1150 FindMyCompany.com 3.63 2e+06
## Industry Employees City State RevenueF
## 1 Government Services 30 Carlsbad CA $2000000
## 2 Consumer Products & Services 5 Naperville IL $2000000
## 3 IT Services 30 Jersey City NJ $2000000
## 4 Retail 4 San Francisco CA $2000000
## 5 Financial Services 11 Dallas TX $2000000
## 6 Consumer Products & Services 14 folsom CA $2000000
## 7 Business Products & Services 8 Cumming GA $2000000
## 8 Government Services 7 West Palm Beach FL $2000000
## 9 Advertising & Marketing 27 Chicago IL $2000000
## 10 Advertising & Marketing 30 Austin TX $2000000
inc%>%arrange(Revenue)%>%mutate(RevenueF = sprintf('$%.0f', Revenue))%>%tail(10)
## Rank Name Growth_Rate Revenue
## 4992 2522 DLA Piper 1.41 2.40e+09
## 4993 1397 EnvisionRxOptions 2.88 2.70e+09
## 4994 4052 Kum & Go 0.65 2.80e+09
## 4995 4802 Boise Cascade 0.41 2.80e+09
## 4996 4246 American Tire Distributors 0.59 3.50e+09
## 4997 4716 Westcon Group 0.44 3.80e+09
## 4998 4997 Dot Foods 0.34 4.50e+09
## 4999 4936 Coty 0.36 4.60e+09
## 5000 3853 ABC Supply 0.73 4.70e+09
## 5001 4788 CDW 0.41 1.01e+10
## Industry Employees City State
## 4992 Business Products & Services 4036 Chicago IL
## 4993 Health 625 Twinsburg OH
## 4994 Retail 4589 West Des Moines IA
## 4995 Construction 4470 Boise ID
## 4996 Consumer Products & Services 3341 Huntersville NC
## 4997 IT Services 3000 Tarrytown NY
## 4998 Food & Beverage 3919 Mt. Sterling IL
## 4999 Consumer Products & Services 10000 New York NY
## 5000 Construction 6549 Beloit WI
## 5001 Computer Hardware 6800 Vernon Hills IL
## RevenueF
## 4992 $2400000000
## 4993 $2700000000
## 4994 $2800000000
## 4995 $2800000000
## 4996 $3500000000
## 4997 $3800000000
## 4998 $4500000000
## 4999 $4600000000
## 5000 $4700000000
## 5001 $10100000000
Let’s see revenue and employees by industry
inc1<-inc%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))
inc1%>%arrange(desc(ncompanies))
## # A tibble: 25 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 IT Serv~ 733 2.07e10 102788 2.82e7 140. 0.00274
## 2 Busines~ 482 2.64e10 117357 5.47e7 244. 0.00349
## 3 Adverti~ 471 7.78e 9 39731 1.65e7 84.4 0.0196
## 4 Health 355 1.79e10 82430 5.03e7 233. 0.00802
## 5 Software 342 8.14e 9 51262 2.38e7 150. 0.0116
## 6 Financi~ 260 1.32e10 47693 5.06e7 183. 0.0150
## 7 Manufac~ 256 1.27e10 43942 4.95e7 172. 0.00655
## 8 Consume~ 203 1.50e10 45464 7.37e7 224. 0.0515
## 9 Retail 203 1.03e10 37068 5.05e7 183. 0.0144
## 10 Governm~ 202 6.01e 9 26185 2.97e7 130. 0.0285
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>
inc1%>%arrange(desc(arevenue))
## # A tibble: 25 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Compute~ 44 1.19e10 9714 2.70e8 221. 0.0153
## 2 Energy 109 1.38e10 26437 1.26e8 243. 0.332
## 3 Food & ~ 131 1.29e10 65911 9.86e7 511. 0.0148
## 4 Logisti~ 155 1.48e10 39994 9.57e7 260. 0.0185
## 5 Consume~ 203 1.50e10 45464 7.37e7 224. 0.0515
## 6 Constru~ 187 1.32e10 29099 7.05e7 156. 0.00894
## 7 Telecom~ 129 7.33e 9 30842 5.69e7 243. 0.0138
## 8 Busines~ 482 2.64e10 117357 5.47e7 244. 0.00349
## 9 Security 73 3.81e 9 41059 5.22e7 562. 0.0247
## 10 Environ~ 51 2.64e 9 10155 5.17e7 199. 0.0650
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>
inc1%>%arrange(desc(aemployee))
## # A tibble: 25 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Human R~ 196 9.25e 9 226980 4.72e7 1158. 0.0100
## 2 Security 73 3.81e 9 41059 5.22e7 562. 0.0247
## 3 Food & ~ 131 1.29e10 65911 9.86e7 511. 0.0148
## 4 Travel ~ 62 2.93e 9 23035 4.73e7 372. 0.0193
## 5 Enginee~ 74 2.53e 9 20435 3.42e7 276. 0.0262
## 6 Logisti~ 155 1.48e10 39994 9.57e7 260. 0.0185
## 7 Busines~ 482 2.64e10 117357 5.47e7 244. 0.00349
## 8 Telecom~ 129 7.33e 9 30842 5.69e7 243. 0.0138
## 9 Energy 109 1.38e10 26437 1.26e8 243. 0.332
## 10 Health 355 1.79e10 82430 5.03e7 233. 0.00802
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>
inc1%>%arrange(desc(agrowth))
## # A tibble: 25 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Energy 109 1.38e10 26437 1.26e8 243. 0.332
## 2 Real Es~ 96 2.97e 9 18893 3.09e7 199. 0.0773
## 3 Environ~ 51 2.64e 9 10155 5.17e7 199. 0.0650
## 4 Media 54 1.74e 9 9532 3.23e7 177. 0.0525
## 5 Consume~ 203 1.50e10 45464 7.37e7 224. 0.0515
## 6 Educati~ 83 1.14e 9 7685 1.37e7 92.6 0.0409
## 7 Insuran~ 50 2.34e 9 7339 4.68e7 147. 0.0315
## 8 Governm~ 202 6.01e 9 26185 2.97e7 130. 0.0285
## 9 Enginee~ 74 2.53e 9 20435 3.42e7 276. 0.0262
## 10 Security 73 3.81e 9 41059 5.22e7 562. 0.0247
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>
inc1%>%arrange(desc(arev_per_emp))
## # A tibble: 25 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Compute~ 44 1.19e10 9714 2.70e8 221. 0.0153
## 2 Energy 109 1.38e10 26437 1.26e8 243. 0.332
## 3 Constru~ 187 1.32e10 29099 7.05e7 156. 0.00894
## 4 Logisti~ 155 1.48e10 39994 9.57e7 260. 0.0185
## 5 Consume~ 203 1.50e10 45464 7.37e7 224. 0.0515
## 6 Insuran~ 50 2.34e 9 7339 4.68e7 147. 0.0315
## 7 Manufac~ 256 1.27e10 43942 4.95e7 172. 0.00655
## 8 Retail 203 1.03e10 37068 5.05e7 183. 0.0144
## 9 Financi~ 260 1.32e10 47693 5.06e7 183. 0.0150
## 10 Environ~ 51 2.64e 9 10155 5.17e7 199. 0.0650
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>
Let’s see revenue and employees by State
inc2<-inc%>%group_by(State)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))
inc2%>%arrange(desc(ncompanies))
## # A tibble: 52 x 8
## State ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 CA 701 2.35e10 161219 3.35e7 230. 0.00966
## 2 TX 387 2.22e10 90765 5.73e7 235. 0.0596
## 3 NY 311 1.83e10 84370 5.87e7 271. 0.00482
## 4 VA 283 8.67e 9 35667 3.06e7 126. 0.0145
## 5 FL 282 1.06e10 61221 3.76e7 217. 0.0143
## 6 IL 273 3.32e10 103266 1.22e8 380. 0.00480
## 7 GA 212 6.47e 9 34546 3.05e7 164. 0.0217
## 8 OH 186 1.28e10 38002 6.87e7 204. 0.0117
## 9 MA 182 6.04e 9 24682 3.32e7 136. 0.0334
## 10 PA 164 5.67e 9 30392 3.46e7 186. 0.0138
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>
inc2%>%arrange(desc(arevenue))
## # A tibble: 52 x 8
## State ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 ID 17 3.94e 9 5817 2.32e8 342. 0.0336
## 2 AK 2 3.43e 8 2528 1.72e8 1264 0.270
## 3 IA 28 3.45e 9 11344 1.23e8 405. 0.0255
## 4 IL 273 3.32e10 103266 1.22e8 380. 0.00480
## 5 HI 7 6.96e 8 621 9.95e7 88.7 0.325
## 6 WI 79 7.30e 9 15548 9.24e7 202. 0.0123
## 7 DC 43 3.28e 9 9221 7.63e7 220. 0.495
## 8 OH 186 1.28e10 38002 6.87e7 204. 0.0117
## 9 NC 137 9.26e 9 36685 6.76e7 272. 0.0114
## 10 MI 126 7.81e 9 36905 6.20e7 293. 0.0109
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>
inc2%>%arrange(desc(aemployee))
## # A tibble: 52 x 8
## State ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 DE 16 6.77e 8 68544 4.23e7 4284 0.118
## 2 AK 2 3.43e 8 2528 1.72e8 1264 0.270
## 3 MS 12 5.25e 8 5531 4.38e7 461. 0.139
## 4 MT 4 2.46e 7 1673 6.15e6 418. 0.173
## 5 IA 28 3.45e 9 11344 1.23e8 405. 0.0255
## 6 IL 273 3.32e10 103266 1.22e8 380. 0.00480
## 7 AZ 100 5.50e 9 34281 5.50e7 343. 0.0243
## 8 ID 17 3.94e 9 5817 2.32e8 342. 0.0336
## 9 MD 131 3.30e 9 40439 2.52e7 309. 0.0326
## 10 MO 59 2.66e 9 17296 4.52e7 293. 0.0197
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>
inc2%>%arrange(desc(agrowth))
## # A tibble: 52 x 8
## State ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 ME 13 1.62e8 879 1.25e7 67.6 2.77
## 2 WY 2 6.95e7 107 3.48e7 53.5 2.48
## 3 PR 1 2.30e6 29 2.30e6 29 1.73
## 4 RI 16 7.52e8 2964 4.70e7 185. 0.581
## 5 DC 43 3.28e9 9221 7.63e7 220. 0.495
## 6 SD 3 1.77e7 761 5.90e6 254. 0.392
## 7 VT 6 2.77e8 1069 4.62e7 178. 0.355
## 8 NM 5 4.82e7 617 9.64e6 123. 0.334
## 9 WV 2 3.13e7 240 1.56e7 120 0.332
## 10 HI 7 6.96e8 621 9.95e7 88.7 0.325
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>
inc2%>%arrange(desc(arev_per_emp))
## # A tibble: 52 x 8
## State ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 HI 7 6.96e 8 621 9.95e7 88.7 0.325
## 2 ID 17 3.94e 9 5817 2.32e8 342. 0.0336
## 3 WY 2 6.95e 7 107 3.48e7 53.5 2.48
## 4 WI 79 7.30e 9 15548 9.24e7 202. 0.0123
## 5 DC 43 3.28e 9 9221 7.63e7 220. 0.495
## 6 CT 50 2.47e 9 6989 4.95e7 140. 0.107
## 7 NH 24 1.00e 9 2890 4.17e7 120. 0.0618
## 8 OH 186 1.28e10 38002 6.87e7 204. 0.0117
## 9 IL 273 3.32e10 103266 1.22e8 380. 0.00480
## 10 OR 49 1.40e 9 4399 2.86e7 89.8 0.0419
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
library(ggplot2)
a<-ggplot(inc, aes(x=State)) + geom_bar(color="black", fill="white")
a+coord_flip() +scale_y_continuous("Number of Companies")+ggtitle("Distribution of Companies by State")
#a+coord_fixed(ratio = 0.05)+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
inc%>%count(State)%>%arrange(desc(n))%>%slice(3)
## # A tibble: 1 x 2
## State n
## <fct> <int>
## 1 NY 311
incNY<-inc%>%filter(State=='NY')
incNY<-incNY[complete.cases(incNY),]
incNY1<-incNY%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))
head(incNY1)
## # A tibble: 6 x 8
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Adverti~ 57 9.49e8 3331 1.66e7 58.4 0.110
## 2 Busines~ 26 2.55e9 38804 9.81e7 1492. 0.0233
## 3 Compute~ 1 2.29e7 44 2.29e7 44 1.19
## 4 Constru~ 6 8.23e7 366 1.37e7 61 0.183
## 5 Consume~ 17 4.80e9 10647 2.82e8 626. 0.0462
## 6 Educati~ 14 7.08e7 838 5.06e6 59.9 0.167
## # ... with 1 more variable: arev_per_emp <dbl>
ggplot(incNY, aes(x=Industry, y=Employees)) + geom_boxplot()+ scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Boxplot of Employment by Industry in NY State")
We will try to remove outliers. We will do it by industry. We will use 1.5IQR approach
incNYwoOut <- incNY %>%
group_by(Industry) %>%
filter((Employees <= quantile(Employees,0.75)+1.5*IQR(Employees))&Employees >= quantile(Employees,0.25)-1.5*IQR(Employees))
ggplot(incNYwoOut, aes(x=Industry, y=Employees)) + geom_boxplot()+ scale_y_continuous("Average Employees",trans='log2')+coord_flip()+ggtitle("Boxplot of Employment by Industry in NY State - No outliers")
Looks better.
incNY1<-incNYwoOut%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE),minemp=min(Employees,na.rm=TRUE),maxemp=max(Employees,na.rm=TRUE))
ggplot(data = incNY1, mapping = aes(x = Industry, y = aemployee))+geom_pointrange(mapping = aes(ymin = minemp, ymax = maxemp))+ scale_y_continuous("Average Employees",trans='log2')+coord_flip()+ggtitle("Plot of Average Employment per Company by Industry in NY State - No outliers")
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
inc$RevEmp<-inc$Revenue/inc$Employees
incCC<-inc[complete.cases(inc),]%>%arrange(desc(RevEmp))
head(incCC)
## Rank Name Growth_Rate Revenue
## 1 2675 Cedar Petrochemcials 1.31 2.037e+08
## 2 4 Bridger 233.08 1.900e+09
## 3 1978 Hightowers Petroleum Company 1.90 3.030e+08
## 4 564 Fast Fusion 7.98 1.280e+07
## 5 4371 NeoGov 0.55 1.260e+07
## 6 17 Intelligent Audit 105.73 1.450e+08
## Industry Employees City State RevEmp
## 1 Energy 5 New York NY 40740000
## 2 Energy 50 Addison TX 38000000
## 3 Energy 19 Franklin OH 15947368
## 4 Manufacturing 1 Palisade CO 12800000
## 5 Software 1 El Segunod CA 12600000
## 6 Logistics & Transportation 15 Rochelle Park NJ 9666667
ggplot(incCC, aes(x=Industry, y=RevEmp))+ scale_y_continuous("Revenue per Employee", trans='log2') + geom_boxplot()+coord_flip()+ggtitle("Boxplot of Ave Revenue per Employee by Industry")
Taking out outliers.
incCCwoOut <- incCC %>%
group_by(Industry) %>%
filter((RevEmp <= quantile(RevEmp,0.75)+1.5*IQR(RevEmp))&(RevEmp>= quantile(RevEmp,0.25)-1.5*IQR(RevEmp)))
ggplot(incCCwoOut, aes(x=Industry, y=RevEmp)) + geom_boxplot()+ scale_y_continuous("Revenue Per Employee",trans='log2')+coord_flip()+ggtitle("Boxplot of Revenue per Employee by Industry - No outliers")
incRevPerEmp<-incCCwoOut%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE),minrpe=min(RevEmp,na.rm=TRUE),maxrpe=max(RevEmp,na.rm=TRUE))%>%arrange(desc(arev_per_emp))
head(incRevPerEmp)
## # A tibble: 6 x 10
## Industry ncompanies trevenue temployees arevenue aemployee agrowth
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Compute~ 39 1.13e10 9518 2.90e8 244. 0.0166
## 2 Constru~ 169 1.19e10 28400 7.04e7 168. 0.00836
## 3 Energy 96 8.26e 9 25150 8.61e7 262. 0.0614
## 4 Consume~ 187 1.43e10 45102 7.65e7 241. 0.0560
## 5 Logisti~ 138 1.12e10 39124 8.15e7 284. 0.0124
## 6 Retail 193 1.01e10 36947 5.21e7 191. 0.0134
## # ... with 3 more variables: arev_per_emp <dbl>, minrpe <dbl>,
## # maxrpe <dbl>
ggplot(data = incRevPerEmp, mapping = aes(x = reorder(Industry, arev_per_emp), y = arev_per_emp))+geom_pointrange(mapping = aes(ymin = minrpe, ymax = maxrpe))+ scale_y_continuous("Revenue Per Employee",trans='log2')+ggtitle("Plot of Average Revenue per Employee by Industry - No outliers")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+ scale_x_discrete("Industry")