R Notebook - Data 608 2019 Spring Knowledge and Visual Analytics 01

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

tail(inc)

##      Rank               Name Growth_Rate  Revenue
## 4996 4996              cSubs        0.34 1.34e+07
## 4997 4997          Dot Foods        0.34 4.50e+09
## 4998 4998 Lethal Performance        0.34 6.80e+06
## 4999 4999   ArcaTech Systems        0.34 3.26e+07
## 5000 5000                INE        0.34 6.80e+06
## 5001 5000               ALL4        0.34 4.70e+06
##                          Industry Employees         City State
## 4996 Business Products & Services        19     Montvale    NJ
## 4997              Food & Beverage      3919 Mt. Sterling    IL
## 4998                       Retail         8   Wellington    FL
## 4999           Financial Services        63       Mebane    NC
## 5000                  IT Services        35     Bellevue    WA
## 5001       Environmental Services        34    Kimberton    PA

#filter(inc,Name=='Cedar Petrochemcials')

I will look at some frequencies

head(as.data.frame(prop.table(table(inc$Industry))))

##                           Var1       Freq
## 1      Advertising & Marketing 0.09418116
## 2 Business Products & Services 0.09638072
## 3            Computer Hardware 0.00879824
## 4                 Construction 0.03739252
## 5 Consumer Products & Services 0.04059188
## 6                    Education 0.01659668

head(as.data.frame(prop.table(table(inc$State))))

##   Var1       Freq
## 1   AK 0.00039992
## 2   AL 0.01019796
## 3   AR 0.00179964
## 4   AZ 0.01999600
## 5   CA 0.14017197
## 6   CO 0.02679464

Let’s look at the companies with lowest and highest revenue

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

inc%>%arrange(Revenue)%>%mutate(RevenueF = sprintf('$%.0f', Revenue))%>%head(10)

##    Rank                      Name Growth_Rate Revenue
## 1   246   Cardinal Point Captains       17.65   2e+06
## 2   252                   Benzara       17.02   2e+06
## 3   439           SPICA Computers       10.41   2e+06
## 4   459           Bows-N-Ties.com       10.03   2e+06
## 5   500     Mpact Financial Group        9.18   2e+06
## 6   588            ADA Collection        7.72   2e+06
## 7   602        The Rocket Company        7.62   2e+06
## 8   635 Kyzen Consulting Services        7.25   2e+06
## 9   959     Punchkick Interactive        4.52   2e+06
## 10 1150         FindMyCompany.com        3.63   2e+06
##                        Industry Employees            City State RevenueF
## 1           Government Services        30        Carlsbad    CA $2000000
## 2  Consumer Products & Services         5      Naperville    IL $2000000
## 3                   IT Services        30     Jersey City    NJ $2000000
## 4                        Retail         4   San Francisco    CA $2000000
## 5            Financial Services        11          Dallas    TX $2000000
## 6  Consumer Products & Services        14          folsom    CA $2000000
## 7  Business Products & Services         8         Cumming    GA $2000000
## 8           Government Services         7 West Palm Beach    FL $2000000
## 9       Advertising & Marketing        27         Chicago    IL $2000000
## 10      Advertising & Marketing        30          Austin    TX $2000000

inc%>%arrange(Revenue)%>%mutate(RevenueF = sprintf('$%.0f', Revenue))%>%tail(10)

##      Rank                       Name Growth_Rate  Revenue
## 4992 2522                  DLA Piper        1.41 2.40e+09
## 4993 1397          EnvisionRxOptions        2.88 2.70e+09
## 4994 4052                   Kum & Go        0.65 2.80e+09
## 4995 4802              Boise Cascade        0.41 2.80e+09
## 4996 4246 American Tire Distributors        0.59 3.50e+09
## 4997 4716              Westcon Group        0.44 3.80e+09
## 4998 4997                  Dot Foods        0.34 4.50e+09
## 4999 4936                       Coty        0.36 4.60e+09
## 5000 3853                 ABC Supply        0.73 4.70e+09
## 5001 4788                        CDW        0.41 1.01e+10
##                          Industry Employees            City State
## 4992 Business Products & Services      4036         Chicago    IL
## 4993                       Health       625       Twinsburg    OH
## 4994                       Retail      4589 West Des Moines    IA
## 4995                 Construction      4470           Boise    ID
## 4996 Consumer Products & Services      3341    Huntersville    NC
## 4997                  IT Services      3000       Tarrytown    NY
## 4998              Food & Beverage      3919    Mt. Sterling    IL
## 4999 Consumer Products & Services     10000        New York    NY
## 5000                 Construction      6549          Beloit    WI
## 5001            Computer Hardware      6800    Vernon Hills    IL
##          RevenueF
## 4992  $2400000000
## 4993  $2700000000
## 4994  $2800000000
## 4995  $2800000000
## 4996  $3500000000
## 4997  $3800000000
## 4998  $4500000000
## 4999  $4600000000
## 5000  $4700000000
## 5001 $10100000000

Let’s see revenue and employees by industry

inc1<-inc%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))

inc1%>%arrange(desc(ncompanies))

## # A tibble: 25 x 8
##    Industry ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 IT Serv~        733  2.07e10     102788   2.82e7     140.  0.00274
##  2 Busines~        482  2.64e10     117357   5.47e7     244.  0.00349
##  3 Adverti~        471  7.78e 9      39731   1.65e7      84.4 0.0196 
##  4 Health          355  1.79e10      82430   5.03e7     233.  0.00802
##  5 Software        342  8.14e 9      51262   2.38e7     150.  0.0116 
##  6 Financi~        260  1.32e10      47693   5.06e7     183.  0.0150 
##  7 Manufac~        256  1.27e10      43942   4.95e7     172.  0.00655
##  8 Consume~        203  1.50e10      45464   7.37e7     224.  0.0515 
##  9 Retail          203  1.03e10      37068   5.05e7     183.  0.0144 
## 10 Governm~        202  6.01e 9      26185   2.97e7     130.  0.0285 
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>

inc1%>%arrange(desc(arevenue))

## # A tibble: 25 x 8
##    Industry ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 Compute~         44  1.19e10       9714   2.70e8      221. 0.0153 
##  2 Energy          109  1.38e10      26437   1.26e8      243. 0.332  
##  3 Food & ~        131  1.29e10      65911   9.86e7      511. 0.0148 
##  4 Logisti~        155  1.48e10      39994   9.57e7      260. 0.0185 
##  5 Consume~        203  1.50e10      45464   7.37e7      224. 0.0515 
##  6 Constru~        187  1.32e10      29099   7.05e7      156. 0.00894
##  7 Telecom~        129  7.33e 9      30842   5.69e7      243. 0.0138 
##  8 Busines~        482  2.64e10     117357   5.47e7      244. 0.00349
##  9 Security         73  3.81e 9      41059   5.22e7      562. 0.0247 
## 10 Environ~         51  2.64e 9      10155   5.17e7      199. 0.0650 
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>

inc1%>%arrange(desc(aemployee))

## # A tibble: 25 x 8
##    Industry ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 Human R~        196  9.25e 9     226980   4.72e7     1158. 0.0100 
##  2 Security         73  3.81e 9      41059   5.22e7      562. 0.0247 
##  3 Food & ~        131  1.29e10      65911   9.86e7      511. 0.0148 
##  4 Travel ~         62  2.93e 9      23035   4.73e7      372. 0.0193 
##  5 Enginee~         74  2.53e 9      20435   3.42e7      276. 0.0262 
##  6 Logisti~        155  1.48e10      39994   9.57e7      260. 0.0185 
##  7 Busines~        482  2.64e10     117357   5.47e7      244. 0.00349
##  8 Telecom~        129  7.33e 9      30842   5.69e7      243. 0.0138 
##  9 Energy          109  1.38e10      26437   1.26e8      243. 0.332  
## 10 Health          355  1.79e10      82430   5.03e7      233. 0.00802
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>

inc1%>%arrange(desc(agrowth))

## # A tibble: 25 x 8
##    Industry ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 Energy          109  1.38e10      26437   1.26e8     243.   0.332 
##  2 Real Es~         96  2.97e 9      18893   3.09e7     199.   0.0773
##  3 Environ~         51  2.64e 9      10155   5.17e7     199.   0.0650
##  4 Media            54  1.74e 9       9532   3.23e7     177.   0.0525
##  5 Consume~        203  1.50e10      45464   7.37e7     224.   0.0515
##  6 Educati~         83  1.14e 9       7685   1.37e7      92.6  0.0409
##  7 Insuran~         50  2.34e 9       7339   4.68e7     147.   0.0315
##  8 Governm~        202  6.01e 9      26185   2.97e7     130.   0.0285
##  9 Enginee~         74  2.53e 9      20435   3.42e7     276.   0.0262
## 10 Security         73  3.81e 9      41059   5.22e7     562.   0.0247
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>

inc1%>%arrange(desc(arev_per_emp))

## # A tibble: 25 x 8
##    Industry ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 Compute~         44  1.19e10       9714   2.70e8      221. 0.0153 
##  2 Energy          109  1.38e10      26437   1.26e8      243. 0.332  
##  3 Constru~        187  1.32e10      29099   7.05e7      156. 0.00894
##  4 Logisti~        155  1.48e10      39994   9.57e7      260. 0.0185 
##  5 Consume~        203  1.50e10      45464   7.37e7      224. 0.0515 
##  6 Insuran~         50  2.34e 9       7339   4.68e7      147. 0.0315 
##  7 Manufac~        256  1.27e10      43942   4.95e7      172. 0.00655
##  8 Retail          203  1.03e10      37068   5.05e7      183. 0.0144 
##  9 Financi~        260  1.32e10      47693   5.06e7      183. 0.0150 
## 10 Environ~         51  2.64e 9      10155   5.17e7      199. 0.0650 
## # ... with 15 more rows, and 1 more variable: arev_per_emp <dbl>

Let’s see revenue and employees by State

inc2<-inc%>%group_by(State)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))

inc2%>%arrange(desc(ncompanies))

## # A tibble: 52 x 8
##    State ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>      <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 CA           701  2.35e10     161219   3.35e7      230. 0.00966
##  2 TX           387  2.22e10      90765   5.73e7      235. 0.0596 
##  3 NY           311  1.83e10      84370   5.87e7      271. 0.00482
##  4 VA           283  8.67e 9      35667   3.06e7      126. 0.0145 
##  5 FL           282  1.06e10      61221   3.76e7      217. 0.0143 
##  6 IL           273  3.32e10     103266   1.22e8      380. 0.00480
##  7 GA           212  6.47e 9      34546   3.05e7      164. 0.0217 
##  8 OH           186  1.28e10      38002   6.87e7      204. 0.0117 
##  9 MA           182  6.04e 9      24682   3.32e7      136. 0.0334 
## 10 PA           164  5.67e 9      30392   3.46e7      186. 0.0138 
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>

inc2%>%arrange(desc(arevenue))

## # A tibble: 52 x 8
##    State ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>      <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 ID            17  3.94e 9       5817   2.32e8     342.  0.0336 
##  2 AK             2  3.43e 8       2528   1.72e8    1264   0.270  
##  3 IA            28  3.45e 9      11344   1.23e8     405.  0.0255 
##  4 IL           273  3.32e10     103266   1.22e8     380.  0.00480
##  5 HI             7  6.96e 8        621   9.95e7      88.7 0.325  
##  6 WI            79  7.30e 9      15548   9.24e7     202.  0.0123 
##  7 DC            43  3.28e 9       9221   7.63e7     220.  0.495  
##  8 OH           186  1.28e10      38002   6.87e7     204.  0.0117 
##  9 NC           137  9.26e 9      36685   6.76e7     272.  0.0114 
## 10 MI           126  7.81e 9      36905   6.20e7     293.  0.0109 
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>

inc2%>%arrange(desc(aemployee))

## # A tibble: 52 x 8
##    State ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>      <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 DE            16  6.77e 8      68544   4.23e7     4284  0.118  
##  2 AK             2  3.43e 8       2528   1.72e8     1264  0.270  
##  3 MS            12  5.25e 8       5531   4.38e7      461. 0.139  
##  4 MT             4  2.46e 7       1673   6.15e6      418. 0.173  
##  5 IA            28  3.45e 9      11344   1.23e8      405. 0.0255 
##  6 IL           273  3.32e10     103266   1.22e8      380. 0.00480
##  7 AZ           100  5.50e 9      34281   5.50e7      343. 0.0243 
##  8 ID            17  3.94e 9       5817   2.32e8      342. 0.0336 
##  9 MD           131  3.30e 9      40439   2.52e7      309. 0.0326 
## 10 MO            59  2.66e 9      17296   4.52e7      293. 0.0197 
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>

inc2%>%arrange(desc(agrowth))

## # A tibble: 52 x 8
##    State ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>      <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 ME            13   1.62e8        879   1.25e7      67.6   2.77 
##  2 WY             2   6.95e7        107   3.48e7      53.5   2.48 
##  3 PR             1   2.30e6         29   2.30e6      29     1.73 
##  4 RI            16   7.52e8       2964   4.70e7     185.    0.581
##  5 DC            43   3.28e9       9221   7.63e7     220.    0.495
##  6 SD             3   1.77e7        761   5.90e6     254.    0.392
##  7 VT             6   2.77e8       1069   4.62e7     178.    0.355
##  8 NM             5   4.82e7        617   9.64e6     123.    0.334
##  9 WV             2   3.13e7        240   1.56e7     120     0.332
## 10 HI             7   6.96e8        621   9.95e7      88.7   0.325
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>

inc2%>%arrange(desc(arev_per_emp))

## # A tibble: 52 x 8
##    State ncompanies trevenue temployees arevenue aemployee agrowth
##    <fct>      <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
##  1 HI             7  6.96e 8        621   9.95e7      88.7 0.325  
##  2 ID            17  3.94e 9       5817   2.32e8     342.  0.0336 
##  3 WY             2  6.95e 7        107   3.48e7      53.5 2.48   
##  4 WI            79  7.30e 9      15548   9.24e7     202.  0.0123 
##  5 DC            43  3.28e 9       9221   7.63e7     220.  0.495  
##  6 CT            50  2.47e 9       6989   4.95e7     140.  0.107  
##  7 NH            24  1.00e 9       2890   4.17e7     120.  0.0618 
##  8 OH           186  1.28e10      38002   6.87e7     204.  0.0117 
##  9 IL           273  3.32e10     103266   1.22e8     380.  0.00480
## 10 OR            49  1.40e 9       4399   2.86e7      89.8 0.0419 
## # ... with 42 more rows, and 1 more variable: arev_per_emp <dbl>

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

library(ggplot2)

a<-ggplot(inc, aes(x=State)) + geom_bar(color="black", fill="white")

a+coord_flip() +scale_y_continuous("Number of Companies")+ggtitle("Distribution of Companies by State")

#a+coord_fixed(ratio = 0.05)+ theme(axis.text.x = element_text(angle = 90, hjust = 1))

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc%>%count(State)%>%arrange(desc(n))%>%slice(3)

## # A tibble: 1 x 2
##   State     n
##   <fct> <int>
## 1 NY      311

incNY<-inc%>%filter(State=='NY')

incNY<-incNY[complete.cases(incNY),]

incNY1<-incNY%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE))

head(incNY1)

## # A tibble: 6 x 8
##   Industry ncompanies trevenue temployees arevenue aemployee agrowth
##   <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
## 1 Adverti~         57   9.49e8       3331   1.66e7      58.4  0.110 
## 2 Busines~         26   2.55e9      38804   9.81e7    1492.   0.0233
## 3 Compute~          1   2.29e7         44   2.29e7      44    1.19  
## 4 Constru~          6   8.23e7        366   1.37e7      61    0.183 
## 5 Consume~         17   4.80e9      10647   2.82e8     626.   0.0462
## 6 Educati~         14   7.08e7        838   5.06e6      59.9  0.167 
## # ... with 1 more variable: arev_per_emp <dbl>

ggplot(incNY, aes(x=Industry, y=Employees)) + geom_boxplot()+ scale_y_continuous("Average Employees", trans='log2')+coord_flip()+ggtitle("Boxplot of Employment by Industry in NY State")

We will try to remove outliers. We will do it by industry. We will use 1.5IQR approach

incNYwoOut <- incNY %>%
  group_by(Industry) %>%
  filter((Employees <= quantile(Employees,0.75)+1.5*IQR(Employees))&Employees >= quantile(Employees,0.25)-1.5*IQR(Employees))

ggplot(incNYwoOut, aes(x=Industry, y=Employees)) + geom_boxplot()+ scale_y_continuous("Average Employees",trans='log2')+coord_flip()+ggtitle("Boxplot of Employment by Industry in NY State - No outliers")

Looks better.

incNY1<-incNYwoOut%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE),minemp=min(Employees,na.rm=TRUE),maxemp=max(Employees,na.rm=TRUE))

ggplot(data = incNY1, mapping = aes(x = Industry, y = aemployee))+geom_pointrange(mapping = aes(ymin = minemp, ymax = maxemp))+ scale_y_continuous("Average Employees",trans='log2')+coord_flip()+ggtitle("Plot of Average Employment per Company by Industry in NY State - No outliers")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

inc$RevEmp<-inc$Revenue/inc$Employees

incCC<-inc[complete.cases(inc),]%>%arrange(desc(RevEmp))

head(incCC)

##   Rank                         Name Growth_Rate   Revenue
## 1 2675         Cedar Petrochemcials        1.31 2.037e+08
## 2    4                      Bridger      233.08 1.900e+09
## 3 1978 Hightowers Petroleum Company        1.90 3.030e+08
## 4  564                  Fast Fusion        7.98 1.280e+07
## 5 4371                       NeoGov        0.55 1.260e+07
## 6   17            Intelligent Audit      105.73 1.450e+08
##                     Industry Employees          City State   RevEmp
## 1                     Energy         5      New York    NY 40740000
## 2                     Energy        50       Addison    TX 38000000
## 3                     Energy        19      Franklin    OH 15947368
## 4              Manufacturing         1      Palisade    CO 12800000
## 5                   Software         1    El Segunod    CA 12600000
## 6 Logistics & Transportation        15 Rochelle Park    NJ  9666667

ggplot(incCC, aes(x=Industry, y=RevEmp))+ scale_y_continuous("Revenue per Employee", trans='log2') + geom_boxplot()+coord_flip()+ggtitle("Boxplot of Ave Revenue per Employee by Industry")

Taking out outliers.

incCCwoOut <- incCC %>%
  group_by(Industry) %>%
  filter((RevEmp <= quantile(RevEmp,0.75)+1.5*IQR(RevEmp))&(RevEmp>= quantile(RevEmp,0.25)-1.5*IQR(RevEmp)))

ggplot(incCCwoOut, aes(x=Industry, y=RevEmp)) + geom_boxplot()+ scale_y_continuous("Revenue Per Employee",trans='log2')+coord_flip()+ggtitle("Boxplot of Revenue per Employee by Industry - No outliers")

incRevPerEmp<-incCCwoOut%>%group_by(Industry)%>%summarise(ncompanies=sum(!is.na(Revenue)),trevenue = sum(Revenue),temployees=sum(Employees, na.rm=TRUE),arevenue=mean(Revenue),aemployee=mean(Employees,na.rm=TRUE),agrowth=mean(Growth_Rate*Revenue)/sum(Revenue),arev_per_emp=sum(Revenue)/sum(Employees,na.rm=TRUE),minrpe=min(RevEmp,na.rm=TRUE),maxrpe=max(RevEmp,na.rm=TRUE))%>%arrange(desc(arev_per_emp))

head(incRevPerEmp)

## # A tibble: 6 x 10
##   Industry ncompanies trevenue temployees arevenue aemployee agrowth
##   <fct>         <int>    <dbl>      <int>    <dbl>     <dbl>   <dbl>
## 1 Compute~         39  1.13e10       9518   2.90e8      244. 0.0166 
## 2 Constru~        169  1.19e10      28400   7.04e7      168. 0.00836
## 3 Energy           96  8.26e 9      25150   8.61e7      262. 0.0614 
## 4 Consume~        187  1.43e10      45102   7.65e7      241. 0.0560 
## 5 Logisti~        138  1.12e10      39124   8.15e7      284. 0.0124 
## 6 Retail          193  1.01e10      36947   5.21e7      191. 0.0134 
## # ... with 3 more variables: arev_per_emp <dbl>, minrpe <dbl>,
## #   maxrpe <dbl>

ggplot(data = incRevPerEmp, mapping = aes(x = reorder(Industry, arev_per_emp), y = arev_per_emp))+geom_pointrange(mapping = aes(ymin = minrpe, ymax = maxrpe))+ scale_y_continuous("Revenue Per Employee",trans='log2')+ggtitle("Plot of Average Revenue per Employee by Industry - No outliers")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+ scale_x_discrete("Industry")

R Notebook - Data 608 2019 Spring Knowledge and Visual Analytics 01

Mikhail Groysman

February 9, 2019

Question 1

Quesiton 2

Question 3