Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

# Insert your code here, create more chunks as necessary
library(plyr)

head(arrange(inc,desc(Revenue)), n = 50)
##    Rank                             Name Growth_Rate   Revenue
## 1  4788                              CDW        0.41 1.010e+10
## 2  3853                       ABC Supply        0.73 4.700e+09
## 3  4936                             Coty        0.36 4.600e+09
## 4  4997                        Dot Foods        0.34 4.500e+09
## 5  4716                    Westcon Group        0.44 3.800e+09
## 6  4246       American Tire Distributors        0.59 3.500e+09
## 7  4052                         Kum & Go        0.65 2.800e+09
## 8  4802                    Boise Cascade        0.41 2.800e+09
## 9  1397                EnvisionRxOptions        2.88 2.700e+09
## 10 2522                        DLA Piper        1.41 2.400e+09
## 11 4629               Prime Therapeutics        0.47 2.000e+09
## 12    4                          Bridger      233.08 1.900e+09
## 13 1843              Sun Coast Resources        2.08 1.900e+09
## 14 3844                Atlas Oil Company        0.74 1.900e+09
## 15 4961                 Kirkland & Ellis        0.36 1.900e+09
## 16 1488           Sprouts Farmers Market        2.68 1.800e+09
## 17 4689 Global Brass and Copper Holdings        0.45 1.700e+09
## 18 3463                    Hogan Lovells        0.89 1.600e+09
## 19 2145              AdvancePierre Foods        1.73 1.500e+09
## 20 3650                            Genco        0.82 1.500e+09
## 21  960                Advanced Disposal        4.51 1.400e+09
## 22 2236          Total Quality Logistics        1.65 1.400e+09
## 23 2496             Carahsoft Technology        1.42 1.400e+09
## 24 3414             Restoration Hardware        0.91 1.200e+09
## 25 1908      Diplomat Specialty Pharmacy        1.99 1.100e+09
## 26 3734                              KSS        0.78 1.000e+09
## 27 3425       Blackhawk Network Holdings        0.90 9.591e+08
## 28 1996                     Ambit Energy        1.88 9.303e+08
## 29 4532                      GoDaddy.com        0.49 9.109e+08
## 30 2958                       ImmixGroup        1.15 8.836e+08
## 31 4793                 LORD Corporation        0.41 8.615e+08
## 32 3163                    Quinn Emanuel        1.03 8.525e+08
## 33 2961                      Genesis-ATC        1.15 8.462e+08
## 34 1284        Hearthside Food Solutions        3.17 8.396e+08
## 35 1480                 Coyote Logistics        2.70 7.864e+08
## 36 4765                   Squire Sanders        0.42 7.745e+08
## 37 3921       Granite Telecommunications        0.70 7.362e+08
## 38 4854                  Arnold & Porter        0.40 7.310e+08
## 39 1869    Universal Services of America        2.04 7.181e+08
## 40 2274                    Liberty Power        1.61 7.108e+08
## 41 4427                 Sunshine Minting        0.53 7.061e+08
## 42 4459                           Belcan        0.52 6.887e+08
## 43 2806                 Goodman Networks        1.23 6.509e+08
## 44 4009                 Schumacher Group        0.67 6.130e+08
## 45 4815                     Perkins Coie        0.40 6.080e+08
## 46 1768          The Cellular Connection        2.17 6.065e+08
## 47 2537                      Wayfair.com        1.40 6.020e+08
## 48 4577       Sutherland Global Services        0.48 5.976e+08
## 49 4093               Advanced BioEnergy        0.64 5.848e+08
## 50 4857                          AVI-SPL        0.39 5.807e+08
##                        Industry Employees             City State
## 1             Computer Hardware      6800     Vernon Hills    IL
## 2                  Construction      6549           Beloit    WI
## 3  Consumer Products & Services     10000         New York    NY
## 4               Food & Beverage      3919     Mt. Sterling    IL
## 5                   IT Services      3000        Tarrytown    NY
## 6  Consumer Products & Services      3341     Huntersville    NC
## 7                        Retail      4589  West Des Moines    IA
## 8                  Construction      4470            Boise    ID
## 9                        Health       625        Twinsburg    OH
## 10 Business Products & Services      4036          Chicago    IL
## 11                       Health      2549            Eagan    MN
## 12                       Energy        50          Addison    TX
## 13                       Energy      1640          Houston    TX
## 14   Logistics & Transportation       374           Taylor    MI
## 15 Business Products & Services      1517          Chicago    IL
## 16 Consumer Products & Services     13200          Phoenix    AZ
## 17                Manufacturing      1986       Schaumburg    IL
## 18 Business Products & Services      2280       Washington    DC
## 19              Food & Beverage      4000       Cincinnati    OH
## 20   Logistics & Transportation     10800       Pittsburgh    PA
## 21       Environmental Services      5347      Ponte Vedra    FL
## 22   Logistics & Transportation      2116       Cincinnati    OH
## 23          Government Services       365           Reston    VA
## 24                       Retail      2900     Corte Madera    CA
## 25                       Health       761            Flint    MI
## 26                Manufacturing      8500 Sterling Heights    MI
## 27           Financial Services       725       Pleasanton    CA
## 28                       Energy       492           Dallas    TX
## 29                  IT Services      3369       Scottsdale    AZ
## 30                  IT Services       252           McLean    VA
## 31                Manufacturing      2959             Cary    NC
## 32 Business Products & Services       697      Los Angeles    CA
## 33           Telecommunications       347      San Antonio    TX
## 34              Food & Beverage      5000    Downers Grove    IL
## 35   Logistics & Transportation      1219          Chicago    IL
## 36 Business Products & Services      1257        Cleveland    OH
## 37           Telecommunications      1000           Quincy    MA
## 38 Business Products & Services       748       Washington    DC
## 39                     Security     20000        Santa Ana    CA
## 40                       Energy       277  Fort Lauderdale    FL
## 41                Manufacturing       280   Coeur D' Alene    ID
## 42                  Engineering     10000       Cincinnati    OH
## 43           Telecommunications      1693            Plano    TX
## 44                       Health      1945        Lafayette    LA
## 45 Business Products & Services       823          Seattle    WA
## 46           Telecommunications      1428           Marion    IN
## 47                       Retail      1300           Boston    MA
## 48 Business Products & Services     32000        Pittsford    NY
## 49                       Energy        75      Bloomington    MN
## 50 Business Products & Services      1800            Tampa    FL
head(arrange(inc, desc(Employees)), n = 50)
##    Rank                                  Name Growth_Rate   Revenue
## 1  2345          Integrity staffing Solutions        1.55 2.782e+08
## 2  4577            Sutherland Global Services        0.48 5.976e+08
## 3  1869         Universal Services of America        2.04 7.181e+08
## 4  3456                  The Seaton Companies        0.89 4.815e+08
## 5  2871                            PrideStaff        1.19 1.143e+08
## 6  2314                           Infiniti HR        1.57 1.464e+08
## 7  4655                            CareersUSA        0.46 3.130e+07
## 8  1488                Sprouts Farmers Market        2.68 1.800e+09
## 9  4140        Cornerstone Staffing Solutions        0.63 9.600e+07
## 10 3650                                 Genco        0.82 1.500e+09
## 11 2876                           BG Staffing        1.19 7.680e+07
## 12 3452                  VXI Global Solutions        0.89 1.839e+08
## 13 4459                                Belcan        0.52 6.887e+08
## 14 4936                                  Coty        0.36 4.600e+09
## 15 3734                                   KSS        0.78 1.000e+09
## 16 4508 Bojangles' Famous Chicken 'n Biscuits        0.50 3.488e+08
## 17 2229                             Tandem HR        1.66 3.942e+08
## 18 4305                            Towne Park        0.57 1.789e+08
## 19 4134                             Collabera        0.63 4.561e+08
## 20 4329                     Noodles & Company        0.56 3.004e+08
## 21 4788                                   CDW        0.41 1.010e+10
## 22 3853                            ABC Supply        0.73 4.700e+09
## 23  745                      Charming Charlie        6.15 3.710e+08
## 24 3687                   EventPro Strategies        0.80 1.170e+07
## 25 1431               MAU Workforce Solutions        2.81 2.153e+08
## 26  960                     Advanced Disposal        4.51 1.400e+09
## 27 1284             Hearthside Food Solutions        3.17 8.396e+08
## 28 3490   Benchmark Hospitality International        0.88 5.066e+08
## 29 4850                       Whelan Security        0.40 1.449e+08
## 30 4286                           RuffaloCODY        0.57 8.130e+07
## 31 4052                              Kum & Go        0.65 2.800e+09
## 32 2180          Security Industry Specialist        1.69 9.530e+07
## 33 4802                         Boise Cascade        0.41 2.800e+09
## 34 3717                 Heartland Dental Care        0.79 5.540e+08
## 35 3898               Pacific Dental Services        0.71 5.266e+08
## 36 4811                     Flying Food Group        0.41 4.097e+08
## 37 4542                      Orion Associates        0.49 5.590e+07
## 38   15                          LivingSocial      123.33 5.360e+08
## 39 2522                             DLA Piper        1.41 2.400e+09
## 40 2145                   AdvancePierre Foods        1.73 1.500e+09
## 41 4108               First Hospitality Group        0.64 2.435e+08
## 42 4997                             Dot Foods        0.34 4.500e+09
## 43 3451                             TempStaff        0.89 2.410e+07
## 44 4541                     Acadian Companies        0.49 3.515e+08
## 45 2034          Digital Intelligence Systems        1.84 3.280e+08
## 46 3613                         Pacific Bells        0.83 1.992e+08
## 47 4532                           GoDaddy.com        0.49 9.109e+08
## 48 4871          Pinnacle Technical Resources        0.39 2.388e+08
## 49 4246            American Tire Distributors        0.59 3.500e+09
## 50 3573                   Program Productions        0.85 2.070e+07
##                        Industry Employees             City State
## 1               Human Resources     66803       Wilmington    DE
## 2  Business Products & Services     32000        Pittsford    NY
## 3                      Security     20000        Santa Ana    CA
## 4               Human Resources     18887          Chicago    IL
## 5               Human Resources     17057           Fresno    CA
## 6               Human Resources     17000            Olney    MD
## 7               Human Resources     14451       Boca Raton    FL
## 8  Consumer Products & Services     13200          Phoenix    AZ
## 9               Human Resources     13071       Pleasanton    CA
## 10   Logistics & Transportation     10800       Pittsburgh    PA
## 11              Human Resources     10611           Dallas    TX
## 12           Telecommunications     10000      Los Angeles    CA
## 13                  Engineering     10000       Cincinnati    OH
## 14 Consumer Products & Services     10000         New York    NY
## 15                Manufacturing      8500 Sterling Heights    MI
## 16              Food & Beverage      7681        Charlotte    NC
## 17              Human Resources      7612        Oak Brook    IL
## 18              Human Resources      7052        Annapolis    MD
## 19                  IT Services      7000       Morristown    NJ
## 20              Food & Beverage      7000       Broomfield    CO
## 21            Computer Hardware      6800     Vernon Hills    IL
## 22                 Construction      6549           Beloit    WI
## 23                       Retail      5821          Houston    TX
## 24      Advertising & Marketing      5637       Scottsdale    AZ
## 25              Human Resources      5400          Augusta    GA
## 26       Environmental Services      5347      Ponte Vedra    FL
## 27              Food & Beverage      5000    Downers Grove    IL
## 28         Travel & Hospitality      4878    The Woodlands    TX
## 29                     Security      4720        St. Louis    MO
## 30 Business Products & Services      4600     Cedar Rapids    IA
## 31                       Retail      4589  West Des Moines    IA
## 32                     Security      4510      Culver City    CA
## 33                 Construction      4470            Boise    ID
## 34 Business Products & Services      4392        Effingham    IL
## 35                       Health      4390           Irvine    CA
## 36              Food & Beverage      4223          chicago    IL
## 37              Human Resources      4129    Golden Valley    MN
## 38 Consumer Products & Services      4100       Washington    DC
## 39 Business Products & Services      4036          Chicago    IL
## 40              Food & Beverage      4000       Cincinnati    OH
## 41         Travel & Hospitality      3921         Rosemont    IL
## 42              Food & Beverage      3919     Mt. Sterling    IL
## 43              Human Resources      3892          Jackson    MS
## 44                       Health      3725        Lafayette    LA
## 45                  IT Services      3500           McLean    VA
## 46              Food & Beverage      3400        Vancouver    WA
## 47                  IT Services      3369       Scottsdale    AZ
## 48              Human Resources      3355           Dallas    TX
## 49 Consumer Products & Services      3341     Huntersville    NC
## 50                        Media      3300          Lombard    IL
sd (inc$Growth_Rate)
## [1] 14.12369
sd (inc$Revenue)
## [1] 240542281

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here
# Stacked Bar Plot with Colors and Legend
library(ggplot2)
counts <- table(inc$State)

df_counts = as.data.frame(counts)

 bp = ggplot(data=df_counts, aes(x=Var1, y=Freq, fill = Var1))+
 geom_bar(stat="identity")+ 
 scale_colour_gradient2()+
 coord_flip()+
 ylim(0, 705)+
 scale_x_discrete(limits = df_counts$Var1)+
 theme(legend.position = "none")

bp + labs(title = "Total Distribution of Companies by U.S. State", y = "Companies in top 500", x = "States")

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# Answer Question 2 here

ny_inc <- subset(inc, State='NY')
ny_inc = inc[c(5:6)]
head(ny_inc)
##                       Industry Employees
## 1 Consumer Products & Services       104
## 2          Government Services        51
## 3                       Health       132
## 4                       Energy        50
## 5      Advertising & Marketing       220
## 6                  Real Estate        63
ny_inc <- ny_inc[complete.cases(ny_inc),]

attach(ny_inc)
aggny <-aggregate(ny_inc, by=list(Industry), 
  FUN=mean, na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
head(aggny)
##                        Group.1 Industry Employees
## 1      Advertising & Marketing       NA  84.35456
## 2 Business Products & Services       NA 244.49375
## 3            Computer Hardware       NA 220.77273
## 4                 Construction       NA 155.60963
## 5 Consumer Products & Services       NA 223.96059
## 6                    Education       NA  92.59036
detach(ny_inc)

nydata = ggplot(data=aggny, aes(x=aggny$Group.1, y=aggny$Employees, fill = aggny$Group.1))+
 geom_bar(stat="identity")+
 scale_colour_gradient2()+
 coord_flip()+
 ylim(0, 1200)+
 scale_x_discrete(limits = aggny$Group.1)+
 theme(legend.position = "none")

nydata + labs(title="Average Number of Employees per Industry in New York", y="Employees", x="Industry")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here
us_inc = inc[c(4:6)]
head(us_inc)
##     Revenue                     Industry Employees
## 1 1.179e+08 Consumer Products & Services       104
## 2 4.960e+07          Government Services        51
## 3 2.550e+07                       Health       132
## 4 1.900e+09                       Energy        50
## 5 8.700e+07      Advertising & Marketing       220
## 6 4.570e+07                  Real Estate        63
aggus = aggregate(. ~ Industry, FUN = sum, data = us_inc, 
          na.rm = TRUE, na.action = na.pass)






head(aggus)
##                       Industry     Revenue Employees
## 1      Advertising & Marketing  7785000000     39731
## 2 Business Products & Services 26367900000    117357
## 3            Computer Hardware 11885700000      9714
## 4                 Construction 13174300000     29099
## 5 Consumer Products & Services 14956400000     45464
## 6                    Education  1139300000      7685
aggus$RevEmp = round(aggus$Revenue/aggus$Employees/10000,0)

head(aggus)
##                       Industry     Revenue Employees RevEmp
## 1      Advertising & Marketing  7785000000     39731     20
## 2 Business Products & Services 26367900000    117357     22
## 3            Computer Hardware 11885700000      9714    122
## 4                 Construction 13174300000     29099     45
## 5 Consumer Products & Services 14956400000     45464     33
## 6                    Education  1139300000      7685     15
usdata = ggplot(data=aggus, aes(x=Industry, y=RevEmp, fill = Industry))+
 geom_bar(stat="identity")+
 scale_colour_gradient2()+
 coord_flip()+
 ylim(0, 123)+
 scale_x_discrete(limits = aggus$Industry)+
 theme(legend.position = "none")

usdata + ggtitle("Revenue per Employee per Industry in the U.S.A.")