Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
# Insert your code here, create more chunks as necessary
library(plyr)
head(arrange(inc,desc(Revenue)), n = 50)
## Rank Name Growth_Rate Revenue
## 1 4788 CDW 0.41 1.010e+10
## 2 3853 ABC Supply 0.73 4.700e+09
## 3 4936 Coty 0.36 4.600e+09
## 4 4997 Dot Foods 0.34 4.500e+09
## 5 4716 Westcon Group 0.44 3.800e+09
## 6 4246 American Tire Distributors 0.59 3.500e+09
## 7 4052 Kum & Go 0.65 2.800e+09
## 8 4802 Boise Cascade 0.41 2.800e+09
## 9 1397 EnvisionRxOptions 2.88 2.700e+09
## 10 2522 DLA Piper 1.41 2.400e+09
## 11 4629 Prime Therapeutics 0.47 2.000e+09
## 12 4 Bridger 233.08 1.900e+09
## 13 1843 Sun Coast Resources 2.08 1.900e+09
## 14 3844 Atlas Oil Company 0.74 1.900e+09
## 15 4961 Kirkland & Ellis 0.36 1.900e+09
## 16 1488 Sprouts Farmers Market 2.68 1.800e+09
## 17 4689 Global Brass and Copper Holdings 0.45 1.700e+09
## 18 3463 Hogan Lovells 0.89 1.600e+09
## 19 2145 AdvancePierre Foods 1.73 1.500e+09
## 20 3650 Genco 0.82 1.500e+09
## 21 960 Advanced Disposal 4.51 1.400e+09
## 22 2236 Total Quality Logistics 1.65 1.400e+09
## 23 2496 Carahsoft Technology 1.42 1.400e+09
## 24 3414 Restoration Hardware 0.91 1.200e+09
## 25 1908 Diplomat Specialty Pharmacy 1.99 1.100e+09
## 26 3734 KSS 0.78 1.000e+09
## 27 3425 Blackhawk Network Holdings 0.90 9.591e+08
## 28 1996 Ambit Energy 1.88 9.303e+08
## 29 4532 GoDaddy.com 0.49 9.109e+08
## 30 2958 ImmixGroup 1.15 8.836e+08
## 31 4793 LORD Corporation 0.41 8.615e+08
## 32 3163 Quinn Emanuel 1.03 8.525e+08
## 33 2961 Genesis-ATC 1.15 8.462e+08
## 34 1284 Hearthside Food Solutions 3.17 8.396e+08
## 35 1480 Coyote Logistics 2.70 7.864e+08
## 36 4765 Squire Sanders 0.42 7.745e+08
## 37 3921 Granite Telecommunications 0.70 7.362e+08
## 38 4854 Arnold & Porter 0.40 7.310e+08
## 39 1869 Universal Services of America 2.04 7.181e+08
## 40 2274 Liberty Power 1.61 7.108e+08
## 41 4427 Sunshine Minting 0.53 7.061e+08
## 42 4459 Belcan 0.52 6.887e+08
## 43 2806 Goodman Networks 1.23 6.509e+08
## 44 4009 Schumacher Group 0.67 6.130e+08
## 45 4815 Perkins Coie 0.40 6.080e+08
## 46 1768 The Cellular Connection 2.17 6.065e+08
## 47 2537 Wayfair.com 1.40 6.020e+08
## 48 4577 Sutherland Global Services 0.48 5.976e+08
## 49 4093 Advanced BioEnergy 0.64 5.848e+08
## 50 4857 AVI-SPL 0.39 5.807e+08
## Industry Employees City State
## 1 Computer Hardware 6800 Vernon Hills IL
## 2 Construction 6549 Beloit WI
## 3 Consumer Products & Services 10000 New York NY
## 4 Food & Beverage 3919 Mt. Sterling IL
## 5 IT Services 3000 Tarrytown NY
## 6 Consumer Products & Services 3341 Huntersville NC
## 7 Retail 4589 West Des Moines IA
## 8 Construction 4470 Boise ID
## 9 Health 625 Twinsburg OH
## 10 Business Products & Services 4036 Chicago IL
## 11 Health 2549 Eagan MN
## 12 Energy 50 Addison TX
## 13 Energy 1640 Houston TX
## 14 Logistics & Transportation 374 Taylor MI
## 15 Business Products & Services 1517 Chicago IL
## 16 Consumer Products & Services 13200 Phoenix AZ
## 17 Manufacturing 1986 Schaumburg IL
## 18 Business Products & Services 2280 Washington DC
## 19 Food & Beverage 4000 Cincinnati OH
## 20 Logistics & Transportation 10800 Pittsburgh PA
## 21 Environmental Services 5347 Ponte Vedra FL
## 22 Logistics & Transportation 2116 Cincinnati OH
## 23 Government Services 365 Reston VA
## 24 Retail 2900 Corte Madera CA
## 25 Health 761 Flint MI
## 26 Manufacturing 8500 Sterling Heights MI
## 27 Financial Services 725 Pleasanton CA
## 28 Energy 492 Dallas TX
## 29 IT Services 3369 Scottsdale AZ
## 30 IT Services 252 McLean VA
## 31 Manufacturing 2959 Cary NC
## 32 Business Products & Services 697 Los Angeles CA
## 33 Telecommunications 347 San Antonio TX
## 34 Food & Beverage 5000 Downers Grove IL
## 35 Logistics & Transportation 1219 Chicago IL
## 36 Business Products & Services 1257 Cleveland OH
## 37 Telecommunications 1000 Quincy MA
## 38 Business Products & Services 748 Washington DC
## 39 Security 20000 Santa Ana CA
## 40 Energy 277 Fort Lauderdale FL
## 41 Manufacturing 280 Coeur D' Alene ID
## 42 Engineering 10000 Cincinnati OH
## 43 Telecommunications 1693 Plano TX
## 44 Health 1945 Lafayette LA
## 45 Business Products & Services 823 Seattle WA
## 46 Telecommunications 1428 Marion IN
## 47 Retail 1300 Boston MA
## 48 Business Products & Services 32000 Pittsford NY
## 49 Energy 75 Bloomington MN
## 50 Business Products & Services 1800 Tampa FL
head(arrange(inc, desc(Employees)), n = 50)
## Rank Name Growth_Rate Revenue
## 1 2345 Integrity staffing Solutions 1.55 2.782e+08
## 2 4577 Sutherland Global Services 0.48 5.976e+08
## 3 1869 Universal Services of America 2.04 7.181e+08
## 4 3456 The Seaton Companies 0.89 4.815e+08
## 5 2871 PrideStaff 1.19 1.143e+08
## 6 2314 Infiniti HR 1.57 1.464e+08
## 7 4655 CareersUSA 0.46 3.130e+07
## 8 1488 Sprouts Farmers Market 2.68 1.800e+09
## 9 4140 Cornerstone Staffing Solutions 0.63 9.600e+07
## 10 3650 Genco 0.82 1.500e+09
## 11 2876 BG Staffing 1.19 7.680e+07
## 12 3452 VXI Global Solutions 0.89 1.839e+08
## 13 4459 Belcan 0.52 6.887e+08
## 14 4936 Coty 0.36 4.600e+09
## 15 3734 KSS 0.78 1.000e+09
## 16 4508 Bojangles' Famous Chicken 'n Biscuits 0.50 3.488e+08
## 17 2229 Tandem HR 1.66 3.942e+08
## 18 4305 Towne Park 0.57 1.789e+08
## 19 4134 Collabera 0.63 4.561e+08
## 20 4329 Noodles & Company 0.56 3.004e+08
## 21 4788 CDW 0.41 1.010e+10
## 22 3853 ABC Supply 0.73 4.700e+09
## 23 745 Charming Charlie 6.15 3.710e+08
## 24 3687 EventPro Strategies 0.80 1.170e+07
## 25 1431 MAU Workforce Solutions 2.81 2.153e+08
## 26 960 Advanced Disposal 4.51 1.400e+09
## 27 1284 Hearthside Food Solutions 3.17 8.396e+08
## 28 3490 Benchmark Hospitality International 0.88 5.066e+08
## 29 4850 Whelan Security 0.40 1.449e+08
## 30 4286 RuffaloCODY 0.57 8.130e+07
## 31 4052 Kum & Go 0.65 2.800e+09
## 32 2180 Security Industry Specialist 1.69 9.530e+07
## 33 4802 Boise Cascade 0.41 2.800e+09
## 34 3717 Heartland Dental Care 0.79 5.540e+08
## 35 3898 Pacific Dental Services 0.71 5.266e+08
## 36 4811 Flying Food Group 0.41 4.097e+08
## 37 4542 Orion Associates 0.49 5.590e+07
## 38 15 LivingSocial 123.33 5.360e+08
## 39 2522 DLA Piper 1.41 2.400e+09
## 40 2145 AdvancePierre Foods 1.73 1.500e+09
## 41 4108 First Hospitality Group 0.64 2.435e+08
## 42 4997 Dot Foods 0.34 4.500e+09
## 43 3451 TempStaff 0.89 2.410e+07
## 44 4541 Acadian Companies 0.49 3.515e+08
## 45 2034 Digital Intelligence Systems 1.84 3.280e+08
## 46 3613 Pacific Bells 0.83 1.992e+08
## 47 4532 GoDaddy.com 0.49 9.109e+08
## 48 4871 Pinnacle Technical Resources 0.39 2.388e+08
## 49 4246 American Tire Distributors 0.59 3.500e+09
## 50 3573 Program Productions 0.85 2.070e+07
## Industry Employees City State
## 1 Human Resources 66803 Wilmington DE
## 2 Business Products & Services 32000 Pittsford NY
## 3 Security 20000 Santa Ana CA
## 4 Human Resources 18887 Chicago IL
## 5 Human Resources 17057 Fresno CA
## 6 Human Resources 17000 Olney MD
## 7 Human Resources 14451 Boca Raton FL
## 8 Consumer Products & Services 13200 Phoenix AZ
## 9 Human Resources 13071 Pleasanton CA
## 10 Logistics & Transportation 10800 Pittsburgh PA
## 11 Human Resources 10611 Dallas TX
## 12 Telecommunications 10000 Los Angeles CA
## 13 Engineering 10000 Cincinnati OH
## 14 Consumer Products & Services 10000 New York NY
## 15 Manufacturing 8500 Sterling Heights MI
## 16 Food & Beverage 7681 Charlotte NC
## 17 Human Resources 7612 Oak Brook IL
## 18 Human Resources 7052 Annapolis MD
## 19 IT Services 7000 Morristown NJ
## 20 Food & Beverage 7000 Broomfield CO
## 21 Computer Hardware 6800 Vernon Hills IL
## 22 Construction 6549 Beloit WI
## 23 Retail 5821 Houston TX
## 24 Advertising & Marketing 5637 Scottsdale AZ
## 25 Human Resources 5400 Augusta GA
## 26 Environmental Services 5347 Ponte Vedra FL
## 27 Food & Beverage 5000 Downers Grove IL
## 28 Travel & Hospitality 4878 The Woodlands TX
## 29 Security 4720 St. Louis MO
## 30 Business Products & Services 4600 Cedar Rapids IA
## 31 Retail 4589 West Des Moines IA
## 32 Security 4510 Culver City CA
## 33 Construction 4470 Boise ID
## 34 Business Products & Services 4392 Effingham IL
## 35 Health 4390 Irvine CA
## 36 Food & Beverage 4223 chicago IL
## 37 Human Resources 4129 Golden Valley MN
## 38 Consumer Products & Services 4100 Washington DC
## 39 Business Products & Services 4036 Chicago IL
## 40 Food & Beverage 4000 Cincinnati OH
## 41 Travel & Hospitality 3921 Rosemont IL
## 42 Food & Beverage 3919 Mt. Sterling IL
## 43 Human Resources 3892 Jackson MS
## 44 Health 3725 Lafayette LA
## 45 IT Services 3500 McLean VA
## 46 Food & Beverage 3400 Vancouver WA
## 47 IT Services 3369 Scottsdale AZ
## 48 Human Resources 3355 Dallas TX
## 49 Consumer Products & Services 3341 Huntersville NC
## 50 Media 3300 Lombard IL
sd (inc$Growth_Rate)
## [1] 14.12369
sd (inc$Revenue)
## [1] 240542281
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
# Answer Question 1 here
# Stacked Bar Plot with Colors and Legend
library(ggplot2)
counts <- table(inc$State)
df_counts = as.data.frame(counts)
bp = ggplot(data=df_counts, aes(x=Var1, y=Freq, fill = Var1))+
geom_bar(stat="identity")+
scale_colour_gradient2()+
coord_flip()+
ylim(0, 705)+
scale_x_discrete(limits = df_counts$Var1)+
theme(legend.position = "none")
bp + labs(title = "Total Distribution of Companies by U.S. State", y = "Companies in top 500", x = "States")
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases()
function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
# Answer Question 2 here
ny_inc <- subset(inc, State='NY')
ny_inc = inc[c(5:6)]
head(ny_inc)
## Industry Employees
## 1 Consumer Products & Services 104
## 2 Government Services 51
## 3 Health 132
## 4 Energy 50
## 5 Advertising & Marketing 220
## 6 Real Estate 63
ny_inc <- ny_inc[complete.cases(ny_inc),]
attach(ny_inc)
aggny <-aggregate(ny_inc, by=list(Industry),
FUN=mean, na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
head(aggny)
## Group.1 Industry Employees
## 1 Advertising & Marketing NA 84.35456
## 2 Business Products & Services NA 244.49375
## 3 Computer Hardware NA 220.77273
## 4 Construction NA 155.60963
## 5 Consumer Products & Services NA 223.96059
## 6 Education NA 92.59036
detach(ny_inc)
nydata = ggplot(data=aggny, aes(x=aggny$Group.1, y=aggny$Employees, fill = aggny$Group.1))+
geom_bar(stat="identity")+
scale_colour_gradient2()+
coord_flip()+
ylim(0, 1200)+
scale_x_discrete(limits = aggny$Group.1)+
theme(legend.position = "none")
nydata + labs(title="Average Number of Employees per Industry in New York", y="Employees", x="Industry")
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
# Answer Question 3 here
us_inc = inc[c(4:6)]
head(us_inc)
## Revenue Industry Employees
## 1 1.179e+08 Consumer Products & Services 104
## 2 4.960e+07 Government Services 51
## 3 2.550e+07 Health 132
## 4 1.900e+09 Energy 50
## 5 8.700e+07 Advertising & Marketing 220
## 6 4.570e+07 Real Estate 63
aggus = aggregate(. ~ Industry, FUN = sum, data = us_inc,
na.rm = TRUE, na.action = na.pass)
head(aggus)
## Industry Revenue Employees
## 1 Advertising & Marketing 7785000000 39731
## 2 Business Products & Services 26367900000 117357
## 3 Computer Hardware 11885700000 9714
## 4 Construction 13174300000 29099
## 5 Consumer Products & Services 14956400000 45464
## 6 Education 1139300000 7685
aggus$RevEmp = round(aggus$Revenue/aggus$Employees/10000,0)
head(aggus)
## Industry Revenue Employees RevEmp
## 1 Advertising & Marketing 7785000000 39731 20
## 2 Business Products & Services 26367900000 117357 22
## 3 Computer Hardware 11885700000 9714 122
## 4 Construction 13174300000 29099 45
## 5 Consumer Products & Services 14956400000 45464 33
## 6 Education 1139300000 7685 15
usdata = ggplot(data=aggus, aes(x=Industry, y=RevEmp, fill = Industry))+
geom_bar(stat="identity")+
scale_colour_gradient2()+
coord_flip()+
ylim(0, 123)+
scale_x_discrete(limits = aggus$Industry)+
theme(legend.position = "none")
usdata + ggtitle("Revenue per Employee per Industry in the U.S.A.")