Principles of Data Visualization and Introduction to ggplot2
And lets preview this data:
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry Employees
## Min. :2.000e+06 IT Services : 733 Min. : 1.0
## 1st Qu.:5.100e+06 Business Products & Services: 482 1st Qu.: 25.0
## Median :1.090e+07 Advertising & Marketing : 471 Median : 53.0
## Mean :4.822e+07 Health : 355 Mean : 232.7
## 3rd Qu.:2.860e+07 Software : 342 3rd Qu.: 132.0
## Max. :1.010e+10 Financial Services : 260 Max. :66803.0
## (Other) :2358 NA's :12
## City State
## New York : 160 CA : 701
## Chicago : 90 TX : 387
## Austin : 88 NY : 311
## Houston : 76 VA : 283
## San Francisco: 75 FL : 282
## Atlanta : 74 IL : 273
## (Other) :4438 (Other):2764
The summaries below help to understand the skew of the data the via the total, average and mean columns and also provide an aggregate of the revnue per state and per industry.
| State | CompanyCount | TotEmployee | AvgEmployee | StdEmployee | SumRevenue_b |
|---|---|---|---|---|---|
| CA | 700 | 161219 | 230.31 | 1213.67 | 23.3646 |
| TX | 386 | 90765 | 235.14 | 739.43 | 22.1543 |
| NY | 311 | 84370 | 271.29 | 1916.18 | 18.2604 |
| VA | 283 | 35667 | 126.03 | 263.66 | 8.6677 |
| FL | 282 | 61221 | 217.10 | 960.70 | 10.6103 |
| IL | 272 | 103266 | 379.65 | 1463.61 | 33.2388 |
| OH | 186 | 38002 | 204.31 | 818.26 | 12.7866 |
| NC | 135 | 36685 | 271.74 | 819.37 | 9.2525 |
| MI | 126 | 36905 | 292.90 | 850.47 | 7.8058 |
| WI | 77 | 15548 | 201.92 | 757.54 | 7.1314 |
| Industry | CompanyCount | TotEmployee | AvgEmployee | StdEmployee | SumRevenue_b |
|---|---|---|---|---|---|
| IT Services | 732 | 102788 | 140.42 | 392.37 | 20.5250 |
| Business Products & Services | 480 | 117357 | 244.49 | 1519.52 | 26.3459 |
| Advertising & Marketing | 471 | 39731 | 84.35 | 287.40 | 7.7850 |
| Health | 354 | 82430 | 232.85 | 490.96 | 17.8601 |
| Software | 341 | 51262 | 150.33 | 267.57 | 8.1346 |
| Financial Services | 260 | 47693 | 183.43 | 302.83 | 13.1509 |
| Manufacturing | 255 | 43942 | 172.32 | 617.10 | 12.6036 |
| Consumer Products & Services | 203 | 45464 | 223.96 | 1214.94 | 14.9564 |
| Retail | 203 | 37068 | 182.60 | 594.90 | 10.2574 |
| Government Services | 202 | 26185 | 129.63 | 182.64 | 6.0091 |
| Human Resources | 196 | 226980 | 1158.06 | 5474.04 | 9.2461 |
| Construction | 187 | 29099 | 155.61 | 589.23 | 13.1743 |
| Logistics & Transportation | 154 | 39994 | 259.70 | 928.82 | 14.8378 |
| Food & Beverage | 129 | 65911 | 510.94 | 1250.18 | 12.8125 |
| Telecommunications | 127 | 30842 | 242.85 | 919.95 | 7.2879 |
| Energy | 109 | 26437 | 242.54 | 454.36 | 13.7716 |
| Real Estate | 95 | 18893 | 198.87 | 412.58 | 2.9568 |
| Education | 83 | 7685 | 92.59 | 136.42 | 1.1393 |
| Engineering | 74 | 20435 | 276.15 | 1166.04 | 2.5325 |
| Security | 73 | 41059 | 562.45 | 2433.98 | 3.8128 |
| Travel & Hospitality | 62 | 23035 | 371.53 | 900.79 | 2.9316 |
| Media | 54 | 9532 | 176.52 | 502.23 | 1.7424 |
| Environmental Services | 51 | 10155 | 199.12 | 742.18 | 2.6388 |
| Insurance | 50 | 7339 | 146.78 | 412.92 | 2.3379 |
| Computer Hardware | 44 | 9714 | 220.77 | 1016.74 | 11.8857 |
I chose to use a sorted bar graph. The large number of states justifies the flip in axes. The bar graph uses length to display information which is is visually easy to interpret, while the sorting eliminates having to visually compare non-adjacent bars.
complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.I chose the boxplot and the log transform because the it had the nicest spread of data over the plot and allowed for the larger industries not to visually swamp out the smaller ones. Sorting the boxplot by median also helps to makes the pattern in the mean more discernable. While the data may have a nice spread, the reader may have difficulties interpreting the mean and median values relative to the log scale.
## Warning: `fun.y` is deprecated. Use `fun` instead.
Here, a rotated and sorted bar seemed again to be the most appropriate choice. The bar labels were divided by 10^3 and rounded to make the number more digestible to the user.