Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

And lets preview this data:

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

The summaries below help to understand the skew of the data the via the total, average and mean columns and also provide an aggregate of the revnue per state and per industry.

Summary: Top 10 State by Company Count
State CompanyCount TotEmployee AvgEmployee StdEmployee SumRevenue_b
CA 700 161219 230.31 1213.67 23.3646
TX 386 90765 235.14 739.43 22.1543
NY 311 84370 271.29 1916.18 18.2604
VA 283 35667 126.03 263.66 8.6677
FL 282 61221 217.10 960.70 10.6103
IL 272 103266 379.65 1463.61 33.2388
OH 186 38002 204.31 818.26 12.7866
NC 135 36685 271.74 819.37 9.2525
MI 126 36905 292.90 850.47 7.8058
WI 77 15548 201.92 757.54 7.1314
Summary: Industries
Industry CompanyCount TotEmployee AvgEmployee StdEmployee SumRevenue_b
IT Services 732 102788 140.42 392.37 20.5250
Business Products & Services 480 117357 244.49 1519.52 26.3459
Advertising & Marketing 471 39731 84.35 287.40 7.7850
Health 354 82430 232.85 490.96 17.8601
Software 341 51262 150.33 267.57 8.1346
Financial Services 260 47693 183.43 302.83 13.1509
Manufacturing 255 43942 172.32 617.10 12.6036
Consumer Products & Services 203 45464 223.96 1214.94 14.9564
Retail 203 37068 182.60 594.90 10.2574
Government Services 202 26185 129.63 182.64 6.0091
Human Resources 196 226980 1158.06 5474.04 9.2461
Construction 187 29099 155.61 589.23 13.1743
Logistics & Transportation 154 39994 259.70 928.82 14.8378
Food & Beverage 129 65911 510.94 1250.18 12.8125
Telecommunications 127 30842 242.85 919.95 7.2879
Energy 109 26437 242.54 454.36 13.7716
Real Estate 95 18893 198.87 412.58 2.9568
Education 83 7685 92.59 136.42 1.1393
Engineering 74 20435 276.15 1166.04 2.5325
Security 73 41059 562.45 2433.98 3.8128
Travel & Hospitality 62 23035 371.53 900.79 2.9316
Media 54 9532 176.52 502.23 1.7424
Environmental Services 51 10155 199.12 742.18 2.6388
Insurance 50 7339 146.78 412.92 2.3379
Computer Hardware 44 9714 220.77 1016.74 11.8857

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

I chose to use a sorted bar graph. The large number of states justifies the flip in axes. The bar graph uses length to display information which is is visually easy to interpret, while the sorting eliminates having to visually compare non-adjacent bars.

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

I chose the boxplot and the log transform because the it had the nicest spread of data over the plot and allowed for the larger industries not to visually swamp out the smaller ones. Sorting the boxplot by median also helps to makes the pattern in the mean more discernable. While the data may have a nice spread, the reader may have difficulties interpreting the mean and median values relative to the log scale.

## Warning: `fun.y` is deprecated. Use `fun` instead.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Here, a rotated and sorted bar seemed again to be the most appropriate choice. The bar labels were divided by 10^3 and rounded to make the number more digestible to the user.