Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

And lets preview this data:

Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

## # A tibble: 1 x 1
##       n
##   <int>
## 1  5001
## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...
x
12
x
Rank 0
Name 0
Growth_Rate 0
Revenue 0
Industry 0
Employees 12
City 0
State 0
x
Rank 4999
Name 5001
Growth_Rate 1147
Revenue 1069
Industry 25
Employees 692
City 1519
State 52
Rank Name Growth_Rate Revenue Industry Employees City State
3424 Stemp Systems Group 19.37 6800000 IT Services 39 Long Island City NY
3424 Total Beverage Solution 0.90 41500000 Food & Beverage 35 Mt. Pleasant SC
5000 INE 0.34 6800000 IT Services 35 Bellevue WA
5000 ALL4 0.34 4700000 Environmental Services 34 Kimberton PA
Rank Name Growth_Rate Revenue Industry Employees City State
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders 179.38 4.570e+07 Real Estate 63 Austin TX
Rank Growth_Rate Revenue Employees
Rank 1.0000000 -0.3976698 0.0821068 NA
Growth_Rate -0.3976698 1.0000000 0.0063041 NA
Revenue 0.0821068 0.0063041 1.0000000 NA
Employees NA NA NA 1
vars n mean sd median trimmed mad min max range skew kurtosis se
Rank 1 5001 2.501641e+03 1.443506e+03 2.502e+03 2.501731e+03 1.853250e+03 1.0e+00 5.0000e+03 4.9990e+03 -0.0004896 -1.200432 2.041222e+01
Name* 2 5001 2.501000e+03 1.443809e+03 2.501e+03 2.501000e+03 1.853250e+03 1.0e+00 5.0010e+03 5.0000e+03 0.0000000 -1.200720 2.041650e+01
Growth_Rate 3 5001 4.611826e+00 1.412369e+01 1.420e+00 2.136756e+00 1.215732e+00 3.4e-01 4.2148e+02 4.2114e+02 12.5495059 242.336616 1.997192e-01
Revenue 4 5001 4.822254e+07 2.405423e+08 1.090e+07 1.733497e+07 1.067472e+07 2.0e+06 1.0100e+10 1.0098e+10 22.1744453 722.656318 3.401441e+06
Industry* 5 5001 1.210038e+01 7.328351e+00 1.300e+01 1.205499e+01 8.895600e+00 1.0e+00 2.5000e+01 2.4000e+01 -0.1012333 -1.184515 1.036282e-01
Employees 6 4989 2.327180e+02 1.353128e+03 5.300e+01 8.177511e+01 5.337360e+01 1.0e+00 6.6803e+04 6.6802e+04 29.8104167 1268.671130 1.915720e+01
City* 7 5001 7.320010e+02 4.411164e+02 7.610e+02 7.317383e+02 6.049008e+02 1.0e+00 1.5190e+03 1.5180e+03 -0.0420954 -1.264827 6.237704e+00
State* 8 5001 2.480324e+01 1.563706e+01 2.300e+01 2.444364e+01 1.927380e+01 1.0e+00 5.2000e+01 5.1000e+01 0.1190507 -1.461030 2.211193e-01

Notes:

We can see from the additional EDA above that out of the 5001 observations in this data set with 8 variables, half of which are numerical and the other are categorical variables. We can also show that 12 RevperEmp data points are of NA value, and each column contains a varying number of unique values. It is interesting to note that there are 52 unique data points in the State column, when we know that there are 50 states, which leads me to believe that terratories other than states are included in this dataset. We also see that there are 2 duplicated rankings - which may be interesting to explore the reason why that is in more detail because the data seemes really different for the companies that have the same ranking in this dataset. Upon deeper insection of the variable Growth_Rate we can see that it is clearly correlated to the Rank variable. Upon further inspection of the correlation between numeric variables we can also see that Growth_Rate and Rank are highly correlated (negative correlation), and Revenue and Rank also seem to be significantly correlated (positive correlation). Finally by using the describe() function from the psych library, we can uncover some more helpful information about the variables in this dataset (such as kurtosis, skew)

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

Approach

We are essentially creating a frequency chart of the number of companies in each state therefore it is appropriate to group the data by state and count the number of companies in each state. Then I use a bar graph to display the counts for each state and arrange them in deecending order so that the state with the greatest number of growing companies is at the top. Since the visualization was meant to be viewed in a portate format I decided to flip the axis of this graph so that more space can be placed between the state columns and therefore make the width less than the height of this vizualization. I then changed the color of each bar to make it easier to compare adjacent bars and included the abreviation for each state next to the bar (rather than on the axis).

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

Approach:

According to the graph in question 1, the state with the 3rd most companies in the dataset is NY. To prep the data I filtered out NY then used complete.cases() to deal with cases with incomplete data. The graphs I chose were boxplots and bar graphs which seemed most appropriate because we are comparing measures of central tendency for various categories. Initially when we plot the data we see that in Graph A there are two outliers that are not allowing us to really see the boxplots for the other categories; this graph gives insight into the range and distribution of each Industry. In Graph B we can see that if we change the scale and set a limit for the y scale at about 1500 (which essentially means to ignore all outliers outside that range), we can clearly see the distribution of Employees in each Industry is highly skewed. We confirm this in Graph C and D where we compare mean and median number of employees for each Industry; Business Products and Services has the biggest skew in number of employees whereas Real Estate, Insurance, Government Services, Enviernmental Services, Computer Hardware anmd Engineering are normally distributed (mean = median).

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 50 x 3
##    Industry                     statistic RevperEmp
##    <fct>                        <chr>         <dbl>
##  1 Advertising & Marketing      IndMean     306036.
##  2 Business Products & Services IndMean     359097.
##  3 Computer Hardware            IndMean     817702.
##  4 Construction                 IndMean     465682.
##  5 Consumer Products & Services IndMean     466068.
##  6 Education                    IndMean     296454.
##  7 Energy                       IndMean    1554656.
##  8 Engineering                  IndMean     201120.
##  9 Environmental Services       IndMean     283607.
## 10 Financial Services           IndMean     394231.
## # ... with 40 more rows

Approach:

The same approach taken in Question 2 is the same approach taken here except, the data from all states were used. First the ratio of revenue per employee was calculated and added as a new column in the dataframe. Then two boxplots were chosen to display the distribution and range of each industry’s revenue per employee, one with outliers the original scale and one with a scale that had a limit so that the visualization was more informative. A deeper dive into comparing means and medians was represented in the bar charts that follow. We can see that the distribution of revenue per employee is highly skewed right for all industries (mean > median). This means that there are many companies that have a higher average revenue per employee than median revenue per employee. We can see that the Computer Hardware industry has the most revenue per employee whereas Human Resources has the least. This makes sense to me - most people spend lots of money on technology such as smart phones, smart tvs, computers etc; Human Resources on the other hand is not an industry particularly looking to make money. It is interesting to note that Energy has the highest median revenue per employee.