Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

ANSWER (Part One): By using the dim function, we see that there are 5001 observations over 8 features. The str function shows that there are 4 factor variables, two integer variables, and two numeric variables.

# Insert your code here, create more chunks as necessary
dim(inc)
## [1] 5001    8
str(inc)
## 'data.frame':    5001 obs. of  8 variables:
##  $ Rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name       : Factor w/ 5001 levels "(Add)ventures",..: 1770 1633 4423 690 1198 2839 4733 1468 1869 4968 ...
##  $ Growth_Rate: num  421 248 245 233 213 ...
##  $ Revenue    : num  1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
##  $ Industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 12 13 7 1 20 10 1 5 21 ...
##  $ Employees  : int  104 51 132 50 220 63 27 75 97 15 ...
##  $ City       : Factor w/ 1519 levels "Acton","Addison",..: 391 365 635 2 139 66 912 1179 131 1418 ...
##  $ State      : Factor w/ 52 levels "AK","AL","AR",..: 5 47 10 45 20 45 44 5 46 41 ...

ANSWER, Part Two: Checking for missing or incomplete data, we see 12 cases where data for the Employees feature is missing. Missing data needs to be accounted for when performing aggregate functions.

inc[!complete.cases(inc),]
##      Rank                             Name Growth_Rate   Revenue
## 183   183           First Flight Solutions       22.32   2700000
## 1063 1064                         Popchips        3.98  93300000
## 1123 1124                       Vocalocity        3.72  42900000
## 1652 1653                     Higher Logic        2.36   6000000
## 1685 1686      Global Communications Group        2.30   3600000
## 2196 2197              JeffreyM Consulting        1.68  12100000
## 2742 2743               Excalibur Exhibits        1.27   9900000
## 3000 3001       Heartland Business Systems        1.12 156300000
## 3978 3978                             SSEC        0.68  80400000
## 4112 4112 Carolinas Home Medical Equipment        0.64   3300000
## 4566 4566                         Oakbrook        0.48   8900000
## 4968 4968                   Popcorn Palace        0.35   5500000
##                          Industry Employees          City State
## 183    Logistics & Transportation        NA  Emerald Isle    NC
## 1063              Food & Beverage        NA San Francisco    CA
## 1123           Telecommunications        NA       Atlanta    GA
## 1652                     Software        NA    Washington    DC
## 1685           Telecommunications        NA     Englewood    CO
## 2196 Business Products & Services        NA      Bellevue    WA
## 2742 Business Products & Services        NA       houston    TX
## 3000                  IT Services        NA  Little Chute    WI
## 3978                Manufacturing        NA       Horsham    PA
## 4112                       Health        NA      Matthews    NC
## 4566                  Real Estate        NA       Madison    WI
## 4968              Food & Beverage        NA Schiller Park    IL

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here
suppressWarnings(suppressMessages(library("dplyr", quietly = T)))
library(dplyr)
Companies_per_State <- inc %>% 
    group_by(State) %>% 
    summarise(Companies_in_State = n()) %>% 
    arrange(Companies_in_State)
## Warning: package 'bindrcpp' was built under R version 3.5.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_One.png", width=900, height=1200)
ggplot(Companies_per_State, aes(x=reorder(State, Companies_in_State), y=Companies_in_State)) +
  geom_bar(stat = "identity", fill="blue") +
  coord_flip() +
    ggtitle("Distribution of Growth Companies by State") +
    labs(y="No. of Growth COmpanies", x="State") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank()) 

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

ANSWER 2 (Part One): First, remove the incomplete cases from the dataset as identified earlier. Doing so, reduces the total number of observations down to 4,989. Doing so increases the accuracy of the aggregate data.

# Answer Question 2 here

inc <- inc[complete.cases(inc),]
dim(inc)
## [1] 4989    8

ANSWER 2 (Part Two): Next,identifying the outliers in the dataset. We can see from this dot plot that there are a few huge outliers in the Employees data.

ggplot(inc, aes(x = State, y = Employees)) +
        geom_point() +
        scale_x_discrete(name = "State") +
        scale_y_continuous(name = "Employees") +
        geom_jitter() +
        theme(axis.text.x = element_text(angle = 75, hjust = 1))

ANSWER 2 (Part Two): Next,identifying the outliers in the dataset. We can see from this dot plot that there are a few huge outliers in the Employees data. The maximum number of employees is 66,803.

max(inc$Employees)
## [1] 66803
min(inc$Employees)
## [1] 1

ANSWER 2 (Part Three):Applying Turkey’s method for identifying outliers, there are 610 data points that could be considered “Outliers” for the Employees feature. Most of these seem to be employee placement/temporary agencies. Removing these outliers shows that the data is more dispersed.

identify_outlier <- function(datapoint) {
    q1 = 25
    q3 = 132 
    interquartile_range = q3-q1
    if (datapoint < q1-(1.5*interquartile_range) | datapoint > q3+(1.5*interquartile_range)){
        return("Outlier")
        }
    else{
          return("Acceptable")
        }
}

inc$Outlier <- sapply(inc$Employees, identify_outlier, simplify=T)
head(inc[inc$Outlier=="Outlier",])
##    Rank                        Name Growth_Rate   Revenue
## 15   15                LivingSocial      123.33 536000000
## 23   23 Wingspan Portfolio Advisors       87.69  77000000
## 28   28                        Kony       77.86  51100000
## 39   39         Intellect Resources       65.54  30000000
## 45   45      Silver Spring Networks       58.67 196700000
## 64   64               GSC Packaging       48.13 112600000
##                        Industry Employees         City State Outlier
## 15 Consumer Products & Services      4100   Washington    DC Outlier
## 23           Financial Services      1016   Carrollton    TX Outlier
## 28                     Software      1100      Orlando    FL Outlier
## 39                       Health       675   Greensboro    NC Outlier
## 45                       Energy       566 Redwood City    CA Outlier
## 64                Manufacturing       300      Atlanta    GA Outlier
inc <- inc %>% 
        filter(Outlier != 'Outlier')
dim(inc)
## [1] 4379    9

After removing the Outliers, the dataset is reduced to 4,379 observations.

ggplot(inc, aes(x = State, y = Employees)) +
        geom_point() +
        scale_x_discrete(name = "State") +
        scale_y_continuous(name = "Employees") +
        geom_jitter() +
        theme(axis.text.x = element_text(angle = 75, hjust = 1))

ANSWER (Part Four): The code below confirms the visualization from Question 1 that the State with the 3rd most growth companies is NY.

Cos_in_State <- inc %>% 
    group_by(State) %>% 
    summarise(Companies_in_State = n()) %>% 
    arrange(desc(Companies_in_State))  %>% 
    top_n(3) %>% 
    slice(3L)
## Selecting by Companies_in_State
Cos_in_State
## # A tibble: 1 x 2
##   State Companies_in_State
##   <fct>              <int>
## 1 NY                   275
State_3rd_Highest <- inc %>% 
    filter(State == 'NY') %>% 
    group_by(Industry) %>% 
    summarise(Name=n(),total_emp=sum(Employees), mean_emp = mean(Employees), median_emp = median(Employees)) %>% 
    filter(Name > 4) %>% 
    arrange(desc(Name))
State_3rd_Highest
## # A tibble: 15 x 5
##    Industry                      Name total_emp mean_emp median_emp
##    <fct>                        <int>     <int>    <dbl>      <dbl>
##  1 Advertising & Marketing         57      3331     58.4       38  
##  2 IT Services                     35      2545     72.7       47  
##  3 Business Products & Services    18      1095     60.8       33  
##  4 Consumer Products & Services    16       647     40.4       24  
##  5 Telecommunications              16      1305     81.6       29.5
##  6 Education                       14       838     59.9       50.5
##  7 Retail                          14       347     24.8       13.5
##  8 Health                          12       766     63.8       42.5
##  9 Manufacturing                   12       646     53.8       28  
## 10 Financial Services              11       968     88         70  
## 11 Media                           10       586     58.6       30.5
## 12 Software                        10       806     80.6       54.5
## 13 Food & Beverage                  8       305     38.1       35.5
## 14 Human Resources                  7       273     39         30  
## 15 Construction                     6       366     61         24.5
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_Two(1).png", width=600, height=900)
ggplot(State_3rd_Highest, aes(y = mean_emp, x = reorder(Industry, mean_emp))) + 
           geom_bar(stat="identity", fill="#40b8d0") +
           theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
            coord_flip() +
            ggtitle("NY: Average Number of Employees per Industry") +
            labs(y="(minimum of 5 companies per industry)", x="") +
            theme_classic() +
            theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank()) 

#Plot saved as a separte pdf file
png(filename="DATA608_Assignment_One_Question_Two(2).png", width=600, height=900)
ggplot(State_3rd_Highest, aes(y = median_emp, x = reorder(Industry, median_emp))) + 
           geom_bar(stat="identity", fill="#b2d183") +
            theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
            coord_flip() +
            ggtitle("NY: Median Number of Employees per Industry") +
            labs(y="(minimum of 5 companies per industry)", x="") +
            theme_classic() +
            theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank()) 

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

ANSWER: First, create a new variable for each company showing revenue per employee.

# Answer Question 3 here

inc$Rev_per_Employee <- inc$Revenue / inc$Employees
head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State    Outlier
## 1 Consumer Products & Services       104   El Segundo    CA Acceptable
## 2          Government Services        51     Dumfries    VA Acceptable
## 3                       Health       132 Jacksonville    FL Acceptable
## 4                       Energy        50      Addison    TX Acceptable
## 5      Advertising & Marketing       220       Boston    MA Acceptable
## 6                  Real Estate        63       Austin    TX Acceptable
##   Rev_per_Employee
## 1        1133653.8
## 2         972549.0
## 3         193181.8
## 4       38000000.0
## 5         395454.5
## 6         725396.8
Industry_Rev_Employee <- inc %>% 
    group_by(Industry) %>% 
    summarise(Total_Revenue=sum(Revenue), Total_Employee=sum(Employees), Industry_Revenue_perEmployee = sum(Revenue) / sum(Employees)) %>% 
    arrange(desc(Industry_Revenue_perEmployee))
Industry_Rev_Employee 
## # A tibble: 25 x 4
##    Industry            Total_Revenue Total_Employee Industry_Revenue_perE~
##    <fct>                       <dbl>          <int>                  <dbl>
##  1 Energy                 6403000000           5974               1071811.
##  2 Computer Hardware      1785700000           2914                612800.
##  3 Logistics & Transp~    4498700000           7968                564596.
##  4 Food & Beverage        3124300000           6623                471735.
##  5 Insurance              1226400000           3049                402230.
##  6 Real Estate            1645300000           4346                378578.
##  7 Consumer Products ~    3725100000          10050                370657.
##  8 Construction           4401400000          12026                365990.
##  9 Financial Services     5768000000          16237                355238.
## 10 Human Resources        3796900000          10834                350462.
## # ... with 15 more rows
labs <- c("$0", "$300K", "$600K", "$900K", "$1M")
#Plot saved as a separte pdf file
#png(filename="DATA608_Assignment_One_Question_Three.png", width=600, height=900)
ggplot(Industry_Rev_Employee, aes(y = Industry_Revenue_perEmployee, x = reorder(Industry, Industry_Revenue_perEmployee))) + 
           geom_bar(stat="identity", fill="#40b8d0") +
           theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
           scale_y_continuous(labels = labs) +
           coord_flip() +
           ggtitle("Industry Revenue per Employee") +
           labs(y="", x="") +
           theme_classic() +
           theme(plot.title = element_text(hjust = 0.5), panel.border = element_blank()) 

REFLECTION: To answer these questions, I imagined that I was talking to a group of state legislators or a Chamber of Commerce–an audience with a vested interest in the data in order to make policy. Given that I wanted the visualizations to clearly show disparities between the states when it comes to growth businesses, to show disparities between mean and median, and generated revenues per employees. I chose bar charts to communicate these ideas and sorted them in descending orders which further communicates the information that I wanted to demonstrate. I put less emphasis on the actual hard numbers and more emphasis on showing disparties.