Libraries

library (tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Assignment

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)
##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX
summary(inc)
##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

From the tidyverse package we will use the glimpse function which provides a cleaner look to what is under the hood of the data set.

glimpse(inc)
## Observations: 5,001
## Variables: 8
## $ Rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Name        <fctr> Fuhu, FederalConference.com, The HCI Group, Bridg...
## $ Growth_Rate <dbl> 421.48, 248.31, 245.45, 233.08, 213.37, 179.38, 17...
## $ Revenue     <dbl> 1.179e+08, 4.960e+07, 2.550e+07, 1.900e+09, 8.700e...
## $ Industry    <fctr> Consumer Products & Services, Government Services...
## $ Employees   <int> 104, 51, 132, 50, 220, 63, 27, 75, 97, 15, 149, 16...
## $ City        <fctr> El Segundo, Dumfries, Jacksonville, Addison, Bost...
## $ State       <fctr> CA, VA, FL, TX, MA, TX, TN, CA, UT, RI, VA, CA, F...

Rank: an integer variable displaying the position of the fatest growing companies in the U.S.

Name: a factor variable displaying the name of each company

Growth_Rate: a double variable displaying the rate of growth for each company

Revenue: a double variable dispaying the total revenue of each company

Industry: a factor variable displaying the type of industry for each company

Employees: an integer variable displaying the number of employees for each company

City: a factor variable displaying the city where the company resides

State: a factor variable displaying the state where the company resides

Measures of Spread

Check the variance and standard deviation on applicable variables

cat("\n","Variance & Standard Deviation of the Growth_Rate Variable","\n")
## 
##  Variance & Standard Deviation of the Growth_Rate Variable
var(inc$Growth_Rate)
## [1] 199.4787
sd(inc$Growth_Rate)
## [1] 14.12369
cat("\n","Variance & Standard Deviation of the Revenue Variable","\n")
## 
##  Variance & Standard Deviation of the Revenue Variable
var(inc$Revenue)
## [1] 5.786059e+16
sd(inc$Revenue)
## [1] 240542281

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

To gain some insight on how many states there are we utilize the table function

table(inc$State)
## 
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##   2  51   9 100 701 134  50  43  16 282 212   7  28  17 273  69  38  40 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##  37 182 131  13 126  88  59  12   4 137  10  27  24 158   5  26 311 186 
##  OK  OR  PA  PR  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##  46  49 164   1  16  48   3  82 387  95 283   6 130  79   2   2

The best graph to display the data will be a bar chart with flipped coordinates.

In order to achieve this graph we first created a subset entitled Q1_States which groups the data based on the State variable. Data is then counted based on the number of industries per state; the default variable n is created highlighting the count.

Q1_States <- inc %>%
    group_by(State) %>%
    count(State, sort = TRUE)
head(Q1_States)
## # A tibble: 6 x 2
## # Groups: State [6]
##   State      n
##   <fctr> <int>
## 1 CA       701
## 2 TX       387
## 3 NY       311
## 4 VA       283
## 5 FL       282
## 6 IL       273
ggplot(data = Q1_States, aes(x = reorder(State, n), y = n)) +
    geom_bar(stat = "identity", fill = "purple", width = .5) +
    coord_flip() +
    xlab("States") +
    ylab("Number of Companies") +
    ggtitle("Number of Companies Based on State") +
    theme(axis.text=element_text(size=6),
        axis.title=element_text(size=12,face="bold",color = "navy"))

Even in portrait mode this data seems cluttered, a cleaner graph could be generated if we showcased states with one hundred or more companies.

inc %>%
    group_by(State) %>%
    count(State, sort = TRUE) %>%
    filter(n >= 100) %>%
    ggplot(aes(x = reorder(State, n), y = n)) +
    geom_bar(stat = "identity", fill = "purple", width = .5) +
    coord_flip() +
    xlab("States") +
    ylab("Number of Companies") +
    ggtitle("Number of Companies Based on State", subtitle ="With 100 Or More Companies") +
    theme(axis.text=element_text(size=10),
        axis.title=element_text(size=12,face="bold",color = "navy"))

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

complete.cases Return a logical vector indicating which cases are complete, i.e., have no missing values.

Create a data set entitled Q2_identify which will count and sort the data on the State with the most companies on the list. Next, create a variable named Q2_Third based on the criteria in the question.

Q2_identify <- inc %>%
    count(State, sort = TRUE) %>% #counts and sorts data based on the State variable
    mutate(State = reorder(State,n)) %>%
    rename(Count = n)
(Q2_Third <- Q2_identify[3,1]) #filter on the 3rd row and the 1st column
## # A tibble: 1 x 1
##   State 
##   <fctr>
## 1 NY

Using the complete.cases function we are able to identify complete data and then create a subset based on the findings from Q2_identify Next we use the filter command to only extract data from New York. Lastly, the data is grouped by the Industry variable.

Q2_NY <- inc[complete.cases(inc),] %>%
    filter(State == "NY")

After creating the subset, the mean is calculated based on the aggregate feature

(Q2_M <- aggregate(Employees ~ Industry, Q2_NY, mean))
##                        Industry  Employees
## 1       Advertising & Marketing   58.43860
## 2  Business Products & Services 1492.46154
## 3             Computer Hardware   44.00000
## 4                  Construction   61.00000
## 5  Consumer Products & Services  626.29412
## 6                     Education   59.85714
## 7                        Energy  129.20000
## 8                   Engineering   53.50000
## 9        Environmental Services  155.00000
## 10           Financial Services  144.30769
## 11              Food & Beverage   76.44444
## 12          Government Services   17.00000
## 13                       Health   81.84615
## 14              Human Resources  437.54545
## 15                    Insurance   32.50000
## 16                  IT Services  204.09302
## 17   Logistics & Transportation   29.50000
## 18                Manufacturing   73.30769
## 19                        Media  108.00000
## 20                  Real Estate   18.25000
## 21                       Retail   24.78571
## 22                     Security  135.00000
## 23                     Software  245.92308
## 24           Telecommunications   95.35294
## 25         Travel & Hospitality  547.71429
ggplot(data = Q2_NY, aes(x = reorder(Industry, Employees,mean), y = Employees)) +
    geom_boxplot(outlier.shape = NA) +
    coord_flip() +
    xlab("New York Based Industries") +
    ylab("Employees") +
    ggtitle("The Average of New York Employee Size Based on Industry") +
    theme(axis.text=element_text(size=10),
        axis.title=element_text(size=12,face="bold",color = "navy")) +
    scale_y_continuous(limits = c(0,1500))
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

In order to achieve this graph we first created a subset entitled Q3_Revenue which groups the data based on the Industry variable. Data is then summarized by counting the number of industries per group, calculating the sum of the Revenue & Employee variables. Lastly, using the mutate function, a variable entitled RevenueEmp will calculate the Revenue divided by the number of Employees.

At first glance, several key pieces of vital information was missing so it was necessary to add complete.cases to the code.

Q3_Revenue <- inc[complete.cases(inc),] %>%
    group_by(Industry) %>%
    summarise(IndustryCount = n(),
              Revenue = sum(Revenue),
              Employees = sum(Employees)) %>%
    mutate(RevenueEmp = Revenue / Employees)
Q3_Revenue
## # A tibble: 25 x 5
##    Industry                     IndustryCount     Revenue Employees Reven~
##    <fctr>                               <int>       <dbl>     <int>  <dbl>
##  1 Advertising & Marketing                471  7785000000     39731 1.96e5
##  2 Business Products & Services           480 26345900000    117357 2.24e5
##  3 Computer Hardware                       44 11885700000      9714 1.22e6
##  4 Construction                           187 13174300000     29099 4.53e5
##  5 Consumer Products & Services           203 14956400000     45464 3.29e5
##  6 Education                               83  1139300000      7685 1.48e5
##  7 Energy                                 109 13771600000     26437 5.21e5
##  8 Engineering                             74  2532500000     20435 1.24e5
##  9 Environmental Services                  51  2638800000     10155 2.60e5
## 10 Financial Services                     260 13150900000     47693 2.76e5
## # ... with 15 more rows
ggplot(data = Q3_Revenue, aes(x = reorder(Industry, RevenueEmp), y = RevenueEmp)) +
    geom_bar(stat = "identity", fill = "blue") +
    coord_flip() +
    scale_y_continuous("RevenueEmp") +
    xlab("Industry") +
    ylab("Revenue Per Employee") +
    ggtitle("Revenue Per Employee Based on Industry") +
    theme(axis.text=element_text(size=11),
        axis.title=element_text(size=14,face="bold",color = "Red"))