R Notebook

Principles of Data Visualization and Introduction to ggplot2

#References:

##https://www.rdocumentation.org/packages/psych/versions/1.9.12.31/topics/describe

##https://github.com/AjayArora35/Data-607-Final-Project

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

inc <- na.omit(inc)

# Insert your code here, create more chunks as necessary
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores
# Lastly, in addition to the percentiles, the following command provides lower and upper boundaries of data.  
describe(inc)

## inc 
## 
##  8  Variables      4989  Observations
## ---------------------------------------------------------------------------
## Rank 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4989        0     4987        1     2501     1667    252.4    501.8 
##      .25      .50      .75      .90      .95 
##   1252.0   2502.0   3750.0   4500.2   4750.6 
## 
## lowest :    1    2    3    4    5, highest: 4996 4997 4998 4999 5000
## ---------------------------------------------------------------------------
## Name 
##        n  missing distinct 
##     4989        0     4989 
## 
## lowest : (Add)ventures                          @Properties                            1-Stop Translation USA                 110 Consulting                         11thStreetCoffee.com                  
## highest: Zoup!                                  ZT Wealth and Altus Group of Companies Zumasys                                Zurple                                 ZweigWhite                            
## ---------------------------------------------------------------------------
## Growth_Rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4989        0     1145        1    4.615    6.498    0.430    0.500 
##      .25      .50      .75      .90      .95 
##    0.770    1.420    3.290    9.132   17.104 
## 
## lowest :   0.34   0.35   0.36   0.37   0.38, highest: 213.37 233.08 245.45 248.31 421.48
## ---------------------------------------------------------------------------
## Revenue 
##         n   missing  distinct      Info      Mean       Gmd       .05 
##      4989         0      1066         1  48253357  75177089   2400000 
##       .10       .25       .50       .75       .90       .95 
##   3000000   5100000  10900000  28600000  76800000 155200000 
## 
## lowest : 2.00e+06 2.10e+06 2.20e+06 2.30e+06 2.40e+06
## highest: 3.80e+09 4.50e+09 4.60e+09 4.70e+09 1.01e+10
## ---------------------------------------------------------------------------
## Industry 
##        n  missing distinct 
##     4989        0       25 
## 
## lowest : Advertising & Marketing      Business Products & Services Computer Hardware            Construction                 Consumer Products & Services
## highest: Retail                       Security                     Software                     Telecommunications           Travel & Hospitality        
## ---------------------------------------------------------------------------
## Employees 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4989        0      691        1    232.7    365.6     10.0     14.0 
##      .25      .50      .75      .90      .95 
##     25.0     53.0    132.0    351.2    688.0 
## 
## lowest :     1     2     3     4     5, highest: 17057 18887 20000 32000 66803
## ---------------------------------------------------------------------------
## City 
##        n  missing distinct 
##     4989        0     1517 
## 
## lowest : Acton        Addison      Adrian       Agoura Hills Aiea        
## highest: Worthington  Wyomissing   Yonkers      Youngsville  Zumbrota    
## ---------------------------------------------------------------------------
## State 
##        n  missing distinct 
##     4989        0       52 
## 
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## ---------------------------------------------------------------------------

#What is the growth rate by Industry?
(inc %>% dplyr::filter(Growth_Rate >= 100) %>%
  select(Rank, Name,Growth_Rate,Industry ,Employees ,City, State) %>%
   group_by (Industry) %>%
   mutate(mean_growth_rate = mean(Growth_Rate)) %>%
   mutate(min_growth_rate = min(Growth_Rate)) %>%
   mutate(max_growth_rate = max(Growth_Rate)))

## # A tibble: 19 x 10
## # Groups:   Industry [12]
##     Rank Name  Growth_Rate Industry Employees City  State mean_growth_rate
##    <int> <fct>       <dbl> <fct>        <int> <fct> <fct>            <dbl>
##  1     1 Fuhu         421. Consume~       104 El S~ CA                238.
##  2     2 Fede~        248. Governm~        51 Dumf~ VA                206.
##  3     3 The ~        245. Health         132 Jack~ FL                245.
##  4     4 Brid~        233. Energy          50 Addi~ TX                233.
##  5     5 Data~        213. Adverti~       220 Bost~ MA                178.
##  6     6 Mile~        179. Real Es~        63 Aust~ TX                179.
##  7     7 Valu~        174. Financi~        27 Nash~ TN                139.
##  8     8 Emer~        171. Adverti~        75 San ~ CA                178.
##  9     9 Goal~        170. Consume~        97 Bluf~ UT                238.
## 10    10 Yago~        167. Retail          15 Warw~ RI                154.
## 11    11 OBXt~        164. Governm~       149 Tyso~ VA                206.
## 12    12 AdRo~        151. Adverti~       165 San ~ CA                178.
## 13    13 uBre~        141. Retail         250 Orla~ FL                154.
## 14    14 Sparc        129. Software       160 Char~ SC                129.
## 15    15 Livi~        123. Consume~      4100 Wash~ DC                238.
## 16    16 Ampe~        111. Compute~        26 Chino CA                111.
## 17    17 Inte~        106. Logisti~        15 Roch~ NJ                106.
## 18    18 Inte~        105. Financi~        11 Sara~ FL                139.
## 19    19 Vert~        100. Food & ~        51 colu~ OH                100.
## # ... with 2 more variables: min_growth_rate <dbl>, max_growth_rate <dbl>

#Standard Deviation of Growth_Rate
sd(inc$Growth_Rate)

## [1] 14.13767

#What companies exceed the Growth Rate of 100?

(inc %>% dplyr::filter(Growth_Rate >= 100) %>%
  select(Rank, Name,Growth_Rate,Revenue,Industry ,Employees ,City, State))

##    Rank                         Name Growth_Rate   Revenue
## 1     1                         Fuhu      421.48 1.179e+08
## 2     2        FederalConference.com      248.31 4.960e+07
## 3     3                The HCI Group      245.45 2.550e+07
## 4     4                      Bridger      233.08 1.900e+09
## 5     5                       DataXu      213.37 8.700e+07
## 6     6 MileStone Community Builders      179.38 4.570e+07
## 7     7        Value Payment Systems      174.04 2.550e+07
## 8     8         Emerge Digital Group      170.64 2.390e+07
## 9     9                    Goal Zero      169.81 3.310e+07
## 10   10                     Yagoozon      166.89 1.860e+07
## 11   11                       OBXtek      164.33 2.960e+07
## 12   12                       AdRoll      150.65 3.410e+07
## 13   13                   uBreakiFix      141.02 1.700e+07
## 14   14                        Sparc      128.63 2.110e+07
## 15   15                 LivingSocial      123.33 5.360e+08
## 16   16               Amped Wireless      110.68 1.430e+07
## 17   17            Intelligent Audit      105.73 1.450e+08
## 18   18            Integrity Funding      104.62 1.110e+07
## 19   19         Vertex Body Sciences      100.10 1.180e+07
##                        Industry Employees          City State
## 1  Consumer Products & Services       104    El Segundo    CA
## 2           Government Services        51      Dumfries    VA
## 3                        Health       132  Jacksonville    FL
## 4                        Energy        50       Addison    TX
## 5       Advertising & Marketing       220        Boston    MA
## 6                   Real Estate        63        Austin    TX
## 7            Financial Services        27     Nashville    TN
## 8       Advertising & Marketing        75 San Francisco    CA
## 9  Consumer Products & Services        97     Bluffdale    UT
## 10                       Retail        15       Warwick    RI
## 11          Government Services       149 Tysons Corner    VA
## 12      Advertising & Marketing       165 San Francisco    CA
## 13                       Retail       250       Orlando    FL
## 14                     Software       160    Charleston    SC
## 15 Consumer Products & Services      4100    Washington    DC
## 16            Computer Hardware        26         Chino    CA
## 17   Logistics & Transportation        15 Rochelle Park    NJ
## 18           Financial Services        11      Sarasota    FL
## 19              Food & Beverage        51      columbus    OH

#What is the Revenue by Industry?
(inc %>% 
  select(Rank, Name,Industry ,Revenue, Employees ,City, State) %>%
   group_by (Industry) %>%
   mutate(mean_rev = mean(Revenue)) %>%
   mutate(median_rev = median(Revenue)) %>%
   mutate(min_rev = min(Revenue)) %>%
   mutate(max_rev = max(Revenue)))

## # A tibble: 4,989 x 11
## # Groups:   Industry [25]
##     Rank Name  Industry Revenue Employees City  State mean_rev median_rev
##    <int> <fct> <fct>      <dbl>     <int> <fct> <fct>    <dbl>      <dbl>
##  1     1 Fuhu  Consume~  1.18e8       104 El S~ CA      7.37e7    9400000
##  2     2 Fede~ Governm~  4.96e7        51 Dumf~ VA      2.97e7   11450000
##  3     3 The ~ Health    2.55e7       132 Jack~ FL      5.05e7   11400000
##  4     4 Brid~ Energy    1.90e9        50 Addi~ TX      1.26e8   29400000
##  5     5 Data~ Adverti~  8.70e7       220 Bost~ MA      1.65e7    7900000
##  6     6 Mile~ Real Es~  4.57e7        63 Aust~ TX      3.11e7   13300000
##  7     7 Valu~ Financi~  2.55e7        27 Nash~ TN      5.06e7   15550000
##  8     8 Emer~ Adverti~  2.39e7        75 San ~ CA      1.65e7    7900000
##  9     9 Goal~ Consume~  3.31e7        97 Bluf~ UT      7.37e7    9400000
## 10    10 Yago~ Retail    1.86e7        15 Warw~ RI      5.05e7    8200000
## # ... with 4,979 more rows, and 2 more variables: min_rev <dbl>,
## #   max_rev <dbl>

#Standard Deviation of Revenue
sd(inc$Revenue)

## [1] 240819469

#What is the count of distinct cities?
result2 <- inc %>% 
           group_by(City) %>% 
           summarise(n())
nrow(result2)

## [1] 1517

#What are the distinct industries in the data?
result3 <- inc %>% 
           dplyr::group_by(Industry) %>% 
           dplyr::summarise(n()) 
result3

## # A tibble: 25 x 2
##    Industry                     `n()`
##    <fct>                        <int>
##  1 Advertising & Marketing        471
##  2 Business Products & Services   480
##  3 Computer Hardware               44
##  4 Construction                   187
##  5 Consumer Products & Services   203
##  6 Education                       83
##  7 Energy                         109
##  8 Engineering                     74
##  9 Environmental Services          51
## 10 Financial Services             260
## # ... with 15 more rows

#What are median, mean, etc. for employees?
(inc %>% 
  select(Rank, Name,Industry ,Revenue, Employees ,City, State) %>%
   group_by (Industry) %>%
   mutate(mean_employee = mean(Employees)) %>%
   mutate(min_employee = min(Employees)) %>%
   mutate(max_employee = max(Employees)) %>%
   mutate(sum_employee = sum(Employees)))

## # A tibble: 4,989 x 11
## # Groups:   Industry [25]
##     Rank Name  Industry Revenue Employees City  State mean_employee
##    <int> <fct> <fct>      <dbl>     <int> <fct> <fct>         <dbl>
##  1     1 Fuhu  Consume~  1.18e8       104 El S~ CA            224. 
##  2     2 Fede~ Governm~  4.96e7        51 Dumf~ VA            130. 
##  3     3 The ~ Health    2.55e7       132 Jack~ FL            233. 
##  4     4 Brid~ Energy    1.90e9        50 Addi~ TX            243. 
##  5     5 Data~ Adverti~  8.70e7       220 Bost~ MA             84.4
##  6     6 Mile~ Real Es~  4.57e7        63 Aust~ TX            199. 
##  7     7 Valu~ Financi~  2.55e7        27 Nash~ TN            183. 
##  8     8 Emer~ Adverti~  2.39e7        75 San ~ CA             84.4
##  9     9 Goal~ Consume~  3.31e7        97 Bluf~ UT            224. 
## 10    10 Yago~ Retail    1.86e7        15 Warw~ RI            183. 
## # ... with 4,979 more rows, and 3 more variables: min_employee <int>,
## #   max_employee <int>, sum_employee <int>

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

# Answer Question 1 here

result4 = inc %>%
  group_by(State) %>%
  count(State)%>%
  arrange(desc(n))


graph1 <- ggplot(data = result4,aes(x=reorder(State, n), y=n, fill = "lightblue", )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8), panel.background = element_blank()) + 
  geom_bar(stat = "identity") + 
  #geom_label(aes(label=(result4$n)), position = position_dodge(width = 0.5), size = 3.0,   label.padding = unit(0.08, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Distribution By States", x = "States", y = "Number of Companies")+ 
  coord_flip() 
graph1

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

# Answer Question 2 here
#Retrieve the top 3 states
result5 <- inc %>% 
           group_by(State) %>% 
           summarise(n=n()) %>% 
           arrange(desc(n)) %>% 
           top_n(3)

## Selecting by n

result5

## # A tibble: 3 x 2
##   State     n
##   <fct> <int>
## 1 CA      700
## 2 TX      386
## 3 NY      311

#Isolate 3rd state.
result6 <- inc[complete.cases(inc),] %>% 
           filter(State=='NY') %>% 
           group_by(Industry) %>% 
           summarise(median=median(Employees)) %>% 
           arrange(desc(Industry))


result6

## # A tibble: 25 x 2
##    Industry                   median
##    <fct>                       <dbl>
##  1 Travel & Hospitality         61  
##  2 Telecommunications           31  
##  3 Software                     80  
##  4 Security                     32.5
##  5 Retail                       13.5
##  6 Real Estate                  18  
##  7 Media                        45  
##  8 Manufacturing                30  
##  9 Logistics & Transportation   23.5
## 10 IT Services                  54  
## # ... with 15 more rows

graph2 <- ggplot(result6, aes(x=reorder(result6$Industry, result6$median), y=result6$median, fill = "lightblue", )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8),  panel.background = element_blank()) +
  geom_bar(stat = "identity") + 
  #geom_label(aes(label=(result6$median)), position = position_dodge(width = 0.5), size = 3.0,   label.padding = unit(0.08, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Distribution By Industries for NY", x = "Industries", y = "Average number of employees for NY")+
  coord_flip()
graph2

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

# Answer Question 3 here

result7 = inc[complete.cases(inc),] %>%
  group_by(Industry) %>%
  summarise(Revenue=sum(Revenue), Employees=sum(Employees)) %>%
  mutate(Revenue_per_Employee = Revenue/Employees)

result7

## # A tibble: 25 x 4
##    Industry                         Revenue Employees Revenue_per_Employee
##    <fct>                              <dbl>     <int>                <dbl>
##  1 Advertising & Marketing       7785000000     39731              195943.
##  2 Business Products & Services 26345900000    117357              224494.
##  3 Computer Hardware            11885700000      9714             1223564.
##  4 Construction                 13174300000     29099              452741.
##  5 Consumer Products & Services 14956400000     45464              328972.
##  6 Education                     1139300000      7685              148250.
##  7 Energy                       13771600000     26437              520921.
##  8 Engineering                   2532500000     20435              123930.
##  9 Environmental Services        2638800000     10155              259852.
## 10 Financial Services           13150900000     47693              275741.
## # ... with 15 more rows

graph3 <- ggplot(result7, aes(x=reorder(result7$Industry, result7$Revenue_per_Employee), y=result7$Revenue_per_Employee, fill = "lightblue", )) + 
  theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8),  panel.background = element_blank()) +
  geom_bar(stat = "identity") + 
  #geom_label(aes(label=paste((result7$Employees), " Total EEs", sep = "")), position = position_dodge(width = 0.5), size = 3.0,   label.padding = unit(0.08, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Distribution By Industries -- Revenue Per Employee", x = "Industries", y = "Revenue Per Employee")+
  coord_flip()
graph3

R Notebook

Ajay Arora

Question 1

Quesiton 2

Question 3