CUNY DATA608 - Project 1

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

Number of unique values for each nominal variable

library(dplyr)
library(ggplot2)
df <-data_frame('Nominal Variables' = c('Industries','Cities','States'),
                'Number of Unique Values' = c(n_distinct(inc$Industry),
                                              n_distinct(inc$City),
                                              n_distinct(inc$State)))
knitr::kable(df)

Nominal Variables	Number of Unique Values
Industries	25
Cities	1519
States	52

Number of unique cities, industries, companies, and employees by state

df_city <- inc %>%
  group_by(State) %>%
  summarize(city_cnt = n_distinct(City))
df_ind <- inc %>%
  group_by(State) %>%
  summarize(ind_cnt = n_distinct(Industry))
df_co <- inc %>%
  group_by(State) %>%
  summarize(company_cnt = n_distinct(Name))
df_emp <- inc %>%
  group_by(State) %>%
  summarize(emp_cnt = sum(Employees))
df_by_state <- full_join(df_city, df_ind, by = "State")
df_by_state <- full_join(df_by_state, df_co, by="State")
df_by_state <- full_join(df_by_state, df_emp, by="State")
knitr::kable(df_by_state)

State	city_cnt	ind_cnt	company_cnt	emp_cnt
AK	2	2	2	2528
AL	13	18	51	6393
AR	7	7	9	496
AZ	13	22	100	34281
CA	204	25	701	NA
CO	28	23	134	NA
CT	34	18	50	6989
DC	3	13	43	NA
DE	7	11	16	68544
FL	102	25	282	61221
GA	45	22	212	NA
HI	4	5	7	621
IA	18	13	28	11344
ID	12	10	17	5817
IL	104	25	273	NA
IN	30	21	69	12697
KS	16	17	38	8725
KY	13	14	40	5544
LA	15	13	37	10669
MA	72	24	182	24682
MD	44	22	131	40439
ME	5	8	13	879
MI	52	22	126	36905
MN	37	19	88	18534
MO	27	18	59	17296
MS	8	9	12	5531
MT	2	3	4	1673
NC	46	22	137	NA
ND	5	8	10	963
NE	7	13	27	3823
NH	15	13	24	2890
NJ	97	22	158	30162
NM	2	5	5	617
NV	6	11	26	1725
NY	90	25	311	84370
OH	79	24	186	38002
OK	12	15	46	6976
OR	12	16	49	4399
PA	80	24	164	NA
PR	1	1	1	29
RI	9	12	16	2964
SC	22	18	48	5348
SD	3	3	3	761
TN	21	20	82	14586
TX	67	24	387	NA
UT	26	20	95	19028
VA	56	24	283	35667
VT	4	6	6	1069
WA	30	19	130	NA
WI	43	20	79	NA
WV	2	2	2	240
WY	2	2	2	107

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

ggplot(data=df_by_state, 
       aes(x=reorder(State,company_cnt),
           y=company_cnt,fill=company_cnt)) +
  geom_bar(position="dodge",stat="identity") +
  geom_text(aes(label=company_cnt),hjust=-0.5) +
  scale_fill_gradient(low = "blue",high = "green") +
  coord_flip() +
  xlab("State") + ylab("Number of Companies") +
  labs(fill = "Number of Companies") +
  ggtitle("Distribution of Companies by State")

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

inc_comp <- inc %>% filter(complete.cases(.))
ny <- inc_comp %>%
  filter(State=='NY') 

ggplot(data=ny,aes(x=reorder(Industry,Employees, FUN = median),
                   y=Employees)) +
  xlab("Industry") +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  ggtitle("New York State Number of Employees by Industry")

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

Assuming we’re working with national data again (Not NY specific)!!

From the graph below, we can tell immediately that the best and probably safest industry to invest in would be in computer hardware.

inc_comp$rev_emp <- inc_comp$Revenue / inc_comp$Employees

ggplot(data=inc_comp,aes(x=reorder(Industry,rev_emp, FUN = median),
                   y=rev_emp)) +
  xlab("Industry") +
  ylab("Revenue per Employee") +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  ggtitle("Revenue per Single Employee by Industry")