R Notebook - Data 608

Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)

And lets preview this data:

head(inc)

##   Rank                         Name Growth_Rate   Revenue
## 1    1                         Fuhu      421.48 1.179e+08
## 2    2        FederalConference.com      248.31 4.960e+07
## 3    3                The HCI Group      245.45 2.550e+07
## 4    4                      Bridger      233.08 1.900e+09
## 5    5                       DataXu      213.37 8.700e+07
## 6    6 MileStone Community Builders      179.38 4.570e+07
##                       Industry Employees         City State
## 1 Consumer Products & Services       104   El Segundo    CA
## 2          Government Services        51     Dumfries    VA
## 3                       Health       132 Jacksonville    FL
## 4                       Energy        50      Addison    TX
## 5      Advertising & Marketing       220       Boston    MA
## 6                  Real Estate        63       Austin    TX

summary(inc)

##       Rank                          Name       Growth_Rate     
##  Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
##  1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
##  Median :2502   1-Stop Translation USA:   1   Median :  1.420  
##  Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
##  3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
##  Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
##                 (Other)               :4995                    
##     Revenue                                  Industry      Employees      
##  Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
##  1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
##  Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
##  Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
##  3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
##  Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
##                      (Other)                     :2358   NA's   :12       
##             City          State     
##  New York     : 160   CA     : 701  
##  Chicago      :  90   TX     : 387  
##  Austin       :  88   NY     : 311  
##  Houston      :  76   VA     : 283  
##  San Francisco:  75   FL     : 282  
##  Atlanta      :  74   IL     : 273  
##  (Other)      :4438   (Other):2764

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kableExtra)
library(ggplot2)

We can start by quickly looking at what industries have most of the fast growing companies. As we can see below, by far that is the IT sector, with several categories of service companies coming second but fairly far behind. We also see there are 25 Industry sectors in the dataset.

ind<-data.frame(inc %>% group_by(Industry) %>% tally()) %>% arrange(desc(n))
ind %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Industry	n
IT Services	733
Business Products & Services	482
Advertising & Marketing	471
Health	355
Software	342
Financial Services	260
Manufacturing	256
Consumer Products & Services	203
Retail	203
Government Services	202
Human Resources	196
Construction	187
Logistics & Transportation	155
Food & Beverage	131
Telecommunications	129
Energy	109
Real Estate	96
Education	83
Engineering	74
Security	73
Travel & Hospitality	62
Media	54
Environmental Services	51
Insurance	50
Computer Hardware	44

But out of the 5,000 companies IT Services is a somewhat low percentage. So no sector really dominating growth.

ind[which(ind$Industry=="IT Services"),]$n/nrow(inc)*100

## [1] 14.65707

We can do the same analysis for States, to see if a State in particular is where most of the growth is happening.

sta<-data.frame(inc %>% group_by(State) %>% tally()) %>% arrange(desc(n))
sta %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

State	n
CA	701
TX	387
NY	311
VA	283
FL	282
IL	273
GA	212
OH	186
MA	182
PA	164
NJ	158
NC	137
CO	134
MD	131
WA	130
MI	126
AZ	100
UT	95
MN	88
TN	82
WI	79
IN	69
MO	59
AL	51
CT	50
OR	49
SC	48
OK	46
DC	43
KY	40
KS	38
LA	37
IA	28
NE	27
NV	26
NH	24
ID	17
DE	16
RI	16
ME	13
MS	12
ND	10
AR	9
HI	7
VT	6
NM	5
MT	4
SD	3
AK	2
WV	2
WY	2
PR	1

Here California comes on top a fair bit. Glad to see my home state of Texas coming second, although a fair bit behind.

sta[which(sta$State=="CA"),]$n/nrow(inc)*100

## [1] 14.0172

sta[which(sta$State=="TX"),]$n/nrow(inc)*100

## [1] 7.738452

We can actually quickly look at the proportion of all states.

-sort(desc(prop.table(table(inc$State)))) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

	x
CA	0.14017197
TX	0.07738452
NY	0.06218756
VA	0.05658868
FL	0.05638872
IL	0.05458908
GA	0.04239152
OH	0.03719256
MA	0.03639272
PA	0.03279344
NJ	0.03159368
NC	0.02739452
CO	0.02679464
MD	0.02619476
WA	0.02599480
MI	0.02519496
AZ	0.01999600
UT	0.01899620
MN	0.01759648
TN	0.01639672
WI	0.01579684
IN	0.01379724
MO	0.01179764
AL	0.01019796
CT	0.00999800
OR	0.00979804
SC	0.00959808
OK	0.00919816
DC	0.00859828
KY	0.00799840
KS	0.00759848
LA	0.00739852
IA	0.00559888
NE	0.00539892
NV	0.00519896
NH	0.00479904
ID	0.00339932
DE	0.00319936
RI	0.00319936
ME	0.00259948
MS	0.00239952
ND	0.00199960
AR	0.00179964
HI	0.00139972
VT	0.00119976
NM	0.00099980
MT	0.00079984
SD	0.00059988
AK	0.00039992
WV	0.00039992
WY	0.00039992
PR	0.00019996

We can also ask specifically what are the top 20 faster growing companies.

inc %>% arrange(desc(Growth_Rate)) %>% head(20) %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State
1	Fuhu	421.48	1.179e+08	Consumer Products & Services	104	El Segundo	CA
2	FederalConference.com	248.31	4.960e+07	Government Services	51	Dumfries	VA
3	The HCI Group	245.45	2.550e+07	Health	132	Jacksonville	FL
4	Bridger	233.08	1.900e+09	Energy	50	Addison	TX
5	DataXu	213.37	8.700e+07	Advertising & Marketing	220	Boston	MA
6	MileStone Community Builders	179.38	4.570e+07	Real Estate	63	Austin	TX
7	Value Payment Systems	174.04	2.550e+07	Financial Services	27	Nashville	TN
8	Emerge Digital Group	170.64	2.390e+07	Advertising & Marketing	75	San Francisco	CA
9	Goal Zero	169.81	3.310e+07	Consumer Products & Services	97	Bluffdale	UT
10	Yagoozon	166.89	1.860e+07	Retail	15	Warwick	RI
11	OBXtek	164.33	2.960e+07	Government Services	149	Tysons Corner	VA
12	AdRoll	150.65	3.410e+07	Advertising & Marketing	165	San Francisco	CA
13	uBreakiFix	141.02	1.700e+07	Retail	250	Orlando	FL
14	Sparc	128.63	2.110e+07	Software	160	Charleston	SC
15	LivingSocial	123.33	5.360e+08	Consumer Products & Services	4100	Washington	DC
16	Amped Wireless	110.68	1.430e+07	Computer Hardware	26	Chino	CA
17	Intelligent Audit	105.73	1.450e+08	Logistics & Transportation	15	Rochelle Park	NJ
18	Integrity Funding	104.62	1.110e+07	Financial Services	11	Sarasota	FL
19	Vertex Body Sciences	100.10	1.180e+07	Food & Beverage	51	columbus	OH
20	BlueKai	92.45	2.680e+07	Advertising & Marketing	107	Cupertino	CA

Finally we can ask which companies are growing quicker, small or large. If we define a small company as having less than 100 employees, medium size between 100 and 1000, and large more than 1000, we find most fast growing companies are small.

b <- c(-Inf, 100, 1000, Inf)
names <- c("Small", "Medium", "Large")
inc$bin <- cut(inc$Employees, breaks = b, labels = names)
inc %>% group_by(bin) %>% tally() %>% kable() %>% kable_styling()

bin	n
Small	3456
Medium	1365
Large	168
NA	12

Interestingly we stumble upon some companies which do not report number of employees, something that was shown in the summary of the data frame.

inc[is.na(inc$Employees),] %>% kable() %>% kable_styling()

	Rank	Name	Growth_Rate	Revenue	Industry	Employees	City	State	bin
183	183	First Flight Solutions	22.32	2700000	Logistics & Transportation	NA	Emerald Isle	NC	NA
1063	1064	Popchips	3.98	93300000	Food & Beverage	NA	San Francisco	CA	NA
1123	1124	Vocalocity	3.72	42900000	Telecommunications	NA	Atlanta	GA	NA
1652	1653	Higher Logic	2.36	6000000	Software	NA	Washington	DC	NA
1685	1686	Global Communications Group	2.30	3600000	Telecommunications	NA	Englewood	CO	NA
2196	2197	JeffreyM Consulting	1.68	12100000	Business Products & Services	NA	Bellevue	WA	NA
2742	2743	Excalibur Exhibits	1.27	9900000	Business Products & Services	NA	houston	TX	NA
3000	3001	Heartland Business Systems	1.12	156300000	IT Services	NA	Little Chute	WI	NA
3978	3978	SSEC	0.68	80400000	Manufacturing	NA	Horsham	PA	NA
4112	4112	Carolinas Home Medical Equipment	0.64	3300000	Health	NA	Matthews	NC	NA
4566	4566	Oakbrook	0.48	8900000	Real Estate	NA	Madison	WI	NA
4968	4968	Popcorn Palace	0.35	5500000	Food & Beverage	NA	Schiller Park	IL	NA

nrow(inc[is.na(inc$Employees),])

## [1] 12

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.

sta$State <- factor(sta$State, levels = sta$State[order(sta$n)])

ggplot(sta, aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")
  )

This plot is hard to read, other than to know overall which states have the most fast growing companies. For a more detailed look we can plot just the top 5 states.

ggplot(head(sta,5), aes(State,n)) + geom_bar(stat="identity", fill='grey52') + coord_flip() + ylab("Number of Companies") + theme_minimal() + ggtitle("Distribution of Fastest Growing Companies Per State") + theme(plot.title = element_text(hjust = 0.5)) + geom_text(aes(label=n), vjust=0.3, hjust=1.6, color="white", size=3.5) + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95") 
  )

Quesiton 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

ny <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry) %>% summarize(Average=mean(Employees), Median=median(Employees)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Industry	Average	Median
Advertising & Marketing	58.43860	38.0
Business Products & Services	1492.46154	70.5
Computer Hardware	44.00000	44.0
Construction	61.00000	24.5
Consumer Products & Services	626.29412	25.0
Education	59.85714	50.5
Energy	129.20000	120.0
Engineering	53.50000	54.5
Environmental Services	155.00000	155.0
Financial Services	144.30769	81.0
Food & Beverage	76.44444	41.0
Government Services	17.00000	17.0
Health	81.84615	45.0
Human Resources	437.54545	56.0
Insurance	32.50000	32.5
IT Services	204.09302	54.0
Logistics & Transportation	29.50000	23.5
Manufacturing	73.30769	30.0
Media	108.00000	45.0
Real Estate	18.25000	18.0
Retail	24.78571	13.5
Security	135.00000	32.5
Software	245.92308	80.0
Telecommunications	95.35294	31.0
Travel & Hospitality	547.71429	61.0

library(reshape2)

ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

A large discrepancy between mean and median is noticeable for several industries. This discrepancy usually means the dataset has outliners which affect the average more than the median calculations. This is still valuable information, if the data is accurate. We now know in Travel & Hospitality, HR, Consumer Products & Services and especially in Business Products & Services, there are companies that employ a disproportionate number of people compared to their peers. Outliers can be visualized with a simple boxplot, which results in a not so pretty graph.

ny_all <- inc %>% filter(State == 'NY') %>% filter(complete.cases(.)) %>% group_by(Industry)
ggplot(ny_all, aes(x=Industry, y=Employees)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
                outlier.size=4) + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

We can eliminate the outliers and re-plot ordering by the median. Outliers are defined for the entire dataset, not for individual groups.

quantiles <- quantile(ny_all$Employees, probs = c(.25, .75))
quantiles

##   25%   75% 
##  21.0 105.5

range <- 1.5 * IQR(ny_all$Employees)
range

## [1] 126.75

ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Employees & quantiles[1]+range > ny_all$Employees ))


ggplot(ny_clean, aes(x=reorder(Industry,Employees,FUN=median), y=Employees)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees") + xlab("Industry")

Now we can go back to our original plot and make better sense of the data now that there aren’t large discrepancies between mean and median.

ny <- ny_clean %>% summarize(Average=mean(Employees), Median=median(Employees)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Industry	Average	Median
Advertising & Marketing	41.84615	35.0
Business Products & Services	47.47059	32.0
Computer Hardware	44.00000	44.0
Construction	29.40000	24.0
Consumer Products & Services	40.43750	24.0
Education	49.07692	50.0
Energy	88.00000	105.0
Engineering	53.50000	54.5
Environmental Services	60.00000	60.0
Financial Services	51.25000	39.0
Food & Beverage	38.12500	35.5
Government Services	17.00000	17.0
Health	41.90000	38.5
Human Resources	39.00000	30.0
Insurance	32.50000	32.5
IT Services	53.77419	40.0
Logistics & Transportation	29.50000	23.5
Manufacturing	40.54545	26.0
Media	42.88889	16.0
Real Estate	18.25000	18.0
Retail	24.78571	13.5
Security	30.00000	25.0
Software	48.50000	37.0
Telecommunications	44.15385	25.0
Travel & Hospitality	34.75000	36.0

ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

Some groups still show outliers, but as stated before these still provide valuable information about the number of employees per industry sector. The Energy industry is surely the largest employer, with Environmental Services and Engineering coming behind as distant second and third. Overall Government Services shows having the least number of employees.

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

ny_rev_ee <- ny_all %>%  mutate(Revenue_Per_EE = Revenue/Employees) %>%  group_by(Industry) %>% summarize(Average_Rev_Per_EE = mean(Revenue_Per_EE), Median_Rev_Per_EE = median(Revenue_Per_EE))

ny_rev_ee %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Industry	Average_Rev_Per_EE	Median_Rev_Per_EE
Advertising & Marketing	373403.5	255555.6
Business Products & Services	527816.9	203148.1
Computer Hardware	520454.5	520454.5
Construction	238694.5	238613.0
Consumer Products & Services	382942.6	222222.2
Education	112060.6	114848.5
Energy	8472533.5	283211.7
Engineering	215744.7	202998.1
Environmental Services	134366.7	134366.7
Financial Services	400174.4	219607.8
Food & Beverage	174630.9	120238.1
Government Services	158823.5	158823.5
Health	532491.0	155000.0
Human Resources	337366.3	175000.0
Insurance	371000.0	371000.0
IT Services	228816.1	164285.7
Logistics & Transportation	1245870.1	996285.7
Manufacturing	665818.6	217500.0
Media	333549.6	262500.0
Real Estate	383809.5	330952.4
Retail	520790.3	305000.0
Security	153277.8	149000.0
Software	143749.0	133333.3
Telecommunications	408143.4	410714.3
Travel & Hospitality	282089.8	223333.3

ggplot(melt(ny_rev_ee, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee")

As in the previous case, we see at least one industry with a large discrepancy between average and median. We proceed with the same analysis.

ny_all <- ny_all %>%  mutate(Revenue_Per_EE = Revenue/Employees)

ggplot(ny_all, aes(x=Industry, y=Revenue_Per_EE)) + geom_boxplot(fill='gray90',outlier.colour="red", outlier.shape=8,
                outlier.size=4) + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Employees per Industry in New York") + ylab("Employees")

quantiles <- quantile(ny_all$Revenue_Per_EE, probs = c(.25, .75))
quantiles

##      25%      75% 
## 138653.7 373333.3

range <- 1.5 * IQR(ny_all$Revenue_Per_EE)
range

## [1] 352019.4

ny_clean<-subset.data.frame(ny_all,(quantiles[1]-range < ny_all$Revenue_Per_EE & quantiles[1]+range > ny_all$Revenue_Per_EE))


ggplot(ny_clean, aes(x=reorder(Industry,Revenue_Per_EE,FUN=median), y=Revenue_Per_EE)) + geom_boxplot(fill='gray90') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Revenue per Employee per Industry in New York") + ylab("Revenue per Employee") + xlab("Industry")

ny <- ny_clean %>% summarize(Average=mean(Revenue_Per_EE), Median=median(Revenue_Per_EE)) 
ny %>% kable() %>% kable_styling() %>% scroll_box(height = "500px")

Industry	Average	Median
Advertising & Marketing	229838.8	200000.0
Business Products & Services	184010.4	180000.0
Construction	238694.5	238613.0
Consumer Products & Services	187158.4	157777.8
Education	112060.6	114848.5
Energy	149778.0	106122.4
Engineering	215744.7	202998.1
Environmental Services	134366.7	134366.7
Financial Services	209618.4	195169.1
Food & Beverage	174630.9	120238.1
Government Services	158823.5	158823.5
Health	114565.5	124000.0
Human Resources	117477.8	103267.7
Insurance	180000.0	180000.0
IT Services	176233.7	160502.9
Logistics & Transportation	299740.3	299740.3
Manufacturing	224302.8	186363.6
Media	254404.6	252844.7
Real Estate	267301.6	300000.0
Retail	259547.4	260000.0
Security	153277.8	149000.0
Software	143749.0	133333.3
Telecommunications	296756.2	373333.3
Travel & Hospitality	209660.3	184617.5

ggplot(melt(ny, id.vars='Industry'), aes(x=Industry, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge') + coord_flip() + theme_minimal() + theme(
  panel.grid.major.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "orchid"), 
  panel.grid.minor.x = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray100"),
  panel.grid.major.y = element_line(size = 0.15, linetype = 'solid',
                                colour = "gray95")) +
  ggtitle("Renevue Employee per Industry in New York") + ylab("Revenue per Employee")

Now looking at the data, Telecommunications stands out as the most efficient industry (highest revenue per employee), with Logistics and transportation coming second. The one to avoid is Human Resources which comes last, with Health and Education also close to the bottom.

R Notebook - Data 608 - Homework 1

Peter Kowalchuk

September 6, 2019

Question 1

Quesiton 2

Question 3

State	n
CA	701
TX	387
NY	311
VA	283
FL	282
IL	273
GA	212
OH	186
MA	182
PA	164
NJ	158
NC	137
CO	134
MD	131
WA	130
MI	126
AZ	100
UT	95
MN	88
TN	82
WI	79
IN	69
MO	59
AL	51
CT	50
OR	49
SC	48
OK	46
DC	43
KY	40
KS	38
LA	37
IA	28
NE	27
NV	26
NH	24
ID	17
DE	16
RI	16
ME	13
MS	12
ND	10
AR	9
HI	7
VT	6
NM	5
MT	4
SD	3
AK	2
WV	2
WY	2
PR	1

State	n
CA	701
TX	387
NY	311
VA	283
FL	282
IL	273
GA	212
OH	186
MA	182
PA	164
NJ	158
NC	137
CO	134
MD	131
WA	130
MI	126
AZ	100
UT	95
MN	88
TN	82
WI	79
IN	69
MO	59
AL	51
CT	50
OR	49
SC	48
OK	46
DC	43
KY	40
KS	38
LA	37
IA	28
NE	27
NV	26
NH	24
ID	17
DE	16
RI	16
ME	13
MS	12
ND	10
AR	9
HI	7
VT	6
NM	5
MT	4
SD	3
AK	2
WV	2
WY	2
PR	1

State	n
CA	701
TX	387
NY	311
VA	283
FL	282
IL	273
GA	212
OH	186
MA	182
PA	164
NJ	158
NC	137
CO	134
MD	131
WA	130
MI	126
AZ	100
UT	95
MN	88
TN	82
WI	79
IN	69
MO	59
AL	51
CT	50
OR	49
SC	48
OK	46
DC	43
KY	40
KS	38
LA	37
IA	28
NE	27
NV	26
NH	24
ID	17
DE	16
RI	16
ME	13
MS	12
ND	10
AR	9
HI	7
VT	6
NM	5
MT	4
SD	3
AK	2
WV	2
WY	2
PR	1