#Objective
Using a dataset about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine (limited to ggplot2):
##Question One
Create a graph the at shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use assuming I am using a 'portrait' oriented screen (ie taller than wide).
###Variety of Charts/Graphics
####Map of the United States
####Scatterplots
Did not think a scatterplot was appropriate.
####Bar Charts
####Circular Charts
Note, I do agree with the sentiment that pie charts are not always the best visual aides. I try not to use them but I do like to use them in the process of finding the best visual aide. This section is not limited to pie charts, but general circular charts (such as radar).
###Recommended/Favorite Graphic
My personal favorite graphic was the first bar chart, but it does not fit the display very well. So perhaps teh best chart is the continuous bar chart. The chart I'd like to use, I could not figure out how to create: a single line on the y-axis stretching from 0 to the maximum state count, with a label at the count per each state.
###Question Two
Let's dig in on the State with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries employ. Create a plot of average employment by industry for companies in this state (only use cases with full data (user R's complete.cases() function). Your graph should show how variable the ranges are, and exclude outliers.
###Question Three
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart makes this information clear.
##All Together (Code Displayed)
#load data
inc500 <- read.csv("https://raw.githubusercontent.com/chrisgmartin/DATA608/master/lecture1/inc5000_data.csv", stringsAsFactors=FALSE)
#QUESTION ONE
#Count number of companies per state
statecount <- inc500 %>%
group_by(., State) %>%
dplyr::summarise(count = n())
#using a second df for data quality
statecount2 <- statecount
#continuous count bar chart
#for US state counts
y.breaks <- cumsum(statecount2$count) - statecount2$count/2
ggplot(statecount2, aes(x=1, y=count, fill=State)) + geom_bar(stat="identity") + ggtitle("Continuous Count of Fastest Growing Companies") + geom_bar(stat="identity", colour='black') + guides(fill=guide_legend(override.aes=list(colour=NA))) + theme(axis.text.x=element_text(color='black')) + scale_y_continuous(breaks=y.breaks, labels=statecount2$State)
#QUESTION TWO
#filter for third highest state count and remove incomplete cases
statecount3 <- statecount
statecount3 <- statecount3[order(-statecount3$count),]
statecount3 <- statecount3$State[3]
indcount <- inc500[complete.cases(inc500),]
indcount <- filter(indcount, State == statecount3)
#plot
ggplot(indcount, aes(x=Industry, y=Employees)) + geom_boxplot(outlier.shape = NA) + scale_y_continuous(limits=quantile(indcount$Employees, c(0.1, 0.9))) + theme(axis.text.x = element_text(angle=90, hjust=1))
#QUESTION THREE
#all of the United States
revemp <- inc500
revemp <- revemp[complete.cases(revemp),]
revemp$RevperEmp <- revemp$Revenue/revemp$Employees
ggplot(revemp, aes(x=Industry, y=RevperEmp)) + geom_boxplot(outlier.shape = NA) + scale_y_continuous(limits=quantile(revemp$RevperEmp, c(0.1, 0.9))) + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Revenue per Employees per Industry (all of the United States)")
#for third highest state count
revemp <- inc500[complete.cases(inc500),]
revemp <- filter(revemp, State == statecount3)
revemp$RevperEmp <- revemp$Revenue/revemp$Employees
ggplot(revemp, aes(x=Industry, y=RevperEmp)) + geom_boxplot(outlier.shape = NA) + scale_y_continuous(limits=quantile(revemp$RevperEmp, c(0.1, 0.9))) + theme(axis.text.x = element_text(angle=90, hjust=1)) + ggtitle("Revenue per Employees per Industry (Third Higest State with FGC)")