NOTE: See the github repository for the .rmd and input data: https://github.com/msekhar12/Data_Visualization_Course
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv('inc5000_data.csv', header=TRUE)
#Creating another data frame with group by
df_g <- group_by(df,State)
df_g <- summarise(df_g,n=n())
df_g <- arrange(df_g,(n))
We will plot a bar chart showing the distribution of companies in all the states of the USA.
ggplot(df_g, aes(x = reorder(State, n), y = n,label=n,ymax=n+10)) +
geom_bar(stat = "identity",fill="blue") +
geom_text( position = position_dodge(1), vjust = 0.5,hjust=0) +
labs(title="Distribution of Companies across all the states",y="Number of companies",x="State") +
theme(legend.position="none") +
coord_flip()
The states in the graph are ordered in the descending order of the number of companies. The California (CA) state has the most number of companies (a total of 701 companies are California based).
From the bar chart of companies distribution across all the states of the USA, we can identify that NY (New York) state has 311 number of companies and is ranked in the third position.
Let us create a data frame containing the data related to NY state only.
df_NY <- filter(df, State =='NY')
df_NY <- df_NY[complete.cases(df_NY),]
Now we will group by the data by industry, eliminate the outliers (if any) present based on the employees count in each industry, and finally create a data set that contains the industry, along with the maximum and minimum range of number of employees in each industry. Note that we may get some negative values as the minimum value, since the formula used to calculate the minimum employee count (\(Employee_{0.25} - 1.5 IQR\)) may have some negative values. In the formula \(Employee_{0.25}\) is the \(25^{th}\) percentile of the number of employees and IQR is the Inter-Quartile Range. The maximum value of number of employees is obtained by \(Employee_{0.75} + 1.5 IQR\), where \(Employee_{0.75}\) is the \(75^{th}\) percentile of the employee count.
df_NY_grp <- df_NY %>%
filter(!is.na(Employees)) %>%
group_by(Industry) %>%
summarise(min=quantile(Employees,0.25) - IQR(Employees) * 1.5,max=quantile(Employees,0.75) + 1.5* IQR(Employees))
Now we will join the data frame obtained in the previous R code, with the NY state’s data, and create box-plots of number of employees employed in various industries in the State of NY.
df_NY_no_outliers <- select(df_NY,Industry,Employees) %>%
inner_join(df_NY_grp) %>%
filter(Employees >= min & Employees <= max)
## Joining by: "Industry"
ggplot(df_NY_no_outliers,aes(Industry,Employees))+
geom_boxplot(width=1)+
#scale_y_continuous(limits = c(0, 900)) +
coord_flip()+
labs(title="Box-plots of number of employees in various industries in NY State",y="Number of Employees")
In the above plot, the “Travel & Hospitality” industry has the maximum variance, and its data is skewed to the right. The “Environmental Services” industry has the highest average number of employees among all industries in NY state.
The following R code will compute the average revenue per employee in all the industries of the State of NY, and plots a bar chart.
#nrow(df)
df_comp <- df[complete.cases(df),]
#nrow(df_comp)
Rev_per_employee <- df_comp %>%
group_by(Industry) %>%
summarise(Rev_per_employee=sum(Revenue)/sum(Employees))
nrow(Rev_per_employee)
## [1] 25
#Rev_per_employee$Rev_per_emp_percent <- (Rev_per_employee$Rev_per_employee/sum(Rev_per_employee$Rev_per_employee)) * 100
ggplot(Rev_per_employee, aes(x = reorder(Industry, Rev_per_employee), y = Rev_per_employee,
label=paste(round(Rev_per_employee),'$', sep=""),
ymax=Rev_per_employee+10000
,fill=Industry)) +
geom_bar(stat = "identity",fill="blue") +
geom_text( position = position_dodge(1), vjust = 0.5,hjust=0) +
labs(title="Revenue per employee in various industries of NY State",y="Revenue per employee in US Dollars($)",x="Industry") +
theme(legend.position="none") +
coord_flip()+
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14,face="bold"))
As per the bar chart shown above, the Computer Hardware industry has the most average revenue per employee.