suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(psych)))
suppressWarnings(suppressMessages(library(scales)))
inc <- fread("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv")
dim(inc)
## [1] 5001 8
kable(head(inc))
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 1 | Fuhu | 421.48 | 1.179e+08 | Consumer Products & Services | 104 | El Segundo | CA |
| 2 | FederalConference.com | 248.31 | 4.960e+07 | Government Services | 51 | Dumfries | VA |
| 3 | The HCI Group | 245.45 | 2.550e+07 | Health | 132 | Jacksonville | FL |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX |
| 5 | DataXu | 213.37 | 8.700e+07 | Advertising & Marketing | 220 | Boston | MA |
| 6 | MileStone Community Builders | 179.38 | 4.570e+07 | Real Estate | 63 | Austin | TX |
str(inc)
## Classes 'data.table' and 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Fuhu" "FederalConference.com" "The HCI Group" "Bridger" ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : chr "Consumer Products & Services" "Government Services" "Health" "Energy" ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : chr "El Segundo" "Dumfries" "Jacksonville" "Addison" ...
## $ State : chr "CA" "VA" "FL" "TX" ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City
## Length:5001 Min. : 1.0 Length:5001
## Class :character 1st Qu.: 25.0 Class :character
## Mode :character Median : 53.0 Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
## State
## Length:5001
## Class :character
## Mode :character
##
##
##
##
colSums(is.na(inc))
## Rank Name Growth_Rate Revenue Industry Employees
## 0 0 0 0 0 12
## City State
## 0 0
The dataset has 8 variables and 5,001 records. The Employement column has 12 missing records. Exploratrion of summary reveals that on average there are 233 employees in an industry. Average growth rate per industry is 4.6%. On average, the revenue generated by the industries is $48,222,535
q1 <- inc %>% group_by(State) %>% summarise(Count = n())
ggplot(q1, aes(x=State, y=Count)) +
geom_bar(stat="identity", width=0.5,fill= "grey") + coord_flip() +
geom_text(aes(label=Count), size=2.5) +
ylab("Count of Companies") + xlab("State") +
ggtitle("Distribution of Companies by State")+
theme(text = element_text(size = 8),panel.background = element_rect(fill='white', colour="white"))
California has mst number of states.
complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.q2 <- q1 %>% arrange(-Count)
top_n(q2,3)
## Selecting by Count
## # A tibble: 3 x 2
## State Count
## <chr> <int>
## 1 CA 701
## 2 TX 387
## 3 NY 311
q2 <- inc %>%filter(State=="NY") %>% na.omit()
ggplot(q2, aes(x=Industry, y=Employees)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot() +
coord_cartesian(ylim = c(0,1000)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.3))+
ggtitle("NY Employee Count by Industry")
It is clear from the graph that there are outliers in Human Resources, IT services, Advertising and Marketing and few other industries. The default when plotting a boxplot, range=1.5, means that the whiskers will extend 1.5 times the interquartile range above the third quartile and below the first quartile; all other points will be labeled as outliers. Link. Due to the large difference in the ranges of different industries, the mean is not clearly evident from the box plot, bar plot depicting the mean is presented below.
q2a <- inc %>% filter(State=="NY")%>%na.omit() %>% group_by(Industry)%>% summarize(avg =round(mean(Employees),1))
ggplot(q2a, aes(x=reorder(Industry, avg), y=avg)) +
geom_bar(stat="identity",width=0.3,fill= "grey") + coord_flip() +
geom_text(aes(label=avg), size=2.5) +
ylab("Count of Companies") + xlab("State") +
ggtitle("Average Employees by Industry NY") +
theme(text = element_text(size = 8),panel.background = element_rect(fill='white', colour="white"))
q3 <- inc %>% na.omit() %>% group_by(Industry) %>%
summarise(Revenue_Gen = sum(Revenue)/ sum(Employees))
ggplot(q3, aes(x = reorder(Industry, Revenue_Gen), y = Revenue_Gen)) +
geom_bar(stat="identity", fill="grey") + coord_flip() +
ggtitle("Revenue per Employee by Industry") +
geom_text( aes(label=dollar_format()(Revenue_Gen)), size=2.5) +
ylab("") + xlab("") + theme_minimal()
A look at the graph reveals that Computer Hardware industry generates more revenue per employee followed by Energy and Construction. As an investor, I would review the numbers by the region of my investment and the graph can be broken down by State and Industry for additional trends.
https://stackoverflow.com/questions/13297995/changing-font-size-and-direction-of-axes-text-in-ggplot2 https://stackoverflow.com/questions/25061822/ggplot-geom-text-font-size-control?rq=1 https://stackoverflow.com/questions/37662119/custom-ggplot2-axis-and-label-formatting/37662563#37662563 https://stats.stackexchange.com/questions/211327/r-box-plot-on-log-scale-vs-log-transforming-then-creating-box-plot-dont-ge