Principles of Data Visualization and Introduction to ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.3
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 117900000
## 2 2 FederalConference.com 248.31 49600000
## 3 3 The HCI Group 245.45 25500000
## 4 4 Bridger 233.08 1900000000
## 5 5 DataXu 213.37 87000000
## 6 6 MileStone Community Builders 179.38 45700000
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry
## Min. : 2000000 IT Services : 733
## 1st Qu.: 5100000 Business Products & Services: 482
## Median : 10900000 Advertising & Marketing : 471
## Mean : 48222535 Health : 355
## 3rd Qu.: 28600000 Software : 342
## Max. :10100000000 Financial Services : 260
## (Other) :2358
## Employees City State
## Min. : 1.0 New York : 160 CA : 701
## 1st Qu.: 25.0 Chicago : 90 TX : 387
## Median : 53.0 Austin : 88 NY : 311
## Mean : 232.7 Houston : 76 VA : 283
## 3rd Qu.: 132.0 San Francisco: 75 FL : 282
## Max. :66803.0 Atlanta : 74 IL : 273
## NA's :12 (Other) :4438 (Other):2764
First off, how many states are included in this data? If it’s less than 50, which states are not part of the data; more than 50, which of the territories are part of this data?
inc$State %>% unique() %>% length()
## [1] 52
52 is more than the conventional US states. Lets see which one is extra from the standard states.
inc$State %>% unique()
## [1] CA VA FL TX MA TN UT RI SC DC NJ OH WA ME NY CO GA IL AZ NC MD MN OK
## [24] PA CT IN MS WI WY MI MO KS OR NE AL HI NV IA KY ID AK LA DE AR NH VT
## [47] NM SD ND PR MT WV
## 52 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
Unfortunately, most people do not realize the District of Columbia does not fall into any state as it’s federal land.
filter(inc, State == "PR")
## Rank Name Growth_Rate Revenue Industry Employees City State
## 1 2140 Wovenware 1.73 2300000 Software 29 San Juan PR
Puerto Rico has only one company as part of this data. No useful insights can be gained.
dc_stats<- filter(inc, State == "DC")
summary(dc_stats)
## Rank Name Growth_Rate
## Min. : 15 Apprio : 1 Min. : 0.400
## 1st Qu.: 710 Arnold & Porter : 1 1st Qu.: 0.930
## Median :1914 Atlas Research : 1 Median : 1.980
## Mean :2161 Barbaricum : 1 Mean : 8.298
## 3rd Qu.:3367 Brailsford & Dunlavey : 1 3rd Qu.: 6.525
## Max. :4854 CaseDriven Technologies: 1 Max. :123.330
## (Other) :37
## Revenue Industry Employees
## Min. : 2400000 Business Products & Services:10 Min. : 7.00
## 1st Qu.: 5400000 Government Services :10 1st Qu.: 20.00
## Median : 8700000 IT Services : 9 Median : 41.00
## Mean : 76344186 Construction : 2 Mean : 219.55
## 3rd Qu.: 16050000 Health : 2 3rd Qu.: 84.25
## Max. :1600000000 Real Estate : 2 Max. :4100.00
## (Other) : 8 NA's :1
## City State
## Washington :41 DC :43
## Washington, DC: 1 AK : 0
## Washingtonton : 1 AL : 0
## Acton : 0 AR : 0
## Addison : 0 AZ : 0
## Adrian : 0 CA : 0
## (Other) : 0 (Other): 0
Business Products and Government Services make sense for DC businesses. A large about of them are focused on federal and state governments. There is a discrepancy on the Growth Rate. The 3rd QR is 6.525, however, the max is 123.33. This would generate a heavy skew to the rate.
Another item of note, the data does not appear to be completely clean, mostly likely human error. The City column has 3 different ways to spell Washington, DC. If any analysis is going to happen on the cities of this dataset, a cleaning will need to happen.
Focusing back on the dataset, the same item pops out of the total summary numbers.
summary(inc$Growth_Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.340 0.770 1.420 4.612 3.290 421.480
Growth Rate, even without plotting any item, shows heavy skew much like DC.
The 3rd qr numbers are 3.290 however the Max number is 421.480.
inc %>% count(Growth_Rate > 3.291)
## # A tibble: 2 x 2
## `Growth_Rate > 3.291` n
## <lgl> <int>
## 1 FALSE 3752
## 2 TRUE 1249
33% of the businesses are above the 3rd QR range.
q1 <- inc%>%
group_by(State)%>%
count(State)%>%
arrange(desc(n))%>%
as_tibble(q1)
ggplot(q1, aes(x=reorder(State,n), y=n))+
geom_bar(stat="identity", width=.6)+
theme(axis.title=element_blank())+
geom_hline(yintercept=seq(1,800,100), col="white", lwd=1)+
theme(panel.grid.major.y = element_blank())+
theme_classic()+
coord_flip()+
xlab("State")+
ylab("Number of Fastest Growing Companies")
The Bar chart is chosen for it’s clean representation of the data. It’s easily read and quickly understood. I chose to not use color on most of these plots. I am not comparing data point to another where I would need to have the same color running through the report.
complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.# finding the state in question
q1[3,"State"]
## # A tibble: 1 x 1
## State
## <fct>
## 1 NY
How does NY compare to the country as a whole?
summary(inc)
## Rank Name Growth_Rate
## Min. : 1 (Add)ventures : 1 Min. : 0.340
## 1st Qu.:1252 @Properties : 1 1st Qu.: 0.770
## Median :2502 1-Stop Translation USA: 1 Median : 1.420
## Mean :2502 110 Consulting : 1 Mean : 4.612
## 3rd Qu.:3751 11thStreetCoffee.com : 1 3rd Qu.: 3.290
## Max. :5000 123 Exteriors : 1 Max. :421.480
## (Other) :4995
## Revenue Industry
## Min. : 2000000 IT Services : 733
## 1st Qu.: 5100000 Business Products & Services: 482
## Median : 10900000 Advertising & Marketing : 471
## Mean : 48222535 Health : 355
## 3rd Qu.: 28600000 Software : 342
## Max. :10100000000 Financial Services : 260
## (Other) :2358
## Employees City State
## Min. : 1.0 New York : 160 CA : 701
## 1st Qu.: 25.0 Chicago : 90 TX : 387
## Median : 53.0 Austin : 88 NY : 311
## Mean : 232.7 Houston : 76 VA : 283
## 3rd Qu.: 132.0 San Francisco: 75 FL : 282
## Max. :66803.0 Atlanta : 74 IL : 273
## NA's :12 (Other) :4438 (Other):2764
#comparing NY to the country
ny_stats<- filter(inc, State == "NY")
summary(ny_stats)
## Rank Name Growth_Rate
## Min. : 26 1st Equity : 1 Min. : 0.350
## 1st Qu.:1186 33Across : 1 1st Qu.: 0.670
## Median :2702 5Linx Enterprises : 1 Median : 1.310
## Mean :2612 Access Display Group: 1 Mean : 4.371
## 3rd Qu.:4005 Adafruit : 1 3rd Qu.: 3.580
## Max. :4981 AdCorp Media Group : 1 Max. :84.430
## (Other) :305
## Revenue Industry Employees
## Min. : 2000000 Advertising & Marketing : 57 Min. : 1.0
## 1st Qu.: 4300000 IT Services : 43 1st Qu.: 21.0
## Median : 8800000 Business Products & Services: 26 Median : 45.0
## Mean : 58715113 Consumer Products & Services: 17 Mean : 271.3
## 3rd Qu.: 25700000 Telecommunications : 17 3rd Qu.: 105.5
## Max. :4600000000 Education : 14 Max. :32000.0
## (Other) :137
## City State
## New York :160 NY :311
## Brooklyn : 15 AK : 0
## Rochester: 9 AL : 0
## Buffalo : 5 AR : 0
## Fairport : 5 AZ : 0
## new york : 5 CA : 0
## (Other) :112 (Other): 0
q2 <- ny_stats%>%
filter(complete.cases(.))%>%
group_by(Industry)%>%
summarise(Industry_Mean = mean(Employees),
Industry_Median = median(Employees))%>%
gather(statistic, Employees, Industry_Mean, Industry_Median)
ggplot(q2, aes(x=Industry, y=Employees))+
geom_bar(stat = "identity", aes(fill=statistic))+
geom_hline(yintercept=seq(1,1600,100), col="white", lwd=1)+
theme_classic()+
coord_flip()+
ggtitle("NY Mean/Median by Industry")
q2_bar <- ny_stats %>%
filter(complete.cases(.))%>%
select(Industry, Employees)
ggplot(q2_bar, aes(x=Employees, y=reorder(Industry, Employees, median, order = TRUE)))+
geom_boxplot(fill="slateblue", alpha=0.2)+
scale_x_log10()+
theme_classic()+
ylab("Industry (by median)")+
ggtitle("NY Median by Industry")
I wanted to present the data in two different ways and give a comparison between the requested value and the data as a whole.
The bar charts give, again, clean representation of the data. I chose to allow color as there is two items on the same charts and wanted a clear delineation between the values in both B/w and color display.
The Box plot displays the same data, however, given the median value graphically in the form of the line tells the story of the data easier.
q3 <- ny_stats%>%
filter(complete.cases(.))%>%
group_by(Industry)%>%
summarise(Revenue_total = sum(Revenue), Employees_Total= sum(Employees))%>%
mutate(Revenue_per_employee = Revenue_total/Employees_Total)
ggplot(q3, aes(x=reorder(Industry, Revenue_per_employee), y=Revenue_per_employee))+
geom_bar(stat = "identity")+
geom_hline(yintercept=seq(1,700000,100000), col="white", lwd=1)+
theme_classic()+
coord_flip()+
xlab("Industry")+
ggtitle("NY Industry Revenue per Employee")