Principles of Data Visualization and Introduction to ggplot2
I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", header= TRUE)
And lets preview this data:
head(inc)
## Rank Name Growth_Rate Revenue
## 1 1 Fuhu 421.48 1.179e+08
## 2 2 FederalConference.com 248.31 4.960e+07
## 3 3 The HCI Group 245.45 2.550e+07
## 4 4 Bridger 233.08 1.900e+09
## 5 5 DataXu 213.37 8.700e+07
## 6 6 MileStone Community Builders 179.38 4.570e+07
## Industry Employees City State
## 1 Consumer Products & Services 104 El Segundo CA
## 2 Government Services 51 Dumfries VA
## 3 Health 132 Jacksonville FL
## 4 Energy 50 Addison TX
## 5 Advertising & Marketing 220 Boston MA
## 6 Real Estate 63 Austin TX
summary(inc)
## Rank Name Growth_Rate Revenue
## Min. : 1 Length:5001 Min. : 0.340 Min. :2.000e+06
## 1st Qu.:1252 Class :character 1st Qu.: 0.770 1st Qu.:5.100e+06
## Median :2502 Mode :character Median : 1.420 Median :1.090e+07
## Mean :2502 Mean : 4.612 Mean :4.822e+07
## 3rd Qu.:3751 3rd Qu.: 3.290 3rd Qu.:2.860e+07
## Max. :5000 Max. :421.480 Max. :1.010e+10
##
## Industry Employees City State
## Length:5001 Min. : 1.0 Length:5001 Length:5001
## Class :character 1st Qu.: 25.0 Class :character Class :character
## Mode :character Median : 53.0 Mode :character Mode :character
## Mean : 232.7
## 3rd Qu.: 132.0
## Max. :66803.0
## NA's :12
Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:
##Answer tried to understand by summarizing to understand he data.
#install.packages("kableExtra")
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("ggthemes")
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.1.3
str(inc)
## 'data.frame': 5001 obs. of 8 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Fuhu" "FederalConference.com" "The HCI Group" "Bridger" ...
## $ Growth_Rate: num 421 248 245 233 213 ...
## $ Revenue : num 1.18e+08 4.96e+07 2.55e+07 1.90e+09 8.70e+07 ...
## $ Industry : chr "Consumer Products & Services" "Government Services" "Health" "Energy" ...
## $ Employees : int 104 51 132 50 220 63 27 75 97 15 ...
## $ City : chr "El Segundo" "Dumfries" "Jacksonville" "Addison" ...
## $ State : chr "CA" "VA" "FL" "TX" ...
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v readr 2.1.1
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
# Tufe theme. Quick graph showing industries with greater than 1000 employees along with revenue
ggplot(filter(inc,Employees > 1000 ), aes(x = Industry, y = Revenue)) +
geom_point() +
labs(x = "Industry", y = "Revenue") +
theme_tufte() + coord_flip()
# Tufe theme. Quick graph showing industries with greater than 100 employees along with revenue
ggplot(filter(inc,Employees < 100 ), aes(x = Industry, y = Revenue)) +
geom_point(fill = "indianred") +
labs(x = "Industry", y = "Revenue") +
theme_tufte() + coord_flip()
#Top 50 records by revenue
revenuetop50 = inc %>% arrange(desc(Revenue)) %>% head(50)
#revenuetop50
revenuetop50$Revenue1 = sapply(revenuetop50$Revenue, function(x) paste(round((x / 1e9), 1), " Billion"))
revenuetop50 %>% kable() %>% kable_styling()
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State | Revenue1 |
|---|---|---|---|---|---|---|---|---|
| 4788 | CDW | 0.41 | 1.010e+10 | Computer Hardware | 6800 | Vernon Hills | IL | 10.1 Billion |
| 3853 | ABC Supply | 0.73 | 4.700e+09 | Construction | 6549 | Beloit | WI | 4.7 Billion |
| 4936 | Coty | 0.36 | 4.600e+09 | Consumer Products & Services | 10000 | New York | NY | 4.6 Billion |
| 4997 | Dot Foods | 0.34 | 4.500e+09 | Food & Beverage | 3919 | Mt. Sterling | IL | 4.5 Billion |
| 4716 | Westcon Group | 0.44 | 3.800e+09 | IT Services | 3000 | Tarrytown | NY | 3.8 Billion |
| 4246 | American Tire Distributors | 0.59 | 3.500e+09 | Consumer Products & Services | 3341 | Huntersville | NC | 3.5 Billion |
| 4052 | Kum & Go | 0.65 | 2.800e+09 | Retail | 4589 | West Des Moines | IA | 2.8 Billion |
| 4802 | Boise Cascade | 0.41 | 2.800e+09 | Construction | 4470 | Boise | ID | 2.8 Billion |
| 1397 | EnvisionRxOptions | 2.88 | 2.700e+09 | Health | 625 | Twinsburg | OH | 2.7 Billion |
| 2522 | DLA Piper | 1.41 | 2.400e+09 | Business Products & Services | 4036 | Chicago | IL | 2.4 Billion |
| 4629 | Prime Therapeutics | 0.47 | 2.000e+09 | Health | 2549 | Eagan | MN | 2 Billion |
| 4 | Bridger | 233.08 | 1.900e+09 | Energy | 50 | Addison | TX | 1.9 Billion |
| 1843 | Sun Coast Resources | 2.08 | 1.900e+09 | Energy | 1640 | Houston | TX | 1.9 Billion |
| 3844 | Atlas Oil Company | 0.74 | 1.900e+09 | Logistics & Transportation | 374 | Taylor | MI | 1.9 Billion |
| 4961 | Kirkland & Ellis | 0.36 | 1.900e+09 | Business Products & Services | 1517 | Chicago | IL | 1.9 Billion |
| 1488 | Sprouts Farmers Market | 2.68 | 1.800e+09 | Consumer Products & Services | 13200 | Phoenix | AZ | 1.8 Billion |
| 4689 | Global Brass and Copper Holdings | 0.45 | 1.700e+09 | Manufacturing | 1986 | Schaumburg | IL | 1.7 Billion |
| 3463 | Hogan Lovells | 0.89 | 1.600e+09 | Business Products & Services | 2280 | Washington | DC | 1.6 Billion |
| 2145 | AdvancePierre Foods | 1.73 | 1.500e+09 | Food & Beverage | 4000 | Cincinnati | OH | 1.5 Billion |
| 3650 | Genco | 0.82 | 1.500e+09 | Logistics & Transportation | 10800 | Pittsburgh | PA | 1.5 Billion |
| 960 | Advanced Disposal | 4.51 | 1.400e+09 | Environmental Services | 5347 | Ponte Vedra | FL | 1.4 Billion |
| 2236 | Total Quality Logistics | 1.65 | 1.400e+09 | Logistics & Transportation | 2116 | Cincinnati | OH | 1.4 Billion |
| 2496 | Carahsoft Technology | 1.42 | 1.400e+09 | Government Services | 365 | Reston | VA | 1.4 Billion |
| 3414 | Restoration Hardware | 0.91 | 1.200e+09 | Retail | 2900 | Corte Madera | CA | 1.2 Billion |
| 1908 | Diplomat Specialty Pharmacy | 1.99 | 1.100e+09 | Health | 761 | Flint | MI | 1.1 Billion |
| 3734 | KSS | 0.78 | 1.000e+09 | Manufacturing | 8500 | Sterling Heights | MI | 1 Billion |
| 3425 | Blackhawk Network Holdings | 0.90 | 9.591e+08 | Financial Services | 725 | Pleasanton | CA | 1 Billion |
| 1996 | Ambit Energy | 1.88 | 9.303e+08 | Energy | 492 | Dallas | TX | 0.9 Billion |
| 4532 | GoDaddy.com | 0.49 | 9.109e+08 | IT Services | 3369 | Scottsdale | AZ | 0.9 Billion |
| 2958 | ImmixGroup | 1.15 | 8.836e+08 | IT Services | 252 | McLean | VA | 0.9 Billion |
| 4793 | LORD Corporation | 0.41 | 8.615e+08 | Manufacturing | 2959 | Cary | NC | 0.9 Billion |
| 3163 | Quinn Emanuel | 1.03 | 8.525e+08 | Business Products & Services | 697 | Los Angeles | CA | 0.9 Billion |
| 2961 | Genesis-ATC | 1.15 | 8.462e+08 | Telecommunications | 347 | San Antonio | TX | 0.8 Billion |
| 1284 | Hearthside Food Solutions | 3.17 | 8.396e+08 | Food & Beverage | 5000 | Downers Grove | IL | 0.8 Billion |
| 1480 | Coyote Logistics | 2.70 | 7.864e+08 | Logistics & Transportation | 1219 | Chicago | IL | 0.8 Billion |
| 4765 | Squire Sanders | 0.42 | 7.745e+08 | Business Products & Services | 1257 | Cleveland | OH | 0.8 Billion |
| 3921 | Granite Telecommunications | 0.70 | 7.362e+08 | Telecommunications | 1000 | Quincy | MA | 0.7 Billion |
| 4854 | Arnold & Porter | 0.40 | 7.310e+08 | Business Products & Services | 748 | Washington | DC | 0.7 Billion |
| 1869 | Universal Services of America | 2.04 | 7.181e+08 | Security | 20000 | Santa Ana | CA | 0.7 Billion |
| 2274 | Liberty Power | 1.61 | 7.108e+08 | Energy | 277 | Fort Lauderdale | FL | 0.7 Billion |
| 4427 | Sunshine Minting | 0.53 | 7.061e+08 | Manufacturing | 280 | Coeur D’ Alene | ID | 0.7 Billion |
| 4459 | Belcan | 0.52 | 6.887e+08 | Engineering | 10000 | Cincinnati | OH | 0.7 Billion |
| 2806 | Goodman Networks | 1.23 | 6.509e+08 | Telecommunications | 1693 | Plano | TX | 0.7 Billion |
| 4009 | Schumacher Group | 0.67 | 6.130e+08 | Health | 1945 | Lafayette | LA | 0.6 Billion |
| 4815 | Perkins Coie | 0.40 | 6.080e+08 | Business Products & Services | 823 | Seattle | WA | 0.6 Billion |
| 1768 | The Cellular Connection | 2.17 | 6.065e+08 | Telecommunications | 1428 | Marion | IN | 0.6 Billion |
| 2537 | Wayfair.com | 1.40 | 6.020e+08 | Retail | 1300 | Boston | MA | 0.6 Billion |
| 4577 | Sutherland Global Services | 0.48 | 5.976e+08 | Business Products & Services | 32000 | Pittsford | NY | 0.6 Billion |
| 4093 | Advanced BioEnergy | 0.64 | 5.848e+08 | Energy | 75 | Bloomington | MN | 0.6 Billion |
| 4857 | AVI-SPL | 0.39 | 5.807e+08 | Business Products & Services | 1800 | Tampa | FL | 0.6 Billion |
#print
sample <- revenuetop50 %>% arrange(desc(Revenue)) %>% head(10)
#describe unique value to find unique industries in the sample data
sample
## Rank Name Growth_Rate Revenue
## 1 4788 CDW 0.41 1.01e+10
## 2 3853 ABC Supply 0.73 4.70e+09
## 3 4936 Coty 0.36 4.60e+09
## 4 4997 Dot Foods 0.34 4.50e+09
## 5 4716 Westcon Group 0.44 3.80e+09
## 6 4246 American Tire Distributors 0.59 3.50e+09
## 7 4052 Kum & Go 0.65 2.80e+09
## 8 4802 Boise Cascade 0.41 2.80e+09
## 9 1397 EnvisionRxOptions 2.88 2.70e+09
## 10 2522 DLA Piper 1.41 2.40e+09
## Industry Employees City State Revenue1
## 1 Computer Hardware 6800 Vernon Hills IL 10.1 Billion
## 2 Construction 6549 Beloit WI 4.7 Billion
## 3 Consumer Products & Services 10000 New York NY 4.6 Billion
## 4 Food & Beverage 3919 Mt. Sterling IL 4.5 Billion
## 5 IT Services 3000 Tarrytown NY 3.8 Billion
## 6 Consumer Products & Services 3341 Huntersville NC 3.5 Billion
## 7 Retail 4589 West Des Moines IA 2.8 Billion
## 8 Construction 4470 Boise ID 2.8 Billion
## 9 Health 625 Twinsburg OH 2.7 Billion
## 10 Business Products & Services 4036 Chicago IL 2.4 Billion
unique(inc$Industry)
## [1] "Consumer Products & Services" "Government Services"
## [3] "Health" "Energy"
## [5] "Advertising & Marketing" "Real Estate"
## [7] "Financial Services" "Retail"
## [9] "Software" "Computer Hardware"
## [11] "Logistics & Transportation" "Food & Beverage"
## [13] "IT Services" "Business Products & Services"
## [15] "Education" "Construction"
## [17] "Manufacturing" "Telecommunications"
## [19] "Security" "Human Resources"
## [21] "Travel & Hospitality" "Media"
## [23] "Environmental Services" "Engineering"
## [25] "Insurance"
unique(sample$Industry)
## [1] "Computer Hardware" "Construction"
## [3] "Consumer Products & Services" "Food & Beverage"
## [5] "IT Services" "Retail"
## [7] "Health" "Business Products & Services"
#Plot to show by State, industries with the highest revenue where employees > 1000 from Sample of 50
ggplot(filter(sample, Employees > 1000),aes(x = State, y = Revenue, color = Industry)) +
geom_point() +
geom_label(aes(label = Industry)) +
labs(x = "State", y = "Revenue") +
theme(legend.position = "none")
Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a ‘portrait’ oriented screen (ie taller than wide), which should further guide your layout choices.
I used ggplot below to show a portrait graph by using coord_flip().Then using Theme_Tufte. Additionally then I tried to plot the numbers on the bar graph and adjusted to size 3 so I can show the numbers on the graph.
# Answer Question 1 here
library(ggplot2)
library(tidyverse)
library(dplyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
#graph1
inc %>% count(State) %>%
ggplot(aes(x=reorder(State, n), y=n)) +
geom_bar(stat = 'identity',width = 0.5, position = position_dodge(0.8)) +
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
coord_flip() +
theme_tufte() +
xlab("State") +
ylab("Count by State") +
ggtitle("Distribution of companies by state")
#FINAL GRAPH FOR QUESTION1 : 2graph2 with numbers, and make the color a light blue with blank count of companies by state
inc %>% count(State) %>%
ggplot(aes(x=reorder(State, n), y=n)) +
geom_bar(stat = 'identity',fill = "light blue", position = "dodge") +
geom_text(aes(label=n), position = position_stack(vjust= 0.5),
colour = "black", size = 3) +
coord_flip() +
theme_tufte() +
xlab("State") +
ylab("Count by State") +
ggtitle("Distribution of companies by state") + scale_color_brewer(palette = "Pastel1")
Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R’s complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.
I tried to create a few plots to understand how to show the outlier data using axis transformation for log2 and log10. Also showed triangular on outlier data. User reorder method The “default” method treats its first argument as a categorical variable (Industry), and reorders its levels based on the values of a second variable (Employees), usually numeric
# Answer Question 2 here
#1 Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R's `complete.cases()` function.)
# Subset only NY records
NyState <- inc[ which(inc$State == 'NY'), ]
# Remove incomplete records
NyState <- NyState[complete.cases(NyState ), ]
# show data
head(NyState )
## Rank Name Growth_Rate Revenue
## 26 26 BeenVerified 84.43 13700000
## 30 30 Sailthru 73.22 8100000
## 37 37 YellowHammer 67.40 18000000
## 38 38 Conductor 67.02 7100000
## 48 48 Cinium Financial Services 53.65 5900000
## 70 70 33Across 44.99 27900000
## Industry Employees City State
## 26 Consumer Products & Services 17 New York NY
## 30 Advertising & Marketing 79 New York NY
## 37 Advertising & Marketing 27 New York NY
## 38 Advertising & Marketing 89 New York NY
## 48 Financial Services 32 Rock Hill NY
## 70 Advertising & Marketing 75 New York NY
#Find statistics
summary(NyState)
## Rank Name Growth_Rate Revenue
## Min. : 26 Length:311 Min. : 0.350 Min. :2.000e+06
## 1st Qu.:1186 Class :character 1st Qu.: 0.670 1st Qu.:4.300e+06
## Median :2702 Mode :character Median : 1.310 Median :8.800e+06
## Mean :2612 Mean : 4.371 Mean :5.872e+07
## 3rd Qu.:4005 3rd Qu.: 3.580 3rd Qu.:2.570e+07
## Max. :4981 Max. :84.430 Max. :4.600e+09
## Industry Employees City State
## Length:311 Min. : 1.0 Length:311 Length:311
## Class :character 1st Qu.: 21.0 Class :character Class :character
## Mode :character Median : 45.0 Mode :character Mode :character
## Mean : 271.3
## 3rd Qu.: 105.5
## Max. :32000.0
NyState %>% arrange(desc(Employees)) %>% head(50) %>% kable() %>% kable_styling()
| Rank | Name | Growth_Rate | Revenue | Industry | Employees | City | State |
|---|---|---|---|---|---|---|---|
| 4577 | Sutherland Global Services | 0.48 | 5.976e+08 | Business Products & Services | 32000 | Pittsford | NY |
| 4936 | Coty | 0.36 | 4.600e+09 | Consumer Products & Services | 10000 | New York | NY |
| 4716 | Westcon Group | 0.44 | 3.800e+09 | IT Services | 3000 | Tarrytown | NY |
| 3899 | Denihan Hospitality Group | 0.71 | 2.808e+08 | Travel & Hospitality | 2280 | New York | NY |
| 4363 | TransPerfect | 0.55 | 3.413e+08 | Business Products & Services | 2218 | New York | NY |
| 1499 | Sterling Infosystems | 2.66 | 2.149e+08 | Human Resources | 2081 | New York | NY |
| 4465 | OpenLink | 0.52 | 3.043e+08 | Software | 1271 | Uniondale | NY |
| 3136 | FSO Onsite Outsourcing | 1.05 | 5.530e+07 | Human Resources | 1134 | New York | NY |
| 2830 | ReSource Pro | 1.22 | 2.430e+07 | Business Products & Services | 1124 | New York | NY |
| 2995 | Pride Technologies | 1.13 | 2.310e+08 | Human Resources | 1000 | New York | NY |
| 3387 | Mitchell/Martin | 0.92 | 1.453e+08 | IT Services | 1000 | New York | NY |
| 4747 | TravelClick | 0.43 | 2.571e+08 | Travel & Hospitality | 906 | New York | NY |
| 2964 | DataArt | 1.14 | 3.070e+07 | Software | 800 | New York | NY |
| 4913 | Jackson Lewis | 0.37 | 3.520e+08 | Business Products & Services | 690 | White Plains | NY |
| 4646 | Mimeo.com | 0.46 | 8.770e+07 | Business Products & Services | 632 | New York | NY |
| 4153 | Everyday Health | 0.62 | 1.464e+08 | Media | 602 | New York | NY |
| 4003 | Ovation Travel Group | 0.67 | 5.860e+07 | Travel & Hospitality | 509 | New York | NY |
| 1640 | BlueWolf | 2.38 | 9.040e+07 | IT Services | 500 | New York | NY |
| 2971 | ConServe | 1.14 | 5.270e+07 | Financial Services | 483 | Fairport | NY |
| 4727 | Arrow Security | 0.44 | 1.400e+07 | Security | 450 | New York | NY |
| 3584 | Infusion | 0.84 | 6.790e+07 | IT Services | 432 | New York | NY |
| 4535 | Capital Access Network | 0.49 | 1.516e+08 | Financial Services | 425 | New York | NY |
| 4224 | Fragomen | 0.60 | 3.465e+08 | Business Products & Services | 404 | New York | NY |
| 3500 | Magnolia Bakery | 0.88 | 2.360e+07 | Food & Beverage | 383 | New York | NY |
| 1069 | Systems Made Simple | 3.94 | 1.671e+08 | IT Services | 382 | Syracuse | NY |
| 4039 | Empire Office | 0.66 | 3.496e+08 | Business Products & Services | 345 | New York | NY |
| 3704 | Infinity Consulting Solutions | 0.79 | 3.570e+07 | Human Resources | 325 | New York | NY |
| 3021 | Linium | 1.11 | 4.590e+07 | IT Services | 322 | New York | NY |
| 2218 | Globo Mobile | 1.67 | 4.500e+06 | Software | 320 | New York | NY |
| 2896 | Telx | 1.18 | 2.143e+08 | Telecommunications | 316 | New York | NY |
| 4820 | Aluf Plastics | 0.40 | 1.029e+08 | Manufacturing | 307 | Orangeburg | NY |
| 2436 | Shinetech Software | 1.48 | 1.200e+07 | IT Services | 300 | New York | NY |
| 3871 | eTransMedia Technology | 0.72 | 2.590e+07 | Health | 298 | Troy | NY |
| 4981 | SmartSource Computer & Audio Visual Rentals | 0.35 | 5.730e+07 | Business Products & Services | 296 | Hauppauge | NY |
| 1190 | Usablenet | 3.52 | 5.900e+07 | IT Services | 295 | New York | NY |
| 2556 | Precision Pipeline Solutions | 1.39 | 3.120e+07 | Energy | 294 | New Windsor | NY |
| 4944 | McElroy Deutsch | 0.36 | 1.165e+08 | Business Products & Services | 288 | New York | NY |
| 4732 | Mycroft | 0.43 | 2.640e+07 | IT Services | 275 | New York | NY |
| 4552 | Paradysz | 0.49 | 4.100e+07 | Advertising & Marketing | 270 | New York | NY |
| 4738 | McMurry/TMG | 0.43 | 9.140e+07 | Advertising & Marketing | 269 | New York City | NY |
| 3009 | 5Linx Enterprises | 1.12 | 1.036e+08 | Telecommunications | 268 | Rochester | NY |
| 1565 | Droga5 | 2.54 | 6.730e+07 | Advertising & Marketing | 250 | New York | NY |
| 2997 | Fibertech Networks | 1.13 | 1.423e+08 | Telecommunications | 250 | ROCHESTER | NY |
| 3661 | Environmental Products & Services of Vermont | 0.81 | 4.510e+07 | Environmental Services | 250 | Syracuse | NY |
| 1895 | nfrastructure | 2.00 | 5.320e+07 | IT Services | 237 | Clifton Park | NY |
| 1371 | Impelsys | 2.94 | 1.290e+07 | Software | 220 | New York | NY |
| 3054 | DDS Companies | 1.09 | 5.250e+07 | Construction | 219 | Rochester | NY |
| 275 | Technical Solutions | 15.85 | 1.630e+07 | Telecommunications | 213 | Walden | NY |
| 3262 | Sriven Systems | 0.98 | 2.530e+07 | IT Services | 210 | Melville | NY |
| 890 | Payoneer | 5.00 | 4.040e+07 | Financial Services | 207 | New YOrk | NY |
#quick graph of NY state
ggplot(NyState, aes(x = Industry, y = Employees)) +
geom_boxplot(fill = "indianred") +
labs(x = "Industry", y = "Employees") +
theme_tufte() + coord_flip()
# Graph to show boxplot with log10 transformation.
ggplot(NyState, aes(x = Industry, y = Employees)) +
geom_boxplot(outlier.colour = "Red",
outlier.shape = 2) + theme_tufte() + coord_flip()+
scale_y_continuous(trans = "log10")
#to show boxplot with log2 transformation.
ggplot(NyState, aes(x = Industry, y = Employees)) +
geom_boxplot(outlier.colour = "Red",
outlier.shape = 2) + theme_tufte() + coord_flip() +
scale_y_continuous(trans = "log2")
#FINAL GRAPH FOR QUESTION2 : Graph to show boxplot with log2 transformation. Use re-order. Shows Business Products and Services with max outlier
ggplot(NyState, aes(reorder(Industry, Employees, FUN = median), Employees)) +
geom_boxplot(outlier.colour = "Red",
outlier.shape = 2) +
scale_y_continuous(trans = "log2", limits = c(min(NyState$Employees), max(NyState$Employees))) +
coord_flip() +
geom_hline(yintercept = median(NyState$Employees),
color="Red",
linetype="dashed") +
geom_text(aes(x=2,
label="Median for Employees", y = median(NyState$Employees)+100 ),
size = 3,
colour="blue") +
xlab("Industry") +
ylab("Number of Employees") +
theme_tufte() +
ggtitle("Number of Employees by Industry in New York")
## Warning: Use of `NyState$Employees` is discouraged. Use `Employees` instead.
Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.
I checked the summary by using table and then colored the graph light blue to keep it more easy to view. Highest revenue is Computer Hardware, which generates revenue and employeers fewer total employees within this industry.
inc_nonmissing <- inc[complete.cases(inc), ]
revemployee <- group_by(inc_nonmissing, Industry) %>% summarise(sum(Revenue),sum(Employees), revperemp = sum(Revenue) / sum(Employees))
revemployee %>% arrange(desc(revperemp)) %>% head(50) %>% kable() %>% kable_styling()
| Industry | sum(Revenue) | sum(Employees) | revperemp |
|---|---|---|---|
| Computer Hardware | 11885700000 | 9714 | 1223563.93 |
| Energy | 13771600000 | 26437 | 520921.44 |
| Construction | 13174300000 | 29099 | 452740.64 |
| Logistics & Transportation | 14837800000 | 39994 | 371000.65 |
| Consumer Products & Services | 14956400000 | 45464 | 328972.37 |
| Insurance | 2337900000 | 7339 | 318558.39 |
| Manufacturing | 12603600000 | 43942 | 286823.54 |
| Retail | 10257400000 | 37068 | 276718.46 |
| Financial Services | 13150900000 | 47693 | 275740.67 |
| Environmental Services | 2638800000 | 10155 | 259852.29 |
| Telecommunications | 7287900000 | 30842 | 236297.91 |
| Government Services | 6009100000 | 26185 | 229486.35 |
| Business Products & Services | 26345900000 | 117357 | 224493.64 |
| Health | 17860100000 | 82430 | 216669.90 |
| IT Services | 20525000000 | 102788 | 199682.84 |
| Advertising & Marketing | 7785000000 | 39731 | 195942.71 |
| Food & Beverage | 12812500000 | 65911 | 194390.92 |
| Media | 1742400000 | 9532 | 182794.80 |
| Software | 8134600000 | 51262 | 158686.75 |
| Real Estate | 2956800000 | 18893 | 156502.41 |
| Education | 1139300000 | 7685 | 148249.84 |
| Travel & Hospitality | 2931600000 | 23035 | 127267.20 |
| Engineering | 2532500000 | 20435 | 123929.53 |
| Security | 3812800000 | 41059 | 92861.49 |
| Human Resources | 9246100000 | 226980 | 40735.31 |
ggplot(revemployee, aes(x=reorder(Industry, revperemp), y=revperemp)) +
geom_bar(stat = 'Identity',fill = "light blue", position = "dodge") +
coord_flip() +
xlab("Industry") +
ylab("Revenue per employee ($)") +
ggtitle("Revenue per employee for industry") +
scale_y_continuous(labels = scales::comma)+
theme_tufte()