Working with cleveland plots - looking at the data using glimpse
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following objects are masked from 'package:readr':
##
## col_factor, col_numeric
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:datasets':
##
## cars
library(dplyr)
glimpse(countyComplete)
## Observations: 3,143
## Variables: 53
## $ state <fctr> Alabama, Alabama, A...
## $ name <fctr> Autauga County, Bal...
## $ FIPS <dbl> 1001, 1003, 1005, 10...
## $ pop2010 <dbl> 54571, 182265, 27457...
## $ pop2000 <dbl> 43671, 140415, 29038...
## $ age_under_5 <dbl> 6.6, 6.1, 6.2, 6.0, ...
## $ age_under_18 <dbl> 26.8, 23.0, 21.9, 22...
## $ age_over_65 <dbl> 12.0, 16.8, 14.2, 12...
## $ female <dbl> 51.3, 51.1, 46.9, 46...
## $ white <dbl> 78.5, 85.7, 48.0, 75...
## $ black <dbl> 17.7, 9.4, 46.9, 22....
## $ native <dbl> 0.4, 0.7, 0.4, 0.3, ...
## $ asian <dbl> 0.9, 0.7, 0.4, 0.1, ...
## $ pac_isl <dbl> NA, NA, NA, NA, NA, ...
## $ two_plus_races <dbl> 1.6, 1.5, 0.9, 0.9, ...
## $ hispanic <dbl> 2.4, 4.4, 5.1, 1.8, ...
## $ white_not_hispanic <dbl> 77.2, 83.5, 46.8, 75...
## $ no_move_in_one_plus_year <dbl> 86.3, 83.0, 83.0, 90...
## $ foreign_born <dbl> 2.0, 3.6, 2.8, 0.7, ...
## $ foreign_spoken_at_home <dbl> 3.7, 5.5, 4.7, 1.5, ...
## $ hs_grad <dbl> 85.3, 87.6, 71.9, 74...
## $ bachelors <dbl> 21.7, 26.8, 13.5, 10...
## $ veterans <dbl> 5817, 20396, 2327, 1...
## $ mean_work_travel <dbl> 25.1, 25.8, 23.8, 28...
## $ housing_units <dbl> 22135, 104061, 11829...
## $ home_ownership <dbl> 77.5, 76.7, 68.0, 82...
## $ housing_multi_unit <dbl> 7.2, 22.6, 11.1, 6.6...
## $ median_val_owner_occupied <dbl> 133900, 177200, 8820...
## $ households <dbl> 19718, 69476, 9795, ...
## $ persons_per_household <dbl> 2.70, 2.50, 2.52, 3....
## $ per_capita_income <dbl> 24568, 26469, 15875,...
## $ median_household_income <dbl> 53255, 50147, 33219,...
## $ poverty <dbl> 10.6, 12.2, 25.0, 12...
## $ private_nonfarm_establishments <dbl> 877, 4812, 522, 318,...
## $ private_nonfarm_employment <dbl> 10628, 52233, 7990, ...
## $ percent_change_private_nonfarm_employment <dbl> 16.6, 17.4, -27.0, -...
## $ nonemployment_establishments <dbl> 2971, 14175, 1527, 1...
## $ firms <dbl> 4067, 19035, 1667, 1...
## $ black_owned_firms <dbl> 15.2, 2.7, NA, 14.9,...
## $ native_owned_firms <dbl> NA, 0.4, NA, NA, NA,...
## $ asian_owned_firms <dbl> 1.3, 1.0, NA, NA, NA...
## $ pac_isl_owned_firms <dbl> NA, NA, NA, NA, NA, ...
## $ hispanic_owned_firms <dbl> 0.7, 1.3, NA, NA, NA...
## $ women_owned_firms <dbl> 31.7, 27.3, 27.0, NA...
## $ manufacturer_shipments_2007 <dbl> NA, 1410273, NA, 0, ...
## $ mercent_whole_sales_2007 <dbl> NA, NA, NA, NA, NA, ...
## $ sales <dbl> 598175, 2966489, 188...
## $ sales_per_capita <dbl> 12003, 17166, 6334, ...
## $ accommodation_food_service <dbl> 88157, 436955, NA, 1...
## $ building_permits <dbl> 191, 696, 10, 8, 18,...
## $ fed_spending <dbl> 331142, 1119082, 240...
## $ area <dbl> 594.44, 1589.78, 884...
## $ density <dbl> 91.8, 114.6, 31.0, 3...
Selecting only the data needed - State - Pop2010 - cleveland plots generally work with categorical data, otherwise it would be too crowded.
cc <- countyComplete %>%
select(state, pop2010)
#factoring the.
cc %>%
group_by(state) #grouing by state
## Source: local data frame [3,143 x 2]
## Groups: state [51]
##
## state pop2010
## * <fctr> <dbl>
## 1 Alabama 54571
## 2 Alabama 182265
## 3 Alabama 27457
## 4 Alabama 22915
## 5 Alabama 57322
## 6 Alabama 10914
## 7 Alabama 20947
## 8 Alabama 118572
## 9 Alabama 34215
## 10 Alabama 25989
## # ... with 3,133 more rows
ggplot(cc, aes(x = factor(state), y = pop2010)) +
stat_summary(fun.y = "mean", geom = "bar")
When creating a plot with states in the x-axis all the names are jumbled together. You could change the name settings to a 90 degree turn but will still be difficult to read. - Cleveland plot reverse the x and y axis in order to have many factors readable and still maintaining accuracy.
#increased plot size by 10 & 11
cc1 <- ggplot(cc, aes(x = factor(state), y = pop2010))+
geom_point(col = "red", size = 2) +
geom_segment(aes(x = state,
xend = state,
y=min(pop2010),
yend=max(pop2010)),
linetype="dashed",
size=0.1) +
labs(title="State Dot Plot",
subtitle="State Vs Pop2010") +
coord_flip()
print(cc1)