This week’s challenge is to play around with some other aspects of visualization.
This week I’ll be working with the Population Health Institute’s County Health Rankings for Montana from 2021. The ranking system was designed by the Institute and all data collection and analysis to determine ranking are conducted by them.
All data are freely available to the public and can be found here.
I’ll be looking at the rankings for both health factors and health outcomes.
Let’s read in our data. I am going to read in one specific sheet from a file that contains many sheets. If I were presenting this data in the final publication, a datatable would be best but since I am just taking a quick look to see what I’m working with, a tibble is fine.
# set working directory
setwd("~/Documents/R Projects/Data")
# read data and assign to object
data <- read_excel("2021 County Health Rankings Montana Data - v1.xlsx", sheet = "Outcomes & Factors Rankings")
# print
data
## # A tibble: 58 × 7
## ...1 ...2 ...3 `Health Outcomes` ...5 `Health Factors` ...7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 FIPS State County Z-Score Rank Z-Score Rank
## 2 30000 Montana <NA> <NA> <NA> <NA> <NA>
## 3 30001 Montana Beaverhead -0.58404984400000004 10 -0.5974907360000… 3
## 4 30003 Montana Big Horn 2.2480318276000002 48 1.47867168269999… 48
## 5 30005 Montana Blaine 1.4127189876999999 45 0.80676611119999… 45
## 6 30007 Montana Broadwater -0.79660183100000004 4 -0.206196512 22
## 7 30009 Montana Carbon -0.83152443399999998 3 -0.4199081949999… 10
## 8 30011 Montana Carter <NA> NR <NA> NR
## 9 30013 Montana Cascade 4.1509575999999999E-2 34 -0.1436595240000… 25
## 10 30015 Montana Chouteau -2.1442419000000001E-2 32 -2.4527642999999… 28
## # … with 48 more rows
Okay, game plan:
# rename columns
data <- data %>%
rename("County" = 3,
"Outcomes" = 5,
"Factors" = 7)
# convert variable type for columns 5,7 to integer
data$`Outcomes` <- as.integer(as.character(data$`Outcomes`))
data$`Factors` <- as.integer(as.character(data$`Factors`))
# create subset of data
countyRanking <- data %>%
subset(select = c("County", "Outcomes", "Factors"))
# slice to remove first 2 rows
countyRanking <- countyRanking %>%
slice(-c(1,2))
# pivot_longer
longCountyRanking <- countyRanking %>%
pivot_longer(
cols = c(`Outcomes`, `Factors`),
names_to = "Rank Type",
values_to = "Rank"
)
# print wide
countyRanking
## # A tibble: 56 × 3
## County Outcomes Factors
## <chr> <int> <int>
## 1 Beaverhead 10 3
## 2 Big Horn 48 48
## 3 Blaine 45 45
## 4 Broadwater 4 22
## 5 Carbon 3 10
## 6 Carter NA NA
## 7 Cascade 34 25
## 8 Chouteau 32 28
## 9 Custer 15 12
## 10 Daniels 25 15
## # … with 46 more rows
# print long
longCountyRanking
## # A tibble: 112 × 3
## County `Rank Type` Rank
## <chr> <chr> <int>
## 1 Beaverhead Outcomes 10
## 2 Beaverhead Factors 3
## 3 Big Horn Outcomes 48
## 4 Big Horn Factors 48
## 5 Blaine Outcomes 45
## 6 Blaine Factors 45
## 7 Broadwater Outcomes 4
## 8 Broadwater Factors 22
## 9 Carbon Outcomes 3
## 10 Carbon Factors 10
## # … with 102 more rows
I’ll be using the ggplot function from the ggplot2 package to create all plots.
Let’s start with a simple scatterplot. This will help us see any correlation between a county’s rank in Factors and Outcomes.
# create scatterplot
pScatter <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes, color = County)) +
geom_point() +
labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking")
pScatter
This will suffice for revealing the correlation between our two variables but a few things come to mind:
# create scatterplot with labels
pScatterLabel <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes)) +
geom_point() +
geom_label(
aes(label = County)) +
labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
theme_minimal()
pScatterLabel
Hmm, well, that’s sort of what I was going for but not exactly. This iteration is useful for seeing relative ranking but sacrifices precision. That is, it’s easy to see that Gallatin County ranks higher than Powell County but it would be hard to give an estimate for some of the counties because the labels are large and overlap so much.
# create scatterplot with text
pScatterText <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes)) +
geom_point() +
geom_text(aes(label = County)) +
labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
theme_minimal()
pScatterText
# create scatterplot with smaller text
pScatterText2 <- ggplot(countyRanking, aes(x = Factors, y = Outcomes, label = County)) +
geom_point() +
geom_text(size = 1.5,
nudge_x = 0, nudge_y = 1.25,
check_overlap = T) +
geom_smooth(method = lm, se = FALSE) +
labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
theme_minimal()
pScatterText2
That will do for now. We can see the correlation between factors and outcomes and quickly compare counties against one another.
In order to play around with reordering the x-axis according to size, I’ll need to create a bar plot or something similar. After perusing the R Graph Gallery, I’ve decided to try a lollipop plot.
# create lollipop plot
pLolli <- ggplot(countyRanking, aes(x = County, y = Factors)) +
geom_point() +
geom_segment( aes(x = County, xend = County, y = 0, yend = Factors)) +
labs(title = "Montana County Rankings in 2021", y = "health factors ranking", x = "county") +
theme_minimal() +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
pLolli
# create lollipop plot reordered by ranking
pLolli2 <- countyRanking %>%
mutate(County = fct_reorder(County, Factors)) %>%
ggplot(aes(x = County, y = Factors)) +
geom_point() +
geom_segment( aes(x = County, xend = County, y = 0, yend = Factors)) +
labs(title = "Montana County Rankings in 2021", y = "health factors ranking", x = "county") +
theme_minimal() +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
pLolli2
I can only order by one variable and I chose factors. Ultimately this isn’t a very interesting plot, as I could achieve the same result by simply reordering within a table.
# calculate difference between factors and outcomes rank for each county
countyRankDiff <- countyRanking %>%
mutate(Difference = Outcomes - Factors)
countyRankDiff
## # A tibble: 56 × 4
## County Outcomes Factors Difference
## <chr> <int> <int> <int>
## 1 Beaverhead 10 3 7
## 2 Big Horn 48 48 0
## 3 Blaine 45 45 0
## 4 Broadwater 4 22 -18
## 5 Carbon 3 10 -7
## 6 Carter NA NA NA
## 7 Cascade 34 25 9
## 8 Chouteau 32 28 4
## 9 Custer 15 12 3
## 10 Daniels 25 15 10
## # … with 46 more rows
# create Cleveland dot plot
pCleve <- countyRanking %>%
ggplot() +
geom_segment( aes(x = County, xend = County, y = Factors, yend = Outcomes), color = "grey") +
geom_point( aes(x = County, y = Factors), color = rgb(0.2,0.7,0.1,0.5), size=3 ) + # create dot for health factors ranking
geom_point( aes(x = County, y = Outcomes), color = rgb(0.7,0.2,0.1,0.5), size=3 ) + # create dot for health outcomes ranking
labs(title = "Montana County Rankings in 2021", y = "health factors and outcomes ranking", x = "county") + # label plot and axes
coord_flip() +
theme_minimal()
pCleve
I’ll leave it there for now but moving forward there are a few changes to make: