the Challenge

This week’s challenge is to play around with some other aspects of visualization.

  • reorder axis elements (this post)
  • plot data on map of some kind (see Missoula Food System Map)

the Data

This week I’ll be working with the Population Health Institute’s County Health Rankings for Montana from 2021. The ranking system was designed by the Institute and all data collection and analysis to determine ranking are conducted by them.

All data are freely available to the public and can be found here.

I’ll be looking at the rankings for both health factors and health outcomes.

Let’s read in our data. I am going to read in one specific sheet from a file that contains many sheets. If I were presenting this data in the final publication, a datatable would be best but since I am just taking a quick look to see what I’m working with, a tibble is fine.

# set working directory
setwd("~/Documents/R Projects/Data")

# read data and assign to object
data <- read_excel("2021 County Health Rankings Montana Data - v1.xlsx", sheet = "Outcomes & Factors Rankings")

# print
data
## # A tibble: 58 × 7
##    ...1  ...2    ...3       `Health Outcomes`      ...5  `Health Factors`  ...7 
##    <chr> <chr>   <chr>      <chr>                  <chr> <chr>             <chr>
##  1 FIPS  State   County     Z-Score                Rank  Z-Score           Rank 
##  2 30000 Montana <NA>       <NA>                   <NA>  <NA>              <NA> 
##  3 30001 Montana Beaverhead -0.58404984400000004   10    -0.5974907360000… 3    
##  4 30003 Montana Big Horn   2.2480318276000002     48    1.47867168269999… 48   
##  5 30005 Montana Blaine     1.4127189876999999     45    0.80676611119999… 45   
##  6 30007 Montana Broadwater -0.79660183100000004   4     -0.206196512      22   
##  7 30009 Montana Carbon     -0.83152443399999998   3     -0.4199081949999… 10   
##  8 30011 Montana Carter     <NA>                   NR    <NA>              NR   
##  9 30013 Montana Cascade    4.1509575999999999E-2  34    -0.1436595240000… 25   
## 10 30015 Montana Chouteau   -2.1442419000000001E-2 32    -2.4527642999999… 28   
## # … with 48 more rows

Okay, game plan:

  • I don’t need columns one, two, four, or six so I am going to create a subset of this dataset with just the columns I’m interested in.
  • I know from examining the raw data that column five is the health outcomes rank and column seven is the health factors rank so I am going to rename those columns, as well as column three.
  • The first two rows are junk so I’m going to get rid of them.
  • Both rank columns (five and seven) are character type but should be integer so I’m going to change that.
  • I’m going to create a long version of the data–assigned to a new variable–so that I have both the wide and long version available to me.
# rename columns
data <- data %>%
  rename("County" = 3,
         "Outcomes" = 5,
         "Factors" = 7)

# convert variable type for columns 5,7 to integer
data$`Outcomes` <- as.integer(as.character(data$`Outcomes`))
data$`Factors` <- as.integer(as.character(data$`Factors`))


# create subset of data
countyRanking <- data %>%
  subset(select = c("County", "Outcomes", "Factors"))

# slice to remove first 2 rows
countyRanking <- countyRanking %>%
  slice(-c(1,2))

# pivot_longer
longCountyRanking <- countyRanking %>%
  pivot_longer(
    cols = c(`Outcomes`, `Factors`),
    names_to = "Rank Type",
    values_to = "Rank"
  )

# print wide
countyRanking
## # A tibble: 56 × 3
##    County     Outcomes Factors
##    <chr>         <int>   <int>
##  1 Beaverhead       10       3
##  2 Big Horn         48      48
##  3 Blaine           45      45
##  4 Broadwater        4      22
##  5 Carbon            3      10
##  6 Carter           NA      NA
##  7 Cascade          34      25
##  8 Chouteau         32      28
##  9 Custer           15      12
## 10 Daniels          25      15
## # … with 46 more rows
# print long
longCountyRanking
## # A tibble: 112 × 3
##    County     `Rank Type`  Rank
##    <chr>      <chr>       <int>
##  1 Beaverhead Outcomes       10
##  2 Beaverhead Factors         3
##  3 Big Horn   Outcomes       48
##  4 Big Horn   Factors        48
##  5 Blaine     Outcomes       45
##  6 Blaine     Factors        45
##  7 Broadwater Outcomes        4
##  8 Broadwater Factors        22
##  9 Carbon     Outcomes        3
## 10 Carbon     Factors        10
## # … with 102 more rows

Visualizing the Data

I’ll be using the ggplot function from the ggplot2 package to create all plots.

Scatterplots

Let’s start with a simple scatterplot. This will help us see any correlation between a county’s rank in Factors and Outcomes.

# create scatterplot
pScatter <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes, color = County)) +
  geom_point() +
  labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking")
  
pScatter

This will suffice for revealing the correlation between our two variables but a few things come to mind:

  • It’s pretty useless if we want to see where any given county falls on the plot. There are too many points and the colors for some are too similar to distinguish between them. It’s worth noting here that in terms of accessibility, this plot is likely impossible for anyone who can’t distinguish between colors (e.g. is colorblind).
  • Because the numeric values are rankings, the lower the number, the higher the rank. Thus the county that ranks the highest for both factors and outcomes–Gallatin County–is located in the lower, left-hand corner of the plot. This is counter-intuitive to how many users interpret scatterplots, which is something to keep in mind.
# create scatterplot with labels
pScatterLabel <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes)) +
  geom_point() +
  geom_label(
    aes(label = County)) +
  labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
  theme_minimal()

pScatterLabel

Hmm, well, that’s sort of what I was going for but not exactly. This iteration is useful for seeing relative ranking but sacrifices precision. That is, it’s easy to see that Gallatin County ranks higher than Powell County but it would be hard to give an estimate for some of the counties because the labels are large and overlap so much.

# create scatterplot with text
pScatterText <- ggplot(data = countyRanking, aes(x = Factors, y = Outcomes)) +
  geom_point() +
  geom_text(aes(label = County)) +
  labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
  theme_minimal()

pScatterText

# create scatterplot with smaller text
pScatterText2 <- ggplot(countyRanking, aes(x = Factors, y = Outcomes, label = County)) +
  geom_point() +
  geom_text(size = 1.5,
    nudge_x = 0, nudge_y = 1.25, 
    check_overlap = T) +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Montana County Rankings in 2021", y = "health outcomes ranking", x = "health factors ranking") +
  theme_minimal()

pScatterText2

That will do for now. We can see the correlation between factors and outcomes and quickly compare counties against one another.

Lollipop Plot

In order to play around with reordering the x-axis according to size, I’ll need to create a bar plot or something similar. After perusing the R Graph Gallery, I’ve decided to try a lollipop plot.

# create lollipop plot
pLolli <- ggplot(countyRanking, aes(x = County, y = Factors)) +
  geom_point() +
  geom_segment( aes(x = County, xend = County, y = 0, yend = Factors)) +
  labs(title = "Montana County Rankings in 2021", y = "health factors ranking", x = "county") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

pLolli

# create lollipop plot reordered by ranking
pLolli2 <- countyRanking %>%
  mutate(County = fct_reorder(County, Factors)) %>%
  ggplot(aes(x = County, y = Factors)) +
  geom_point() +
  geom_segment( aes(x = County, xend = County, y = 0, yend = Factors)) +
  labs(title = "Montana County Rankings in 2021", y = "health factors ranking", x = "county") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

pLolli2

I can only order by one variable and I chose factors. Ultimately this isn’t a very interesting plot, as I could achieve the same result by simply reordering within a table.

Cleveland Plot

# calculate difference between factors and outcomes rank for each county
countyRankDiff <- countyRanking %>%
  mutate(Difference = Outcomes - Factors)

countyRankDiff
## # A tibble: 56 × 4
##    County     Outcomes Factors Difference
##    <chr>         <int>   <int>      <int>
##  1 Beaverhead       10       3          7
##  2 Big Horn         48      48          0
##  3 Blaine           45      45          0
##  4 Broadwater        4      22        -18
##  5 Carbon            3      10         -7
##  6 Carter           NA      NA         NA
##  7 Cascade          34      25          9
##  8 Chouteau         32      28          4
##  9 Custer           15      12          3
## 10 Daniels          25      15         10
## # … with 46 more rows
# create Cleveland dot plot
pCleve <- countyRanking %>%
  ggplot() +
  geom_segment( aes(x = County, xend = County, y = Factors, yend = Outcomes), color = "grey") +
  geom_point( aes(x = County, y = Factors), color = rgb(0.2,0.7,0.1,0.5), size=3 ) + # create dot for health factors ranking
  geom_point( aes(x = County, y = Outcomes), color = rgb(0.7,0.2,0.1,0.5), size=3 ) + # create dot for health outcomes ranking
  labs(title = "Montana County Rankings in 2021", y = "health factors and outcomes ranking", x = "county") + # label plot and axes
  coord_flip() +
  theme_minimal()

pCleve

I’ll leave it there for now but moving forward there are a few changes to make:

  • Order the counties by the difference between their Factors and Outcomes ranking. This would really be a different way of viewing correlation but I’d like to see what it looks like. I.e. For any county whose Factor and Outcome ranking is the same, the difference would be zero. This would yield two points on top of one another = the most highly correlated.
  • Adjust the size and/or spacing of the county labels so that they’re readable.
  • Change the colors of the points (not digging the Christmas-color-theme).