Project Setup

First off, the inspiration for this project was from the Cleveland Dot Plot examples from the UC Business Analytics R Programming website at https://uc-r.github.io/cleveland-dot-plots.

For this project, I decided to look at the change of infection rates of Gonorrhea between the years 2015 and 2016. Specifically, I wished to look at infection rates per 100,000 persons. You can find the data on the Centers for Disease Control and Prevention at https://www.cdc.gov/std/stats16/tables/14.htm.

We begin by loading the tidyverse library. We will be using several individual libraries within the tidyverse, so it’s easier to call it rathar than the individual libraries it contains.

library(tidyverse)

The data were copied from the CDC website to a .csv file, using the columns for state names and rates per 100,000 persons for each state for the years 2015 and 2016. The file I used is named “gonorrhea_2015_2016_change_rates.csv” for this example. The data were, then, loaded into memory and given appropriate column names in data1. For our purposes, the District of Columbia was excluded.

data1 <- read.csv('gonorrhea_2015_2016_change_rates.csv',header=T,sep=',')
colnames(data1) <- c('Location','Y2015','Y2016')
head(data1)
##     Location Y2015 Y2016
## 1    Alabama 148.1 173.0
## 2     Alaska 150.7 196.9
## 3    Arizona 120.8 151.3
## 4   Arkansas 160.5 192.5
## 5 California 138.3 164.9
## 6   Colorado  80.4 109.5

We can, now, reshape the data using the gather() function. Rates for 2015 and 2016 are matched up to each state.

data2a <- data1 %>% gather(key = Year, value = Rate, Y2015, Y2016)
data2a <- data2a %>% group_by(Location,Year) %>% ungroup()
head(data2a)
## # A tibble: 6 x 3
##   Location   Year   Rate
##   <fct>      <chr> <dbl>
## 1 Alabama    Y2015 148  
## 2 Alaska     Y2015 151  
## 3 Arizona    Y2015 121  
## 4 Arkansas   Y2015 160  
## 5 California Y2015 138  
## 6 Colorado   Y2015  80.4

Next, we can set up a data frame to use for labels. The labels will indicate the percentage change in rates between 2015 and 2016. The new data frame, data2b, will include a new column for the percentage change in rates via the mutate() function.

data2b <- data1 %>% mutate(Change = Y2016 / Y2015 - 1)

From data2b we create the data frame data2c where we will, again, use the gather() to reshape the data. The data will, then, be grouped by the state names. The highest value for each group (state) will then be filtered out, with the appropriate year.

data2c <- gather(data2b,key = Year, value = Rate, Y2015, Y2016) %>% group_by(Location) %>% top_n(1)
## Selecting by Rate

We can now plot the data, using the data frame data2a for the main plot, lines, and points. The data frame data2c is used for the labels for the percentage change for each state. The percentage is formated using a reference to the scales library.

Changes to the legend are then made, so that the different years are indicated. Finally, the plot’s title and caption are added, with changes to the x and y axes.

p1 <- ggplot(data2a,aes(Rate,reorder(Location,Rate)))+geom_line(aes(group=Location))+geom_point(aes(color=Year),size=2.5)+geom_text(data=data2c,aes(color=Year,label=paste0(scales::percent(round(Change,digits=3)))),size=3,hjust=-.5)
p2 <- p1+scale_color_discrete(name='Year',breaks=c('Y2015','Y2016'),labels=c('2015','2016'))
p3 <- p2+labs(title='Reported rates of Gonorrhea per 100,000 persons by state, 2015-2016\nwith percent change between years',
x='Rate per 100,000 persons',y='State',caption='Source: Centers for Disease Control and Prevention,\nwww.cdc.gov/std/stats16/tables/14.htm')
p3