What is a Dumbbell Plot?

A dumbbell plot is used to demonstrate the difference or change of a case (person, place, thing, etc…) of a quantitative variable across a binary variable (categorical variable with 2 outcomes).

Homes Data

I spent two months house hunting before purchasing a home. The Homes Viewed.csv data set has 5 variables for each of the 29 homes:

  1. Location: “Street Name, Town”
  2. Type: House/Condo
  3. Square Feet: The area of finished space (in square feet)
  4. Asking Price: How much the sellers were asking when the house was listed.
  5. Purchase Price: How much the buyers paid for the house.

Load the tidyverse package, read in the data, and skim() it to look at the data:

Data summary
Name homes
Number of rows 29
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Address 0 1.00 13 25 0 29 0
Type 0 1.00 5 5 0 2 0
Asking.Price 0 1.00 4 4 0 24 0
Purchase.Price 1 0.97 4 4 0 21 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Square.Feet 0 1 1751.31 338.43 1270 1533 1623 1935 2530 ▅▇▅▃▂

A couple of issues we should fix:

  1. The column names are not user friendly (Capitalized, spaces between words)
  2. Purchase Price has a missing value
  3. The 2 price variables are characters since the price have a “k” after them

Cleaing the data

So how do we clean the data to make it useable?

  1. Use clean_names() from the janitor package to make the column names more user friendly
  • Installing the tidyverse package also installs janitor, but it doesn’t load it!
  1. Use one of our main dplyr verbs to remove the house that has a missing Purchase Price

  2. Use another one of our main dplyr verbs along with parse_number() to convert Asking Price and Purchase Price from a string to a numeric column.

  • “450k” –> 450000
  1. We may want to have the name of the street and the name of the town in different columns. The separate() function in the tidyr package will separate a string into two (or more) different columns based on a separator character.
  • Use “,” to separate street and town in the Address column

separate() has three main arguments:

  1. col = the column with the string to be divided - “Var1”
  2. into = a pair of names to name the new columns - c("Var2", "Var3")
  3. sep = a character used to divide the data into columns

Finally, name the new data set houses

##                street       town  type square_feet asking purchase
## 1           Venus Ave        NNE house        1407 360000   350000
## 2        Pearl Street      Essex house        1270 350000   360000
## 3           James Ave        NNE house        1528 325000   360000
## 4            Maple St       Bton house        1533 425000   400000
## 5            River St   Winooski condo        2263 380000   412000
## 6       Logwod Circle      Essex house        1620 370000   420000
## 7       Iroquis Drive      Essex house        2158 390000   425000
## 8           North Ave        NNE condo        1852 405000   427000
## 9           Arthur Ct       Bton condo        1579 390000   430000
## 10      Victory Drive         SB house        1796 396000   450000
## 11  Laurel Hill Drive         SB house        1935 432000   465000
## 12        Baldwin Ave         SB house        1555 430000   465000
## 13     Whitewater cir  Williston condo        2241 430000   475000
## 14         Thorton St   Winooski house        1554 440000   480000
## 15         Rudgate Rd Colchester house        2312 450000   511000
## 16           North St   Winooski house        1759 485000   527000
## 17 University Terrace       Bton house        1272 489000   550000
## 18      Devon Hill Ct      Essex house        2530 485000   550000
## 19   Juniper Ridge Rd      Essex house        1863 495000   552000
## 20        Colonial Dr Colchester house        1656 475000   590000
## 21          Foster St       Bton house        1489 519000   620000
## 22         Meridan St        NNE house        1824 410000   430000
## 23         Sundown Dr  Williston house        2270 435000   485000
## 24         Cascade St      Essex house        1623 418000   450000
## 25          Sherry Rd         SB house        1613 395000   441000
## 26        Lapointe St   Winooski house        1956 445000   480000
## 27      Lafountain St   Winooski house        1600 360000   360000
## 28       Saratoga Ave        NNE house        1428 425000   448000

Dumbbell Plot: Method 1

There is a direct way we can use to create a dumbbell plot with 1 major downside we’ll see when we finish making it.

Looking at the example dumbbell plot in Brightspace, for each house we need:

  1. 1 point for asking price

  2. 1 point for purchase price

  3. A line to connect the two points

For the two points, we can use two separate geom_point() - one for each column.

What about the line that connects them? How do we draw a line that has a stopping and starting point?

With geom_segment(), which has 4 needed aesthetics that indicate the two sets of coordinates to start and stop the line segment:

  1. starting coordinates: (x, y)

  2. ending coordinates: (xend, yend)

Since we are drawing a flat line, the y and yend coordinates should be the same while x and xend will be the asking and purchase price. Try creating the basic graph:

What is the major problem?

That there isn’t a legend! The reader doesn’t know which color dot is the asking price vs the purchase price. So what can we do?

In order for ggplot() to automatically create a legend, the aesthetic that needs to be a guide needs to be mapped to a single column. So we need a column that has the price type (asking/purchase) and another that just has the prices, regardless if it is the asking or purchase price. So how do we do that?

Pivot verbs

If we want to create a dumbbell plot using ggplot, we need each of the variables in their own column so we can map them to their specific aesthetics:

  • y = case - house denoted by street
  • x = quantitative variable - house price
  • color = binary variable - asking or purchase price

Unfortunately, our data isn’t in the right format. The quantitative variable is in 2 different columns and the binary variable (price type) is represented by the column names.

If we want to use ggplot() to create a dumbbell plot, we need:

  • Both asking and purchase to be in the same column - price
  • An additional column that gives the type of price (asking vs purchase) in another column - price_type

This is called the long format of the data.

We can use pivot_longer() to wrangle the data into a long format for price. The arguments are:

  1. cols = The columns we want to “stack” on top of one another
  2. names_to = the column name for the additional column (that contains the old column names)
  3. values_to = the column name for the column that stores the stacked columns

Both the names_to and values_to arguments need the new names to be in quotes - “var1”

Use pivot_longer() to manipulate the data to be in the long format and call it houses_long

## # A tibble: 56 × 6
##    street       town     type  square_feet price_type  price
##    <chr>        <chr>    <chr>       <int> <chr>       <dbl>
##  1 Venus Ave    NNE      house        1407 asking     360000
##  2 Venus Ave    NNE      house        1407 purchase   350000
##  3 Pearl Street Essex    house        1270 asking     350000
##  4 Pearl Street Essex    house        1270 purchase   360000
##  5 James Ave    NNE      house        1528 asking     325000
##  6 James Ave    NNE      house        1528 purchase   360000
##  7 Maple St     Bton     house        1533 asking     425000
##  8 Maple St     Bton     house        1533 purchase   400000
##  9 River St     Winooski condo        2263 asking     380000
## 10 River St     Winooski condo        2263 purchase   412000
## # ℹ 46 more rows

Now we have the data in the format we need!

Dumbbell plot

Now use ggplot(), geom_point(), and geom_line() to create a dumbbell plot!

To match the graph in blackboard, set size to 3 in geom_point() and 1 in geom_line()

Let’s go through a couple of ways we can improve the graph as a class!