A dumbbell plot is used to demonstrate the difference or change of a case (person, place, thing, etc…) of a quantitative variable across a binary variable (categorical variable with 2 outcomes).
I spent two months house hunting before purchasing a home. The Homes Viewed.csv data set has 5 variables for each of the 29 homes:
Load the tidyverse package, read in the data, and
skim() it to look at the data:
| Name | homes |
| Number of rows | 29 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Address | 0 | 1.00 | 13 | 25 | 0 | 29 | 0 |
| Type | 0 | 1.00 | 5 | 5 | 0 | 2 | 0 |
| Asking.Price | 0 | 1.00 | 4 | 4 | 0 | 24 | 0 |
| Purchase.Price | 1 | 0.97 | 4 | 4 | 0 | 21 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Square.Feet | 0 | 1 | 1751.31 | 338.43 | 1270 | 1533 | 1623 | 1935 | 2530 | ▅▇▅▃▂ |
A couple of issues we should fix:
So how do we clean the data to make it useable?
clean_names() from the janitor package
to make the column names more user friendlytidyverse package also installs
janitor, but it doesn’t load it!Use one of our main dplyr verbs to remove the house that has a missing Purchase Price
Use another one of our main dplyr verbs along with
parse_number() to convert Asking Price and
Purchase Price from a string to a numeric
column.
separate() function in the
tidyr package will separate a string into two (or more)
different columns based on a separator character.separate() has three main arguments:
col = the column with the string to be divided -
“Var1”into = a pair of names to name the new columns -
c("Var2", "Var3")sep = a character used to divide the data into
columnsFinally, name the new data set houses
## street town type square_feet asking purchase
## 1 Venus Ave NNE house 1407 360000 350000
## 2 Pearl Street Essex house 1270 350000 360000
## 3 James Ave NNE house 1528 325000 360000
## 4 Maple St Bton house 1533 425000 400000
## 5 River St Winooski condo 2263 380000 412000
## 6 Logwod Circle Essex house 1620 370000 420000
## 7 Iroquis Drive Essex house 2158 390000 425000
## 8 North Ave NNE condo 1852 405000 427000
## 9 Arthur Ct Bton condo 1579 390000 430000
## 10 Victory Drive SB house 1796 396000 450000
## 11 Laurel Hill Drive SB house 1935 432000 465000
## 12 Baldwin Ave SB house 1555 430000 465000
## 13 Whitewater cir Williston condo 2241 430000 475000
## 14 Thorton St Winooski house 1554 440000 480000
## 15 Rudgate Rd Colchester house 2312 450000 511000
## 16 North St Winooski house 1759 485000 527000
## 17 University Terrace Bton house 1272 489000 550000
## 18 Devon Hill Ct Essex house 2530 485000 550000
## 19 Juniper Ridge Rd Essex house 1863 495000 552000
## 20 Colonial Dr Colchester house 1656 475000 590000
## 21 Foster St Bton house 1489 519000 620000
## 22 Meridan St NNE house 1824 410000 430000
## 23 Sundown Dr Williston house 2270 435000 485000
## 24 Cascade St Essex house 1623 418000 450000
## 25 Sherry Rd SB house 1613 395000 441000
## 26 Lapointe St Winooski house 1956 445000 480000
## 27 Lafountain St Winooski house 1600 360000 360000
## 28 Saratoga Ave NNE house 1428 425000 448000
There is a direct way we can use to create a dumbbell plot with 1 major downside we’ll see when we finish making it.
Looking at the example dumbbell plot in Brightspace, for each house we need:
1 point for asking price
1 point for purchase price
A line to connect the two points
For the two points, we can use two separate geom_point()
- one for each column.
What about the line that connects them? How do we draw a line that has a stopping and starting point?
With geom_segment(), which has 4 needed aesthetics that
indicate the two sets of coordinates to start and stop the line
segment:
starting coordinates: (x, y)
ending coordinates: (xend,
yend)
Since we are drawing a flat line, the y and
yend coordinates should be the same while x
and xend will be the asking and purchase price. Try
creating the basic graph:
What is the major problem?
That there isn’t a legend! The reader doesn’t know which color dot is the asking price vs the purchase price. So what can we do?
In order for ggplot() to automatically create a legend,
the aesthetic that needs to be a guide needs to be mapped to a single
column. So we need a column that has the price type (asking/purchase)
and another that just has the prices, regardless if it is the asking or
purchase price. So how do we do that?
If we want to create a dumbbell plot using ggplot, we need each of the variables in their own column so we can map them to their specific aesthetics:
y = case - house denoted by streetx = quantitative variable - house pricecolor = binary variable - asking or purchase priceUnfortunately, our data isn’t in the right format. The quantitative variable is in 2 different columns and the binary variable (price type) is represented by the column names.
If we want to use ggplot() to create a dumbbell plot, we
need:
This is called the long format of the data.
We can use pivot_longer() to wrangle the data into a
long format for price. The arguments are:
cols = The columns we want to “stack” on top of one
anothernames_to = the column name for the additional column
(that contains the old column names)values_to = the column name for the column that stores
the stacked columnsBoth the names_to and values_to arguments
need the new names to be in quotes - “var1”
Use pivot_longer() to manipulate the data to be in the
long format and call it houses_long
## # A tibble: 56 × 6
## street town type square_feet price_type price
## <chr> <chr> <chr> <int> <chr> <dbl>
## 1 Venus Ave NNE house 1407 asking 360000
## 2 Venus Ave NNE house 1407 purchase 350000
## 3 Pearl Street Essex house 1270 asking 350000
## 4 Pearl Street Essex house 1270 purchase 360000
## 5 James Ave NNE house 1528 asking 325000
## 6 James Ave NNE house 1528 purchase 360000
## 7 Maple St Bton house 1533 asking 425000
## 8 Maple St Bton house 1533 purchase 400000
## 9 River St Winooski condo 2263 asking 380000
## 10 River St Winooski condo 2263 purchase 412000
## # ℹ 46 more rows
Now we have the data in the format we need!
Now use ggplot(), geom_point(), and
geom_line() to create a dumbbell plot!
To match the graph in blackboard, set size to 3 in
geom_point() and 1 in geom_line()
Let’s go through a couple of ways we can improve the graph as a class!