A dumbbell plot is used to demonstrate the difference or change of a case (person, place, thing, etc…) of a quantitative variable across a binary variable (categorical variable with 2 outcomes).
I spent two months house hunting before purchasing a home. The Homes Viewed.csv data set has 5 variables for each of the 29 homes:
Load the tidyverse
package, read in the data, and
skim()
it to look at the data:
Name | homes |
Number of rows | 29 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Address | 0 | 1.00 | 13 | 25 | 0 | 29 | 0 |
Type | 0 | 1.00 | 5 | 5 | 0 | 2 | 0 |
Asking.Price | 0 | 1.00 | 4 | 4 | 0 | 24 | 0 |
Purchase.Price | 1 | 0.97 | 4 | 4 | 0 | 21 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Square.Feet | 0 | 1 | 1751.31 | 338.43 | 1270 | 1533 | 1623 | 1935 | 2530 | ▅▇▅▃▂ |
A couple of issues we should fix:
So how do we clean the data to make it useable?
clean_names()
from the janitor
package
to make the column names more user friendlytidyverse
package also installs
janitor
, but it doesn’t load it!Use one of our main dplyr verbs to remove the house that has a missing Purchase Price
Use another one of our main dplyr verbs along with
parse_number()
to convert Asking Price and
Purchase Price from a string to a numeric
column.
separate()
function in the
tidyr
package will separate a string into two (or more)
different columns based on a separator character.separate()
has three main arguments:
col =
the column with the string to be divided -
“Var1”into =
a pair of names to name the new columns -
c("Var2", "Var3")
sep =
a character used to divide the data into
columnsFinally, name the new data set houses
## street town type square_feet asking purchase
## 1 Venus Ave NNE house 1407 360000 350000
## 2 Pearl Street Essex house 1270 350000 360000
## 3 James Ave NNE house 1528 325000 360000
## 4 Maple St Bton house 1533 425000 400000
## 5 River St Winooski condo 2263 380000 412000
## 6 Logwod Circle Essex house 1620 370000 420000
## 7 Iroquis Drive Essex house 2158 390000 425000
## 8 North Ave NNE condo 1852 405000 427000
## 9 Arthur Ct Bton condo 1579 390000 430000
## 10 Victory Drive SB house 1796 396000 450000
## 11 Laurel Hill Drive SB house 1935 432000 465000
## 12 Baldwin Ave SB house 1555 430000 465000
## 13 Whitewater cir Williston condo 2241 430000 475000
## 14 Thorton St Winooski house 1554 440000 480000
## 15 Rudgate Rd Colchester house 2312 450000 511000
## 16 North St Winooski house 1759 485000 527000
## 17 University Terrace Bton house 1272 489000 550000
## 18 Devon Hill Ct Essex house 2530 485000 550000
## 19 Juniper Ridge Rd Essex house 1863 495000 552000
## 20 Colonial Dr Colchester house 1656 475000 590000
## 21 Foster St Bton house 1489 519000 620000
## 22 Meridan St NNE house 1824 410000 430000
## 23 Sundown Dr Williston house 2270 435000 485000
## 24 Cascade St Essex house 1623 418000 450000
## 25 Sherry Rd SB house 1613 395000 441000
## 26 Lapointe St Winooski house 1956 445000 480000
## 27 Lafountain St Winooski house 1600 360000 360000
## 28 Saratoga Ave NNE house 1428 425000 448000
There is a direct way we can use to create a dumbbell plot with 1 major downside we’ll see when we finish making it.
Looking at the example dumbbell plot in Brightspace, for each house we need:
1 point for asking price
1 point for purchase price
A line to connect the two points
For the two points, we can use two separate geom_point()
- one for each column.
What about the line that connects them? How do we draw a line that has a stopping and starting point?
With geom_segment()
, which has 4 needed aesthetics that
indicate the two sets of coordinates to start and stop the line
segment:
starting coordinates: (x
, y
)
ending coordinates: (xend
,
yend
)
Since we are drawing a flat line, the y
and
yend
coordinates should be the same while x
and xend
will be the asking and purchase price. Try
creating the basic graph:
What is the major problem?
That there isn’t a legend! The reader doesn’t know which color dot is the asking price vs the purchase price. So what can we do?
In order for ggplot()
to automatically create a legend,
the aesthetic that needs to be a guide needs to be mapped to a single
column. So we need a column that has the price type (asking/purchase)
and another that just has the prices, regardless if it is the asking or
purchase price. So how do we do that?
If we want to create a dumbbell plot using ggplot, we need each of the variables in their own column so we can map them to their specific aesthetics:
y =
case - house denoted by streetx =
quantitative variable - house pricecolor =
binary variable - asking or purchase priceUnfortunately, our data isn’t in the right format. The quantitative variable is in 2 different columns and the binary variable (price type) is represented by the column names.
If we want to use ggplot()
to create a dumbbell plot, we
need:
This is called the long format of the data.
We can use pivot_longer()
to wrangle the data into a
long format for price. The arguments are:
cols =
The columns we want to “stack” on top of one
anothernames_to =
the column name for the additional column
(that contains the old column names)values_to =
the column name for the column that stores
the stacked columnsBoth the names_to
and values_to
arguments
need the new names to be in quotes - “var1”
Use pivot_longer()
to manipulate the data to be in the
long format and call it houses_long
## # A tibble: 56 × 6
## street town type square_feet price_type price
## <chr> <chr> <chr> <int> <chr> <dbl>
## 1 Venus Ave NNE house 1407 asking 360000
## 2 Venus Ave NNE house 1407 purchase 350000
## 3 Pearl Street Essex house 1270 asking 350000
## 4 Pearl Street Essex house 1270 purchase 360000
## 5 James Ave NNE house 1528 asking 325000
## 6 James Ave NNE house 1528 purchase 360000
## 7 Maple St Bton house 1533 asking 425000
## 8 Maple St Bton house 1533 purchase 400000
## 9 River St Winooski condo 2263 asking 380000
## 10 River St Winooski condo 2263 purchase 412000
## # ℹ 46 more rows
Now we have the data in the format we need!
Now use ggplot()
, geom_point()
, and
geom_line()
to create a dumbbell plot!
To match the graph in blackboard, set size to 3 in
geom_point()
and 1 in geom_line()
Let’s go through a couple of ways we can improve the graph as a class!