These are my notes for GRD 610A: Data Visualization II in Winter 2021 at the College for Creative Studies. These notes are for my work in the book Data Visualization by Kieran Healy (Princeton University Press, 2019).
Objects in R are created and referred to by their names. Certain names are not allowed because they are reserved words such as TRUE, if, mean(), and NA. Names also cannot start with a number or contain spaces. There are different naming conventions.
Snake Case
my_data
this_is_snake_case
Camel Case
myData
thisIsCamelCase
Pascal Case
MyData
ThisIsPascalCase
Pick one naming convention and stick with it. Be consistent; don’t switch between conventions. I recommend snake case.
# This is a comment (it starts with #)
my_data <- c(1, 2, 3, 4) # Assign using <- ; use ALT + - or OPTION + -
My_Data
## Error in eval(expr, envir, enclos): object 'My_Data' not found
# Cannot be found because we called it my_data (lowercase)
# Now we can see it
my_data
## [1] 1 2 3 4
Think of functions like a recipe. The arguments of the function are the ingredients and what happens within the function is the sequence of cooking steps.
c(1, 2, 3, 1, 3, 5, 25) # c() is the combine function, it puts things together into a vector/list
## [1] 1 2 3 1 3 5 25
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)
my_numbers
## [1] 1 2 3 1 3 5 25
mean(x = my_numbers)
## [1] 5.714286
mean(my_numbers) # you don't have to specify the argument names, but order matters if you do not specify
## [1] 5.714286
mean(x = your_numbers)
## [1] 19.71429
my_summary <- summary(my_numbers)
my_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 3.000 5.714 4.000 25.000
table(my_numbers)
## my_numbers
## 1 2 3 5 25
## 2 1 2 1 1
sd(my_numbers)
## [1] 8.616153
my_numbers * 5
## [1] 5 10 15 5 15 25 125
my_numbers + 1
## [1] 2 3 4 2 4 6 26
my_numbers + my_numbers # How is this different than the last line?
## [1] 2 4 6 2 6 10 50
# If you're not sure what an object is, ask for its class or type
class(my_numbers)
## [1] "numeric"
class(my_summary)
## [1] "summaryDefault" "table"
class(summary)
## [1] "function"
my_new_vector <- c(my_numbers, "Apple") # What happens if we combine a word with numbers?
my_new_vector
## [1] "1" "2" "3" "1" "3" "5" "25" "Apple"
class(my_new_vector)
## [1] "character"
# Let's look at a new dataset
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
class(titanic)
## [1] "data.frame"
# Titanic is a data frame, which is like a table
# The $ operator can be used to access a column of a data frame by name
titanic$percent
## [1] 62.0 5.7 16.7 15.6
# Tibbles are slightly different than data frames. They are also data tables, but they provide more information.
titanic_tb <- as_tibble(titanic)
titanic_tb # How is does this compare to titanic above?
## # A tibble: 4 x 4
## fate sex n percent
## <fct> <fct> <dbl> <dbl>
## 1 perished male 1364 62
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
# To see inside an object, ask for its structure
str(my_numbers)
## num [1:7] 1 2 3 1 3 5 25
str(my_summary)
## 'summaryDefault' Named num [1:6] 1 1.5 3 5.71 4 ...
## - attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
Programming in R can be challenging and it takes time to get used to. Be patient and take a break if you get stuck. Make sure parentheses are opened and closed. Complete your commands (look out for the + in the console). Take your time and lookout for typos.
In this section, we will get data from a URL and make a quick figure.
# Data source
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
# Read the CSV from the URL
organs <- read_csv(file = url)
# Take a quick look at the data
glimpse(organs)
## Rows: 238
## Columns: 21
## $ country <chr> "Australia", "Australia", "Australia", "Australia"...
## $ year <dbl> NA, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998...
## $ donors <dbl> NA, 12.09, 12.35, 12.51, 10.25, 10.18, 10.59, 10.2...
## $ pop <dbl> 17065, 17284, 17495, 17667, 17855, 18072, 18311, 1...
## $ pop.dens <dbl> 0.2204433, 0.2232723, 0.2259980, 0.2282198, 0.2306...
## $ gdp <dbl> 16774, 17171, 17914, 18883, 19849, 21079, 21923, 2...
## $ gdp.lag <dbl> 16591, 16774, 17171, 17914, 18883, 19849, 21079, 2...
## $ health <dbl> 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948, 20...
## $ health.lag <dbl> 1224, 1300, 1379, 1455, 1540, 1626, 1737, 1846, 19...
## $ pubhealth <dbl> 4.8, 5.4, 5.4, 5.4, 5.4, 5.5, 5.6, 5.7, 5.9, 6.1, ...
## $ roads <dbl> 136.59537, 122.25179, 112.83224, 110.54508, 107.98...
## $ cerebvas <dbl> 682, 647, 630, 611, 631, 592, 576, 525, 516, 493, ...
## $ assault <dbl> 21, 19, 17, 18, 17, 16, 17, 17, 16, 15, 16, 15, 14...
## $ external <dbl> 444, 425, 406, 376, 387, 371, 395, 385, 410, 409, ...
## $ txp.pop <dbl> 0.9375916, 0.9257116, 0.9145470, 0.9056433, 0.8961...
## $ world <chr> "Liberal", "Liberal", "Liberal", "Liberal", "Liber...
## $ opt <chr> "In", "In", "In", "In", "In", "In", "In", "In", "I...
## $ consent.law <chr> "Informed", "Informed", "Informed", "Informed", "I...
## $ consent.practice <chr> "Informed", "Informed", "Informed", "Informed", "I...
## $ consistent <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "...
## $ ccode <chr> "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "O...
# View(organs) # Run in RStudio
# Another way to view data
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
# Make a plot object
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
# Create a scatterplot
p + geom_point()
ggplot2 is an R library/package that allows us to map data to visual elements. Using it we can control the way the data appears in the plot and how each element of the plot will be displayed. Aesthetic Mappings make the connection between the data and how it is displayed on the plot (location, size, color, shape, etc.). Geoms define the type of plot (scatterplot, line plot, box plot, bar chart, etc.). Code is added together to make the plot using + the plus sign. More pieces can be added to the plot that define the scales, legend, labels, axes, style or theme of the plot, etc. Each part can be added using different functions with arguments specifying the look of the plot; the plot is built up piece by piece.
In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
From Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10).
Build a plot layer by layer, starting with telling ggplot what data to use and how to map or link it to parts of the plot, like the x and y axes. Then add on the type of geom.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point()
Trying different geom_ functions.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_smooth()
p + geom_point() + # add the points back into the plot
geom_smooth()
p + geom_point() +
geom_smooth(method = "lm") # use a linear model
p + geom_point() +
geom_smooth(method = "gam") + # generalized additive model
scale_x_log10() # transform x-axis to log-10 scale
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) # format x-axis in dollars
Using the aesthetics mapping, different parts of the data can be encoded in different ways.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = "purple")) # ggplot adds the value "purple" to all rows
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
# To actually turn all of the points purple, we need to set the color property of the geom_ function
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(color = "purple") + # set point color to purple
geom_smooth(method = "loess") +
scale_x_log10()
p + geom_point(alpha = 0.3) + # make points more transparent
geom_smooth(color = "orange", # make line orange
se = FALSE, # remove standard error band
size = 8, # increase thickness of the line
method = "lm") +
scale_x_log10()
p + geom_point(alpha = 0.3) + # make points more transparent
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
# Add title and labels
labs(x = "GDP per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder")
# Map data by continent
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent))
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent,
fill = continent)) # now the error bands will also have the color
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(mapping = aes(color = continent)) + # points will be colored by continent
geom_smooth(method = "loess") + # the smoothed line will be for all data
scale_x_log10()
p + geom_point(mapping = aes(color = log(pop))) + # points will be colored by population
scale_x_log10()
Use here() to save plots in the current directory. This function can also be used to reference folders within the current directory. For this class, use .svg to save in vector format and embed in Adobe Illustrator. The function to save a plot is ggsave() which will automatically save the last plot and can also be provided a ggplot object to save.
Pick at least two of the questions presented under the Where to Go Next section and answer them.
“Code almost never works properly the first time you write it.” (p. 73)
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p + geom_line() # Something is wrong, we didn't tell it how to group
p + geom_line(aes(group = country)) # Now there is a line per country
Facet = small multiple (i.e. a separate graph for each value of the variable)
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p + geom_line(aes(group = country)) +
facet_wrap(~continent) # make a separate plot for each continent
# Make it look a little nicer
p + geom_line(color = "gray70",
aes(group = country)) +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels = scales::dollar) +
facet_wrap(~continent, ncol = 5) +
labs(x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five Continents")
# New dataset 2016 General Social Survey with more categorical data
glimpse(gss_sm)
## Rows: 2,867
## Columns: 32
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ ballot <labelled> 1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 2, 1, 2, 3, 2, 3, 3,...
## $ age <dbl> 47, 61, 72, 43, 55, 53, 50, 23, 45, 71, 33, 86, 32, 60,...
## $ childs <dbl> 3, 0, 2, 4, 2, 2, 2, 3, 3, 4, 5, 4, 3, 5, 7, 2, 6, 5, 0...
## $ sibs <labelled> 2, 3, 3, 3, 2, 2, 2, 6, 5, 1, 4, 4, 3, 6, 0, 1, 3,...
## $ degree <fct> Bachelor, High School, Bachelor, High School, Graduate,...
## $ race <fct> White, White, White, White, White, White, White, Other,...
## $ sex <fct> Male, Male, Male, Female, Female, Female, Male, Female,...
## $ region <fct> New England, New England, New England, New England, New...
## $ income16 <fct> $170000 or over, $50000 to 59999, $75000 to $89999, $17...
## $ relig <fct> None, None, Catholic, Catholic, None, None, None, Catho...
## $ marital <fct> Married, Never Married, Married, Married, Married, Marr...
## $ padeg <fct> Graduate, Lt High School, High School, NA, Bachelor, NA...
## $ madeg <fct> High School, High School, Lt High School, High School, ...
## $ partyid <fct> "Independent", "Ind,near Dem", "Not Str Republican", "N...
## $ polviews <fct> Moderate, Liberal, Conservative, Moderate, Slightly Lib...
## $ happy <fct> Pretty Happy, Pretty Happy, Very Happy, Pretty Happy, V...
## $ partners <fct> NA, 1 Partner, 1 Partner, NA, 1 Partner, 1 Partner, NA,...
## $ grass <fct> NA, Legal, Not Legal, NA, Legal, Legal, NA, Not Legal, ...
## $ zodiac <fct> Aquarius, Scorpio, Pisces, Cancer, Scorpio, Scorpio, Ca...
## $ pres12 <labelled> 3, 1, 2, 2, 1, 1, NA, NA, NA, 2, NA, NA, 1, 1, 2, ...
## $ wtssall <dbl> 0.9569935, 0.4784968, 0.9569935, 1.9139870, 1.4354903, ...
## $ income_rc <fct> Gt $170000, Gt $50000, Gt $75000, Gt $170000, Gt $17000...
## $ agegrp <fct> Age 45-55, Age 55-65, Age 65+, Age 35-45, Age 45-55, Ag...
## $ ageq <fct> Age 34-49, Age 49-62, Age 62+, Age 34-49, Age 49-62, Ag...
## $ siblings <fct> 2, 3, 3, 3, 2, 2, 2, 6+, 5, 1, 4, 4, 3, 6+, 0, 1, 3, 6+...
## $ kids <fct> 3, 0, 2, 4+, 2, 2, 2, 3, 3, 4+, 4+, 4+, 3, 4+, 4+, 2, 4...
## $ religion <fct> None, None, Catholic, Catholic, None, None, None, Catho...
## $ bigregion <fct> Northeast, Northeast, Northeast, Northeast, Northeast, ...
## $ partners_rc <fct> NA, 1, 1, NA, 1, 1, NA, 1, NA, 3, 1, NA, 1, NA, 0, 1, 0...
## $ obama <dbl> 0, 1, 0, 0, 1, 1, NA, NA, NA, 0, NA, NA, 1, 1, 0, 1, 0,...
# Practice using facet_grid() to facet between multiple variables
p <- ggplot(data = gss_sm,
mapping = aes(x = age,
y = childs))
p + geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(sex ~ race)
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
Each geom_ function has an associated stat_ function that is used to plot the data. Sometimes this involves transforming the data in some way.
p <- ggplot(data = gss_sm,
mapping = aes(x = bigregion))
p + geom_bar() # makes a bar graph that counts the number of observations per region; count is computed for us
p + geom_bar(mapping = aes(y = ..prop..)) # the prop statistic can show us proportions
# But this is not right, each shows 100%
# So, we need to fix the automatic grouping that is occurring by region
p + geom_bar(mapping = aes(y = ..prop.., group = 1)) # using group = 1 is basically a placeholder that says all the data is in the same group
# Look at a different variable
table(gss_sm$religion)
##
## Protestant Catholic Jewish None Other
## 1371 649 51 619 159
p <- ggplot(data = gss_sm,
mapping = aes(x = religion, color = religion))
p + geom_bar() # only the outline has a color - we need to use fill
p <- ggplot(data = gss_sm,
mapping = aes(x = religion, fill = religion))
p + geom_bar()
# Remove the legend
p + geom_bar() +
guides(fill = FALSE)
# How can we look at two variables together
p <- ggplot(data = gss_sm,
mapping = aes(x = bigregion,
fill = religion))
p + geom_bar() # Stacked bar chart of counts
p + geom_bar(position = "fill") # Stacked bar chart of proportions
p + geom_bar(position = "dodge") # Bar chart of counts side by side
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..)) # Bar chart of proportions side by side
# Not quite right - all are 100%
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..,
group = religion)) # Bar chart of proportions side by side
# The proportions sum to 1 by religion across the regions
p <- ggplot(data = gss_sm,
mapping = aes(x = religion))
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..,
group = bigregion)) +
facet_wrap(~bigregion, ncol = 2)
# Now the proportions sum to 1 by region across religions
Histograms create bins of numerical data and display the distribution of the data within those bins.
# A new dataset
glimpse(midwest)
## Rows: 437
## Columns: 28
## $ PID <int> 561, 562, 563, 564, 565, 566, 567, 568, 569, 5...
## $ county <chr> "ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN"...
## $ state <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL", "IL"...
## $ area <dbl> 0.052, 0.014, 0.022, 0.017, 0.018, 0.050, 0.01...
## $ poptotal <int> 66090, 10626, 14991, 30806, 5836, 35688, 5322,...
## $ popdensity <dbl> 1270.9615, 759.0000, 681.4091, 1812.1176, 324....
## $ popwhite <int> 63917, 7054, 14477, 29344, 5264, 35157, 5298, ...
## $ popblack <int> 1702, 3496, 429, 127, 547, 50, 1, 111, 16, 165...
## $ popamerindian <int> 98, 19, 35, 46, 14, 65, 8, 30, 8, 331, 51, 26,...
## $ popasian <int> 249, 48, 16, 150, 5, 195, 15, 61, 23, 8033, 89...
## $ popother <int> 124, 9, 34, 1139, 6, 221, 0, 84, 6, 1596, 20, ...
## $ percwhite <dbl> 96.71206, 66.38434, 96.57128, 95.25417, 90.198...
## $ percblack <dbl> 2.57527614, 32.90043290, 2.86171703, 0.4122573...
## $ percamerindan <dbl> 0.14828264, 0.17880670, 0.23347342, 0.14932156...
## $ percasian <dbl> 0.37675897, 0.45172219, 0.10673071, 0.48691813...
## $ percother <dbl> 0.18762294, 0.08469791, 0.22680275, 3.69733169...
## $ popadults <int> 43298, 6724, 9669, 19272, 3979, 23444, 3583, 1...
## $ perchsd <dbl> 75.10740, 59.72635, 69.33499, 75.47219, 68.861...
## $ percollege <dbl> 19.63139, 11.24331, 17.03382, 17.27895, 14.476...
## $ percprof <dbl> 4.355859, 2.870315, 4.488572, 4.197800, 3.3676...
## $ poppovertyknown <int> 63628, 10529, 14235, 30337, 4815, 35107, 5241,...
## $ percpovertyknown <dbl> 96.27478, 99.08714, 94.95697, 98.47757, 82.505...
## $ percbelowpoverty <dbl> 13.151443, 32.244278, 12.068844, 7.209019, 13....
## $ percchildbelowpovert <dbl> 18.011717, 45.826514, 14.036061, 11.179536, 13...
## $ percadultpoverty <dbl> 11.009776, 27.385647, 10.852090, 5.536013, 11....
## $ percelderlypoverty <dbl> 12.443812, 25.228976, 12.697410, 6.217047, 19....
## $ inmetro <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1...
## $ category <chr> "AAR", "LHR", "AAR", "ALU", "AAR", "AAR", "LAR...
# Show distribution of the size of the counties in the Midwest
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_histogram() # count is computed automatically by the default stat function
p + geom_histogram(bins = 10) # set 10 bins
# Look at only two states
oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = percollege,
fill = state))
p + geom_histogram(alpha = 0.4, bins = 20) # Overlapping histograms
# Density estimate of the underlying distribution - density plot
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_density()
# Density by state
p <- ggplot(data = midwest,
mapping = aes(x = area,
fill = state,
color = state))
p + geom_density(alpha = 0.3)
# Compare to geom_line(stat = "density")
p + geom_line(stat = "density")
# Scaled density
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = percollege,
fill = state,
color = state))
p + geom_density(alpha = 0.3,
mapping = aes(y = ..scaled..))
### Avoiding Transformation When Necessary
Avoiding transformations - sometimes the data is already aggregated or summarized and we do not need a transformation.
titanic # this data is already summarized
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
p <- ggplot(data = titanic,
mapping = aes(x = fate,
y = percent,
fill = sex))
p + geom_bar(position = "dodge",
stat = "identity") + # plot values as provided, do not summarize/count/etc.
theme(legend.position = "top") # this puts the legend at the top of the graph
oecd_sum # another dataset that is already summarized
## # A tibble: 57 x 5
## # Groups: year [57]
## year other usa diff hi_lo
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1960 68.6 69.9 1.3 Below
## 2 1961 69.2 70.4 1.2 Below
## 3 1962 68.9 70.2 1.30 Below
## 4 1963 69.1 70 0.9 Below
## 5 1964 69.5 70.3 0.800 Below
## 6 1965 69.6 70.3 0.7 Below
## 7 1966 69.9 70.3 0.400 Below
## 8 1967 70.1 70.7 0.6 Below
## 9 1968 70.1 70.4 0.3 Below
## 10 1969 70.1 70.6 0.5 Below
## # ... with 47 more rows
p <- ggplot(data = oecd_sum,
mapping = aes(x = year,
y = diff,
fill = hi_lo))
p + geom_col() + # this is the same as geom_bar with stat = "identity"
guides(fill = FALSE) + # no legend
labs(x = NULL, # no x-axis label
y = "Difference in Years",
title = "The US Life Expectancy Gap",
subtitle = "Difference between US and OECD average life expectancies, 1960-2015",
caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017")
## Warning: Removed 1 rows containing missing values (position_stack).
Pick at least two of the questions presented under the Where to Go Next section and answer them.
It is best practice to calculate the summary statistics first and then plot them, rather than using the stat_ functions within geom_ functions. This is because it makes the code easier to understand and read and allows us to double check the data and aggregations more easily.
The pipe operator %>% from dplyr allows us to pass data from one operation or function to another. Usually there is a sequence of steps: group, filter/select, mutate, then summarize.
Within group_by(), grouping levels (left to right) go from outermost to innermost. Functions used to create new variables (such as summarize()) will be applied to the innermost group level first.
# Create a tibble/datat table with the percent of each religion by region
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq * 100), 0))
# Check the percentages sum to 100 by region
rel_by_region %>%
group_by(bigregion) %>%
summarise(total = sum(pct))
## # A tibble: 4 x 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
# Make a plot (note: Healy stops using the argument name)
p <- ggplot(data = rel_by_region,
mapping = aes(x = bigregion,
y = pct,
fill = religion))
p +
geom_col(position = "dodge2") +
labs(x = "Region",
y = "Percent",
fill = "Reiligion") +
theme(legend.position = "top")
# Let's rearrange it a little
p <- ggplot(data = rel_by_region,
mapping = aes(x = religion,
y = pct,
fill = religion))
p +
geom_col(position = "dodge2") +
labs(x = NULL, # don't put a label on the axis
y = "Percent",
fill = "Religion") +
guides(fill = FALSE) +
coord_flip() + # switches the x and y axes after the plot is made
facet_grid(~ bigregion)
In this section, we will learn how to use geom_boxplot()
# New Dataset on Organ Donations by country and year
organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 x 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Italy 2002-01-01 18.1 57994 19.2 25569
## 2 Switzerland 1997-01-01 14.3 7089 17.2 27675
## 3 France 1995-01-01 15.1 57844 10.5 21283
## 4 Switzerland 2002-01-01 10.4 7290 17.7 30725
## 5 Belgium 1996-01-01 21.2 10157 30.7 22152
## 6 Italy 1992-01-01 5.8 56859 18.9 18883
## 7 United States 1999-01-01 20.9 279040 2.90 33016
## 8 Netherlands 1992-01-01 15.1 15184 36.6 19285
## 9 Canada NA NA NA NA NA
## 10 Sweden 1995-01-01 13 8827 1.96 21290
# Graph some of the organ data without really looking at the dataset
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_point() # get a warning about missing data; this graph doesn't make much sense
## Warning: Removed 34 rows containing missing values (geom_point).
# Plot the organ donations by country over time
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_line(mapping = aes(group = country)) +
facet_wrap(~country) # automatically orders countries alphabetically
## Warning: Removed 34 row(s) containing missing values (geom_path).
# Make boxplots by country (using the data over the years)
p <- ggplot(data = organdata,
mapping = aes(x = country, y = donors))
p + geom_boxplot() +
coord_flip() # move country names to the y-axis
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Let's reorder the boxplots by mean donation rate using the reorder function
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors))
p + geom_boxplot() +
labs(x = NULL) + # no x-axis title
coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Add color to the boxplots
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
fill = world))
p + geom_boxplot() +
labs(x = NULL) + # no x-axis title
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Let's look at this data in point format
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_point() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Points are on top of each other, so use geom_jitter() to move them around a little
p + geom_jitter() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Reduce the amount of spread in the points using position_jitter()
p + geom_jitter(position = position_jitter(width = 0.15)) +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Get information about consent laws by country
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebvas_mean = mean(cerebvas, na.rm = TRUE))
by_country
## # A tibble: 17 x 8
## # Groups: consent_law [2]
## consent_law country donors_mean donors_sd gdp_mean health_mean roads_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Austra~ 10.6 1.14 22179. 1958. 105.
## 2 Informed Canada 14.0 0.751 23711. 2272. 109.
## 3 Informed Denmark 13.1 1.47 23722. 2054. 102.
## 4 Informed Germany 13.0 0.611 22163. 2349. 113.
## 5 Informed Ireland 19.8 2.48 20824. 1480. 118.
## 6 Informed Nether~ 13.7 1.55 23013. 1993. 76.1
## 7 Informed United~ 13.5 0.775 21359. 1561. 67.9
## 8 Informed United~ 20.0 1.33 29212. 3988. 155.
## 9 Presumed Austria 23.5 2.42 23876. 1875. 150.
## 10 Presumed Belgium 21.9 1.94 22500. 1958. 155.
## 11 Presumed Finland 18.4 1.53 21019. 1615. 93.6
## 12 Presumed France 16.8 1.60 22603. 2160. 156.
## 13 Presumed Italy 11.1 4.28 21554. 1757 122.
## 14 Presumed Norway 15.4 1.11 26448. 2217. 70.0
## 15 Presumed Spain 28.1 4.96 16933 1289. 161.
## 16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3
## 17 Presumed Switze~ 14.2 1.71 27233 2776. 96.4
## # ... with 1 more variable: cerebvas_mean <dbl>
# Another way to do this in a shorter step
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize_if(is.numeric,
list(mean = mean, sd = sd), # note funs is deprecated
na.rm = TRUE) %>%
ungroup()
# Make a simple plot of our summarized data (Cleaveland Dot Plot)
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean), # this puts the countries in order by donors_mean
color = consent_law))
p +
geom_point(size = 3) +
labs(x = "Donor Procurement Rate",
y = "", # another way of putting no label on an axis
color = "Consent Law") +
theme(legend.position = "top")
# Facet into two panels for the Cleaveland Dot Plot
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p +
geom_point(size = 3) +
facet_wrap(~ consent_law,
scales = "free_y", # allow the y-axis labels to be different on each facet
ncol = 1) # orient plot in a single columns
labs(x = "Donor Procurement Rate",
y = "")
## $x
## [1] "Donor Procurement Rate"
##
## $y
## [1] ""
##
## attr(,"class")
## [1] "labels"
# Plot the dots (which represent the mean) with the range of standard deviation using geom_pointrange()
p <- ggplot(data = by_country,
mapping = aes(x = reorder(country, donors_mean),
y = donors_mean))
p +
geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
ymax = donors_mean + donors_sd)) +
labs(x = "",
y = "Donor Procurement Rate") +
coord_flip()
# need to use coord_flip() because geom_pointrange() uses y, ymin, and ymax and we want to show this on the x-axis
geom_text() is used to plot labels on a graph. The argument hjust can be used to left or right justify the text. hjust = 0 will left-justify; hjust = 1 will right-justify.
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean,
y = donors_mean))
p +
geom_point() + # plot points
geom_text(mapping = aes(label = country)) # plot the labels
# Text is right on top of the points, use hjust to move it
p +
geom_point() +
geom_text(mapping = aes(label = country),
hjust = 0)
The
ggrepepl package provides two geoms that are more flexible for plotting labels.
library(ggrepel)
# Switch datasets
elections_historic %>% select(2:7)
## # A tibble: 49 x 6
## year winner win_party ec_pct popular_pct popular_margin
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1824 John Quincy Adams D.-R. 0.322 0.309 -0.104
## 2 1828 Andrew Jackson Dem. 0.682 0.559 0.122
## 3 1832 Andrew Jackson Dem. 0.766 0.547 0.178
## 4 1836 Martin Van Buren Dem. 0.578 0.508 0.142
## 5 1840 William Henry Harrison Whig 0.796 0.529 0.0605
## 6 1844 James Polk Dem. 0.618 0.495 0.0145
## 7 1848 Zachary Taylor Whig 0.562 0.473 0.0479
## 8 1852 Franklin Pierce Dem. 0.858 0.508 0.0695
## 9 1856 James Buchanan Dem. 0.588 0.453 0.122
## 10 1860 Abraham Lincoln Rep. 0.594 0.396 0.101
## # ... with 39 more rows
# Set titles and labels
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional"
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(data = elections_historic,
mapping = aes(x = popular_pct,
y = ec_pct,
label = winner_label))
p +
geom_hline(yintercept = 0.5,
size = 1.4,
color = "gray80") +
geom_vline(xintercept = 0.5,
size = 1.4,
color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label,
y = y_label,
title = p_title,
subtitle = p_subtitle,
caption = p_caption)
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
To label specific points, we need to tell the geom which points to label using the subset() function rather than giving the geom the entire dataset.
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean,
y = health_mean))
p +
geom_point() +
# Only label points with mean GDP greater than 25,000
geom_text_repel(data = subset(by_country, gdp_mean > 25000),
mapping = aes(label = country))
p +
geom_point() +
# Only label points with mean GDP greater than 25,000 OR
# mean health less than 1,500 or Belgium
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 |
health_mean < 1500 |
country %in% "Belgium"),
mapping = aes(label = country))
An alternative to using subset() to filter the data is to add a variable that already has the conditions for labeling to the dataset.
# Add code/indicator variable to organ data
organdata$ind <- organdata$ccode %in% c("Ita", "Spa") & organdata$year > 1998
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = ind))
p +
geom_point() +
geom_text_repel(data = subset(organdata, ind),
mapping = aes(label = ccode)) +
guides(label = FALSE,
color = FALSE) # removes legend
## Warning: Removed 34 rows containing missing values (geom_point).
annotate() can be used to add annotations from different geoms to plots (text, shading, etc.)
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors))
p +
geom_point() +
annotate(geom = "text", # a text annotation
x = 91, # x position for the annotation
y = 33, # y position for the annotation
label = "A surprisingly high \n recovery rate", # label for the annotation
hjust = 0) # left-align
## Warning: Removed 34 rows containing missing values (geom_point).
p +
geom_point() +
annotate(geom = "rect", # a rectangular annotation
xmin = 125, xmax = 155, # x position for the annotation
ymin = 30, ymax = 35, # y position for the annotation
fill = "red", alpha = 0.2) + # color/fill properties for the annotation
annotate(geom = "text", # a text annotation
x = 91, # x position for the annotation
y = 33, # y position for the annotation
label = "A surprisingly high \n recovery rate", # label for the annotation
hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).
scale_<MAPPING>_<KIND>() functions can be used to adjust the axes and colors used in plots. The guide() function can be used to adjust the legend. The theme() function can be used to adjust the overall look of a plot.
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world))
# Adjust the x and y axes
p +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c("Five", "Fifteen", "Twenty Five"))
## Warning: Removed 34 rows containing missing values (geom_point).
# Adjust the color legend
p +
geom_point() +
scale_color_discrete(labels = c("Corporatist", "Liberal",
"Social Democrat", "Unclassified")) +
labs(x = "Road Deaths",
y = "Donor Procurement",
color = "Welfare State")
## Warning: Removed 34 rows containing missing values (geom_point).
Pick at least two of the questions presented under the Where to Go Next section and answer them.
There are many different ways to represent data on a map; the designer needs to decide how fine the resolution of the representation should be, how to convey the weight of different data points, how to accurately represent spatial data, and the map type to use.
# Election data - select columns of interest and view a sample of 5 rows
election %>%
select(state, total_vote, r_points, pct_trump, party, census) %>%
sample_n(5)
## # A tibble: 5 x 6
## state total_vote r_points pct_trump party census
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Florida 9502747 1.19 48.6 Republican South
## 2 Arkansas 1130635 26.9 60.6 Republican South
## 3 Wyoming 255849 46.3 68.2 Republican West
## 4 Vermont 315067 -26.4 30.3 Democratic Northeast
## 5 Idaho 690433 31.8 59.2 Republican West
# FIPS code is a unique six-digit identifier for every U.S. county
# The first two digits of a FIPS code represent the state
# Set colors for Democratic and Republican parties
party_colors <- c("#2E74C0", "#CB454A")
# Create a plot of the elections data in a faceted dot plot
p0 <- ggplot(data = subset(election, st %nin% "DC"),
mapping = aes(x = r_points,
y = reorder(state, r_points),
color = party))
p1 <- p0 +
geom_vline(xintercept = 0, color = "gray30") +
geom_point(size = 2)
p2 <- p1 +
scale_color_manual(values = party_colors)
p3 <- p2 +
scale_x_continuous(breaks = c(-30, -20, -10, 0, 10, 20, 30, 40),
labels = c("30\n(Clinton)",
"20", "10", "0", "10", "20", "30",
"40\n(Trump)"))
p3 +
facet_wrap(~ census, ncol = 1, scales = "free_y") +
guides(color = FALSE) +
labs(x = "Point Margin",
y = "") +
theme(axis.text = element_text(size = 8))
The maps library provides pre-drawn map data.
library(maps)
# Get data for U.S. States
us_states <- map_data("state")
head(us_states) # it provides latitude and longitude information; region is the state name
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
dim(us_states) # it's a large data frame
## [1] 15537 6
# Make a map
p <- ggplot(data = us_states,
mapping = aes(x = long, # Use latitude and longitude to plot states
y = lat,
group = group))
p +
geom_polygon(fill = "white",
color = "black") # Outline of a map of U.S. states
p <- ggplot(data = us_states,
mapping = aes(x = long,
y = lat,
group = group,
fill = region)) # add color fill to the states
p +
geom_polygon(color = "gray90",
size = 0.1) +
guides(fill = FALSE)
Sometimes we will want to alter the map projection so that it looks more accurate. This can be done using the coord_map() function to select an alternate coordinate system (right now it is Cartesian). To use the Albers projection, we have to provide numbers for lat0 and lat1.
p <- ggplot(data = us_states,
mapping = aes(x = long,
y = lat,
group = group,
fill = region))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
guides(fill = FALSE)
Now it is time to get our data onto the map. We have to merge the two datasets - one has the elections data and one has the data to draw the states on the map. We can use the left_join() function to do this. It is important for a column of data in each dataset to match exactly so that we can put the datasets together.
# First we need to lowercase the state names and put them in a column called "region", to match the mapping data
election$region <- tolower(election$state)
# Now we can join the datasets together using the common region column
us_states_elec <- left_join(us_states, election)
# We are now ready to plot
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = party))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_manual(values = party_colors) + # adjust the colors of the map
labs(title = "Election Results 2016", fill = NULL) +
theme_map() # added in setup chunk
Mapping the data only to states is a little deceptive because there are differences in voting by county and certain areas of the country have larger populations than others.
# Put a continuous variable on the map fill
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = pct_trump))
p + geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
labs(title = "Trump vote",
fill = "Percent") +
theme_map()
# Let's change the color to red and have darker red mean higher percent
p + geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
labs(title = "Trump vote",
fill = "Percent") +
scale_fill_gradient(low = "white",
high = "#CB454A") +
theme_map()
scale_gradient2() is a function that creates a diverging scale from a midpoint.
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = d_points))
# Create a gradient fill for point margins
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_gradient2(low = "red",
mid = scales::muted("purple"),
high = "blue",
breaks = c(-25, 0, 25, 50, 75)) +
labs(title = "Winning margins",
fill = "Percent") +
theme_map()
# Remove D.C. since it is an outlier; gradient colors are enhanced
# Note: earlier we used st to remove D.C., now we're using region
p <- ggplot(data = subset(us_states_elec,
region %nin% "district of columbia"),
mapping = aes(x = long,
y = lat,
group = group,
fill = d_points))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_gradient2(low = "red",
mid = scales::muted("purple"),
high = "blue",
breaks = c(-25, 0, 25, 50, 75)) +
labs(title = "Winning margins",
fill = "Percent") +
theme_map()
Data can be mapped to counties as well, but it’s important to remember the population distribution in the U.S. Choropleth maps of the U.S. tend to show population density more than anything else because there are concentrations of population in counties in the northeast and the west coast compared to the west. Note that the previous maps did not include Alaska or Hawaii. Now, we are going to add them using a county map dataset.
# Mapping dataset for U.S. counties
county_map %>%
sample_n(5) # FIPS ID is used to identify the counties
## long lat order hole piece group id
## 1 1610704.9 -1305011.0 40822 FALSE 1 0500000US13069.1 13069
## 2 -1988976.7 -134312.1 24836 FALSE 1 0500000US06023.1 06023
## 3 576577.6 -583664.6 100848 FALSE 1 0500000US29033.1 29033
## 4 1999716.2 -296186.3 140953 FALSE 1 0500000US42071.1 42071
## 5 1403501.0 -724641.1 66577 FALSE 1 0500000US21125.1 21125
# County demographic data for U.S. Counties
county_data %>%
select(id, name, state, pop_dens, pct_black) %>%
sample_n(5) # FIPS ID is used to identify the counties
## id name state pop_dens pct_black
## 1 29139 Montgomery County MO [ 10, 50) [ 0.0, 2.0)
## 2 17013 Calhoun County IL [ 10, 50) [ 0.0, 2.0)
## 3 46127 Union County SD [ 10, 50) [ 0.0, 2.0)
## 4 42087 Mifflin County PA [ 100, 500) [ 0.0, 2.0)
## 5 38063 Nelson County ND [ 0, 10) [ 0.0, 2.0)
head(county_data) # ID 0 is for the entire U.S.; IDs with just the first two-digits is for the state
## id name state census_region pop_dens pop_dens4
## 1 0 <NA> <NA> <NA> [ 50, 100) [ 45, 118)
## 2 01000 1 AL South [ 50, 100) [ 45, 118)
## 3 01001 Autauga County AL South [ 50, 100) [ 45, 118)
## 4 01003 Baldwin County AL South [ 100, 500) [118,71672]
## 5 01005 Barbour County AL South [ 10, 50) [ 17, 45)
## 6 01007 Bibb County AL South [ 10, 50) [ 17, 45)
## pop_dens6 pct_black pop female white black travel_time land_area
## 1 [ 82, 215) [10.0,15.0) 318857056 50.8 77.7 13.2 25.5 3531905.43
## 2 [ 82, 215) [25.0,50.0) 4849377 51.5 69.8 26.6 24.2 50645.33
## 3 [ 82, 215) [15.0,25.0) 55395 51.5 78.1 18.4 26.2 594.44
## 4 [ 82, 215) [ 5.0,10.0) 200111 51.2 87.3 9.5 25.9 1589.78
## 5 [ 25, 45) [25.0,50.0) 26887 46.5 50.2 47.6 24.6 884.88
## 6 [ 25, 45) [15.0,25.0) 22506 46.0 76.3 22.1 27.6 622.58
## hh_income su_gun4 su_gun6 fips votes_dem_2016 votes_gop_2016 total_votes_2016
## 1 53046 <NA> <NA> 0 NA NA NA
## 2 43253 <NA> <NA> 1000 NA NA NA
## 3 53682 [11,54] [10,12) 1001 5908 18110 24661
## 4 50221 [11,54] [10,12) 1003 18409 72780 94090
## 5 32911 [ 5, 8) [ 7, 8) 1005 4848 5431 10390
## 6 36447 [11,54] [10,12) 1007 1874 6733 8748
## per_dem_2016 per_gop_2016 diff_2016 per_dem_2012 per_gop_2012 diff_2012
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 0.2395685 0.7343579 12202 0.2657577 0.7263374 11012
## 4 0.1956531 0.7735147 54371 0.2156657 0.7738975 47443
## 5 0.4666025 0.5227141 583 0.5125229 0.4833755 334
## 6 0.2142204 0.7696616 4859 0.2621857 0.7306638 3931
## winner partywinner16 winner12 partywinner12 flipped
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 Trump Republican Romney Republican No
## 4 Trump Republican Romney Republican No
## 5 Trump Republican Obama Democrat Yes
## 6 Trump Republican Romney Republican No
# Put the mapping data and demographic data together
county_full <- left_join(county_map, county_data, by = "id")
# Plot population density by county
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens,
group = group))
p +
geom_polygon(color = "gray90", size = 0.05) +
coord_equal() + # relative scale of map does not change, even if plot dimensions change
scale_fill_brewer(palette = "Blues",
labels = c("0-10", "10-50", "50-100", "100-500",
"500-1,000", "1,000-5,000", ">5,000")) +
labs(fill = "Population per\nsquare mile") +
theme_map() +
guides(fill = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
# Plot percent of Black population by county
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pct_black,
group = group))
p +
geom_polygon(color = "gray90",
size = 0.05) +
coord_equal() +
scale_fill_brewer(palette = "Greens") +
labs(fill = "US Population, Percent Black") +
theme_map() +
guides(fill = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
The population density and the percent of the population that is Black are confounding variables with many other county-level variables that we might want to examine, so it is important to keep the previous two plots in mind whenever plotting county-level data. To demonstrate this issue, we will make two more plots - one on gun-related suicides and one on binned population density.
# Create the color palette
# brewer.pal() produces evenly spaced color palettes
orange_pal <- RColorBrewer::brewer.pal(n = 6, name = "Oranges")
orange_pal
## [1] "#FEEDDE" "#FDD0A2" "#FDAE6B" "#FD8D3C" "#E6550D" "#A63603"
# Reverse color palette
orange_rev <- rev(orange_pal)
orange_rev
## [1] "#A63603" "#E6550D" "#FD8D3C" "#FDAE6B" "#FDD0A2" "#FEEDDE"
# Recreate a "poorly sourced by widely circulated county map of firearm-related suicide rates" (p. 186)
gun_p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = su_gun6,
group = group))
gun_p1 <- gun_p +
geom_polygon(color = "gray90",
size = 0.05) +
coord_equal()
gun_p2 <- gun_p1 +
scale_fill_manual(values = orange_pal)
gun_p2 +
labs(title = "Gun-Related Suicides 1999-2015",
fill = "Rate per 100,000pop.") +
theme_map() +
theme(legend.position = "bottom")
# Create the reverse-coded population density map
pop_p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens6,
group = group))
pop_p1 <- pop_p +
geom_polygon(color = "gray90", size = 0.05) +
coord_equal()
pop_p2 <- pop_p1 +
scale_fill_manual(values = orange_rev)
pop_p2 + labs(title = "Reverse-coded Population Density",
fill = "People per square mile") +
theme_map() +
theme(legend.position = "bottom")
“Small differences in reporting, combined with coarse binning and miscoding, will produce spatially misleading and substantively mistaken results. It might seem that focusing on on the details of variable coding in this particular case is a little too much in the weeds for a genreal introduction. But it is exactly these details that can dramatically alter the appearance of any graph, and especially maps, in a way that can be hard to detect after the fact.” (p. 189)
The statebins package is an alternative way to develop U.S. maps. The syntax is slightly different than ggplot; the statebins package has been updated since the writing of the book, so the code below is different than the code in the book.
# install.packages("statebins") # note statebins has not been previously installed
library(statebins)
# The statebins package has changed since this book was written
# Continuous Data
statebins(state_data = election,
state_col = "state",
value_col = "pct_trump",
round = FALSE) +
labs(fill = "Percent Trump") +
scale_fill_gradient(low = "#FEE5D9",
high = "#A50F15") +
theme_statebins() +
theme(legend.position = "top")
statebins(state_data = subset(election, st %nin% "DC"),
state_col = "state",
value_col = "pct_clinton",
round = FALSE) +
scale_fill_gradient(low = "#EFF3FF",
high = "#08519C") +
labs(fill = "Percent Clinton") +
theme_statebins() +
theme(legend.position = "top")
# Categorical Data
ggplot(data = election,
mapping = aes(state = st,
fill = party)) +
geom_statebins() +
scale_fill_manual(values = c("royalblue", "darkred"),
labels = c("Clinton", "Trump")) +
labs(fill = "Winner") +
coord_equal() +
theme_statebins() +
theme(legend.position = "right")
# Binned Data
ggplot(data = election,
mapping = aes(state = st,
fill = cut(pct_trump, 4))) +
geom_statebins() +
scale_fill_brewer(palette = "Reds",
labels = c("4-21", "21-37", "37-53", "53-70")) +
labs(fill = "Percent Trump") +
coord_equal() +
theme_statebins() +
theme(legend.position = "top")
Use small-multiple maps to show geographic data over time. We will also use the viridis package to get good color palettes that are vibrant on both ends.
opiates # state-level death rate from optiate-related causes 1999-2014
## # A tibble: 800 x 11
## year state fips deaths population crude adjusted adjusted_se region abbr
## <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <ord> <chr>
## 1 1999 Alab~ 1 37 4430141 0.8 0.8 0.1 South AL
## 2 1999 Alas~ 2 27 624779 4.3 4 0.8 West AK
## 3 1999 Ariz~ 4 229 5023823 4.6 4.7 0.3 West AZ
## 4 1999 Arka~ 5 28 2651860 1.1 1.1 0.2 South AR
## 5 1999 Cali~ 6 1474 33499204 4.4 4.5 0.1 West CA
## 6 1999 Colo~ 8 164 4226018 3.9 3.7 0.3 West CO
## 7 1999 Conn~ 9 151 3386401 4.5 4.4 0.4 North~ CT
## 8 1999 Dela~ 10 32 774990 4.1 4.1 0.7 South DE
## 9 1999 Dist~ 11 28 570213 4.9 4.9 0.9 South DC
## 10 1999 Flor~ 12 402 15759421 2.6 2.6 0.1 South FL
## # ... with 790 more rows, and 1 more variable: division_name <chr>
# lower-case the state name to match the us_states data (from the maps package)
opiates$region <- tolower(opiates$state)
# join with the us_states data
opiates_map <- left_join(us_states, opiates)
library(viridis)
p0 <- ggplot(data = subset(opiates_map, year > 1999),
mapping = aes(x = long, y = lat,
group = group,
fill = adjusted))
p1 <- p0 +
geom_polygon(color = "gray90",
size = 0.05) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45)
p2 <- p1 +
scale_fill_viridis_c(option = "plasma")
p2 +
theme_map() +
facet_wrap(~ year, ncol = 3) +
theme(legend.position = "bottom",
strip.background = element_blank()) +
labs(title = "Opiate Related Deaths by State, 2000-2014",
fill = "Death rate per 100,000 population")
But this might not be the best way to view this data due to the issues with population, demographics, and the difficulty in comparing things spatially.
We can make line plots of the opiates data over time, which may make it easier to make comparisons and draw conclusions.
p <- ggplot(data = opiates,
mapping = aes(x = year,
y = adjusted,
group = state))
p + geom_line(color = "gray70") # But this is a little difficult to see since there are so many lines
## Warning: Removed 17 row(s) containing missing values (geom_path).
# Divide it up by census division
p0 <- ggplot(data = drop_na(opiates, division_name), # remove rows with NA for division_name to leave out D.C.
mapping = aes(x = year,
y = adjusted))
p1 <- p0 +
geom_line(color = "gray70",
mapping = aes(group = state)) # make line chart, one line per state
p2 <- p1 +
geom_smooth(mapping = aes(group = division_name),
se = FALSE) # make trend line by census division
p3 <- p2 + # label only the end of the line with the state
geom_text_repel(data = subset(opiates, year == max(year) & abbr != "DC"),
mapping = aes(x = year,
y = adjusted,
label = abbr),
size = 1.8,
segment.color = NA, # no line segment linking text to point
nudge_x = 30) + # shift the label text to the right
# shift coordinate system so there's room for the labels
coord_cartesian(c(min(opiates$year), max(opiates$year)))
p3 +
labs(x = "",
y = "Rate per 100,000 population",
title = "State-Level Opiate Death Rates by Census Division, 1999-2014") +
facet_wrap(~ reorder(division_name, -adjusted, na.rm = TRUE),
# put the divisions with highest rates first
nrow = 3)
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 17 row(s) containing missing values (geom_path).