These are my notes for GRD 610A: Data Visualization II in Winter 2022 at the College for Creative Studies. These notes are for my work in the book Data Visualization by Kieran Healy (Princeton University Press, 2019).
Objects in R are created and referred to by their names. Certain names are not allowed because they are reserved words such as TRUE
, if
, mean()
, and NA
. Names also cannot start with a number or contain spaces. There are different naming conventions.
Snake Case
my_data
this_is_snake_case
Camel Case
myData
thisIsCamelCase
Pascal Case
MyData
ThisIsPascalCase
Pick one naming convention and stick with it. Be consistent; don’t switch between conventions. I recommend snake case.
# This is a comment (it starts with #)
my_data <- c(1, 2, 3, 4) # Assign using <- ; use ALT + - or OPTION + -
My_Data
## Error in eval(expr, envir, enclos): object 'My_Data' not found
# Cannot be found because we called it my_data (lowercase)
# Now we can see it
my_data
## [1] 1 2 3 4
Think of functions like a recipe. The arguments of the function are the ingredients and what happens within the function is the sequence of cooking steps.
c(1, 2, 3, 1, 3, 5, 25) # c() is the combine function, it puts things together into a vector/list
## [1] 1 2 3 1 3 5 25
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)
my_numbers
## [1] 1 2 3 1 3 5 25
mean(x = my_numbers)
## [1] 5.714286
mean(my_numbers) # you don't have to specify the argument names, but order matters if you do not specify
## [1] 5.714286
mean(x = your_numbers)
## [1] 19.71429
my_summary <- summary(my_numbers)
my_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 3.000 5.714 4.000 25.000
table(my_numbers)
## my_numbers
## 1 2 3 5 25
## 2 1 2 1 1
sd(my_numbers)
## [1] 8.616153
my_numbers * 5
## [1] 5 10 15 5 15 25 125
my_numbers + 1
## [1] 2 3 4 2 4 6 26
my_numbers + my_numbers # How is this different than the last line?
## [1] 2 4 6 2 6 10 50
# If you're not sure what an object is, ask for its class or type
class(my_numbers)
## [1] "numeric"
class(my_summary)
## [1] "summaryDefault" "table"
class(summary)
## [1] "function"
my_new_vector <- c(my_numbers, "Apple") # What happens if we combine a word with numbers?
my_new_vector
## [1] "1" "2" "3" "1" "3" "5" "25" "Apple"
class(my_new_vector)
## [1] "character"
# Let's look at a new dataset
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
class(titanic)
## [1] "data.frame"
# Titanic is a data frame, which is like a table
# The $ operator can be used to access a column of a data frame by name
titanic$percent
## [1] 62.0 5.7 16.7 15.6
# Tibbles are slightly different than data frames. They are also data tables, but they provide more information.
titanic_tb <- as_tibble(titanic)
titanic_tb # How is does this compare to titanic above?
## # A tibble: 4 x 4
## fate sex n percent
## <fct> <fct> <dbl> <dbl>
## 1 perished male 1364 62
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
# To see inside an object, ask for its structure
str(my_numbers)
## num [1:7] 1 2 3 1 3 5 25
str(my_summary)
## 'summaryDefault' Named num [1:6] 1 1.5 3 5.71 4 ...
## - attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
Programming in R can be challenging and it takes time to get used to. Be patient and take a break if you get stuck. Make sure parentheses are opened and closed. Complete your commands (look out for the + in the console). Take your time and lookout for typos.
In this section, we will get data from a URL and make a quick figure.
# Data source
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
# Read the CSV from the URL
organs <- read_csv(file = url)
# Take a quick look at the data
glimpse(organs)
## Rows: 238
## Columns: 21
## $ country <chr> "Australia", "Australia", "Australia", "Australia", "~
## $ year <dbl> NA, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1~
## $ donors <dbl> NA, 12.09, 12.35, 12.51, 10.25, 10.18, 10.59, 10.26, ~
## $ pop <dbl> 17065, 17284, 17495, 17667, 17855, 18072, 18311, 1851~
## $ pop.dens <dbl> 0.2204433, 0.2232723, 0.2259980, 0.2282198, 0.2306484~
## $ gdp <dbl> 16774, 17171, 17914, 18883, 19849, 21079, 21923, 2296~
## $ gdp.lag <dbl> 16591, 16774, 17171, 17914, 18883, 19849, 21079, 2192~
## $ health <dbl> 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948, 2077,~
## $ health.lag <dbl> 1224, 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948,~
## $ pubhealth <dbl> 4.8, 5.4, 5.4, 5.4, 5.4, 5.5, 5.6, 5.7, 5.9, 6.1, 6.2~
## $ roads <dbl> 136.59537, 122.25179, 112.83224, 110.54508, 107.98096~
## $ cerebvas <dbl> 682, 647, 630, 611, 631, 592, 576, 525, 516, 493, 474~
## $ assault <dbl> 21, 19, 17, 18, 17, 16, 17, 17, 16, 15, 16, 15, 14, N~
## $ external <dbl> 444, 425, 406, 376, 387, 371, 395, 385, 410, 409, 393~
## $ txp.pop <dbl> 0.9375916, 0.9257116, 0.9145470, 0.9056433, 0.8961075~
## $ world <chr> "Liberal", "Liberal", "Liberal", "Liberal", "Liberal"~
## $ opt <chr> "In", "In", "In", "In", "In", "In", "In", "In", "In",~
## $ consent.law <chr> "Informed", "Informed", "Informed", "Informed", "Info~
## $ consent.practice <chr> "Informed", "Informed", "Informed", "Informed", "Info~
## $ consistent <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes~
## $ ccode <chr> "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz",~
# View(organs) # Run in RStudio
# Another way to view data
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
# Make a plot object
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
# Create a scatterplot
p + geom_point()
ggplot2
is an R
library/package that allows us to map data to visual elements. Using it we can control the way the data appears in the plot and how each element of the plot will be displayed. Aesthetic Mappings make the connection between the data and how it is displayed on the plot (location, size, color, shape, etc.). Geoms define the type of plot (scatterplot, line plot, box plot, bar chart, etc.). Code is added together to make the plot using +
the plus sign. More pieces can be added to the plot that define the scales, legend, labels, axes, style or theme of the plot, etc. Each part can be added using different functions with arguments specifying the look of the plot; the plot is built up piece by piece.
In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
From Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10).
Build a plot layer by layer, starting with telling ggplot what data to use and how to map or link it to parts of the plot, like the x and y axes. Then add on the type of geom
.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point()
Trying different geom_
functions.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_smooth()
p + geom_point() + # add the points back into the plot
geom_smooth()
p + geom_point() +
geom_smooth(method = "lm") # use a linear model
p + geom_point() +
geom_smooth(method = "gam") + # generalized additive model
scale_x_log10() # transform x-axis to log-10 scale
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) # format x-axis in dollars
Using the aesthetics mapping, different parts of the data can be encoded in different ways.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = "purple")) # ggplot adds the value "purple" to all rows
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
# To actually turn all of the points purple, we need to set the color property of the geom_ function
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(color = "purple") + # set point color to purple
geom_smooth(method = "loess") +
scale_x_log10()
p + geom_point(alpha = 0.3) + # make points more transparent
geom_smooth(color = "orange", # make line orange
se = FALSE, # remove standard error band
size = 8, # increase thickness of the line
method = "lm") +
scale_x_log10()
p + geom_point(alpha = 0.3) + # make points more transparent
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
# Add title and labels
labs(x = "GDP per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder")
# Map data by continent
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent))
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent,
fill = continent)) # now the error bands will also have the color
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(mapping = aes(color = continent)) + # points will be colored by continent
geom_smooth(method = "loess") + # the smoothed line will be for all data
scale_x_log10()
p + geom_point(mapping = aes(color = log(pop))) + # points will be colored by population
scale_x_log10()
Use here()
to save plots in the current directory. This function can also be used to reference folders within the current directory. For this class, use .svg
to save in vector format and embed in Adobe Illustrator. The function to save a plot is ggsave()
which will automatically save the last plot and can also be provided a ggplot
object to save.
Pick at least two of the questions presented under the Where to Go Next section and answer them.
“Code almost never works properly the first time you write it.” (p. 73)
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p + geom_line() # Something is wrong, we didn't tell it how to group
p + geom_line(aes(group = country)) # Now there is a line per country
Facet = small multiple (i.e. a separate graph for each value of the variable)
p <- ggplot(data = gapminder,
mapping = aes(x = year,
y = gdpPercap))
p + geom_line(aes(group = country)) +
facet_wrap(~continent) # make a separate plot for each continent
# Make it look a little nicer
p + geom_line(color = "gray70",
aes(group = country)) +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels = scales::dollar) +
facet_wrap(~continent, ncol = 5) +
labs(x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five Continents")
# New dataset 2016 General Social Survey with more categorical data
glimpse(gss_sm)
## Rows: 2,867
## Columns: 32
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016~
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ ballot <labelled> 1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 2, 1, 2, 3, 2, 3, 3, 2,~
## $ age <dbl> 47, 61, 72, 43, 55, 53, 50, 23, 45, 71, 33, 86, 32, 60, 76~
## $ childs <dbl> 3, 0, 2, 4, 2, 2, 2, 3, 3, 4, 5, 4, 3, 5, 7, 2, 6, 5, 0, 2~
## $ sibs <labelled> 2, 3, 3, 3, 2, 2, 2, 6, 5, 1, 4, 4, 3, 6, 0, 1, 3, 8,~
## $ degree <fct> Bachelor, High School, Bachelor, High School, Graduate, Ju~
## $ race <fct> White, White, White, White, White, White, White, Other, Bl~
## $ sex <fct> Male, Male, Male, Female, Female, Female, Male, Female, Ma~
## $ region <fct> New England, New England, New England, New England, New En~
## $ income16 <fct> $170000 or over, $50000 to 59999, $75000 to $89999, $17000~
## $ relig <fct> None, None, Catholic, Catholic, None, None, None, Catholic~
## $ marital <fct> Married, Never Married, Married, Married, Married, Married~
## $ padeg <fct> Graduate, Lt High School, High School, NA, Bachelor, NA, H~
## $ madeg <fct> High School, High School, Lt High School, High School, Hig~
## $ partyid <fct> "Independent", "Ind,near Dem", "Not Str Republican", "Not ~
## $ polviews <fct> Moderate, Liberal, Conservative, Moderate, Slightly Libera~
## $ happy <fct> Pretty Happy, Pretty Happy, Very Happy, Pretty Happy, Very~
## $ partners <fct> NA, "1 Partner", "1 Partner", NA, "1 Partner", "1 Partner"~
## $ grass <fct> NA, Legal, Not Legal, NA, Legal, Legal, NA, Not Legal, NA,~
## $ zodiac <fct> Aquarius, Scorpio, Pisces, Cancer, Scorpio, Scorpio, Capri~
## $ pres12 <labelled> 3, 1, 2, 2, 1, 1, NA, NA, NA, 2, NA, NA, 1, 1, 2, 1, ~
## $ wtssall <dbl> 0.9569935, 0.4784968, 0.9569935, 1.9139870, 1.4354903, 0.9~
## $ income_rc <fct> Gt $170000, Gt $50000, Gt $75000, Gt $170000, Gt $170000, ~
## $ agegrp <fct> Age 45-55, Age 55-65, Age 65+, Age 35-45, Age 45-55, Age 4~
## $ ageq <fct> Age 34-49, Age 49-62, Age 62+, Age 34-49, Age 49-62, Age 4~
## $ siblings <fct> 2, 3, 3, 3, 2, 2, 2, 6+, 5, 1, 4, 4, 3, 6+, 0, 1, 3, 6+, 2~
## $ kids <fct> 3, 0, 2, 4+, 2, 2, 2, 3, 3, 4+, 4+, 4+, 3, 4+, 4+, 2, 4+, ~
## $ religion <fct> None, None, Catholic, Catholic, None, None, None, Catholic~
## $ bigregion <fct> Northeast, Northeast, Northeast, Northeast, Northeast, Nor~
## $ partners_rc <fct> NA, 1, 1, NA, 1, 1, NA, 1, NA, 3, 1, NA, 1, NA, 0, 1, 0, N~
## $ obama <dbl> 0, 1, 0, 0, 1, 1, NA, NA, NA, 0, NA, NA, 1, 1, 0, 1, 0, 1,~
# Practice using facet_grid() to facet between multiple variables
p <- ggplot(data = gss_sm,
mapping = aes(x = age,
y = childs))
p + geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(sex ~ race)
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
Each geom_
function has an associated stat_
function that is used to plot the data. Sometimes this involves transforming the data in some way.
p <- ggplot(data = gss_sm,
mapping = aes(x = bigregion))
p + geom_bar() # makes a bar graph that counts the number of observations per region; count is computed for us
p + geom_bar(mapping = aes(y = ..prop..)) # the prop statistic can show us proportions
# But this is not right, each shows 100%
# So, we need to fix the automatic grouping that is occurring by region
p + geom_bar(mapping = aes(y = ..prop.., group = 1)) # using group = 1 is basically a placeholder that says all the data is in the same group
# Look at a different variable
table(gss_sm$religion)
##
## Protestant Catholic Jewish None Other
## 1371 649 51 619 159
p <- ggplot(data = gss_sm,
mapping = aes(x = religion, color = religion))
p + geom_bar() # only the outline has a color - we need to use fill
p <- ggplot(data = gss_sm,
mapping = aes(x = religion, fill = religion))
p + geom_bar()
# Remove the legend
p + geom_bar() +
guides(fill = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# How can we look at two variables together
p <- ggplot(data = gss_sm,
mapping = aes(x = bigregion,
fill = religion))
p + geom_bar() # Stacked bar chart of counts
p + geom_bar(position = "fill") # Stacked bar chart of proportions
p + geom_bar(position = "dodge") # Bar chart of counts side by side
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..)) # Bar chart of proportions side by side
# Not quite right - all are 100%
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..,
group = religion)) # Bar chart of proportions side by side
# The proportions sum to 1 by religion across the regions
p <- ggplot(data = gss_sm,
mapping = aes(x = religion))
p + geom_bar(position = "dodge",
mapping = aes(y = ..prop..,
group = bigregion)) +
facet_wrap(~bigregion, ncol = 2)
# Now the proportions sum to 1 by region across religions
Histograms create bins of numerical data and display the distribution of the data within those bins.
# A new dataset
glimpse(midwest)
## Rows: 437
## Columns: 28
## $ PID <int> 561, 562, 563, 564, 565, 566, 567, 568, 569, 570,~
## $ county <chr> "ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN", "~
## $ state <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL", "IL", "~
## $ area <dbl> 0.052, 0.014, 0.022, 0.017, 0.018, 0.050, 0.017, ~
## $ poptotal <int> 66090, 10626, 14991, 30806, 5836, 35688, 5322, 16~
## $ popdensity <dbl> 1270.9615, 759.0000, 681.4091, 1812.1176, 324.222~
## $ popwhite <int> 63917, 7054, 14477, 29344, 5264, 35157, 5298, 165~
## $ popblack <int> 1702, 3496, 429, 127, 547, 50, 1, 111, 16, 16559,~
## $ popamerindian <int> 98, 19, 35, 46, 14, 65, 8, 30, 8, 331, 51, 26, 17~
## $ popasian <int> 249, 48, 16, 150, 5, 195, 15, 61, 23, 8033, 89, 3~
## $ popother <int> 124, 9, 34, 1139, 6, 221, 0, 84, 6, 1596, 20, 7, ~
## $ percwhite <dbl> 96.71206, 66.38434, 96.57128, 95.25417, 90.19877,~
## $ percblack <dbl> 2.57527614, 32.90043290, 2.86171703, 0.41225735, ~
## $ percamerindan <dbl> 0.14828264, 0.17880670, 0.23347342, 0.14932156, 0~
## $ percasian <dbl> 0.37675897, 0.45172219, 0.10673071, 0.48691813, 0~
## $ percother <dbl> 0.18762294, 0.08469791, 0.22680275, 3.69733169, 0~
## $ popadults <int> 43298, 6724, 9669, 19272, 3979, 23444, 3583, 1132~
## $ perchsd <dbl> 75.10740, 59.72635, 69.33499, 75.47219, 68.86152,~
## $ percollege <dbl> 19.63139, 11.24331, 17.03382, 17.27895, 14.47600,~
## $ percprof <dbl> 4.355859, 2.870315, 4.488572, 4.197800, 3.367680,~
## $ poppovertyknown <int> 63628, 10529, 14235, 30337, 4815, 35107, 5241, 16~
## $ percpovertyknown <dbl> 96.27478, 99.08714, 94.95697, 98.47757, 82.50514,~
## $ percbelowpoverty <dbl> 13.151443, 32.244278, 12.068844, 7.209019, 13.520~
## $ percchildbelowpovert <dbl> 18.011717, 45.826514, 14.036061, 11.179536, 13.02~
## $ percadultpoverty <dbl> 11.009776, 27.385647, 10.852090, 5.536013, 11.143~
## $ percelderlypoverty <dbl> 12.443812, 25.228976, 12.697410, 6.217047, 19.200~
## $ inmetro <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0~
## $ category <chr> "AAR", "LHR", "AAR", "ALU", "AAR", "AAR", "LAR", ~
# Show distribution of the size of the counties in the Midwest
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_histogram() # count is computed automatically by the default stat function
p + geom_histogram(bins = 10) # set 10 bins
# Look at only two states
oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = percollege,
fill = state))
p + geom_histogram(alpha = 0.4, bins = 20) # Overlapping histograms
# Density estimate of the underlying distribution - density plot
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_density()
# Density by state
p <- ggplot(data = midwest,
mapping = aes(x = area,
fill = state,
color = state))
p + geom_density(alpha = 0.3)
# Compare to geom_line(stat = "density")
p + geom_line(stat = "density")
# Scaled density
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = percollege,
fill = state,
color = state))
p + geom_density(alpha = 0.3,
mapping = aes(y = ..scaled..))
### Avoiding Transformation When Necessary
Avoiding transformations - sometimes the data is already aggregated or summarized and we do not need a transformation.
titanic # this data is already summarized
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
p <- ggplot(data = titanic,
mapping = aes(x = fate,
y = percent,
fill = sex))
p + geom_bar(position = "dodge",
stat = "identity") + # plot values as provided, do not summarize/count/etc.
theme(legend.position = "top") # this puts the legend at the top of the graph
oecd_sum # another dataset that is already summarized
## # A tibble: 57 x 5
## # Groups: year [57]
## year other usa diff hi_lo
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1960 68.6 69.9 1.30 Below
## 2 1961 69.2 70.4 1.20 Below
## 3 1962 68.9 70.2 1.30 Below
## 4 1963 69.1 70 0.900 Below
## 5 1964 69.5 70.3 0.800 Below
## 6 1965 69.6 70.3 0.700 Below
## 7 1966 69.9 70.3 0.400 Below
## 8 1967 70.1 70.7 0.600 Below
## 9 1968 70.1 70.4 0.300 Below
## 10 1969 70.1 70.6 0.5 Below
## # ... with 47 more rows
p <- ggplot(data = oecd_sum,
mapping = aes(x = year,
y = diff,
fill = hi_lo))
p + geom_col() + # this is the same as geom_bar with stat = "identity"
guides(fill = FALSE) + # no legend
labs(x = NULL, # no x-axis label
y = "Difference in Years",
title = "The US Life Expectancy Gap",
subtitle = "Difference between US and OECD average life expectancies, 1960-2015",
caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 1 rows containing missing values (position_stack).
Pick at least two of the questions presented under the Where to Go Next section and answer them.
It is best practice to calculate the summary statistics first and then plot them, rather than using the stat_
functions within geom_
functions. This is because it makes the code easier to understand and read and allows us to double check the data and aggregations more easily.
The pipe operator %>%
from dplyr
allows us to pass data from one operation or function to another. Usually there is a sequence of steps: group, filter/select, mutate, then summarize.
Within group_by()
, grouping levels (left to right) go from outermost to innermost. Functions used to create new variables (such as summarize()
) will be applied to the innermost group level first.
# Create a tibble/datat table with the percent of each religion by region
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq * 100), 0))
# Check the percentages sum to 100 by region
rel_by_region %>%
group_by(bigregion) %>%
summarise(total = sum(pct))
## # A tibble: 4 x 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
# Make a plot (note: Healy stops using the argument name)
p <- ggplot(data = rel_by_region,
mapping = aes(x = bigregion,
y = pct,
fill = religion))
p +
geom_col(position = "dodge2") +
labs(x = "Region",
y = "Percent",
fill = "Reiligion") +
theme(legend.position = "top")
# Let's rearrange it a little
p <- ggplot(data = rel_by_region,
mapping = aes(x = religion,
y = pct,
fill = religion))
p +
geom_col(position = "dodge2") +
labs(x = NULL, # don't put a label on the axis
y = "Percent",
fill = "Religion") +
guides(fill = FALSE) +
coord_flip() + # switches the x and y axes after the plot is made
facet_grid(~ bigregion)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
In this section, we will learn how to use geom_boxplot()
# New Dataset on Organ Donations by country and year
organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 x 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Australia 2000-01-01 10.2 19153 0.247 26545
## 2 Finland 2002-01-01 17.1 5201 1.54 26616
## 3 Ireland NA NA 3514 5.00 12917
## 4 Italy NA NA NA NA NA
## 5 Switzerland 1995-01-01 13 7041 17.1 26304
## 6 United States 1996-01-01 20.1 269394 2.80 28772
## 7 United States 1992-01-01 17.6 256514 2.66 24411
## 8 Austria 2001-01-01 23.9 8030 9.58 28457
## 9 Sweden NA NA 8559 1.90 18660
## 10 Germany 1994-01-01 12.3 81438 22.8 20690
# Graph some of the organ data without really looking at the dataset
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_point() # get a warning about missing data; this graph doesn't make much sense
## Warning: Removed 34 rows containing missing values (geom_point).
# Plot the organ donations by country over time
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_line(mapping = aes(group = country)) +
facet_wrap(~country) # automatically orders countries alphabetically
## Warning: Removed 34 row(s) containing missing values (geom_path).
# Make boxplots by country (using the data over the years)
p <- ggplot(data = organdata,
mapping = aes(x = country, y = donors))
p + geom_boxplot() +
coord_flip() # move country names to the y-axis
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Let's reorder the boxplots by mean donation rate using the reorder function
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors))
p + geom_boxplot() +
labs(x = NULL) + # no x-axis title
coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Add color to the boxplots
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
fill = world))
p + geom_boxplot() +
labs(x = NULL) + # no x-axis title
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
# Let's look at this data in point format
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_point() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Points are on top of each other, so use geom_jitter() to move them around a little
p + geom_jitter() +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Reduce the amount of spread in the points using position_jitter()
p + geom_jitter(position = position_jitter(width = 0.15)) +
labs(x = NULL) +
coord_flip() +
theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).
# Get information about consent laws by country
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebvas_mean = mean(cerebvas, na.rm = TRUE))
by_country
## # A tibble: 17 x 8
## # Groups: consent_law [2]
## consent_law country donors_mean donors_sd gdp_mean health_mean roads_mean
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Australia 10.6 1.14 22179. 1958. 105.
## 2 Informed Canada 14.0 0.751 23711. 2272. 109.
## 3 Informed Denmark 13.1 1.47 23722. 2054. 102.
## 4 Informed Germany 13.0 0.611 22163. 2349. 113.
## 5 Informed Ireland 19.8 2.48 20824. 1480. 118.
## 6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1
## 7 Informed United Kin~ 13.5 0.775 21359. 1561. 67.9
## 8 Informed United Sta~ 20.0 1.33 29212. 3988. 155.
## 9 Presumed Austria 23.5 2.42 23876. 1875. 150.
## 10 Presumed Belgium 21.9 1.94 22500. 1958. 155.
## 11 Presumed Finland 18.4 1.53 21019. 1615. 93.6
## 12 Presumed France 16.8 1.60 22603. 2160. 156.
## 13 Presumed Italy 11.1 4.28 21554. 1757 122.
## 14 Presumed Norway 15.4 1.11 26448. 2217. 70.0
## 15 Presumed Spain 28.1 4.96 16933 1289. 161.
## 16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3
## 17 Presumed Switzerland 14.2 1.71 27233 2776. 96.4
## # ... with 1 more variable: cerebvas_mean <dbl>
# Another way to do this in a shorter step
by_country <- organdata %>%
group_by(consent_law, country) %>%
summarize_if(is.numeric,
list(mean = mean, sd = sd), # note funs is deprecated
na.rm = TRUE) %>%
ungroup()
# Make a simple plot of our summarized data (Cleaveland Dot Plot)
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean), # this puts the countries in order by donors_mean
color = consent_law))
p +
geom_point(size = 3) +
labs(x = "Donor Procurement Rate",
y = "", # another way of putting no label on an axis
color = "Consent Law") +
theme(legend.position = "top")
# Facet into two panels for the Cleaveland Dot Plot
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p +
geom_point(size = 3) +
facet_wrap(~ consent_law,
scales = "free_y", # allow the y-axis labels to be different on each facet
ncol = 1) # orient plot in a single columns
labs(x = "Donor Procurement Rate",
y = "")
## $x
## [1] "Donor Procurement Rate"
##
## $y
## [1] ""
##
## attr(,"class")
## [1] "labels"
# Plot the dots (which represent the mean) with the range of standard deviation using geom_pointrange()
p <- ggplot(data = by_country,
mapping = aes(x = reorder(country, donors_mean),
y = donors_mean))
p +
geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
ymax = donors_mean + donors_sd)) +
labs(x = "",
y = "Donor Procurement Rate") +
coord_flip()
# need to use coord_flip() because geom_pointrange() uses y, ymin, and ymax and we want to show this on the x-axis
geom_text()
is used to plot labels on a graph. The argument hjust
can be used to left or right justify the text. hjust = 0
will left-justify; hjust = 1
will right-justify.
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean,
y = donors_mean))
p +
geom_point() + # plot points
geom_text(mapping = aes(label = country)) # plot the labels
# Text is right on top of the points, use hjust to move it
p +
geom_point() +
geom_text(mapping = aes(label = country),
hjust = 0)
The
ggrepepl
package provides two geoms that are more flexible for plotting labels.
library(ggrepel)
# Switch datasets
elections_historic %>% select(2:7)
## # A tibble: 49 x 6
## year winner win_party ec_pct popular_pct popular_margin
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1824 John Quincy Adams D.-R. 0.322 0.309 -0.104
## 2 1828 Andrew Jackson Dem. 0.682 0.559 0.122
## 3 1832 Andrew Jackson Dem. 0.766 0.547 0.178
## 4 1836 Martin Van Buren Dem. 0.578 0.508 0.142
## 5 1840 William Henry Harrison Whig 0.796 0.529 0.0605
## 6 1844 James Polk Dem. 0.618 0.495 0.0145
## 7 1848 Zachary Taylor Whig 0.562 0.473 0.0479
## 8 1852 Franklin Pierce Dem. 0.858 0.508 0.0695
## 9 1856 James Buchanan Dem. 0.588 0.453 0.122
## 10 1860 Abraham Lincoln Rep. 0.594 0.396 0.101
## # ... with 39 more rows
# Set titles and labels
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional"
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(data = elections_historic,
mapping = aes(x = popular_pct,
y = ec_pct,
label = winner_label))
p +
geom_hline(yintercept = 0.5,
size = 1.4,
color = "gray80") +
geom_vline(xintercept = 0.5,
size = 1.4,
color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label,
y = y_label,
title = p_title,
subtitle = p_subtitle,
caption = p_caption)
## Warning: ggrepel: 13 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
To label specific points, we need to tell the geom which points to label using the subset()
function rather than giving the geom the entire dataset.
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean,
y = health_mean))
p +
geom_point() +
# Only label points with mean GDP greater than 25,000
geom_text_repel(data = subset(by_country, gdp_mean > 25000),
mapping = aes(label = country))
p +
geom_point() +
# Only label points with mean GDP greater than 25,000 OR
# mean health less than 1,500 or Belgium
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 |
health_mean < 1500 |
country %in% "Belgium"),
mapping = aes(label = country))
An alternative to using subset()
to filter the data is to add a variable that already has the conditions for labeling to the dataset.
# Add code/indicator variable to organ data
organdata$ind <- organdata$ccode %in% c("Ita", "Spa") & organdata$year > 1998
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = ind))
p +
geom_point() +
geom_text_repel(data = subset(organdata, ind),
mapping = aes(label = ccode)) +
guides(label = FALSE,
color = FALSE) # removes legend
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 34 rows containing missing values (geom_point).
annotate()
can be used to add annotations from different geoms to plots (text, shading, etc.)
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors))
p +
geom_point() +
annotate(geom = "text", # a text annotation
x = 91, # x position for the annotation
y = 33, # y position for the annotation
label = "A surprisingly high \n recovery rate", # label for the annotation
hjust = 0) # left-align
## Warning: Removed 34 rows containing missing values (geom_point).
p +
geom_point() +
annotate(geom = "rect", # a rectangular annotation
xmin = 125, xmax = 155, # x position for the annotation
ymin = 30, ymax = 35, # y position for the annotation
fill = "red", alpha = 0.2) + # color/fill properties for the annotation
annotate(geom = "text", # a text annotation
x = 91, # x position for the annotation
y = 33, # y position for the annotation
label = "A surprisingly high \n recovery rate", # label for the annotation
hjust = 0)
## Warning: Removed 34 rows containing missing values (geom_point).
scale_<MAPPING>_<KIND>()
functions can be used to adjust the axes and colors used in plots. The guide()
function can be used to adjust the legend. The theme()
function can be used to adjust the overall look of a plot.
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world))
# Adjust the x and y axes
p +
geom_point() +
scale_x_log10() +
scale_y_continuous(breaks = c(5, 15, 25),
labels = c("Five", "Fifteen", "Twenty Five"))
## Warning: Removed 34 rows containing missing values (geom_point).
# Adjust the color legend
p +
geom_point() +
scale_color_discrete(labels = c("Corporatist", "Liberal",
"Social Democrat", "Unclassified")) +
labs(x = "Road Deaths",
y = "Donor Procurement",
color = "Welfare State")
## Warning: Removed 34 rows containing missing values (geom_point).
Pick at least two of the questions presented under the Where to Go Next section and answer them.
There are many different ways to represent data on a map; the designer needs to decide how fine the resolution of the representation should be, how to convey the weight of different data points, how to accurately represent spatial data, and the map type to use.
# Election data - select columns of interest and view a sample of 5 rows
election %>%
select(state, total_vote, r_points, pct_trump, party, census) %>%
sample_n(5)
## # A tibble: 5 x 6
## state total_vote r_points pct_trump party census
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Minnesota 2945233 -1.51 44.9 Democratic Midwest
## 2 Delaware 443814 -11.4 41.7 Democratic South
## 3 Colorado 2780247 -4.91 43.2 Democratic West
## 4 Oregon 2001336 -11.0 39.1 Democratic West
## 5 Vermont 315067 -26.4 30.3 Democratic Northeast
# FIPS code is a unique six-digit identifier for every U.S. county
# The first two digits of a FIPS code represent the state
# Set colors for Democratic and Republican parties
party_colors <- c("#2E74C0", "#CB454A")
# Create a plot of the elections data in a faceted dot plot
p0 <- ggplot(data = subset(election, st %nin% "DC"),
mapping = aes(x = r_points,
y = reorder(state, r_points),
color = party))
p1 <- p0 +
geom_vline(xintercept = 0, color = "gray30") +
geom_point(size = 2)
p2 <- p1 +
scale_color_manual(values = party_colors)
p3 <- p2 +
scale_x_continuous(breaks = c(-30, -20, -10, 0, 10, 20, 30, 40),
labels = c("30\n(Clinton)",
"20", "10", "0", "10", "20", "30",
"40\n(Trump)"))
p3 +
facet_wrap(~ census, ncol = 1, scales = "free_y") +
guides(color = FALSE) +
labs(x = "Point Margin",
y = "") +
theme(axis.text = element_text(size = 8))
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
The maps
library provides pre-drawn map data.
library(maps)
# Get data for U.S. States
us_states <- map_data("state")
head(us_states) # it provides latitude and longitude information; region is the state name
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
dim(us_states) # it's a large data frame
## [1] 15537 6
# Make a map
p <- ggplot(data = us_states,
mapping = aes(x = long, # Use latitude and longitude to plot states
y = lat,
group = group))
p +
geom_polygon(fill = "white",
color = "black") # Outline of a map of U.S. states
p <- ggplot(data = us_states,
mapping = aes(x = long,
y = lat,
group = group,
fill = region)) # add color fill to the states
p +
geom_polygon(color = "gray90",
size = 0.1) +
guides(fill = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Sometimes we will want to alter the map projection so that it looks more accurate. This can be done using the coord_map()
function to select an alternate coordinate system (right now it is Cartesian). To use the Albers projection, we have to provide numbers for lat0
and lat1
.
p <- ggplot(data = us_states,
mapping = aes(x = long,
y = lat,
group = group,
fill = region))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
guides(fill = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Now it is time to get our data onto the map. We have to merge the two datasets - one has the elections data and one has the data to draw the states on the map. We can use the left_join()
function to do this. It is important for a column of data in each dataset to match exactly so that we can put the datasets together.
# First we need to lowercase the state names and put them in a column called "region", to match the mapping data
election$region <- tolower(election$state)
# Now we can join the datasets together using the common region column
us_states_elec <- left_join(us_states, election)
# We are now ready to plot
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = party))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_manual(values = party_colors) + # adjust the colors of the map
labs(title = "Election Results 2016", fill = NULL) +
theme_map() # added in setup chunk
Mapping the data only to states is a little deceptive because there are differences in voting by county and certain areas of the country have larger populations than others.
# Put a continuous variable on the map fill
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = pct_trump))
p + geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
labs(title = "Trump vote",
fill = "Percent") +
theme_map()
# Let's change the color to red and have darker red mean higher percent
p + geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
labs(title = "Trump vote",
fill = "Percent") +
scale_fill_gradient(low = "white",
high = "#CB454A") +
theme_map()
scale_gradient2()
is a function that creates a diverging scale from a midpoint.
p <- ggplot(data = us_states_elec,
mapping = aes(x = long,
y = lat,
group = group,
fill = d_points))
# Create a gradient fill for point margins
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_gradient2(low = "red",
mid = scales::muted("purple"),
high = "blue",
breaks = c(-25, 0, 25, 50, 75)) +
labs(title = "Winning margins",
fill = "Percent") +
theme_map()
# Remove D.C. since it is an outlier; gradient colors are enhanced
# Note: earlier we used st to remove D.C., now we're using region
p <- ggplot(data = subset(us_states_elec,
region %nin% "district of columbia"),
mapping = aes(x = long,
y = lat,
group = group,
fill = d_points))
p +
geom_polygon(color = "gray90",
size = 0.1) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45) +
scale_fill_gradient2(low = "red",
mid = scales::muted("purple"),
high = "blue",
breaks = c(-25, 0, 25, 50, 75)) +
labs(title = "Winning margins",
fill = "Percent") +
theme_map()
Data can be mapped to counties as well, but it’s important to remember the population distribution in the U.S. Choropleth maps of the U.S. tend to show population density more than anything else because there are concentrations of population in counties in the northeast and the west coast compared to the west. Note that the previous maps did not include Alaska or Hawaii. Now, we are going to add them using a county map dataset.
# Mapping dataset for U.S. counties
county_map %>%
sample_n(5) # FIPS ID is used to identify the counties
## long lat order hole piece group id
## 1 -1001120.2 -391757.5 166172 FALSE 1 0500000US49011.1 49011
## 2 1341299.7 -1206414.3 3304 FALSE 1 0500000US01111.1 01111
## 3 -1190908.2 -2142158.7 8678 FALSE 1 0500000US02122.1 02122
## 4 1374683.9 -884353.7 153368 FALSE 1 0500000US47129.1 47129
## 5 722217.2 -155738.4 57682 FALSE 1 0500000US19005.1 19005
# County demographic data for U.S. Counties
county_data %>%
select(id, name, state, pop_dens, pct_black) %>%
sample_n(5) # FIPS ID is used to identify the counties
## id name state pop_dens pct_black
## 1 28153 Wayne County MS [ 10, 50) [25.0,50.0)
## 2 31079 Hall County NE [ 100, 500) [ 2.0, 5.0)
## 3 18007 Benton County IN [ 10, 50) [ 0.0, 2.0)
## 4 06013 Contra Costa County CA [ 1000, 5000) [ 5.0,10.0)
## 5 40073 Kingfisher County OK [ 10, 50) [ 0.0, 2.0)
head(county_data) # ID 0 is for the entire U.S.; IDs with just the first two-digits is for the state
## id name state census_region pop_dens pop_dens4
## 1 0 <NA> <NA> <NA> [ 50, 100) [ 45, 118)
## 2 01000 1 AL South [ 50, 100) [ 45, 118)
## 3 01001 Autauga County AL South [ 50, 100) [ 45, 118)
## 4 01003 Baldwin County AL South [ 100, 500) [118,71672]
## 5 01005 Barbour County AL South [ 10, 50) [ 17, 45)
## 6 01007 Bibb County AL South [ 10, 50) [ 17, 45)
## pop_dens6 pct_black pop female white black travel_time land_area
## 1 [ 82, 215) [10.0,15.0) 318857056 50.8 77.7 13.2 25.5 3531905.43
## 2 [ 82, 215) [25.0,50.0) 4849377 51.5 69.8 26.6 24.2 50645.33
## 3 [ 82, 215) [15.0,25.0) 55395 51.5 78.1 18.4 26.2 594.44
## 4 [ 82, 215) [ 5.0,10.0) 200111 51.2 87.3 9.5 25.9 1589.78
## 5 [ 25, 45) [25.0,50.0) 26887 46.5 50.2 47.6 24.6 884.88
## 6 [ 25, 45) [15.0,25.0) 22506 46.0 76.3 22.1 27.6 622.58
## hh_income su_gun4 su_gun6 fips votes_dem_2016 votes_gop_2016 total_votes_2016
## 1 53046 <NA> <NA> 0 NA NA NA
## 2 43253 <NA> <NA> 1000 NA NA NA
## 3 53682 [11,54] [10,12) 1001 5908 18110 24661
## 4 50221 [11,54] [10,12) 1003 18409 72780 94090
## 5 32911 [ 5, 8) [ 7, 8) 1005 4848 5431 10390
## 6 36447 [11,54] [10,12) 1007 1874 6733 8748
## per_dem_2016 per_gop_2016 diff_2016 per_dem_2012 per_gop_2012 diff_2012
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 0.2395685 0.7343579 12202 0.2657577 0.7263374 11012
## 4 0.1956531 0.7735147 54371 0.2156657 0.7738975 47443
## 5 0.4666025 0.5227141 583 0.5125229 0.4833755 334
## 6 0.2142204 0.7696616 4859 0.2621857 0.7306638 3931
## winner partywinner16 winner12 partywinner12 flipped
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 Trump Republican Romney Republican No
## 4 Trump Republican Romney Republican No
## 5 Trump Republican Obama Democrat Yes
## 6 Trump Republican Romney Republican No
# Put the mapping data and demographic data together
county_full <- left_join(county_map, county_data, by = "id")
# Plot population density by county
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens,
group = group))
p +
geom_polygon(color = "gray90", size = 0.05) +
coord_equal() + # relative scale of map does not change, even if plot dimensions change
scale_fill_brewer(palette = "Blues",
labels = c("0-10", "10-50", "50-100", "100-500",
"500-1,000", "1,000-5,000", ">5,000")) +
labs(fill = "Population per\nsquare mile") +
theme_map() +
guides(fill = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
# Plot percent of Black population by county
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pct_black,
group = group))
p +
geom_polygon(color = "gray90",
size = 0.05) +
coord_equal() +
scale_fill_brewer(palette = "Greens") +
labs(fill = "US Population, Percent Black") +
theme_map() +
guides(fill = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
The population density and the percent of the population that is Black are confounding variables with many other county-level variables that we might want to examine, so it is important to keep the previous two plots in mind whenever plotting county-level data. To demonstrate this issue, we will make two more plots - one on gun-related suicides and one on binned population density.
# Create the color palette
# brewer.pal() produces evenly spaced color palettes
orange_pal <- RColorBrewer::brewer.pal(n = 6, name = "Oranges")
orange_pal
## [1] "#FEEDDE" "#FDD0A2" "#FDAE6B" "#FD8D3C" "#E6550D" "#A63603"
# Reverse color palette
orange_rev <- rev(orange_pal)
orange_rev
## [1] "#A63603" "#E6550D" "#FD8D3C" "#FDAE6B" "#FDD0A2" "#FEEDDE"
# Recreate a "poorly sourced by widely circulated county map of firearm-related suicide rates" (p. 186)
gun_p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = su_gun6,
group = group))
gun_p1 <- gun_p +
geom_polygon(color = "gray90",
size = 0.05) +
coord_equal()
gun_p2 <- gun_p1 +
scale_fill_manual(values = orange_pal)
gun_p2 +
labs(title = "Gun-Related Suicides 1999-2015",
fill = "Rate per 100,000pop.") +
theme_map() +
theme(legend.position = "bottom")
# Create the reverse-coded population density map
pop_p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens6,
group = group))
pop_p1 <- pop_p +
geom_polygon(color = "gray90", size = 0.05) +
coord_equal()
pop_p2 <- pop_p1 +
scale_fill_manual(values = orange_rev)
pop_p2 + labs(title = "Reverse-coded Population Density",
fill = "People per square mile") +
theme_map() +
theme(legend.position = "bottom")
“Small differences in reporting, combined with coarse binning and miscoding, will produce spatially misleading and substantively mistaken results. It might seem that focusing on on the details of variable coding in this particular case is a little too much in the weeds for a genreal introduction. But it is exactly these details that can dramatically alter the appearance of any graph, and especially maps, in a way that can be hard to detect after the fact.” (p. 189)
The statebins
package is an alternative way to develop U.S. maps. The syntax is slightly different than ggplot
; the statebins
package has been updated since the writing of the book, so the code below is different than the code in the book.
# install.packages("statebins") # note statebins has not been previously installed
library(statebins)
# The statebins package has changed since this book was written
# Continuous Data
statebins(state_data = election,
state_col = "state",
value_col = "pct_trump",
round = FALSE) +
labs(fill = "Percent Trump") +
scale_fill_gradient(low = "#FEE5D9",
high = "#A50F15") +
theme_statebins() +
theme(legend.position = "top")
statebins(state_data = subset(election, st %nin% "DC"),
state_col = "state",
value_col = "pct_clinton",
round = FALSE) +
scale_fill_gradient(low = "#EFF3FF",
high = "#08519C") +
labs(fill = "Percent Clinton") +
theme_statebins() +
theme(legend.position = "top")
# Categorical Data
ggplot(data = election,
mapping = aes(state = st,
fill = party)) +
geom_statebins() +
scale_fill_manual(values = c("royalblue", "darkred"),
labels = c("Clinton", "Trump")) +
labs(fill = "Winner") +
coord_equal() +
theme_statebins() +
theme(legend.position = "right")
# Binned Data
ggplot(data = election,
mapping = aes(state = st,
fill = cut(pct_trump, 4))) +
geom_statebins() +
scale_fill_brewer(palette = "Reds",
labels = c("4-21", "21-37", "37-53", "53-70")) +
labs(fill = "Percent Trump") +
coord_equal() +
theme_statebins() +
theme(legend.position = "top")
Use small-multiple maps to show geographic data over time. We will also use the viridis
package to get good color palettes that are vibrant on both ends.
opiates # state-level death rate from optiate-related causes 1999-2014
## # A tibble: 800 x 11
## year state fips deaths population crude adjusted adjusted_se region abbr
## <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <ord> <chr>
## 1 1999 Alabama 1 37 4430141 0.8 0.8 0.1 South AL
## 2 1999 Alaska 2 27 624779 4.3 4 0.8 West AK
## 3 1999 Arizona 4 229 5023823 4.6 4.7 0.3 West AZ
## 4 1999 Arkans~ 5 28 2651860 1.1 1.1 0.2 South AR
## 5 1999 Califo~ 6 1474 33499204 4.4 4.5 0.1 West CA
## 6 1999 Colora~ 8 164 4226018 3.9 3.7 0.3 West CO
## 7 1999 Connec~ 9 151 3386401 4.5 4.4 0.4 North~ CT
## 8 1999 Delawa~ 10 32 774990 4.1 4.1 0.7 South DE
## 9 1999 Distri~ 11 28 570213 4.9 4.9 0.9 South DC
## 10 1999 Florida 12 402 15759421 2.6 2.6 0.1 South FL
## # ... with 790 more rows, and 1 more variable: division_name <chr>
# lower-case the state name to match the us_states data (from the maps package)
opiates$region <- tolower(opiates$state)
# join with the us_states data
opiates_map <- left_join(us_states, opiates)
library(viridis)
p0 <- ggplot(data = subset(opiates_map, year > 1999),
mapping = aes(x = long, y = lat,
group = group,
fill = adjusted))
p1 <- p0 +
geom_polygon(color = "gray90",
size = 0.05) +
coord_map(projection = "albers",
lat0 = 39,
lat1 = 45)
p2 <- p1 +
scale_fill_viridis_c(option = "plasma")
p2 +
theme_map() +
facet_wrap(~ year, ncol = 3) +
theme(legend.position = "bottom",
strip.background = element_blank()) +
labs(title = "Opiate Related Deaths by State, 2000-2014",
fill = "Death rate per 100,000 population")
But this might not be the best way to view this data due to the issues with population, demographics, and the difficulty in comparing things spatially.
We can make line plots of the opiates data over time, which may make it easier to make comparisons and draw conclusions.
p <- ggplot(data = opiates,
mapping = aes(x = year,
y = adjusted,
group = state))
p + geom_line(color = "gray70") # But this is a little difficult to see since there are so many lines
## Warning: Removed 17 row(s) containing missing values (geom_path).
# Divide it up by census division
p0 <- ggplot(data = drop_na(opiates, division_name), # remove rows with NA for division_name to leave out D.C.
mapping = aes(x = year,
y = adjusted))
p1 <- p0 +
geom_line(color = "gray70",
mapping = aes(group = state)) # make line chart, one line per state
p2 <- p1 +
geom_smooth(mapping = aes(group = division_name),
se = FALSE) # make trend line by census division
p3 <- p2 + # label only the end of the line with the state
geom_text_repel(data = subset(opiates, year == max(year) & abbr != "DC"),
mapping = aes(x = year,
y = adjusted,
label = abbr),
size = 1.8,
segment.color = NA, # no line segment linking text to point
nudge_x = 30) + # shift the label text to the right
# shift coordinate system so there's room for the labels
coord_cartesian(c(min(opiates$year), max(opiates$year)))
p3 +
labs(x = "",
y = "Rate per 100,000 population",
title = "State-Level Opiate Death Rates by Census Division, 1999-2014") +
facet_wrap(~ reorder(division_name, -adjusted, na.rm = TRUE),
# put the divisions with highest rates first
nrow = 3)
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 17 row(s) containing missing values (geom_path).
Pick at least two of the questions presented under the Where to Go Next section and answer them.
ggplot
allows us to make lots of customization and refinement to our plots once we have finalized the data we want to show. This could include annotations, highlights, theme changes, and different colors.
# New dataset - membership and income data by section of the American Sociological Association
head(asasec)
## Section Sname Beginning Revenues
## 1 Aging and the Life Course (018) Aging 12752 12104
## 2 Alcohol, Drugs and Tobacco (030) Alcohol/Drugs 11933 1144
## 3 Altruism and Social Solidarity (047) Altruism 1139 1862
## 4 Animals and Society (042) Animals 473 820
## 5 Asia/Asian America (024) Asia 9056 2116
## 6 Body and Embodiment (048) Body 3408 1618
## Expenses Ending Journal Year Members
## 1 12007 12849 No 2005 598
## 2 400 12677 No 2005 301
## 3 1875 1126 No 2005 NA
## 4 1116 177 No 2005 209
## 5 1710 9462 No 2005 365
## 6 1920 3106 No 2005 NA
# Plot membership vs. revenue for 2014
p <- ggplot(data = subset(asasec, Year == 2014),
mapping = aes(x = Members,
y = Revenues,
label = Sname))
# Use a linear function to estimate relationship
# Label the outliers
# Add titles and axis labels
# Make y-axis labels currency
# Move legend to the bottom
p4 <- p +
geom_point(mapping = aes(color = Journal)) +
geom_smooth(method = "lm", se = FALSE, color = "gray80") +
geom_text_repel(data = subset(asasec, Year == 2014 & Revenues > 7000),
size = 2) +
labs(x = "Membership",
y = "Revenues",
color = "Section has own Journal",
title = "ASA Sections",
subtitle = "2014 Calendar Year.",
caption = "Source: ASA annual report") +
scale_y_continuous(labels = scales::dollar) +
theme(legend.position = "bottom")
p4
When deciding on a color palette, think about whether the variable values are unordered or ordered; categorical or numeric. RColorBrewer
offers many types of palettes and can be accessed within ggplot
using scale_color_brewer()
and scale_fill_brewer()
. Color palettes can also be set manually by providing the hex codes and using scale_color_manual()
and scale_fill_manual()
. R also knows some color names, these can be accessed using demo('colors')
.
# Test different color palettes with the organdata
p <- ggplot(data = organdata,
mapping = aes(x = roads,
y = donors,
color = world))
# Set2
p + geom_point(size = 2) +
scale_color_brewer(palette = "Set2") +
theme(legend.position = "bottom")
## Warning: Removed 46 rows containing missing values (geom_point).
# Pastel2
p + geom_point(size = 2) +
scale_color_brewer(palette = "Pastel2") +
theme(legend.position = "bottom")
## Warning: Removed 46 rows containing missing values (geom_point).
# Dark2
p + geom_point(size = 2) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom")
## Warning: Removed 46 rows containing missing values (geom_point).
# Use a manual palette
# Set the color values
cb_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# Use scale_color_manual() to set colors to the color blind palette
p4 +
scale_color_manual(values = cb_palette)
# Note: only the first two colors are used because there are only two values in the color variable
# The dichromat package can be used to transform colors into color-blind safe colors
library(dichromat) # note: need to install
# Get list of colors - 5 colors from Set2
default <- RColorBrewer::brewer.pal(5, "Set2")
# List types of color-blindness
types <- c("deutan", "protan", "tritan")
names(types) <- c("Deuteronopia", "Protanopia", "Tritanopia")
# Create a table of colors that will transform the Set2 colors into what is seen under different color-blindness
color_table <- types %>%
purrr::map(~ dichromat(default, .x)) %>%
as_tibble() %>%
add_column(default, .before = TRUE)
color_table
## # A tibble: 5 x 4
## default Deuteronopia Protanopia Tritanopia
## <chr> <chr> <chr> <chr>
## 1 #66C2A5 #AEAEA7 #BABAA5 #82BDBD
## 2 #FC8D62 #B6B661 #9E9E63 #F29494
## 3 #8DA0CB #9C9CCB #9E9ECB #92ABAB
## 4 #E78AC3 #ACACC1 #9898C3 #DA9C9C
## 5 #A6D854 #CACA5E #D3D355 #B6C8C8
# view the colors
color_comp(color_table)
We can use color and text to highlight certain parts of a plot. county_data
contains data on the 2016 U.S. general election.
# Democrat Blue and Republican Red - create color palette
party_colors <- c("#2E74C0", "#CB454A")
# Initial plot - points for each county that did not flip
# x-axis is the population
# y-axis is the percent of the population that is Black
p0 <- ggplot(data = subset(county_data,
flipped == "No"),
mapping = aes(x = pop,
y = black / 100))
p1 <- p0 +
geom_point(alpha = 0.15,
color = "gray50") +
scale_x_log10(labels = scales::comma) # log-scale for population
p1
# Add a second layer of points for the counties that did flip
# color by party that won
p2 <- p1 +
geom_point(data = subset(county_data,
flipped == "Yes"),
mapping = aes(x = pop,
y = black / 100,
color = partywinner16)) +
scale_color_manual(values = party_colors)
p2
# Set the y-axis scale and labels
p3 <- p2 +
scale_y_continuous(labels = scales::percent) +
labs(color = "County flipped to...",
x = "County Population (log scale)",
y = "Percent Black Population",
title = "Flipped counties, 2016",
caption = "Counties in gray did not flip.")
p3
# Label counties with high percentage of Black residents that flipped
p4 <- p3 +
geom_text_repel(data = subset(county_data,
flipped == "Yes" & black > 25),
mapping = aes(x = pop,
y = black / 100,
label = state),
size = 2)
# Add a minimal theme and put the legend at the top
p4 + theme_minimal() +
theme(legend.position = "top")
The theme of a plot can change the overall appearance. Using theme_set()
with the name of a theme as the argument of the function, we can set the theme for all plots in a session. We can also use +
with a theme function to add a theme to a specific plot. The theme()
function can be used to fine-tune different aspects of the plot (like the legend position). ggplot
contains some default themes; the ggthemes
package contains more themes. There are other pre-made themes available in the cowplot
and hrbrthemes
packages. We can also manually set theme elements within the theme()
function to make our own custom themes.
# theme_bw()
theme_set(theme_bw())
p4 + theme(legend.position = "top")
# theme_dark
theme_set(theme_dark())
p4 + theme(legend.position = "top")
library(ggthemes)
# Economist Theme
theme_set(theme_economist())
p4 + theme(legend.position="top")
# Wall Street Journal Theme
theme_set(theme_wsj())
# Text sizes are quite large on this theme, so set them in the theme() function to be smaller
p4 + theme(plot.title = element_text(size = rel(0.6)),
legend.title = element_text(size = rel(0.35)),
plot.caption = element_text(size = rel(0.35)),
legend.position = "top")
# Clear out theme setting
theme_set(theme())
# Make a custom theme
# Warnings - font family not found in Windows font database
# Needed to update the family names
# Added hjust to left-align title
p4 + theme(legend.position = "top",
plot.title = element_text(size = rel(2),
lineheight = .5,
family = "serif",
face = "bold.italic",
color = "orange",
hjust = 0),
axis.text.x = element_text(size = rel(1.1),
family = "mono",
face = "bold",
color = "purple"),
panel.background = element_blank(),
panel.grid.major = element_line(color = "gray90"),
legend.key = element_blank())
The theme()
in ggplot
can be used to fix design elements and create custom figures.
# Dataset: age of each GSS respondent for all years since 1972
# Aggregation - mean age of respondents for years of interest
table(gss_lon$year)
##
## 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987 1988 1989
## 1613 1504 1484 1490 1499 1530 1532 1468 1860 1599 1473 1534 1470 1819 1481 1537
## 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
## 1372 1517 1606 2992 2904 2832 2817 2765 2812 4510 2023 2044 1974 2538 2867
yrs <- c(seq(1972, 1988, 4), # survey occurred every 4 years, 1972 - 1988
1993, # survey occurred in 1993
seq(1996, 2016, 4) # survey occurred every 4 years, 1996 - 2016
)
# Compute mean age
mean_age <- gss_lon %>%
filter(age %nin% NA && year %in% yrs) %>%
group_by(year) %>%
summarise(xbar = round(mean(age, na.rm = TRUE), 0))
# Add y-axis coordinate for text label
mean_age$y <- 0.3
# Create location and label for years
yr_labs <- data.frame(x = 85,
y = 0.8,
year = yrs)
# Make plot
p <- ggplot(data = subset(gss_lon, year %in% yrs),
mapping = aes(x = age))
p1 <- p +
geom_density(fill = "gray20",
color = FALSE,
alpha = 0.9,
mapping = aes(y = ..scaled..)) +
geom_vline(data = subset(mean_age, year %in% yrs),
mapping = aes(xintercept = xbar),
color = "white",
size = 0.5) +
geom_text(data = subset(mean_age, year %in% yrs),
mapping = aes(x = xbar, y = y, label = xbar),
nudge_x = 7.5,
color = "white",
size = 3.5,
hjust = 1) +
geom_text(data = subset(yr_labs, year %in% yrs),
mapping = aes(x = x, y = y, label = year)) +
facet_grid(year ~ .,
switch = "y") # put the year labels for the facets on the left
p1
## Warning: Removed 83 rows containing non-finite values (stat_density).
# Add theme
# Remove theme_book()
# p1 + theme_book(base_size = 10,
# plot_title_size = 10,
# strip_text_size = 32,
# panel_spacing = unit(0.1, "lines")) +
p1 +
theme_minimal() +
theme(plot.title = element_text(size = 16),
axis.text.x= element_text(size = 12),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
strip.background = element_blank(),
strip.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = "Age",
y = NULL,
title = "Age Distribution of\nGSS Respondents")
## Warning: Removed 83 rows containing non-finite values (stat_density).
# Make this plot with ggridges
library(ggridges)
# Plot all years; ordered by year (factor())
p <- ggplot(data = gss_lon,
mapping = aes(x = age,
y = factor(year,
levels = rev(unique(year)),
ordered = TRUE)))
p + geom_density_ridges(alpha = 0.6,
fill = "lightblue",
scale = 1.5) +
scale_x_continuous(breaks = c(25, 50, 75)) +
scale_y_discrete(expand = c(0.01, 0)) +
labs(x = "Age",
y = NULL,
title = "Age Distribution of\nGSS Respondents") +
theme_ridges() +
theme(title = element_text(size = 16, face = "bold"))
## Warning: Removed 221 rows containing non-finite values (stat_density_ridges).
Will be covered in Week 11
Pick at least two of the questions presented under the Where to Go Next section and answer them.
This section is not in the book as is taken from Time Based Heatmaps in R. geom_tile()
can be used to create heat maps.
# Data - 911 calls in Seattle
incidents <- read_csv("https://raw.githubusercontent.com/lgellis/MiscTutorial/master/ggmap/i2Sample.csv")
# Assign color variables
col1 = "#d8e1cf"
col2 = "#438484"
# Peek at the data set
head(incidents)
## # A tibble: 6 x 20
## ...1 CAD.CDW.ID CAD.Event.Number General.Offense.Number Event.Clearance.Co~
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 450024 813558 12000168676 2012168676 064
## 2 1028516 1002946 12000438147 2012438147 100
## 3 1319344 1036985 15000080173 201580173 161
## 4 1334148 1053917 15000105067 2015105067 100
## 5 32080 48491 10000294206 2010294206 281
## 6 326478 681155 11000387576 2011387576 250
## # ... with 15 more variables: Event.Clearance.Description <chr>,
## # Event.Clearance.SubGroup <chr>, Event.Clearance.Group <chr>,
## # Event.Clearance.Date <chr>, Hundred.Block.Location <chr>,
## # District.Sector <chr>, Zone.Beat <chr>, Census.Tract <chr>,
## # Longitude <dbl>, Latitude <dbl>, Incident.Location <chr>,
## # Initial.Type.Description <chr>, Initial.Type.Subgroup <chr>,
## # Initial.Type.Group <chr>, At.Scene.Time <chr>
str(incidents)
## spec_tbl_df [50,000 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:50000] 450024 1028516 1319344 1334148 32080 ...
## $ CAD.CDW.ID : num [1:50000] 813558 1002946 1036985 1053917 48491 ...
## $ CAD.Event.Number : num [1:50000] 1.2e+10 1.2e+10 1.5e+10 1.5e+10 1.0e+10 ...
## $ General.Offense.Number : num [1:50000] 2.01e+09 2.01e+09 2.02e+08 2.02e+09 2.01e+09 ...
## $ Event.Clearance.Code : chr [1:50000] "064" "100" "161" "100" ...
## $ Event.Clearance.Description: chr [1:50000] "SHOPLIFT" "FRAUD (INCLUDING IDENTITY THEFT)" "TRESPASS" "FRAUD (INCLUDING IDENTITY THEFT)" ...
## $ Event.Clearance.SubGroup : chr [1:50000] "THEFT" "FRAUD CALLS" "TRESPASS" "FRAUD CALLS" ...
## $ Event.Clearance.Group : chr [1:50000] "SHOPLIFTING" "FRAUD CALLS" "TRESPASS" "FRAUD CALLS" ...
## $ Event.Clearance.Date : chr [1:50000] "05/31/2012 06:00:00 PM" "12/24/2012 11:14:00 AM" "03/11/2015 12:45:00 PM" "03/31/2015 04:56:00 PM" ...
## $ Hundred.Block.Location : chr [1:50000] "39XX BLOCK OF S OTHELLO ST" "27XX BLOCK OF ALKI AVE SW" "6XX BLOCK OF NW MARKET ST" "77XX BLOCK OF RAINIER AV S" ...
## $ District.Sector : chr [1:50000] "S" "W" "B" "S" ...
## $ Zone.Beat : chr [1:50000] "S1" "W1" "B2" "S2" ...
## $ Census.Tract : chr [1:50000] "11000.1011" "9701.2000" "4800.4000" "11102.4008" ...
## $ Longitude : num [1:50000] -122 -122 -122 -122 -122 ...
## $ Latitude : num [1:50000] 47.5 47.6 47.7 47.5 47.6 ...
## $ Incident.Location : chr [1:50000] "(47.537044021, -122.282344886)" "(47.579317217, -122.409989598)" "(47.668651602, -122.364558421)" "(47.533143434, -122.269986901)" ...
## $ Initial.Type.Description : chr [1:50000] NA "TRU - FORGERY/CHKS/BUNCO/SCAMS/ID THEFT" "BURG - RES (INCL UNOCC STRUCTURES ON PROP)" "FRAUD - FORGERY,BUNCO, SCAMS, ID THEFT, ETC" ...
## $ Initial.Type.Subgroup : chr [1:50000] NA "FRAUD CALLS" "BURGLARY" "FRAUD CALLS" ...
## $ Initial.Type.Group : chr [1:50000] NA "FRAUD CALLS" "RESIDENTIAL BURGLARIES" "FRAUD CALLS" ...
## $ At.Scene.Time : chr [1:50000] NA "12/24/2012 10:33:00 AM" NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. CAD.CDW.ID = col_double(),
## .. CAD.Event.Number = col_double(),
## .. General.Offense.Number = col_double(),
## .. Event.Clearance.Code = col_character(),
## .. Event.Clearance.Description = col_character(),
## .. Event.Clearance.SubGroup = col_character(),
## .. Event.Clearance.Group = col_character(),
## .. Event.Clearance.Date = col_character(),
## .. Hundred.Block.Location = col_character(),
## .. District.Sector = col_character(),
## .. Zone.Beat = col_character(),
## .. Census.Tract = col_character(),
## .. Longitude = col_double(),
## .. Latitude = col_double(),
## .. Incident.Location = col_character(),
## .. Initial.Type.Description = col_character(),
## .. Initial.Type.Subgroup = col_character(),
## .. Initial.Type.Group = col_character(),
## .. At.Scene.Time = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Prepare data - get month/year/day/etc. using lubridate package
incidents$ymd <-lubridate::mdy_hms(incidents$Event.Clearance.Date)
incidents$month <- lubridate::month(incidents$ymd, label = TRUE)
incidents$year <- lubridate::year(incidents$ymd)
incidents$wday <- lubridate::wday(incidents$ymd, label = TRUE)
incidents$hour <- lubridate::hour(incidents$ymd)
# Create a summary table that has the number of incidents by hour and day
dayHour <- incidents %>%
group_by(hour, wday) %>%
summarise(N = n())
# Order by weekday in reverse
dayHour$wday <- factor(dayHour$wday, levels = rev(levels(dayHour$wday)))
# Remove NAs
dayHour <- dayHour %>% na.omit()
# Use geom_tile() to create a heat map
ggplot(data = dayHour,
mapping = aes(x = hour,
y = wday)) +
geom_tile(aes(fill = N),
colour = "white") + # white outline
scale_fill_gradient(low = col1,
high = col2) +
theme_minimal() +
labs(title = "Heat Map of Seattle Incidents by Day of Week and Hour",
x = "Incidents Per Hour",
y = "Day of Week",
fill = "Total Incidents") +
# Remove the gridlines
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
# Create summary table for the number of incidents by group and hour
groupSummary <- incidents %>%
group_by(Event.Clearance.Group, hour) %>%
summarise(N = n()) %>%
filter(!is.na(Event.Clearance.Group)) # remove NA event
# Make heat map
ggplot(data = groupSummary,
mapping = aes(x = hour,
y = Event.Clearance.Group)) +
geom_tile(aes(fill = N),
colour = "white") +
scale_fill_gradient(low = col1,
high = col2) +
labs(title = "Heat Map of Seattle Incidents by Event and Hour",
x = "Hour",
y = "Event",
fill = "Total Incidents") +
theme_minimal() +
# Remove gridlines
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
## Warning: Removed 4 rows containing missing values (geom_tile).