Students will be able to distinguish bad visualizations from good visualizations, based on aesthetic principles of design.
Students will be able to label the different pieces of visualizations: data, aesthetics (mappings) and geoms (layers).
Students will be able to use skills developed in previous weeks (and from previous coursework) to successfully install new packages/modules, access helpfiles and understand the structure and summary of datasets.
Students will be able to write scripts that produce simple plots (histograms, scatter plots and boxplots) with base R (or matplotlib) plotting, and understand the basic syntax and structure of associated functions.
Students will be able to think critically to use the grammar of graphics and ggplot to produce new visualizations using the expertise of their classmates, the basic syntax they learned in class, and the helpfiles associated with R (and with the web).
Conceptor (Manager) - In addition to managerial duties - does initial constructions of the what the figure should look like. Draws things, conceptualizes how the data will be presented.
Coder/Recorder - the person writing the bulk of the code. Importing packages, annotating code throughout. Recording ideas.
Error Checker (Reflector) - In addition to reflector duties, this person responsible for debugging code, using helpfiles, etc.
Explainer/Presenter - the person that will do the speaking out. Be able to explain the code and the rationale of each bit to the rest of the class. Should also explain the consensus of the group about what the data show us.
Some inspiring figures:
http://www.r-graph-gallery.com/19-map-leafletr/
http://www.r-graph-gallery.com/21-distribution-plot-using-ggplot2
http://www.r-graph-gallery.com/274-map-a-variable-to-ggplot2-scatterplot/
http://www.r-graph-gallery.com/123-circular-plot-circlize-package-2/
Within your group - what are all of the types of visualization you can think of. The Conceptor will draw them initially, then I’ll draw them on the board during a report out.
Matching activity in group - can you match the visualization to the data type.
File for activity is in the assets folder.
Full mini-lecture can be found in the assets folder.
Figures:
It should look something like this:
Spend 5 minutes and discuss the different flowcharts, though everyone should probably be on the same page on this.
Here is the code and figure, based on the mtcars dataset.
#Loading Tidyverse with ggplot and all the goodies
library(tidyverse)
#Scoping out what the data look like
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#A data frame with a bunch of data on different cars.... cool. Let's plot a bunch of different variables.
#The plot they need to deconstruct
ggplot(data = mtcars, aes(x = log(mpg), y = log(hp), col = factor(cyl))) +
geom_point(aes(size = wt), alpha = 0.6) +
theme_classic() +
theme() +
labs(x = "log MPG",
y = "log Horsepower",
color = "Cylinders",
size = "Weight")
We’re going to be using a real data set for the next set of exercises. I’ll live code below on how to get access to it. It’s a big set of bike-share data from London, which shows bikeshare use (just like the Tucson bikeshare!), along with weather and other factors.
For a python version of code to generate this exercise, check out the assets folder.
# load up the tidyverse
library(tidyverse)
bikeshare = read_csv("../assets/london_merged.csv")
# quality control check using glimpse
glimpse(bikeshare)
## Observations: 17,414
## Variables: 10
## $ timestamp <dttm> 2015-01-04 00:00:00, 2015-01-04 01:00:00, 2015-01-…
## $ cnt <dbl> 182, 138, 134, 72, 47, 46, 51, 75, 131, 301, 528, 7…
## $ t1 <dbl> 3.0, 3.0, 2.5, 2.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2.0, 3…
## $ t2 <dbl> 2.0, 2.5, 2.5, 2.0, 0.0, 2.0, -1.0, -1.0, -1.0, -0.…
## $ hum <dbl> 93.0, 93.0, 96.5, 100.0, 93.0, 93.0, 100.0, 100.0, …
## $ wind_speed <dbl> 6.0, 5.0, 0.0, 0.0, 6.5, 4.0, 7.0, 7.0, 8.0, 9.0, 1…
## $ weather_code <dbl> 3, 1, 1, 1, 1, 1, 4, 4, 4, 3, 3, 3, 4, 3, 3, 3, 3, …
## $ is_holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_weekend <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ season <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
We’re going to start by plotting a basic histogram in base R. We want to see the distribution of values for bikeshare use.
We’re going to use the hist() function to accomplish this. Recall that a function usually takes parameters, hist is no different. We need to feed it the right inputs to get the right outputs.
# Call up the helpfile on hist
`?`(hist)
# Lots of parameters, but only one we really need is 'x' - feed the function
# a vector.
hist(bikeshare$cnt, main = "")
Get into your groups. Here is my challenge to you: produce a histogram using the hist() function on the bikeshare$cnt data that addresses the issues we discussed. Copy and paste your script into an email and send it to me.
In theory, their answers should look something like this:
hist(bikeshare$cnt, breaks = 10, main = NULL, xlab = "New Bikeshares per Hour")
Put figures on the board - among these will probably be boxplots and maybe scatter plots? We’ll talk about these next.
It’s a function. You have to feed it parameters like anything else. Check out the helpfile.
bikeshare$cnt) as a function of each season.boxplot(cnt ~ season, data = bikeshare, ylab = "Total number of new rideshares")
Same as above - different data though. Use the bikeshare data set, plot cnt versus t2, using the plot() function.
plot() is a more general function, we have to specify some extra information, like the type of plot we want it to spit out.
See if you can figure it out from the helpfile in your groups.
There plot should look something like this:
plot(y = bikeshare$cnt, x = bikeshare$t2, pch = 16, ylab = "Number of new bikeshares/hour",
xlab = "Temperature Perception (ºC)")
Let’s look at some base-R code:
plot(bikeshare$t2, rice$cnt, pch = 16, xlab = "Perceived temperature (ºC)",
ylab = "New rideshare count/hour"))
First things first, you’ll need to download and install the packages:
#Loading Tidyverse with ggplot and all the goodies
library(tidyverse)
{r} ?ggplot{r} ggplot(bikeshare, aes(x = t2, y = cnt))ggplot(bikeshare, aes(x = t2, y = cnt)) +
geom_point()
Boom! We’ve got the graph we did in base R above (more or less). Here is how to change the labels, and use a theme that is a bit less clouded.
ggplot(bikeshare, aes(x = t2, y = cnt)) +
geom_point() +
xlab("Perceived Temperature (ºC)") +
ylab("New Bikeshare Count/Hour") +
theme_classic()
We can change attributes within geoms - imagine you wanted the points to be bigger.
ggplot(bikeshare, aes(x = t2, y = cnt)) +
geom_point(size = 4) +
xlab("Perceived Temperature (ºC)") +
ylab("New Bikeshare Count/Hour") +
theme_classic()
And… we can layer on other geometries. Imagine you wanted to add a trendline with a 95% confidence interval to this figure, and make the points a little transparent to prevent overplotting.
ggplot(bikeshare, aes(x = t2, y = cnt)) +
geom_point(size = 2, alpha = 0.1) +
geom_smooth(method = "lm") +
xlab("Perceived Temperature (ºC)") +
ylab("New Bikeshare Count/Hour") +
theme_classic()
I’m going to show you a figure on the board from the bikeshare data set. Your job is to replicate the figure within your group (hint: search the web for ‘faceting ggplot2’)