1 Before we get started

Having trouble remembering what exactly an R Markdown is? Want some more resources for learning R?

  • Review what an R Markdown is here.
  • Explore further resources for learning R here.

1.1 Recap: What we learned in the previous tutorial

In the last tutorial, we started our ggplot adventure. More specifically, we dived into multivariate plots with geom_point, geom_col and geom_line()

1.2 Overview: What we'll learn here

What we'll look at here:

Visualization + Univariate plots with different geoms + Non data related topics on plots


Let's start by reading in our data:

...And picking up right where we left off.

Not really, we should start with something we missed from last week: using a third variable in a bar plot. Say you want to make a bar chart for the mean level of gas mileage mpg on the y axis, and the number of gears gear in the x axis, but you want to compare these results between automatic transmission vs. manual transmission cars am. To do that, we need to do a bar chart like the ones we've seen previously, but including another aesthetic for the third variable. Here's how you do it. Try changing position = "dodge" for position = "stack" and see what happens.

d %>% 
  group_by(gear,vs) %>% 
  summarise(mpg = mean(mpg)) %>% 
  ggplot(aes(gear,mpg,fill=factor(vs)))+geom_col(position = "dodge")
## `summarise()` regrouping output by 'gear' (override with `.groups` argument)

1.2.1 Geoms for univariate or descriptive plots

Geom What It Does
Univariate or descriptive plots
geom_bar() To plot a bar graph/histogram for a single categorical variable
geom_freqpoly() To plot frequencies by class of a variable
geom_histogram() To plot distribution/frequencies of single continuous variable
geom_density() To plor distribution of single or various continuous variables
geom_violin() To plot distributions. Similar to a density plot, but on both sides. Useful for comparing distributions among factors
geom_qq() To plot residuals with a QQ plot

geom_bar() plots the count of a single factor variable, whereas geom_histogram() plots the count of a single continuous (i.e. numeric) variable.

d %>% 
ggplot() +
  geom_bar(aes(x = carb))

d %>% 
ggplot() +
  geom_histogram(aes(x = hp))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Practice Note the warning message: "stat_bin() using bins = 30. Pick better value with binwidth." Redo the plot using different values for binwidth = , inside the geom_histogram() call.

Here are some other geoms for descriptive plots:

d %>% 
  ggplot(aes(mpg)) + 
    geom_freqpoly(aes(colour = wt), binwidth = 1/4)

#Box plot
d %>% 
ggplot(aes(x = gear, y = mpg)) +
  geom_boxplot()

d %>% 
ggplot(aes(x = carb, y = mpg)) +
  geom_boxplot() +
  coord_flip()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

#violin plots
d %>% 
  ggplot(aes(x = am,y = mpg))+
  geom_violin()

#visualize covariation between counts:
d %>% 
ggplot() +
  geom_count(aes(x = am, y = foreign))

# you can rewrite this plot more concisely:
# faithful dataframe holds the eruption times in minutes for the Old Faithful geiser in Yellowstone National Park
ggplot(data = faithful, aes(x = eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

# How would you interpret this graph?

1.3 Common aesthetics

Aesthetic What It Does
color = Seperates groups by color of obs
shape = Seperates groups by shape of obs
size = Seperates groups by size of obs
alpha = In datasets with many overlapping data points, makes less overlapping obs more transparent and more overlapping obs more solid, in order to see where your data clusters most heavily
linetype = Separates groups with different line types: filled, dotted, etc.
fill = For bars, the color of the inside of the bar. Color would do the border

Here are asome common aesthetics you might try using. As you see below, if you apply aesthetics as global aesthetics in ggplot() they will apply to all layers. As we saw, changing features outside of aes(), though, (e.g. color), need to be specified in each layer.

d %>% 
  ggplot(aes(x = mpg, y = wt)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

d %>% 
  ggplot(aes(x = mpg, y = wt, color = gear)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

d %>% 
  ggplot(aes(x = mpg, y = wt)) + 
  geom_point(color = "red") +
  geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula 'y ~ x'

Also notice that not all aesthetics work for all types of geoms--"shape", for example, doesn't work with geom_smooth, so the global aesthetic shape = gear ignores that layer. You can, however, add another aesthetic that does apply to geom_smooth(), such as "linetype".

d %>% 
  ggplot(aes(x = mpg, y = wt,shape=gear)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

d %>% 
  ggplot(aes(x = mpg, y = wt, shape = gear, linetype = gear)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

d %>% 
  ggplot(aes(x = mpg, y = wt)) + 
  geom_point(shape = "square") +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

1.4 Faceting

Sometimes, you want to break down a graph based on groups. When using colors or other graphical devices is not possible (they are already mapped to other variables) or the resulting graph would look to crowded, you can use faceting to repeat the graph subsetted by each level of the faceting variable.

#Makes three scatterplots, one for each gear category.
d %>% 
ggplot() + 
  geom_point(aes(x = mpg, y = wt)) + 
  facet_wrap(~ gear, nrow = 1)

#Makes six scatterplots, resulting from all the possible combinations of am (2) and cyl (3).
d %>% 
ggplot() + 
  geom_point(aes(x = mpg, y = wt)) + 
  facet_wrap(am ~ cyl)

#facet_grid organizes the information better when using two variables for faceting.
d %>% 
ggplot() + 
  geom_point(aes(x = mpg, y = wt)) + 
  facet_grid(am ~ cyl)

d %>% 
ggplot() + 
  geom_point(aes(x = mpg, y = wt)) + 
  facet_wrap(~ cyl)

1.5 Manipulating your axes and scaling

Sometimes, you will want to manipulate the way your graphs look. You might want to flip the graph 90 degrees to make stuff look better, or use a transformation in the axis to show your data. Here's how

1.5.1 Coordinate Systems

For flipping the axes you can use coord_flip. This is particularly useful when you have a lot of text on the variables in the x axis. See the following example.

mtcars %>% #Cars Dataframe
  rownames_to_column("Name") %>%  #Makes the names of the car a varaible
  sample_frac(.50) %>%  # Keeps only 50% of randomly selected cases (just to plot less cars)
  ggplot(aes(Name,mpg))+
  geom_col()+
  coord_flip()

Now take out the coord_flip() argument and see what happens.

What if you want to zoom in a plot. Inside the coord_cartesian function, you can use the xlim and ylim arguments to specify what ranges of data you want to show. Here is an example.

mtcars %>% 
  ggplot(aes(wt,mpg))+
  geom_point()

mtcars %>% 
  ggplot(aes(wt,mpg))+
  geom_point()+
  coord_cartesian(xlim =c(3,4),ylim = c(15,20))

Say, now that you are plotting a variable that isn't linear. A great example for this is the spread of the coronavirus, especially durign its first stages. Here is how the plot would look like by default.

Corona = read.csv("https://covid.ourworldindata.org/data/owid-covid-data.csv",header = T,sep=",",stringsAsFactors=F)
Corona$date = as.Date(Corona$date)
Corona %>% 
  filter(location == "United States") %>% 
  filter(date < as.Date("2020-05-13")) %>% 
  ggplot(aes(date,total_cases))+
  geom_line()

Now let's add scale_y_log10, and see how that improves the plot.

Corona %>% 
  filter(location == "United States") %>% 
  filter(date < as.Date("2020-05-13")) %>% 
  ggplot(aes(date,total_cases))+
  geom_line()+
  scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis

1.6 Non-data aspects of your ggplot

Now let's return to our first graph looking at mpg and wt. ggplot makes it easy to make these graphs look really nice and professional. What can we do to prep them to present to others?

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There are a few different ways to change the descriptive text on your ggplots, but one easy way is using labs() (short for "labels"). For example, we can add a title:

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can add a subtitle:

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can even add a caption for reference information (can you find it?):

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Of course, as would be true for any good graph, we can add labels for the x and y axes:

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s of pounds)"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also change the label for legend representing the "z" variable--that is, the factor variable that divides the data into two groups.

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Along with the label for the legend, we can, with a little effort outside of ggplot, change the names for each of its constituent levels, as well:

d %<>% mutate(
  am = factor(am, labels = c("Manual", "Automatic"))
)

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What if you want to change the position of the legend, or remove it completely? Easy! Just use the legend.position argument inside the theme function. Try legend.position = "none" and see what happens.

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  )+
  theme(legend.position = "bottom")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Besides the descriptive text, the other major aspect of a graph's presentation is its background. ggplot2 has some preset themes that you can add to your graphs to change how they look. Check some of them out below.

theme() can also be used to tweak aspects of your graph's theme manually.

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  theme_classic()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  theme_light()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  theme_dark()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  theme_test()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And there are even more themes you can get from other packages, like sjPlot and ggthemes:

#install.packages("sjPlot")

d %>% 
ggplot(aes(mpg, wt)) +
  geom_point(aes(color = am)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Manual cars tend to be heavier than automatic cars",
    subtitle = "Car weight and transmission type significantly affect mpg",
    caption = "Data from fueleconomy.gov",
    x = "Average Gas Mileage (mpg)",
    y = "Car Weight (in 1000s)",
    color = "Transmission Type"
  ) +
  sjPlot::theme_sjplot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

2 Review: End Notes

Today, we covered:

*The basic structure of ggplots
*Common aesthetics
*Common geoms
*How to change the presentation of your ggplot

2.1 What's an R Markdown again?

This is the main kind of document that I use in RStudio, and I think its one of the primary advantage of RStudio over base R console. R Markdown allows you to create a file with a mix of R code and regular text, which is useful if you want to have explanations of your code alongside the code itself. This document, for example, is an R Markdown document. It is also useful because you can export your R Markdown file to an html page or a pdf, which comes in handy when you want to share your code or a report of your analyses to someone who doesn't have R. If you're interested in learning more about the functionality of R Markdown, you can visit this webpage

R Markdowns use chunks to run code. A chunk is designated by starting with ```{r} and ending with ``` This is where you will write your code. A new chunk can be created by pressing COMMAND + ALT + I on Mac, or CONTROL + ALT + I on PC.

You can run lines of code by highlighting them, and pressing COMMAND + ENTER on Mac, or CONTROL + ENTER on PC. If you want to run a whole chunk of code, you can press COMMAND + ALT + C on Mac, or ALT + CONTROL + ALT + C on PC. Alternatively, you can run a chunk of code by clicking the green right-facing arrow at the top-right corner of each chunk. The downward-facing arrow directly left of the green arrow will run all code up to that point.

2.2 Some useful resources to continue your learning

A useful resource, in my opinion, is the stackoverflow website. Because this is a general-purpose resource for programming help, it will be useful to use the R tag ([R]) in your queries. A related resource is the statistics stackexchange, which is like Stack Overflow but focused more on the underlying statistical issues. Add other resources