In the EDA,you need learn how to use plots as tools for exploration
library(tidyverse)
## ─ Attaching packages ──── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ─ Conflicts ───── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggrepel)
The easiest place to start when turning an exploratory graphic into an expository graphaic is with good labels
add labels with the labs() function
ggplot(mpg,aes(displ,hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreaes with engine size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
you can add more text
> subtitle
> caption
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
you can also use labs() to replace the axis and legend titles
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
if you want to use the mathematical equations instead of text strings,just use quote()
df <- tibble(
x = runif(10),
y = runif(10)
)
ggplot(df, aes(x, y)) +
geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
In addition to labelling major components of your plot,it’s often useful to label individual observations or group of observations
> geom_text()
the function is similar to geom_point(),but it has an additional aesthetic: label
there are two sources of labels
first,you might have a tibble that provides labels. the plot below isn’t terribly useful,but it illustrates a useful approach
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
This is hard to read,beacuse the labels overlap with each other
we can make things a little better by switching to geom_label() which draws a rectangle behind the text
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_label(aes(label = model),
data = best_in_class,
nudge_y = 2, # move the slightly above the text
alpha = 0.5)
That helps a bit ,but if you look closely in the top-left hand corner,you’ll notice that there are two labels practically on top of each other
you can use ggrepel package,it will automatically adjust labels so that they don’t overlap
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3,
shape = 1,
data = best_in_class) +
geom_label_repel(aes(label = model),
data = best_in_class)
Note another handy technique used here: I added a second layer of large,holow points to highlight the points that I’ve labelled
Remember, in addition to geom_text(),you have many other geoms in ggplot2 available to help annotate your plot
>geom_hline
add reference horizonal lines
geom_vline
add reference vertical lines
geom_rect
draw a rectangle around points of interest
geom_segment
draw a segment
complicated task ?
how can you put a different label in each facet
If the facet variable is not specified, the text is drawn in all facets
label <- tibble(
displ = Inf,
hwy = Inf,
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(aes(label = label),
data = label,
vjust = "top",
hjust = "right",
size = 2) +
facet_wrap(~class)
To draw the label in only one facet, add a column to the label data frame with the value of the faceting variable(s) in which to draw it
label <- tibble(
displ = Inf,
hwy = Inf,
class = "2seater",
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(aes(label = label),
data = label,
vjust = "top",
hjust = "right",
size = 2
) +
facet_wrap(~class)
To draw labels in different plots, simply have the facetting variable(s)
label <- tibble(
displ = Inf,
hwy = Inf,
class = unique(mpg$class), # the Label data.frame add the facet variable
label = str_c("Label for ", class)
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class) +
geom_text(aes(label = label),
data = label,
vjust = "top",
hjust = "right",
size = 3
)
In the simility,we can also add different vertical line in different facet
只需要新建一个辅助数据框 aux,将对应分面变量的横截距添加进去即可
aux <- data.frame(cyl = c(4,6,8),
l = c(3.5,4,4.5),
m = c(5,4.3,3))
ggplot(mtcars, aes(x = drat)) +
geom_line(aes(y = mpg,
colour = "mpg")) +
geom_line(aes(y = qsec,
colour = "qsec")) +
facet_wrap(~cyl) +
geom_vline(aes(xintercept = l,
colour = "xiaopang"),
data = aux,
lty=2) +
geom_vline(aes(xintercept = m,
colour = "xiaomei"),
data = aux,
lty = 2)
The third way you can make your plot better for communication is to adjust the scales
Normally,ggplot2 automatically adds scale for you. For example
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
upper codes equal to below codes
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
the default scales are named according to the type of variable they align with: continuous, discrete, datetime or date
the default scales have been carefully chosen to do a good job for a wide range of inputs.Nevertheless, you might want to override the defaults for two reasons:
fist,you want to tweak some of the parameters of the default scale.This allows you to do things like change the breaks on the axes,or the key label on the legend
second,you want to replace the scale altogether,and use a completely different algorithm
There are two primary arguments that affect the appearance of the ticks on the axes and the key on the legend: breaks and labels
> breaks
breaks control the position of the ticks, or the values associated with the keys
> labels
labels control the text label associated with each tick/key
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
you can use labels() in the same way(a character vector the same length as breaks),but you can also set it to NULL to suppress the labels altogether
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
axes and legends are called guides. Axes are used for x and y aesthetics; legends are used for everything else
you will most often use breaks and labels to tweak the axes.
To control the overall position of the legend,you need to use a theme() setting,theme() control the non-data parts of the plot
the theme setting legend.position controls where the legend is drawn
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
base + theme(legend.position = "left")
base + theme(legend.position = "right") # the default
base + theme(legend.position = "none") # suppress the display of the legend altogether
To control the display of dividual legends,use guides() along with guide_legend() or guide_colourbar
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(nrow = 1, # controlling the number of rows the legend
override.aes = list(size = 4))) # overriding one of the aesthetics to make the points bigger
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Instead of just tweaking the details a little,you can instead replaced the scale altogehther
there are two types of scales you’re mostly likely to want to switch out: continuous position scales and colour scales
it’s very useful to plot transformations of your variable
ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
ggplot(diamonds,
aes(log10(carat), log10(price))) +
geom_bin2d()
however,the disadvantage of this transformation is that the axes are now labelled with transformed values,make it hard to interpret the plot
Instead of doing the transformation in the aesthetic mapping,we can instead do it with the scale
ggplot(diamonds, aes(carat, price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
Another scale that is frequently customised is colour. use ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
scale_colour_brewer(palette = "Set1")
RColorBrewer::display.brewer.all()
above plot shows the complete list of all palettes.the sequential(top) and diveraging(bottom) palettes are particulary useful if your categorical values are ordered, or have a “middle”
when you have a predefined mapping between values and colours, use scale_colour_manual()
presidential %>%
mutate(id = 33 + row_number()) %>%
ggplot(aes(start,
id,
colour = party)) +
geom_point() +
geom_segment(aes(xend = end,
yend = id)) +
scale_colour_manual(
values = c(Republican = "red",
Democratic = "blue"))
For continuous colour,you can use the built-in scale_colour_gradient() or scale_fill_gradient()
if you have a diverging scale,you can use scale_colour_gradient2() .for example,positive and negative values different colours
To zoom in on a region of the plot,it’s generally best to use coord_cartesian()
ggplot(mpg,mapping = aes(displ,hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5,7),
ylim = c(10,30))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
you can customise the non-data elements of your plot with a theme
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'