R For Data Science | by: Wickham & Grolemund
Chapter 1 | Data Visualization with GGplot
Add the tidyverse, palmerpenguins, and ggthemes to your setup chunk before doing these exercises.
There are different ways to view data. The first is a tibble, much like a table of the rows and columns that your data contain. Just type the name of the data, and it prints it as a tibble.
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
You can also quickly view the data type for each column using
glimpse().
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Finally, instead of using your mouse to click open your data, you can
just use view() to quickly open the data in a new
window.
view(penguins)
Create a visual display of the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.
plot <- ggplot(penguins, aes(flipper_length_mm, body_mass_g, color=species, shape=species))+
geom_point()+
labs(title="Body mass & flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x="Flipper Length (mm)", y="Body Mass (g)")
plot
plot+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
When aesthetic mappings are defined in ggplot(), at the
global level, they’re passed down to each of the subsequent geom layers
of the plot. However, each geom function in ggplot2 can also take a
mapping argument, which allows for aesthetic mappings at the local level
that are added to those inherited from the global level. Since we want
points to be colored based on species but don’t want the lines to be
separated out for them, we should specify
color = species for geom_point() only.
plot2 <- ggplot(penguins, aes(flipper_length_mm, body_mass_g))+
geom_point(aes(color=species, shape=species))+
labs(title="Body mass & flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x="Flipper Length (mm)", y="Body Mass (g)")+
geom_smooth(method="lm")
plot2
## `geom_smooth()` using formula = 'y ~ x'
plot2+scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
dim(penguins)
## [1] 344 8
There are 344 rows and 8 columns in the penguins
data.
#?penguins
The variable, bill_depth_mm represents a number denoting bill depth (millimeters).
ggplot(penguins, aes(bill_length_mm, bill_depth_mm))+
geom_point(aes(color=species, shape=species))+
labs(title="Penguin bill length vs. bill depth", x="Bill Length (mm)", y="Bill Depth (mm)")+
geom_smooth(method="lm", color="gray")+
scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
As bill length (mm) decreases, so too does the depth of each penguin bill.
Because species is categorical – I think that it would be a better choice to display box and whiskers for these data or maybe a violin plot.
plot3 <- ggplot(penguins, aes(species, bill_depth_mm))+
geom_violin(aes(color=species))+
labs(title="Penguin species vs. bill depth", x="Bill Length (mm)", y="Bill Depth (mm)")+
scale_color_colorblind()
plot3
ggplot(data = penguins) +
geom_point()
The code above is missing x and y values necessary to add aesthetics
to the plot. I would add aes(x="", y="") inside of the
geom_point, or inside of the ggplot if I wanted to apply the code
globally rather than locally.
#?geom_point()
na.rm in geom_point() makes it so that if your dataframe
has any missing values (which ours does, recall the initial ggplot
warnings to the data) the missing values are removed. In geom_point, the
na.rm is set to FALSE automatically. So instead of removing the warnings
from each code chunk above, I could have just added na.rm=TRUE to my
geom_point() function.
plot3+labs(caption = "Data come from the palmerpenguins package.")
ggplot(penguins, aes(flipper_length_mm, body_mass_g, color=bill_depth_mm))+
geom_point(na.rm=T)+
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
This code will create a scatter plot with flipper length on the x axis and body mass on the y axis. The plot will be colored by the categorical islands and there will be a line smoothed over it. I do not know what the se=F does under the geom_smooth() function.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#?geom_smooth
se within the geom_smooth() function displays the
confidence interval around smooth. (TRUE by default, see level to
control.)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
These two plots should not look any different, even though in the second plot the aes is applied locally. The reason is that both are calling the exact same display both globally and locally. They have the same x and the same y in both cases. Plot 2 just seems to overcommand, plot 1 seems to simplify the request.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()+
labs(title="Plot 1")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)+
labs(title="Plot 2")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
You do not need to specify the data each time by saying data = penguins. Instead, just write penguins. Soon we will get more into *** piping *** and you will be able to create a plot rather quickly by doing the following:
penguins |>
ggplot(aes(flipper_length_mm, body_mass_g))+
geom_point()
How you decide to visualize your data depends on the data types you are trying to view.
Note that a density plot is kinda like dropping a piece of spaghetti over the bars of a bar plot.
penguins |>
ggplot(aes(y=species, fill=species))+
geom_bar()
The only difference is that it is tipped on its side now.
```ggplot(penguins, aes(x = species)) + geom_bar(color = “red”)
ggplot(penguins, aes(x = species)) + geom_bar(fill = “red”)```
fill changes the inside colors of the bars and color changes to outline of the bar. Both are effective and you can decide which way you prefer.
It allows you to set a specific number of bins. The default is 30 bins and this is overwritten by binwidth if you decide to use this argument instead.
diamonds|>
ggplot(aes(x=carat))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
diamonds|>
ggplot(aes(x=carat))+
geom_histogram(binwidth=20)
diamonds|>
ggplot(aes(carat))+
geom_histogram(binwidth = 0.5)
mean(diamonds$carat)
## [1] 0.7979397
diamonds|>
ggplot(aes(carat))+
geom_histogram(binwidth = 0.8)
diamonds|>
ggplot(aes(carat))+
geom_histogram(binwidth = 0.3)
A binwidth less than 1 reveals the most interesting pattern.
Note that if a data point falls outside of the IQR by 1.5 times is shown as a dot above or below the whisker.
penguins |>
ggplot(aes(species, body_mass_g))+
geom_boxplot()
You can also choose to display the above data as a density plot:
plot4 <- penguins |>
ggplot(aes(body_mass_g, color=species))+
geom_density()
plot4
And we can add a fill to density with a transparency using the alpha command.
penguins|>
ggplot(aes(body_mass_g, color=species, fill=species))+
geom_density(alpha=0.5)
Say you have 2 categories and 2 numbers you are trying to display.
Try facetwrap()!
penguins|>
ggplot(aes(body_mass_g, flipper_length_mm))+
geom_point(aes(color=species, shape=species))+
facet_wrap(~island)
penguins|>
ggplot(aes(island, fill=species))+
geom_bar()
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
mpg|>
ggplot(aes(hwy, displ))+
geom_point(aes(color=cty))
mpg|>
ggplot(aes(hwy, displ))+
geom_point(aes(color=manufacturer, shape=manufacturer))
You can only use shapes on categories (characters).
mpg|>
ggplot(aes(hwy, displ, linewidth=year))+
geom_point(aes(linewidth=year))
mpg|>
ggplot(aes(hwy, hwy))+
geom_point(aes(color=hwy))
The data seem too nice! :)
penguins|>
ggplot(aes(bill_depth_mm, bill_length_mm))+
geom_point(aes(color=species))
penguins|>
ggplot(aes(bill_depth_mm, bill_length_mm))+
geom_point(aes(color=species))+
facet_wrap(~species)
It seems that Adelie penguins have a shorter bill length on average than the other two species of penguin. It also seems that bill depth is lowest in Gentoo penguins. In all penguin species, the depth of the bill increases as the length of the bill increases.
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm
)
) +
geom_point(aes(color = species, shape = species))+
labs(color="Species", shape="Species")
All you have to do is ensure that you specify each with the same label
within the labs() function.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")
You can save your plots using ggsave()!
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
ggsave(filename = "penguin-plot.png")
## Saving 7 x 5 in image
If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.
Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 28.
ggplot(mpg, aes(x = class)) +
geom_bar()
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.png")
## Saving 7 x 5 in image
The second plot is the only one being saved in this code. It is because ggsave() will work on the most recent plot that you have in the line of code above.
ggsave("mpg-plot.pdf")
## Saving 7 x 5 in image
If you’re still stuck, try the help. You can get help about any R
function by running ?function_name in the console, or
highlighting the function name and pressing F1 in RStudio. Don’t worry
if the help doesn’t seem that helpful - instead skip down to the
examples and look for code that matches what you’re trying to do.