1+2
## [1] 3
library(tidyverse)
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
By size:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
#> Warning: Using size for a discrete variable is not advised.
Other:
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many of them.
## Warning: Removed 62 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(data = mpg)
What do I see: Blank space
Rows: 234 Columns: 11
drv means: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
What happens: graph is created but it does not indicate any correlations, patterns or trends between the two variables. The plot is not useful because it does not clearly highlight a trend, thus, it lacks purpose.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
What is wrong with the code: it has one more ) at the end, it should have been:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Categorical: manufacturer, model, trans, drv, fl, class
Continuous: displ, year, cyl, cty, hwy
We can see this information when we run ?mpg, because the categorical describe the types or categories while the continuous describe measurable quantities. To verify this information:
categorical_vars <- names(mpg)[sapply(mpg, function(x) class(x) %in% c("factor", "character"))]
categorical_vars
## [1] "manufacturer" "model" "trans" "drv" "fl"
## [6] "class"
continuous_vars <- names(mpg)[sapply(mpg, function(x) class(x) %in% c("numeric", "integer"))]
continuous_vars
## [1] "displ" "year" "cyl" "cty" "hwy"
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class, size = class))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many of them.
## Warning: Removed 62 rows containing missing values or values outside the scale range
## (`geom_point()`).
For categorical variables, the shape and color change, while for continuous variables the size changes.
You will create a graph that highlight a pattern, and the size and color of a point will vary depending on where that point is compared to the rest of the data.
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, color = displ, size = displ))
The stroke aesthetic modify the width of the border. It can only work for shapes that have border, and for which you can colour the inside and outside separately. Ex”:
ggplot(mtcars, aes(wt, mpg)) +
geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = displ < 5))
When the data reach the number indicated in the colour section, the points will change colour.
How to get help
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class, nrow = 2)
Facet with 2 variables:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~year, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~displ, nrow = 2)
It will display the data points in many facets, treating the continuous variable like a categorical. Thus, it will split them into the range (the continuous variable) we chose and will create a separate “box” (facet) for each one. It is noted that by doing so, you risk having too many facets, like in the case above: displ.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
An empty cell in plot with facet_grid(drv ~ cyl) means that it is not an option that exist in the data set. This relate to this plot as this facet shows all the combinations that already exist, which helps identify combinations that do not exist and/or are missing.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
When the . is on the right, it creates an horizontal display, with facets for each value of drv. When the . is on the left, it creates an vertical display, with facets for each value of cyl.
Basically, the . indicates “no facet in this dimension”, as such, when on the right only row facets can be created, and when on the left, only column facets can be created.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
By using faceting instead of the colour aesthetic, it gives a clearer view of each distinct group. It is easier to compare their patterns when using facets since there is less clutter and overlapping, making it easier to read each points.
However, the disadvantages of faceting are that it is more difficult to compare the values from different groups as they are in different facets. It is less obvious than when using one panel and colours. Also, depending on the number of facets, it can be more difficult to make them fit in a document or a page when there are a lot of them.
If I had a larger dataset, facets would become more valuable since they is a higher chance of overlapping points and clutter. Also, the more groups, the more colours, which would make it even more difficult to read Thus, facets would allow a clearer view of the data for each group.
nrow: Controls the number of rows
ncol: Controls the number of columns
Other options that control the layout of the individual panels are: scales, dir, and strip.position.
facet_grid() doesn’t have nrow and ncol arguments because it is defined by rows and columns, thus we don’t have to manually set them.
Having more levels in the columns will work better with screen, since most screens have a bigger width then length, making it a better use of space to have data appear horizontally.
different visual object to represent data
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
not every aesthetic works with every geom
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
two geoms in the same graph!
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
local vs. global mappings This makes it possible to display different aesthetics in different layers.
specify different data for each layer
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
Line chart:geom_line
Boxplot: geom_boxplot
Area Chart: geom_area
Prediction: A graph with points and curves
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
Predictions confirmed.
show.legend = FALSE, will unable the display of the legend for the group specified in the code using this. I think you used it earlier in the chapter, to help the students focus on the data, since it gives a clearer view, making it easier to learn in that context.
The se argument controls if the standard error margin will be displayed around the line or not. If it is = YES, it will be displayed, if it is = FALSE, it will only leave the line, making the graph cleaner.
No, because they say the same thing, and the different organization are asking for the same result. In the first option, the () in the last two lines indicates that the previous line is include there, and vice-versa for the second option.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 5) +
geom_smooth(se = FALSE, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point(se = FALSE, color = "black", size = 5) +
geom_smooth(se = FALSE, size = 2)
## Warning in geom_point(se = FALSE, color = "black", size = 5): Ignoring unknown
## parameters: `se`
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point(size = 5) +
geom_smooth(se = FALSE, size = 2)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv, size = 5)) +
geom_smooth(se = FALSE, size = 2)
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point(size = 5) +
geom_smooth(aes(color = drv, size = 2), se = FALSE)
ggplot(mtcars, aes(wt, mpg), fill = drv) +
geom_point(shape = 21, colour = "white", size = 5, stroke = 5)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
demo = tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
## Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#> Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(prop)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median)
adjustments for bar charts
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
Others:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
adjustments for scatterplots
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
switch x and y
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
set the aspect ratio correctly for maps
Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL) + coord_flip()
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL) + coord_polar()
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) +
coord_polar()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
The problem with this plot, is that there is not much information to take from this graph. As such, the viewer are limited in their interpretation of the graph: they can’t learn about a specific vehicule’s class or characteristics.
How I would improve it: I would add a line to show more clearly the trend and I would add colours to show clearly the categories.
ggplot(data = mpg, aes(x = cty, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(method = "lm", se = FALSE)
geom_jitter() controls the width and height of the data’s jitter.
geom_jitter() shows frequency (overlapping) by allowing jittering and more randomness to the plot. It spreads each points in width and lenght.
geom_count() shows frequency without moving the points, it only changes the size to show frequency (overlapping)
The default position is “dodge”.
ggplot(mpg, aes(x = class, y = hwy, fill = drv)) +
geom_boxplot()
ggplot(mpg, aes(x = class, y = hwy, fill = drv)) +
geom_boxplot(position = "dodge")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE,width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar = ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE,width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_polar()
labs() allows to put Titles on plot, axis, legend. Basically, it labels everything you could want to label.
coord_quickmap(): creates a quick vizualisation of a map, as it sets the aspect ratio correctly for maps. It is approximate, but will be created faster
coord_map(): gives a precise map as it uses polar coordinates.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
It tells me that cars generally get better mileage on highways than in the city, as the point are above the line.The coord_fixed() is important as it set the 1:1 ratio. The geom_abline creates the reference line, that illustrate when the ratio is the same in cities and on highways.
The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of: