library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
234 rows & 11 columns
DRV: “the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd”
The color mapping is inside the aesthetic. In ggplot2 aesthetics do not automatically create a legend but an axis line which operates as a legend.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
*Catergorical variables: manufacturer, model, class, fl, & trans.
*Continous variables: cyl, cty, hwy, displ & year.
*Run:mpg & see that variables are classified under
Continous variables cannot be mapped in shape aesthetic, while catorigal variable can be mapped in color, shape& size aesthetics.Continous variables can be mapped in aes (see below) in shape and color(which creates a gradient.)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size= cty))
Mapping the same categorical variable in muliple aesthetic limits the legend to size possible shapes.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = manufacturer, size= manufacturer, shape=manufacturer))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 15. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 112 rows containing missing values (geom_point).
Stroke creates a size based point layout for continuous variables depending if it measurable but if not creates a Rorschach style graph.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, stroke =cyl))
Using less than colour = displ < x, creates a true or false legend based on the axises on the graph.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
When you facet a continous variable you get a display of that variables measurements. This does not display accurate information of the dataset and categorical values are better for faceting.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ year, nrow = 2)
facet_grid(drv ~ cyl) means that the facet grid displayed by drv (the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd) and cyl (# of cylinders) are seperated by “~”.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))+
facet_grid(drv ~ cyl)
Code#1 displays the engine displacement, in l and the highway miles per gallon with facet_grid grouping by drv and (.) stop from sperating row & columns dimension on left y axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
Code#2 displays the engine displacement, in l and the highway miles per gallon with facet_grid grouping by number of cylinders and the (.) stops the facet from grouping by rows & columns dimensions of the top x axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages to using faceting instead of colour aesthetic: 1. Grouping the variables with facet creates subplots and makes the data visualizations more comprehendable. 2. Faceting is more specific display. 3. Colorblind friendly :) 4. Facet is easier to manually manipulate the label visualization on the multiple axises. Disadvantages to using facet instead colour aesthetic: 1. Loss of the legends. 2. No colours. The balance might changes if the dataset is larger because colour is more useful with plotting graphs with more than two variables.
nrow&ncol creates # of variable rows and columns.
nrow&ncol creates # of rows and columns. Switch (change the x & y axises), dir(change to horizontal or vertical), strip.position( control labels postion), shrink(if true shrinks scales to fit output of statistics not raw data), & labeller(different labeling function for different types of labels formatting.)
facet_grid does nrow & ncol because it forms a matrix based on the defined row&column of the faceting variables not only one variable with multiple levels.
Multiple variables in columns can be labeled with labeller.
Line chart is geom_line, a boxplot is geom_boxplot, a histogram is geom_histogram, and an area chart is geom_area.
I predict that the geom_smooth(se = FALSE) will create smooth lines for the graph.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It removes the legend. I think you removed the legend to cleanly display 3 graphs.
It aids the eyes in seeing patterns if the graph is over plotted.
No they will not look different, the second code is a manual mapping redundancy of the first code.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) +
geom_point()+
geom_smooth(se= FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, linetype = drv)) +
geom_point(mapping = aes(color=drv))+
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 4, color = "white") +
geom_point(aes(colour = drv))
The default geom associated with stat_summary() is geom_pointrange.
ggplot(data = diamonds) +
geom_pointrange(mapping = aes (x = cut, y = depth), stat= "summary", fun.min = min, fun.max = max, fun = median)
geom_col is a bar chart that you would use if want the heights of the bars to represent the values of the data. geom_col is different that geom_bar in that it uses stat_identity(): it leaves the data as is.
| geom | stat |
|---|---|
| geom_bar() | stat_count() |
| geom_bin() | stat_bin_2d() |
| geom_boxplot() | stat_boxplot() |
| geom_contour_filled() | stat_contour_filled |
| geom_contour() | stat_contour() |
| geom_count() | stat_sum() |
| geom_density_2d() | stat_density_2d() |
| geom_density() | stat_density() |
| geom_dotplot() | stat_bindot() |
| geom_function() | stat_function() |
| geom_sf() | stat_sf() |
| geom_sf() | stat_sf() |
| geom_smooth() | stat_smooth() |
| geom_violin() | stat_ydensity() |
| geom_hex() | stat_bin_hex() |
| geom_qq_line() | stat_qq_line() |
| geom_qq() | stat_qq() |
| geom_quantile() | stat_quantile() |
| variable | calculation |
|---|---|
| y or x | predicted value |
| ymin or xmin | lower pointwise confidence interval around the mean |
| ymax or xmax | upper pointwise confidence interval around the mean |
| se | standard error |
| Parameter | Behavior |
|---|---|
| method | Smoothing method (function) to use, accepts either NULL or a character vector, e.g. “lm”, “glm”, “gam”, “loess” or a function, e.g. MASS::rlm or mgcv::gam, stats::lm, or stats::loess. “auto” is also accepted for backwards compatibility. It is equivalent to NULL. |
| formula | Smoothing method (function) to use, accepts either NULL or a character vector, e.g. “lm”, “glm”, “gam”, “loess” or a function, e.g. MASS::rlm or mgcv::gam, stats::lm, or stats::loess. “auto” is also accepted for backwards compatibility. It is equivalent to NULL. |
| se | Display confidence interval around smooth? (TRUE by default se, FALSE only display line) |
| na.rm | If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed. |
| method.args | List of additional arguments passed on to the modelling function defined by method. |
Without group =1 the code does not organize height of the bar chart and just assume the group is equal to x values.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
The problem with this plot is there is an overplot of cty & hwy values.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
As instructed in the chapter I would improve by the jitter position to minimize the overplotting. This provides more accurate information with a similar observation to the first.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position="jitter")
| parameter | behavior |
|---|---|
| width | Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here. If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories. |
| height | Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories. |
| geom_jitter | geom_count() |
|---|---|
| reduces overplotting | combines overlapping points to make bigger point |
| scatter plot | scatter plot |
| slightly change location | locations unchanged |
The default position adjustment for geom_boxplot() “dodge2.”
ggplot(data = mpg, aes( x= drv, y = hwy, colour = manufacturer))+
geom_boxplot()
ggplot(mpg, aes(x = factor(1), fill = drv))+
geom_bar(width=1) +
coord_polar(theta ="y")
Labs are how create labels for anything in the whole chart.
The coor_map() is different from coord_quickmap because it displays the earth on a 3D globe into a 2D plane. Coord_quickmap is faster but ignores the curvature of earth.
The plot below hwy is always larger than cty.The coord_fixed() turns the axes into the same units. The geom_abline adds reference lines (sometimes called rules) to a plot, either horizontal, vertical, or diagonal (specified by slope and intercept).
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
The i in my_varıable is not actually an i.
#my_variable <- 10
#my_varıable
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
## # A tibble: 70 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
## 2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
## 4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
## 6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
## 7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
## 9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
## 10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
## # … with 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # … with 22 more rows