Part #1 Chapter.3 HW

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

mpg

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Part 1 Chapter 3 Due Friday midnight

3.2.4 Exercises

2. How many rows are in mpg? How many columns?

234 rows & 11 columns

3. What does the drv variable describe? Read the help for ?mpg to find out.

DRV: “the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd”

3.3.1 Exercises

1.What’s gone wrong with this code? Why are the points not blue?

The color mapping is inside the aesthetic. In ggplot2 aesthetics do not automatically create a legend but an axis line which operates as a legend.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

2.Which variables in mpg are categorical?

*Catergorical variables: manufacturer, model, class, fl, & trans.

Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset).

*Continous variables: cyl, cty, hwy, displ & year.

How can you see this information when you run mpg?

*Run:mpg & see that variables are classified under = character,= integer, & =double. Character variables are catergorical variables and since integers & doubles variables are continous.

3.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Continous variables cannot be mapped in shape aesthetic, while catorigal variable can be mapped in color, shape& size aesthetics.Continous variables can be mapped in aes (see below) in shape and color(which creates a gradient.)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size= cty))

4.What happens if you map the same variable to multiple aesthetics?

Mapping the same categorical variable in muliple aesthetic limits the legend to size possible shapes.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = manufacturer, size= manufacturer, shape=manufacturer))

## Warning: Using size for a discrete variable is not advised.

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 15. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 112 rows containing missing values (geom_point).

5.What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

Stroke creates a size based point layout for continuous variables depending if it measurable but if not creates a Rorschach style graph.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, stroke =cyl))

6.What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

Using less than colour = displ < x, creates a true or false legend based on the axises on the graph.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))

3.5.1 Exercises

1. What happens if you facet on a continuous variable?

When you facet a continous variable you get a display of that variables measurements. This does not display accurate information of the dataset and categorical values are better for faceting.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ year, nrow = 2)

2.What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

facet_grid(drv ~ cyl) means that the facet grid displayed by drv (the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd) and cyl (# of cylinders) are seperated by “~”.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))+
  facet_grid(drv ~ cyl)

3.What plots does the following code make? What does . do?

Code#1 displays the engine displacement, in l and the highway miles per gallon with facet_grid grouping by drv and (.) stop from sperating row & columns dimension on left y axis.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

Code#2 displays the engine displacement, in l and the highway miles per gallon with facet_grid grouping by number of cylinders and the (.) stops the facet from grouping by rows & columns dimensions of the top x axis.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

4.Take the first faceted plot in this section:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Advantages to using faceting instead of colour aesthetic: 1. Grouping the variables with facet creates subplots and makes the data visualizations more comprehendable. 2. Faceting is more specific display. 3. Colorblind friendly :) 4. Facet is easier to manually manipulate the label visualization on the multiple axises. Disadvantages to using facet instead colour aesthetic: 1. Loss of the legends. 2. No colours. The balance might changes if the dataset is larger because colour is more useful with plotting graphs with more than two variables.

3.6.1 Excerises

1.What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

Line chart is geom_line, a boxplot is geom_boxplot, a histogram is geom_histogram, and an area chart is geom_area.

2.Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

I predict that the geom_smooth(se = FALSE) will create smooth lines for the graph.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3.What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

It removes the legend. I think you removed the legend to cleanly display 3 graphs.

4.What does the se argument to geom_smooth() do?

It aids the eyes in seeing patterns if the graph is over plotted.

5.Will these two graphs look different? Why/why not?

No they will not look different, the second code is a manual mapping redundancy of the first code.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

6.Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) +
  geom_point()+
  geom_smooth(se= FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = drv)) + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, linetype = drv)) + 
  geom_point(mapping = aes(color=drv))+
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4, color = "white") +
  geom_point(aes(colour = drv))

3.7.1:1-5 ## 3.7.1 Exercises

1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default geom associated with stat_summary() is geom_pointrange.

ggplot(data = diamonds) +
geom_pointrange(mapping = aes (x = cut, y = depth), stat= "summary", fun.min = min, fun.max = max, fun = median)

2. What does geom_col() do? How is it different to geom_bar()?

geom_col is a bar chart that you would use if want the heights of the bars to represent the values of the data. geom_col is different that geom_bar in that it uses stat_identity(): it leaves the data as is.

3.Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

geom	stat
geom_bar()	stat_count()
geom_bin()	stat_bin_2d()
geom_boxplot()	stat_boxplot()
geom_contour_filled()	stat_contour_filled
geom_contour()	stat_contour()
geom_count()	stat_sum()
geom_density_2d()	stat_density_2d()
geom_density()	stat_density()
geom_dotplot()	stat_bindot()
geom_function()	stat_function()
geom_sf()	stat_sf()
geom_sf()	stat_sf()
geom_smooth()	stat_smooth()
geom_violin()	stat_ydensity()
geom_hex()	stat_bin_hex()
geom_qq_line()	stat_qq_line()
geom_qq()	stat_qq()
geom_quantile()	stat_quantile()

4. What variables does stat_smooth() compute? What parameters control its behaviour?

variable	calculation
y or x	predicted value
ymin or xmin	lower pointwise confidence interval around the mean
ymax or xmax	upper pointwise confidence interval around the mean
se	standard error

Parameter	Behavior
method	Smoothing method (function) to use, accepts either NULL or a character vector, e.g. “lm”, “glm”, “gam”, “loess” or a function, e.g. MASS::rlm or mgcv::gam, stats::lm, or stats::loess. “auto” is also accepted for backwards compatibility. It is equivalent to NULL.
formula	Smoothing method (function) to use, accepts either NULL or a character vector, e.g. “lm”, “glm”, “gam”, “loess” or a function, e.g. MASS::rlm or mgcv::gam, stats::lm, or stats::loess. “auto” is also accepted for backwards compatibility. It is equivalent to NULL.
se	Display confidence interval around smooth? (TRUE by default se, FALSE only display line)
na.rm	If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.
method.args	List of additional arguments passed on to the modelling function defined by method.

5.In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

Without group =1 the code does not organize height of the bar chart and just assume the group is equal to x values.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

3.8.1 Excerises

1. What is the problem with this plot? How could you improve it?

The problem with this plot is there is an overplot of cty & hwy values.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

As instructed in the chapter I would improve by the jitter position to minimize the overplotting. This provides more accurate information with a similar observation to the first.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point(position="jitter")

2. What parameters to geom_jitter() control the amount of jittering?

parameter	behavior
width	Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here. If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories.
height	Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories.

width

Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here. If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories.

height

Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories.

3. Compare and contrast geom_jitter() with geom_count().

geom_jitter	geom_count()
reduces overplotting	combines overlapping points to make bigger point
scatter plot	scatter plot
slightly change location	locations unchanged

4. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

The default position adjustment for geom_boxplot() “dodge2.”

ggplot(data = mpg, aes( x= drv, y = hwy, colour = manufacturer))+
  geom_boxplot()

3.9.1 Excerises

1. Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(mpg, aes(x = factor(1), fill = drv))+
  geom_bar(width=1) +
  coord_polar(theta ="y")

2.What does labs() do? Read the documentation.

Labs are how create labels for anything in the whole chart.

3.What’s the difference between coord_quickmap() and coord_map()?

The coor_map() is different from coord_quickmap because it displays the earth on a 3D globe into a 2D plane. Coord_quickmap is faster but ignores the curvature of earth.

4.What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The plot below hwy is always larger than cty.The coord_fixed() turns the axes into the same units. The geom_abline adds reference lines (sometimes called rules) to a plot, either horizontal, vertical, or diagonal (specified by slope and intercept).

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

4.4 Excerises

1. Why does this code not work?

The i in my_varıable is not actually an i.

#my_variable <- 10
#my_varıable

2. Tweak each of the following R commands so that they run correctly:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)

## # A tibble: 70 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…
##  2 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  3 chevrolet    c1500 sub…   5.3  2008     8 auto… r        11    15 e     suv  
##  4 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  5 chevrolet    c1500 sub…   5.7  1999     8 auto… r        13    17 r     suv  
##  6 chevrolet    c1500 sub…   6    2008     8 auto… r        12    17 r     suv  
##  7 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…
##  8 chevrolet    corvette     5.7  1999     8 auto… r        15    23 p     2sea…
##  9 chevrolet    corvette     6.2  2008     8 manu… r        16    26 p     2sea…
## 10 chevrolet    corvette     6.2  2008     8 auto… r        15    25 p     2sea…
## # … with 60 more rows

filter(diamonds, carat > 3)

## # A tibble: 32 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  3.01 Premium I     I1       62.7    58  8040  9.1   8.97  5.67
##  2  3.11 Fair    J     I1       65.9    57  9823  9.15  9.02  5.98
##  3  3.01 Premium F     I1       62.2    56  9925  9.24  9.13  5.73
##  4  3.05 Premium E     I1       60.9    58 10453  9.26  9.25  5.66
##  5  3.02 Fair    I     I1       65.2    56 10577  9.11  9.02  5.91
##  6  3.01 Fair    H     I1       56.1    62 10761  9.54  9.38  5.31
##  7  3.65 Fair    H     I1       67.1    53 11668  9.53  9.48  6.38
##  8  3.24 Premium H     I1       62.1    58 12300  9.44  9.4   5.85
##  9  3.22 Ideal   I     I1       62.6    55 12545  9.49  9.42  5.92
## 10  3.5  Ideal   H     I1       62.8    57 12587  9.65  9.59  6.03
## # … with 22 more rows

3.Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

A quick cheat sheet of different inputs in R. To get to this in menus go to tools and select keyboard shortcut help.