install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

2.2.1

Question 1:List five functions that you could use to get more information about the mpg dataset

5 functions that can be used to get more information about the mpg dataset are veiw, head, str, summary, and name

Question 2: How can you find out what other datasets are included with ggplot2?

To veiw all of the datasets in ggplot you use data()

Question 3: Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?

mpg\(cty_l100km = 235.2 / mpg\)cty mpg\(hwy_l100km = 235.2 / mpg\)hwy This adjust and adds the colums to the dataset coverted to km

Question 4: Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?

In the dplyr packgage you can use the function count to see the amount of each and then sort to sort by type

mpg %>%
  count(model, sort = TRUE)
## # A tibble: 38 × 2
##    model                   n
##    <chr>               <int>
##  1 caravan 2wd            11
##  2 ram 1500 pickup 4wd    10
##  3 civic                   9
##  4 dakota pickup 4wd       9
##  5 jetta                   9
##  6 mustang                 9
##  7 a4 quattro              8
##  8 grand cherokee 4wd      8
##  9 impreza awd             8
## 10 a4                      7
## # ℹ 28 more rows

This code shows us that the caravan and ram 1500 4wd had the most models in this dataset which would mean that the manufracturer Dodge has the most. The caravan model has the most variation of 11.

This code helps us sort and clean the data to see if we removed the redundant specification of drive train from the model name in the dataset.

mpg %>%
  mutate(model_clean = str_remove_all(model,"4wd| quattro| 2wd")) %>%
  count(model_clean, sort = TRUE)
## # A tibble: 37 × 2
##    model_clean            n
##    <chr>              <int>
##  1 "a4"                  15
##  2 "caravan"             11
##  3 "ram 1500 pickup "    10
##  4 "civic"                9
##  5 "dakota pickup "       9
##  6 "jetta"                9
##  7 "mustang"              9
##  8 "grand cherokee "      8
##  9 "impreza awd"          8
## 10 "camry"                7
## # ℹ 27 more rows

Our answer does change and the a4 has the most variation by model.

2.3.1

ggplot(mpg, aes(displ,hwy))+
  geom_point()

Question 1: How would you describe the relationship between cty and hwy? Do you have any concerns about drawing conclusions from that plot?

From this graph we could conclude that the smaller the engine the higher the hwy mpg the vehicle has and as the engine gets better the hwy mpg decreases. There is a bit of concern when drawing conclusions because correlation does not equal causation.

Question 2: What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How could you modify the data to make it more informative?

This graph is not useful because the x axis is completely unreadable and both variables are categorical. To modify the graph you could make one of the categories numerical.

ggplot(mpg, aes(model, manufacturer)) +
  geom_point()

Question 3: Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.

  1. ggplot(mpg, aes(cty, hwy)) + geom_point()

data = mpg dataset

aesthetic mapping = x = cty and y = hwy

layers = scatter plot

Predicting what it would look like: cty will be on the x axis and being compared to hwy mpg on the y axis. There will be points at each spot due to the geom_points creating a scatterplot.

  1. ggplot(diamonds, aes(carat, price)) + geom_point()

data = diamond

aesthetic mapping = x = carat and y = price

layers = scatterplot

Predicting what it would look like: carat will be on the x axis and being compared to price on the y axis. There will be points at each spot due to the geom_points creating a scatterplot.

  1. ggplot(economics, aes(date, unemploy)) + geom_line()

data = economics

aesthetic mapping = x = date and y = unemploy

layers = line graph

Predicting what it would look like: date will be on the x axis and being compared to unemploy on the y axis. There will be a line connecting the data since geom_line creates a line graph.

  1. ggplot(mpg, aes(cty)) + geom_histogram()

data = mpg

aesthetic mapping = x = cty, no y value since it is a histogram

layers = histogram

Predicting what it would look like: cty will be on the x axis. geom_histogram uses the data from mpg$cty to create a histogram.

2.4.1

Question 1: Experiment with the color, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?

When you try and map continuous values, ggplot automatically adds a color gradient, adjust the size based on number and adding a key. When working with categorical data, ggplot automatically can add the colors to the different categories when you use it when specifying it in the original aes. When you use aesthetic more then one time in a plot it can help create more specific things in each layer that you give the plot.

Question 2: What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?

When you try to map a continuous variable to shape you get an error because ggplot cannot assign the shape to many numbers. When you map “trans” to shape it allows the shape function to work because each transmission type gets a different shape and a legend appears.

Question 3: How is drive train related to fuel economy? How is drive train related to engine size and class?

Drive train is related to fuel economy because it has an influence on the weight and complexity of the car. 4WD tends to have larger engines which decrease the MPG, FWD is the opposite with smaller engines and higher mpg.

2.6.1

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point()+
  geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point()+
  geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(displ, hwy))+
  geom_point()+
  geom_smooth(method = "gam", formula = y ~ s(x))

ggplot(mpg, aes(displ, hwy))+
  geom_point()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

2.6.6

Question 1: What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy)) + geom_point()? Which of the geoms described above is most effective at remedying the problem?

The problem with ggplot(mpg, aes(cty, hwy)) + geom_point() is that it is a scatterplot that we cannot use to create and accurate conclusion. The scatterplot has outliers creating it hard to create an accurate conclusion. A jitter plot may be better at remedying this since it condenses the plots and can clearly indicate a conclusion from the data.

ggplot(mpg, aes(hwy, cty)) + geom_jitter()

Question 2: One challenge with ggplot(mpg, aes(class, hwy)) + geom_boxplot() is that the ordering of class is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?

ggplot(mpg, aes(class, hwy)) + geom_boxplot()

ggplot(mpg, aes(reorder(class, hwy),hwy)) +
  geom_boxplot()

Using the reorder function allows the data to go in an increasing fashion so that it is easier to see the correlations between the model and hwy mpg.

Question 3: Explore the distribution of the carat variable in the diamonds dataset. What binwidth reveals the most interesting patterns?

The carat variable is mainly below 2 carats creating a skew with many small diamonds and very few large ones. The smaller (0.01) binwidth is the best because it shows clear spikes at common carat values.

ggplot(diamonds, aes(carat)) +
  geom_freqpoly(binwidth = 0.5)

ggplot(diamonds, aes(carat)) +
  geom_freqpoly(binwidth = 0.01)

Question 4: Explore the distribution of the price variable in the diamonds data. How does the distribution vary by cut?

head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
ggplot(diamonds, aes(reorder(cut, price), price))+
  geom_boxplot()

When looking at the price for the type of diamond cut, the data shows that as you increase the quality of the cut that the price increases as well. It is hard to determine that the type of cut plays a role in the price distribution since all types of cuts have multiple outliers.

Question 5: You now know (at least) three ways to compare the distributions of subgroups: geom_violin(), geom_freqpoly() and the colour aesthetic, or geom_histogram() and faceting. What are the strengths and weaknesses of each approach? What other approaches could you try?

Boxplots summarise the bulk of the distribution with only five numbers, while jittered plots show every point but only work with relatively small datasets. Violin plots give the richest display, but rely on the calculation of a density estimate, making it hard to interpret. Faceting makes comparisons a little harder, but easier to see the distribution of each group.

Question 6: Read the documentation for geom_bar(). What does the weight aesthetic do?

The weight aesthetic changes what the bars represent. Typically the bars represent count but when you use the weight function to summarize a specific variable.

Question 7:Using the techniques already discussed in this chapter, come up with three ways to visualise a 2d categorical distribution. Try them out by visualising the distribution of model and manufacturer, trans and class, and cyl and trans.

ggplot(mpg, aes(manufacturer, fill = model))+
  geom_bar()

trans and class

ggplot(mpg, aes(class, trans)) + 
  geom_count()

cyl and trans

ggplot(mpg, aes(cyl, fill = trans)) + 
  geom_bar(position = "dodge")

3.1.1

Question 1: What geoms would you use to draw each of the following named plots?

Scatterplot

geom_point()

Line chart

geom_line()

Histogram

geom_histogram()

Bar chart

geom_bar(stat = “identify”)

Pie chart geom_bar(stat = “identify”) coord_polar(theta = “y”)

Question 2: What’s the difference between geom_path() and geom_polygon()? What’s the difference between geom_path() and geom_line()?

geom_path() creates a line connecting the y values starting at the lowest plot to greatest. geom_polygon fills in the path plots, creating a polygon.

geom_path() creates a line connecting the y values starting at the lowest plot to greatest and then the line connects the x values lowest to greatest.

Question 3: What low-level geoms are used to draw geom_smooth()? What about geom_boxplot() and geom_violin()?

geom_line and geom_ribbon can be used to draw geom_smooth(. For geom_boxplot(), low-level geoms would be geom_rect(), geom_segment(), and geom_point(). For geom_violin(), low-level geoms would be geom_polygon(), and geom_boxplot().

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
mpg$cty_l100km = 235.2 / mpg$cty
mpg$hwy_l100km = 235.2 / mpg$hwy

library(dplyr)

mpg %>%
  count(model, sort = TRUE)
## # A tibble: 38 × 2
##    model                   n
##    <chr>               <int>
##  1 caravan 2wd            11
##  2 ram 1500 pickup 4wd    10
##  3 civic                   9
##  4 dakota pickup 4wd       9
##  5 jetta                   9
##  6 mustang                 9
##  7 a4 quattro              8
##  8 grand cherokee 4wd      8
##  9 impreza awd             8
## 10 a4                      7
## # ℹ 28 more rows
mpg %>%
  mutate(model_clean = str_remove_all(model,"4wd| quattro| 2wd")) %>%
  count(model_clean, sort = TRUE)
## # A tibble: 37 × 2
##    model_clean            n
##    <chr>              <int>
##  1 "a4"                  15
##  2 "caravan"             11
##  3 "ram 1500 pickup "    10
##  4 "civic"                9
##  5 "dakota pickup "       9
##  6 "jetta"                9
##  7 "mustang"              9
##  8 "grand cherokee "      8
##  9 "impreza awd"          8
## 10 "camry"                7
## # ℹ 27 more rows