Homework

In this assignment you will practice visualizing data distributions using several common ggplot2 tools:

  1. Histograms
  2. Density plots
  3. Frequency polygons
  4. 2D density plots

Understanding how data are distributed is an essential first step in exploratory data analysis.

Task 1

Create a histogram showing the distribution of highway fuel efficiency (hwy) from the mpg dataset.

Requirements: 1. Use geom_histogram() 2. Set the bin width to 2 3. Use a minimal theme 4. Include appropriate axis labels and a title

# library 
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(mpg, 10) # inside ggplot
## # A tibble: 10 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
mpg %>%
  ggplot(aes(hwy)) + 
  geom_histogram(binwidth = 2) + 
  theme_minimal() + 
  labs(title = "Mileage per Gallon on Highway Roads", 
       x = "Highway Mileage", 
       y = "Frequency of Occurence", 
       )

  1. What range of highway fuel efficiency values is most common? 25-27mpg

  2. Does the distribution appear symmetric, skewed, or multimodal? multimodel ## Task 2

Create a histogram showing the distribution of engine displacement (displ).

Requirements: 1. Use geom_histogram() 2. Choose a reasonable bin width 3. Fill the bars with a color of your choice

mpg %>% 
  ggplot(aes(displ)) + 
  geom_histogram(fill = "lightblue", binwidth = 0.3) + 
  theme_minimal() + 
  labs(title = "Engine Displacement Distribution", 
       x = "Engine Displacement (litres)", 
       y = "Frequency")

Question

  1. What does the histogram suggest about the types of engines represented in the dataset? There are much more engines with a lower displacement than higher, and little frequency of engines hover around 4.0

Task 3

Density plots provide a smoothed estimate of the distribution of a variable.

Create a density plot of engine displacement (displ).

Requirements: 1. Use geom_density() 2. Fill the density curve with a color 3. Set transparency (alpha) so the shape is visible

mpg %>%
  ggplot(aes(displ)) + 
 # geom_histogram() +
  geom_density(fill = "lightblue", alpha = 0.5) + 
  theme_minimal() + 
  labs(title = "Engine Displacement Density", 
       x = "Engine Displacement (litres)", 
       y = "Frequency")

Question

  1. How does the density plot compare to the histogram you created earlier? The plots show the same information. The density plot is slightly easier to process because of the smoothed curve. The frequency proportion is on a different scale in density compared to the histogram. ## Task 4

Create density plots comparing engine displacement across vehicle classes.

Requirements: 1. Map fill = class 2. Use transparency so the curves overlap 3. Include a legend

mpg %>%
  ggplot(aes(displ, fill = class)) + 
  geom_density(alpha = 0.5) + 
  theme_minimal() + 
  labs(title = "Engine Displacement Per Car Type (class)", 
       x = "Engine Displacement (litres)", 
       y = "Frequency")

Question

  1. Which vehicle class tends to have the largest engines? 2-seater

Task 5

Create a frequency polygon comparing highway fuel efficiency across vehicle classes.

Requirements: 1. Map color = class 2. Use the same bin width as before

mpg %>%
  ggplot(aes(hwy, colour = class)) + 
  geom_freqpoly(bin_width = 2) + 
  theme_minimal() + 
  labs(title = "Highway Fuel Efficiency by Vehicle Class", 
       x = "Fuel Efficiency on Highway Roads", 
       y = "Frequency")
## Warning in geom_freqpoly(bin_width = 2): Ignoring unknown parameters:
## `bin_width`
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Question

  1. Why might frequency polygons be easier to interpret than overlapping histograms when comparing groups? They can reduce any static or unneccessary visualizations in the plot. The lines on the frequency polygon make it easier to distinguish distribution among categories

Task 6

Create a filled 2D density plot.

Requirements: 1. Use stat_density_2d() 2. Set geom = “polygon” 3. Map fill to density level

mpg %>%
  ggplot(aes(x = displ, y = hwy)) +
  stat_density_2d(
    aes(fill = after_stat(level)),
    geom = "polygon"
  ) +
  theme_minimal() 

Question

  1. What additional information does the filled density plot provide compared to the contour plot? filled density plots use color shading to provide differences in density

Task 7

Create a scatterplot of carat vs price using the diamonds dataset.

Requirements: 1. Use geom_point() 2. Set alpha = 0.2 so overlapping points are visible 3. Use theme_minimal() 4. Use the ggExtra package to add marginal density plots to the scatterplot.

# library
#install.packages("ggExtra")
library(ggExtra)
library(ggplot2)
library(dplyr)

p1 <- diamonds %>%
  ggplot(aes(carat, price)) + 
  geom_point(alpha = 0.2) + 
  theme_minimal()

ggMarginal(p1, type = "density", fill = "lightblue")

Question

  1. Why might marginal plots be useful when analyzing relationships between variables? Marginal plots show the distributions of the data alongside the main visual analysis from the point plot