In this assignment you will practice visualizing data distributions using several common ggplot2 tools:
Understanding how data are distributed is an essential first step in exploratory data analysis.
Create a histogram showing the distribution of highway fuel efficiency (hwy) from the mpg dataset.
Requirements: 1. Use geom_histogram() 2. Set the bin width to 2 3. Use a minimal theme 4. Include appropriate axis labels and a title
# library
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(mpg, 10) # inside ggplot
## # A tibble: 10 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
mpg %>%
ggplot(aes(hwy)) +
geom_histogram(binwidth = 2) +
theme_minimal() +
labs(title = "Mileage per Gallon on Highway Roads",
x = "Highway Mileage",
y = "Frequency of Occurence",
)
What range of highway fuel efficiency values is most common?
25-27mpg
Does the distribution appear symmetric, skewed, or multimodal?
multimodel ## Task 2
Create a histogram showing the distribution of engine displacement (displ).
Requirements: 1. Use geom_histogram() 2. Choose a reasonable bin width 3. Fill the bars with a color of your choice
mpg %>%
ggplot(aes(displ)) +
geom_histogram(fill = "lightblue", binwidth = 0.3) +
theme_minimal() +
labs(title = "Engine Displacement Distribution",
x = "Engine Displacement (litres)",
y = "Frequency")
Question
There are much more engines with a lower displacement than higher, and little frequency of engines hover around 4.0Density plots provide a smoothed estimate of the distribution of a variable.
Create a density plot of engine displacement (displ).
Requirements: 1. Use geom_density() 2. Fill the density curve with a color 3. Set transparency (alpha) so the shape is visible
mpg %>%
ggplot(aes(displ)) +
# geom_histogram() +
geom_density(fill = "lightblue", alpha = 0.5) +
theme_minimal() +
labs(title = "Engine Displacement Density",
x = "Engine Displacement (litres)",
y = "Frequency")
Question
The plots show the same information. The density plot is slightly easier to process because of the smoothed curve. The frequency proportion is on a different scale in density compared to the histogram.
## Task 4Create density plots comparing engine displacement across vehicle classes.
Requirements: 1. Map fill = class 2. Use transparency so the curves overlap 3. Include a legend
mpg %>%
ggplot(aes(displ, fill = class)) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(title = "Engine Displacement Per Car Type (class)",
x = "Engine Displacement (litres)",
y = "Frequency")
Question
2-seaterCreate a frequency polygon comparing highway fuel efficiency across vehicle classes.
Requirements: 1. Map color = class 2. Use the same bin width as before
mpg %>%
ggplot(aes(hwy, colour = class)) +
geom_freqpoly(bin_width = 2) +
theme_minimal() +
labs(title = "Highway Fuel Efficiency by Vehicle Class",
x = "Fuel Efficiency on Highway Roads",
y = "Frequency")
## Warning in geom_freqpoly(bin_width = 2): Ignoring unknown parameters:
## `bin_width`
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Question
They can reduce any static or unneccessary visualizations in the plot. The lines on the frequency polygon make it easier to distinguish distribution among categoriesCreate a filled 2D density plot.
Requirements: 1. Use stat_density_2d() 2. Set geom = “polygon” 3. Map fill to density level
mpg %>%
ggplot(aes(x = displ, y = hwy)) +
stat_density_2d(
aes(fill = after_stat(level)),
geom = "polygon"
) +
theme_minimal()
Question
filled density plots use color shading to provide differences in densityCreate a scatterplot of carat vs price using the diamonds dataset.
Requirements: 1. Use geom_point() 2. Set alpha = 0.2 so overlapping points are visible 3. Use theme_minimal() 4. Use the ggExtra package to add marginal density plots to the scatterplot.
# library
#install.packages("ggExtra")
library(ggExtra)
library(ggplot2)
library(dplyr)
p1 <- diamonds %>%
ggplot(aes(carat, price)) +
geom_point(alpha = 0.2) +
theme_minimal()
ggMarginal(p1, type = "density", fill = "lightblue")
Question
Marginal plots show the distributions of the data alongside the main visual analysis from the point plot