I picked Moiya Joseph’s TidyVerse CREATE project to extend. I removed a second dataset from the geom_point() layer in the final map graphic and edited the original segment for flow, language and formatting. See the ‘TidyVerse EXTEND’ section for additional code.
title: “Tidyverse CREATE”
author: “Moiya Josephs”
date: “4/9/2022”
output: html_document
ggplot2 is a package in tidyverse
that allows us to
visualize our data and make it clearer when presenting our findings. In
this document I plan to show how to use ggplot features that are not
just bar plots or line charts.
library(ggplot2)
library(tidyverse)
The data I will be using to demo the different features of ggplot is one that rates Ramen. It allows me to show how ggplot graphs discrete and continuous variables.
<- "https://raw.githubusercontent.com/moiyajosephs/Data607-Project2/main/ramen-ratings.csv"
filelocation <- read.csv(filelocation) ramen_ratings
When plotting two discrete variables geom_count
is
recommended. Geom_Count
is a variation of
geom_point
and maps the frequency for each observation. It
has a legend for the dots, the larger the dot the more of that value
there is in the data.
Both discrete
ggplot(ramen_ratings, aes(Brand, Style, color = Style)) + geom_count()
ggplot(ramen_ratings, aes(Brand, Style, color = Style)) + geom_count() + theme(axis.text.x = element_blank() )
Geom col
is identical to geom_bar since it shows bar
charts. The difference with geom_col
is that it allows you
to plot the bars relative to the data instead of the number of types x
occurs like geom_bar
.
Below I am able to plot the brand of ramen and the ratings they received.
$Stars <- as.numeric(ramen_ratings$Stars)
ramen_ratingsggplot(head(ramen_ratings,10), aes(Brand, Stars, fill = Style) ) + geom_col() + theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))
This is the most interesting of the data and if data has regions
specified, you can map where each point is. The function
map_data
allows you to get the longitude and latitude of a
region specified, like state, or country. The ramen data is
international, luckily we can also set map_data
to world,
so it collects all the coordinates of countries in the world.
$Stars <- as.numeric(ramen_ratings$Stars)
ramen_ratings<- map_data("world")
map
# select region in ramen ratings, get set
<- unique(ramen_ratings$Country)
regions
# select country in map, select
<- map %>% filter(region %in% regions)
m1 <- distinct(m1, region, .keep_all = TRUE) m1
This data is very large, however, and we do not need all the
coordinates for the same country. So I used the distinct
function in order to get unique row values for each country in
map_data
and call it map_regions.set
.
<- left_join(ramen_ratings, m1, by = c("Country"="region"))
life.exp.map <- distinct(map, region, .keep_all = TRUE) map.regions.set
Now that I have the unique regions, I can left join it with the map data where the country equals the region. That way we have the ramen information from the original dataset, joined with the coordinates for the region of each plot,
<- left_join(ramen_ratings, map.regions.set, by = c("Country"="region"))
ramen.map ggplot() +
geom_map(
data = map, map = map,
aes(long, lat, map_id = region),
color = "white", fill = "lightgray", size = 0.1
+
) geom_point(
data = life.exp.map,
aes(long, lat, color = Style),
alpha = 0.7
)
The map above shows where the style of ramens are located around the world. At a glance, a person could see that pack is a popular ramen style.
Ggplot
is a very powerful library within tidyverse that
allows you to make various visualizations based on your data. With
visualizations, data scientists can present any key findings in an easy
to understand way.
Here we extend the project with additional things you can do with ggplot. I adapted a number of examples from a 2021 medium article by Keith McNulty, towardsdatascience.com/ten-random-but-useful-things-to-know-about-ggplot2
With ggplot you can chart a regular math function. Below is an example with the sin function.
ggplot() +
xlim(-10, 10) +
geom_function(fun = function(x) sin(x))
**This can be useful if you want to over lay the function with your data to establish a pattern. We can add the function as a layer to the ggplot of the data’s longitude (x) and latitude (y) and see that they are not described by the function 5*sin(x/5).**
# Create a plot of the longitude and latitude of the countries
<- ggplot(data = life.exp.map,
exp1 aes(x = long, y = lat, color = as.factor(Style))) +
geom_point()
# Overlay a sin function on the above plot
+ geom_function(fun = function(x) 5*sin(x/5)) exp1
## Warning: Removed 148 rows containing missing values (geom_point).
## Warning: Multiple drawing groups in `geom_function()`. Did you use the correct
## `group`, `colour`, or `fill` aesthetics?
In the above example I assigned a simpler graph to a variable
exp1
. This allows me to repeat the base graph without
having to repeat the code. This also touches on the topic of inheritance
because I passed aesthetic information into ggplot instead of into the
geom_point() layer. In this way my additional layer inherited the
aesthetic from the ggplot that it was layered onto. This is something to
consider as your build out projects so that your code is clean and
intentional.
Note in this case, with a generic math function, inheritance wouldn’t have an effect, but see the final example for where inheritance would create an error and how to disinherit.
# Repeat of previous code chunk with additional notation
# Here the data and aes information is being passed to the ggplot function so it is used by the geom_point() layer.
<- ggplot(data = life.exp.map,
exp1 aes(x = long, y = lat, color = as.factor(Style))) +
geom_point()
# When we add the geom_function() layer it inherits the data and aes information used in exp1.
+ geom_function(fun = function(x) 5*sin(x/5))
exp1
# It's the same as if we had written
<- ggplot(data = life.exp.map,
exp1 aes(x = long, y = lat, color = as.factor(Style))) +
geom_point() +
geom_function(fun = function(x) 5*sin(x/5))
# But if the data and aes were passed to the geom_point() layer than it wouldn't be inherited by the second layer. The difference isn't evident in this example but it's something to keep in mind.
<- ggplot() +
exp1 geom_point(data = life.exp.map,
aes(x = long, y = lat, color = as.factor(Style))) +
geom_function(fun = function(x) 5*sin(x/5))
Here we plot density function over a histogram. However note
we set y to be ..density..
otherwise the histogram would
default to count and be on a different scale than the density layer.
Note I had to cast the Star column as numeric to be
plotted.
# Overlay a density layer over a histogram layer with the same y-scale
<- ggplot(data = life.exp.map, aes(x = as.numeric(Stars), y = ..density..)) +
exp2 geom_histogram(fill = "lightblue", bins = 30) +
geom_density(fill = "pink", alpha = 0.4)
exp2
Here we overlay the example above with a normal distribution.
First we remove unrated records and recast the values as numbers and calculate the normal distributions mean and standard deviation.
When we add the layer with the normal density function we have to
disinherit the aesthetic information which we can do with
inherit.aes = FALSE
.
# Tidy data to be easier to work with
<- filter(life.exp.map, Stars!= "Unrated")
life.exp.map2 $Stars <- as.numeric(life.exp.map2$Stars)
life.exp.map2
# Calculate the mean and standard deviation for the normal distribution
<- mean(life.exp.map2$Stars)
stars_mean <- sd(life.exp.map2$Stars)
stars_sd
# plot the prior example with the additional normal density layer - Note the disinheritance
+ geom_function(
exp2 fun = function(x) dnorm(x, mean = stars_mean, sd = stars_sd),
linetype = "dashed",
inherit.aes = FALSE
)
Looking at the plot above notice how people are inordinately inclined to give a brand five stars or even zero stars which indicates to me an issue with relying on survey data for quality.
Here is the above plot repeated without the disinheritance to demonstrate the error.
# Repeated plot without the disinheritance
+ geom_function(
exp2 fun = function(x) dnorm(x, mean = stars_mean, sd = stars_sd),
linetype = "dashed")
Error in ‘f()’:
! Aesthetics must be valid computed stats. Problematic aesthetics(s): y = ..density…
Did you map your stat in the wrong layer?
Check out the examples as they were originally described by the author, Keith McNulty. He goes on to describe an extended final example that is very detailed, beautiful and instructive.
Keith McNulty (2021) towardsdatascience.com/ten-random-but-useful-things-to-know-about-ggplot2