install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(dplyr)
The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.
Create a scatter plot with:
x-axis: carat
y-axis: price
ggplot(diamonds, aes(carat,price))+
geom_point()
Question: What type of relationship appears between carat and price?
There appears to be a positive linear relationship between carat and price.
Modify your plot:
Color points by cut
Add a meaningful title and axis labels
Apply theme_minimal()
ggplot(diamonds, aes(carat,price, colour = carat)) +
geom_point() +
labs(
title = "Diamond Price vs. Carat by Cut Quality",
x = "Carat Size",
y = "Price (Dollars)"
) +
theme_minimal()
Question: Which cut appears to have higher prices at similar carat values?
Between Carats 1 and 2 seems to have higher prices all at the same carat value
Add a regression line:
ggplot(diamonds, aes(carat,price, colour = carat)) +
geom_point() +
labs(
title = "Diamond Price vs. Carat by Cut Quality",
x = "Carat Size",
y = "Price (Dollars)"
) +
theme_minimal()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Question: Does the relationship between carat and price appear linear?
The relationship between the carat and price does not seem to be linear since a 2 carat is equal to a 5 carat.
Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?
In the geom_smooth command “lm” creates a linear model of the data adding a line. Other geom_smooth methods are “loess” which creates a curved flexible lines, “gam” which creates a smooth curve, and “glm” which depends on the family used.
Because the dataset is large, reduce overplotting by:
Adjusting alpha
Changing point size
Trying geom_jitter()
ggplot(diamonds, aes(carat,price, colour = carat)) +
geom_jitter(size = 0.7, alpha = 0.3) +
labs(
title = "Diamond Price vs. Carat by Cut Quality",
x = "Carat Size",
y = "Price (Dollars)"
) +
theme_minimal()
Question: Why is overplotting a concern with large datasets?
Overplotting is a concern with large datasets because it makes the data hard to read and very busy.
Question: What does the alpha command do and how does it help with overplotting?
The command alpha makes the plots transparent so the more denser areas are darker and the less dense areas are lighter. This makes it easier to tell the plots apart.
Question: Based on what you see, what are the risks and benefits of using geom_jitter?
Based on the differences between the scatterplot and the jitter plot, the jitter plot seemed to clear the data up making it less busy. The jitter plot could end up misrepresenting your data with smaller data sets but for bigger ones it clean it up.
Create a scatter plot:
table vs price
Points colored by clarity
Facet by cut (we learn a lot more about this later, but just give it a try!)
ggplot(diamonds, aes(table,price, colour = clarity)) +
geom_point(size = 0.7, alpha = 0.5) +
labs(
title = "Diamond Price vs. Table",
x = "Table",
y = "Price (Dollars)"
) +
theme_minimal()+
facet_wrap(~cut)
Question: Does the relationship differ by cut?
The relationship between table and price does not seem to differ by cut.
The economics dataset contains monthly US economic data over time.
Create a line plot:
x-axis: date
y-axis: unemploy
ggplot(economics, aes(date, unemploy))+
geom_line()
Question: Describe the overall trend over time.
The trend over time is a wave, as time increase unemployment generally incresaes but once it hits a peak it tends to decrease and then shoot back up.
Reshape the data using pivot_longer() to plot:
uempmed
psavert
Then create a multi-line plot with:
economics_pivot = economics %>%
pivot_longer(
cols = c(uempmed, psavert),
names_to = "variable",
values_to = "value")
ggplot(economics_pivot, aes(date, value, color = variable))+
geom_line()
Question: Do these variables appear to move together over time?
These variables do not seem to move together overtime.
Enhance your plot by:
Changing line width
Customizing colors
Formatting the date axis
Adding title, subtitle, and caption
Applying a theme (theme_bw() or theme_classic())
ggplot(economics_pivot, aes(date, value, color = variable))+
geom_line(linewidth= 0.3)+
scale_color_brewer(palette = "Set1")+
scale_x_date(
date_breaks = "5 years",
date_labels = "%Y"
)+
labs(
title = "Unemployment Duration and Personal Savings Rate",
subtitle = "United States",
x = "Year",
y = "Value",
caption = "*This is a graph showing the median changes over time")+
theme_bw()+
theme(axis.text.x = element_text(angle = 50, hjust = 1))