The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.
Create a scatter plot with:
x-axis: carat
y-axis: price
library(ggplot2)
ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
Question: What type of relationship appears between carat and price?
There is a positive relationship between carat and price. As the carat increases the price also increases.
Modify your plot:
Color points by cut
Add a meaningful title and axis labels
Apply theme_minimal()
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
labs(title = "Diamond Price vs Carat by Cut",
x = "Carat",
y = "Price") +
theme_minimal()
Question: Which cut appears to have higher prices at similar carat values?
The ideal and premium cuts appear to have higher prices at similar carat values.
Add a regression line:
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Diamond Price vs Carat by Cut",
x = "Carat",
y = "Price") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Question: Does the relationship between carat and price appear linear?
The relationship looks mostly positive but not perfectly linear.
Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?
The “lm” option tells ggplot to use a linear model which draws a stright best fit regression line through the data. Other options incle “loses”, which creates a smooth curved line that follows the pattern of the data “gam”, which allows more flexible curves, and “glm” which uses a gernealized lionear model for differnt types of data.
Because the dataset is large, reduce overplotting by:
Adjusting alpha
Changing point size
Trying geom_jitter()
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_jitter(alpha = 0.3, size = 1) +
geom_smooth(method = "lm") +
labs(title = "Diamond Price vs Carat by Cut",
x = "Carat",
y = "Price") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Question: Why is overplotting a concern with large datasets?
Many datapoints overlap, making it hard to see the true pattern.
Question: What does the alpha command do and how does it help with overplotting?
Makes the points more transparent so overlapping points become easier to see.
Question: Based on what you see, what are the risks and benefits of using geom_jitter?
The benefit is thay it spreads points out so overlapping points are easier to see, but the risk if that it slightly moves the points and may make the data look a little less exact.
Create a scatter plot:
table vs price
Points colored by clarity
Facet by cut (we learn alot more about this later, but just give it a try!)
ggplot(diamonds, aes(x = table, y = price, color = clarity)) +
geom_point(alpha = 0.3, size = 1) +
facet_wrap(~cut) +
theme_minimal()
Question: Does the relationship differ by cut?
Yes the relationship differs by cut. Each categroy shows a slightly differnt pattern and spread of pieces.
The economics dataset contains monthly US economic data over time.
Create a line plot:
x-axis: date
y-axis: unemploy
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line()
Question: Describe the overall trend over time.
The overall trend shpws that unemployment changes over time with several increases and decreases. ## Task 7: Multiple Lines on One Plot
Reshape the data using pivot_longer() to plot:
uempmed
psavert
Then create a multi-line plot with:
library(tidyr)
economics_long <- pivot_longer(economics, cols = c(uempmed, psavert),
names_to = "variable", values_to = "value")
ggplot(economics_long, aes(x = date, y = value, color = variable)) +
geom_line()
Question: Do these variables appear to move together over time?
no, the variables do not move together consistently over time. Sometimes they increase or decrease at the same time, but often follow diff trends.
Enhance your plot by:
Changing line width
Customizing colors
Formatting the date axis
Adding title, subtitle, and caption
Applying a theme (theme_bw() or theme_classic())
ggplot(economics_long, aes(x = date, y = value, color = variable)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("pink", "yellow")) +
scale_x_date(date_labels = "%Y", date_breaks = "5 years") +
labs(
title = "Economic Trends Over Time",
subtitle = "U.S. Unemployment Duration vs Personal Savings Rate",
caption = "Source: ggplot2 economics dataset",
x = "Year",
y = "Value"
) +
theme_classic()