The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.
Create a scatter plot with:
x-axis: carat
y-axis: price
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diamonds %>%
ggplot(aes(x = carat, y = price)) +
geom_point()
Question: What type of relationship appears between carat and price?
linear positive correlation
Modify your plot:
Color points by cut
Add a meaningful title and axis labels
Apply theme_minimal()
diamonds %>%
ggplot(aes(x = carat, y = price, colour = cut)) +
geom_point() +
theme_minimal() +
labs(title = "Diamond Price by Carat",
x = "Carats",
y = "Price")
Question: Which cut appears to have higher prices at similar carat values?
ideal cut
Add a regression line:
diamonds %>%
ggplot(aes(x = carat, y = price, colour = cut)) +
geom_point() +
theme_minimal() +
labs(title = "Diamond Price by Carat",
x = "Carats",
y = "Price") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
Question: Does the relationship between carat and price appear linear?
because we are using the “lm” method which gives a linear line of best fit. There are also so many data points that the line probably would not skew if far if it wasn’t linear.
Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?
it commands the trend line to fit a linear model representing the relationship between the two variables. you can use “loess” which is used for smaller datasets or “rlm” which is the more robust version of “lm”.
Because the dataset is large, reduce overplotting by:
Adjusting alpha
Changing point size
Trying geom_jitter()
diamonds %>%
ggplot(aes(x = carat, y = price, colour = cut)) +
#geom_point(alpha = 0.3, size = 0.7) +
geom_jitter(alpha = 0.3, size = 0.7, position = "jitter") +
theme_minimal() +
labs(title = "Diamond Price by Carat",
x = "Carats",
y = "Price") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
? jitter
## starting httpd help server ... done
Question: Why is overplotting a concern with large datasets? - the larger the dataset is, the more points you have on your figure, which can become cluttered and not give the viewer enough information to understand the correlation on the figure without supplemental information.
Question: What does the alpha command do and how does it help with overplotting? - alpha changes the transparency of the dat points or boxes. it helps view overlapping data points.
Question: Based on what you see, what are the risks and benefits of using geom_jitter? - geom_jitter can help spread the data points on the x or y axis to make viewing all the clustered points easier.
Create a scatter plot:
table vs price
Points colored by clarity
Facet by cut (we learn alot more about this later, but just give it a try!)
diamonds %>%
ggplot(aes(x = table, y = price, colour = clarity)) +
geom_point(alpha = 0.5) +
facet_wrap(~cut)
Question: Does the relationship differ by cut? - yes, it looks like the clarity changes for different cuts.
The economics dataset contains monthly US economic data over time.
Create a line plot:
x-axis: date
y-axis: unemploy
economics %>%
ggplot(aes(x = date, y = unemploy)) +
geom_line()
Question: Describe the overall trend over time.
Reshape the data using pivot_longer() to plot:
uempmed
psavert
Then create a multi-line plot with:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.2.0 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#economics %>%
# pivot_longer(cols = c(uempmed, psavert)) %>%
# ggplot(aes(x = date, y = unemploy, colour = variable)) +
# geom_line()
# can't figure out this error with pivot
Question: Do these variables appear to move together over time?
Enhance your plot by:
Changing line width
Customizing colors
Formatting the date axis
Adding title, subtitle, and caption
Applying a theme (theme_bw() or theme_classic())
head(economics)
## # A tibble: 6 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
economics %>%
pivot_longer(cols = c(uempmed, psavert)) %>%
ggplot(aes(x = date, y = unemploy, colour = "red")) +
geom_line(linewidth = 1.5) +
theme_minimal() +
labs(title = "Unemployment By Year",
x = "Year",
y = "Unemployment",
subtitle = "Number of Unemployments Per Year in The US to The Thousands") +
theme(axis.text.x = element_text(angle = 45),
title = element_text(face = "bold"),
legend.position = "none")