Part 1: Scatter Plots (Using diamonds)

The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.

Task 1: Basic Scatter Plot

Create a scatter plot with:

  • x-axis: carat

  • y-axis: price

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
diamonds %>%
  ggplot(aes(x = carat, y = price)) + 
  geom_point()

Question: What type of relationship appears between carat and price?

linear positive correlation

Task 2: Add Aesthetic Mappings

Modify your plot:

  • Color points by cut

  • Add a meaningful title and axis labels

  • Apply theme_minimal()

diamonds %>%
  ggplot(aes(x = carat, y = price, colour = cut)) + 
  geom_point() + 
  theme_minimal() + 
  labs(title = "Diamond Price by Carat", 
       x = "Carats", 
       y = "Price")

Question: Which cut appears to have higher prices at similar carat values?

ideal cut

Task 3: Add a Trend Line

Add a regression line:

  • geom_smooth(method = “lm”)
diamonds %>%
  ggplot(aes(x = carat, y = price, colour = cut)) + 
  geom_point() + 
  theme_minimal() + 
  labs(title = "Diamond Price by Carat", 
       x = "Carats", 
       y = "Price") + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Question: Does the relationship between carat and price appear linear?

because we are using the “lm” method which gives a linear line of best fit. There are also so many data points that the line probably would not skew if far if it wasn’t linear.

Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?

it commands the trend line to fit a linear model representing the relationship between the two variables. you can use “loess” which is used for smaller datasets or “rlm” which is the more robust version of “lm”.

Task 4: Improve Visualization

Because the dataset is large, reduce overplotting by:

  • Adjusting alpha

  • Changing point size

  • Trying geom_jitter()

diamonds %>%
  ggplot(aes(x = carat, y = price, colour = cut)) + 
  #geom_point(alpha = 0.3, size = 0.7) + 
  geom_jitter(alpha = 0.3, size = 0.7, position = "jitter") +
  theme_minimal() + 
  labs(title = "Diamond Price by Carat", 
       x = "Carats", 
       y = "Price") + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

? jitter
## starting httpd help server ... done

Question: Why is overplotting a concern with large datasets? - the larger the dataset is, the more points you have on your figure, which can become cluttered and not give the viewer enough information to understand the correlation on the figure without supplemental information.

Question: What does the alpha command do and how does it help with overplotting? - alpha changes the transparency of the dat points or boxes. it helps view overlapping data points.

Question: Based on what you see, what are the risks and benefits of using geom_jitter? - geom_jitter can help spread the data points on the x or y axis to make viewing all the clustered points easier.

Task 5: Challenge Scatter Plot

Create a scatter plot:

  • table vs price

  • Points colored by clarity

  • Facet by cut (we learn alot more about this later, but just give it a try!)

diamonds %>%
  ggplot(aes(x = table, y = price, colour = clarity)) + 
  geom_point(alpha = 0.5) +
  facet_wrap(~cut)

Question: Does the relationship differ by cut? - yes, it looks like the clarity changes for different cuts.

Part 2: Line Plots (Using economics Dataset)

The economics dataset contains monthly US economic data over time.

Task 6: Basic Line Plot

Create a line plot:

  • x-axis: date

  • y-axis: unemploy

economics %>%
  ggplot(aes(x = date, y = unemploy)) + 
  geom_line()

Question: Describe the overall trend over time.

Task 7: Multiple Lines on One Plot

Reshape the data using pivot_longer() to plot:

  • uempmed

  • psavert

Then create a multi-line plot with:

  • color = variable
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#economics %>%
#   pivot_longer(cols = c(uempmed, psavert)) %>%
 # ggplot(aes(x = date, y = unemploy, colour = variable)) + 
 # geom_line()
 # can't figure out this error with pivot

Question: Do these variables appear to move together over time?

Task 8: Customize Your Line Plot

Enhance your plot by:

  • Changing line width

  • Customizing colors

  • Formatting the date axis

  • Adding title, subtitle, and caption

  • Applying a theme (theme_bw() or theme_classic())

head(economics)
## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018
economics %>%
   pivot_longer(cols = c(uempmed, psavert)) %>%
  ggplot(aes(x = date, y = unemploy, colour = "red")) + 
  geom_line(linewidth = 1.5) + 
  theme_minimal() + 
  labs(title = "Unemployment By Year", 
       x = "Year", 
       y = "Unemployment", 
       subtitle = "Number of Unemployments Per Year in The US to The Thousands") + 
   theme(axis.text.x = element_text(angle = 45), 
         title = element_text(face = "bold"), 
         legend.position = "none")