install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(dplyr)

Part 1: Scatter Plots (Using diamonds)

The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.

Task 1: Basic Scatter Plot

Create a scatter plot with:

  • x-axis: carat

  • y-axis: price

ggplot(diamonds, aes(carat,price))+
  geom_point()

Question: What type of relationship appears between carat and price?

There appears to be a positive linear relationship between carat and price.

Task 2: Add Aesthetic Mappings

Modify your plot:

  • Color points by cut

  • Add a meaningful title and axis labels

  • Apply theme_minimal()

ggplot(diamonds, aes(carat,price, colour = carat)) +
  geom_point() +
  labs(
    title = "Diamond Price vs. Carat by Cut Quality",
    x = "Carat Size",
    y = "Price (Dollars)"
    ) +
  theme_minimal()

Question: Which cut appears to have higher prices at similar carat values?

Between Carats 1 and 2 seems to have higher prices all at the same carat value

Task 3: Add a Trend Line

Add a regression line:

  • geom_smooth(method = “lm”)
ggplot(diamonds, aes(carat,price, colour = carat)) +
  geom_point() +
  labs(
    title = "Diamond Price vs. Carat by Cut Quality",
    x = "Carat Size",
    y = "Price (Dollars)"
    ) +
  theme_minimal()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Question: Does the relationship between carat and price appear linear?

The relationship between the carat and price does not seem to be linear since a 2 carat is equal to a 5 carat.

Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?

In the geom_smooth command “lm” creates a linear model of the data adding a line. Other geom_smooth methods are “loess” which creates a curved flexible lines, “gam” which creates a smooth curve, and “glm” which depends on the family used.

Task 4: Improve Visualization

Because the dataset is large, reduce overplotting by:

  • Adjusting alpha

  • Changing point size

  • Trying geom_jitter()

ggplot(diamonds, aes(carat,price, colour = carat)) +
  geom_jitter(size = 0.7, alpha = 0.3) +
  labs(
    title = "Diamond Price vs. Carat by Cut Quality",
    x = "Carat Size",
    y = "Price (Dollars)"
    ) +
  theme_minimal()

Question: Why is overplotting a concern with large datasets?

Overplotting is a concern with large datasets because it makes the data hard to read and very busy.

Question: What does the alpha command do and how does it help with overplotting?

The command alpha makes the plots transparent so the more denser areas are darker and the less dense areas are lighter. This makes it easier to tell the plots apart.

Question: Based on what you see, what are the risks and benefits of using geom_jitter?

Based on the differences between the scatterplot and the jitter plot, the jitter plot seemed to clear the data up making it less busy. The jitter plot could end up misrepresenting your data with smaller data sets but for bigger ones it clean it up.

Task 5: Challenge Scatter Plot

Create a scatter plot:

  • table vs price

  • Points colored by clarity

  • Facet by cut (we learn a lot more about this later, but just give it a try!)

ggplot(diamonds, aes(table,price, colour = clarity)) +
  geom_point(size = 0.7, alpha = 0.5) +
  labs(
    title = "Diamond Price vs. Table",
    x = "Table",
    y = "Price (Dollars)"
    ) +
  theme_minimal()+
  facet_wrap(~cut)

Question: Does the relationship differ by cut?

The relationship between table and price does not seem to differ by cut.

Part 2: Line Plots (Using economics Dataset)

The economics dataset contains monthly US economic data over time.

Task 6: Basic Line Plot

Create a line plot:

  • x-axis: date

  • y-axis: unemploy

ggplot(economics, aes(date, unemploy))+ 
  geom_line()

Question: Describe the overall trend over time.

The trend over time is a wave, as time increase unemployment generally incresaes but once it hits a peak it tends to decrease and then shoot back up.

Task 7: Multiple Lines on One Plot

Reshape the data using pivot_longer() to plot:

  • uempmed

  • psavert

Then create a multi-line plot with:

  • color = variable
economics_pivot = economics %>%
  pivot_longer(
    cols = c(uempmed, psavert),
    names_to = "variable",
    values_to = "value")

ggplot(economics_pivot, aes(date, value, color = variable))+
  geom_line()

Question: Do these variables appear to move together over time?

These variables do not seem to move together overtime.

Task 8: Customize Your Line Plot

Enhance your plot by:

  • Changing line width

  • Customizing colors

  • Formatting the date axis

  • Adding title, subtitle, and caption

  • Applying a theme (theme_bw() or theme_classic())

ggplot(economics_pivot, aes(date, value, color = variable))+
  geom_line(linewidth= 0.3)+
    scale_color_brewer(palette = "Set1")+
  scale_x_date(
    date_breaks = "5 years",
    date_labels = "%Y"
  )+
    labs(
    title = "Unemployment Duration and Personal Savings Rate",
    subtitle = "United States",
    x = "Year",
    y = "Value",
    caption = "*This is a graph showing the median changes over time")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 50, hjust = 1))