DS 2870: Module 2 Homework

Question 1: Box Plot for Olympian Ages

The data set olympics.csv (found at https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/olympics.csv) has data on about 6000 Olympic athletes that completed in 2024 Olympic games in one of 10 sports:

Athletics, Swimming, Rowing, Judo, Shooting, Sailing, Volleyball, Equestrian, Fencing, Boxing, Cycling Road, Gymnastics

(Athletics is a catchall for Track and Field style of events)

The two relevant columns are:

sport: Which of the 10 sports the athlete participated in
age: The age the athlete is at the start of the 2024 Olympic games

Using the data set, create the side-by-side box plots seen in Brightspace. The hex codes for the colors are #0081c8 and #FCB131.

To reorder the sports to match what is in Brightspace, use fct_reorder() (to see how it works, the help menu is your friend!)

ggplot(
  # Reading the data directly into ggplot (a bit of a shorcut)
  data = read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/olympics.csv"),
  # Mapping age to x and a reordered sports (by age) to y
  mapping = aes(
    x = age,
    y = fct_reorder(sport, age)
  )
) + 
  # Creating the box plots of age by sport
  geom_boxplot(
    fill = "#0081c8",
    color = "#FCB131"
  ) + 
  # Changing the labels and adding a title and subtitle
  labs(
    x = NULL,    # Removing the space for an x-axis label
    y = "Sport",
    title = "Age of Olympians at the start of the Olympics by sport",
    subtitle = "10 of the most common sports only"
  ) + 
  # Changing the default theme and centering the title and subtitle
  theme_bw() + 
  theme(
    plot.title    = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12)
  )

Question 2: Used Cars

The used cars.csv file has info about 400 cars listed on Craigslist in 2023 (github link: https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/used%20cars.csv)

The columns are:

id: The unique ID for the car in the data set
price: The asking price of the used car on Craigslist (in dollars)
year: The year the car was manufactured
manufacturer: The maker of the car (Chevrolet, Ford, Honda, Jeep)
odometer: How many miles the car has been driven

Part 2a: Basic graph

Create the first graph for question 2 seen in Brightspace. Save it as gg_q2a and make sure to display it in the knitted document

gg_q2a <- 
  ggplot(
    # Reading the data directly into ggplot()
    data = read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/used%20cars.csv"),
    # Mapping the columns to the corresponding aesthetics
    mapping = aes(
      x = odometer,
      y = price,
      color = manufacturer
    )
  ) + 
  # Adding the points and a trend line
  geom_point() +
  geom_smooth(
    method = "lm",    # Straight line
    se = F,           # No shaded confidence region
    formula = y ~ x,  # Default formula
    show.legend = F,  # No line in the color guide
    color = "black"   # black line (overrides color = manufacturer)
  )
gg_q2a

Part 2b: Adding context

Using gg_q2a, add the title, subtitle, and caption and change/remove the labels for x, y, and color as seen in Brightspace. Change the legend to match Brightspace and move the legend to the top right corner of the plot.

Save it as gg_q2b display it in the knitted document.

gg_q2b <- 
  gg_q2a + 
  # Changing the labels and adding the title, subtitle, and caption
  labs(
    x = "Mileage",
    y = NULL,
    color = NULL,
    title = "Used Cars for Sale",
    subtitle = "Listed on Craiglist in 2023",
    caption = "Data: kaggle.com"
    ) + 
  # Changing the theme and moving the legend to inside the plot
  theme_classic() +
  theme(
    legend.position = "inside",
    legend.position.inside = c(0.9, 0.9)
  )

gg_q2b

Part 2c: Improving appearance

Make the final changes to the graph in gg_q2b that can be seen in Brightspace. Make sure to pay close attention to the color guide!

The colors used are

Ford: #47a8e5 Chevrolet: #D1AD57 Honda: #CC0000 Jeep: #485F2B

gg_q2b +
  # Changing the y-axis labels to have $ and be every $10k
  scale_y_continuous(
    labels = scales::label_dollar(),
    breaks = seq(from = 0, to = 6e4, by = 1e4)
  ) + 
  # x-axis not in scientific notation
  scale_x_continuous(
    labels = scales::label_comma(),
    #breaks = seq(from = 0, to = 2e5, by = 25e3)
  ) +
  # Changing the colors used and labels for the manufacturer
  scale_color_manual(
    labels = c(chevrolet ="Chevrolet", ford ="Ford", 
               honda ="Honda", jeep = "Jeep"),
    values = c(chevrolet = "#D1AD57", ford = "#47a8e5", 
               honda = "#CC0000", jeep = "#485F2B")
  )

Question 3: Small Multiples

Create a set of 4 scatter plots with a fitted line in the same overall graph - 1 for each manufacturer.

Each individual plot should have odometer on the x-axis, price on the y-axis, and age of the car represented by color.

ggplot(
  data = read.csv("https://raw.githubusercontent.com/Shammalamala/DS-2870-Data-Sets/main/used%20cars.csv"),
  mapping = aes(
    x = odometer,
    y = price,
    color = 2023 - year  # age = 2023 - year
  )
) + 
  # Adding points and a trend line
  geom_point() +
  geom_smooth(
    method = "lm",
    se = F,
    formula = y ~ x,
    show.legend = F,
    color = "black"
  ) + 
  # Adding context
  labs(
    x = "Mileage",
    y = NULL,
    color = "Age",
    title = "Used Cars for Sale",
    subtitle = "Listed on Craiglist in 2023",
    caption = "Data: kaggle.com"
  ) + 
  # Changing the theme
  theme_bw() +
  # Creating 4 plots (small multiples) for each manufacturer
  facet_wrap(
    facets = vars(manufacturer)
  )

DS 2870: Module 2 Homework - Used Cars

Your Name

2024-09-23