DS Labs Homework

Author

Cody Paulay-Simmons

Load libraries and a data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
data(movielens)

DS Labs Week 4: movielens

For this assignment, I picked the movielens dataset from the dslabs package.

As I looked through the data, a few ideas popped into my head, especially around the genres. I was curious to explore how average ratings changed over time by genre.

Clean the Data

I removed any missing values from the title, year, and genres columns.

movies <- movielens |>
  filter(!is.na(title),
         !is.na(year),
         !is.na(genres))

!is.na() removes rows where any of the selected columns are missing data.

Group by year, genre, and title to get average ratings

movies2 <- movies |>
  group_by(year,
           genres,
           title) |>
  summarize(avg_rating = mean(rating))
`summarise()` has grouped output by 'year', 'genres'. You can override using
the `.groups` argument.

This creates a new dataset that focuses on year, genres, title, and avg_rating, calculated using the mean() function.

Set a Base Plot

p1 <- ggplot() +
  labs(title = "Average Movie Ratings Over Time by Genre",
       x = "Year",
       y = "Rating",
       caption = "Source: DSLabs Movielens Dataset") +
  theme_minimal()
p1

The base plot is like a blank canvas, just getting the structure set up for what I’ll add next.

Add the Scatterplot

p2 <- ggplot(movies2,
             aes(x = year,
                 y = avg_rating)) +
  geom_point(size = 2,
             alpha = 0.5,
             color = "purple") +
  labs(title = "Movie Ratings Over Time",
       x = "Year",
       y = "Rating",
       caption = "Source: DSLabs Movielens Dataset") +
  theme_minimal()
p2

This plot shows trends over time, but the color doesn’t give us much information, it’s just all purple. I wanted to add genre colors next.

Add Color Based on Genre

p3 <- ggplot(movies2,
             aes(x = year,
                 y = avg_rating,
                 color = genres)) +
  geom_point(size = 2,
             alpha = 0.5) +
  labs(title = "Average Movie Ratings Over Time by Genre",
       x = "Year",
       y = "Rating",
       color = "Genre",
       caption = "Source: DSLabs Movies Dataset") +
  theme_minimal()
p3

This didn’t work out, there are too many combinations of genres (like “Comedy|Action|Sci-Fi”) and it cluttered the legend and plot. So I decided to simplify.

Clean the genres

Use only the first-named genre for each movie.

movies2 <- movies |>
  mutate(main_genre = str_split(genres,
                                "\\|",
                                simplify = TRUE)[, 1])

I used ChatGPT to help me figure out how to separate the genre string by the pipe symbol | and pull out just the first one. That made each movie easier to categorize.

Note:

str_split() splits strings by the “|” symbol.

simplify = TRUE turns the result into a matrix.

[, 1] grabs the first genre listed for each movie.

Plot With Cleaned Genres

p4 <- ggplot(movies2,
             aes(x = year,
                 y = rating,
                 color = main_genre)) +
  geom_point(size = 2,
             alpha = 0.5) +
  labs(title = "Average Movie Ratings Over Time by Genre",
       x = "Year",
       y = "Rating",
       color = "Genre",
       caption = "Source: DSLabs Movies Dataset") +
  theme_minimal()
p4

This was a big improvement, but there were still too many genres. So next, I want to filter down to the 8 most common ones.

Keep Top 8 Most Common Genres

top_genres <- movies2 |>
  count(main_genre) |>
  arrange(desc(n)) |>
  slice_head(n = 8) |>
  pull(main_genre)
top_genres
[1] "Action"    "Comedy"    "Drama"     "Adventure" "Crime"     "Horror"   
[7] "Animation" "Children" 

Explanation:

count() tallies how many movies are in each genre.

arrange(desc(n)) sorts by most popular.

slice_head(n = 8) grabs the top 8.

pull(main_genre) turns the result into a vector.

Filter Movies to Keep Only Top Genres

movies3 <- movies2 |>
  filter(main_genre %in% top_genres)

This gives me a cleaner dataset with only the top 8 genres.

Group Again By Year and Genre

Now we average the ratings by year and genre again.

movies4 <- movies3 |>
  group_by(year,
           main_genre) |>
  summarize(avg_rating = mean(rating))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Updated minimal dataset focusing on year, main genre and average rating

Final Plot

p5 <- ggplot(movies4,
             aes(x = year,
                 y = avg_rating,
                 color = main_genre)) +
  geom_point(size = 2,
             alpha = 0.5) +
  labs(title = "Average Movie Ratings Over Time by Genre",
    x = "Year",
    y = "Rating",
    color = "Genre",
    caption = "Source: DSLabs Movielens Dataset") +
  theme_minimal()
p5

Make it Interactive

p6 <- ggplot(movies4,
             aes(x = year,
                 y = avg_rating,
                 color = main_genre)) +
  geom_point(size = 2,
             alpha = 0.5) +
  labs(title = "Average Movie Ratings Over Time by Genre",
       x = "Year",
       y = "Rating",
       color = "Genre",
       caption = "Source: DSLabs Movielens Dataset") +
  theme_minimal()
p6 <- ggplotly(p6)
p6

This was exciting to build—it was cool to finally see the interactive version come to life.

Final Reflection

This project helped me realize how important data cleaning is when working with real-world datasets. At first, I wasn’t sure what to do with the genres column because many movies had multiple genres. I decided to use only the first-named genre and then filtered the data to include only the top 8 genres. That made the final plot much easier to read and understand.

I also tried out different plot types before landing on the scatterplot with Plotly, which was the best fit for this kind of analysis. It helped me see how ratings changed over time by genre. I learned that picking what to include, and what to leave out, really changes how clearly the story comes across in a visualization.

The class notes were a huge help along the way, especially for getting the interactive version working. It made me realize I could someday add cool things like hoverable movie posters or deeper interactivity. I’m happy with how this turned out… it was definitely a challenge at times, but seeing the final chart pop up made it worth it.

Challenge: Genres

One challenge I ran into was that the majority of movies had multiple genres listed like for example, “Action|Adventure|Sci-Fi”. To simplify the genre list, I decided to use only the first-named genre for each movie which would let me select “Action” and ignore the “Adventure|Sci-Fi”. This helped reduce the number of categories and made the plot cleaner and easier to read.

Even after that, there were still too many genres to show clearly, so I filtered the dataset to include only the eight most common genres. This allowed me to create a color legend that was easier to interpret while still keeping the diversity in the data.

There was a point when I considered reaching out to the professor for help with filtering the genres to only the first-named genre, but I was on a roll and decided to use the ChatGPT and asked for suggestions of which function to use for what I am looking for and asked it to teach what each words in the function does so I could stay in track and learn new function. I didn’t need a full solution, just a nudge in the right direction, and it worked out. In the end, I was glad I kept going and solved it on my own. Please do offer any feedback, insights!