── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
data(movielens)
DS Labs Week 4: movielens
For this assignment, I picked the movielens dataset from the dslabs package.
As I looked through the data, a few ideas popped into my head, especially around the genres. I was curious to explore how average ratings changed over time by genre.
Clean the Data
I removed any missing values from the title, year, and genres columns.
movies <- movielens |>filter(!is.na(title),!is.na(year),!is.na(genres))
!is.na() removes rows where any of the selected columns are missing data.
Group by year, genre, and title to get average ratings
movies2 <- movies |>group_by(year, genres, title) |>summarize(avg_rating =mean(rating))
`summarise()` has grouped output by 'year', 'genres'. You can override using
the `.groups` argument.
This creates a new dataset that focuses on year, genres, title, and avg_rating, calculated using the mean() function.
Set a Base Plot
p1 <-ggplot() +labs(title ="Average Movie Ratings Over Time by Genre",x ="Year",y ="Rating",caption ="Source: DSLabs Movielens Dataset") +theme_minimal()p1
The base plot is like a blank canvas, just getting the structure set up for what I’ll add next.
This plot shows trends over time, but the color doesn’t give us much information, it’s just all purple. I wanted to add genre colors next.
Add Color Based on Genre
p3 <-ggplot(movies2,aes(x = year,y = avg_rating,color = genres)) +geom_point(size =2,alpha =0.5) +labs(title ="Average Movie Ratings Over Time by Genre",x ="Year",y ="Rating",color ="Genre",caption ="Source: DSLabs Movies Dataset") +theme_minimal()p3
This didn’t work out, there are too many combinations of genres (like “Comedy|Action|Sci-Fi”) and it cluttered the legend and plot. So I decided to simplify.
Clean the genres
Use only the first-named genre for each movie.
movies2 <- movies |>mutate(main_genre =str_split(genres,"\\|",simplify =TRUE)[, 1])
I used ChatGPT to help me figure out how to separate the genre string by the pipe symbol | and pull out just the first one. That made each movie easier to categorize.
Note:
str_split() splits strings by the “|” symbol.
simplify = TRUE turns the result into a matrix.
[, 1] grabs the first genre listed for each movie.
Plot With Cleaned Genres
p4 <-ggplot(movies2,aes(x = year,y = rating,color = main_genre)) +geom_point(size =2,alpha =0.5) +labs(title ="Average Movie Ratings Over Time by Genre",x ="Year",y ="Rating",color ="Genre",caption ="Source: DSLabs Movies Dataset") +theme_minimal()p4
This was a big improvement, but there were still too many genres. So next, I want to filter down to the 8 most common ones.
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Updated minimal dataset focusing on year, main genre and average rating
Final Plot
p5 <-ggplot(movies4,aes(x = year,y = avg_rating,color = main_genre)) +geom_point(size =2,alpha =0.5) +labs(title ="Average Movie Ratings Over Time by Genre",x ="Year",y ="Rating",color ="Genre",caption ="Source: DSLabs Movielens Dataset") +theme_minimal()p5
Make it Interactive
p6 <-ggplot(movies4,aes(x = year,y = avg_rating,color = main_genre)) +geom_point(size =2,alpha =0.5) +labs(title ="Average Movie Ratings Over Time by Genre",x ="Year",y ="Rating",color ="Genre",caption ="Source: DSLabs Movielens Dataset") +theme_minimal()p6 <-ggplotly(p6)p6
This was exciting to build—it was cool to finally see the interactive version come to life.
Final Reflection
This project helped me realize how important data cleaning is when working with real-world datasets. At first, I wasn’t sure what to do with the genres column because many movies had multiple genres. I decided to use only the first-named genre and then filtered the data to include only the top 8 genres. That made the final plot much easier to read and understand.
I also tried out different plot types before landing on the scatterplot with Plotly, which was the best fit for this kind of analysis. It helped me see how ratings changed over time by genre. I learned that picking what to include, and what to leave out, really changes how clearly the story comes across in a visualization.
The class notes were a huge help along the way, especially for getting the interactive version working. It made me realize I could someday add cool things like hoverable movie posters or deeper interactivity. I’m happy with how this turned out… it was definitely a challenge at times, but seeing the final chart pop up made it worth it.
Challenge: Genres
One challenge I ran into was that the majority of movies had multiple genres listed like for example, “Action|Adventure|Sci-Fi”. To simplify the genre list, I decided to use only the first-named genre for each movie which would let me select “Action” and ignore the “Adventure|Sci-Fi”. This helped reduce the number of categories and made the plot cleaner and easier to read.
Even after that, there were still too many genres to show clearly, so I filtered the dataset to include only the eight most common genres. This allowed me to create a color legend that was easier to interpret while still keeping the diversity in the data.
There was a point when I considered reaching out to the professor for help with filtering the genres to only the first-named genre, but I was on a roll and decided to use the ChatGPT and asked for suggestions of which function to use for what I am looking for and asked it to teach what each words in the function does so I could stay in track and learn new function. I didn’t need a full solution, just a nudge in the right direction, and it worked out. In the end, I was glad I kept going and solved it on my own. Please do offer any feedback, insights!