DS Labs HW

Author

Aline Mayrink

Introduction

The dataset used in this analysis is the movielens dataset from the ‘dslabs’ package. This dataset contains user ratings of movies, including metadata such as genre, release year, and average rating. The objective of this visualization is to explore the relationship between the year of released movies, their average rating, and the number of ratings each movie has received.

Load libraries

library(tidyverse) # 'tidyverse' for data manipulation and visualization

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dslabs) #'dslabs' for the movielens dataset
library(highcharter) # 'highcharter' for creating interactive visualizations

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

Attaching package: 'highcharter'

The following object is masked from 'package:dslabs':

    stars

Load data

data("movielens") # This dataset contains movie ratings data along with metadata like genre, year of release, etc.

Select columns and summarize

movie_data <- movielens %>%
  group_by(title, year) %>% # Group the data by movie and year
  summarize(avg_rating = mean(rating, na.rm = TRUE), num_ratings = n(), .groups = 'drop') # calculate the average rating and the number of ratings for each movie.

Filter movies with a minimun number of ratings

filtered_data <- movie_data %>%
  filter(num_ratings > 50 & !is.na(year)) # Only include movies that have more than 50 ratings and exclude movies with missing year data.

Create a Scatterplot with a third variable (Number of Ratings)

# This plot visualizes the relationship between the release year and average rating, with color indicating the number of ratings.
ds_theme_set()
ggplot(filtered_data, aes(x = year, y = avg_rating, color = num_ratings)) +
  geom_point(alpha = 0.6, size = 3) +  # Adjust transparency and point size for better readability.
  scale_color_gradient(low = "blue", high = "red", name = "Number of Ratings") +
  labs(title = "Movie Ratings Over the Years",
       x = "Year of Release",
       y = "Average Rating") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))  # Center the title.

Interactive Visualization

# The plot is similar to the previous one but offers interactive features such as tooltips and color gradient for the number of ratings.
hchart(filtered_data, "scatter", hcaes(x = year, y = avg_rating, color = num_ratings)) %>%
  hc_title(text = "Interactive Movie Ratings") %>%
  hc_xAxis(title = list(text = "Year of Release")) %>%
  hc_yAxis(title = list(text = "Average Rating")) %>%
  hc_colorAxis(stops = color_stops(n = 5, colors = c("blue", "green")))%>%
  hc_tooltip(pointFormat = "Movie: {point.movie}<br>Year: {point.year}<br>Avg Rating: {point.avg_rating}<br>Num Ratings: {point.num_ratings}") %>%
  hc_legend(align = "right", 
            verticalAlign = "top")

Insights and Conclusion

This visualization provides insights into how movie ratings have changed over the years. Older movies tend to have higher average ratings, possibly due to survivor bias (only well-regarded films remain widely rated). Additionally, we observe that movies with a large number of ratings tend to cluster around more recent years, likely due to increased accessibility via streaming platforms.

By focusing on the number of ratings as a key variable, we can analyze how rating trends evolve over time. The interactive visualization allows users to hover over data points and explore specific movie ratings dynamically. Further analysis could investigate factors influencing rating distributions.