The dataset used in this analysis is the movielens dataset from the ‘dslabs’ package. This dataset contains user ratings of movies, including metadata such as genre, release year, and average rating. The objective of this visualization is to explore the relationship between the year of released movies, their average rating, and the number of ratings each movie has received.
Load libraries
library(tidyverse) # 'tidyverse' for data manipulation and visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs) #'dslabs' for the movielens datasetlibrary(highcharter) # 'highcharter' for creating interactive visualizations
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Attaching package: 'highcharter'
The following object is masked from 'package:dslabs':
stars
Load data
data("movielens") # This dataset contains movie ratings data along with metadata like genre, year of release, etc.
Select columns and summarize
movie_data <- movielens %>%group_by(title, year) %>%# Group the data by movie and yearsummarize(avg_rating =mean(rating, na.rm =TRUE), num_ratings =n(), .groups ='drop') # calculate the average rating and the number of ratings for each movie.
Filter movies with a minimun number of ratings
filtered_data <- movie_data %>%filter(num_ratings >50&!is.na(year)) # Only include movies that have more than 50 ratings and exclude movies with missing year data.
Create a Scatterplot with a third variable (Number of Ratings)
# This plot visualizes the relationship between the release year and average rating, with color indicating the number of ratings.ds_theme_set()ggplot(filtered_data, aes(x = year, y = avg_rating, color = num_ratings)) +geom_point(alpha =0.6, size =3) +# Adjust transparency and point size for better readability.scale_color_gradient(low ="blue", high ="red", name ="Number of Ratings") +labs(title ="Movie Ratings Over the Years",x ="Year of Release",y ="Average Rating") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) # Center the title.
Interactive Visualization
# The plot is similar to the previous one but offers interactive features such as tooltips and color gradient for the number of ratings.hchart(filtered_data, "scatter", hcaes(x = year, y = avg_rating, color = num_ratings)) %>%hc_title(text ="Interactive Movie Ratings") %>%hc_xAxis(title =list(text ="Year of Release")) %>%hc_yAxis(title =list(text ="Average Rating")) %>%hc_colorAxis(stops =color_stops(n =5, colors =c("blue", "green")))%>%hc_tooltip(pointFormat ="Movie: {point.movie}<br>Year: {point.year}<br>Avg Rating: {point.avg_rating}<br>Num Ratings: {point.num_ratings}") %>%hc_legend(align ="right", verticalAlign ="top")
Insights and Conclusion
This visualization provides insights into how movie ratings have changed over the years. Older movies tend to have higher average ratings, possibly due to survivor bias (only well-regarded films remain widely rated). Additionally, we observe that movies with a large number of ratings tend to cluster around more recent years, likely due to increased accessibility via streaming platforms.
By focusing on the number of ratings as a key variable, we can analyze how rating trends evolve over time. The interactive visualization allows users to hover over data points and explore specific movie ratings dynamically. Further analysis could investigate factors influencing rating distributions.