In the field of data science, effectively analyzing and visualizing data is essential for extracting meaningful insights. This project utilizes the Movielens dataset from the dslabs package, which comprises user ratings for a diverse array of films. Our focus is on popular movies, defined as those with over 100 ratings, to investigate the relationship between the count of ratings and average ratings across various genres. Through this analysis, we aim to reveal patterns in audience preferences and enhance our understanding of viewer behavior in the film industry. By leveraging data manipulation and visualization techniques, we aim to present our findings in a clear and engaging manner.
movieId title year
Min. : 1 Length:100004 Min. :1902
1st Qu.: 1028 Class :character 1st Qu.:1987
Median : 2406 Mode :character Median :1995
Mean : 12549 Mean :1992
3rd Qu.: 5418 3rd Qu.:2001
Max. :163949 Max. :2016
NA's :7
genres userId rating timestamp
Drama : 7757 Min. : 1 Min. :0.500 Min. :7.897e+08
Comedy : 6748 1st Qu.:182 1st Qu.:3.000 1st Qu.:9.658e+08
Comedy|Romance : 3973 Median :367 Median :4.000 Median :1.110e+09
Drama|Romance : 3462 Mean :347 Mean :3.544 Mean :1.130e+09
Comedy|Drama : 3272 3rd Qu.:520 3rd Qu.:4.000 3rd Qu.:1.296e+09
Comedy|Drama|Romance: 3204 Max. :671 Max. :5.000 Max. :1.477e+09
(Other) :71588
# Check for missing values in the dataset without removing rowscolSums(is.na(movielens))
movieId title year genres userId rating timestamp
0 7 7 0 0 0 0
# Group by movieId and calculate rating count and average rating, then filter for movies with more than 100 ratingspopular_movies <- movielens %>%group_by(movieId, title, genres) %>%summarize(rating_count =n(), # Count the number of ratings per movieavg_rating =mean(rating, na.rm =TRUE)) %>%# Calculate average ratingfilter(rating_count >100) # Filter for popular movies
`summarise()` has grouped output by 'movieId', 'title'. You can override using
the `.groups` argument.
# Group by movieId and calculate rating count and average rating, then filter for movies with more than 100 ratingspopular_movies <- movielens %>%group_by(movieId, title, genres) %>%summarize(rating_count =n(), # Count the number of ratings per movieavg_rating =mean(rating, na.rm =TRUE), .groups ="drop") %>%# Calculate average ratingfilter(rating_count >100) # Filter for popular movies# Prepare genre data by separating genres into individual rowspopular_movies_genres <- popular_movies %>%separate_rows(genres, sep ="\\|") # Split multiple genres into separate rows
# Define colors for genres# (length(unique(popular_movies_genres$genres)) > 12) so we create our own pallette# Use a custom set of colors colors <-c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00", "#984EA3", "#FFFF33", "#A65628", "#999999","#FF00FF", "#00FFFF", "#FF4500", "#ADFF2F", "#FFD700", "#00FA9A", "#8A2BE2", "#20B2AA")# Create the highcharter plot# Create the highcharter plot with proper categorical x-axis labels# Prepare genre data by separating genres into individual rowspopular_movies_genres <- popular_movies %>%separate_rows(genres, sep ="\\|") %>%group_by(genres) %>%summarize(rating_count =sum(rating_count), # Total count of ratings for each genreavg_rating =mean(avg_rating, na.rm =TRUE), .groups ="drop") # Average rating per genre
# Create the highcharter plot with rating count on the x-axishighchart() %>%hc_add_series(data = popular_movies_genres,type ="scatter",hcaes(x = rating_count, y = avg_rating, group = genres),marker =list(symbol ="circle", radius =5)) %>%hc_chart(zoomType ="xy") %>%hc_title(text ="Average Ratings of Popular Movies by Rating Count") %>%hc_xAxis(title =list(text ="Count of Ratings")) %>%hc_yAxis(title =list(text ="Average Rating")) %>%hc_plotOptions(scatter =list(marker =list(lineWidth =1))) %>%hc_legend(enabled =FALSE) %>%hc_colors(colors)
CONCLUSION
This analysis has revealed important insights into the landscape of movie ratings across various genres using the Movielens dataset. By focusing on popular movies with over 100 ratings, we highlighted the correlation between the number of ratings and average ratings, illustrating audience preferences within the film industry. The findings emphasize the dominance of certain genres, showcasing how viewer satisfaction varies. Through effective data cleaning, summarization, and visualization techniques, we presented our results in a manner that is both informative and engaging. This project underscores the significance of data visualization in comprehending complex datasets and sets the stage for further exploration of trends in movie ratings and viewer preferences.