Ds Labs Movie Lenses

Author

Latifah Traore

Introduction

In the field of data science, effectively analyzing and visualizing data is essential for extracting meaningful insights. This project    utilizes the Movielens dataset from the dslabs package, which comprises user ratings for a diverse array of films. Our focus is on popular movies, defined as those with over 100 ratings, to investigate the relationship between the count of ratings and average ratings across various genres. Through this analysis, we aim to reveal patterns in audience preferences and enhance our understanding of viewer behavior in the film industry. By leveraging data manipulation and visualization techniques, we aim to present our findings in a clear and engaging manner.

# Load necessary libraries
library(dslabs)
library(tidyr)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

Attaching package: 'highcharter'
The following object is masked from 'package:dslabs':

    stars
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# Load the dataset
data("movielens")
# Inspect the structure and summary of the dataset
str(movielens)
'data.frame':   100004 obs. of  7 variables:
 $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
 $ title    : chr  "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
 $ year     : int  1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
 $ genres   : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
 $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
 $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
summary(movielens)
    movieId          title                year     
 Min.   :     1   Length:100004      Min.   :1902  
 1st Qu.:  1028   Class :character   1st Qu.:1987  
 Median :  2406   Mode  :character   Median :1995  
 Mean   : 12549                      Mean   :1992  
 3rd Qu.:  5418                      3rd Qu.:2001  
 Max.   :163949                      Max.   :2016  
                                     NA's   :7     
                  genres          userId        rating        timestamp        
 Drama               : 7757   Min.   :  1   Min.   :0.500   Min.   :7.897e+08  
 Comedy              : 6748   1st Qu.:182   1st Qu.:3.000   1st Qu.:9.658e+08  
 Comedy|Romance      : 3973   Median :367   Median :4.000   Median :1.110e+09  
 Drama|Romance       : 3462   Mean   :347   Mean   :3.544   Mean   :1.130e+09  
 Comedy|Drama        : 3272   3rd Qu.:520   3rd Qu.:4.000   3rd Qu.:1.296e+09  
 Comedy|Drama|Romance: 3204   Max.   :671   Max.   :5.000   Max.   :1.477e+09  
 (Other)             :71588                                                    
# Check for missing values in the dataset without removing rows
colSums(is.na(movielens))
  movieId     title      year    genres    userId    rating timestamp 
        0         7         7         0         0         0         0 
# Group by movieId and calculate rating count and average rating, then filter for movies with more than 100 ratings
popular_movies <- movielens %>%
  group_by(movieId, title, genres) %>%
  summarize(rating_count = n(),               # Count the number of ratings per movie
            avg_rating = mean(rating, na.rm = TRUE)) %>%  # Calculate average rating
  filter(rating_count > 100)  # Filter for popular movies
`summarise()` has grouped output by 'movieId', 'title'. You can override using
the `.groups` argument.
# Inspect the filtered data
head(popular_movies)
# A tibble: 6 × 5
# Groups:   movieId, title [6]
  movieId title                              genres      rating_count avg_rating
    <int> <chr>                              <fct>              <int>      <dbl>
1       1 Toy Story                          Adventure|…          247       3.87
2       2 Jumanji                            Adventure|…          107       3.40
3       6 Heat                               Action|Cri…          104       3.88
4      10 GoldenEye                          Action|Adv…          122       3.45
5      25 Leaving Las Vegas                  Drama|Roma…          101       3.74
6      32 Twelve Monkeys (a.k.a. 12 Monkeys) Mystery|Sc…          196       3.92
# Group by movieId and calculate rating count and average rating, then filter for movies with more than 100 ratings
popular_movies <- movielens %>%
  group_by(movieId, title, genres) %>%
  summarize(rating_count = n(),               # Count the number of ratings per movie
            avg_rating = mean(rating, na.rm = TRUE), .groups = "drop") %>%  # Calculate average rating
  filter(rating_count > 100)  # Filter for popular movies

# Prepare genre data by separating genres into individual rows
popular_movies_genres <- popular_movies %>%
  separate_rows(genres, sep = "\\|")  # Split multiple genres into separate rows
# Define colors for genres
# (length(unique(popular_movies_genres$genres)) > 12) so we create our own pallette
  # Use a custom set of colors
  colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00", "#984EA3", "#FFFF33", "#A65628", "#999999",
              "#FF00FF", "#00FFFF", "#FF4500", "#ADFF2F", "#FFD700", "#00FA9A", "#8A2BE2", "#20B2AA")


# Create the highcharter plot
# Create the highcharter plot with proper categorical x-axis labels
# Prepare genre data by separating genres into individual rows
popular_movies_genres <- popular_movies %>%
  separate_rows(genres, sep = "\\|") %>%
  group_by(genres) %>%
  summarize(rating_count = sum(rating_count),  # Total count of ratings for each genre
            avg_rating = mean(avg_rating, na.rm = TRUE), .groups = "drop")  # Average rating per genre
# Create the highcharter plot with rating count on the x-axis
highchart() %>%
  hc_add_series(data = popular_movies_genres,
                type = "scatter",
                hcaes(x = rating_count, y = avg_rating, group = genres),
                marker = list(symbol = "circle", radius = 5)) %>%
  hc_chart(zoomType = "xy") %>%
  hc_title(text = "Average Ratings of Popular Movies by Rating Count") %>%
  hc_xAxis(title = list(text = "Count of Ratings")) %>%
  hc_yAxis(title = list(text = "Average Rating")) %>%
  hc_plotOptions(scatter = list(marker = list(lineWidth = 1))) %>%
  hc_legend(enabled = FALSE) %>%
  hc_colors(colors)

CONCLUSION

This analysis has revealed important insights into the landscape of movie ratings across various genres using the Movielens dataset. By focusing on popular movies with over 100 ratings, we highlighted the correlation between the number of ratings and average ratings, illustrating audience preferences within the film industry. The findings emphasize the dominance of certain genres, showcasing how viewer satisfaction varies. Through effective data cleaning, summarization, and visualization techniques, we presented our results in a manner that is both informative and engaging. This project underscores the significance of data visualization in comprehending complex datasets and sets the stage for further exploration of trends in movie ratings and viewer preferences.