DS Labs Assignment

Author

Kittim

DS Labs Datasets

# install.packages("dslabs")  # these are data science labs
library("dslabs")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
data(movielens)
head(movielens)
  movieId                                   title year
1      31                         Dangerous Minds 1995
2    1029                                   Dumbo 1941
3    1061                                Sleepers 1996
4    1129                    Escape from New York 1981
5    1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
6    1263                        Deer Hunter, The 1978
                            genres userId rating  timestamp
1                            Drama      1    2.5 1260759144
2 Animation|Children|Drama|Musical      1    3.0 1260759179
3                         Thriller      1    3.0 1260759182
4 Action|Adventure|Sci-Fi|Thriller      1    2.0 1260759185
5                            Drama      1    4.0 1260759205
6                        Drama|War      1    2.0 1260759151
# Filter for the years 2007 to 2016 and create a genres_updated column
movielens_filtered <- movielens |>
  filter(year >= 2007 & year <= 2016) |>
  mutate(genres_updated = word(genres, 1, 1, sep = "\\|"))

# Create a heatmap to explore average movie ratings over the years by updated genre
heatmap_chart <- movielens_filtered |>
  group_by(year, genres_updated) |>
  summarize(mean_ratings = mean(rating))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Define a color palette from Set1 with 5 distinct colors
my_palette <- brewer.pal(5, "Set1")

# Ensure that mean_ratings is treated as a continuous variable
heatmap_chart$mean_ratings <- as.numeric(heatmap_chart$mean_ratings)

heatmap <- ggplot(heatmap_chart, aes(x = year, y = genres_updated, fill = mean_ratings)) +
  geom_tile() +
  scale_fill_gradientn(colors = my_palette, name = "Mean Ratings") +  # Use the Set1 palette
  labs(
    x = "Release Year",
    y = "Movie Genre",
    title = "Average Genre Ratings for Movies Released from 2007 to 2016",
    caption = "Data source: MovieLens dataset"
  ) +
  theme_minimal()

# Display the heatmap
print(heatmap)

In this analysis, I used the MovieLens dataset to explore and visualize the average genre ratings for movies released between 2007 and 2016. This choice allows us to explore the trends and changes in movie ratings within the last decade, which is often more relevant to current audience preferences. To create the heatmap, I first filtered the dataset to include only movies from 2007 to 2016. I also added a new column, “genres_updated,” which captures the primary genre of each movie.

Next, I grouped the filtered data by the release year and the updated genre. For each combination of year and genre, I calculated the mean movie rating, representing the average user rating for that genre in a particular year.

To enhance the visualization, I used a color palette from the RColorBrewer package and employed five distinct colors. The colors represent different levels of mean ratings. The heatmap shows the relationship between movie genres and their average ratings over the specified ten-year period. The x-axis represents the release year, the y-axis displays movie genres, and the color intensity within each cell corresponds to the mean rating for that genre in a given year.

This heatmap helps us understand how user ratings for different genres have evolved over the years and provides insights into which genres consistently received high or low ratings during the specified period. It’s a valuable tool for exploring trends and patterns in movie ratings for specific genres.

The gaps in the heatmap where there is no color could indicate that no movies of a particular genre were released in that specific year. This is especially true for niche or less common genres that may not have regular annual releases.

Alternatively, it could mean that movies of that genre were released, but there were no recorded ratings for them during that year. This could be due to various reasons, such as limited viewership, the absence of user reviews or ratings, or data collection issues.