library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
data(package = "dslabs")
data("movielens")
head(movielens)
## movieId title year
## 1 31 Dangerous Minds 1995
## 2 1029 Dumbo 1941
## 3 1061 Sleepers 1996
## 4 1129 Escape from New York 1981
## 5 1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
## 6 1263 Deer Hunter, The 1978
## genres userId rating timestamp
## 1 Drama 1 2.5 1260759144
## 2 Animation|Children|Drama|Musical 1 3.0 1260759179
## 3 Thriller 1 3.0 1260759182
## 4 Action|Adventure|Sci-Fi|Thriller 1 2.0 1260759185
## 5 Drama 1 4.0 1260759205
## 6 Drama|War 1 2.0 1260759151
movielens |> group_by(genres) |> summarise(entry = n()) |> arrange(desc(entry)) |> select(genres, entry)
## # A tibble: 901 × 2
## genres entry
## <fct> <int>
## 1 Drama 7757
## 2 Comedy 6748
## 3 Comedy|Romance 3973
## 4 Drama|Romance 3462
## 5 Comedy|Drama 3272
## 6 Comedy|Drama|Romance 3204
## 7 Crime|Drama 2367
## 8 Action|Adventure|Sci-Fi 2146
## 9 Action|Adventure|Sci-Fi|Thriller 1453
## 10 Action|Crime|Thriller 1441
## # ℹ 891 more rows
I chose Action|Adventure|Thriller, Comedy, Drama, Horror, and Documentary because they had substantial numbers of entries (600+) and don’t categorically overlap.
genre_ratings <- movielens |> group_by(genres, year) |> filter(genres != "(no genres listed)", genres %in% c("Action|Adventure|Thriller", "Comedy", "Drama", "Horror", "Documentary")) |> summarise(mean_rating = mean(rating, na.rm = TRUE))
## `summarise()` has grouped output by 'genres'. You can override using the
## `.groups` argument.
genre_ratings
## # A tibble: 318 × 3
## # Groups: genres [5]
## genres year mean_rating
## <fct> <int> <dbl>
## 1 Action|Adventure|Thriller 1933 4
## 2 Action|Adventure|Thriller 1962 3.65
## 3 Action|Adventure|Thriller 1963 3.64
## 4 Action|Adventure|Thriller 1964 3.72
## 5 Action|Adventure|Thriller 1965 3.56
## 6 Action|Adventure|Thriller 1971 3.43
## 7 Action|Adventure|Thriller 1973 3.62
## 8 Action|Adventure|Thriller 1974 3.5
## 9 Action|Adventure|Thriller 1977 3.75
## 10 Action|Adventure|Thriller 1981 3.53
## # ℹ 308 more rows
genre_ratings |> ggplot(aes(x = year, y = mean_rating, color = genres)) +
geom_point() +
facet_wrap(~genres, nrow = 5) +
theme_minimal() +
labs(
x = "Year",
y = "Mean Rating (0-5)",
title = "Ratings of Movie Genres Over the Last 100 Years",
color = "Genre"
)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
I used the movielens dataset, which contains data on movies from 1902 to 2016. The variables included in the dataset are the movie ID number, the title of the movie, the year of the movie’s release, the genre(s) of the movie, User ID numbers of the people gave ratings, rating of the movie (0-5), and timestamp (date and time) at which the rating was given.
I explored the average ratings for 5 distinct genres of movies
(Action|Adventure|Thriller, Comedy, Drama, Documentary, and Horror) over
the timeframe of the data. I used the geom_point and
facet_wrap functions to create a faceted scatterplot
showing the mean ratings each year for each genre. In the
facet_wrap, I set nrow = 5 so that each scatterplot had
their own row, which helps to easily compare the genres’ ratings over
the years.
Based on the plots, it seems that Comedy, Documentary, and Drama movies were consistently rated high over the years, while Horror movie ratings fluctuated and Action/Adventure/Thriller movies stayed in the middle (2-4) range. Looking at the plots, I notice that there were many years didn’t have movie ratings, which may affect the interpretation of the data. If I were to redo this visualization, I would select genres that all had similar year distributions or filter the data for a specific time period, like 1970-2010, which would’ve better represented the genres selected for this plot.