For this assignment, the “movielens” dataset from the “dslabs” package was used. The dataset contains movie ratings and information from the MovieLens website.
# install.packages("dslabs") # these are data science labs
library("dslabs")
## Warning: package 'dslabs' was built under R version 4.2.3
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
Loading datasets and packages
data(movielens)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.2.3
#view(movielens)
write_csv(movielens, "movielens.csv", na="")
Defining a custom color palette with a unique color for each genre
genre_colors <- c("Action" = "red", "Adventure" = "blue", "Animation" = "green",
"Children's" = "purple", "Comedy" = "orange", "Crime" = "black",
"Documentary" = "grey", "Drama" = "pink", "Fantasy" = "brown",
"Film-Noir" = "navy", "Horror" = "darkred", "Musical" = "gold",
"Mystery" = "darkgreen", "Romance" = "maroon", "Sci-Fi" = "#008080",
"Thriller" = "darkblue", "War" = "#377E22", "Western" = "sienna")
Creating movie_ratings data frame to calculate movie average rating by movie ID
movie_ratings <- movielens %>%
group_by(movieId) %>%
summarize(avg_rating = mean(rating), num_ratings = n())
Creating movie_info data frame to filter and only include movies with rating information by using inner_join
movie_info <- movielens %>%
select(movieId, title, genres, timestamp) %>%
inner_join(movie_ratings, by = "movieId")
Creating the scatter plot. The x-axis represents the average rating for each movie, and the y-axis represents the number of ratings for each movie. Each data point is colored by its genre using the custom color palette. The “theme_light()” function is used to apply a custom theme to the plot, which changes the font and color scheme.
ggplot(movie_info %>% filter(!is.na(num_ratings)), aes(x = avg_rating, y = num_ratings, color = genres)) +
geom_point(size=1) +
scale_color_manual(values = genre_colors) +
labs(x = "Average Rating", y = "Number of Ratings", title = "Number of Ratings vs. Average Rating by Genre") +
theme_light()
In this plot, we can see that movies with higher average ratings tend to have more ratings, which is not surprising. However, we can also see some interesting patterns based on genre. It is evident that there are just simply a lot of documentary movies and that only that genre has more than Thriller has a number of rating above 200. I would conclude that the data from the MovieLens website must have a lot of ratings and info on documentaries. Musical and dramas are also noticeable. I would conclude that musicals vary from the lower side of the average rating and while the drama genre mostly congregated around the average rating between 3 and 4.The rest of the genres are barely noticeable.