DSLabs Dataset Assignment

Load packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dslabs)
data(package = "dslabs")

data("movielens")
head(movielens)

##   movieId                                   title year
## 1      31                         Dangerous Minds 1995
## 2    1029                                   Dumbo 1941
## 3    1061                                Sleepers 1996
## 4    1129                    Escape from New York 1981
## 5    1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
## 6    1263                        Deer Hunter, The 1978
##                             genres userId rating  timestamp
## 1                            Drama      1    2.5 1260759144
## 2 Animation|Children|Drama|Musical      1    3.0 1260759179
## 3                         Thriller      1    3.0 1260759182
## 4 Action|Adventure|Sci-Fi|Thriller      1    2.0 1260759185
## 5                            Drama      1    4.0 1260759205
## 6                        Drama|War      1    2.0 1260759151

Group by genres and arrange in descending order of total entries.

movielens |> group_by(genres) |> summarise(entry = n()) |> arrange(desc(entry)) |> select(genres, entry)

## # A tibble: 901 × 2
##    genres                           entry
##    <fct>                            <int>
##  1 Drama                             7757
##  2 Comedy                            6748
##  3 Comedy|Romance                    3973
##  4 Drama|Romance                     3462
##  5 Comedy|Drama                      3272
##  6 Comedy|Drama|Romance              3204
##  7 Crime|Drama                       2367
##  8 Action|Adventure|Sci-Fi           2146
##  9 Action|Adventure|Sci-Fi|Thriller  1453
## 10 Action|Crime|Thriller             1441
## # ℹ 891 more rows

Group by genres and year, filter out movies with no listed genres, and filter 5 distinct movie genres. Then summarize the mean ratings for each genre each year.

I chose Action|Adventure|Thriller, Comedy, Drama, Horror, and Documentary because they had substantial numbers of entries (600+) and don’t categorically overlap.

genre_ratings <- movielens |> group_by(genres, year) |> filter(genres != "(no genres listed)", genres %in% c("Action|Adventure|Thriller", "Comedy", "Drama", "Horror", "Documentary")) |> summarise(mean_rating = mean(rating, na.rm = TRUE))

## `summarise()` has grouped output by 'genres'. You can override using the
## `.groups` argument.

genre_ratings

## # A tibble: 318 × 3
## # Groups:   genres [5]
##    genres                     year mean_rating
##    <fct>                     <int>       <dbl>
##  1 Action|Adventure|Thriller  1933        4   
##  2 Action|Adventure|Thriller  1962        3.65
##  3 Action|Adventure|Thriller  1963        3.64
##  4 Action|Adventure|Thriller  1964        3.72
##  5 Action|Adventure|Thriller  1965        3.56
##  6 Action|Adventure|Thriller  1971        3.43
##  7 Action|Adventure|Thriller  1973        3.62
##  8 Action|Adventure|Thriller  1974        3.5 
##  9 Action|Adventure|Thriller  1977        3.75
## 10 Action|Adventure|Thriller  1981        3.53
## # ℹ 308 more rows

Create a faceted scatterplot with the x-axis representing the year, the y-axis representing the mean rating, and the points color-coded by genre. Have the scatter plots arranged in 5 rows for each of the 5 genres.

genre_ratings |> ggplot(aes(x = year, y = mean_rating, color = genres)) + 
  geom_point() +
  facet_wrap(~genres, nrow = 5) +
  theme_minimal() +
  labs(
    x = "Year",
    y = "Mean Rating (0-5)",
    title = "Ratings of Movie Genres Over the Last 100 Years",
    color = "Genre"
  )

## Warning: Removed 2 rows containing missing values (`geom_point()`).

I used the movielens dataset, which contains data on movies from 1902 to 2016. The variables included in the dataset are the movie ID number, the title of the movie, the year of the movie’s release, the genre(s) of the movie, User ID numbers of the people gave ratings, rating of the movie (0-5), and timestamp (date and time) at which the rating was given.

I explored the average ratings for 5 distinct genres of movies (Action|Adventure|Thriller, Comedy, Drama, Documentary, and Horror) over the timeframe of the data. I used the geom_point and facet_wrap functions to create a faceted scatterplot showing the mean ratings each year for each genre. In the facet_wrap, I set nrow = 5 so that each scatterplot had their own row, which helps to easily compare the genres’ ratings over the years.

Based on the plots, it seems that Comedy, Documentary, and Drama movies were consistently rated high over the years, while Horror movie ratings fluctuated and Action/Adventure/Thriller movies stayed in the middle (2-4) range. Looking at the plots, I notice that there were many years didn’t have movie ratings, which may affect the interpretation of the data. If I were to redo this visualization, I would select genres that all had similar year distributions or filter the data for a specific time period, like 1970-2010, which would’ve better represented the genres selected for this plot.

DSLabs Dataset Assignment

Yoseph Habtu

2023-10-24

Load packages

Group by genres and arrange in descending order of total entries.

Group by genres and year, filter out movies with no listed genres, and filter 5 distinct movie genres. Then summarize the mean ratings for each genre each year.

Create a faceted scatterplot with the x-axis representing the year, the y-axis representing the mean rating, and the points color-coded by genre. Have the scatter plots arranged in 5 rows for each of the 5 genres.