Does gender, age, genre and a movie’s release year impact how surveyees rate movies?

Overview

The data below was collected from five individuals. Specifically, three males, two females, with ages ranging from 28 to 60. It represents how they rate six recent movies with varying genres (for the most part). Note, movies may have multiple genres – the data only captures the “main” genre, as interpreted by myself, the surveyor, as the first listed under the movie’s IMDB profile. If an individual did not see a movie in the survey, a null value is denoted in the “Rating” column.

1. Load packages into R.

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.7
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
## 
## Attaching package: 'curl'
## The following object is masked from 'package:readr':
## 
##     parse_date
library(ggplot2)
library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(DBI)
library(odbc)

2. Read the survey data into an R data frame.

movie_ratings <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/DATA607/main/Brian_Singh_movie_ratings"))
head(movie_ratings,10)
##    person age       movie_title release_year   genre rating
## 1   Tasha  32     Black Panther         2018  Action      4
## 2   Tasha  32             Joker         2019   Drama      2
## 3   Tasha  32 The Invisible Man         2020  Horror     NA
## 4   Tasha  32    A Star is Born         2018 Romance      5
## 5   Tasha  32     The Nice Guys         2016  Comedy      3
## 6   Tasha  32 Avengers: Endgame         2019  Action      4
## 7  Dereck  38     Black Panther         2018  Action      5
## 8  Dereck  38             Joker         2019   Drama      3
## 9  Dereck  38 The Invisible Man         2020  Horror      4
## 10 Dereck  38    A Star is Born         2018 Romance     NA

3. Summarize data.

print(summary(movie_ratings))
##     person               age       movie_title         release_year 
##  Length:30          Min.   :28.0   Length:30          Min.   :2016  
##  Class :character   1st Qu.:32.0   Class :character   1st Qu.:2018  
##  Mode  :character   Median :38.0   Mode  :character   Median :2018  
##                     Mean   :43.2                      Mean   :2018  
##                     3rd Qu.:59.0                      3rd Qu.:2019  
##                     Max.   :60.0                      Max.   :2020  
##                                                                     
##     genre               rating    
##  Length:30          Min.   :1.00  
##  Class :character   1st Qu.:3.00  
##  Mode  :character   Median :4.00  
##                     Mean   :3.56  
##                     3rd Qu.:4.00  
##                     Max.   :5.00  
##                     NA's   :5

4. Convert release year from integer to a date num field.

movie_ratings$release_year <- as.Date(as.character(movie_ratings$release_year), format = "%Y")
movie_ratings$release_year <- year(movie_ratings$release_year)

5. Find the average rating by genre, removing null values.

movie_ratings%>%
    group_by(genre) %>%
    summarise(mean_rtg = mean(rating, na.rm = TRUE,
              count_rtg = n())) %>%
    arrange(desc(mean_rtg))
## # A tibble: 5 × 2
##   genre   mean_rtg
##   <chr>      <dbl>
## 1 Action      4.2 
## 2 Horror      3.67
## 3 Comedy      3.5 
## 4 Romance     3   
## 5 Drama       2.5
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=reorder(genre,-rating),y=rating)) +
    geom_bar(position = "dodge",
        stat = "summary",
        fun = "mean") +
    ggtitle("Average Movie Rating by Genre") + xlab("Genre") + ylab("Avg Rating")

Based on the survey, it appears movies with an Action genre are rated regardless of age, with Drama performing the worst of the five included in the survey.

6. Does recency bias have an impact on the movie’s rating?

movie_ratings%>%
    group_by(release_year) %>%
    summarise(mean_rtg = mean(rating, na.rm = TRUE),
              count_rtg = n()) %>%
    arrange(desc(mean_rtg))
## # A tibble: 4 × 3
##   release_year mean_rtg count_rtg
##          <dbl>    <dbl>     <int>
## 1         2018     3.78        10
## 2         2020     3.67         5
## 3         2016     3.5          5
## 4         2019     3.33        10
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=release_year,y=rating)) +
    geom_smooth(position = "dodge",
        stat = "summary",
        fun = "mean") +
    ggtitle("Average Movie Rating by Release Year") + xlab("Release Year") + ylab("Avg Rating") +
    ylim(1,5)
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

From the survey, no visible conclusion can be made based on in terms of recency bias. The range in differences of average rating year-over-year does not exceed 0.44.

7. Does age have an impact on movie ratings, based on genre?

#Create a column to assign a group to "age" as follows: 20-30, 30-40, 40-50,50-60
movie_ratings$age_group <- cut(movie_ratings$age,c(20,30,40,50,60))

genre_age <- movie_ratings%>%
    group_by(genre,age_group) %>%
    summarise(mean_rtg = mean(rating, na.rm = TRUE),
              count_rtg = n()) %>%
    arrange(genre,age_group,desc(mean_rtg))
## `summarise()` has grouped output by 'genre'. You can override using the
## `.groups` argument.
genre_age
## # A tibble: 15 × 4
## # Groups:   genre [5]
##    genre   age_group mean_rtg count_rtg
##    <chr>   <fct>        <dbl>     <int>
##  1 Action  (20,30]       5            2
##  2 Action  (30,40]       4.25         4
##  3 Action  (50,60]       3.75         4
##  4 Comedy  (20,30]     NaN            1
##  5 Comedy  (30,40]       4            2
##  6 Comedy  (50,60]       3            2
##  7 Drama   (20,30]       1            1
##  8 Drama   (30,40]       2.5          2
##  9 Drama   (50,60]       4            2
## 10 Horror  (20,30]       3            1
## 11 Horror  (30,40]       4            2
## 12 Horror  (50,60]       4            2
## 13 Romance (20,30]       2            1
## 14 Romance (30,40]       5            2
## 15 Romance (50,60]       2.5          2
ggplot(data=subset(genre_age,!is.na(mean_rtg)),aes(x=age_group,y=mean_rtg)) +
    geom_bar(position = "dodge",
        stat = "summary",
        fun = "mean") +
    facet_wrap(~genre)+
    ggtitle("Average Movie Rating by Genre, Age") + xlab("Genre") + ylab("Avg Rating")

From the results, some conclusions may be inferred, though it leaves open questions that may require further investigation. Romance movies do not appear popular outside of the 30-40 year old age group. The 50-60 year old age group appear to favor Drama and Horror films, while the 20-30 year old group does not. 20-30 year olds appear to enjoy Action movies the most, with apparently no interest in Comedy. Because only one individual was surveyed in this age group, is this an outlier? The sample size is not large enough to draw a reasonable conclusion.

8. Does interest in movie genres vary by gender?

movie_ratings$gender <- c("F","F","F","F","F","F","M","M","M","M","M","M","M","M","M","M","M","M","F","F","F","F","F","F","M","M","M","M","M","M") #I attempted an ifelse based on Tasha and Shoma to return F, else return M, but it kept returning all F

movie_ratings%>%
    group_by(gender, genre) %>%
    summarise(mean_rtg = mean(rating, na.rm = TRUE,
              count_rtg = n())) %>%
    arrange(genre,gender,desc(mean_rtg))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   gender [2]
##    gender genre   mean_rtg
##    <chr>  <chr>      <dbl>
##  1 F      Action      3.5 
##  2 M      Action      4.67
##  3 F      Comedy      2.5 
##  4 M      Comedy      4.5 
##  5 F      Drama       2   
##  6 M      Drama       2.67
##  7 F      Horror    NaN   
##  8 M      Horror      3.67
##  9 F      Romance     4.5 
## 10 M      Romance     1.5
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=gender,y=rating, color=gender)) +
    geom_bar(position = "dodge",
        stat = "summary",
        fun = "mean") +
    facet_wrap(~genre) +
    ggtitle("Average Movie Rating by Genre, Gender") + xlab("Gender") + ylab("Avg Rating")

Based on the results, females tend to prefer Romance over males, while males tend to rate Action and Comedy higher than females. The average rating for Drama is relatively equal across male & female genders. It does not appear that the females in this survey cared to watch Horror films, at least not The Invisible Man.

# ### 9. Does replacing null values with the average by genre and gender tell a better story?
# 
# #I attempted to replace null values with the average by genre and gender as calculated above to "predict" what the rating would be if the person, in fact, saw the movie. I came across some errors in attempting to do so, so I commented out the code. Any input would be great.
# 
# #replace nan with 0
# movie_ratings$rating <- ifelse(is.nan(movie_ratings$rating),0,movie_ratings$rating)
# #average genre rating by gender
# avg_genre_gender <- movie_ratings %>%
#     group_by(genre,gender) %>%
#     summarise_at(vars(rating), list(avg=mean),na.rm=TRUE)
# 
# avg_genre_gender
# 
# #copy movie rating table & replace na's with avg by gender/genre
# movie_ratings2 <- movie_ratings
# movie_ratings2$rating <- as.double(movie_ratings2$rating) #convert int to double for matching columns
# #replace nan with 0
# movie_ratings2$rating <- ifelse(is.nan(movie_ratings2$rating)==TRUE,avg_genre_gender$avg[avg_genre_gender$genre %in% movie_ratings2$genre && avg_genre_gender$gender %in% movie_ratings2$gender],movie_ratings2$rating)
# movie_ratings2
# 
# movie_ratings2 %>%
#     group_by(gender, genre) %>%
#     summarise(mean_rtg = mean(rating,
#               count_rtg = n())) %>%
#     arrange(genre,gender,desc(mean_rtg))
# 
# ggplot(data=movie_ratings2,aes(x=gender,y=rating)) +
#     geom_bar(position = "dodge",
#         stat = "summary",
#         fun = "mean") +
#     facet_wrap(~genre) +
#     ggtitle("Average Movie Rating by Genre, Gender") + xlab("Gender") + ylab("Avg Rating")

Conclusion

Per the survey results, it appears that gender and age in combination with a movie’s genre affect how it’s rated, however, recency bias is not apparent in the results. As seen above, Action movies are the highest rated without factoring in any other criteria. How could you not love Black Panther and Avengers: Endgame, anyway? When taking into account age groups (20-30, 30-40, and 50-60), you can see that ratings for Action and Drama movies are actually reversed, while the 30-40 year old age range favors Romance movies more so than other age groups. Finally, based on gender, the Male population is more into Action & Comedy movie, while Females prefer Romance, while apparently avoiding Horror films. With a small sample of 5 individuals, these conclusions are far from proven, however, it can serve as a basis for a hypothesis to be proven with in-depth research.