The data below was collected from five individuals. Specifically, three males, two females, with ages ranging from 28 to 60. It represents how they rate six recent movies with varying genres (for the most part). Note, movies may have multiple genres – the data only captures the “main” genre, as interpreted by myself, the surveyor, as the first listed under the movie’s IMDB profile. If an individual did not see a movie in the survey, a null value is denoted in the “Rating” column.
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
##
## Attaching package: 'curl'
## The following object is masked from 'package:readr':
##
## parse_date
library(ggplot2)
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(DBI)
library(odbc)
movie_ratings <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/DATA607/main/Brian_Singh_movie_ratings"))
head(movie_ratings,10)
## person age movie_title release_year genre rating
## 1 Tasha 32 Black Panther 2018 Action 4
## 2 Tasha 32 Joker 2019 Drama 2
## 3 Tasha 32 The Invisible Man 2020 Horror NA
## 4 Tasha 32 A Star is Born 2018 Romance 5
## 5 Tasha 32 The Nice Guys 2016 Comedy 3
## 6 Tasha 32 Avengers: Endgame 2019 Action 4
## 7 Dereck 38 Black Panther 2018 Action 5
## 8 Dereck 38 Joker 2019 Drama 3
## 9 Dereck 38 The Invisible Man 2020 Horror 4
## 10 Dereck 38 A Star is Born 2018 Romance NA
print(summary(movie_ratings))
## person age movie_title release_year
## Length:30 Min. :28.0 Length:30 Min. :2016
## Class :character 1st Qu.:32.0 Class :character 1st Qu.:2018
## Mode :character Median :38.0 Mode :character Median :2018
## Mean :43.2 Mean :2018
## 3rd Qu.:59.0 3rd Qu.:2019
## Max. :60.0 Max. :2020
##
## genre rating
## Length:30 Min. :1.00
## Class :character 1st Qu.:3.00
## Mode :character Median :4.00
## Mean :3.56
## 3rd Qu.:4.00
## Max. :5.00
## NA's :5
movie_ratings$release_year <- as.Date(as.character(movie_ratings$release_year), format = "%Y")
movie_ratings$release_year <- year(movie_ratings$release_year)
movie_ratings%>%
group_by(genre) %>%
summarise(mean_rtg = mean(rating, na.rm = TRUE,
count_rtg = n())) %>%
arrange(desc(mean_rtg))
## # A tibble: 5 × 2
## genre mean_rtg
## <chr> <dbl>
## 1 Action 4.2
## 2 Horror 3.67
## 3 Comedy 3.5
## 4 Romance 3
## 5 Drama 2.5
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=reorder(genre,-rating),y=rating)) +
geom_bar(position = "dodge",
stat = "summary",
fun = "mean") +
ggtitle("Average Movie Rating by Genre") + xlab("Genre") + ylab("Avg Rating")
Based on the survey, it appears movies with an Action genre are rated regardless of age, with Drama performing the worst of the five included in the survey.
movie_ratings%>%
group_by(release_year) %>%
summarise(mean_rtg = mean(rating, na.rm = TRUE),
count_rtg = n()) %>%
arrange(desc(mean_rtg))
## # A tibble: 4 × 3
## release_year mean_rtg count_rtg
## <dbl> <dbl> <int>
## 1 2018 3.78 10
## 2 2020 3.67 5
## 3 2016 3.5 5
## 4 2019 3.33 10
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=release_year,y=rating)) +
geom_smooth(position = "dodge",
stat = "summary",
fun = "mean") +
ggtitle("Average Movie Rating by Release Year") + xlab("Release Year") + ylab("Avg Rating") +
ylim(1,5)
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
From the survey, no visible conclusion can be made based on in terms of recency bias. The range in differences of average rating year-over-year does not exceed 0.44.
#Create a column to assign a group to "age" as follows: 20-30, 30-40, 40-50,50-60
movie_ratings$age_group <- cut(movie_ratings$age,c(20,30,40,50,60))
genre_age <- movie_ratings%>%
group_by(genre,age_group) %>%
summarise(mean_rtg = mean(rating, na.rm = TRUE),
count_rtg = n()) %>%
arrange(genre,age_group,desc(mean_rtg))
## `summarise()` has grouped output by 'genre'. You can override using the
## `.groups` argument.
genre_age
## # A tibble: 15 × 4
## # Groups: genre [5]
## genre age_group mean_rtg count_rtg
## <chr> <fct> <dbl> <int>
## 1 Action (20,30] 5 2
## 2 Action (30,40] 4.25 4
## 3 Action (50,60] 3.75 4
## 4 Comedy (20,30] NaN 1
## 5 Comedy (30,40] 4 2
## 6 Comedy (50,60] 3 2
## 7 Drama (20,30] 1 1
## 8 Drama (30,40] 2.5 2
## 9 Drama (50,60] 4 2
## 10 Horror (20,30] 3 1
## 11 Horror (30,40] 4 2
## 12 Horror (50,60] 4 2
## 13 Romance (20,30] 2 1
## 14 Romance (30,40] 5 2
## 15 Romance (50,60] 2.5 2
ggplot(data=subset(genre_age,!is.na(mean_rtg)),aes(x=age_group,y=mean_rtg)) +
geom_bar(position = "dodge",
stat = "summary",
fun = "mean") +
facet_wrap(~genre)+
ggtitle("Average Movie Rating by Genre, Age") + xlab("Genre") + ylab("Avg Rating")
From the results, some conclusions may be inferred, though it leaves open questions that may require further investigation. Romance movies do not appear popular outside of the 30-40 year old age group. The 50-60 year old age group appear to favor Drama and Horror films, while the 20-30 year old group does not. 20-30 year olds appear to enjoy Action movies the most, with apparently no interest in Comedy. Because only one individual was surveyed in this age group, is this an outlier? The sample size is not large enough to draw a reasonable conclusion.
movie_ratings$gender <- c("F","F","F","F","F","F","M","M","M","M","M","M","M","M","M","M","M","M","F","F","F","F","F","F","M","M","M","M","M","M") #I attempted an ifelse based on Tasha and Shoma to return F, else return M, but it kept returning all F
movie_ratings%>%
group_by(gender, genre) %>%
summarise(mean_rtg = mean(rating, na.rm = TRUE,
count_rtg = n())) %>%
arrange(genre,gender,desc(mean_rtg))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: gender [2]
## gender genre mean_rtg
## <chr> <chr> <dbl>
## 1 F Action 3.5
## 2 M Action 4.67
## 3 F Comedy 2.5
## 4 M Comedy 4.5
## 5 F Drama 2
## 6 M Drama 2.67
## 7 F Horror NaN
## 8 M Horror 3.67
## 9 F Romance 4.5
## 10 M Romance 1.5
ggplot(data=subset(movie_ratings,!is.na(rating)),aes(x=gender,y=rating, color=gender)) +
geom_bar(position = "dodge",
stat = "summary",
fun = "mean") +
facet_wrap(~genre) +
ggtitle("Average Movie Rating by Genre, Gender") + xlab("Gender") + ylab("Avg Rating")
Based on the results, females tend to prefer Romance over males, while males tend to rate Action and Comedy higher than females. The average rating for Drama is relatively equal across male & female genders. It does not appear that the females in this survey cared to watch Horror films, at least not The Invisible Man.
# ### 9. Does replacing null values with the average by genre and gender tell a better story?
#
# #I attempted to replace null values with the average by genre and gender as calculated above to "predict" what the rating would be if the person, in fact, saw the movie. I came across some errors in attempting to do so, so I commented out the code. Any input would be great.
#
# #replace nan with 0
# movie_ratings$rating <- ifelse(is.nan(movie_ratings$rating),0,movie_ratings$rating)
# #average genre rating by gender
# avg_genre_gender <- movie_ratings %>%
# group_by(genre,gender) %>%
# summarise_at(vars(rating), list(avg=mean),na.rm=TRUE)
#
# avg_genre_gender
#
# #copy movie rating table & replace na's with avg by gender/genre
# movie_ratings2 <- movie_ratings
# movie_ratings2$rating <- as.double(movie_ratings2$rating) #convert int to double for matching columns
# #replace nan with 0
# movie_ratings2$rating <- ifelse(is.nan(movie_ratings2$rating)==TRUE,avg_genre_gender$avg[avg_genre_gender$genre %in% movie_ratings2$genre && avg_genre_gender$gender %in% movie_ratings2$gender],movie_ratings2$rating)
# movie_ratings2
#
# movie_ratings2 %>%
# group_by(gender, genre) %>%
# summarise(mean_rtg = mean(rating,
# count_rtg = n())) %>%
# arrange(genre,gender,desc(mean_rtg))
#
# ggplot(data=movie_ratings2,aes(x=gender,y=rating)) +
# geom_bar(position = "dodge",
# stat = "summary",
# fun = "mean") +
# facet_wrap(~genre) +
# ggtitle("Average Movie Rating by Genre, Gender") + xlab("Gender") + ylab("Avg Rating")
Per the survey results, it appears that gender and age in combination with a movie’s genre affect how it’s rated, however, recency bias is not apparent in the results. As seen above, Action movies are the highest rated without factoring in any other criteria. How could you not love Black Panther and Avengers: Endgame, anyway? When taking into account age groups (20-30, 30-40, and 50-60), you can see that ratings for Action and Drama movies are actually reversed, while the 30-40 year old age range favors Romance movies more so than other age groups. Finally, based on gender, the Male population is more into Action & Comedy movie, while Females prefer Romance, while apparently avoiding Horror films. With a small sample of 5 individuals, these conclusions are far from proven, however, it can serve as a basis for a hypothesis to be proven with in-depth research.