Importing Data
urlfile=("https://raw.githubusercontent.com/cassie-boylan/DATA-607/main/movie_ratings_resultset.csv")
movie_ratings <- read.csv(urlfile)Introduction of the Data
The data reported is from a survey of 10 respondents, 5 male, 5 female, who were asked to rate 6 different movies from the last 2 years. Three of these movies were categorized as dramas, two as comedies, and one as action.
Those surveyed were asked to score these movies on a scale of 1 to 5 where 1 serves as the lowest possible rating, and 5 the highest.
It should be noted there are not enough observations or samples in this survey for any reported analysis to have statistical significance.
Imputing the Data
On reveiwing the structure of the dataframe pulled from the resultset of my sql movie_ratings query, I coerced column data types to the most accurate data type I wanted for my analysis.
Column names have been renamed to follow r naming convention.
str(movie_ratings)## 'data.frame': 60 obs. of 7 variables:
## $ Movie_Title : chr "Parasite" "Parasite" "Parasite" "Parasite" ...
## $ Viewer_Rating : chr "5" "5" "5" "NULL" ...
## $ Movie_Genre : chr "drama" "drama" "drama" "drama" ...
## $ Viewer : chr "Sam" "Jack" "Bonnie" "Santa" ...
## $ Gender : chr "M" "M" "F" "F" ...
## $ Age : int 27 23 59 65 29 28 31 33 42 63 ...
## $ Viewer_Favorite_Genre: chr "action" "drama" "drama" "comedy" ...
sum(movie_ratings$Viewer_Rating=="NULL")## [1] 14
names(movie_ratings) <- tolower(names(movie_ratings))To handle missing data in movie_ratings dataframe, I am replacing every entry of ‘NULL’ from the sql resultset as NA.
Where NULL is representative of missing data entry in SQL, NA serves the same meaning in R language.
Assigning NA to entries ‘NULL’ allows me to use the built-in r functionality for missing data
movie_ratings[movie_ratings=="NULL"] <- NANA values in the particular case of this dataframe indicate where viewer surveyed did not watch the movie and did not give a rating.
movie_ratings <- transform(movie_ratings, viewer_rating = as.numeric(viewer_rating),
movie_genre = as.factor(movie_genre),
viewer_favorite_genre = as.factor(viewer_favorite_genre),
gender = as.factor(gender)
)
str(movie_ratings)## 'data.frame': 60 obs. of 7 variables:
## $ movie_title : chr "Parasite" "Parasite" "Parasite" "Parasite" ...
## $ viewer_rating : num 5 5 5 NA 5 5 4 4 3 5 ...
## $ movie_genre : Factor w/ 3 levels "action","comedy",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ viewer : chr "Sam" "Jack" "Bonnie" "Santa" ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 1 2 2 2 ...
## $ age : int 27 23 59 65 29 28 31 33 42 63 ...
## $ viewer_favorite_genre: Factor w/ 3 levels "action","comedy",..: 1 3 3 2 2 3 1 1 3 2 ...
I am interested in seeing from the survey data how many movies were not seen and what percentage that makes up of the total possible rating count.
How many records are left once movies with rating NA are omitted?
In 23% of responses, or 14 observations, no rating was given.
nrow(na.omit(movie_ratings))## [1] 46
round(sum(is.na(movie_ratings))/nrow(movie_ratings),2)## [1] 0.23
To see that visually:
vis_miss(movie_ratings)Handling NULL “non-response” values
Those I surveyed indicated that of the movies they had not seen, they did not plan to see in future - indicating a lack of interest or possible distaste for the genre.
People tend to see movies of genres they prefer and favorably rate those movies that speak to their taste. This sample survey likely had non-participation bias to it, where only the movies people felt interest in and perceived that they would like did they see and have a strong opinion after viewing.
I considered dropping the rating entries where NA exist for a particular viewer and reduce my survey responses (seen below)
movie_ratings_clean <- movie_ratings %>%
# Filter to remove rows with missing rating
filter(!is.na(movie_ratings$viewer_rating))
ggplot(movie_ratings_clean, aes(movie_genre)) + geom_bar()However, because there are so few values in this sample and to account for the probable skew removing 23% of observations would create, I decided to create multiple datasets for comparison, imputing missing ratings as 3 [neutral] and 1 [hated] to see how much ratings changed depending on held assumptions that either people who avoided watching a movie would be neutral, or that people who avoided watching a movie did so because they correctly predicted they would dislike it.
#where rating is NA for 'not seen', pass new value as list to function replace_na to impute NA values
movie_ratings_one<- movie_ratings %>% replace_na(list(viewer_rating = 1))
movie_ratings_three <- movie_ratings %>% replace_na(list(viewer_rating = 3))Viewership - Popularity
What movie enjoyed the highest views?
If we analyze only on response rate, we can see which movie is the most viewed and which is the least viewed.
Knives Out was seen by all 10 respondents and had the highest number of reviews submitted. Sadly, Little Women and Booksmart, 2 movies featuring female leads, were least popular of viewers sampled with only 40 and 60% respectively reported having seen.
movie_ratings_clean %>%
count(movie_title, movie_genre) %>%
arrange(desc(n))## movie_title movie_genre n
## 1 Knives Out comedy 10
## 2 Parasite drama 9
## 3 Us drama 9
## 4 Avengers: End Game action 8
## 5 Booksmart comedy 6
## 6 Little Women drama 4
Viewership by Gender
If we broke viewership down by gender, how do movie genre and specific movies compare between those who identify as female and male?
We can see that by a slight lift of +2 (but significant in a sample of 10) that more men than women watched action, and more women than men watched drama and comedy.
movie_ratings_clean %>%
group_by(gender, movie_genre) %>%
summarize(views = n())## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: gender [2]
## gender movie_genre views
## <fct> <fct> <int>
## 1 F action 3
## 2 F comedy 9
## 3 F drama 12
## 4 M action 5
## 5 M comedy 7
## 6 M drama 10
Only 60% of women surveyed had seen the single action movie in this survey “Avengers: End Game”, compared to 100% of men. Viewership of drama was poorest among men surveyed whereas viewership of comedy was highest among women.
movie_ratings %>%
group_by(gender, movie_genre,) %>%
summarize(views =percent(sum(!is.na(viewer_rating))/n()))## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: gender [2]
## gender movie_genre views
## <fct> <fct> <chr>
## 1 F action 60%
## 2 F comedy 90%
## 3 F drama 80%
## 4 M action 100%
## 5 M comedy 70%
## 6 M drama 67%
Of the two movies featuring women as leads, only 40% of men surveyed had seen Booksmart, and no men reported seeing Little Women.
movie_ratings %>%
group_by(gender, movie_title) %>%
summarize(views =percent(sum(!is.na(viewer_rating))/n())) ## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: gender [2]
## gender movie_title views
## <fct> <chr> <chr>
## 1 F Avengers: End Game 60%
## 2 F Booksmart 80%
## 3 F Knives Out 100%
## 4 F Little Women 80%
## 5 F Parasite 80%
## 6 F Us 80%
## 7 M Avengers: End Game 100%
## 8 M Booksmart 40%
## 9 M Knives Out 100%
## 10 M Little Women 0%
## 11 M Parasite 100%
## 12 M Us 100%
Ratings
What movie enjoyed the highest ratings?
movie_ratings_clean %>%
group_by(movie_title, movie_genre) %>%
summarize(
avg_rating = round(mean(viewer_rating),2)) %>%
arrange(desc(avg_rating))## # A tibble: 6 x 3
## # Groups: movie_title [6]
## movie_title movie_genre avg_rating
## <chr> <fct> <dbl>
## 1 Avengers: End Game action 4.75
## 2 Parasite drama 4.56
## 3 Booksmart comedy 4.5
## 4 Knives Out comedy 4.4
## 5 Little Women drama 4.25
## 6 Us drama 3.44
What was reported the favorite movie broken down by gender?
movie_ratings_clean %>%
group_by(gender,movie_title) %>%
summarize(
avg_rating = round(mean(viewer_rating),2)) %>%
top_n(1,avg_rating)## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 3 x 3
## # Groups: gender [2]
## gender movie_title avg_rating
## <fct> <chr> <dbl>
## 1 F Booksmart 4.75
## 2 F Parasite 4.75
## 3 M Avengers: End Game 4.8
I also experimented with seeing how ratings are effected if we assume that movies not seen would be given the least favorable rating or a neutral rating by those who avoided or chose not to see them.
Imputing “not seen” responses as ratings of 1
by_one <- movie_ratings_one %>%
group_by(movie_title) %>%
summarise(
avg_rating = round(mean(viewer_rating),2),
median_rating = median(viewer_rating)
)
by_one## # A tibble: 6 x 3
## movie_title avg_rating median_rating
## <chr> <dbl> <dbl>
## 1 Avengers: End Game 4 5
## 2 Booksmart 3.1 3.5
## 3 Knives Out 4.4 5
## 4 Little Women 2.3 1
## 5 Parasite 4.2 5
## 6 Us 3.2 3.5
Imputing “not seen” responses as ratings of 3
by_three <- movie_ratings_three %>%
group_by(movie_title) %>%
summarise(
avg_rating = round(mean(viewer_rating),2),
median_rating = median(viewer_rating)
)
by_three## # A tibble: 6 x 3
## movie_title avg_rating median_rating
## <chr> <dbl> <dbl>
## 1 Avengers: End Game 4.4 5
## 2 Booksmart 3.9 3.5
## 3 Knives Out 4.4 5
## 4 Little Women 3.5 3
## 5 Parasite 4.4 5
## 6 Us 3.4 3.5
Which genre enjoyed the highest ratings?
Continuing this experiment, out of the sample surveyed, which genre of movies was rated most favorably overall?
movie_ratings_clean %>%
group_by(movie_genre) %>%
summarise(
avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
median_rating = median(viewer_rating, na.rm=TRUE)
)## # A tibble: 3 x 3
## movie_genre avg_rating median_rating
## <fct> <dbl> <dbl>
## 1 action 4.75 5
## 2 comedy 4.44 5
## 3 drama 4.04 4
Imputing “not seen” responses as ratings of 1
movie_ratings_one %>%
group_by(movie_genre) %>%
summarise(
avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
median_rating = median(viewer_rating, na.rm=TRUE)
)## # A tibble: 3 x 3
## movie_genre avg_rating median_rating
## <fct> <dbl> <dbl>
## 1 action 4 5
## 2 comedy 3.75 4.5
## 3 drama 3.23 4
Imputing “not seen” responses as ratings of 3
movie_ratings_three %>%
group_by(movie_genre) %>%
summarise(
avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
median_rating = median(viewer_rating, na.rm=TRUE)
)## # A tibble: 3 x 3
## movie_genre avg_rating median_rating
## <fct> <dbl> <dbl>
## 1 action 4.4 5
## 2 comedy 4.15 4.5
## 3 drama 3.77 4
Rating by Gender
If we broke ratings down by gender, how do movie genre and specific movies compare between those who identify as female and male?
ratings_gender_clean <- movie_ratings_clean %>%
group_by(movie_genre, gender) %>%
summarise(
avg_rating = round(mean(viewer_rating),3),
median_rating = median(viewer_rating),
.groups = 'drop'
)
ratings_gender_clean## # A tibble: 6 x 4
## movie_genre gender avg_rating median_rating
## <fct> <fct> <dbl> <dbl>
## 1 action F 4.67 5
## 2 action M 4.8 5
## 3 comedy F 4.56 5
## 4 comedy M 4.29 5
## 5 drama F 4.25 4.5
## 6 drama M 3.8 4
Imputing “not seen” responses as ratings of 3
ratings_gender_three <- movie_ratings_three %>%
group_by(movie_genre, gender) %>%
summarise(
avg_rating = round(mean(viewer_rating),3),
median_rating = median(viewer_rating),
.groups = 'drop'
)
ratings_gender_three## # A tibble: 6 x 4
## movie_genre gender avg_rating median_rating
## <fct> <fct> <dbl> <dbl>
## 1 action F 4 4
## 2 action M 4.8 5
## 3 comedy F 4.4 5
## 4 comedy M 3.9 3.5
## 5 drama F 4 4
## 6 drama M 3.53 3
The rating of drama genre falls precipitously among men when operating under the assumption that movies not viewed by men can be assumed to have ratings 3 or below.
Imputing “not seen” responses as ratings of 1
ratings_gender_one<- movie_ratings_one %>%
group_by(movie_genre, gender) %>%
summarise(
avg_rating = round(mean(viewer_rating),3),
median_rating = median(viewer_rating),
.groups = 'drop'
)
ratings_gender_one## # A tibble: 6 x 4
## movie_genre gender avg_rating median_rating
## <fct> <fct> <dbl> <dbl>
## 1 action F 3.2 4
## 2 action M 4.8 5
## 3 comedy F 4.2 5
## 4 comedy M 3.3 3.5
## 5 drama F 3.6 4
## 6 drama M 2.87 3
Plotting this, we can see that men rated movies categorized as drama as their lowest rated genre and action as their highest. Women rated action as lowest rated genre, and also action as their highest rated.
ratings_gender_clean %>%
ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
geom_point()ratings_gender_one %>%
ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
geom_point()ratings_gender_three %>%
ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
geom_point()Rating By Favored Genre
One last thing to look at, how did the assumption that people more favorably rate movies that speak to their taste bear out.
Specifically, how did those who reported their favorite genre actually rate movies in that genre they were asked to review?
From the data, we can see that those who reported their favorite genre to be comedy rated movies that were dramas least favorably and rated action and comedy movies roughly the same. Interesting enough, of those who reported dramas to be their favorite genre still rated movies they had seen under comedy and action as higher rated than the movies under drama. This group did rate drama movies of the survey most favorably of the 3 groups however.
movie_ratings %>%
select(viewer,viewer_favorite_genre) %>%
distinct(viewer,viewer_favorite_genre) %>%
count(viewer_favorite_genre, name="favored by")## viewer_favorite_genre favored by
## 1 action 3
## 2 comedy 3
## 3 drama 4
viewer_fav <- movie_ratings_clean %>%
group_by(viewer_favorite_genre, movie_genre) %>%
summarize(avg_rating = round(mean(viewer_rating),2))## `summarise()` has grouped output by 'viewer_favorite_genre'. You can override using the `.groups` argument.
viewer_fav %>%
ggplot(aes(x=movie_genre,y=avg_rating,color=viewer_favorite_genre)) +
geom_point()