Importing Data

urlfile=("https://raw.githubusercontent.com/cassie-boylan/DATA-607/main/movie_ratings_resultset.csv")

movie_ratings <- read.csv(urlfile)

Introduction of the Data

The data reported is from a survey of 10 respondents, 5 male, 5 female, who were asked to rate 6 different movies from the last 2 years. Three of these movies were categorized as dramas, two as comedies, and one as action.
Those surveyed were asked to score these movies on a scale of 1 to 5 where 1 serves as the lowest possible rating, and 5 the highest.

It should be noted there are not enough observations or samples in this survey for any reported analysis to have statistical significance.

Imputing the Data

On reveiwing the structure of the dataframe pulled from the resultset of my sql movie_ratings query, I coerced column data types to the most accurate data type I wanted for my analysis.

Column names have been renamed to follow r naming convention.

str(movie_ratings)
## 'data.frame':    60 obs. of  7 variables:
##  $ Movie_Title          : chr  "Parasite" "Parasite" "Parasite" "Parasite" ...
##  $ Viewer_Rating        : chr  "5" "5" "5" "NULL" ...
##  $ Movie_Genre          : chr  "drama" "drama" "drama" "drama" ...
##  $ Viewer               : chr  "Sam" "Jack" "Bonnie" "Santa" ...
##  $ Gender               : chr  "M" "M" "F" "F" ...
##  $ Age                  : int  27 23 59 65 29 28 31 33 42 63 ...
##  $ Viewer_Favorite_Genre: chr  "action" "drama" "drama" "comedy" ...
sum(movie_ratings$Viewer_Rating=="NULL")
## [1] 14
names(movie_ratings) <- tolower(names(movie_ratings))

To handle missing data in movie_ratings dataframe, I am replacing every entry of ‘NULL’ from the sql resultset as NA.

Where NULL is representative of missing data entry in SQL, NA serves the same meaning in R language.

Assigning NA to entries ‘NULL’ allows me to use the built-in r functionality for missing data

movie_ratings[movie_ratings=="NULL"] <- NA

NA values in the particular case of this dataframe indicate where viewer surveyed did not watch the movie and did not give a rating.

movie_ratings <- transform(movie_ratings, viewer_rating = as.numeric(viewer_rating),
movie_genre = as.factor(movie_genre),
viewer_favorite_genre = as.factor(viewer_favorite_genre),
gender = as.factor(gender)
)
str(movie_ratings)
## 'data.frame':    60 obs. of  7 variables:
##  $ movie_title          : chr  "Parasite" "Parasite" "Parasite" "Parasite" ...
##  $ viewer_rating        : num  5 5 5 NA 5 5 4 4 3 5 ...
##  $ movie_genre          : Factor w/ 3 levels "action","comedy",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ viewer               : chr  "Sam" "Jack" "Bonnie" "Santa" ...
##  $ gender               : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 1 2 2 2 ...
##  $ age                  : int  27 23 59 65 29 28 31 33 42 63 ...
##  $ viewer_favorite_genre: Factor w/ 3 levels "action","comedy",..: 1 3 3 2 2 3 1 1 3 2 ...

I am interested in seeing from the survey data how many movies were not seen and what percentage that makes up of the total possible rating count.
How many records are left once movies with rating NA are omitted?
In 23% of responses, or 14 observations, no rating was given.

nrow(na.omit(movie_ratings))
## [1] 46
round(sum(is.na(movie_ratings))/nrow(movie_ratings),2)
## [1] 0.23

To see that visually:

vis_miss(movie_ratings)

Handling NULL “non-response” values

Those I surveyed indicated that of the movies they had not seen, they did not plan to see in future - indicating a lack of interest or possible distaste for the genre.

People tend to see movies of genres they prefer and favorably rate those movies that speak to their taste. This sample survey likely had non-participation bias to it, where only the movies people felt interest in and perceived that they would like did they see and have a strong opinion after viewing.

I considered dropping the rating entries where NA exist for a particular viewer and reduce my survey responses (seen below)

movie_ratings_clean <- movie_ratings %>%
  # Filter to remove rows with missing rating
  filter(!is.na(movie_ratings$viewer_rating))

ggplot(movie_ratings_clean, aes(movie_genre)) + geom_bar()

However, because there are so few values in this sample and to account for the probable skew removing 23% of observations would create, I decided to create multiple datasets for comparison, imputing missing ratings as 3 [neutral] and 1 [hated] to see how much ratings changed depending on held assumptions that either people who avoided watching a movie would be neutral, or that people who avoided watching a movie did so because they correctly predicted they would dislike it.

#where rating is NA for 'not seen', pass new value as list to function replace_na to impute NA values
movie_ratings_one<- movie_ratings %>% replace_na(list(viewer_rating = 1))

movie_ratings_three <- movie_ratings  %>% replace_na(list(viewer_rating = 3))

Viewership - Popularity

What movie enjoyed the highest views?

If we analyze only on response rate, we can see which movie is the most viewed and which is the least viewed.

Knives Out was seen by all 10 respondents and had the highest number of reviews submitted. Sadly, Little Women and Booksmart, 2 movies featuring female leads, were least popular of viewers sampled with only 40 and 60% respectively reported having seen.

movie_ratings_clean %>%
  count(movie_title, movie_genre) %>%
  arrange(desc(n))
##          movie_title movie_genre  n
## 1         Knives Out      comedy 10
## 2           Parasite       drama  9
## 3                 Us       drama  9
## 4 Avengers: End Game      action  8
## 5          Booksmart      comedy  6
## 6       Little Women       drama  4

Viewership by Gender

If we broke viewership down by gender, how do movie genre and specific movies compare between those who identify as female and male?

We can see that by a slight lift of +2 (but significant in a sample of 10) that more men than women watched action, and more women than men watched drama and comedy.

movie_ratings_clean %>%
  group_by(gender, movie_genre) %>%
  summarize(views = n())
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   gender [2]
##   gender movie_genre views
##   <fct>  <fct>       <int>
## 1 F      action          3
## 2 F      comedy          9
## 3 F      drama          12
## 4 M      action          5
## 5 M      comedy          7
## 6 M      drama          10

Only 60% of women surveyed had seen the single action movie in this survey “Avengers: End Game”, compared to 100% of men. Viewership of drama was poorest among men surveyed whereas viewership of comedy was highest among women.

movie_ratings %>%
  group_by(gender, movie_genre,) %>%
  summarize(views =percent(sum(!is.na(viewer_rating))/n()))
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   gender [2]
##   gender movie_genre views
##   <fct>  <fct>       <chr>
## 1 F      action      60%  
## 2 F      comedy      90%  
## 3 F      drama       80%  
## 4 M      action      100% 
## 5 M      comedy      70%  
## 6 M      drama       67%

Of the two movies featuring women as leads, only 40% of men surveyed had seen Booksmart, and no men reported seeing Little Women.

movie_ratings %>%
  group_by(gender, movie_title) %>%
  summarize(views =percent(sum(!is.na(viewer_rating))/n()))  
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   gender [2]
##    gender movie_title        views
##    <fct>  <chr>              <chr>
##  1 F      Avengers: End Game 60%  
##  2 F      Booksmart          80%  
##  3 F      Knives Out         100% 
##  4 F      Little Women       80%  
##  5 F      Parasite           80%  
##  6 F      Us                 80%  
##  7 M      Avengers: End Game 100% 
##  8 M      Booksmart          40%  
##  9 M      Knives Out         100% 
## 10 M      Little Women       0%   
## 11 M      Parasite           100% 
## 12 M      Us                 100%

Ratings

What movie enjoyed the highest ratings?

movie_ratings_clean %>%
  group_by(movie_title, movie_genre) %>%
  summarize(
    avg_rating = round(mean(viewer_rating),2)) %>%
  arrange(desc(avg_rating))
## # A tibble: 6 x 3
## # Groups:   movie_title [6]
##   movie_title        movie_genre avg_rating
##   <chr>              <fct>            <dbl>
## 1 Avengers: End Game action            4.75
## 2 Parasite           drama             4.56
## 3 Booksmart          comedy            4.5 
## 4 Knives Out         comedy            4.4 
## 5 Little Women       drama             4.25
## 6 Us                 drama             3.44

What was reported the favorite movie broken down by gender?

movie_ratings_clean %>%
  group_by(gender,movie_title) %>%
  summarize(
    avg_rating = round(mean(viewer_rating),2)) %>%
  top_n(1,avg_rating)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
## # A tibble: 3 x 3
## # Groups:   gender [2]
##   gender movie_title        avg_rating
##   <fct>  <chr>                   <dbl>
## 1 F      Booksmart                4.75
## 2 F      Parasite                 4.75
## 3 M      Avengers: End Game       4.8

I also experimented with seeing how ratings are effected if we assume that movies not seen would be given the least favorable rating or a neutral rating by those who avoided or chose not to see them.

Imputing “not seen” responses as ratings of 1

by_one <- movie_ratings_one %>% 
  group_by(movie_title) %>%
  summarise(
  avg_rating = round(mean(viewer_rating),2),
  median_rating = median(viewer_rating)
)
by_one
## # A tibble: 6 x 3
##   movie_title        avg_rating median_rating
##   <chr>                   <dbl>         <dbl>
## 1 Avengers: End Game        4             5  
## 2 Booksmart                 3.1           3.5
## 3 Knives Out                4.4           5  
## 4 Little Women              2.3           1  
## 5 Parasite                  4.2           5  
## 6 Us                        3.2           3.5

Imputing “not seen” responses as ratings of 3

by_three <- movie_ratings_three %>% 
  group_by(movie_title) %>%
  summarise(
  avg_rating = round(mean(viewer_rating),2),
  median_rating = median(viewer_rating)
)
by_three
## # A tibble: 6 x 3
##   movie_title        avg_rating median_rating
##   <chr>                   <dbl>         <dbl>
## 1 Avengers: End Game        4.4           5  
## 2 Booksmart                 3.9           3.5
## 3 Knives Out                4.4           5  
## 4 Little Women              3.5           3  
## 5 Parasite                  4.4           5  
## 6 Us                        3.4           3.5

Which genre enjoyed the highest ratings?

Continuing this experiment, out of the sample surveyed, which genre of movies was rated most favorably overall?

movie_ratings_clean %>%
  group_by(movie_genre) %>%
  summarise(
  avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
  median_rating = median(viewer_rating, na.rm=TRUE)
  )
## # A tibble: 3 x 3
##   movie_genre avg_rating median_rating
##   <fct>            <dbl>         <dbl>
## 1 action            4.75             5
## 2 comedy            4.44             5
## 3 drama             4.04             4

Imputing “not seen” responses as ratings of 1

movie_ratings_one %>%
  group_by(movie_genre) %>%
  summarise(
  avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
  median_rating = median(viewer_rating, na.rm=TRUE)
  )
## # A tibble: 3 x 3
##   movie_genre avg_rating median_rating
##   <fct>            <dbl>         <dbl>
## 1 action            4              5  
## 2 comedy            3.75           4.5
## 3 drama             3.23           4

Imputing “not seen” responses as ratings of 3

movie_ratings_three %>%
  group_by(movie_genre) %>%
  summarise(
  avg_rating = round(mean(viewer_rating, na.rm=TRUE),3),
  median_rating = median(viewer_rating, na.rm=TRUE)
  )
## # A tibble: 3 x 3
##   movie_genre avg_rating median_rating
##   <fct>            <dbl>         <dbl>
## 1 action            4.4            5  
## 2 comedy            4.15           4.5
## 3 drama             3.77           4

Rating by Gender

If we broke ratings down by gender, how do movie genre and specific movies compare between those who identify as female and male?

ratings_gender_clean <- movie_ratings_clean %>%
  group_by(movie_genre, gender) %>%
  summarise(
  avg_rating = round(mean(viewer_rating),3),
  median_rating = median(viewer_rating),
  .groups = 'drop'
  )
ratings_gender_clean
## # A tibble: 6 x 4
##   movie_genre gender avg_rating median_rating
##   <fct>       <fct>       <dbl>         <dbl>
## 1 action      F            4.67           5  
## 2 action      M            4.8            5  
## 3 comedy      F            4.56           5  
## 4 comedy      M            4.29           5  
## 5 drama       F            4.25           4.5
## 6 drama       M            3.8            4

Imputing “not seen” responses as ratings of 3

ratings_gender_three <- movie_ratings_three %>%
  group_by(movie_genre, gender) %>%
  summarise(
  avg_rating = round(mean(viewer_rating),3),
  median_rating = median(viewer_rating),
  .groups = 'drop'
  )
ratings_gender_three
## # A tibble: 6 x 4
##   movie_genre gender avg_rating median_rating
##   <fct>       <fct>       <dbl>         <dbl>
## 1 action      F            4              4  
## 2 action      M            4.8            5  
## 3 comedy      F            4.4            5  
## 4 comedy      M            3.9            3.5
## 5 drama       F            4              4  
## 6 drama       M            3.53           3

The rating of drama genre falls precipitously among men when operating under the assumption that movies not viewed by men can be assumed to have ratings 3 or below.

Imputing “not seen” responses as ratings of 1

ratings_gender_one<- movie_ratings_one %>%
  group_by(movie_genre, gender) %>%
  summarise(
  avg_rating = round(mean(viewer_rating),3),
  median_rating = median(viewer_rating),
  .groups = 'drop'
  )
ratings_gender_one
## # A tibble: 6 x 4
##   movie_genre gender avg_rating median_rating
##   <fct>       <fct>       <dbl>         <dbl>
## 1 action      F            3.2            4  
## 2 action      M            4.8            5  
## 3 comedy      F            4.2            5  
## 4 comedy      M            3.3            3.5
## 5 drama       F            3.6            4  
## 6 drama       M            2.87           3

Plotting this, we can see that men rated movies categorized as drama as their lowest rated genre and action as their highest. Women rated action as lowest rated genre, and also action as their highest rated.

ratings_gender_clean %>%
  ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
  geom_point()

ratings_gender_one %>%
  ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
  geom_point()

ratings_gender_three %>%
  ggplot(aes(x=movie_genre, y=avg_rating, color=gender)) +
  geom_point()

Rating By Favored Genre

One last thing to look at, how did the assumption that people more favorably rate movies that speak to their taste bear out.

Specifically, how did those who reported their favorite genre actually rate movies in that genre they were asked to review?

From the data, we can see that those who reported their favorite genre to be comedy rated movies that were dramas least favorably and rated action and comedy movies roughly the same. Interesting enough, of those who reported dramas to be their favorite genre still rated movies they had seen under comedy and action as higher rated than the movies under drama. This group did rate drama movies of the survey most favorably of the 3 groups however.

movie_ratings %>%
  select(viewer,viewer_favorite_genre) %>%
  distinct(viewer,viewer_favorite_genre) %>%
  count(viewer_favorite_genre, name="favored by")
##   viewer_favorite_genre favored by
## 1                action          3
## 2                comedy          3
## 3                 drama          4
viewer_fav <- movie_ratings_clean %>%
  group_by(viewer_favorite_genre, movie_genre) %>%
  summarize(avg_rating = round(mean(viewer_rating),2))
## `summarise()` has grouped output by 'viewer_favorite_genre'. You can override using the `.groups` argument.
viewer_fav %>%
  ggplot(aes(x=movie_genre,y=avg_rating,color=viewer_favorite_genre)) +
  geom_point()