Movie Survey

This assignment is a survey of 5 individuals. Each individual was asked to rate the following movies on a scale of 1 to 5. * Joker * It 2 * Parasite * Ready or Not * Avengers Endgame * Star Wars: Return of the Jedi

The survey results were stored in a SQL table in pgAdmin4 using postgres. You can see the SQL code here.

The table was created and then exported as a csv, which was uploaded to GitHub as well. We’ll grab the raw results from there:

raw <- read.csv(url('https://raw.githubusercontent.com/dataconsumer101/data607/master/movies.csv'))
str(raw)
## 'data.frame':    5 obs. of  7 variables:
##  $ respondent      : Factor w/ 5 levels "Doug","Frank",..: 1 3 2 4 5
##  $ joker           : int  5 2 4 NA 5
##  $ it_2            : int  4 NA NA 3 4
##  $ parasite        : int  4 5 3 4 NA
##  $ ready_or_not    : int  NA 5 NA 5 4
##  $ avengers_endgame: int  5 4 5 4 5
##  $ star_wars_9     : int  NA 5 4 3 3
head(raw)
##   respondent joker it_2 parasite ready_or_not avengers_endgame star_wars_9
## 1       Doug     5    4        4           NA                5          NA
## 2        Jen     2   NA        5            5                4           5
## 3      Frank     4   NA        3           NA                5           4
## 4       Jess    NA    3        4            5                4           3
## 5      Steve     5    4       NA            4                5           3

There are NA values within the dataset because every person isn’t really expected to see every movie.

Lets rearrange the data using tidyr so its easier to analyze:

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- gather(raw, movie, rating, -respondent)
head(df)
##   respondent movie rating
## 1       Doug joker      5
## 2        Jen joker      2
## 3      Frank joker      4
## 4       Jess joker     NA
## 5      Steve joker      5
## 6       Doug  it_2      4

Lets looking at the average rating for each movie:

(g <- group_by(df, movie) %>%
  summarize(avg_rating = mean(rating, na.rm = T)) %>%
  arrange(desc(avg_rating)))
## # A tibble: 6 x 2
##   movie            avg_rating
##   <chr>                 <dbl>
## 1 ready_or_not           4.67
## 2 avengers_endgame       4.6 
## 3 joker                  4   
## 4 parasite               4   
## 5 star_wars_9            3.75
## 6 it_2                   3.67

I’m also interested in which movies were most popular, so lets count how many people saw them:

df$counter <- ifelse(is.na(df$rating),0,1)
(c <- group_by(df, movie) %>%
  summarize(views = sum(counter)) %>%
    arrange(desc(views)))
## # A tibble: 6 x 2
##   movie            views
##   <chr>            <dbl>
## 1 avengers_endgame     5
## 2 joker                4
## 3 parasite             4
## 4 star_wars_9          4
## 5 it_2                 3
## 6 ready_or_not         3

Finally, lets visualize the results so that its a bit easier to digest.

ggplot(c, aes(x = movie, y = views)) +
  geom_col() +
  ggtitle('Movies by Popularity') +
  theme_classic()  

ggplot(g, aes(x = movie, y = avg_rating)) +
  geom_col() +
  ggtitle('Average Rating of Movie Survey') +
  theme_classic()

It looks like a survey of 5 people isn’t necessarily going to be a great representation of the population.