This assignment is a survey of 5 individuals. Each individual was asked to rate the following movies on a scale of 1 to 5. * Joker * It 2 * Parasite * Ready or Not * Avengers Endgame * Star Wars: Return of the Jedi
The survey results were stored in a SQL table in pgAdmin4 using postgres. You can see the SQL code here.
The table was created and then exported as a csv, which was uploaded to GitHub as well. We’ll grab the raw results from there:
raw <- read.csv(url('https://raw.githubusercontent.com/dataconsumer101/data607/master/movies.csv'))
str(raw)
## 'data.frame': 5 obs. of 7 variables:
## $ respondent : Factor w/ 5 levels "Doug","Frank",..: 1 3 2 4 5
## $ joker : int 5 2 4 NA 5
## $ it_2 : int 4 NA NA 3 4
## $ parasite : int 4 5 3 4 NA
## $ ready_or_not : int NA 5 NA 5 4
## $ avengers_endgame: int 5 4 5 4 5
## $ star_wars_9 : int NA 5 4 3 3
head(raw)
## respondent joker it_2 parasite ready_or_not avengers_endgame star_wars_9
## 1 Doug 5 4 4 NA 5 NA
## 2 Jen 2 NA 5 5 4 5
## 3 Frank 4 NA 3 NA 5 4
## 4 Jess NA 3 4 5 4 3
## 5 Steve 5 4 NA 4 5 3
There are NA values within the dataset because every person isn’t really expected to see every movie.
Lets rearrange the data using tidyr so its easier to analyze:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- gather(raw, movie, rating, -respondent)
head(df)
## respondent movie rating
## 1 Doug joker 5
## 2 Jen joker 2
## 3 Frank joker 4
## 4 Jess joker NA
## 5 Steve joker 5
## 6 Doug it_2 4
Lets looking at the average rating for each movie:
(g <- group_by(df, movie) %>%
summarize(avg_rating = mean(rating, na.rm = T)) %>%
arrange(desc(avg_rating)))
## # A tibble: 6 x 2
## movie avg_rating
## <chr> <dbl>
## 1 ready_or_not 4.67
## 2 avengers_endgame 4.6
## 3 joker 4
## 4 parasite 4
## 5 star_wars_9 3.75
## 6 it_2 3.67
I’m also interested in which movies were most popular, so lets count how many people saw them:
df$counter <- ifelse(is.na(df$rating),0,1)
(c <- group_by(df, movie) %>%
summarize(views = sum(counter)) %>%
arrange(desc(views)))
## # A tibble: 6 x 2
## movie views
## <chr> <dbl>
## 1 avengers_endgame 5
## 2 joker 4
## 3 parasite 4
## 4 star_wars_9 4
## 5 it_2 3
## 6 ready_or_not 3
Finally, lets visualize the results so that its a bit easier to digest.
ggplot(c, aes(x = movie, y = views)) +
geom_col() +
ggtitle('Movies by Popularity') +
theme_classic()
ggplot(g, aes(x = movie, y = avg_rating)) +
geom_col() +
ggtitle('Average Rating of Movie Survey') +
theme_classic()