Introduction
I provided a list of 6 movies to 5 friends to investigate which ones they recognize and have watched, and to rate movives accordingly. The purpose of this survey is to explore the following questions:
Which movie is recognized by the most friends? Which friend watches movies most frequently? Which friend watches movies the least? Which movie receives the highest total rating? Which movie receives the highest rating from each active reviewer / the highest average rating?
Establish connection with MySQL
## 载入需要的程辑包:DBI
Load the dataset
data <- dbGetQuery(con, "SELECT * FROM movies")
data
## movie_name reviewer rating
## 1 Oppenheimer Tommy 5
## 2 Oppenheimer Jackie NA
## 3 Oppenheimer Adrian 4
## 4 Oppenheimer Emma NA
## 5 Oppenheimer Muhammad 5
## 6 The Marvels Tommy 2
## 7 The Marvels Jackie 4
## 8 The Marvels Adrian 3
## 9 The Marvels Emma NA
## 10 The Marvels Muhammad 3
## 11 Aquaman and the Lost Kingdom Tommy 4
## 12 Aquaman and the Lost Kingdom Jackie 4
## 13 Aquaman and the Lost Kingdom Adrian 2
## 14 Aquaman and the Lost Kingdom Emma 3
## 15 Aquaman and the Lost Kingdom Muhammad NA
## 16 Ghosted Tommy NA
## 17 Ghosted Jackie 4
## 18 Ghosted Adrian 3
## 19 Ghosted Emma NA
## 20 Ghosted Muhammad 2
## 21 John Wick 4 Tommy 4
## 22 John Wick 4 Jackie NA
## 23 John Wick 4 Adrian 4
## 24 John Wick 4 Emma NA
## 25 John Wick 4 Muhammad 5
## 26 Fast and Furious 10 Tommy 4
## 27 Fast and Furious 10 Jackie 2
## 28 Fast and Furious 10 Adrian 4
## 29 Fast and Furious 10 Emma 3
## 30 Fast and Furious 10 Muhammad 5
First, I checked data to find out which movie is the most widely recognized.
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
counts <- data %>%
group_by(movie_name) %>%
summarize(Never_Seen = sum(is.na(rating)),
Seen = sum(!is.na(rating))) %>%
arrange(desc(Seen))
counts
## # A tibble: 6 × 3
## movie_name Never_Seen Seen
## <chr> <int> <int>
## 1 Fast and Furious 10 0 5
## 2 Aquaman and the Lost Kingdom 1 4
## 3 The Marvels 1 4
## 4 Ghosted 2 3
## 5 John Wick 4 2 3
## 6 Oppenheimer 2 3
ggplot(counts, aes(x = reorder(movie_name, -Seen), y = Seen)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Count of Non-NA Ratings by Movie", x = "Movie Name", y = "Seen") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Next, I want to check which of my friends has watched the most movies
and which has watched the least.
Reviewer_counts <- data %>%
group_by(reviewer) %>%
summarize(Never_Seen = sum(is.na(rating)),
Seen = sum(!is.na(rating))) %>%
arrange(desc(Seen))
Reviewer_counts
## # A tibble: 5 × 3
## reviewer Never_Seen Seen
## <chr> <int> <int>
## 1 Adrian 0 6
## 2 Muhammad 1 5
## 3 Tommy 1 5
## 4 Jackie 2 4
## 5 Emma 4 2
data_long <- tidyr::pivot_longer(Reviewer_counts, cols = c(Seen, Never_Seen ), names_to = "status", values_to = "count")
ggplot(data_long, aes(x = reviewer, y = count, fill = status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Count of Never Seen and Seen by Reviewer", x = "Reviewer", y = "Count") +
scale_fill_manual(values = c("Never_Seen" = "skyblue", "Seen" = "salmon")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Then, let’s see which movie has the highest rating.
total_rating <- data %>%
filter(!is.na(rating)) %>%
group_by(movie_name) %>%
summarize(total_score = sum(rating, na.rm = TRUE)) %>%
arrange(desc(total_score))
total_rating
## # A tibble: 6 × 2
## movie_name total_score
## <chr> <int>
## 1 Fast and Furious 10 18
## 2 Oppenheimer 14
## 3 Aquaman and the Lost Kingdom 13
## 4 John Wick 4 13
## 5 The Marvels 12
## 6 Ghosted 9
ggplot(total_rating, aes(x = movie_name, y = total_score, fill = movie_name)) +
geom_bar(stat = "identity") +
labs(title = "Total Rating by Movie", x = "Movie Name", y = "Total Score") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Finally, let’s see which movie receives the highest individual rating.
average_ratings <- data %>%
group_by(movie_name) %>%
summarize(average_rating = sum(rating, na.rm = TRUE) / sum(!is.na(rating)),
non_na_count = sum(!is.na(rating))) %>%
arrange(desc(average_rating))
average_ratings
## # A tibble: 6 × 3
## movie_name average_rating non_na_count
## <chr> <dbl> <int>
## 1 Oppenheimer 4.67 3
## 2 John Wick 4 4.33 3
## 3 Fast and Furious 10 3.6 5
## 4 Aquaman and the Lost Kingdom 3.25 4
## 5 Ghosted 3 3
## 6 The Marvels 3 4
ggplot(average_ratings, aes(x = movie_name, y = average_rating)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Average Rating by Active Viewers", x = "Movie Name", y = "Average Rating") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Conclusion,
Among my friends, Adrian has watched the most movies and has seen every movie on the list; Emma watches the fewest movies, having only seen two. The most widely recognized movie is “Fast and Furious 10,” by its Series of ten movies, resulting in the highest total score. However, this doesn’t mean it’s the best movie. Upon calculation, “Oppenheimer” has the highest average rating by active reviewers, making it the truly best recent popular movie.