Assignment – SQL and R

Introduction

I provided a list of 6 movies to 5 friends to investigate which ones they recognize and have watched, and to rate movives accordingly. The purpose of this survey is to explore the following questions:

Which movie is recognized by the most friends? Which friend watches movies most frequently? Which friend watches movies the least? Which movie receives the highest total rating? Which movie receives the highest rating from each active reviewer / the highest average rating?

Establish connection with MySQL

## 载入需要的程辑包：DBI

Load the dataset

data <- dbGetQuery(con, "SELECT * FROM movies")
data

##                      movie_name reviewer rating
## 1                   Oppenheimer    Tommy      5
## 2                   Oppenheimer   Jackie     NA
## 3                   Oppenheimer   Adrian      4
## 4                   Oppenheimer     Emma     NA
## 5                   Oppenheimer Muhammad      5
## 6                   The Marvels    Tommy      2
## 7                   The Marvels   Jackie      4
## 8                   The Marvels   Adrian      3
## 9                   The Marvels     Emma     NA
## 10                  The Marvels Muhammad      3
## 11 Aquaman and the Lost Kingdom    Tommy      4
## 12 Aquaman and the Lost Kingdom   Jackie      4
## 13 Aquaman and the Lost Kingdom   Adrian      2
## 14 Aquaman and the Lost Kingdom     Emma      3
## 15 Aquaman and the Lost Kingdom Muhammad     NA
## 16                      Ghosted    Tommy     NA
## 17                      Ghosted   Jackie      4
## 18                      Ghosted   Adrian      3
## 19                      Ghosted     Emma     NA
## 20                      Ghosted Muhammad      2
## 21                  John Wick 4    Tommy      4
## 22                  John Wick 4   Jackie     NA
## 23                  John Wick 4   Adrian      4
## 24                  John Wick 4     Emma     NA
## 25                  John Wick 4 Muhammad      5
## 26          Fast and Furious 10    Tommy      4
## 27          Fast and Furious 10   Jackie      2
## 28          Fast and Furious 10   Adrian      4
## 29          Fast and Furious 10     Emma      3
## 30          Fast and Furious 10 Muhammad      5

First, I checked data to find out which movie is the most widely recognized.

library(dplyr)

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

counts <- data %>%
  group_by(movie_name) %>%
  summarize(Never_Seen = sum(is.na(rating)),
            Seen = sum(!is.na(rating))) %>%
arrange(desc(Seen))

counts

## # A tibble: 6 × 3
##   movie_name                   Never_Seen  Seen
##   <chr>                             <int> <int>
## 1 Fast and Furious 10                   0     5
## 2 Aquaman and the Lost Kingdom          1     4
## 3 The Marvels                           1     4
## 4 Ghosted                               2     3
## 5 John Wick 4                           2     3
## 6 Oppenheimer                           2     3

ggplot(counts, aes(x = reorder(movie_name, -Seen), y = Seen)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Count of Non-NA Ratings by Movie", x = "Movie Name", y = "Seen") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next, I want to check which of my friends has watched the most movies and which has watched the least.

Reviewer_counts <- data %>%
  group_by(reviewer) %>%
  summarize(Never_Seen = sum(is.na(rating)),
            Seen = sum(!is.na(rating))) %>%
arrange(desc(Seen))

Reviewer_counts

## # A tibble: 5 × 3
##   reviewer Never_Seen  Seen
##   <chr>         <int> <int>
## 1 Adrian            0     6
## 2 Muhammad          1     5
## 3 Tommy             1     5
## 4 Jackie            2     4
## 5 Emma              4     2

data_long <- tidyr::pivot_longer(Reviewer_counts, cols = c(Seen, Never_Seen ), names_to = "status", values_to = "count")

ggplot(data_long, aes(x = reviewer, y = count, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Count of Never Seen and Seen by Reviewer", x = "Reviewer", y = "Count") +
  scale_fill_manual(values = c("Never_Seen" = "skyblue", "Seen" = "salmon")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Then, let’s see which movie has the highest rating.

total_rating <- data %>%
  filter(!is.na(rating)) %>%
  group_by(movie_name) %>%
    summarize(total_score = sum(rating, na.rm = TRUE)) %>%
  arrange(desc(total_score))
total_rating

## # A tibble: 6 × 2
##   movie_name                   total_score
##   <chr>                              <int>
## 1 Fast and Furious 10                   18
## 2 Oppenheimer                           14
## 3 Aquaman and the Lost Kingdom          13
## 4 John Wick 4                           13
## 5 The Marvels                           12
## 6 Ghosted                                9

ggplot(total_rating, aes(x = movie_name, y = total_score, fill = movie_name)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Rating by Movie", x = "Movie Name", y = "Total Score") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Finally, let’s see which movie receives the highest individual rating.

average_ratings <- data %>%
  group_by(movie_name) %>%
summarize(average_rating = sum(rating, na.rm = TRUE) / sum(!is.na(rating)),
            non_na_count = sum(!is.na(rating))) %>%
  arrange(desc(average_rating))
average_ratings

## # A tibble: 6 × 3
##   movie_name                   average_rating non_na_count
##   <chr>                                 <dbl>        <int>
## 1 Oppenheimer                            4.67            3
## 2 John Wick 4                            4.33            3
## 3 Fast and Furious 10                    3.6             5
## 4 Aquaman and the Lost Kingdom           3.25            4
## 5 Ghosted                                3               3
## 6 The Marvels                            3               4

ggplot(average_ratings, aes(x = movie_name, y = average_rating)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Average Rating by Active Viewers", x = "Movie Name", y = "Average Rating") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion,

Among my friends, Adrian has watched the most movies and has seen every movie on the list; Emma watches the fewest movies, having only seen two. The most widely recognized movie is “Fast and Furious 10,” by its Series of ten movies, resulting in the highest total score. However, this doesn’t mean it’s the best movie. Upon calculation, “Oppenheimer” has the highest average rating by active reviewers, making it the truly best recent popular movie.

Assignment – SQL and R

ZIXIAN LIANG

2024-02-04