Introduction

To gather insights on recent popular movies, I will choose six recently released films that have gained significant attention. Then, I will ask at least five people I know—friends, family, classmates, or even imaginary friends if needed—to rate each movie they have watched on a scale from 1 to 5.

install.packages("RMySQL", repos = "https://cran.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/s3/v3s06grs4td_hm4kgmt3q2ww0000gn/T//RtmptAq5It/downloaded_packages
library(RMySQL)
## Loading required package: DBI
user <- Sys.getenv("MYSQL_USER")
password <- Sys.getenv("MYSQL_PASSWORD")
host <- Sys.getenv("MYSQL_HOST")
dbname <- Sys.getenv("MYSQL_DBNAME")

conn <- dbConnect(MySQL(), user = user, password = password, host = host, dbname = dbname)

query <- "select * from movie_rankings"
movies_ranking_df <- dbGetQuery(conn, query)

query <- "select * from movies"
movies_df <- dbGetQuery(conn, query)
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric
dbDisconnect(conn)
## [1] TRUE

The first six rows of my friends’ ranking table records.

library(knitr)
kable(head(movies_ranking_df))
id movie_name movie_ranking friend_name
1 Beetlejuice Beetlejuice 5 Halyna
2 Joker: Folie à Deux 3 Halyna
3 It Ends With Us 3 Halyna
4 The Watchers 2 Halyna
5 Back to Black 4 Halyna
6 Nowhere 1 Halyna

The first six rows of my movies table records.

kable(head(movies_df))
movie_id movie_name imdb_ranking genre release_date
1 Beetlejuice Beetlejuice 7.1 Dark Comedy 2024
2 Joker: Folie à Deux 7.1 Dark Comedy 2024
3 It Ends With Us 6.7 Drama, Romance 2024
4 The Watchers 5.7 Fantasy, Horror 2024
5 Back to Black 6.3 Docudrama 2024
6 Nowhere 6.3 Survival, Drama 2024

The average rating of my frinds for the following 6 movies is:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mean_ratings_df <- movies_ranking_df %>% group_by(movie_name) %>% 
  summarise(mean_rating = mean(movie_ranking, na.rm = TRUE))

library(ggplot2)
ggplot(mean_ratings_df, aes(x = movie_name, y = mean_rating, fill = movie_name)) + 
  geom_bar(stat = "identity") + 
  labs(x = "Movie Name", y = "Friends' Average Rating") +
  theme_minimal() 

Note: Missing Data in the Data Frame Example

Code example: “na.rm = true”is an argument used in functions to specify that missing values should be removed before performing the operation. This is particularly useful in functions like mean(), sum(), sd(), etc.

Example vector with missing values: data <- myNumbers(1, 2, NA, 4, 5)

Calculate the mean, removing NA values: mean_value <- mean(data, na.rm = TRUE)

Its output will be 3

***I believe there are some other ways to do it, but this is the only one I was able to actually use in this use case.

Back to our movies…

Normalized ratings representation up to 5 for the same movies according to IMDb.

normalized_imbd_rating_df = movies_df %>% group_by(movie_name) %>%
  summarise(mean_rating = mean(round((imdb_ranking / 10) * 5), na.rm = TRUE))

ggplot(normalized_imbd_rating_df, aes(x = movie_name, y = mean_rating, fill = movie_name)) + 
  geom_bar(stat = "identity") + 
  labs(x = "Movie Name", y = "imdb Average Rating") +
  theme_minimal() + 
  theme(axis.text.x = element_blank()) #make it look clean

There is a representation of now my frends’ ranking is different from imdb ranking:

library(tidyr)

#only common values
joined_movies <- inner_join(mean_ratings_df, normalized_imbd_rating_df, 
                            by = "movie_name", 
                            suffix = c(".mean_ratings_df", ".normalized_imbd_rating_df")) # which column from which table


# long format
long_movies <- joined_movies %>%
  pivot_longer(cols = starts_with("mean_rating"), 
               names_to = "source", 
               values_to = "rating") %>%
  mutate(source = recode(source, 
                         "mean_rating.mean_ratings_df" = "Friends", 
                         "mean_rating.normalized_imbd_rating_df" = "IMDb"))

#data plot
ggplot(long_movies, aes(x = movie_name, y = rating, fill = source)) + 
  geom_bar(stat = "identity", position = position_dodge()) +
  theme_minimal() +
  labs(title = "Comparison of Friends' vs IMDb Rankings",
       x = "Movie Name",
       y = "Average Rating") +
  scale_fill_manual(name = "Rating Source", values = c('Friends' = '#E75480', 'IMDb' = 'blue'))

Colclusion:

Based on two graph graph “Friends’ Ranking vs IMDb Rankings,” we can conclude that: