To gather insights on recent popular movies, I will choose six recently released films that have gained significant attention. Then, I will ask at least five people I know—friends, family, classmates, or even imaginary friends if needed—to rate each movie they have watched on a scale from 1 to 5.
install.packages("RMySQL", repos = "https://cran.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/s3/v3s06grs4td_hm4kgmt3q2ww0000gn/T//RtmptAq5It/downloaded_packages
library(RMySQL)
## Loading required package: DBI
user <- Sys.getenv("MYSQL_USER")
password <- Sys.getenv("MYSQL_PASSWORD")
host <- Sys.getenv("MYSQL_HOST")
dbname <- Sys.getenv("MYSQL_DBNAME")
conn <- dbConnect(MySQL(), user = user, password = password, host = host, dbname = dbname)
query <- "select * from movie_rankings"
movies_ranking_df <- dbGetQuery(conn, query)
query <- "select * from movies"
movies_df <- dbGetQuery(conn, query)
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric
dbDisconnect(conn)
## [1] TRUE
The first six rows of my friends’ ranking table records.
library(knitr)
kable(head(movies_ranking_df))
id | movie_name | movie_ranking | friend_name |
---|---|---|---|
1 | Beetlejuice Beetlejuice | 5 | Halyna |
2 | Joker: Folie à Deux | 3 | Halyna |
3 | It Ends With Us | 3 | Halyna |
4 | The Watchers | 2 | Halyna |
5 | Back to Black | 4 | Halyna |
6 | Nowhere | 1 | Halyna |
The first six rows of my movies table records.
kable(head(movies_df))
movie_id | movie_name | imdb_ranking | genre | release_date |
---|---|---|---|---|
1 | Beetlejuice Beetlejuice | 7.1 | Dark Comedy | 2024 |
2 | Joker: Folie à Deux | 7.1 | Dark Comedy | 2024 |
3 | It Ends With Us | 6.7 | Drama, Romance | 2024 |
4 | The Watchers | 5.7 | Fantasy, Horror | 2024 |
5 | Back to Black | 6.3 | Docudrama | 2024 |
6 | Nowhere | 6.3 | Survival, Drama | 2024 |
The average rating of my frinds for the following 6 movies is:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mean_ratings_df <- movies_ranking_df %>% group_by(movie_name) %>%
summarise(mean_rating = mean(movie_ranking, na.rm = TRUE))
library(ggplot2)
ggplot(mean_ratings_df, aes(x = movie_name, y = mean_rating, fill = movie_name)) +
geom_bar(stat = "identity") +
labs(x = "Movie Name", y = "Friends' Average Rating") +
theme_minimal()
Code example: “na.rm = true”is an argument used in functions to specify that missing values should be removed before performing the operation. This is particularly useful in functions like mean(), sum(), sd(), etc.
Example vector with missing values: data <- myNumbers(1, 2, NA, 4, 5)
Calculate the mean, removing NA values: mean_value <- mean(data, na.rm = TRUE)
Its output will be 3
***I believe there are some other ways to do it, but this is the only one I was able to actually use in this use case.
Normalized ratings representation up to 5 for the same movies according to IMDb.
normalized_imbd_rating_df = movies_df %>% group_by(movie_name) %>%
summarise(mean_rating = mean(round((imdb_ranking / 10) * 5), na.rm = TRUE))
ggplot(normalized_imbd_rating_df, aes(x = movie_name, y = mean_rating, fill = movie_name)) +
geom_bar(stat = "identity") +
labs(x = "Movie Name", y = "imdb Average Rating") +
theme_minimal() +
theme(axis.text.x = element_blank()) #make it look clean
There is a representation of now my frends’ ranking is different from imdb ranking:
library(tidyr)
#only common values
joined_movies <- inner_join(mean_ratings_df, normalized_imbd_rating_df,
by = "movie_name",
suffix = c(".mean_ratings_df", ".normalized_imbd_rating_df")) # which column from which table
# long format
long_movies <- joined_movies %>%
pivot_longer(cols = starts_with("mean_rating"),
names_to = "source",
values_to = "rating") %>%
mutate(source = recode(source,
"mean_rating.mean_ratings_df" = "Friends",
"mean_rating.normalized_imbd_rating_df" = "IMDb"))
#data plot
ggplot(long_movies, aes(x = movie_name, y = rating, fill = source)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(title = "Comparison of Friends' vs IMDb Rankings",
x = "Movie Name",
y = "Average Rating") +
scale_fill_manual(name = "Rating Source", values = c('Friends' = '#E75480', 'IMDb' = 'blue'))
Based on two graph graph “Friends’ Ranking vs IMDb Rankings,” we can conclude that: