As October approaches, it’s time to curate the perfect horror movie watchlist to set the mood for spooky season. Not all horror films are created equal, and with subcategories like Supernatural, Psychological, and Mystery/Thriller, each brings its own unique brand of fright. To help narrow down the must-watch titles, I asked six family members to rate some of 2024’s most popular horror movies on a scale of 1 to 5—where 1 means they weren’t impressed, and 5 signifies a must-see scare
knitr::opts_chunk$set(echo = TRUE)
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Imported SQL CSV file into markdown
horror_rating <- read.csv("C:/Users/tiffh/Assignment#2/Results.csv")
head(horror_rating)
## ranking_id participant_id movie_id ranking
## 1 1 1 1 5
## 2 2 1 2 4
## 3 3 1 3 0
## 4 4 1 4 3
## 5 5 1 5 0
## 6 6 1 6 4
The selected horror movies included The Exorcism, Tarot, Immaculate, The Watchers, Oddity, and The Deliverance. When I created the dataset in SQL, the movies were initially assigned numbers, so I recoded it to associate each number with its corresponding title. However, not all family members watched every movie, leading to some missing ratings. My first step in the analysis was to explore the watched vs. not watched status for each film.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
#Checking for missing data in dataset
which(horror_rating$ranking == 0)
## [1] 3 5 8 10 12 13 17 19 21 24 26 28
# In the 36 rankings, 12 were ranked zero meaning they have not been watched, it
#would be intresting to see movies watched and not watched so recode for watchted
# and not wacthed and including the movies full names so better when graphed
#Recode with movie name
movie_names <- c("The Exorcism", "Tarot", "Immaculate", "The Watchers", "Oddity", "The Deliverance")
#Recode with watched and not watched
recode_movies <- function(x) {
ifelse(x == 0, "Not Watched", movie_names[x])
}
# Create df with watch staus
movie_watched <- horror_rating %>%
mutate(MovieName = movie_names[movie_id],
WatchedStatus = ifelse(ranking == 0, "Not Watched", "Watched")
) %>%
select(participant_id, MovieName, WatchedStatus, ranking)
print(movie_watched)
## participant_id MovieName WatchedStatus ranking
## 1 1 The Exorcism Watched 5
## 2 1 Tarot Watched 4
## 3 1 Immaculate Not Watched 0
## 4 1 The Watchers Watched 3
## 5 1 Oddity Not Watched 0
## 6 1 The Deliverance Watched 4
## 7 2 The Exorcism Watched 3
## 8 2 Tarot Not Watched 0
## 9 2 Immaculate Watched 5
## 10 2 The Watchers Not Watched 0
## 11 2 Oddity Watched 4
## 12 2 The Deliverance Not Watched 0
## 13 3 The Exorcism Not Watched 0
## 14 3 Tarot Watched 3
## 15 3 Immaculate Watched 4
## 16 3 The Watchers Watched 2
## 17 3 Oddity Not Watched 0
## 18 3 The Deliverance Watched 5
## 19 4 The Exorcism Not Watched 0
## 20 4 Tarot Watched 4
## 21 4 Immaculate Not Watched 0
## 22 4 The Watchers Watched 3
## 23 4 Oddity Watched 2
## 24 4 The Deliverance Not Watched 0
## 25 5 The Exorcism Watched 4
## 26 5 Tarot Not Watched 0
## 27 5 Immaculate Watched 3
## 28 5 The Watchers Not Watched 0
## 29 5 Oddity Watched 5
## 30 5 The Deliverance Watched 2
## 31 6 The Exorcism Watched 4
## 32 6 Tarot Watched 5
## 33 6 Immaculate Watched 3
## 34 6 The Watchers Watched 2
## 35 6 Oddity Watched 1
## 36 6 The Deliverance Watched 4
knitr::opts_chunk$set(echo = TRUE)
#install.packages("gglots2")
# Make visualization of watched status
#install.packages("ggplot2")
library(ggplot2)
library(dplyr)
watched_counts <- movie_watched %>%
group_by(MovieName, WatchedStatus) %>%
summarise(Count = n(), .groups = 'drop')
# Create the bar plot
plot <- ggplot(watched_counts, aes(x = MovieName, y = Count, fill = WatchedStatus)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Watched" = "purple", "Not Watched" = "pink")) +
labs(title = "Watched vs Not Watched Movies",
x = "Movie Name",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Print the plot
print(plot)
The
bar plot displays the count of respondents who watched versus those who
didn’t for each of the six movies. At least four people watched each
movie, while two respondents missed one. To further investigate, I will
calculate the average rating for each movie, excluding non-watchers.
This approach ensures the ratings accurately reflect the opinions of
those who actually viewed the films, providing a clearer picture of the
movies’ true reception.
library(dplyr)
#Create df with watched only
watched_only <- subset(movie_watched, ranking > 0)
print(watched_only)
## participant_id MovieName WatchedStatus ranking
## 1 1 The Exorcism Watched 5
## 2 1 Tarot Watched 4
## 4 1 The Watchers Watched 3
## 6 1 The Deliverance Watched 4
## 7 2 The Exorcism Watched 3
## 9 2 Immaculate Watched 5
## 11 2 Oddity Watched 4
## 14 3 Tarot Watched 3
## 15 3 Immaculate Watched 4
## 16 3 The Watchers Watched 2
## 18 3 The Deliverance Watched 5
## 20 4 Tarot Watched 4
## 22 4 The Watchers Watched 3
## 23 4 Oddity Watched 2
## 25 5 The Exorcism Watched 4
## 27 5 Immaculate Watched 3
## 29 5 Oddity Watched 5
## 30 5 The Deliverance Watched 2
## 31 6 The Exorcism Watched 4
## 32 6 Tarot Watched 5
## 33 6 Immaculate Watched 3
## 34 6 The Watchers Watched 2
## 35 6 Oddity Watched 1
## 36 6 The Deliverance Watched 4
#average movie rank
average_rank <- watched_only %>%
group_by(MovieName) %>%
summarise(AverageRank = mean(ranking, na.rm = TRUE), .groups = 'drop')
print(average_rank)
## # A tibble: 6 × 2
## MovieName AverageRank
## <chr> <dbl>
## 1 Immaculate 3.75
## 2 Oddity 3
## 3 Tarot 4
## 4 The Deliverance 3.75
## 5 The Exorcism 4
## 6 The Watchers 2.5
#df with avg rank
average_rank_df <- data.frame(
MovieName = c("Immaculate", "Oddity", "Tarot", "The Deliverance", "The Exorcism", "The Watchers"),
AverageRank = c(3.75, 3.00, 4.00, 3.75, 4.00, 2.50)
)
print(average_rank_df)
## MovieName AverageRank
## 1 Immaculate 3.75
## 2 Oddity 3.00
## 3 Tarot 4.00
## 4 The Deliverance 3.75
## 5 The Exorcism 4.00
## 6 The Watchers 2.50
#visualize
ggplot(average_rank_df, aes(x = MovieName, y = AverageRank, fill = MovieName)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Paired") +
theme_minimal() +
labs(title = "Average Rank Scores of Movies", x = "Movies", y = "Average Rank") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on the average rankings, Tarot and The Exorcism both received the highest rating of 4.00, indicating they were rated very favorably by viewers. There is little variance among Immaculate, Oddity, Tarot, The Deliverance, and The Exorcism, with scores ranging from 3.75 to 4.00. This small difference suggests that these movies were similarly well-received, though the distinction between 3.75 and 4.00 may warrant further investigation to understand the specific factors contributing to the variation. Oddity received a more average rating of 3.00, reflecting a more mixed response, while The Watchers had the lowest rating at 2.50. This suggests that The Watchers did not connect with the audience and may not be a suitable choice for those seeking high-quality horror films.
One possible explanation for the differing ratings could be attributed to the subcategories of the movies. To categorize the films, I assigned the subcategories as follows: The Exorcism, Immaculate, and The Deliverance fall under “Supernatural Horror,” while Oddity and Tarot are classified as “Psychological Horror.” Lastly, The Watchers is categorized as “Mystery/Thriller.” This classification may help explain the variations in average ratings among the films.
# create subcategory of the movies
average_rank_df$Subcategory <- ifelse(average_rank_df$MovieName %in% c("The Exorcism", "Immaculate", "The Deliverance"),
"Supernatural Horror", ifelse(average_rank_df$MovieName %in% c("Oddity", "Tarot"),
"Psychological Horror",
ifelse(average_rank_df$MovieName == "The Watchers",
"Mystery/Thriller",
NA)))
print(average_rank_df)
## MovieName AverageRank Subcategory
## 1 Immaculate 3.75 Supernatural Horror
## 2 Oddity 3.00 Psychological Horror
## 3 Tarot 4.00 Psychological Horror
## 4 The Deliverance 3.75 Supernatural Horror
## 5 The Exorcism 4.00 Supernatural Horror
## 6 The Watchers 2.50 Mystery/Thriller
#Aggregate average rank by subcategory
subcategory_avg <- aggregate(AverageRank ~ Subcategory, data = average_rank_df, FUN = mean)
print(subcategory_avg)
## Subcategory AverageRank
## 1 Mystery/Thriller 2.500000
## 2 Psychological Horror 3.500000
## 3 Supernatural Horror 3.833333
# linear regression
model <- lm(AverageRank ~ Subcategory, data = average_rank_df)
summary(model)
##
## Call:
## lm(formula = AverageRank ~ Subcategory, data = average_rank_df)
##
## Residuals:
## 1 2 3 4 5 6
## -8.333e-02 -5.000e-01 5.000e-01 -8.333e-02 1.667e-01 -4.857e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5000 0.4249 5.883 0.0098 **
## SubcategoryPsychological Horror 1.0000 0.5204 1.922 0.1504
## SubcategorySupernatural Horror 1.3333 0.4907 2.717 0.0727 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 3 degrees of freedom
## Multiple R-squared: 0.7111, Adjusted R-squared: 0.5185
## F-statistic: 3.692 on 2 and 3 DF, p-value: 0.1553
# Df with coefficients and confidence intervals
coef_df <- data.frame(
Subcategory = c("Mystery/Thriller", "Psychological Horror", "Supernatural Horror"),
Estimate = c(1.0000, 1.0000, 1.3333),
StdError = c(0.4249, 0.5204, 0.4907)
)
# visulization
ggplot(coef_df, aes(x = Subcategory, y = Estimate, fill = Subcategory)) +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = Estimate - StdError, ymax = Estimate + StdError), width = 0.2) +
theme_minimal() +
labs(title = "Estimated Average Rank by Subcategory", x = "Subcategory", y = "Estimated Average Rank") +
scale_fill_manual(values = c("Mystery/Thriller" = "lightblue",
"Psychological Horror" = "pink",
"Supernatural Horror" = "purple"))
The
average ratings for the different subcategories of horror films reveal
interesting trends. Mystery/Thriller has the lowest average rating at
2.50, primarily driven by the low rating of The Watchers, which is the
only film in this category. In contrast, Psychological Horror received a
more favorable average of 3.50, while Supernatural Horror leads with an
average rating of 3.83. This suggests that the supernatural horror
category is the most favored among viewers, especially when considering
that it combines the higher ratings of The Exorcism, Immaculate, and The
Deliverance, resulting in the highest overall ranking among the
subcategories.
I used linear regression to understand the relationship between average rankings of horror movies and their subcategories. The intercept value of 2.50 indicates the baseline average ranking for the reference category, which is the Mystery/Thriller subcategory. The coefficient for Psychological Horror is 1.00, suggesting that movies in this category have an average ranking that is 1.00 higher than the baseline. However, the p-value 0.1504 indicates that this result is not statistically significant at alpha 0.05. This implies that there isn’t strong evidence to conclude that Psychological Horror significantly improves the average ranking compared to the reference category. The coefficient for Supernatural Horror is 1.3333, indicating that films in this category have an average ranking that is 1.3333 higher than the baseline. The p-value of 0.0727 suggests that this result approaches significance, indicating a trend toward higher ratings for supernatural horror movies compared to Mystery/Thriller. Although it’s not below the 0.05 threshold, it may still warrant attention.Overall, the analysis indicates that while both Psychological Horror and Supernatural Horror appear to improve average rankings relative to Mystery/Thriller, only Supernatural Horror shows a trend that might be significant, suggesting that viewers generally prefer supernatural elements in horror films over the mystery/thriller genre. The residuals indicate small deviations from the fitted values, suggesting a reasonably good fit of the model to the data.
In conclusion, the analysis of horror movie rankings by subcategory reveals that both Psychological Horror and Supernatural Horror tend to yield higher average rankings compared to Mystery/Thriller. Notably, only Supernatural Horror exhibits a potentially significant trend, suggesting a viewer preference for supernatural elements in horror films. The minimal deviations in the residuals indicate a good fit of the model to the data, reinforcing the findings. However, to gain a deeper understanding of participants’ rankings, it is crucial to explore the reasons behind their scores, as relying solely on subcategories may not provide a comprehensive assessment. The presence of well-known actors like Avantika Vandanapu, Jacob Batalon, and Sydney Sweeney in the highest-ranking films may have influenced viewer ratings, alongside factors such as cinematography. Moving forward, expanding the sample size and gathering more detailed insights into participant rankings will significantly enhance this study on horror movie evaluations and provide a clearer picture of audience preferences in the genre.
Bison. “If-Else Statement in R.” Stack Overflow. March 12, 2013. https://stackoverflow.com/questions/15391418/if-else-statement-in-r.
“Linear Regression in R.” DataCamp. Last modified June 24, 2021. https://www.datacamp.com/tutorial/linear-regression-R.
Tudor, Paul. “Linear Regression in R.” YouTube video, 12:15. Posted June 20, 2023. https://www.youtube.com/watch?v=TnW3CobKcp0.
“Missing Values.” UC-R. Accessed September 8, 2024. https://uc-r.github.io/missing_values#na_test.
“Working with Databases and SQL in RStudio.” Posit. Last modified September 5, 2023. Accessed September 8, 2024. https://posit.co/blog/working-with-databases-and-sql-in-rstudio/.