CS 614 Final Report: Letterboxd User Analysis

Introduction

For a majority of its history, cinema has been a medium dominated by American and European films, and has not been representative of a broader cultural landscape. That, alongside the limitations of information dispersion during the 20th century, has led to several enclaves of film gems globaly. However, in today’s globalized world, the popularity of foreign films continues to grow. With the rise of streaming services and online platforms such as Letterboxd, it has become easier than ever for people to discover and watch films from around the world, and as a byproduct, create a cultural exchange that was once impossible.

In this research paper, we will analyze data from Letterboxd to explore trends in the ratings and viewing habits of users’ foreign films consumption. We will focus on a range of factors, including the most popular countries of origin for foreign films and the factors that influence a user’s rating of a foreign film. Through our analysis, we hope to gain a better understanding of the role that foreign films play in the modern film landscape and the ways in which they are received by audiences.

Furthermore, we will also analyze methods for predicting user ratings for popular films, through data imputation. Utilizing users’ literal film ratings rather than their viewing habits can potentially lead to a lower error with film predictions.

Data Background and Preprocessing

The data for this study has been extracted from a Kaggle dataset by Sam Learner. The dataset consists of scraped Letterboxd user ratings from the top 7000 most active Letterboxd users. Letterboxd is a social media site where users can write movie reviews and rate movies on a 5-star rating scale, with half stars allowed. There are approximately 11 million ratings in this dataset, as well as 280,000 movies.

The movie dataset contained extensive data features in comparison to the user one, which only contained a number_of_reviews column which needed to be redone due to inaccurate values. As a result, the majority of the analysis will be focused on the demographics of movies rather than the demographics of the users. Dataset also originally contained a genres, production_countries, and spoken_languages columns, which each consisted of arrays. These columns were split up into individual columns, such as production_countries1. Furthermore, in order to group together movies by geographic region, a dataset from Our World in Data was used to assign films a region based on the production_countries1 column.

Several other derived columns were also used in this dataset in order to clarify trends, such as, but not limited to:

ratings.user_diff: Calculated by subtracting a user’s rating of film from their user.avg_rating. This normalizes user ratings in order to account for an inconsistent enjoyment benchmark. This metric can be interpreted as a user’s enjoyment of a movie relative to other rated movies.
ratings.movie_diff: Calculated by subtracting a user’s rating of film from the movie’s avg_rating. This normalizes user ratings to contextualize their opinion within the general consensus on the film. This metric can be interpreted as how much a user “underrates” or “overrates” a movie.
percentage_of_consumption: Calculated for each user on a region, and/or genre level by dividing the number of movies rated from a given region from by the total number of user ratings. This metric is meant to contextualize how often users are exposed to films from a given region.
movies.freq: The number of ratings for a given movie.

Methods

The rating prediction will be done using various imputation methods. Firstly, the MissForest package will be utilized for its MissForest algorithm, which combines the Random Forest and the MICE imputation algorithms. A MICE imputation with Predictive Mean Matching will also be used. The imputation will be performed on the 30-most rated movies, in order to include as many user ratings into the imputation as possible. First, those scores will be randomly amputed using the mice package, and will be removed using a Missing at Random amputation strategy. As far as features to test this on, this imputation strategy will be tested on both the user_diff and rating metrics.

As far as the foreign film viewing analysis, a variety of visualization and statistical inference methods will be utilized in order to analyze our results. For statistical inference, the distribution of the user_diff and the avg_rating features will be analyzed using Analysis of Variance(ANOVA), the Shapiro test for Normality, and the Wilcoxon test. The ANOVA and Wilcoxon test both compare the distributions of data subsets based on varying parameters; however, the ANOVA test proves normality, while the Wilcoxon test does not. The Wilcoxon test is therefore useful for non-gaussian distributions as well as when groups are not equally sized.

The visualizations will analyze the language, genre, and regional breakdown of movie viewing habits. The main aim of the visualizations are to uncover trends among Letterboxd users.

Results

Imputation Methods

Results of Imputation
	Metric	Method	MAE result	NRMSE result
V1	avg_rating	pmm	0.91	0.601
V2	avg_rating	MissForest	0.866	0.601
V3	user_diff	pmm	0.0869	0.024
V4	user_diff	MissForest	0.0869	0.0224

Generally, the MissForest imputation tended to have lower NRMSE and MAE results. The imputation results for the user_diff metric prove to have a much lower NRMSE rather than the avg_rating results. This may be due to the fact that only the most popular films were used, and as a result, all the scores tended to be higher than the average score. This would have a greater effect on the user_diff, as the values would rarely be negative. If true, this demonstrates that imputation on user_diff is a far superior method to imputation on avg_rating.

Statistical Inference with Varying Parameters

This section deals exclusively with non-English and non-North American films, unless stated otherwise

First, several AIC models were crafted to test the efficacy of this method on the data plotted above. The feature being predicted was the user_diff, and I used a variety of parameters such as genre, language, region, and number of reviews. Below is a cross analysis of those methods.

## 
## Model selection based on AICc:
## 
##        K     AICc Delta_AICc AICcWt Cum.Wt       LL
## aov_5 88 11670.15       0.00      1      1 -5745.57
## aov_4 27 11704.21      34.06      0      1 -5824.96
## aov_6 21 11767.29      97.14      0      1 -5862.56
## aov_7 21 11767.29      97.14      0      1 -5862.56
## aov_3 15 12482.01     811.86      0      1 -6225.96
## aov_2  9 12482.11     811.96      0      1 -6232.04
## aov_1  3 12502.61     832.46      0      1 -6248.30

Most of these models performed abysmally, as is apparent by the extremely high AIC values and the very negative Logarithmic Likelihood (LL) values. As a result, I had to double check to test if my data was normalized, and so a Shapiro test was ran on various subsets of average_vals data to check for normality. Taking a p-value of 0.01, these are the results.

As these results show, a majority of the Region and genre groups and not normally distributed, with a few notable exceptions (Romance, Mystery, North Africa and West Asia, Sub-Saharan Africa). Languages, on the other hand, are almost all normally distributed, with a few exceptions. Much of these normal distributions can be attributed to smaller sample sizes, as they tend to all be on the lower end of that spectrum, whereas the movies with more representation tend to have a more left skewed distribution, as is demonstrated in the plot below.

As an alternative to the failed ANOVA tests, Wilcoxon rank sum tests were then run on these groups, in order to compare their various distributions. Wilcoxon tests are less powerful statistically than ANOVA tests, but they can work especially well for non-Gaussian and unequal sample sizes. A p-value of 0.01 was again used.

The test showed a strong comparison between the distribution of European films and those of East and South Asia, with p-values of 0.00082 and 0.00038, respectively. There were several similar distributions among the genre groups, particularly within Horror, Family, and Science Fiction movies all had a p-value < 0.01 with every other genre, with the exception of each other. Furthermore, Drama, the most popular genre among Letterboxd users, has a similar distribtuion to Action, Adventure, Animation, and Romance. With regards to language, Bengali had a strong comparison with Arabic, Danish, and German; while most of those languages have a normal distribution in the dataset, German does not.

Visualizations

Firstly, here are some simple plots of regional and language statistics:

movie count by language rating count by language avg rating by language While Spanish language movies are the most frequent in the database, they have one of the lowest average rating, just below Italy, which had the third highest number of ratings. In contrast, Chinese and Korean language movies have relatively low rating frequency, they have some of the highest average ratings. From the beginning, the trend of East Asian films being rated higher and consumed less than European movies begins to emerge.

movie count by region rating count by reguib avg rating by language

Overall, European movies are for more consumed and have a far more ratings than non-European movies, maintaining that plurality even when accounting for number of films. However, East Asian movies are rated significantly higher than films from other regions. Digging further, movies from Sub-Saharan Africa and SWANA(South West Asia, North Africa) tend to be the least common in the film selection, as well as the lowest average scores, indicating a lack of popularity both from a critical and demographic perspective.

Plotting popular (>7), well known (>100 reviews) films from each of the 4 most common languages

As assumed, English is dominant among the film selections of Letterboxd users. Furthermore, the French and Japanese films have a similar distribution of average ratings and number of scores, as well as a similar quantity. However, the Spanish-language films with higher ratings are disproportionately lower than the other languages. This can be seen clearly in the following violin plot, with the density denoting the density of ratings:

violin_plot

In the Spanish language movies, theres a large density of movie ratings in the 5-6 range, as opposed to the rest of the movies that tend to be concentrated approximately at 7. This trend corresponds to the other trends of Spanish movies from earlier, indicating that perhaps Letterboxd users do not rate Spanish-Language films highly, despite rating them more often than most other language groups. This can be clarified through further examination into Regional trends, as Spanish films typically originate from either North America (which is not included in these visualizations), Latin America, and Europe.

User diff by consumption The inclusion criteria of these plots is at least 4 reviews from a given region. As demonstrated by the Wilcoxon tests, the distributions of user_diff of European, South Asian, and East Asian (EA) movies is very similar. However, as shown by this graph, the percentage_of_consumption is much higher in European films than in the other two. Furthermore, even though EA is the second most common non-English movie region, and the highest rated group of films, their viewership is extremely skewed to the left, with a mean below 10%, as shown below in contrast with Europe’s approximately 25% mean. This distribution indicates that despite EA films’ high ratings and relatively high consumption, the majority of users do not watch these movies nearly as frequently as European made films:

Asian_distribution europe_distribution

Furthermore, East Asian and Latin American movies have the highest average user_diff, with European and South Asian movies hovering around 0, and SWANA and Sub-Saharan African movies below average. South Asian and SWANA movies also have a unique trend of negative correlation between percentage_of_consumption and user_diff, indicating that the more users consume these films relative to their regular viewing, the lower they rate it.

Overall, This plot futher reinforces the bias against SWANA and Sub-Saharan African films by Letterboxd users, as they both are near zerio in user consumption, and negative relative to the user’s common scores. However, this trend disappears when controlling for Dramas, where the average user_diff for all regions shoots up about half a rating point universally, with similar distributions otherwise. All films have an avg_userdiff score above 0. Comedy films cause a drop in all means, with the exception of EA films which received a significant increase. Finally, Action movies were rated as having a mean negative avg_userdiff score universally, with the exception of Sub-Saharan Africa. Action movies also contain very similar distributions of both parameters from European, East Asian, and South Asian films, essentially eliminating the added percentage_of_consumption values given to European films in the previously tested subgroups. In all three plots, the distribution shape of the points is generally the same, both in avg_userdiff and percentage_of_consumption.

dramaplot comedyplot actionplot

Conclusion

The imputation testing found that imputing the user_diff column proved to be a highly effective method, with an NRMSE value of only 0.0224 as well as a very low MAE value. As a result, this method proves to be immensely effective in predicting user interests with regard to film. While there were some homogeneity issues in regards to the film selection, the significance of an error value that low should not be understated.

In regards to foreign films, despite filtering all English langauge and North American films, there was still a strong preference towards European films. East Asian films are secondarily preferred, and despite receiving on average the highest critical praise, elicit a trend of “novelty viewing”, where Letterboxd users do not consume that media regularly but still rate it higher than their regular consumption. Additionally, Letterboxd users continuously display bias against Sub-Saharan African and SWANA films, as show in their low percentage of average user consumption, as well as consistently low and/or negative ratings even when controlling for most genres. South Asian films tend to fluctuate greatly; while they do have similar distributions to East Asian and European films (as proven by the Wilcoxon tests), they tend to be slanted towards less user enjoyment, and share a negative correlation between consumption percentage and average user enjoyment with SWANA films.

Finally, Latin American films tended to have positive avg_userdff scores overall, despite a plurality of their ratings being clumped between 5 and 7 rating points, and a low concentration of Latin American films with ratings greater than 7, and with over 100 ratings. This trend persists across movie genres, and may indicate that a vast sum of Latin American films may only contain a handful of medium rating, whereas higher rated films may tend to have less than 100 ratings apiece. Alternatively, this may indicate that Latin American films tend to be lower rated because the users who consume those films tend to have low average movie_diff scores, indicating underrating of films.

The clear bias against Sub-Saharan African and SWANA films, the fluctuations in the consumption and enjoyment of South Asian films, and the unclear distribution of Latin American films will guide the future work in this field. These findings suggest that further research is needed to better understand the global preferences and biases in film consumption, and hopefully enable the worldwide film community to enrich themselves in each others cultures through this medium.