Datasets on the information of all movies watched and my ratings of each movie (on a scale of 0-5 with 5 being the highest rating) were uploaded into R. Also, data on my favorite movies (top 194) ranked highest to least favorite were also added and examined. The structure of the datasets are also examined. The data has been logged and uploaded from Letterboxd.
The “toplist” and “ratings” datasets were left joined and only the position (my ranking out of my top 194 movies), name, release year, and my rating (0-5) were selected.
Overall, the movies I watch tend to be more recent (after 2000) and higher rated movies tend to be more recent as well.
#plot the number of movies in each roughly 5-year increment from 1970-present
ggplot(full, aes(x = Year.x))+
geom_histogram(bins = 11, color = "red", fill = "dark blue")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Number of movies")
#plot the relationship between year movie was made and my ranking
ggplot(full, aes(x = Year.x, y = Position))+
geom_point(color = "dark red")+
geom_smooth(method = "lm")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Ranking (from 1-194)")
Average ratings by year of top movies are calculated and displayed. Two different vsualizations show the my average ratings for movies made in different years.
#calculate and show the average rating and number of movies for each year using only years that have movies on the list
avg_year <- full %>%
group_by(Year.x) %>%
summarise(avg_rating = mean(Rating), count = n())
kable(avg_year, align = "ccc")
| Year.x | avg_rating | count |
|---|---|---|
| 1975 | 4.000000 | 1 |
| 1976 | 4.000000 | 1 |
| 1977 | 4.000000 | 1 |
| 1979 | 4.000000 | 1 |
| 1980 | 4.333333 | 3 |
| 1981 | 4.000000 | 1 |
| 1983 | 4.500000 | 2 |
| 1984 | 3.750000 | 2 |
| 1985 | 4.000000 | 1 |
| 1986 | 3.833333 | 3 |
| 1987 | 3.833333 | 3 |
| 1988 | 4.000000 | 2 |
| 1989 | 4.000000 | 3 |
| 1990 | 4.000000 | 3 |
| 1991 | 4.000000 | 3 |
| 1992 | 3.750000 | 2 |
| 1993 | 4.100000 | 5 |
| 1994 | 4.250000 | 4 |
| 1995 | 4.000000 | 5 |
| 1996 | 4.166667 | 3 |
| 1997 | 4.083333 | 6 |
| 1998 | 4.000000 | 1 |
| 1999 | 4.071429 | 7 |
| 2000 | 3.833333 | 3 |
| 2001 | 4.000000 | 2 |
| 2002 | 3.833333 | 3 |
| 2003 | 4.000000 | 6 |
| 2004 | 4.200000 | 5 |
| 2005 | 3.916667 | 6 |
| 2006 | 4.625000 | 4 |
| 2007 | 4.166667 | 6 |
| 2008 | 4.000000 | 4 |
| 2009 | 4.000000 | 3 |
| 2010 | 3.916667 | 6 |
| 2011 | 3.750000 | 2 |
| 2012 | 3.916667 | 6 |
| 2013 | 4.055556 | 9 |
| 2014 | 3.958333 | 12 |
| 2015 | 4.200000 | 10 |
| 2016 | 3.958333 | 12 |
| 2017 | 4.041667 | 12 |
| 2018 | 4.000000 | 6 |
| 2019 | 3.958333 | 12 |
| 2020 | 4.500000 | 1 |
| 2021 | 3.500000 | 1 |
#Show a visualization of the average rating per year for movies in the top 194
ggplot(avg_year, aes(x = Year.x, y = avg_rating))+
geom_line(color = "dark blue")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Average rating")
#Show a different visualization of the average rating per year for movies in the top 194 as a scatterplot with the count for each year included
ggplot(avg_year, aes(x = Year.x, y = avg_rating, color = count))+
geom_point()+
scale_colour_gradient(low = "orange1", high = "dark red")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Average rating")
As of this reoprt, I have watched 793 movies and I calculated average rating by year for all. The number of films of each year watched is plotted here with a scatterplot showing most of the films I have watched are more recent films (i.e. made in last 20 years).
#calculate and list average rating and number of movies per year for all movies watched
avg_year_all <- ratings %>%
group_by(Year) %>%
summarise(avg_rating = mean(Rating), count = n())
kable(avg_year_all, align = "ccc")
| Year | avg_rating | count |
|---|---|---|
| 1934 | 1.000000 | 1 |
| 1939 | 1.000000 | 1 |
| 1946 | 1.500000 | 1 |
| 1971 | 0.500000 | 2 |
| 1972 | 3.500000 | 1 |
| 1973 | 2.000000 | 2 |
| 1974 | 2.500000 | 5 |
| 1975 | 3.666667 | 3 |
| 1976 | 3.500000 | 2 |
| 1977 | 3.250000 | 2 |
| 1978 | 2.000000 | 3 |
| 1979 | 2.600000 | 5 |
| 1980 | 2.833333 | 6 |
| 1981 | 2.600000 | 5 |
| 1982 | 1.857143 | 7 |
| 1983 | 3.071429 | 7 |
| 1984 | 2.176471 | 17 |
| 1985 | 2.375000 | 12 |
| 1986 | 2.545454 | 11 |
| 1987 | 2.500000 | 10 |
| 1988 | 2.357143 | 14 |
| 1989 | 2.425000 | 20 |
| 1990 | 2.138889 | 18 |
| 1991 | 2.818182 | 11 |
| 1992 | 2.312500 | 8 |
| 1993 | 2.809524 | 21 |
| 1994 | 3.055556 | 9 |
| 1995 | 2.900000 | 15 |
| 1996 | 2.692308 | 13 |
| 1997 | 2.617647 | 17 |
| 1998 | 2.404762 | 21 |
| 1999 | 2.666667 | 27 |
| 2000 | 2.375000 | 16 |
| 2001 | 1.882353 | 17 |
| 2002 | 2.735294 | 17 |
| 2003 | 2.545454 | 22 |
| 2004 | 2.717391 | 23 |
| 2005 | 2.800000 | 20 |
| 2006 | 2.480769 | 26 |
| 2007 | 3.195652 | 23 |
| 2008 | 2.384615 | 26 |
| 2009 | 2.583333 | 24 |
| 2010 | 2.611111 | 18 |
| 2011 | 2.404762 | 21 |
| 2012 | 2.928571 | 21 |
| 2013 | 2.712121 | 33 |
| 2014 | 3.282609 | 23 |
| 2015 | 3.192308 | 26 |
| 2016 | 2.864865 | 37 |
| 2017 | 2.821429 | 42 |
| 2018 | 3.000000 | 25 |
| 2019 | 3.500000 | 20 |
| 2020 | 2.444444 | 9 |
| 2021 | 2.357143 | 7 |
#plot the count of number of films made in each year watched
ggplot(avg_year_all, aes(x = Year, y = count))+
geom_point()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Number of films watched")
Linear regression was used to assess the relationship between year movie was made and my rating. A highly significant positive relationship was found showing that my ratings increase in more recent films.
#run a linear regression assessing the relationship between the average rating (1-5) and year movie was made for all films watched
z <- lm(avg_rating ~ Year, avg_year_all)
summary(z)
##
## Call:
## lm(formula = avg_rating ~ Year, data = avg_year_all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71275 -0.27130 -0.03269 0.25944 1.39124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.66927 7.47306 -3.836 0.000340 ***
## Year 0.01567 0.00375 4.178 0.000113 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5318 on 52 degrees of freedom
## Multiple R-squared: 0.2514, Adjusted R-squared: 0.237
## F-statistic: 17.46 on 1 and 52 DF, p-value: 0.0001125
#plot the relationship between average rating and year for all movies watched
ggplot(avg_year_all, aes(x = Year, y = avg_rating, color = count))+
geom_point()+
scale_colour_gradient(low = "orange1", high = "dark red")+
geom_smooth(method = "lm")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(x = "Year", y = "Average rating")