Datasets

Datasets on the information of all movies watched and my ratings of each movie (on a scale of 0-5 with 5 being the highest rating) were uploaded into R. Also, data on my favorite movies (top 194) ranked highest to least favorite were also added and examined. The structure of the datasets are also examined. The data has been logged and uploaded from Letterboxd.

Joining datasets

The “toplist” and “ratings” datasets were left joined and only the position (my ranking out of my top 194 movies), name, release year, and my rating (0-5) were selected.

Rankings and Years of Movies Watched

Overall, the movies I watch tend to be more recent (after 2000) and higher rated movies tend to be more recent as well.

#plot the number of movies in each roughly 5-year increment from 1970-present
ggplot(full, aes(x = Year.x))+
  geom_histogram(bins = 11, color = "red", fill = "dark blue")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Number of movies")

#plot the relationship between year movie was made and my ranking
ggplot(full, aes(x = Year.x, y = Position))+
  geom_point(color = "dark red")+
  geom_smooth(method = "lm")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Ranking (from 1-194)")

Average Ratings over the years of favorites

Average ratings by year of top movies are calculated and displayed. Two different vsualizations show the my average ratings for movies made in different years.

#calculate and show the average rating and number of movies for each year using only years that have movies on the list
avg_year <- full %>%
  group_by(Year.x) %>%
  summarise(avg_rating = mean(Rating), count = n())

kable(avg_year, align = "ccc")
Year.x avg_rating count
1975 4.000000 1
1976 4.000000 1
1977 4.000000 1
1979 4.000000 1
1980 4.333333 3
1981 4.000000 1
1983 4.500000 2
1984 3.750000 2
1985 4.000000 1
1986 3.833333 3
1987 3.833333 3
1988 4.000000 2
1989 4.000000 3
1990 4.000000 3
1991 4.000000 3
1992 3.750000 2
1993 4.100000 5
1994 4.250000 4
1995 4.000000 5
1996 4.166667 3
1997 4.083333 6
1998 4.000000 1
1999 4.071429 7
2000 3.833333 3
2001 4.000000 2
2002 3.833333 3
2003 4.000000 6
2004 4.200000 5
2005 3.916667 6
2006 4.625000 4
2007 4.166667 6
2008 4.000000 4
2009 4.000000 3
2010 3.916667 6
2011 3.750000 2
2012 3.916667 6
2013 4.055556 9
2014 3.958333 12
2015 4.200000 10
2016 3.958333 12
2017 4.041667 12
2018 4.000000 6
2019 3.958333 12
2020 4.500000 1
2021 3.500000 1
#Show a visualization of the average rating per year for movies in the top 194
ggplot(avg_year, aes(x = Year.x, y = avg_rating))+
  geom_line(color = "dark blue")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Average rating")

#Show a different visualization of the average rating per year for movies in the top 194 as a scatterplot with the count for each year included
ggplot(avg_year, aes(x = Year.x, y = avg_rating, color = count))+
  geom_point()+
  scale_colour_gradient(low = "orange1", high = "dark red")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Average rating")

Ratings of all movies per year

As of this reoprt, I have watched 793 movies and I calculated average rating by year for all. The number of films of each year watched is plotted here with a scatterplot showing most of the films I have watched are more recent films (i.e. made in last 20 years).

#calculate and list average rating and number of movies per year for all movies watched
avg_year_all <- ratings %>%
  group_by(Year) %>%
  summarise(avg_rating = mean(Rating), count = n())

kable(avg_year_all, align = "ccc")
Year avg_rating count
1934 1.000000 1
1939 1.000000 1
1946 1.500000 1
1971 0.500000 2
1972 3.500000 1
1973 2.000000 2
1974 2.500000 5
1975 3.666667 3
1976 3.500000 2
1977 3.250000 2
1978 2.000000 3
1979 2.600000 5
1980 2.833333 6
1981 2.600000 5
1982 1.857143 7
1983 3.071429 7
1984 2.176471 17
1985 2.375000 12
1986 2.545454 11
1987 2.500000 10
1988 2.357143 14
1989 2.425000 20
1990 2.138889 18
1991 2.818182 11
1992 2.312500 8
1993 2.809524 21
1994 3.055556 9
1995 2.900000 15
1996 2.692308 13
1997 2.617647 17
1998 2.404762 21
1999 2.666667 27
2000 2.375000 16
2001 1.882353 17
2002 2.735294 17
2003 2.545454 22
2004 2.717391 23
2005 2.800000 20
2006 2.480769 26
2007 3.195652 23
2008 2.384615 26
2009 2.583333 24
2010 2.611111 18
2011 2.404762 21
2012 2.928571 21
2013 2.712121 33
2014 3.282609 23
2015 3.192308 26
2016 2.864865 37
2017 2.821429 42
2018 3.000000 25
2019 3.500000 20
2020 2.444444 9
2021 2.357143 7
#plot the count of number of films made in each year watched
ggplot(avg_year_all, aes(x = Year, y = count))+
  geom_point()+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Number of films watched")

Quantifying the relationship between year and rating

Linear regression was used to assess the relationship between year movie was made and my rating. A highly significant positive relationship was found showing that my ratings increase in more recent films.

#run a linear regression assessing the relationship between the average rating (1-5) and year movie was made for all films watched
z <- lm(avg_rating ~ Year, avg_year_all)
summary(z)
## 
## Call:
## lm(formula = avg_rating ~ Year, data = avg_year_all)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71275 -0.27130 -0.03269  0.25944  1.39124 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.66927    7.47306  -3.836 0.000340 ***
## Year          0.01567    0.00375   4.178 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5318 on 52 degrees of freedom
## Multiple R-squared:  0.2514, Adjusted R-squared:  0.237 
## F-statistic: 17.46 on 1 and 52 DF,  p-value: 0.0001125
#plot the relationship between average rating and year for all movies watched
ggplot(avg_year_all, aes(x = Year, y = avg_rating, color = count))+
  geom_point()+
  scale_colour_gradient(low = "orange1", high = "dark red")+
  geom_smooth(method = "lm")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(x = "Year", y = "Average rating")