This report is an analysis on the dataset movielens which can be found in full here. The code, data and a description of the variables used in this report can be found in the original repository
readr::read_csv(here::here("data/movies.csv"),
progress = FALSE,
col_types = cols(
movieId = col_integer(),
title = col_character(),
genres = col_character()
)) %>%
group_by(movieId) %>%
mutate(year = as.numeric(sub("\\).*", "",sub(".*\\(", "", title))),
num_genres = length(as.list(strsplit(genres,'|',fixed = TRUE))[[1]]),
homogeneous = num_genres == 1, # Deriving homogeneity
xx_century = year <= 2000
) %>%
na.omit() %>%
ungroup() -> movies
readr::read_csv(here::here("data/ratings.csv"),
progress = FALSE,
col_types = cols(
userId = col_integer(),
movieId = col_integer(),
rating = col_double(),
timestamp = col_integer()
)) %>%
na.omit() -> ratings
dplyr::inner_join(
movies,
ratings,
by="movieId") -> data
data %>%
group_by(movieId) %>%
summarise(median_rating = median(rating), # Deriving whether a movies is well rated
well_rated = median_rating > 3.5) -> summarised
dplyr::inner_join(
summarised,
data,
by="movieId") -> data
data %>%
glimpse()
## Observations: 99,997
## Variables: 12
## $ movieId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ median_rating <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ well_rated <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ title <chr> "Toy Story (1995)", "Toy Story (1995)", "Toy Sto...
## $ genres <chr> "Adventure|Animation|Children|Comedy|Fantasy", "...
## $ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, ...
## $ num_genres <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ homogeneous <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ xx_century <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
## $ userId <int> 7, 9, 13, 15, 19, 20, 23, 26, 30, 37, 43, 44, 47...
## $ rating <dbl> 3.0, 4.0, 5.0, 2.0, 3.0, 3.5, 3.0, 5.0, 4.0, 4.0...
## $ timestamp <int> 851866703, 938629179, 1331380058, 997938310, 855...
movies %>%
filter(title == "Hamlet (2000)")
data %>%
filter(movieId != 3598) -> data
data %>%
group_by(movieId) %>%
slice(1) %>%
ggplot(aes(year)) +
geom_bar() +
labs(x="Movie Year",
y="Absolute Frequency")
data %>%
ggplot(aes(num_genres)) +
geom_bar() +
labs(x="Number of Genres",
y="Absolute Frequency")
data %>%
ggplot(aes(rating,y=..prop..)) +
geom_bar() +
labs(x="Movie Rating",
y="Relative Frequency")
Throughout this report we will employ Confidence Intervals of the difference of two distinct values, let’s call them A and B for instance.This gives us three possibilities:
data %>%
ggplot(aes(homogeneous,rating)) +
geom_violin() +
labs(x="Homogeneous in Genre ",
y="Movie Rating")
data %>%
ggplot(aes(homogeneous,year)) +
geom_violin() +
labs(x="Homogeneous in Genre ",
y="Year of the movie")
meu_theta <- function(x, i) {
x %>%
slice(i) %>%
group_by(homogeneous) %>%
summarise(average = mean(rating)) -> y
result <- y[y$homogeneous == TRUE,]$average -
y[y$homogeneous == FALSE,]$average
return(result)
}
mean.boot <- boot(data = data,
statistic = meu_theta,
R = 5000)
plot(mean.boot)
result_mean <- boot.ci(boot.out = mean.boot,
conf = 0.95,
type = "basic")
result_mean
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = mean.boot, conf = 0.95, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.0597, -0.0250 )
## Calculations and Intervals on Original Scale
X2.5. = c(result_mean$basic[4])
X97.5. = c(result_mean$basic[5])
mean.diff = data.frame(X2.5.,X97.5.)
mean.diff %>%
ggplot(aes(x = "Homogeneous - Heterogeneous",ymin = X2.5., ymax = X97.5.)) +
geom_errorbar(width = .2) +
geom_hline(yintercept = 0, colour = "darkorange") +
labs(x="") +
ggtitle("Confidence Interval for difference of means")
There’s evidence at 95% of confidence that heterogeneous movies are significantly better rated than the Homogeneous movies.
The difference in means although significant in statistical terms is very small in practical terms and therefore is not relevant (doesn’t amount to much in terms of magnitude).
Aparently staying faithful to the same concept all the time is counter productive. One could argue that to be truly complete a movie should dab into more than one aspect of the human psyche. This makes sense when we think about Chaplin, whose best movies not only made us laugh but also dared from time to time to make us cry or make us think.
meu_theta <- function(x, i) {
x %>%
slice(i) %>%
group_by(homogeneous) %>%
summarise(deviation = sd(rating)) -> y
result <- y[y$homogeneous == TRUE,]$deviation -
y[y$homogeneous == FALSE,]$deviation
return(result)
}
deviation.boot <- boot(data = data,
statistic = meu_theta,
R = 5000)
plot(deviation.boot)
result_deviation <- boot.ci(boot.out = deviation.boot,
conf = 0.95,
type = "basic")
result_deviation
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = deviation.boot, conf = 0.95, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.0016, 0.0231 )
## Calculations and Intervals on Original Scale
X2.5. = c(result_deviation$basic[4])
X97.5. = c(result_deviation$basic[5])
sd.diff = data.frame(X2.5.,X97.5.)
sd.diff %>%
ggplot(aes(x = "Homogeneous - Heterogeneous",ymin = X2.5., ymax = X97.5.)) +
geom_errorbar(width = .2) +
geom_hline(yintercept = 0, colour = "darkorange") +
labs(title="Confidence Interval for difference of standard deviation", x="")
Across homogeneous and heterogeneous movies we could see the same central tendency around 4 stars. We must assume that this has more to do with the nature of how the people rate (extremely positive and extremely bad rates rare, positive ratings more common), rather than to do with whether a movie is homogeneneous in genre.
data %>%
filter(homogeneous) %>%
mutate(genres = as.factor(genres)) -> data_homogen
data_homogen %>%
ggplot(aes("",rating,
group=well_rated,
color=well_rated)) +
geom_count(position = position_dodge(width = 0.5)) +
coord_flip() +
labs(y="Movie Rating",x="") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
scale_size(range = c(1,15))
data_homogen %>%
ggplot(aes(genres,rating,
group=genres,
color=well_rated)) +
geom_count() +
coord_flip() +
facet_grid(~ well_rated) +
theme(strip.background = element_blank(),
strip.text.x = element_blank()) +
labs(y="Movie Rating", x="Movie Genre")
data %>%
group_by(movieId) %>%
slice(1) %>%
ggplot(aes(x=well_rated,
fill=xx_century)) +
geom_bar() +
labs(x="Is Well Rated",
y="Absolute Frequency")
meu_theta <- function(x, i) {
x %>%
slice(i) %>%
filter(well_rated) -> d1
x %>%
slice(i) %>%
filter(!well_rated) -> d2
fit1 = lm(rating ~ genres, data=d1)
fit2 = lm(rating ~ genres, data=d2)
return(summary(fit1)$r.square - summary(fit2)$r.square)
}
r2.boot <- boot(data = data_homogen,
statistic = meu_theta,
R = 5000)
plot(r2.boot)
result_r2 <- boot.ci(boot.out = r2.boot,
conf = 0.95,
type = "basic")
result_r2
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = r2.boot, conf = 0.95, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.0113, 0.0012 )
## Calculations and Intervals on Original Scale
X2.5. = c(result_r2$basic[4])
X97.5. = c(result_r2$basic[5])
r2.diff = data.frame(X2.5.,X97.5.)
r2.diff %>%
ggplot(aes(x = "(Well Rated) - (NOT Well Rated)",ymin = X2.5., ymax = X97.5.)) +
geom_errorbar(width = .2) +
geom_hline(yintercept = 0, colour = "darkorange") +
labs(x="") +
ggtitle("Confidence Interval for Impact of Genre on Ratings")
As one would expect whether the movie is objectively bad or not should not have to do with the movie genre unless people are biased towards a certain genre. It seems that people’s bias towards certain genres didn’t play much of a role, a happy surprise to be honest.
meu_theta <- function(x, i) {
x %>%
slice(i) -> d
d %>%
group_by(xx_century) %>%
filter(well_rated) %>%
summarise(n = n()) -> y
d %>%
group_by(xx_century) %>%
summarise(n = n()) -> total
prop1 <- y[y$xx_century == TRUE,]$n / total[total$xx_century == TRUE,]$n
prop2 <- y[y$xx_century == FALSE,]$n / total[total$xx_century == FALSE,]$n
return(prop1 - prop2)
}
proport.boot <- boot(data = data,
statistic = meu_theta,
R = 5000)
plot(proport.boot)
result_proport <- boot.ci(boot.out = proport.boot,
conf = 0.95,
type = "basic")
result_proport
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = proport.boot, conf = 0.95, type = "basic")
##
## Intervals :
## Level Basic
## 95% ( 0.0369, 0.0510 )
## Calculations and Intervals on Original Scale
X2.5. = c(result_proport$basic[4])
X97.5. = c(result_proport$basic[5])
prop.diff = data.frame(X2.5.,X97.5.)
prop.diff %>%
ggplot(aes(x = "(XX century) - (XXI century)",ymin = X2.5., ymax = X97.5.)) +
geom_errorbar(width = .2) +
geom_hline(yintercept = 0, colour = "darkorange") +
labs(x="") +
ggtitle("Confidence Interval for proportion of well rated")
There’s significant evidence at 95% of confidence, which indicates that movies from the XX century have a higher proportion of well rated movies in terms of inference as well.
The difference in proportions although significant in statistical terms is very small in practical terms and therefore is not relevant (doesn’t amount to much in terms of magnitude).
The XX century has had its fair share of awful movies, however it’s natural that most of the forgettable movies don’t make their way into our minds and datasets. The bad movies from the XXI century are still fresh on our minds and our datasets. For this reason it’s understandable that there are less well rated movies in the group of the XXI century.