The MovieLens data set was collected by GroupLens Research. Can we predict movie ratings based on user preferance, age of a movie? Using the MovieLens data set and penalized least squares, the following R script calculates the RMSE based on user ratings, movieId and the age of the movie.
The MovieLens data set contains 10000054 rows, 10677 movies, 797 genres and 69878 users.
The steps performed for analysis of the data - Created an age of movie column - Graphic displays of movie, users and ratings in order to find a pattern or insight to the
behavior of the data. - Explored Genres to determine if ratings could be predicted by genre. - Explored the Coefficient of Determination R-Squared - Graphically explored the linear correlation coefficient, r-value - Calculate RMSE based on movieId, userId, and age of the movie.
After exploring the movies through graphical representations and calculating RMSE, I found the best predictor for ratings was movieId, userId. The age of the movie didn’t change the rmse.
The final RMSE is 0.8252
The following are the libraries I used to explore the data. Explorations that didn’t seem to lead to an insight were taken out of the script.
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- read.table(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))
#Explore the size of the data set
movielens <- left_join(ratings, movies, by = "movieId")
nrow(movielens)
## [1] 10000054
n_distinct(movielens$movieId)
## [1] 10677
n_distinct(movielens$genres)
## [1] 797
n_distinct(movielens$userId)
## [1] 69878
#Validation set will be 10% of the movieLens data
set.seed(1)
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
removed <- anti_join(temp, validation)
## Joining, by = c("userId", "movieId", "rating", "timestamp", "title", "genres")
edx <- rbind(edx, removed)
##Data Cleaning and, data exploration and Data Visulization #In order to determine if age of the movie is a factor for predicting rating, I extracted the premier date of the movie, and then calculated the age of the movie. I will also looked at individual genres for genre effect, as well as, effects of user ratings.
head(edx)
## userId movieId rating timestamp title
## 1 1 122 5 838985046 Boomerang (1992)
## 2 1 185 5 838983525 Net, The (1995)
## 4 1 292 5 838983421 Outbreak (1995)
## 5 1 316 5 838983392 Stargate (1994)
## 6 1 329 5 838983392 Star Trek: Generations (1994)
## 7 1 355 5 838984474 Flintstones, The (1994)
## genres
## 1 Comedy|Romance
## 2 Action|Crime|Thriller
## 4 Action|Drama|Sci-Fi|Thriller
## 5 Action|Adventure|Sci-Fi
## 6 Action|Adventure|Drama|Sci-Fi
## 7 Children|Comedy|Fantasy
glimpse(edx)
## Observations: 9,000,055
## Variables: 6
## $ userId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ movieId <dbl> 122, 185, 292, 316, 329, 355, 356, 362, 364, 370, 37...
## $ rating <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ timestamp <int> 838985046, 838983525, 838983421, 838983392, 83898339...
## $ title <chr> "Boomerang (1992)", "Net, The (1995)", "Outbreak (19...
## $ genres <chr> "Comedy|Romance", "Action|Crime|Thriller", "Action|D...
#How many distinct movie, users and genres
n_distinct(edx$movieId)
## [1] 10677
n_distinct(edx$genres)
## [1] 797
n_distinct(edx$userId)
## [1] 69878
nrow(edx)
## [1] 9000055
#Convert Timestamp to year
edx <- mutate(edx, year_rated = year(as_datetime(timestamp)))
head(edx)
## userId movieId rating timestamp title
## 1 1 122 5 838985046 Boomerang (1992)
## 2 1 185 5 838983525 Net, The (1995)
## 3 1 292 5 838983421 Outbreak (1995)
## 4 1 316 5 838983392 Stargate (1994)
## 5 1 329 5 838983392 Star Trek: Generations (1994)
## 6 1 355 5 838984474 Flintstones, The (1994)
## genres year_rated
## 1 Comedy|Romance 1996
## 2 Action|Crime|Thriller 1996
## 3 Action|Drama|Sci-Fi|Thriller 1996
## 4 Action|Adventure|Sci-Fi 1996
## 5 Action|Adventure|Drama|Sci-Fi 1996
## 6 Children|Comedy|Fantasy 1996
#extracting the premier date
premier <- stringi::stri_extract(edx$title, regex = "(\\d{4})", comments = TRUE ) %>% as.numeric()
#Add the premier date
edx_with_title_dates <- edx %>% mutate(premier_date = premier)
head(edx_with_title_dates)
## userId movieId rating timestamp title
## 1 1 122 5 838985046 Boomerang (1992)
## 2 1 185 5 838983525 Net, The (1995)
## 3 1 292 5 838983421 Outbreak (1995)
## 4 1 316 5 838983392 Stargate (1994)
## 5 1 329 5 838983392 Star Trek: Generations (1994)
## 6 1 355 5 838984474 Flintstones, The (1994)
## genres year_rated premier_date
## 1 Comedy|Romance 1996 1992
## 2 Action|Crime|Thriller 1996 1995
## 3 Action|Drama|Sci-Fi|Thriller 1996 1995
## 4 Action|Adventure|Sci-Fi 1996 1994
## 5 Action|Adventure|Drama|Sci-Fi 1996 1994
## 6 Children|Comedy|Fantasy 1996 1994
#After extracting the premier date from the title, check for accuracy
#drop the timestamp
edx_with_title_dates <- edx_with_title_dates %>% select(-timestamp)
head(edx_with_title_dates)
## userId movieId rating title
## 1 1 122 5 Boomerang (1992)
## 2 1 185 5 Net, The (1995)
## 3 1 292 5 Outbreak (1995)
## 4 1 316 5 Stargate (1994)
## 5 1 329 5 Star Trek: Generations (1994)
## 6 1 355 5 Flintstones, The (1994)
## genres year_rated premier_date
## 1 Comedy|Romance 1996 1992
## 2 Action|Crime|Thriller 1996 1995
## 3 Action|Drama|Sci-Fi|Thriller 1996 1995
## 4 Action|Adventure|Sci-Fi 1996 1994
## 5 Action|Adventure|Drama|Sci-Fi 1996 1994
## 6 Children|Comedy|Fantasy 1996 1994
#looking at the dates - are they correct?
edx_with_title_dates %>% filter(premier_date > 2018) %>% group_by(movieId, title, premier_date) %>% summarize(n = n())
## # A tibble: 6 x 4
## # Groups: movieId, title [?]
## movieId title premier_date n
## <dbl> <chr> <dbl> <int>
## 1 671 Mystery Science Theater 3000: The Movie (1996) 3000 3280
## 2 2308 Detroit 9000 (1973) 9000 22
## 3 4159 3000 Miles to Graceland (2001) 3000 714
## 4 5310 Transylvania 6-5000 (1985) 5000 195
## 5 8864 Mr. 3000 (2004) 3000 146
## 6 27266 2046 (2004) 2046 426
edx_with_title_dates %>% filter(premier_date < 1900) %>% group_by(movieId, title, premier_date) %>% summarize(n = n())
## # A tibble: 8 x 4
## # Groups: movieId, title [?]
## movieId title premier_date n
## <dbl> <chr> <dbl> <int>
## 1 1422 Murder at 1600 (1997) 1600 1566
## 2 4311 Bloody Angels (1732 Høtten: Marerittet Har e… 1732 9
## 3 5472 1776 (1972) 1776 185
## 4 6290 House of 1000 Corpses (2003) 1000 367
## 5 6645 THX 1138 (1971) 1138 464
## 6 8198 1000 Eyes of Dr. Mabuse, The (Tausend Augen … 1000 24
## 7 8905 1492: Conquest of Paradise (1992) 1492 134
## 8 53953 1408 (2007) 1408 466
#Fix the incorrect dates
edx_with_title_dates[edx_with_title_dates$movieId == "27266", "premier_date"] <- 2004
edx_with_title_dates[edx_with_title_dates$movieId == "671", "premier_date"] <- 1996
edx_with_title_dates[edx_with_title_dates$movieId == "2308", "premier_date"] <- 1973
edx_with_title_dates[edx_with_title_dates$movieId == "4159", "premier_date"] <- 2001
edx_with_title_dates[edx_with_title_dates$movieId == "5310", "premier_date"] <- 1985
edx_with_title_dates[edx_with_title_dates$movieId == "8864", "premier_date"] <- 2004
edx_with_title_dates[edx_with_title_dates$movieId == "1422", "premier_date"] <- 1997
edx_with_title_dates[edx_with_title_dates$movieId == "4311", "premier_date"] <- 1998
edx_with_title_dates[edx_with_title_dates$movieId == "5472", "premier_date"] <- 1972
edx_with_title_dates[edx_with_title_dates$movieId == "6290", "premier_date"] <- 2003
edx_with_title_dates[edx_with_title_dates$movieId == "6645", "premier_date"] <- 1971
edx_with_title_dates[edx_with_title_dates$movieId == "8198", "premier_date"] <- 1960
edx_with_title_dates[edx_with_title_dates$movieId == "8905", "premier_date"] <- 1992
edx_with_title_dates[edx_with_title_dates$movieId == "53953", "premier_date"] <- 2007
#Calculate the age of the movie
#Calculate the age of a movie
edx_with_title_dates <- edx_with_title_dates %>% mutate(age_of_movie = 2018 - premier_date,
rating_date_range = year_rated - premier_date)
head(edx_with_title_dates)
## userId movieId rating title
## 1 1 122 5 Boomerang (1992)
## 2 1 185 5 Net, The (1995)
## 3 1 292 5 Outbreak (1995)
## 4 1 316 5 Stargate (1994)
## 5 1 329 5 Star Trek: Generations (1994)
## 6 1 355 5 Flintstones, The (1994)
## genres year_rated premier_date age_of_movie
## 1 Comedy|Romance 1996 1992 26
## 2 Action|Crime|Thriller 1996 1995 23
## 3 Action|Drama|Sci-Fi|Thriller 1996 1995 23
## 4 Action|Adventure|Sci-Fi 1996 1994 24
## 5 Action|Adventure|Drama|Sci-Fi 1996 1994 24
## 6 Children|Comedy|Fantasy 1996 1994 24
## rating_date_range
## 1 4
## 2 1
## 3 1
## 4 2
## 5 2
## 6 2
#Distribution of Movie Ratings
edx %>% group_by(movieId) %>% summarize(n = n()) %>%
ggplot(aes(n)) + geom_histogram(fill = "cadetblue3", color = "grey20", bins = 10) +
scale_x_log10() +
ggtitle("Number of Movies Ratings")
#Number of Ratings by userId
#Distribution of Users
edx %>% group_by(userId) %>% summarize(n = n()) %>%
ggplot(aes(n)) + geom_histogram(fill = "cadetblue3", color = "grey20", bins = 10) +
scale_x_log10() +
ggtitle("Number of Users Ratings")
#Movie rating averages
movie_avgs <- edx_with_title_dates %>% group_by(movieId) %>% summarize(avg_movie_rating = mean(rating))
user_avgs <- edx_with_title_dates %>% group_by(userId) %>% summarize(avg_user_rating = mean(rating))
year_avgs <- edx_with_title_dates%>% group_by(year_rated) %>% summarize(avg_rating_by_year = mean(rating)) #year the movie was rated
age_avgs <- edx_with_title_dates %>% group_by(age_of_movie) %>% summarize(avg_rating_by_age = mean(rating)) #age of movie
head(age_avgs)
## # A tibble: 6 x 2
## age_of_movie avg_rating_by_age
## <dbl> <dbl>
## 1 8 3.37
## 2 10 3.46
## 3 11 3.53
## 4 12 3.53
## 5 13 3.48
## 6 14 3.53
head(user_avgs)
## # A tibble: 6 x 2
## userId avg_user_rating
## <int> <dbl>
## 1 1 5
## 2 2 3.29
## 3 3 3.94
## 4 4 4.06
## 5 5 3.92
## 6 6 3.95
#What is the relationship to the age of a movie and the movies average rating? #Graph age of movie vs average movie rating
# age of movie vs average movie rating
age_avgs %>%
ggplot(aes(age_of_movie, avg_rating_by_age)) +
geom_point() +
ggtitle("Age of a Movie vs Average Movie Rating")
# The above plot shows more variability as movies age. The plot, also, shows higher ratings the older a movies is up to 90 years old, then the ratings drop.
# userId vs average movie rating
user_avgs %>%
ggplot(aes(userId, avg_user_rating)) +
geom_point(alpha = 1/20, colour = "blue") +
ggtitle("User vs Average User Rating")
#From the above graph, we can see average ratings by user are pretty consistent between 2.5 and 4.5
#Calculating the lm of the age of a movie vs average rating
summary(lm(avg_rating_by_age ~ age_of_movie, data = age_avgs))
##
## Call:
## lm(formula = avg_rating_by_age ~ age_of_movie, data = age_avgs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61684 -0.10389 0.00276 0.12759 0.28508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4809443 0.0409983 84.905 < 2e-16 ***
## age_of_movie 0.0041241 0.0006489 6.356 7.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1781 on 94 degrees of freedom
## Multiple R-squared: 0.3006, Adjusted R-squared: 0.2931
## F-statistic: 40.4 on 1 and 94 DF, p-value: 7.377e-09
#We can see that R-square is small at 0.30
#Plot the Residuals
avg_rating.lm <- lm(avg_rating_by_age ~ age_of_movie, data = age_avgs)
avg_rating.res <- resid(avg_rating.lm)
plot(age_avgs$age_of_movie, avg_rating.res,
ylab='Residuals', xlab='age_of_movie',
main = 'Average Rating by Age of Movie') + abline(0,0)
## integer(0)
#The R-squared is fairly small at 0.30; 30% of the variation in movie ratings can be prdicted by # explore the data graphically to see if age of the movie and rating are coorelated
#Movies less than 75 years old
age_of_movie_less_than75 <- age_avgs %>% filter(age_of_movie <75)
# age of movie less than 75 years old vs average movie rating
age_of_movie_less_than75 %>%
ggplot(aes(age_of_movie, avg_rating_by_age)) +
geom_point() +
ggtitle("Age of a Movie vs Average Movie Rating")
#Calculate the R-squared value
age_lessthan75_rating.lm <- lm(avg_rating_by_age ~ age_of_movie, data = age_of_movie_less_than75)
summary(age_lessthan75_rating.lm)
##
## Call:
## lm(formula = avg_rating_by_age ~ age_of_movie, data = age_of_movie_less_than75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21323 -0.07992 0.00663 0.06785 0.23721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2946266 0.0307644 107.09 <2e-16 ***
## age_of_movie 0.0092153 0.0006738 13.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1044 on 64 degrees of freedom
## Multiple R-squared: 0.7451, Adjusted R-squared: 0.7411
## F-statistic: 187.1 on 1 and 64 DF, p-value: < 2.2e-16
#The R-squared increased to 0.745
#Plot the residuals
head(age_of_movie_less_than75)
## # A tibble: 6 x 2
## age_of_movie avg_rating_by_age
## <dbl> <dbl>
## 1 8 3.37
## 2 10 3.46
## 3 11 3.53
## 4 12 3.53
## 5 13 3.48
## 6 14 3.53
age_lessthan75.res <- resid(age_lessthan75_rating.lm)
plot(age_of_movie_less_than75$age_of_movie, age_lessthan75.res,
ylab='Residuals', xlab='age_of_movie',
main = 'Average Rating by Age of Movie') + abline(0,0)
## integer(0)
#Let’s look at moveies between 20 and 75 years old as the graph looks more linear in that time frame
#Movies between 20 and 75 years old
age_between20_and_75 <- age_avgs %>% filter((age_of_movie > 20) & (age_of_movie < 75))
# graph the age of movie between 30 and 75 years old
age_between20_and_75 %>%
ggplot(aes(age_of_movie, avg_rating_by_age)) +
geom_point() + ggtitle("Movies between 30 and 75 years old vs average movie rating")
#The plot above appears to be a linear trend; however, the r-square is 0.69
summary(lm(avg_rating_by_age ~ age_of_movie, data = age_between20_and_75))
##
## Call:
## lm(formula = avg_rating_by_age ~ age_of_movie, data = age_between20_and_75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.235567 -0.077940 -0.009169 0.068137 0.246532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2313562 0.0473472 68.25 < 2e-16 ***
## age_of_movie 0.0103880 0.0009471 10.97 3.88e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1085 on 52 degrees of freedom
## Multiple R-squared: 0.6982, Adjusted R-squared: 0.6924
## F-statistic: 120.3 on 1 and 52 DF, p-value: 3.882e-15
#The R-squared value is lower at 0.6981
# graph the age of movie between 20 and 40 years old
age_between20_and_40 <- age_avgs %>% filter((age_of_movie > 20) & (age_of_movie < 40))
age_between20_and_40 %>%
ggplot(aes(age_of_movie, avg_rating_by_age)) +
geom_point() + ggtitle('Age of Movie between 20 and 40 years old')
#The above graph is displying a linear trend with older movies having higher ratings
#calculate a linear model
summary(lm(avg_rating_by_age ~ age_of_movie, data = age_between20_and_40))
##
## Call:
## lm(formula = avg_rating_by_age ~ age_of_movie, data = age_between20_and_40)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09444 -0.02806 -0.01751 0.05332 0.10592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.031416 0.075493 40.155 < 2e-16 ***
## age_of_movie 0.016103 0.002476 6.505 5.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0591 on 17 degrees of freedom
## Multiple R-squared: 0.7134, Adjusted R-squared: 0.6965
## F-statistic: 42.31 on 1 and 17 DF, p-value: 5.393e-06
#Movies between 0 and 30 years old
age_less_than30 <- age_avgs %>% filter((age_of_movie < 30))
#Graph movies less than 30 years old and average movie rating
age_less_than30 %>%
ggplot(aes(age_of_movie, avg_rating_by_age)) +
geom_point() + ggtitle('Age of Movie less then 30 years old')
#For movies less than 30 years old there appears to be quite a bit of variation. We can see from the linear model that r-squared is nearly zero.
summary(lm(avg_rating_by_age ~ age_of_movie, data = age_less_than30))
##
## Call:
## lm(formula = avg_rating_by_age ~ age_of_movie, data = age_less_than30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.091058 -0.034589 -0.000233 0.021613 0.123826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4688469 0.0420239 82.545 <2e-16 ***
## age_of_movie -0.0007611 0.0021095 -0.361 0.722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05933 on 19 degrees of freedom
## Multiple R-squared: 0.006805, Adjusted R-squared: -0.04547
## F-statistic: 0.1302 on 1 and 19 DF, p-value: 0.7222
#The age of a movie did seem to effect the outcome of the average rating. This is possibly due to a higher number of ratings for older movies.
#Do Genres have an effect on ratings? ##I extracted the genres from the data with the idea to do an analysis on each genre. Some of the exploration I did here was removed as it didn’t appear to effect the RMSE and this analysis keep growing! But I did get some nice graphs pertaining to genres.
#Genres split the data into single genres
dat <- edx_with_title_dates %>% separate_rows(genres, sep ="\\|")
head(dat)
## userId movieId rating title genres year_rated premier_date
## 1 1 122 5 Boomerang (1992) Comedy 1996 1992
## 2 1 122 5 Boomerang (1992) Romance 1996 1992
## 3 1 185 5 Net, The (1995) Action 1996 1995
## 4 1 185 5 Net, The (1995) Crime 1996 1995
## 5 1 185 5 Net, The (1995) Thriller 1996 1995
## 6 1 292 5 Outbreak (1995) Action 1996 1995
## age_of_movie rating_date_range
## 1 26 4
## 2 26 4
## 3 23 1
## 4 23 1
## 5 23 1
## 6 23 1
#Count the number of movies using movieId in each genre
genre_count_by_movieId <- dat %>% group_by(movieId, genres) %>% summarize(n = n())
head(genre_count_by_movieId)
## # A tibble: 6 x 3
## # Groups: movieId [2]
## movieId genres n
## <dbl> <chr> <int>
## 1 1 Adventure 23790
## 2 1 Animation 23790
## 3 1 Children 23790
## 4 1 Comedy 23790
## 5 1 Fantasy 23790
## 6 2 Adventure 10779
#Total number of movies in each genre
number_of_genres <- dat %>% group_by(genres) %>% summarize(n = n())
number_of_genres
## # A tibble: 20 x 2
## genres n
## <chr> <int>
## 1 (no genres listed) 7
## 2 Action 2560545
## 3 Adventure 1908892
## 4 Animation 467168
## 5 Children 737994
## 6 Comedy 3540930
## 7 Crime 1327715
## 8 Documentary 93066
## 9 Drama 3910127
## 10 Fantasy 925637
## 11 Film-Noir 118541
## 12 Horror 691485
## 13 IMAX 8181
## 14 Musical 433080
## 15 Mystery 568332
## 16 Romance 1712100
## 17 Sci-Fi 1341183
## 18 Thriller 2325899
## 19 War 511147
## 20 Western 189394
#List the genres. Movies are either in one genre or multiple genres
genre_list <- number_of_genres$genres
genre_list
## [1] "(no genres listed)" "Action" "Adventure"
## [4] "Animation" "Children" "Comedy"
## [7] "Crime" "Documentary" "Drama"
## [10] "Fantasy" "Film-Noir" "Horror"
## [13] "IMAX" "Musical" "Mystery"
## [16] "Romance" "Sci-Fi" "Thriller"
## [19] "War" "Western"
#Explore the distribution of ratings by genre
#Distribution of Ratings per Genre
temp <- dat %>%
group_by(genres) %>%
summarize(n=n()) %>%
ungroup() %>%
mutate(sumN = sum(n), percentage = n/sumN) %>%
arrange(-percentage)
#Bar Graph of Genre’s
temp %>%
ggplot(aes(reorder(genres, percentage), percentage, fill= percentage)) +
geom_bar(stat = "identity") + coord_flip() +
scale_fill_distiller(palette = "YlOrRd") + labs(y = "Percentage", x = "Genre") +
ggtitle("Distribution of Genres by Percent Rated")
#From the above graph, we can see Drama had the highest percentage of ratings.
#Genre’s Mean rating
temp <- dat %>%
group_by(genres) %>%
summarize(mean_rating_by_genre=mean(rating)) %>%
arrange(-mean_rating_by_genre)
temp %>%
ggplot(aes(reorder(genres, mean_rating_by_genre), mean_rating_by_genre, fill= mean_rating_by_genre)) +
geom_bar(stat = "identity") + coord_flip() +
scale_fill_distiller(palette = "YlOrRd") + labs(y = "Mean Rating", x = "Genre") +
ggtitle("Average Rating of Genres")
#Film Noir had the highest average rating, while Horror had the lowest average rating.
#Explore movie ratings based on number of ratings and value of the rating
#Graph of movies with more than 10000 ratings and a mean rating greater than 4.
avg_rating_greater_than_4 <- edx %>% group_by(title) %>%
summarize(mean_rating= mean(rating), n = n()) %>% filter(mean_rating >=4) %>% arrange(desc(n, mean_rating))
avg_rating_greater_than_4 %>% filter(n >=10000) %>%
ggplot(aes(reorder(title, n), n, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") + xlab("Movie") +ylab('Number of Ratings') +
ggtitle("Movies with an average rating\ngreater than or equal to 4\nand Number of Ratings > 10000")
# Examine Movies with ratings between 3 and 4 and more than 10000 ratings
avg_between3_4 <- edx %>% group_by(title) %>%
summarize(mean_rating= mean(rating), n = n()) %>% filter(n > 10000, (mean_rating >= 3 & mean_rating < 4)) %>% arrange(desc(n, mean_rating))
p <- avg_between3_4 %>% slice(1:40)
p %>%
ggplot(aes(reorder(title, n), n, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("Average ratings 3<= r < 4 and n > 10000") + xlab('Movie') + ylab('Number of Ratings') +
theme_classic()
#Movies with an average rating between 2 and 3 lets look at number of ratings greater than 5000
avg_between2_3 <- edx %>% group_by(title) %>%
summarize(mean_rating= mean(rating), n = n()) %>% filter(n > 5000, (mean_rating >= 2 & mean_rating < 3)) %>% arrange(desc(n, mean_rating))
avg_between2_3 %>%
ggplot(aes(reorder(title, n), n, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("Average ratings 2<= r < 3 and n > 5000") + xlab('Movie') + ylab('Number of Ratings') +
theme_classic()
#Less than 10000 ratings and a rating less than 2 and number of ratings greater than 500
avg_rating_less_than_2 <- edx %>% group_by(title) %>%
summarize(mean_rating= mean(rating), n = n()) %>% filter(n > 500, mean_rating < 2) %>% arrange(desc(n, mean_rating))
avg_rating_less_than_2 %>%
ggplot(aes(reorder(title, n), n, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("Average ratings < 2") + xlab('Movie') + ylab('Number of Ratings') +
theme_classic()
#Compute the least squares for movieId
#Which movies have a large number of ratings and a rating larger than the average mu
mu <- mean(edx$rating)
edx %>% group_by(title) %>%
summarize(b_i = mean(rating - mu), n = n()) %>% filter(b_i > 0.5, n > 10000) %>%
ggplot(aes(reorder(title, b_i), b_i, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("") + xlab("Movie Title") +
ggtitle("Movie rating - mu,\nfor Number of ratings > 10000") +
theme_classic()
#Regularized Movie Averages
movie_avgs <- edx %>% group_by(movieId) %>% summarize(b_i = mean(rating - mu))
movie_reg_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+1), n_i = n())
movie_titles <- edx %>% select(movieId, title) %>% distinct()
edx_with_avgs <- edx %>% group_by(title, movieId) %>% summarize(n = n()) %>%
left_join(movie_reg_avgs, by = "movieId") %>%
arrange(desc(b_i, n))
edx_with_avgs %>% filter(n > 15000) %>%
ggplot(aes(reorder(title, b_i), b_i, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("") + xlab("Movie Title") + ggtitle('Regularized Movie Averages\nfor Number of Ratings > 20000') +
theme_classic()
#Regularized Movie Averages for the movies with regularized ratings less than 2
head(edx_with_avgs)
## # A tibble: 6 x 5
## # Groups: title [6]
## title movieId n b_i n_i
## <chr> <dbl> <int> <dbl> <int>
## 1 More (1998) 4454 7 1.05 7
## 2 Satan's Tango (Sátántangó) (1994) 33264 2 0.992 2
## 3 Human Condition II, The (Ningen no joken II) … 26048 4 0.990 4
## 4 Human Condition III, The (Ningen no joken III… 26073 4 0.990 4
## 5 Who's Singin' Over There? (a.k.a. Who Sings O… 5194 4 0.990 4
## 6 Shawshank Redemption, The (1994) 318 28015 0.943 28015
p <- edx_with_avgs %>% arrange(b_i) %>% filter(b_i < -2) %>% arrange((b_i))
p
## # A tibble: 37 x 5
## # Groups: title [37]
## title movieId n b_i n_i
## <chr> <dbl> <int> <dbl> <int>
## 1 SuperBabies: Baby Geniuses 2 (2004) 8859 56 -2.67 56
## 2 From Justin to Kelly (2003) 6483 199 -2.60 199
## 3 Disaster Movie (2008) 61348 32 -2.57 32
## 4 Hip Hop Witch, Da (2000) 7282 14 -2.51 14
## 5 Pokémon Heroes (2003) 6371 137 -2.47 137
## 6 Carnosaur 3: Primal Species (1996) 3574 68 -2.39 68
## 7 Glitter (2001) 4775 339 -2.33 339
## 8 Roller Boogie (1979) 8856 15 -2.32 15
## 9 Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie)… 5672 202 -2.32 202
## 10 Barney's Great Adventure (1998) 1826 208 -2.31 208
## # ... with 27 more rows
p %>%
ggplot(aes(reorder(title, b_i), b_i, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("") + xlab("Movie Title") + ggtitle('Regularized Movie Averages b_i < -2') +
theme_classic()
#Movies with number of ratings larger than 1000 and regularized average less than 0.
edx_with_avgs %>% filter(n > 10000, b_i < 0.0) %>%
ggplot(aes(reorder(title, b_i), b_i, fill = n)) +
geom_bar(stat = "identity") + coord_flip() + scale_fill_distiller(palette = "PuBuGn") +
ggtitle("") + xlab("Movie Title") + ggtitle('Regularized Movie Averages\nfor Number of Ratings > 20000') +
theme_classic()
#Explore correlation between ratings, users, movieId age of movie and number of ratings
#Is there a correlation
#Number of movie ratings per movie
n_movies_ratings <- edx_with_title_dates %>% group_by(movieId) %>% summarize(n = n())
#Average Movie Rating for each movie
avg_movie_rat <- edx_with_title_dates %>% group_by(movieId) %>% summarize(avg_m_r = mean(rating))
#Create correlation data
cor_dat <- edx_with_title_dates %>% select(rating, movieId, userId, year_rated, age_of_movie, rating_date_range, premier_date) %>%
left_join(n_movies_ratings, by = "movieId") %>%
left_join(avg_movie_rat, by = 'movieId')
head(cor_dat)
## rating movieId userId year_rated age_of_movie rating_date_range
## 1 5 122 1 1996 26 4
## 2 5 185 1 1996 23 1
## 3 5 292 1 1996 23 1
## 4 5 316 1 1996 24 2
## 5 5 329 1 1996 24 2
## 6 5 355 1 1996 24 2
## premier_date n avg_m_r
## 1 1992 2178 2.858586
## 2 1995 13469 3.129334
## 3 1995 14447 3.418011
## 4 1994 17030 3.349677
## 5 1994 14550 3.337457
## 6 1994 4831 2.487787
#Graph the correlation
temp <- cor_dat %>% select(one_of("rating", "movieId", "userId", "year_rated", "age_of_movie",
"rating_date_range", "premier_date", "n", "avg_m_r")) %>% as.matrix()
M <- cor(temp, use = "pairwise.complete.obs")
corrplot(M, order = "hclust", addrect = 2, type = "lower", col = brewer.pal(n = 8, name = "RdBu"))
#What is the effect of the age of the movie
corr_by_age_of_movie <- cor_dat %>% filter((age_of_movie >20) & (age_of_movie < 70))
temp <- corr_by_age_of_movie %>% select(one_of("rating", "movieId", "userId", "year_rated", "age_of_movie",
"rating_date_range", "n", "premier_date", "avg_m_r")) %>% as.matrix()
M <- cor(temp, use = "pairwise.complete.obs")
corrplot(M, order = "hclust", addrect = 2, type = "lower", col = brewer.pal(n = 8, name = "RdBu"))
#Is there a relationship between number of ratings and the average rating
get_cor <- function(df){
m <- cor(df$x, df$y, use="pairwise.complete.obs");
eq <- substitute(italic(r) == cor, list(cor = format(m, digits = 2)))
as.character(as.expression(eq));
}
#Number of ratings vs avg movie ratings
cor_dat %>%
ggplot(aes(n, avg_m_r)) + stat_bin_hex(bins = 50) + scale_fill_distiller(palette = "Spectral") +
stat_smooth(method = "lm", color = "orchid", size = 1) +
annotate("text", x = 20000, y = 2.5, label = get_cor(data.frame(x = cor_dat$n, y = cor_dat$avg_m_r)),
parse = TRUE, color = "orchid", size = 7) + ylab("Average Movie Rating") + xlab("Number of Ratings")
#Is there an Age Effect on Movie Ratings?
cor_dat %>%
ggplot(aes(age_of_movie, avg_m_r)) + stat_bin_hex(bins = 50) + scale_fill_distiller(palette = "Spectral") +
stat_smooth(method = "lm", color = "orchid", size = 1) +
annotate("text", x = 75, y = 0.9, label = get_cor(data.frame(x = corr_by_age_of_movie$age_of_movie, y = corr_by_age_of_movie$avg_m_r)),
parse = TRUE, color = "orchid", size = 7) + ylab("Average Movie Rating") + xlab('Age of Movie')
#Calculate the RMSE
#RMSE function
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
#Choose the tuning value
lambdas <- seq(0,5,.5)
rmses <- sapply(lambdas, function(l){
mu <- mean(edx_with_title_dates$rating)
b_i <- edx_with_title_dates %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n() + l))
b_u <- edx_with_title_dates %>%
left_join(b_i, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n() +l))
predicted_ratings <- edx_with_title_dates %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>% .$pred
return(RMSE(predicted_ratings, edx_with_title_dates$rating))
})
qplot(lambdas, rmses)
lambdas[which.min(rmses)]
## [1] 0.5
mu <- mean(validation$rating)
l <- 0.15
b_i <- validation %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n() + l))
b_u <- validation %>%
left_join(b_i, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n() +l))
predicted_ratings <- validation %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>% .$pred
RMSE(predicted_ratings, validation$rating)
## [1] 0.8252108
#I used movieId and userId to calculate the RMSE and was able to achieve an RMSE = 0.826
#The code below utilizes the package “Metrics”, which resulted in the same RMSE. I included this as a check for my calculations.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:modelr':
##
## mae, mape, mse, rmse
## The following objects are masked from 'package:caret':
##
## precision, recall
rmse(validation$rating, predicted_ratings)
## [1] 0.8252108
#This was an amazing assignment; I’m sure I did way more than I needed to as this script is quite long. However, I wanted to use this data as a learning tool to explore as much as possible from what I’ve learned in this R adventure.