library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Getting Started

We are going to study the ML-100K dataset. This was the first major public ratings dataset, released almost 20 years ago.

ratings <- read_tsv('http://files.grouplens.org/datasets/movielens/ml-100k/u.data', 
                    col_names=c('user', 'item', 'rating', 'timestamp'))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   user = col_double(),
##   item = col_double(),
##   rating = col_double(),
##   timestamp = col_double()
## )
ratings %>% head()
## # A tibble: 6 x 4
##    user  item rating timestamp
##   <dbl> <dbl>  <dbl>     <dbl>
## 1   196   242      3 881250949
## 2   186   302      3 891717742
## 3    22   377      1 878887116
## 4   244    51      2 880606923
## 5   166   346      1 886397596
## 6   298   474      4 884182806
movies <- 
  read_delim('http://files.grouplens.org/datasets/movielens/ml-100k/u.item', 
             delim = '|', col_names = FALSE) %>%
  rename(item='X1', title='X2') %>%
  select('item', 'title')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_logical(),
##   X5 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
movies %>% head()
## # A tibble: 6 x 2
##    item title                                               
##   <dbl> <chr>                                               
## 1     1 Toy Story (1995)                                    
## 2     2 GoldenEye (1995)                                    
## 3     3 Four Rooms (1995)                                   
## 4     4 Get Shorty (1995)                                   
## 5     5 Copycat (1995)                                      
## 6     6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)

Data wrangling

  • Create a new dataframe that adds movie titles to the rating dataframe.

Rating activities

  • Calculate the total number of ratings and average rating.
  • Create the frequency table for rating values. How often is each rating used?
  • Find the most often rated movies.
  • Calculate the sparsity of the matrix. This is the percentage of all possible ratings (if each user rated each movie and) that actually exist.

Calculating simple means

  • Calculating the overall global mean rating.
  • Calculate the mean rating for every movie. Hint: Group by movie and summarize within each movie.
  • Find the movies with highest and lowest means, and show the number of ratings for each. Do they match your intuition?

Calculating “thresholded” means.

  • Only consider movies with at least some minimum number of ratings (e.g. 20).
  • Find the movies with highest and lowest thresholded means, and show the number of ratings for each. Do they match your intuition?

Calculating “smoothed” means.

Like any mean, the smoothed mean is the sum of the ratings divided by the number of ratings:

smoothed_mean = sum_ratings / num_ratings

We pretend that we add in some “fake” ratings, so we add the ratings to both the sum and the count:

smoothed_mean = (real_sum + fake_sum) / (real_count + fake_count)

More specifically, we pretend that we have 5 fake ratings with the global mean. This number 5 can be considered a “smoothing constant.”

smoothed_mean = (actual_sum + 5 * global_mean) / (real_count + 5)

  • Find the movies with highest and lowest smoothed means, and show the number of ratings for each. Do they match your intuition? What happens when you play around with the smoothing constant?