library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We are going to study the ML-100K dataset. This was the first major public ratings dataset, released almost 20 years ago.
ratings <- read_tsv('http://files.grouplens.org/datasets/movielens/ml-100k/u.data',
col_names=c('user', 'item', 'rating', 'timestamp'))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## user = col_double(),
## item = col_double(),
## rating = col_double(),
## timestamp = col_double()
## )
ratings %>% head()
## # A tibble: 6 x 4
## user item rating timestamp
## <dbl> <dbl> <dbl> <dbl>
## 1 196 242 3 881250949
## 2 186 302 3 891717742
## 3 22 377 1 878887116
## 4 244 51 2 880606923
## 5 166 346 1 886397596
## 6 298 474 4 884182806
movies <-
read_delim('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
delim = '|', col_names = FALSE) %>%
rename(item='X1', title='X2') %>%
select('item', 'title')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_logical(),
## X5 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
movies %>% head()
## # A tibble: 6 x 2
## item title
## <dbl> <chr>
## 1 1 Toy Story (1995)
## 2 2 GoldenEye (1995)
## 3 3 Four Rooms (1995)
## 4 4 Get Shorty (1995)
## 5 5 Copycat (1995)
## 6 6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
Like any mean, the smoothed mean is the sum of the ratings divided by the number of ratings:
smoothed_mean = sum_ratings / num_ratings
We pretend that we add in some “fake” ratings, so we add the ratings to both the sum and the count:
smoothed_mean = (real_sum + fake_sum) / (real_count + fake_count)
More specifically, we pretend that we have 5 fake ratings with the global mean. This number 5 can be considered a “smoothing constant.”
smoothed_mean = (actual_sum + 5 * global_mean) / (real_count + 5)