Рекомендательные системы: Нетфликс & чилл

Данный групповой проект был сделан Власенко Анастасией, Волковой Вероникой, Гончаровой Екатериной, Лактионовым Вадимом, Феофиловым Кириллом - группой №5.

Мы начнем работу с загрузки датасета по фильмам и сокращенный по рейтингам этих фильмов. В первой части мы хотим посмотреть на то, как связаны между собой компании, выпускающие фильмы и жанры фильмов ( = построить две разные сети).

Кластеры, которые содержат похожие фильмы, позволят нам предлагать пользователю то, что ему может понравиться. Предположим, что Пользователю 1 понравился фильм в жанрах Экшен, Драма, Комедия. Мы попробуем предложить ему фильм из того же кластера, куда попал первый фильм с расчётом на то, что он ему тоже понравится.

Первая сеть: по жанрам

Начнем с построения самого простого графика, который потом преобразуем.

Первый граф выглядит очень хаотично и практически не читаем, поэтому нам нужно поработать над ним: отфильтруем его, оставим более сильные связи.

Это выглядит уже лучше! Данный граф дает нам представление о том, что жанры все же имеют тенденцию формировать между собой кластеры, более того часть из них собирается в тесный “клубок”, не взаимодействующий с другими узлами (отсутствуют брокеры).

Чтобы можно было говорить о каком-то анализе, раскрасим кластеры в разные цвета и подпишем их.

У нас получилось 9 кластеров, которые объединяют фильмы по жанрам. По большей части фильмы связаны одинаковым набором жанров, причем связь очень сильная. Например, в 9 кластере находятся 2 фильма, у которых абсолютно одинаковый набор жанров, ничем не отличающийся. Могут добавляться еще жанры, но бОльшая часть будет одинаковой. Также, рассчитав ассортативность, выяснили, что, действительно, узлы предпочитают формировать связь с похожими узлами, т.к. показатель ассортативности близок к единице (0.8381166).

Вторая сеть: по производителям

Построим изначальную сеть, опираясь на которую, будем делать дальнейший анализ.

Как можно заметить, сеть слишком плотная, и для того, чтобы более осознанно выделить кластеры, взвесим проекцию обычной проекции(подход Ньюмана, т.е взвесим силу связи между фильмами на популярность кинокомпании). После взвешивания удалим слабые связи. Это позволить просмотреть структуру сети более четко.

Мы можем попробовать удалить все связи, у которых сила чуть ниже или выше среднего(подбором, 0,13 - наиболее оптимальное)

Структура просматривается гораздо лучше. Осталось выделить кластеры.

Следующий шаг - выяснить, какой кластер к чему относится.

Код ниже нужен лишь для того, чтобы можно было просматривать названия фильмов и их количество в зависимости от кластера

a_6 = mov_names %>% 
  filter(clust == 6)

Например, в 6 кластере, несмотря на то, что фильмы были выпущены сразу несколькими компаниями, их объединяет Univeral Pictures.

Рассчитаем ассортативность, т.е. того, насколько узлы склонны иметь связи с узлами, обладающими сходными свойствами.

assortativity(net1, V(net1), directed = T)
## [1] 0.7394885

Ответ: почти 0,74. Значение близко к единице. Таким образом, узлы предпочитают формировать связь с похожими узлами.

Итак, мы получили сеть, в которой четко прослеживаются кластеры фильмов, похожих по кинокомпаниям, это значит, что данная сеть позволит нам в дальнейшем рекомендовать тому или иному пользователю фильмы из одного кластера.

Рекомендательная система

На первых шагах нам не требуются дополнительные данные, поскольку мы строим рекомендательную систему на основе оценок(коллаборативная фильтрация) по готовому датасету ratings_cut.

Подготовка данных

ratings_1 = select(ratings, customer_id, movie_id, rating)
sp_ratings = spread(ratings_1, key = movie_id, value = rating)
rownames(sp_ratings) = sp_ratings$customer_id
sp_ratings = select(sp_ratings, -customer_id)
sp_ratings = as.matrix(sp_ratings)
sp_ratings = as(sp_ratings, "realRatingMatrix")
sp_ratings = sp_ratings[rowCounts(sp_ratings) > 5, colCounts(sp_ratings) > 10]

set.seed(321)
test_ratings = sample(1:nrow(sp_ratings), size = nrow(sp_ratings)*0.1)
ratings_train = sp_ratings[-test_ratings, ]
ratings_test = sp_ratings[test_ratings, ]

Используем метод UBCF

recc_model = Recommender(data = ratings_train, method = "UBCF", parameter = list(k = 30))
## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE
recc_predict = predict(object = recc_model, newdata = ratings_test, n = 5)

Теперь мы можем рекоммендовать фильмы для любого пользователя, поставившего оценки и спрогнозировать будущие оценки фильмам

recc_films = function(user_id){ 
  recc_user = recc_predict@items[[user_id]]
  movies_user = recc_predict@itemLabels[recc_user]
  a = ratings$title[match(movies_user, ratings$movie_id)]
  recc_predict@ratings[[user_id]]
  a
}
recc_ratings = function(user_id){recc_predict@ratings[[user_id]]}

Проверим для пользователя с id 2554698

as.data.frame(recc_films("2554698")) %>% 
  knitr::kable(caption = "Рекомендованные фильмы")
Рекомендованные фильмы
recc_films(“2554698”)
Bowling for Columbine
Being John Malkovich
Seven Samurai
Lock, Stock and Two Smoking Barrels
Super Size Me
as.data.frame(recc_ratings("2554698")) %>% 
  knitr::kable(caption = "Предсказанные рейтинги")
Предсказанные рейтинги
recc_ratings(“2554698”)
4.246181
4.242280
4.218941
4.212714
4.200664

Эти таблицы содержат в себе рекомендации конкретному пользователю с предсказывемой оценкой(насколько пользователю понравится фильм, на сколько он его оценит) - в дальнейшем мы планируем сделать рекомендации более точными, чтобы устранить случаи, когда пользователь остается недоволен фильмом.

Дополнительные данные

Мы нашли два датасета на сайте Kaggle, из одного (tmdb_5000_credits) было бы интересно взять информацию об актерах фильмов, из второго (IMDB_2280_Most_Voted_Movies) - можно будет посмотреть отдельный рейтинг (IMDB Rating), режиссеров и просто добавить больше фильмов для расширенных возможностей.

Посмотрим, какие актеры наиболее часто встречаются в фильмах. Далее же сможем проанализировать рейтинг фильмов, в которых играли актеры из топ-10 по количеству исполненных ролей.

## # A tibble: 9 x 2
##   cast_sep          count
##   <fct>             <int>
## 1 Samuel L. Jackson    68
## 2 Robert De Niro       57
## 3 Bruce Willis         51
## 4 Matt Damon           48
## 5 Morgan Freeman       46
## 6 Steve Buscemi        43
## 7 Johnny Depp          42
## 8 Liam Neeson          41
## 9 Owen Wilson          40

Как видно здесь, часть актеров “собирает” фильмам большие кассы - нам нужно будет обратить внимание на таких, поскольку они скорее всего будут очень часто встречаться. В дальнейшем было бы здорово использовать предпочтения в актерах пользователя, поскольку это делает рекомендации более точными.

## 1) Подготавливаем данные
movies <- read_csv("~/shared/minor2_2018/data/movies.csv")
ratings <- read_csv("~/shared/minor2_2018/data/ratings_cut.csv")
## Warning: Missing column names filled in: 'X1' [1]
IMDB_2280_Most_Voted_Movies <- read_csv("/students/avvlasenko_1/IMDB_2280_Most_Voted_Movies.csv")
imdb = IMDB_2280_Most_Voted_Movies %>% 
  select(title, director, actors, rating)

b = left_join(movies, imdb)
b$movie_id[3] = 9856
b$movie_id[23] = 9863
b$movie_id[27]=9462
b$movie_id[32]=7861
b$movie_id[43] =4529
b$movie_id[189]=5896
b$movie_id[223]=5212
## 2) Создаем матрицу схожести(по жанрам, директорам, странам выпуска, пользовательскому рейтингу)
kino = b %>% 
  select(title, movie_id, genres, production_countries, director, rating)
kino = mutate(kino, e = 1)
kino = extract_json(df = kino, col = "genres")
kino = extract_json2(df = kino, col = "production_countries")
## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from
## a short variable and have been discarded
kino = spread(kino, key = production_countries_sep, value = production_countries_v, fill = 0)
kino = spread(kino, key = director, value = e, fill = 0)
kino = kino %>% 
  select(-genres, -title, -production_countries)
rownames(kino) = kino$movie_id
kino = kino %>% 
  select(-movie_id)
mod = lsa::cosine(t(as.matrix(kino)))
diag(mod) = 0
## 3) Создаем рекомендательную систему
getFilms = function(userId){
  client = ratings %>% 
    filter(customer_id == userId & rating == 5)
  
  if (length(client)==0) {
    recommend = "Гарри Поттер"} else {
    mostSimilar = head(sort(mod[,as.character(client$movie_id)], decreasing = T), n = 3)
    a = which(mod[,as.character(client$movie_id)] %in% mostSimilar, arr.ind = TRUE)
    rows = a %% dim(mod)[1]
    result = rownames(mod)[rows]
    recommend = filter(movies,movie_id %in% result) %>% dplyr::select(title)
  }
  
  recommend
}
getFilms(111343)
## # A tibble: 3 x 1
##   title                   
##   <chr>                   
## 1 Network                 
## 2 Midnight Cowboy         
## 3 A Streetcar Named Desire