Recommender sistem atau Sistem Rekomendasi menurut info di Wikipedia adalah sistem penyaringan informasi yang berupaya memprediksi “peringkat” atau “preferensi” yang akan diberikan pengguna pada suatu barang. Menurut Baptiste Rocca pada artikelnya di https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada menyampaikan bahwa tujuan dari recommender system adalah memberikan rekomendasi yang sesuai kepada pengguna aplikasi dan biasanya menggunakan dua metode yaitu collaborative filtering dan content based. Kali ini saya mencoba menyampaikan recommender system sederhana menggunakan R yang sebelumnya menggunakan python di project DQLab bersama mas Karl Christian Business Intelligence Traveloka. Menurut mas Karl Christian, ada 3 kegunaan Recommender System yaitu :
Mencegah Tindakan Curang atau Fraud : Fraud adalah tindakan kecurangan yang biasanya ada pada e-commerce. Pelanggan dapat menyalah gunakan promo untuk kepentingan pribadi. Sehingga secara tidak langsung dapat merekomendasikan kita mana user yang kemungkinan melakukan fraud dan tidak.
Memberikan rekomendasi konten : Misalnya pada youtube dapat memberikan rekomendasi kepada pengguna youtube, video apa yang sebaiknya ditonton dengan mengevaluasi historical viewed by user.
Diperlukan untuk search engine : Mengumpulkan sekaligus mengorganisir berbagai informasi di internet dilihat dari kebutuhan para pengguna.
Seperti biasa sebelum memulai kita perlu melakukan data preparation dan melihat struktur dari data tersebut
Melihat Struktur Data
## Observations: 9,025
## Variables: 9
## $ tconst <chr> "tt0221078", "tt8862466", "tt7157720", "tt29749...
## $ titleType <chr> "short", "tvEpisode", "tvEpisode", "tvEpisode",...
## $ primaryTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ originalTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ isAdult <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ startYear <chr> "1898", "2018", "2016", "1987", "1973", "1951",...
## $ endYear <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N"...
## $ runtimeMinutes <chr> "\\N", "\\N", "29", "\\N", "\\N", "7", "23", "2...
## $ genres <chr> "Documentary,Short", "Comedy,Drama", "Comedy,Ga...
Cek Column Names
## [1] "tconst" "titleType" "primaryTitle" "originalTitle"
## [5] "isAdult" "startYear" "endYear" "runtimeMinutes"
## [9] "genres"
Cek Missing Values
## # A tibble: 1 x 9
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 14 14 0 0 0
## # ... with 2 more variables: runtimeMinutes <int>, genres <int>
Drop Missing Values dan cek struktur Data
## Observations: 9,000
## Variables: 9
## $ tconst <chr> "tt0221078", "tt8862466", "tt7157720", "tt29749...
## $ titleType <chr> "short", "tvEpisode", "tvEpisode", "tvEpisode",...
## $ primaryTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ originalTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ isAdult <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ startYear <chr> "1898", "2018", "2016", "1987", "1973", "1951",...
## $ endYear <chr> "\\N", "\\N", "\\N", "\\N", "\\N", "\\N", "\\N"...
## $ runtimeMinutes <chr> "\\N", "\\N", "29", "\\N", "\\N", "7", "23", "2...
## $ genres <chr> "Documentary,Short", "Comedy,Drama", "Comedy,Ga...
Cek Missing Values dan Ubah Tipe Data
## # A tibble: 1 x 9
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## # ... with 2 more variables: runtimeMinutes <int>, genres <int>
movie_df_fix <- movie_df_NoNa %>% # mengubah tipe data
mutate_all(~replace(., . =="\\N", "nan"))%>%
mutate(titleType = as.factor(titleType),
isAdult = as.factor(isAdult),
startYear = as.numeric(startYear),
endYear = as.numeric(endYear),
runtimeMinutes = as.numeric(runtimeMinutes))
glimpse(movie_df_fix)## Observations: 9,000
## Variables: 9
## $ tconst <chr> "tt0221078", "tt8862466", "tt7157720", "tt29749...
## $ titleType <fct> short, tvEpisode, tvEpisode, tvEpisode, tvEpiso...
## $ primaryTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ originalTitle <chr> "Circle Dance, Ute Indians", "¡El #TeamOsos va ...
## $ isAdult <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ startYear <dbl> 1898, 2018, 2016, 1987, 1973, 1951, 2006, 2015,...
## $ endYear <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, Na...
## $ runtimeMinutes <dbl> NaN, NaN, 29, NaN, NaN, 7, 23, 23, 85, 45, NaN,...
## $ genres <chr> "Documentary,Short", "Comedy,Drama", "Comedy,Ga...
Menampilkan 6 data Teratas
datatable(head(movie_df_fix),
extensions = 'Scroller', options = list(
deferRender = TRUE,
scrollY = 200,
scroller = TRUE
))Inner Join Dataset
movie_rating_df <- inner_join(movie_df_fix,rating_df)
datatable(print(movie_rating_df),
extensions = 'Scroller', options = list(
deferRender = TRUE,
scrollY = 200,
scroller = TRUE
))## # A tibble: 1,376 x 11
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <chr> <fct> <chr> <chr> <fct> <dbl> <dbl>
## 1 tt004~ short Lion Down Lion Down 0 1951 NaN
## 2 tt016~ video Wicked Cove~ Wicked Cover~ 1 1998 NaN
## 3 tt657~ tvEpisode Shadow Play~ Shadow Play ~ 0 2017 NaN
## 4 tt694~ tvEpisode RuPaul Roast RuPaul Roast 0 2017 NaN
## 5 tt730~ video UCLA Track ~ UCLA Track &~ 0 2017 NaN
## 6 tt226~ movie The Pin The Pin 0 2013 NaN
## 7 tt087~ tvEpisode Episode #32~ Episode #32.9 0 2006 NaN
## 8 tt281~ tvEpisode Coldest Road Coldest Road 0 2013 NaN
## 9 tt052~ tvEpisode The New Mar~ The New Mars~ 0 1981 NaN
## 10 tt098~ tvSeries Favouritism Favouritism 0 2005 2005
## # ... with 1,366 more rows, and 4 more variables: runtimeMinutes <dbl>,
## # genres <chr>, averageRating <dbl>, numVotes <dbl>
Cek Missing Value
## # A tibble: 1 x 11
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 1350
## # ... with 4 more variables: runtimeMinutes <int>, genres <int>,
## # averageRating <int>, numVotes <int>
Drop Missing Values
Order berdasarkan averageRating
## # A tibble: 10 x 11
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <chr> <fct> <chr> <chr> <fct> <dbl> <dbl>
## 1 tt111~ short Coordinated~ Coordinated ~ 0 2019 NaN
## 2 tt726~ tvEpisode Sicilian De~ Sicilian Def~ 0 2016 NaN
## 3 tt906~ tvEpisode Black Hole ~ Black Hole (~ 0 2014 NaN
## 4 tt333~ short Opsporing Opsporing 0 2014 NaN
## 5 tt111~ tvEpisode Agar Tum Sa~ Agar Tum Saa~ 0 2019 NaN
## 6 tt411~ tvEpisode S.O.S. Part~ S.O.S. Part 2 0 2015 NaN
## 7 tt765~ short Clear Skies Under klar h~ 0 2018 NaN
## 8 tt726~ tvEpisode The Case of~ The Case of ~ 0 2016 NaN
## 9 tt220~ video Attack of t~ Attack of th~ 0 2010 NaN
## 10 tt114~ short El-Eicha El-Eicha 0 2010 NaN
## # ... with 4 more variables: runtimeMinutes <dbl>, genres <chr>,
## # averageRating <dbl>, numVotes <dbl>
v <- movie_rating_df$numVotes
R <- movie_rating_df$averageRating
C <- mean(movie_rating_df$averageRating)
m <- quantile(movie_rating_df$numVotes,0.8)
movie_rating_df <- movie_rating_df %>%
mutate(score = (v/(m+v))*R+(m/(m+v))*C)Order Berdasarkan score
## # A tibble: 100 x 12
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <chr> <fct> <chr> <chr> <fct> <dbl> <dbl>
## 1 tt411~ tvEpisode S.O.S. Part~ S.O.S. Part 2 0 2015 NaN
## 2 tt220~ video Attack of t~ Attack of th~ 0 2010 NaN
## 3 tt769~ tvEpisode Chapter Sev~ Chapter Seve~ 0 2019 NaN
## 4 tt712~ tvEpisode Chapter Thi~ Chapter Thir~ 0 2018 NaN
## 5 tt053~ tvEpisode The Prom The Prom 0 1999 NaN
## 6 tt839~ tvEpisode Savages Savages 0 2018 NaN
## 7 tt284~ tvEpisode VIII. VIII. 0 2014 NaN
## 8 tt429~ tvSeries Chef's Table Chef's Table 0 2015 NaN
## 9 tt250~ tvEpisode Trial and E~ Trial and Er~ 0 2013 NaN
## 10 tt033~ video AC/DC: Live~ AC/DC: Live ~ 0 1992 NaN
## # ... with 90 more rows, and 5 more variables: runtimeMinutes <dbl>,
## # genres <chr>, averageRating <dbl>, numVotes <dbl>, score <dbl>
Rekomendasi berdasarkan user preference
Terakhir kita menampilkan data berdasarkan dua preferensi yaitu berdasarkan jenis film adult diwakili dengan angka 1 dan tidak adult dengan angka 0 serta filter berdasarkan tahun.
df <- movie_rating_df
recom <- function(df,ask_adult, ask_start_year,top){
if(ask_adult =='yes'){
df = df%>%
filter(isAdult == 1)
}else{
df <- df%>%
filter(isAdult == 0)
}
df = df%>%
filter(startYear >= ask_start_year)
{
top <- head(df,200)
}
}
datatable(print(recom(df,
ask_adult = 'no',
ask_start_year = 2000,
top),
extensions = 'Scroller', options = list(
deferRender = TRUE,
scrollY = 200,
scroller = TRUE
)))## # A tibble: 200 x 12
## tconst titleType primaryTitle originalTitle isAdult startYear endYear
## <chr> <fct> <chr> <chr> <fct> <dbl> <dbl>
## 1 tt657~ tvEpisode Shadow Play~ Shadow Play ~ 0 2017 NaN
## 2 tt226~ movie The Pin The Pin 0 2013 NaN
## 3 tt087~ tvEpisode Episode #32~ Episode #32.9 0 2006 NaN
## 4 tt098~ tvSeries Favouritism Favouritism 0 2005 2005
## 5 tt244~ movie Point B Point B 0 2013 NaN
## 6 tt310~ movie Lascados Lascados 0 2014 NaN
## 7 tt345~ movie The Gospel ~ The Gospel o~ 0 2014 NaN
## 8 tt028~ movie Far from Ch~ Far from Chi~ 0 2001 NaN
## 9 tt034~ tvSeries Korkeajänni~ Korkeajännit~ 0 2001 NaN
## 10 tt639~ tvEpisode Episode #1.1 Episode #1.1 0 2017 NaN
## # ... with 190 more rows, and 5 more variables: runtimeMinutes <dbl>,
## # genres <chr>, averageRating <dbl>, numVotes <dbl>, score <dbl>