Movie recommendation

Source: Analytics Edge Unit 6 Clustering

Get the data and pre-process it

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_6_Clustering")
movies <- read.table("movieLens.txt", header = FALSE, sep = "|", quote = "\"")
str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ V4 : logi  NA NA NA NA NA NA ...
##  $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
##  $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
##  $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
##  $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ V24: int  0 0 0 0 0 0 0 0 0 0 ...

That last argument just made sure that our text was read in properly.

Our text file doesn’t have column names, so we will go ahead and add the names by ourselves.

colnames(movies) <- c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB","Unknown","Action", "Adventure", "Animatioin","Childeren","Commedy","Crime","Documentary", "Drama", "Fantasy", "FilmNoir","Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War","Western")
str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title           : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
##  $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animatioin      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childeren       : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Commedy         : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...

Remove some useless variables

movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL

And there are a few duplicate entries in our data set, so we’ll go ahead and remove them with the unique function

movies <- unique(movies)
str(movies)

## 'data.frame':    1664 obs. of  20 variables:
##  $ Title      : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ Unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animatioin : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childeren  : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Commedy    : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...