Data Set description

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

Objective

The report is intended to perform movie rating and tags analysis and is sourced from a movie lens data base.

Data Source

GroupLens Research has collected and made available rating data sets from the MovieLens web site MovieLens. The data sets were collected over various periods of time, depending on the size of the set.

Data Variables

The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes (“).There were the following four spreadsheets with the below-mentioned attributes:

  1. Links:Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. It contains the following columns:
    • movieId
    • imdbId
    • tmdbId
  2. Movies: Movie titles are entered manually or imported from (https://www.themoviedb.org/), and include the year of release in parentheses.
    • movieId
    • title
    • genres
  3. Ratings: Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
    • movieId
    • userId
    • rating
    • timestamp
  4. Tags: Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
    • movieId
    • userId
    • tag
    • timestamp

Packages Required

For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.

## Load Required Packages ##

* data.table ## for importing data
* tidyverse ## for data wrangling (for dplyr and ggplot2)
* stringr ## for string manipulation
* plotly ## for interactive visualization
* DT ## extra styling to the R Markdown table
package_list <-
  c(
    'data.table',
    'tidyverse',
    'stringr',
    'plotly',
    'DT'
  )
## checks for whether a package is intsalled or not and loads it thereafter
for (package in package_list) {
  if (!require(package, character.only = T, quietly = T)) {
    install.packages(package, repos = "http://cran.us.r-project.org")
    library(package, character.only = T)
  }
}

Data Cleaning and Preparation

Data was imported from all the four data files and the structure was examined for column values and data types.

## importing data 

links <- fread("links.csv")
movies <- fread("movies.csv")
ratings <- fread("ratings.csv")
tags <- fread("tags.csv")
## taking a glimpse at the data structures
str(links)
## Classes 'data.table' and 'data.frame':   9125 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ imdbId : int  114709 113497 113228 114885 113041 113277 114319 112302 114576 113189 ...
##  $ tmdbId : int  862 8844 15602 31357 11862 949 11860 45325 9091 710 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(movies)
## Classes 'data.table' and 'data.frame':   9125 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(ratings)
## Classes 'data.table' and 'data.frame':   100004 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
##  $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
##  $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(tags)
## Classes 'data.table' and 'data.frame':   1296 obs. of  4 variables:
##  $ userId   : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ movieId  : int  339 1955 7478 32892 34162 35957 37729 45950 100365 100365 ...
##  $ tag      : chr  "sandra 'boring' bullock" "dentist" "Cambodia" "Russian" ...
##  $ timestamp: int  1138537770 1193435061 1170560997 1170626366 1141391765 1141391873 1141391806 1169616291 1425876220 1425876220 ...
##  - attr(*, ".internal.selfref")=<externalptr>

There were few values in movie title where the title belonged to a tv series and thus there were multiple release years for the same.

## few values where there are multiple years
movies[unlist(regexpr("\\([0-9]{4}\\-",movies$title)) != -1,"title"]
##                           title
## 1: Big Bang Theory, The (2007-)
## 2:    Fawlty Towers (1975-1979)

Movies Data set

Genres were pipe separated values in a single column. For the purpose of analysing various genres, each genre associated with a movie was separated into an individual row.

## convert genres into rows
movies <- movies %>%
          mutate(genre = strsplit(genres,"\\|")
                 ,movie_title =  str_trim(substr(title, start = 1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",title)) -1))
                 ,movie_year =  as.numeric(substr(title, start = unlist(regexpr("\\(([0-9\\-]*)\\)$",title))+1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",title))+4))) %>%
          unnest(genre) %>%
          select(movieId,title,genre,movie_title,movie_year,genres)

There were certain NA values observed as a result of above mutation because few movies didn’t have a release year associated with the title.

## checking the values for NAS coerced
movies[is.na(as.numeric(substr(movies$title, start = unlist(regexpr("\\(([0-9\\-]*)\\)$",movies$title))+1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",movies$title))+4))),"title"]
## [1] "Hyena Road"                "The Lovers and the Despot"
## [3] "Stranger Things"           "Women of '69, Unboxed"

Ratings and Tags data set

The time stamp in Ratings and Tags data set represented seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Itw as converted to date for further analysis.

## extracting date from timestamp for ratings and tags
ratings <- ratings %>%
          mutate(date = as.Date(as.POSIXct(timestamp,origin="1970-01-01 00:00:00",tz = "GMT"))
                 ,year = as.numeric(substr(date,1,4))) %>%
          select(-timestamp)

tags <- tags %>%
  mutate(date = as.Date(as.POSIXct(timestamp,origin="1970-01-01 00:00:00",tz = "GMT"))
         ,year = as.numeric(substr(date,1,4))) %>%
  select(-timestamp)

Summary of all four data sets

## summary of all the datasets

summary(movies) ## few NA's in movie_year
##     movieId          title              genre           movie_title       
##  Min.   :     1   Length:20340       Length:20340       Length:20340      
##  1st Qu.:  2906   Class :character   Class :character   Class :character  
##  Median :  6754   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 32101                                                           
##  3rd Qu.: 58839                                                           
##  Max.   :164979                                                           
##                                                                           
##    movie_year      genres         
##  Min.   :1902   Length:20340      
##  1st Qu.:1985   Class :character  
##  Median :1998   Mode  :character  
##  Mean   :1992                     
##  3rd Qu.:2006                     
##  Max.   :2016                     
##  NA's   :4
summary(links) ## few NA's in tmdbId
##     movieId           imdbId            tmdbId      
##  Min.   :     1   Min.   :    417   Min.   :     2  
##  1st Qu.:  2850   1st Qu.:  88846   1st Qu.:  9452  
##  Median :  6290   Median : 119778   Median : 15852  
##  Mean   : 31123   Mean   : 479824   Mean   : 39105  
##  3rd Qu.: 56274   3rd Qu.: 428441   3rd Qu.: 39161  
##  Max.   :164979   Max.   :5794766   Max.   :416437  
##                                     NA's   :13
summary(tags)
##      userId       movieId           tag                 date           
##  Min.   : 15   Min.   :     1   Length:1296        Min.   :2006-01-14  
##  1st Qu.:346   1st Qu.:  2988   Class :character   1st Qu.:2009-05-27  
##  Median :431   Median : 26959   Mode  :character   Median :2012-07-21  
##  Mean   :417   Mean   : 42279                      Mean   :2011-12-19  
##  3rd Qu.:547   3rd Qu.: 72268                      3rd Qu.:2015-08-24  
##  Max.   :663   Max.   :164979                      Max.   :2016-10-16  
##       year     
##  Min.   :2006  
##  1st Qu.:2009  
##  Median :2012  
##  Mean   :2011  
##  3rd Qu.:2015  
##  Max.   :2016
summary(ratings)
##      userId       movieId           rating           date           
##  Min.   :  1   Min.   :     1   Min.   :0.500   Min.   :1995-01-09  
##  1st Qu.:182   1st Qu.:  1028   1st Qu.:3.000   1st Qu.:2000-08-09  
##  Median :367   Median :  2406   Median :4.000   Median :2005-03-10  
##  Mean   :347   Mean   : 12549   Mean   :3.544   Mean   :2005-10-17  
##  3rd Qu.:520   3rd Qu.:  5418   3rd Qu.:4.000   3rd Qu.:2011-01-28  
##  Max.   :671   Max.   :163949   Max.   :5.000   Max.   :2016-10-16  
##       year     
##  Min.   :1995  
##  1st Qu.:2000  
##  Median :2005  
##  Mean   :2005  
##  3rd Qu.:2011  
##  Max.   :2016

Data Wrangling

1. Average rating for each movie released in or after 1996

movies %>%
  filter(movie_year > 1995) %>%
  distinct(movieId,movie_title) %>%
  merge(ratings, by="movieId", all.x = TRUE) %>%
  select(movieId,movie_title,rating) %>%
  group_by(movieId,movie_title) %>%
  summarize(avg_ratings =round(mean(rating),2)) %>%
  datatable(options = list(searching = FALSE))

2. Top 5 most reviewed movies every year after 1994

movies %>%
  distinct(movieId,movie_title) %>%
  merge(ratings, by="movieId") %>%
  select(movieId,movie_title,year) %>%
  filter(year>1994) %>%
  group_by(movieId,movie_title,year) %>%
  summarize(no_reviews =n()) %>%
  group_by(year) %>%
  mutate(rn =row_number(desc(no_reviews))) %>%
  filter(rn<6) %>%
  arrange(year,desc(no_reviews)) %>%
  datatable(options = list(searching = FALSE))

3. Average rating for “Drama”, “Romance” and “Drama and Romance” movies

a <- ratings%>%
  group_by(movieId) %>%
  summarize(avg_rating = mean(rating)) %>%
  merge( distinct(movies,movieId,movie_title,genres),by="movieId", all.y = TRUE) %>%
  select(movieId,avg_rating,movie_title,genres) %>%
  mutate(horror = ifelse(grepl("^(.*)(Horror)(.*)$",genres),1,0),
         drama = ifelse(grepl("^(.*)(Drama)(.*)$",genres),1,0),
         horror_drama = ifelse(grepl("^(.*)(Horror)(.*)$",genres) & grepl("^(.*)(Drama)(.*)$",genres),1,0))
## horror
  mean(filter(a,horror == 1)$avg_rating, na.rm = TRUE)
## [1] 2.991933
## drama
  mean(filter(a,drama == 1)$avg_rating, na.rm = TRUE)
## [1] 3.447417
## horror and drama
  mean(filter(a,horror_drama == 1)$avg_rating, na.rm = TRUE)
## [1] 3.190673

4. Number of customers who rated a movie tagged as “horror” by year

  tags %>%
    filter(regexpr("[hH][oO][rR][rR][oO][rR]",tag) != -1) %>%
    distinct(movieId) %>% 
    merge(ratings, by="movieId", all.x = TRUE) %>%
    group_by(movieId,year) %>%
    summarize(no_users = n_distinct(userId)) %>%
    merge( distinct(movies,movieId,movie_title),by="movieId",all.x = TRUE) %>%
    select(movieId,no_users,movie_title,year) %>%
    datatable(options = list(searching = FALSE))

Data Plots

Trend of movie genres by the release years

Frequency of different genres of movies released each year. If a movie is across multiple genres then count them in all.

data <- movies %>%
    filter(genre != "(no genres listed)") %>%
    group_by(genre,movie_year) %>%
    summarize(count_movies= n()) %>%
    ggplot(aes(x=movie_year,y=count_movies,color=genre))+
    geom_line()+ ggtitle("Movie Genres by Release Years") +
  labs(x="Year",y="No. of Movies")

## to make it interactive used plotly    
ggplotly(data)  

An increasing popularity of genres like Drama, Comedy, Action, Romance and Thriller was observed with time till the year 2000 which was followed again by a decline in popularity.

2. Top 5 most reviewed movies every year after 1994

data <- movies %>%
  distinct(movieId,movie_title) %>%
  merge(ratings, by="movieId") %>%
  select(movieId,movie_title,year) %>%
  filter(year>1994) %>%
  group_by(movieId,movie_title,year) %>%
  summarize(no_reviews =n()) %>%
  group_by(year) %>%
  mutate(rn =row_number(desc(no_reviews))) %>%
  filter(rn<6) %>%
  arrange(year,desc(no_reviews)) 
head(data)
## # A tibble: 6 x 5
## # Groups:   year [2]
##   movieId          movie_title  year no_reviews    rn
##     <int>                <chr> <dbl>      <int> <int>
## 1      21           Get Shorty  1995          1     1
## 2      47 Seven (a.k.a. Se7en)  1995          1     2
## 3    1079 Fish Called Wanda, A  1995          1     3
## 4     590   Dances with Wolves  1996         82     1
## 5     380            True Lies  1996         80     2
## 6     592               Batman  1996         80     3
plot <- ggplot(data, aes(x = year, y = no_reviews, text = paste("Movie: ",movie_title) )) +
  geom_point() + ggtitle("Most Reviewed Movies by Year") +
  labs(x="Year",y="No. of Reviews")

ggplotly(plot)