This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
Objective
The report is intended to perform movie rating and tags analysis and is sourced from a movie lens data base.
Data Source
GroupLens Research has collected and made available rating data sets from the MovieLens web site MovieLens. The data sets were collected over various periods of time, depending on the size of the set.
Data Variables
The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes (“).There were the following four spreadsheets with the below-mentioned attributes:
For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.
## Load Required Packages ##
* data.table ## for importing data
* tidyverse ## for data wrangling (for dplyr and ggplot2)
* stringr ## for string manipulation
* plotly ## for interactive visualization
* DT ## extra styling to the R Markdown table
package_list <-
c(
'data.table',
'tidyverse',
'stringr',
'plotly',
'DT'
)
## checks for whether a package is intsalled or not and loads it thereafter
for (package in package_list) {
if (!require(package, character.only = T, quietly = T)) {
install.packages(package, repos = "http://cran.us.r-project.org")
library(package, character.only = T)
}
}
Data was imported from all the four data files and the structure was examined for column values and data types.
## importing data
links <- fread("links.csv")
movies <- fread("movies.csv")
ratings <- fread("ratings.csv")
tags <- fread("tags.csv")
## taking a glimpse at the data structures
str(links)
## Classes 'data.table' and 'data.frame': 9125 obs. of 3 variables:
## $ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ imdbId : int 114709 113497 113228 114885 113041 113277 114319 112302 114576 113189 ...
## $ tmdbId : int 862 8844 15602 31357 11862 949 11860 45325 9091 710 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(movies)
## Classes 'data.table' and 'data.frame': 9125 obs. of 3 variables:
## $ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
## $ genres : chr "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
## - attr(*, ".internal.selfref")=<externalptr>
str(ratings)
## Classes 'data.table' and 'data.frame': 100004 obs. of 4 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
## $ rating : num 2.5 3 3 2 4 2 2 2 3.5 2 ...
## $ timestamp: int 1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(tags)
## Classes 'data.table' and 'data.frame': 1296 obs. of 4 variables:
## $ userId : int 15 15 15 15 15 15 15 15 15 15 ...
## $ movieId : int 339 1955 7478 32892 34162 35957 37729 45950 100365 100365 ...
## $ tag : chr "sandra 'boring' bullock" "dentist" "Cambodia" "Russian" ...
## $ timestamp: int 1138537770 1193435061 1170560997 1170626366 1141391765 1141391873 1141391806 1169616291 1425876220 1425876220 ...
## - attr(*, ".internal.selfref")=<externalptr>
There were few values in movie title where the title belonged to a tv series and thus there were multiple release years for the same.
## few values where there are multiple years
movies[unlist(regexpr("\\([0-9]{4}\\-",movies$title)) != -1,"title"]
## title
## 1: Big Bang Theory, The (2007-)
## 2: Fawlty Towers (1975-1979)
Movies Data set
Genres were pipe separated values in a single column. For the purpose of analysing various genres, each genre associated with a movie was separated into an individual row.
## convert genres into rows
movies <- movies %>%
mutate(genre = strsplit(genres,"\\|")
,movie_title = str_trim(substr(title, start = 1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",title)) -1))
,movie_year = as.numeric(substr(title, start = unlist(regexpr("\\(([0-9\\-]*)\\)$",title))+1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",title))+4))) %>%
unnest(genre) %>%
select(movieId,title,genre,movie_title,movie_year,genres)
There were certain NA values observed as a result of above mutation because few movies didn’t have a release year associated with the title.
## checking the values for NAS coerced
movies[is.na(as.numeric(substr(movies$title, start = unlist(regexpr("\\(([0-9\\-]*)\\)$",movies$title))+1 , stop= unlist(regexpr("\\(([0-9\\-]*)\\)$",movies$title))+4))),"title"]
## [1] "Hyena Road" "The Lovers and the Despot"
## [3] "Stranger Things" "Women of '69, Unboxed"
Ratings and Tags data set
The time stamp in Ratings and Tags data set represented seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Itw as converted to date for further analysis.
## extracting date from timestamp for ratings and tags
ratings <- ratings %>%
mutate(date = as.Date(as.POSIXct(timestamp,origin="1970-01-01 00:00:00",tz = "GMT"))
,year = as.numeric(substr(date,1,4))) %>%
select(-timestamp)
tags <- tags %>%
mutate(date = as.Date(as.POSIXct(timestamp,origin="1970-01-01 00:00:00",tz = "GMT"))
,year = as.numeric(substr(date,1,4))) %>%
select(-timestamp)
Summary of all four data sets
## summary of all the datasets
summary(movies) ## few NA's in movie_year
## movieId title genre movie_title
## Min. : 1 Length:20340 Length:20340 Length:20340
## 1st Qu.: 2906 Class :character Class :character Class :character
## Median : 6754 Mode :character Mode :character Mode :character
## Mean : 32101
## 3rd Qu.: 58839
## Max. :164979
##
## movie_year genres
## Min. :1902 Length:20340
## 1st Qu.:1985 Class :character
## Median :1998 Mode :character
## Mean :1992
## 3rd Qu.:2006
## Max. :2016
## NA's :4
summary(links) ## few NA's in tmdbId
## movieId imdbId tmdbId
## Min. : 1 Min. : 417 Min. : 2
## 1st Qu.: 2850 1st Qu.: 88846 1st Qu.: 9452
## Median : 6290 Median : 119778 Median : 15852
## Mean : 31123 Mean : 479824 Mean : 39105
## 3rd Qu.: 56274 3rd Qu.: 428441 3rd Qu.: 39161
## Max. :164979 Max. :5794766 Max. :416437
## NA's :13
summary(tags)
## userId movieId tag date
## Min. : 15 Min. : 1 Length:1296 Min. :2006-01-14
## 1st Qu.:346 1st Qu.: 2988 Class :character 1st Qu.:2009-05-27
## Median :431 Median : 26959 Mode :character Median :2012-07-21
## Mean :417 Mean : 42279 Mean :2011-12-19
## 3rd Qu.:547 3rd Qu.: 72268 3rd Qu.:2015-08-24
## Max. :663 Max. :164979 Max. :2016-10-16
## year
## Min. :2006
## 1st Qu.:2009
## Median :2012
## Mean :2011
## 3rd Qu.:2015
## Max. :2016
summary(ratings)
## userId movieId rating date
## Min. : 1 Min. : 1 Min. :0.500 Min. :1995-01-09
## 1st Qu.:182 1st Qu.: 1028 1st Qu.:3.000 1st Qu.:2000-08-09
## Median :367 Median : 2406 Median :4.000 Median :2005-03-10
## Mean :347 Mean : 12549 Mean :3.544 Mean :2005-10-17
## 3rd Qu.:520 3rd Qu.: 5418 3rd Qu.:4.000 3rd Qu.:2011-01-28
## Max. :671 Max. :163949 Max. :5.000 Max. :2016-10-16
## year
## Min. :1995
## 1st Qu.:2000
## Median :2005
## Mean :2005
## 3rd Qu.:2011
## Max. :2016
1. Average rating for each movie released in or after 1996
movies %>%
filter(movie_year > 1995) %>%
distinct(movieId,movie_title) %>%
merge(ratings, by="movieId", all.x = TRUE) %>%
select(movieId,movie_title,rating) %>%
group_by(movieId,movie_title) %>%
summarize(avg_ratings =round(mean(rating),2)) %>%
datatable(options = list(searching = FALSE))
2. Top 5 most reviewed movies every year after 1994
movies %>%
distinct(movieId,movie_title) %>%
merge(ratings, by="movieId") %>%
select(movieId,movie_title,year) %>%
filter(year>1994) %>%
group_by(movieId,movie_title,year) %>%
summarize(no_reviews =n()) %>%
group_by(year) %>%
mutate(rn =row_number(desc(no_reviews))) %>%
filter(rn<6) %>%
arrange(year,desc(no_reviews)) %>%
datatable(options = list(searching = FALSE))
3. Average rating for “Drama”, “Romance” and “Drama and Romance” movies
a <- ratings%>%
group_by(movieId) %>%
summarize(avg_rating = mean(rating)) %>%
merge( distinct(movies,movieId,movie_title,genres),by="movieId", all.y = TRUE) %>%
select(movieId,avg_rating,movie_title,genres) %>%
mutate(horror = ifelse(grepl("^(.*)(Horror)(.*)$",genres),1,0),
drama = ifelse(grepl("^(.*)(Drama)(.*)$",genres),1,0),
horror_drama = ifelse(grepl("^(.*)(Horror)(.*)$",genres) & grepl("^(.*)(Drama)(.*)$",genres),1,0))
## horror
mean(filter(a,horror == 1)$avg_rating, na.rm = TRUE)
## [1] 2.991933
## drama
mean(filter(a,drama == 1)$avg_rating, na.rm = TRUE)
## [1] 3.447417
## horror and drama
mean(filter(a,horror_drama == 1)$avg_rating, na.rm = TRUE)
## [1] 3.190673
4. Number of customers who rated a movie tagged as “horror” by year
tags %>%
filter(regexpr("[hH][oO][rR][rR][oO][rR]",tag) != -1) %>%
distinct(movieId) %>%
merge(ratings, by="movieId", all.x = TRUE) %>%
group_by(movieId,year) %>%
summarize(no_users = n_distinct(userId)) %>%
merge( distinct(movies,movieId,movie_title),by="movieId",all.x = TRUE) %>%
select(movieId,no_users,movie_title,year) %>%
datatable(options = list(searching = FALSE))
Trend of movie genres by the release years
Frequency of different genres of movies released each year. If a movie is across multiple genres then count them in all.
data <- movies %>%
filter(genre != "(no genres listed)") %>%
group_by(genre,movie_year) %>%
summarize(count_movies= n()) %>%
ggplot(aes(x=movie_year,y=count_movies,color=genre))+
geom_line()+ ggtitle("Movie Genres by Release Years") +
labs(x="Year",y="No. of Movies")
## to make it interactive used plotly
ggplotly(data)
An increasing popularity of genres like Drama, Comedy, Action, Romance and Thriller was observed with time till the year 2000 which was followed again by a decline in popularity.
2. Top 5 most reviewed movies every year after 1994
data <- movies %>%
distinct(movieId,movie_title) %>%
merge(ratings, by="movieId") %>%
select(movieId,movie_title,year) %>%
filter(year>1994) %>%
group_by(movieId,movie_title,year) %>%
summarize(no_reviews =n()) %>%
group_by(year) %>%
mutate(rn =row_number(desc(no_reviews))) %>%
filter(rn<6) %>%
arrange(year,desc(no_reviews))
head(data)
## # A tibble: 6 x 5
## # Groups: year [2]
## movieId movie_title year no_reviews rn
## <int> <chr> <dbl> <int> <int>
## 1 21 Get Shorty 1995 1 1
## 2 47 Seven (a.k.a. Se7en) 1995 1 2
## 3 1079 Fish Called Wanda, A 1995 1 3
## 4 590 Dances with Wolves 1996 82 1
## 5 380 True Lies 1996 80 2
## 6 592 Batman 1996 80 3
plot <- ggplot(data, aes(x = year, y = no_reviews, text = paste("Movie: ",movie_title) )) +
geom_point() + ggtitle("Most Reviewed Movies by Year") +
labs(x="Year",y="No. of Reviews")
ggplotly(plot)