Data Analysis on Amazon Movie Reviews

1 Introduction

In this project, we explore the Amazon movie data and create a movie recommender Shiny App based on collective filtering. Firstly we read the raw data and conduct basic data analysis. Then we apply sentiment analysis and topic model to the processed data and obtain some interesting results. At last, we create a movie recommender system recommending movies to users according to their preferences.

2 Data Processing

We use Python to read the raw data and by filtering, we get the data on 250 most popular movies ( we define ‘popular’ as having a large number of reviews). The raw data gives us a whole picture about people’s review while the filtered data renders us more effective information on recommending movies. Therefore, we mainly focus on the latter data set and use IMDB API in R to find out more information about the chosen movies.

3 Descriptive Statistics

Firstly, we investigate the raw data, which include more than 700,000 observations, to obtain an overall picture.

It is indicated above that the review score of customers is mainly distributed in 5 scores. And the helpfulness is mainly distributed in 0, 0.5 and 1.

This scatter plot reflects the relationship between the number of reviews and the review score for each movie(review score is calculated by mean). It shows that for movies that have a large number of reviews, they tend to have a high score, but not extremely high score; for movies that have a small number of reviews, their score tend to have a relatively even distribution.

The most popular movies are below:

Next, we explore the chosen movies and find their producing countries and genres.

All of the chosen 203 movies are made by USA and part of them are also involved with other countries like UK, Spain and Mexico.

Then, We create a bar plot showing the number of movies in each genre. There are 18 genres in all and one movie might belong to several different genres. For example, ‘Sin City’ belongs to both ‘Crime’ and ‘thriller’. The bar plot shows that from the chosen most popular movies, genre ‘Drama’ counts the most and ‘Adventure’ also covers a big part, while ‘animation’ and ‘history’ are the genres that contain the fewest movies.

4 Models

4.1 Sentiment Analysis

This plot depicts the distribution of top 250 movies’ mean sentimental values. We can easily find that the movies we selected have good sentimental values and the median of the density locates at 0.8390107. We can find two peaks in the plot. we believe that it indicates two different movie levels. Movies have sentimental values over 0.9 are really great movies which are popular among all the customers, and the movies have sentimental values between 0.7 and 0.8 is another type of movie which only appeals to a group of customers.

From the scatter plot above, in which the x lab is the sentiment score of each movie, y lab is the corresponding review score, we could see that the two scores have a positive relationship since the points scatter around an approximately straight line. In other words, our sentiment analysis is justified and we could ‘guess’ the review score by applying sentiment analysis to the review text.

In terms of the sentimental score, top 3 movies are Jeff Dunham: Spark of Insanity, Tangled and Notebook:

Bottom 3 movies are Cloverfield, Prometheus, Bridesmaids:

Below we make the word cloud of the review texts with the highest sentiment scores and the lowest sentiment score respectively.

4.2 Topic Model

As most of the reviews belong to cluster 0 and cluster 1 (almost 90%), we try to explore the difference features of these two clusters. This plot compares the difference of review score distribution between cluster 0 and cluster1, while unfortunately, we can hardly tell the difference between the two distributions.

We also create a bipartite connecting every cluster to the words they contain.

Thus we create word clouds according to every word’s score for cluster 0 and cluster 1.

4.3 Recommender System

For new users

For old users

old user 1 http://www.amazon.com/gp/pdp/profile/A1G69BQLIUMWPN

old user 2 http://www.amazon.com/gp/pdp/profile/A16CZRQL23NOIW/ref=cm_cr_tr_tbl_1732_name

5 Shiny App

Based on the data analysis and model evaluation above, we create a smart movie recommender system. It recommends movies for users according to their preference.

Data Analysis on Amazon Movie Reviews

Haoyang Chen, Yicheng Wang, Xinghao Gu, Hexiu Ye

04/13/2016

Contents:

1. Introduction

2. Data Processing

3. Descriptive Statistics

4. Models

5. Shiny App