Movie Ratings Database from Tweets

Heather Geiger

February 11,2018

Introduction

Here I will use the MovieTweetings database from Github user sidooms.

Link here: https://github.com/sidooms/MovieTweetings

This is a dataset compiling tweets that gave clear reviews out of 10 for movies.

Since this database is updated regularly, I have also posted to my Github the exact data I am using from this dataset for reproducibility.

Link here: https://github.com/hmgeiger/MovieDatabase

Lastly, I used rounding to map these reviews out of 10 onto a 5-star system instead.

Selecting movies from MovieTweetings dataset

First, we load the ratings.dat and movies.dat files into R.

Ratings.dat is formatted like this: user_id::movie_id::rating::rating_timestamp

Since ratings.dat has an ID for each movie rather than the actual name of the movie, we will need to use movies.dat to get the IDs for our movies of interest.

Movies.dat is formatted like this: movie_id::movie_title (movie_year)::genre|genre|genre

#Since these files are separated by "::", which will cause problems in R, let's use a quick system call to switch to tab-separated.
#Already ran this so commenting these lines out now.
#system("awk -F "::" '{ OFS="\t"}{print $1,$2,$3,$4}' ratings.dat > ratings.tsv")
#system("awk -F "::" '{ OFS="\t"}{print $1,$2,$3}' movies.dat > movies.tsv")

ratings <- read.table("ratings.tsv",sep="\t",stringsAsFactors=FALSE,
    colClasses = c("character","character","integer","character"),
    col.names = c("user","movie","rating","timestamp"))

movies <- read.table("movies.tsv",sep="\t",stringsAsFactors=FALSE,quote="",comment.char="",
    colClasses = c("character","character","character"),
    col.names=c("id","movie","genre"))

Now, we pull out just the 10 movies we are interested in. To do this, first we use movies data frame to get the code, then select each movie code from ratings.

movies_of_interest <- c("Star Wars: The Last Jedi","Beauty and the Beast","Wonder Woman","Guardians of the Galaxy Vol. 2","Jumanji: Welcome to the Jungle","Spider-Man: Homecoming","It","Thor: Ragnarok","Despicable Me 3","Justice League")

movies_of_interest <- paste0(movies_of_interest," (2017)")

movies[match(movies_of_interest,movies$movie),]
##            id                                 movie
## 23151 2527336       Star Wars: The Last Jedi (2017)
## 23804 2771200           Beauty and the Beast (2017)
## 13390 0451279                   Wonder Woman (2017)
## 26716 3896198 Guardians of the Galaxy Vol. 2 (2017)
## 21924 2283362 Jumanji: Welcome to the Jungle (2017)
## 21773 2250912         Spider-Man: Homecoming (2017)
## 17164 1396484                             It (2017)
## 25780 3501632                 Thor: Ragnarok (2017)
## 25672 3469046                Despicable Me 3 (2017)
## 15103 0974015                 Justice League (2017)
##                          genre
## 23151                         
## 23804   Family|Fantasy|Musical
## 13390                         
## 26716  Action|Adventure|Sci-Fi
## 21924                         
## 21773 Action|Adventure|Fantasy
## 17164                         
## 25780                         
## 25672                         
## 15103
movies_of_interest_codes <- movies$id[match(movies_of_interest,movies$movie)]

ratings_movies_of_interest <- c()

for(movie in movies_of_interest_codes)
{
ratings_movies_of_interest <- rbind(ratings_movies_of_interest,ratings[which(ratings$movie == movie),])
}

Right now the data is in long format.

Let’s convert to wide, with each movie as a column and each user as a row.

ratings_movies_of_interest <- reshape(ratings_movies_of_interest[,c("user","movie","rating")],idvar = "user",timevar = "movie",direction="wide")

head(ratings_movies_of_interest)
##      user rating.2527336 rating.2771200 rating.0451279 rating.3896198
## 2390  155              7             NA              9             NA
## 3475  252              8             NA             NA             NA
## 3921  286             10             NA             NA             NA
## 4446  385              7             NA             NA              8
## 4697  427              8             NA             NA             NA
## 4848  433              8              7             NA             NA
##      rating.2283362 rating.2250912 rating.1396484 rating.3501632
## 2390             NA             10              9              7
## 3475             NA             NA             NA             NA
## 3921             NA             NA             NA             NA
## 4446             NA             NA             NA              7
## 4697             NA             NA             NA             NA
## 4848             NA             NA             NA             NA
##      rating.3469046 rating.0974015
## 2390             NA              9
## 3475             NA             NA
## 3921             NA             NA
## 4446              6             NA
## 4697             NA             NA
## 4848             NA              5
#Let's make "user" be the rownames, adding "user" as a prefix to each user ID.
#And instead of e.g. rating.2527336, replace with movie2527336.

rownames(ratings_movies_of_interest) <- paste0("user.",ratings_movies_of_interest$user)
ratings_movies_of_interest <- ratings_movies_of_interest[,2:ncol(ratings_movies_of_interest)]
colnames(ratings_movies_of_interest) <- paste0("movie",movies_of_interest_codes)

head(ratings_movies_of_interest)
##          movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155            7           NA            9           NA           NA
## user.252            8           NA           NA           NA           NA
## user.286           10           NA           NA           NA           NA
## user.385            7           NA           NA            8           NA
## user.427            8           NA           NA           NA           NA
## user.433            8            7           NA           NA           NA
##          movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155           10            9            7           NA            9
## user.252           NA           NA           NA           NA           NA
## user.286           NA           NA           NA           NA           NA
## user.385           NA           NA            7            6           NA
## user.427           NA           NA           NA           NA           NA
## user.433           NA           NA           NA           NA            5

Let’s now using rounding to convert from a 10-point to 5-point system.

We’ll divide by 2, then round up for odd numbers.

So 1/10 and 2/10 = 1/5, 3/10 and 4/10 = 2/5, etc.

ratings_movies_of_interest <- ceiling(ratings_movies_of_interest/2)

#Now, convert back to integers.

for(i in 1:ncol(ratings_movies_of_interest))
{
ratings_movies_of_interest[,i] <- as.integer(ratings_movies_of_interest[,i])
}

Finally, let’s save objects for ratings_movies_of_interest, plus the subset of movies that is just the code translation for our 10 movies of interest.

Let’s also reformat this subset of movies to better match ratings_movies_of_interest.

We’ll use these saved objects to compare to the ones we will later create from the SQL database of this information.

movie_codes_vs_names_movies_of_interest <- movies[match(movies_of_interest,movies$movie),1:2]
colnames(movie_codes_vs_names_movies_of_interest) <- c("id","movie")
movie_codes_vs_names_movies_of_interest$id <- paste0("movie",movie_codes_vs_names_movies_of_interest$id)
rownames(movie_codes_vs_names_movies_of_interest) <- 1:10
movie_codes_vs_names_movies_of_interest
##              id                                 movie
## 1  movie2527336       Star Wars: The Last Jedi (2017)
## 2  movie2771200           Beauty and the Beast (2017)
## 3  movie0451279                   Wonder Woman (2017)
## 4  movie3896198 Guardians of the Galaxy Vol. 2 (2017)
## 5  movie2283362 Jumanji: Welcome to the Jungle (2017)
## 6  movie2250912         Spider-Man: Homecoming (2017)
## 7  movie1396484                             It (2017)
## 8  movie3501632                 Thor: Ragnarok (2017)
## 9  movie3469046                Despicable Me 3 (2017)
## 10 movie0974015                 Justice League (2017)
save(list=c("ratings_movies_of_interest","movie_codes_vs_names_movies_of_interest"),
file="ratings_movies_of_interest.Rdata")

We will also output these two objects into tsv files.

write.table(data.frame(user = rownames(ratings_movies_of_interest),ratings_movies_of_interest),
file="ratings_movies_of_interest.tsv",
row.names=FALSE,col.names=TRUE,quote=FALSE,sep="\t")

write.table(movie_codes_vs_names_movies_of_interest,
file="movie_codes_vs_names_movies_of_interest.tsv",
row.names=FALSE,col.names=TRUE,quote=FALSE,sep="\t")

Creating the database

Now, we will run the following code to create the SQL database.

#This is the code to open up a MySQL shell. Subsequent commands (up to right before mysqldump) were entered within this shell.

/usr/local/mysql/bin/mysql -u root -p

CREATE DATABASE ratings_of_recent_popular_movies;
USE ratings_of_recent_popular_movies;
CREATE TABLE movienames(id varchar(255),movie varchar(255));
LOAD DATA LOCAL INFILE 'movie_codes_vs_names_movies_of_interest.tsv' INTO TABLE movienames FIELDS TERMINATED BY "\t" IGNORE 1 LINES;
CREATE TABLE ratings(user varchar(255),movie2527336 integer,movie2771200 integer,movie0451279 integer,movie3896198 integer,movie2283362 integer,movie2250912 integer,movie1396484 integer,movie3501632 integer,movie3469046 integer,movie0974015 integer);
LOAD DATA LOCAL INFILE 'ratings_movies_of_interest.tsv' INTO TABLE ratings FIELDS TERMINATED BY "\t" IGNORE 1 LINES;

#After this, quit the MySQL shell, then entered the following.

/usr/local/mysql/bin/mysqldump -u root -p ratings_of_recent_popular_movies > ratings_of_recent_popular_movies.sql

Loading the database back into R and checking for correctness

Now, back to R, where I will read in the data from MySQL.

#Clear all objects in the current environment so we can get a fresh start.

rm(list=ls())

#Now, ready to start reading in from MySql database.

library(RMySQL)
## Loading required package: DBI
#Need to enter password to use the database.
#Here I have my password in a text file that I will read in.

mypassword <- as.vector(read.table("password.txt")$V1)

myconnection <- dbConnect(MySQL(),dbname = "ratings_of_recent_popular_movies",user='root',password=mypassword,host='localhost')

movienames_request <- dbSendQuery(myconnection,"SELECT * from movienames")
movie_codes_vs_names_movies_of_interest_from_sql <- fetch(movienames_request,n=-1)

movie_codes_vs_names_movies_of_interest_from_sql
##              id                                 movie
## 1  movie2527336       Star Wars: The Last Jedi (2017)
## 2  movie2771200           Beauty and the Beast (2017)
## 3  movie0451279                   Wonder Woman (2017)
## 4  movie3896198 Guardians of the Galaxy Vol. 2 (2017)
## 5  movie2283362 Jumanji: Welcome to the Jungle (2017)
## 6  movie2250912         Spider-Man: Homecoming (2017)
## 7  movie1396484                             It (2017)
## 8  movie3501632                 Thor: Ragnarok (2017)
## 9  movie3469046                Despicable Me 3 (2017)
## 10 movie0974015                 Justice League (2017)
ratings_request <- dbSendQuery(myconnection,"SELECT * from ratings")
ratings_movies_of_interest_from_sql <- fetch(ratings_request,n=-1)

head(ratings_movies_of_interest_from_sql)
##       user movie2527336 movie2771200 movie0451279 movie3896198
## 1 user.155            4            0            5            0
## 2 user.252            4            0            0            0
## 3 user.286            5            0            0            0
## 4 user.385            4            0            0            4
## 5 user.427            4            0            0            0
## 6 user.433            4            4            0            0
##   movie2283362 movie2250912 movie1396484 movie3501632 movie3469046
## 1            0            5            5            4            0
## 2            0            0            0            0            0
## 3            0            0            0            0            0
## 4            0            0            0            4            3
## 5            0            0            0            0            0
## 6            0            0            0            0            0
##   movie0974015
## 1            5
## 2            0
## 3            0
## 4            0
## 5            0
## 6            3
#Format to be more like our original data frame.

rownames(ratings_movies_of_interest_from_sql) <- ratings_movies_of_interest_from_sql[,1]
ratings_movies_of_interest_from_sql <- ratings_movies_of_interest_from_sql[,2:ncol(ratings_movies_of_interest_from_sql)]
ratings_movies_of_interest_from_sql[ratings_movies_of_interest_from_sql == 0] <- NA

#Save objects.

save(list=c("movie_codes_vs_names_movies_of_interest_from_sql","ratings_movies_of_interest_from_sql"),
file="ratings_movies_of_interest_created_from_database.Rdata")

Load in our original data frames, and compare to those created using SQL.

load("ratings_movies_of_interest.Rdata")

identical(movie_codes_vs_names_movies_of_interest,movie_codes_vs_names_movies_of_interest_from_sql)
## [1] TRUE
identical(ratings_movies_of_interest,ratings_movies_of_interest_from_sql)
## [1] FALSE

Data frame with the actual ratings is not identical between the one made in R and the one from the database. Why?

#Let's look at number of rows and columns and a head for each object.

dim(ratings_movies_of_interest)
## [1] 3024   10
dim(ratings_movies_of_interest_from_sql)
## [1] 3024   10
head(ratings_movies_of_interest)
##          movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155            4           NA            5           NA           NA
## user.252            4           NA           NA           NA           NA
## user.286            5           NA           NA           NA           NA
## user.385            4           NA           NA            4           NA
## user.427            4           NA           NA           NA           NA
## user.433            4            4           NA           NA           NA
##          movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155            5            5            4           NA            5
## user.252           NA           NA           NA           NA           NA
## user.286           NA           NA           NA           NA           NA
## user.385           NA           NA            4            3           NA
## user.427           NA           NA           NA           NA           NA
## user.433           NA           NA           NA           NA            3
head(ratings_movies_of_interest_from_sql)
##          movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155            4           NA            5           NA           NA
## user.252            4           NA           NA           NA           NA
## user.286            5           NA           NA           NA           NA
## user.385            4           NA           NA            4           NA
## user.427            4           NA           NA           NA           NA
## user.433            4            4           NA           NA           NA
##          movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155            5            5            4           NA            5
## user.252           NA           NA           NA           NA           NA
## user.286           NA           NA           NA           NA           NA
## user.385           NA           NA            4            3           NA
## user.427           NA           NA           NA           NA           NA
## user.433           NA           NA           NA           NA            3
#Maybe it's a class issue?

for(i in 1:ncol(ratings_movies_of_interest))
{
print(class(ratings_movies_of_interest[,i]))
}
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
for(i in 1:ncol(ratings_movies_of_interest_from_sql))
{
print(class(ratings_movies_of_interest_from_sql[,i]))
}
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
#That doesn't seem to be the issue either.
#Let's look at one column at a time now. Let's start with the first column.

length(which(ratings_movies_of_interest[,1] == ratings_movies_of_interest_from_sql[,1]))
## [1] 580
#Interesting! Number of rows listed as equal is less than the total rows in both data frames.

head(setdiff(1:nrow(ratings_movies_of_interest),which(ratings_movies_of_interest[,1] == ratings_movies_of_interest_from_sql[,1])))
## [1] 581 582 583 584 585 586
ratings_movies_of_interest[581:582,]
##          movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.116           NA            3           NA            4           NA
## user.157           NA            3            4            4           NA
##          movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.116           NA           NA           NA            4            4
## user.157            4            4            4            3           NA
ratings_movies_of_interest_from_sql[581:582,]
##          movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.116           NA            3           NA            4           NA
## user.157           NA            3            4            4           NA
##          movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.116           NA           NA           NA            4            4
## user.157            4            4            4            3           NA
#Very strange. These look identical, but R is perceiving them as not identical.
#What if we tried replacing all NAs in both data frames with 0?

ratings_movies_of_interest_with_zeroes <- ratings_movies_of_interest
ratings_movies_of_interest_with_zeroes[is.na(ratings_movies_of_interest_with_zeroes)] <- 0

ratings_movies_of_interest_from_sql_with_zeroes <- ratings_movies_of_interest_from_sql
ratings_movies_of_interest_from_sql_with_zeroes[is.na(ratings_movies_of_interest_from_sql_with_zeroes)] <- 0

identical(ratings_movies_of_interest_with_zeroes,ratings_movies_of_interest_from_sql_with_zeroes)
## [1] TRUE
#Maybe there were some true zeroes in ratings_movies_of_interest that we switched to NA when converting back from SQL?
#Let's check.

reviews_per_user_from_R_direct_dat <- apply(ratings_movies_of_interest,1,function(x)length(which(is.na(x) == FALSE)))
reviews_per_user_from_sql <- apply(ratings_movies_of_interest_from_sql,1,function(x)length(which(is.na(x) == FALSE)))

length(reviews_per_user_from_R_direct_dat)
## [1] 3024
length(which(reviews_per_user_from_R_direct_dat == reviews_per_user_from_sql))
## [1] 3023
#One user is not the same in these data frames.

ratings_movies_of_interest[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),]
##            movie2527336 movie2771200 movie0451279 movie3896198
## user.26393           NA           NA            5           NA
##            movie2283362 movie2250912 movie1396484 movie3501632
## user.26393           NA            0            2           NA
##            movie3469046 movie0974015
## user.26393           NA           NA
ratings_movies_of_interest_from_sql[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),]
##            movie2527336 movie2771200 movie0451279 movie3896198
## user.26393           NA           NA            5           NA
##            movie2283362 movie2250912 movie1396484 movie3501632
## user.26393           NA           NA            2           NA
##            movie3469046 movie0974015
## user.26393           NA           NA
#Looks like that's what happened! I hadn't considered the possibility that there might be at least a few real zeroes in the data, apart from the ones MySQL made to replace the NAs when I made the database.
#This really should not be allowed, as the scale is supposed to be from 1 to 5.
#Let's replace with an NA.

ratings_movies_of_interest[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),"movie2250912"] <- NA

#Now the two data frames should be equal.

identical(ratings_movies_of_interest,ratings_movies_of_interest_from_sql)
## [1] TRUE

Finally exploring the data!

Now that we’ve done all this, let’s start to actually look at the data!

Question 1 - Of these 10 movies, how many of them did the typical Twitter user give a structured review?

reviews_per_user <- apply(ratings_movies_of_interest,1,function(x)length(which(is.na(x) == FALSE)))

table(reviews_per_user)
## reviews_per_user
##    1    2    3    4    5    6    7    8    9   10 
## 1643  539  312  198  121   98   64   33   11    5

Looks like most people only reviewed 1 or 2 of these movies in a structured review.

Which movies did people give structured reviews most frequently?

reviews_per_movie <- apply(ratings_movies_of_interest,2,function(x)length(which(is.na(x) == FALSE)))

reviews_per_movie <- data.frame(Movie = movie_codes_vs_names_movies_of_interest$movie,Number.reviews = reviews_per_movie,row.names=NULL)

reviews_per_movie[order(reviews_per_movie$Number.reviews,decreasing=TRUE),]
##                                    Movie Number.reviews
## 3                    Wonder Woman (2017)           1088
## 6          Spider-Man: Homecoming (2017)            925
## 7                              It (2017)            862
## 4  Guardians of the Galaxy Vol. 2 (2017)            824
## 8                  Thor: Ragnarok (2017)            633
## 2            Beauty and the Beast (2017)            591
## 1        Star Wars: The Last Jedi (2017)            580
## 10                 Justice League (2017)            526
## 5  Jumanji: Welcome to the Jungle (2017)            273
## 9                 Despicable Me 3 (2017)            201

Most of these movies have around 500-1100 reviews, with the exception of “Despicable Me 3” and “Jumanji: Welcome to the Jungle” which had substantially fewer.

Finally, let’s get the proportion of 1 to 5 star reviews for each movie and compare them.

percent_1_to_5_star_reviews <- data.frame(One.star = c(),Two.star = c(),Three.star = c(),Four.star = c(),Five.star = c(),stringsAsFactors=FALSE)

for(i in 1:ncol(ratings_movies_of_interest))
{
num_reviews <- reviews_per_movie$Number.reviews[i]
num_1_to_5_star_reviews <- table(ratings_movies_of_interest[,i])
percent_1_to_5_star_reviews_this_movie <- num_1_to_5_star_reviews*100/num_reviews
percent_1_to_5_star_reviews <- rbind(percent_1_to_5_star_reviews,
data.frame(One.star = percent_1_to_5_star_reviews_this_movie[1],
Two.star = percent_1_to_5_star_reviews_this_movie[2],
Three.star = percent_1_to_5_star_reviews_this_movie[3],
Four.star = percent_1_to_5_star_reviews_this_movie[4],
Five.star = percent_1_to_5_star_reviews_this_movie[5]))
}

#Make row names be names minus the year.

for(i in 1:nrow(percent_1_to_5_star_reviews))
{
rownames(percent_1_to_5_star_reviews)[i] <- strsplit(movie_codes_vs_names_movies_of_interest$movie[i],"\\(")[[1]][1]
}

round(t(percent_1_to_5_star_reviews),digits=2)
##            Star Wars: The Last Jedi  Beauty and the Beast  Wonder Woman 
## One.star                        0.69                  1.18          1.38
## Two.star                        1.90                  3.05          3.40
## Three.star                     14.31                 18.27         18.75
## Four.star                      44.83                 52.28         46.78
## Five.star                      38.28                 25.21         29.69
##            Guardians of the Galaxy Vol. 2  Jumanji: Welcome to the Jungle 
## One.star                              0.24                            1.10
## Two.star                              1.46                            2.93
## Three.star                           10.92                           22.34
## Four.star                            55.83                           61.54
## Five.star                            31.55                           12.09
##            Spider-Man: Homecoming    It  Thor: Ragnarok  Despicable Me 3 
## One.star                      1.30  6.38            0.95             1.00
## Two.star                      1.62  5.34            1.74             4.48
## Three.star                   16.32 20.30            9.48            50.25
## Four.star                    59.46 47.80           50.24            37.81
## Five.star                    21.30 20.19           37.60             6.47
##            Justice League 
## One.star              2.28
## Two.star              5.13
## Three.star           22.24
## Four.star            50.76
## Five.star            19.58
barplot(t(as.matrix(percent_1_to_5_star_reviews)),cex.names=0.5,las=2,col=c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2"))

Interesting! The number of very poor (1 or 2-star, or 4/10 or less in original data) reviews is actually a lot lower than I expected it to be.

Star Wars: The Last Jedi and Thor:Ragnarok have higher proportions of 5-star reviews (or 9/10 to 10/10 in original data).

Despicable Me 3 seems to have a much higher proportion of 3-star (5/10 or 6/10 in original data) reviews than the other movies.