Here I will use the MovieTweetings database from Github user sidooms.
Link here: https://github.com/sidooms/MovieTweetings
This is a dataset compiling tweets that gave clear reviews out of 10 for movies.
Since this database is updated regularly, I have also posted to my Github the exact data I am using from this dataset for reproducibility.
Link here: https://github.com/hmgeiger/MovieDatabase
Lastly, I used rounding to map these reviews out of 10 onto a 5-star system instead.
For this database, I chose the top 10 highest total grossing movies from the past 365 days according to Box Office Mojo.
Link here: http://www.boxofficemojo.com/yearly/chart/past365.htm
At the time of writing, these movies were:
First, we load the ratings.dat and movies.dat files into R.
Ratings.dat is formatted like this: user_id::movie_id::rating::rating_timestamp
Since ratings.dat has an ID for each movie rather than the actual name of the movie, we will need to use movies.dat to get the IDs for our movies of interest.
Movies.dat is formatted like this: movie_id::movie_title (movie_year)::genre|genre|genre
#Since these files are separated by "::", which will cause problems in R, let's use a quick system call to switch to tab-separated.
#Already ran this so commenting these lines out now.
#system("awk -F "::" '{ OFS="\t"}{print $1,$2,$3,$4}' ratings.dat > ratings.tsv")
#system("awk -F "::" '{ OFS="\t"}{print $1,$2,$3}' movies.dat > movies.tsv")
ratings <- read.table("ratings.tsv",sep="\t",stringsAsFactors=FALSE,
colClasses = c("character","character","integer","character"),
col.names = c("user","movie","rating","timestamp"))
movies <- read.table("movies.tsv",sep="\t",stringsAsFactors=FALSE,quote="",comment.char="",
colClasses = c("character","character","character"),
col.names=c("id","movie","genre"))
Now, we pull out just the 10 movies we are interested in. To do this, first we use movies data frame to get the code, then select each movie code from ratings.
movies_of_interest <- c("Star Wars: The Last Jedi","Beauty and the Beast","Wonder Woman","Guardians of the Galaxy Vol. 2","Jumanji: Welcome to the Jungle","Spider-Man: Homecoming","It","Thor: Ragnarok","Despicable Me 3","Justice League")
movies_of_interest <- paste0(movies_of_interest," (2017)")
movies[match(movies_of_interest,movies$movie),]
## id movie
## 23151 2527336 Star Wars: The Last Jedi (2017)
## 23804 2771200 Beauty and the Beast (2017)
## 13390 0451279 Wonder Woman (2017)
## 26716 3896198 Guardians of the Galaxy Vol. 2 (2017)
## 21924 2283362 Jumanji: Welcome to the Jungle (2017)
## 21773 2250912 Spider-Man: Homecoming (2017)
## 17164 1396484 It (2017)
## 25780 3501632 Thor: Ragnarok (2017)
## 25672 3469046 Despicable Me 3 (2017)
## 15103 0974015 Justice League (2017)
## genre
## 23151
## 23804 Family|Fantasy|Musical
## 13390
## 26716 Action|Adventure|Sci-Fi
## 21924
## 21773 Action|Adventure|Fantasy
## 17164
## 25780
## 25672
## 15103
movies_of_interest_codes <- movies$id[match(movies_of_interest,movies$movie)]
ratings_movies_of_interest <- c()
for(movie in movies_of_interest_codes)
{
ratings_movies_of_interest <- rbind(ratings_movies_of_interest,ratings[which(ratings$movie == movie),])
}
Right now the data is in long format.
Let’s convert to wide, with each movie as a column and each user as a row.
ratings_movies_of_interest <- reshape(ratings_movies_of_interest[,c("user","movie","rating")],idvar = "user",timevar = "movie",direction="wide")
head(ratings_movies_of_interest)
## user rating.2527336 rating.2771200 rating.0451279 rating.3896198
## 2390 155 7 NA 9 NA
## 3475 252 8 NA NA NA
## 3921 286 10 NA NA NA
## 4446 385 7 NA NA 8
## 4697 427 8 NA NA NA
## 4848 433 8 7 NA NA
## rating.2283362 rating.2250912 rating.1396484 rating.3501632
## 2390 NA 10 9 7
## 3475 NA NA NA NA
## 3921 NA NA NA NA
## 4446 NA NA NA 7
## 4697 NA NA NA NA
## 4848 NA NA NA NA
## rating.3469046 rating.0974015
## 2390 NA 9
## 3475 NA NA
## 3921 NA NA
## 4446 6 NA
## 4697 NA NA
## 4848 NA 5
#Let's make "user" be the rownames, adding "user" as a prefix to each user ID.
#And instead of e.g. rating.2527336, replace with movie2527336.
rownames(ratings_movies_of_interest) <- paste0("user.",ratings_movies_of_interest$user)
ratings_movies_of_interest <- ratings_movies_of_interest[,2:ncol(ratings_movies_of_interest)]
colnames(ratings_movies_of_interest) <- paste0("movie",movies_of_interest_codes)
head(ratings_movies_of_interest)
## movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155 7 NA 9 NA NA
## user.252 8 NA NA NA NA
## user.286 10 NA NA NA NA
## user.385 7 NA NA 8 NA
## user.427 8 NA NA NA NA
## user.433 8 7 NA NA NA
## movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155 10 9 7 NA 9
## user.252 NA NA NA NA NA
## user.286 NA NA NA NA NA
## user.385 NA NA 7 6 NA
## user.427 NA NA NA NA NA
## user.433 NA NA NA NA 5
Let’s now using rounding to convert from a 10-point to 5-point system.
We’ll divide by 2, then round up for odd numbers.
So 1/10 and 2/10 = 1/5, 3/10 and 4/10 = 2/5, etc.
ratings_movies_of_interest <- ceiling(ratings_movies_of_interest/2)
#Now, convert back to integers.
for(i in 1:ncol(ratings_movies_of_interest))
{
ratings_movies_of_interest[,i] <- as.integer(ratings_movies_of_interest[,i])
}
Finally, let’s save objects for ratings_movies_of_interest, plus the subset of movies that is just the code translation for our 10 movies of interest.
Let’s also reformat this subset of movies to better match ratings_movies_of_interest.
We’ll use these saved objects to compare to the ones we will later create from the SQL database of this information.
movie_codes_vs_names_movies_of_interest <- movies[match(movies_of_interest,movies$movie),1:2]
colnames(movie_codes_vs_names_movies_of_interest) <- c("id","movie")
movie_codes_vs_names_movies_of_interest$id <- paste0("movie",movie_codes_vs_names_movies_of_interest$id)
rownames(movie_codes_vs_names_movies_of_interest) <- 1:10
movie_codes_vs_names_movies_of_interest
## id movie
## 1 movie2527336 Star Wars: The Last Jedi (2017)
## 2 movie2771200 Beauty and the Beast (2017)
## 3 movie0451279 Wonder Woman (2017)
## 4 movie3896198 Guardians of the Galaxy Vol. 2 (2017)
## 5 movie2283362 Jumanji: Welcome to the Jungle (2017)
## 6 movie2250912 Spider-Man: Homecoming (2017)
## 7 movie1396484 It (2017)
## 8 movie3501632 Thor: Ragnarok (2017)
## 9 movie3469046 Despicable Me 3 (2017)
## 10 movie0974015 Justice League (2017)
save(list=c("ratings_movies_of_interest","movie_codes_vs_names_movies_of_interest"),
file="ratings_movies_of_interest.Rdata")
We will also output these two objects into tsv files.
write.table(data.frame(user = rownames(ratings_movies_of_interest),ratings_movies_of_interest),
file="ratings_movies_of_interest.tsv",
row.names=FALSE,col.names=TRUE,quote=FALSE,sep="\t")
write.table(movie_codes_vs_names_movies_of_interest,
file="movie_codes_vs_names_movies_of_interest.tsv",
row.names=FALSE,col.names=TRUE,quote=FALSE,sep="\t")
Now, we will run the following code to create the SQL database.
#This is the code to open up a MySQL shell. Subsequent commands (up to right before mysqldump) were entered within this shell.
/usr/local/mysql/bin/mysql -u root -p
CREATE DATABASE ratings_of_recent_popular_movies;
USE ratings_of_recent_popular_movies;
CREATE TABLE movienames(id varchar(255),movie varchar(255));
LOAD DATA LOCAL INFILE 'movie_codes_vs_names_movies_of_interest.tsv' INTO TABLE movienames FIELDS TERMINATED BY "\t" IGNORE 1 LINES;
CREATE TABLE ratings(user varchar(255),movie2527336 integer,movie2771200 integer,movie0451279 integer,movie3896198 integer,movie2283362 integer,movie2250912 integer,movie1396484 integer,movie3501632 integer,movie3469046 integer,movie0974015 integer);
LOAD DATA LOCAL INFILE 'ratings_movies_of_interest.tsv' INTO TABLE ratings FIELDS TERMINATED BY "\t" IGNORE 1 LINES;
#After this, quit the MySQL shell, then entered the following.
/usr/local/mysql/bin/mysqldump -u root -p ratings_of_recent_popular_movies > ratings_of_recent_popular_movies.sql
Now, back to R, where I will read in the data from MySQL.
#Clear all objects in the current environment so we can get a fresh start.
rm(list=ls())
#Now, ready to start reading in from MySql database.
library(RMySQL)
## Loading required package: DBI
#Need to enter password to use the database.
#Here I have my password in a text file that I will read in.
mypassword <- as.vector(read.table("password.txt")$V1)
myconnection <- dbConnect(MySQL(),dbname = "ratings_of_recent_popular_movies",user='root',password=mypassword,host='localhost')
movienames_request <- dbSendQuery(myconnection,"SELECT * from movienames")
movie_codes_vs_names_movies_of_interest_from_sql <- fetch(movienames_request,n=-1)
movie_codes_vs_names_movies_of_interest_from_sql
## id movie
## 1 movie2527336 Star Wars: The Last Jedi (2017)
## 2 movie2771200 Beauty and the Beast (2017)
## 3 movie0451279 Wonder Woman (2017)
## 4 movie3896198 Guardians of the Galaxy Vol. 2 (2017)
## 5 movie2283362 Jumanji: Welcome to the Jungle (2017)
## 6 movie2250912 Spider-Man: Homecoming (2017)
## 7 movie1396484 It (2017)
## 8 movie3501632 Thor: Ragnarok (2017)
## 9 movie3469046 Despicable Me 3 (2017)
## 10 movie0974015 Justice League (2017)
ratings_request <- dbSendQuery(myconnection,"SELECT * from ratings")
ratings_movies_of_interest_from_sql <- fetch(ratings_request,n=-1)
head(ratings_movies_of_interest_from_sql)
## user movie2527336 movie2771200 movie0451279 movie3896198
## 1 user.155 4 0 5 0
## 2 user.252 4 0 0 0
## 3 user.286 5 0 0 0
## 4 user.385 4 0 0 4
## 5 user.427 4 0 0 0
## 6 user.433 4 4 0 0
## movie2283362 movie2250912 movie1396484 movie3501632 movie3469046
## 1 0 5 5 4 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 4 3
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## movie0974015
## 1 5
## 2 0
## 3 0
## 4 0
## 5 0
## 6 3
#Format to be more like our original data frame.
rownames(ratings_movies_of_interest_from_sql) <- ratings_movies_of_interest_from_sql[,1]
ratings_movies_of_interest_from_sql <- ratings_movies_of_interest_from_sql[,2:ncol(ratings_movies_of_interest_from_sql)]
ratings_movies_of_interest_from_sql[ratings_movies_of_interest_from_sql == 0] <- NA
#Save objects.
save(list=c("movie_codes_vs_names_movies_of_interest_from_sql","ratings_movies_of_interest_from_sql"),
file="ratings_movies_of_interest_created_from_database.Rdata")
Load in our original data frames, and compare to those created using SQL.
load("ratings_movies_of_interest.Rdata")
identical(movie_codes_vs_names_movies_of_interest,movie_codes_vs_names_movies_of_interest_from_sql)
## [1] TRUE
identical(ratings_movies_of_interest,ratings_movies_of_interest_from_sql)
## [1] FALSE
Data frame with the actual ratings is not identical between the one made in R and the one from the database. Why?
#Let's look at number of rows and columns and a head for each object.
dim(ratings_movies_of_interest)
## [1] 3024 10
dim(ratings_movies_of_interest_from_sql)
## [1] 3024 10
head(ratings_movies_of_interest)
## movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155 4 NA 5 NA NA
## user.252 4 NA NA NA NA
## user.286 5 NA NA NA NA
## user.385 4 NA NA 4 NA
## user.427 4 NA NA NA NA
## user.433 4 4 NA NA NA
## movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155 5 5 4 NA 5
## user.252 NA NA NA NA NA
## user.286 NA NA NA NA NA
## user.385 NA NA 4 3 NA
## user.427 NA NA NA NA NA
## user.433 NA NA NA NA 3
head(ratings_movies_of_interest_from_sql)
## movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.155 4 NA 5 NA NA
## user.252 4 NA NA NA NA
## user.286 5 NA NA NA NA
## user.385 4 NA NA 4 NA
## user.427 4 NA NA NA NA
## user.433 4 4 NA NA NA
## movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.155 5 5 4 NA 5
## user.252 NA NA NA NA NA
## user.286 NA NA NA NA NA
## user.385 NA NA 4 3 NA
## user.427 NA NA NA NA NA
## user.433 NA NA NA NA 3
#Maybe it's a class issue?
for(i in 1:ncol(ratings_movies_of_interest))
{
print(class(ratings_movies_of_interest[,i]))
}
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
for(i in 1:ncol(ratings_movies_of_interest_from_sql))
{
print(class(ratings_movies_of_interest_from_sql[,i]))
}
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
## [1] "integer"
#That doesn't seem to be the issue either.
#Let's look at one column at a time now. Let's start with the first column.
length(which(ratings_movies_of_interest[,1] == ratings_movies_of_interest_from_sql[,1]))
## [1] 580
#Interesting! Number of rows listed as equal is less than the total rows in both data frames.
head(setdiff(1:nrow(ratings_movies_of_interest),which(ratings_movies_of_interest[,1] == ratings_movies_of_interest_from_sql[,1])))
## [1] 581 582 583 584 585 586
ratings_movies_of_interest[581:582,]
## movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.116 NA 3 NA 4 NA
## user.157 NA 3 4 4 NA
## movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.116 NA NA NA 4 4
## user.157 4 4 4 3 NA
ratings_movies_of_interest_from_sql[581:582,]
## movie2527336 movie2771200 movie0451279 movie3896198 movie2283362
## user.116 NA 3 NA 4 NA
## user.157 NA 3 4 4 NA
## movie2250912 movie1396484 movie3501632 movie3469046 movie0974015
## user.116 NA NA NA 4 4
## user.157 4 4 4 3 NA
#Very strange. These look identical, but R is perceiving them as not identical.
#What if we tried replacing all NAs in both data frames with 0?
ratings_movies_of_interest_with_zeroes <- ratings_movies_of_interest
ratings_movies_of_interest_with_zeroes[is.na(ratings_movies_of_interest_with_zeroes)] <- 0
ratings_movies_of_interest_from_sql_with_zeroes <- ratings_movies_of_interest_from_sql
ratings_movies_of_interest_from_sql_with_zeroes[is.na(ratings_movies_of_interest_from_sql_with_zeroes)] <- 0
identical(ratings_movies_of_interest_with_zeroes,ratings_movies_of_interest_from_sql_with_zeroes)
## [1] TRUE
#Maybe there were some true zeroes in ratings_movies_of_interest that we switched to NA when converting back from SQL?
#Let's check.
reviews_per_user_from_R_direct_dat <- apply(ratings_movies_of_interest,1,function(x)length(which(is.na(x) == FALSE)))
reviews_per_user_from_sql <- apply(ratings_movies_of_interest_from_sql,1,function(x)length(which(is.na(x) == FALSE)))
length(reviews_per_user_from_R_direct_dat)
## [1] 3024
length(which(reviews_per_user_from_R_direct_dat == reviews_per_user_from_sql))
## [1] 3023
#One user is not the same in these data frames.
ratings_movies_of_interest[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),]
## movie2527336 movie2771200 movie0451279 movie3896198
## user.26393 NA NA 5 NA
## movie2283362 movie2250912 movie1396484 movie3501632
## user.26393 NA 0 2 NA
## movie3469046 movie0974015
## user.26393 NA NA
ratings_movies_of_interest_from_sql[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),]
## movie2527336 movie2771200 movie0451279 movie3896198
## user.26393 NA NA 5 NA
## movie2283362 movie2250912 movie1396484 movie3501632
## user.26393 NA NA 2 NA
## movie3469046 movie0974015
## user.26393 NA NA
#Looks like that's what happened! I hadn't considered the possibility that there might be at least a few real zeroes in the data, apart from the ones MySQL made to replace the NAs when I made the database.
#This really should not be allowed, as the scale is supposed to be from 1 to 5.
#Let's replace with an NA.
ratings_movies_of_interest[which(reviews_per_user_from_R_direct_dat != reviews_per_user_from_sql),"movie2250912"] <- NA
#Now the two data frames should be equal.
identical(ratings_movies_of_interest,ratings_movies_of_interest_from_sql)
## [1] TRUE
Now that we’ve done all this, let’s start to actually look at the data!
Question 1 - Of these 10 movies, how many of them did the typical Twitter user give a structured review?
reviews_per_user <- apply(ratings_movies_of_interest,1,function(x)length(which(is.na(x) == FALSE)))
table(reviews_per_user)
## reviews_per_user
## 1 2 3 4 5 6 7 8 9 10
## 1643 539 312 198 121 98 64 33 11 5
Looks like most people only reviewed 1 or 2 of these movies in a structured review.
Which movies did people give structured reviews most frequently?
reviews_per_movie <- apply(ratings_movies_of_interest,2,function(x)length(which(is.na(x) == FALSE)))
reviews_per_movie <- data.frame(Movie = movie_codes_vs_names_movies_of_interest$movie,Number.reviews = reviews_per_movie,row.names=NULL)
reviews_per_movie[order(reviews_per_movie$Number.reviews,decreasing=TRUE),]
## Movie Number.reviews
## 3 Wonder Woman (2017) 1088
## 6 Spider-Man: Homecoming (2017) 925
## 7 It (2017) 862
## 4 Guardians of the Galaxy Vol. 2 (2017) 824
## 8 Thor: Ragnarok (2017) 633
## 2 Beauty and the Beast (2017) 591
## 1 Star Wars: The Last Jedi (2017) 580
## 10 Justice League (2017) 526
## 5 Jumanji: Welcome to the Jungle (2017) 273
## 9 Despicable Me 3 (2017) 201
Most of these movies have around 500-1100 reviews, with the exception of “Despicable Me 3” and “Jumanji: Welcome to the Jungle” which had substantially fewer.
Finally, let’s get the proportion of 1 to 5 star reviews for each movie and compare them.
percent_1_to_5_star_reviews <- data.frame(One.star = c(),Two.star = c(),Three.star = c(),Four.star = c(),Five.star = c(),stringsAsFactors=FALSE)
for(i in 1:ncol(ratings_movies_of_interest))
{
num_reviews <- reviews_per_movie$Number.reviews[i]
num_1_to_5_star_reviews <- table(ratings_movies_of_interest[,i])
percent_1_to_5_star_reviews_this_movie <- num_1_to_5_star_reviews*100/num_reviews
percent_1_to_5_star_reviews <- rbind(percent_1_to_5_star_reviews,
data.frame(One.star = percent_1_to_5_star_reviews_this_movie[1],
Two.star = percent_1_to_5_star_reviews_this_movie[2],
Three.star = percent_1_to_5_star_reviews_this_movie[3],
Four.star = percent_1_to_5_star_reviews_this_movie[4],
Five.star = percent_1_to_5_star_reviews_this_movie[5]))
}
#Make row names be names minus the year.
for(i in 1:nrow(percent_1_to_5_star_reviews))
{
rownames(percent_1_to_5_star_reviews)[i] <- strsplit(movie_codes_vs_names_movies_of_interest$movie[i],"\\(")[[1]][1]
}
round(t(percent_1_to_5_star_reviews),digits=2)
## Star Wars: The Last Jedi Beauty and the Beast Wonder Woman
## One.star 0.69 1.18 1.38
## Two.star 1.90 3.05 3.40
## Three.star 14.31 18.27 18.75
## Four.star 44.83 52.28 46.78
## Five.star 38.28 25.21 29.69
## Guardians of the Galaxy Vol. 2 Jumanji: Welcome to the Jungle
## One.star 0.24 1.10
## Two.star 1.46 2.93
## Three.star 10.92 22.34
## Four.star 55.83 61.54
## Five.star 31.55 12.09
## Spider-Man: Homecoming It Thor: Ragnarok Despicable Me 3
## One.star 1.30 6.38 0.95 1.00
## Two.star 1.62 5.34 1.74 4.48
## Three.star 16.32 20.30 9.48 50.25
## Four.star 59.46 47.80 50.24 37.81
## Five.star 21.30 20.19 37.60 6.47
## Justice League
## One.star 2.28
## Two.star 5.13
## Three.star 22.24
## Four.star 50.76
## Five.star 19.58
barplot(t(as.matrix(percent_1_to_5_star_reviews)),cex.names=0.5,las=2,col=c("#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2"))
Interesting! The number of very poor (1 or 2-star, or 4/10 or less in original data) reviews is actually a lot lower than I expected it to be.
Star Wars: The Last Jedi and Thor:Ragnarok have higher proportions of 5-star reviews (or 9/10 to 10/10 in original data).
Despicable Me 3 seems to have a much higher proportion of 3-star (5/10 or 6/10 in original data) reviews than the other movies.