We set out to compare 2016 New York Times movie reviews and tweets related to a given set of movies.
Our primary motivation was to explore disparate data sources that we had not encountered before that are also often used in data science, specifically Twitter and sentiment analysis.
We envisioned a sentiment analysis comparison of the New York Times movie reviews and Tweets of the same movies. We hoped to be able to answer how favorably the New York Times reviewed a given movie compared to the average sentiment on Twitter.
Our goal was to gain insight on market research analysis on social media and traditional media sources. We wanted our resulting analysis to be helpful to moviegoers and the movie industry in understanding the perception of reviews.
We began by identifying our project goals and scope, assessing potential data sources, and retrieving the data.
We first performed an API pull of a list of recent New York Times movie reviews.
We then performed an API pull of recent tweets. We learned that Twitter provides only a limited search history for publicly available tweets. This resulted in an adjustment of our project scope.
We then adjusted the date range of our New York Times movie list to contain only the movies with available Twitter data.
We ran a sentiment analysis on the text of the tweets and the text of the movie reviews for the given set of movies.
We created data frames for our data sources, cleaned and transformed the data, and performed analysis on the resulting datasets.
The following libraries were required.
library(indicoio)
library(DT)
library(ggplot2)
library(httr)
library(jsonlite)
library(knitr)
library(leaflet)
library(plotly)
library(plyr)
library(dplyr)
library(RColorBrewer)
library(RCurl)
library(rmarkdown)
library(ROAuth)
library(stringr)
library(tidyr)
library(twitteR)
library(XML)
library(scales)
We registered for a Twitter API key to perform the search of public tweets related to movies.
This code chunk (Twitter Setup
) was run first as it requires a PIN given by Twitter.
library(twitteR)
library(httr)
library(ROAuth)
consumer_key <- "consumer_key"
consumer_secret <- "consumer_secret"
access_token <- "access_token"
access_secret <- "access_secret"
download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem')
reqURL <- 'https://api.twitter.com/oauth/request_token'
accessURL <- 'https://api.twitter.com/oauth/access_token'
authURL <- 'https://api.twitter.com/oauth/authorize'
Cred <- OAuthFactory$new(consumerKey=consumer_key,
consumerSecret=consumer_secret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)
Cred$handshake(cainfo = system.file('CurlSSL',
'cacert.pem',
package = 'RCurl'))
We then ran the following chunks after entering the PIN for Twitter Authorization.
save(Cred, file='twitter authentication.Rdata')
load('twitter authentication.Rdata')
setup_twitter_oauth(consumer_key,
consumer_secret,
access_token,
access_secret)
We registered for a New York Times Movie API key to pull the movie review data.
We used the opening date range 2016-04-21 through 2016-05-07.
#required libraries
library(jsonlite)
library(knitr)
#RESULTS 0-20 AS OF 2016-05-07
NYTREVIEWS_JSON_URL1 = paste0('http://api.nytimes.com/svc/movies/v2/reviews/search.json?opening-date=2016-04-21;2016-05-07&api-key=',YOURAPIKEYHERE, '&order=by-title', collapse = "")#url for api results 0-20
json_file1 <- fromJSON(URLencode(NYTREVIEWS_JSON_URL1))# get json
df1 <- as.data.frame(json_file1$results) #dataframe for results
#str(df1) #view structure
#colnames(df1) #view column names
df1s<-df1[,c(1:8)] #subset needed columns
#kable(df1s) #test
#RESULTS 21-40 AS OF 2016-05-07
NYTREVIEWS_JSON_URL2 = paste0('http://api.nytimes.com/svc/movies/v2/reviews/search.json?opening-date=2016-04-21;2016-05-07&api-key=', YOURAPIKEYHERE, '&offset=20&order=by-title', collapse = "") #url for api results 20-40
json_file2 <- fromJSON(NYTREVIEWS_JSON_URL2) #get json
df2 <- as.data.frame(json_file2$results) #dataframe for results
df2s<-df2[,c(1:8)] #subset needed columns
#kable(df2s) #test
#test
#NYTREVIEWS_JSON_URL3 = 'http://api.nytimes.com/svc/movies/v2/reviews/search.json?opening-date=2016-04-21;2016-05-07&api-key=YOURAPIKEYHERE&offset=40'
#json_file3 <- fromJSON(NYTREVIEWS_JSON_URL3) #get json
#df3 <- as.data.frame(json_file3$results) #dataframe for results
#df3s<-df3[,c(1:8)] #subset needed columns
#kable(df3s)
#ALL RESULTS 0-40 COMBINED
df<- rbind(df1s,df2s) #combine all results
#knitr::kable(df, row.names =TRUE) #test
#ADD URL COLUMN TO RESULTS
url1 <- json_file1$results$link #collect links part 1
url1<-url1[,c(1:2)] #subset links
url2 <- json_file2$results$link #collect links part 2
url2<-url2[,c(1:2)] #subset links
#str(url1) #test
urls<- rbind(url1,url2) #combine all results for links
#knitr::kable(urls, row.names = TRUE) #test
combo<-cbind(df,urls) #combine reviews and links
combo<-combo[,c(1:3,7,8,10)] #subset needed columns
combo<-combo[,c(1,5,4,6,2,3)] #reorder needed columns
kable(combo, row.names = TRUE) #display with row numbers
write.table(combo, "nytmovies.csv") #export to CSV
The resulting CSV is available here:
For each URL pointing to a movie review, we then scraped the text of the review. Only the actual text of the review was considered. This was indicated by the class = ‘story-body-text story-content’ under the paragraph tag “< p >”. In addition, we retrieved the genre of the movie listed in the individual review. The genres are indicated by the following tag <span itemprop=‘genre’ class=’genre>. For movies categorized as more than one genre, we retrieved only the first genre tag.
We then ran the review text through sentiment analysis with indico, resulting in a 0-1 score, with 1 being most positive.
For reproducibility of results, we read the review list and URLs from GitHub.
# Retrieve list of movie from Github
movie_list_url <- getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/nytmovies2.csv")
movie_list <- read.csv(textConnection(movie_list_url), header = TRUE, sep = ",")
movie_list[1:5,]
str(movie_list)
# initialize resulting data frame
movie_score_nyt <- data.frame()
# initialize variables:
# INDICO API Key
my_indicoio_API_Key <-"2cf9508e4628b67c9b69ae7d1059efda"
# Set Up xpath to retrieve review and genre
xpath_review <- "//p[@class = 'story-body-text story-content']"
xpath_genre <- "(//span[@itemprop='genre'][@class = 'genre'])[1]" # will only retrieve first one
agent <- c(R.version$version.string, ",", R.version$platform)
# Loop through movie list, retrieve review for each movie review url
loop_counter <- nrow(movie_list)
for (i in 1:loop_counter){
#Set RCurl pars
curl <- getCurlHandle()
curlSetOpt(cookiejar="cookies.txt", useragent = agent, followlocation = TRUE, curl=curl)
movie_url <- movie_list$url[i]
# Get Review corresponding the url
my_movie <- getURL(as.character(movie_url), curl = curl, verbose = FALSE)
content_review = htmlTreeParse(my_movie, asText=TRUE, useInternalNodes = TRUE, encoding = 'UTF-8')
plain_text <- xpathSApply(xmlRoot(content_review), xpath_review, xmlValue)
review_df <- as.data.frame(as.character(paste(plain_text, collapse = " ")))
colnames(review_df) <- "review"
review_df$movie <- movie_list$movie[i]
review_df$Sentiment_Score_nyt <- unlist(sentiment(as.character(review_df$review), api_key = my_indicoio_API_Key))
genre <- xpathSApply(xmlRoot(content_review), xpath_genre, xmlValue)
review_df$genre <-genre
movie_score <- rbind(movie_score, review_df)
}
movie_score_nyt <- inner_join(movie_list, movie_score, by = "movie")
write.csv(movie_score_nyt, file = "movie_df.csv", sep = ",")
The resulting CSV is available here:
Once we established the list of available New York Times movie reviews, we ran the Twitter search using the movie title as a search term. To get better Twitter search results for one-word movies, we added the word “movie” to our search term.
This chunk retrieved the Twitter search results for a term. searchTwitter
has some other settings that could probably improve results.
We used the indico package to perform sentiment analysis, resulting in a score of 0-1 for each tweet, 1 being extremely positive.
We decided to use unique tweets only, therefore eliminating any retweet data.
For the benefit of our analysis, we filtered the Twitter search results to pull only the tweets which were coded with geographical location.
We chose the following cities at random and pulled tweets for only these cities.
#'r paste(str_trim(locations$City), collapse = ", ")`.
library(knitr)
library(RCurl)
citiesurl <- getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/locations.csv")
citiesxy<- data.frame(read.csv(text=citiesurl))
kable(citiesxy)
City | Lat | Long |
---|---|---|
Boston | 42.35997 | -71.06009 |
New York City | 40.72541 | -73.99292 |
Chicago | 41.87621 | -87.62833 |
Lincoln | 40.82550 | -96.67557 |
New Orleans | 29.94048 | -90.05229 |
Los Angeles | 34.05223 | -118.24368 |
Seattle | 47.61000 | -122.33000 |
Houston | 29.71892 | -95.33916 |
St. Louis | 38.64699 | -90.22497 |
Denver | 39.73915 | -104.98470 |
We mapped these cities with Leaflet.
data.url <- file(paste(url,"locations.csv", sep = ""), open="r" )
locations <- read.csv(data.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
close(data.url)
library(leaflet)
leaflet(locations) %>%
addProviderTiles("Stamen.Toner") %>%
addMarkers(lng = ~Long, lat = ~Lat, popup = paste(as.character(str_trim(locations$City))))
We retrieved tweets based on our date range and geographical locations, then used Indico to generate a sentiment score for each individual tweet.
library(indicoio)
rm_special <-function(x) iconv(x, "UTF-8", "UTF-8",sub='')
search_term <- function(term){
require(magrittr)
search_term <- gsub(" ", "+", term) %>%
paste("+movie", sep = "")
return(search_term)
}
df <- NULL
for (i in movies$display_titles){
SearchTerm <- search_term(i)
for (j in seq(1:length(locations$City))){
list <- suppressWarnings(searchTwitter(SearchTerm,
since = format(Sys.Date()-1,format="%Y-%m-%d"),
until = format(Sys.Date(),format="%Y-%m-%d"), n = 5000,
geocode = paste(locations$Lat[j], ",", locations$Long[j], ",30mi", sep = "" )
))
if (length(list) == 0){
NULL
} else {
tweetdf <- twListToDF(list)
tweetdf <- as.data.frame(unique(tweetdf$text))
colnames(tweetdf) <- c("tweet")
tweetdf$city <- locations$City[j]
tweetdf$movie <- as.character(i)
tweetdf$day <- format(Sys.Date() ,format="%m.%d.%Y")
df <- rbind(df,tweetdf)
}
}
}
df$Sentiment_Score <- unlist(sentiment(as.character(rm_special(df$tweet)), api_key = indico_api_key))
write.csv(df, file = paste("Tweets for ", as.character(format(Sys.Date(), format="%m.%d.%Y")), ".csv", sep = ""))
The resulting Twitter CSVs for individual days are available here:
We performed a sentiment analysis litmus test for New York Times reviews. The Times does not provide “stars” to rate their movie reviews but favorable reviews will be tagged with a “Critics’ Pick.” As a test for the sentiment analysis accuracy, we analyzed the results as follows:
We considered a “good” score any score >= 0.75 and expected the movie to be tagged with “Critics’s Pick” (value = 1).
We looked at Pick/Good (True Positive), Pick/Bad (False Negative), NotPick/Good (False Positive), NotPick/Bad (True Negative).
url <- "https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/"
data.url <- file(paste(url,"movie_score_nyt.csv", sep = ""), open="r" )
movie_score_nyt <- read.csv(data.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
nyt_false_negative <- movie_score_nyt %>% filter(Sentiment_Score_nyt < 0.75 & critics_pick == 1) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_false_positive <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.75 & critics_pick == 0) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_true_negative <- movie_score_nyt %>% filter(Sentiment_Score_nyt < 0.75 & critics_pick == 0) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_true_positive <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.75 & critics_pick == 1) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
close(data.url)
Percentage of False Positive: 35%
Percentage of False Negative: 5%
Percentage of True Positive: 15%
Percentage of True Negative: 45%
The question when conducting sentiment analysis was our confidence in the results. Based on this quick analysis, we saw that the sentiment analysis score did not always match with whether the movie was recommended.
The number of false negative were indicators of how difficult it was to accurately measure the tonality and sentiment in a document. There were two movies that fell into the category of false negative. The movies were recommended by the New York Times critic; however, the sentiment analysis score was less than the threshold we considered. Upon reading the reviews, for these movies, “Viva” and “A Hologram for a King,” the sentiment analysis results were understandable. Both movies were dramas with elements of the review depicting difficult, dark, and negative terms.
The false positive results were more easily explained. The New YOrk Times gave neutral to positive reviews without recommending the movies. Also, a score of 0.75 may have been too low a limit to be expected to be a recommendation. We therefore considered scores below .50 as negative, scores between .50 and .80 as neutral, and above .80 as positive.
We reran the analysis based on these new limits. False negatives were scores < .50 but with a recommendation; false positives were scores >= 80 without a recommendation, and false neutrals were scores within interval (50-80) with a recommendation.
nyt_false_negative <- movie_score_nyt %>% filter(Sentiment_Score_nyt < 0.50 & critics_pick == 1) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_false_positive <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.80 & critics_pick == 0) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_true_negative <- movie_score_nyt %>% filter(Sentiment_Score_nyt < 0.50 & critics_pick == 0) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_true_positive <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.80 & critics_pick == 1) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_false_neutral <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.50 & Sentiment_Score_nyt < 0.80
& critics_pick == 1) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
nyt_true_neutral <- movie_score_nyt %>% filter(Sentiment_Score_nyt >= 0.50 & Sentiment_Score_nyt < 0.80
& critics_pick == 0) %>%
select(movie, mpaa_rating, genre, Sentiment_Score_nyt, critics_pick)
Percentage of False Positive: 27.5%
Percentage of False Negative: 2.5%
Percentage of False Neutral : 2.5%
Percentage of True Positive : 15%
Percentage of True Neutral : 12.5%
Percentage of True Negative : 40%
For the New York Times data, we created a sentiment score summary table without the full review text for easier analysis.
url <- "https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/"
data.url <- file(paste(url,"movie_score_nyt.csv", sep = ""), open="r" )
reviewscore <- read.csv(data.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
reviewscore<-reviewscore[,c(2:5,7:8,10:11)] #subset needed columns
#kable(reviewscore)
close(data.url)
nytsent<-getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/sentimentsummarynyt.csv")
df<- data.frame(read.csv(text=nytsent))
datanyt<-df[,c(3:4,7:9)]
datanyt$Sentiment_Score_nyt=round(datanyt$Sentiment_Score_nyt,3) #round decimals to 3 places
datatable(datanyt, options = list( pageLength = 5, lengthMenu = c(5, 10, 40)), rownames= TRUE)
write.csv(reviewscore, file = "sentimentsummarynyt.csv", sep = ",")
The resulting CSV is available here:
We analyzed each day’s worth of tweets separately to identify outliers and possible trends. The complete daily analysis is available here.
We combined all seven days of tweets into one dataframe of 5,445 records. This served as our primary Twitter dataset.
We looked at the average Twitter Sentiment by movie.
tweets1 <- getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/tweets503to509.csv")
tweets1<- data.frame(read.csv(text=tweets1))
tweets1$city[tweets1$city == 'los Angeles'] <- 'Los Angeles'
tweets1$movie <- encodeString(as.character(tweets1$movie))
tweets1$movie[tweets1$movie == 'Mother<U+0092>s Day'] <- "Mother's Day"
tweets1$movie[tweets1$movie == "L<U+0092>Attesa (The Wait)"] <- "L'Attesa (The Wait)"
moviesum1<- ddply(tweets1, .(movie), summarize, Sentiment_Score=mean(Sentiment_Score), Count_of_Tweets=length(tweet)) #summarize by movie
moviesum1$movie<-as.character(moviesum1$movie) #convert levels to character
moviesum1<- arrange(moviesum1, movie) #sort alphabetically by movie name
#moviesum1$num<-seq.int(nrow(moviesum1)) # add counter row
names(moviesum1)<-c("Movie","Sentiment_Score","Count_of_Tweets")
twittersent<-getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/sentimentsummarytwitter.csv")
df<- data.frame(read.csv(text=twittersent))
datatw<-df[,c(2:4)]
datatw$Sentiment_Score=round(datatw$Sentiment_Score,3) #round decimals to 3 places
datatable(datatw, options = list(columnDefs = list(list(className = 'dt-center', targets = 3)), pageLength = 5, lengthMenu = c(5, 10, 40)), rownames= TRUE) #create data table for display
write.csv(moviesum1, file = "sentimentsummarytwitter.csv", sep = ",")
The resulting CSV is available here:
We looked at the average Twitter sentiment by city.
citysum1<- ddply(tweets1, .(city), summarize, Sentiment_Score=mean(Sentiment_Score), Count_of_Tweets=length(tweet)) #summarize by city
kable(citysum1, rownames = TRUE)
city | Sentiment_Score | Count_of_Tweets |
---|---|---|
Boston | 0.6685495 | 207 |
Chicago | 0.6869049 | 526 |
Denver | 0.7415370 | 165 |
Houston | 0.7692286 | 284 |
Lincoln | 0.7168697 | 12 |
Los Angeles | 0.6937290 | 1724 |
New Orleans | 0.6528819 | 99 |
New York City | 0.7133538 | 2076 |
Seattle | 0.7025967 | 218 |
St. Louis | 0.8095269 | 134 |
datanyt<- datanyt %>% left_join(datatw, by = "movie")
datacombo<- datanyt[,c(1,4,6,3,7,2,5)]
names(datacombo)<-c("Movie","NYT_Sentiment","Twitter_Sentiment","NYT_CriticsPick","Count_of_Tweets","Opening_Date","Genre")
#kable(datacombo)
datatable(datacombo, options = list( pageLength = 10, lengthMenu = c(5, 10, 40)), rownames= TRUE)
write.csv(datacombo, file = "sentimentsummarycombo.csv", sep = ",")
The resulting CSV is available here:
We compared New York Times review sentiment score and Twitter average sentiment score for the top 10 movies by tweet count.
library(plotly)
top_movies <- ddply(tweets1, ~movie, summarize, tweet_count = length(tweet))
top_movies <- top_movies[order(-top_movies$tweet_count),]
n <- 10
top_movies <- head(top_movies$movie, n = n)
data.url <- file(paste(url,"movie_score.csv", sep = ""), open="r" )
movies_score_NYT <- read.csv(data.url, sep=",", header=TRUE, stringsAsFactors = FALSE)
movies_score_NYT$movie[movies_score_NYT$X == 22] <- "Mother's Day"
movies_score_NYT_subset <- subset(movies_score_NYT, movies_score_NYT$movie %in% top_movies)
a <- list()
for (i in seq_len(nrow(movies_score_NYT_subset ))) {
m <- movies_score_NYT_subset [i, ]
a[[i]] <- list(
x = movies_score_NYT_subset$movie[i],
y = movies_score_NYT_subset$Sentiment_score[i],
text = paste("NYT Score: <br>", signif(movies_score_NYT_subset$Sentiment_score[i],5)),
xref = "x", yref = "y", showarrow = TRUE, arrowhead = 7, ax = 10, ay = -20
)
}
m = list( l = 50, r = 50, b = 150, t = 100, pad = 5)
xlabel <- list(title = "Movie")
ylabel <- list(title = "Setiment Score")
tweets1_subset <- subset(tweets1, tweets1$movie %in% top_movies) %>%
arrange(movie)
plot_ly(tweets1_subset,
y = Sentiment_Score,
color = movie,
type = "box",
boxpoints = "all", jitter = 0.3) %>%
layout(title = sprintf("Box Plot of Movie Tweets for Top %s Movies", n),
xaxis = xlabel,
yaxis = ylabel,
annotations = a,
autosize = F, width = 1000, height = 800, margin = m)
We wanted to answer how relevant the New York Times review was a predictor of the sentiment analysis for the tweets. In order to attempt to answer this question we compared the New York Times score against the average of the Twitter sentiment analysis for each movie. We filtered the dataset to consider only movies which generated at least 15 tweets.
####################################################################
# Build Data Set for Analysis
####################################################################
# Merge the 2 streams
tweets_url <- getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/tweets503to509.csv")
tweets_list <- read.csv(text = tweets_url, header = TRUE, sep = ",")
#movie_score_nyt <- read.csv("movie_score_nyt.csv", header = TRUE, sep = ",")
nyt_url <- getURL("https://raw.githubusercontent.com/spsstudent15/Data607FinalProject/master/sentimentsummarynyt.csv")
movie_score_nyt <- read.csv(text = nyt_url, header = TRUE, sep = ",")
#str(movie_score_nyt)
#str(tweets_list)
my_data_raw <- inner_join(tweets_list, movie_score_nyt, by = "movie")
###################################################################
# Transform data
###################################################################
# Fix Los Angeles
my_data_raw$city[my_data_raw$city == 'los Angeles'] <- 'Los Angeles'
###################################################################
# Select/Filter/Arrange Data
### Selecting columns
my_data_analysis <- select (my_data_raw, movie, day, opening_date, publication_date,
mpaa_rating, genre, critics_pick, city, Sentiment_Score, Sentiment_Score_nyt)
#### Filter out any rows where count < 15
data_movie <- my_data_analysis %>% group_by (movie) %>% summarise (count=n())
data_movie_15 <- data_movie %>% filter(count>= 15) %>% select(movie, count)
data_movie_15 <- arrange(data_movie_15, desc(count), movie)
mydf15 <- inner_join(data_movie_15, my_data_analysis)
mydf15 <- arrange(mydf15, desc(count), movie, Sentiment_Score)
###################################################################
# Tweet Sentiment Analysis Confidence Interval
##################################################################
movies <- unique(mydf15$movie)
n_movie <- length(movies)
# initial storage variables
movie_mean <- rep(NA, n_movie)
movie_cis <- matrix(nrow = n_movie,ncol = 2)
for (i in 1:n_movie){
# Extract Data for caluclation
rows <- which(mydf15$movie == movies[i])
observations <- mydf15$Sentiment_Score[rows]
# Store mean for Sentiment Analysis
movie_mean[i] <- mean(observations)
movie_sd <- sd(observations)
n <- length(observations)
movie_ster <- movie_sd / sqrt(n)
# Calculate CIs
# We will use 2 (instead of 1.96 for 95% interval since we do not know distribution)
movie_cis[i,1] <- movie_mean[i] - 2 * movie_ster
movie_cis[i,2] <- movie_mean[i] + 2 * movie_ster
}
movie_mean_df <- bind_cols(data_movie_15,as.data.frame(movie_mean))
movie_analysis <- inner_join(movie_mean_df, movie_score_nyt)
movie_analysis <- select(movie_analysis, movie, count, movie_mean, mpaa_rating, critics_pick, Sentiment_Score_nyt, genre)
movie_analysis <- arrange(movie_analysis, desc(count), movie)
#############################################################
We plotted the New York Times Review Score versus the average Twitter sentiment score.
ggplot(movie_analysis, aes(x=Sentiment_Score_nyt, y=movie_mean )) + geom_point()
From the scatter plot, it did not seem probable that there was a relationship between these two variables. We then fit a line through the graph and calculated the linear regression.
ggplot(movie_analysis, aes(x=Sentiment_Score_nyt, y=movie_mean )) + geom_point() + stat_smooth(method = lm)
m1 <- lm(movie_mean ~ Sentiment_Score_nyt, data = movie_analysis)
summary(m1)
##
## Call:
## lm(formula = movie_mean ~ Sentiment_Score_nyt, data = movie_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.149683 -0.046395 -0.009227 0.033324 0.178899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.64689 0.03952 16.369 5.62e-11 ***
## Sentiment_Score_nyt 0.03521 0.05518 0.638 0.533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08733 on 15 degrees of freedom
## Multiple R-squared: 0.02643, Adjusted R-squared: -0.03847
## F-statistic: 0.4072 on 1 and 15 DF, p-value: 0.533
From these results, it did not appear that there was a relationship between these two variables.
To get a better picture of the variations between the average of the Twitter sentiment analysis and the New York Times review score, we plotted the confidence interval (95%) for the tweets for all movies concurrently. We also plotted the NY Times review score.
We displayed the confidence intervals for each movie based on their twitter sentiment analysis scores. We can see that for 9 of the 10 movies the New York Times sentiment score falls outside of the range of confidence interval for each movie. Since we are 95% confident that the true mean of twitter sentiment is contained in this interval we can conclude that twitter sentiment scores are not similar to sentiment scores of the New York Times review.
tweets <- merge(tweets1, movies_score_NYT_subset[2:4]) %>%
rename(Twitter_score = Sentiment_Score, NYT_score = Sentiment_score) %>%
select(movie, Twitter_score, NYT_score)
tweets_confidence <- NULL
for (i in unique(tweets$movie)){
conf <- t.test(tweets$Twitter_score[tweets$movie == i], conf.level = .95)
results <- NULL
results$movie <- i
results$lower_conf <- conf$conf.int[1]
results$upper_conf <- conf$conf.int[2]
tweets_confidence <- rbind(tweets_confidence, as.data.frame(results))
}
tweets <- merge(tweets, tweets_confidence)
ggplot(tweets, aes(x=Twitter_score)) +
geom_histogram(binwidth=.1, colour="black", fill="white") +
theme_minimal() +
theme(legend.position="top") +
facet_wrap(~movie) +
xlab("Twitter Sentiment Score") +
ylab("Count") +
ggtitle("Confidence Intervals for Top 10 Twitter Movies and NYTimes Sentiment Score") +
geom_vline(data=unique(tweets[c("movie","NYT_score")]),
aes(xintercept = NYT_score, color="NYT Sentiment Score"), linetype= 1, size=1) +
geom_vline(data = tweets[c("movie", "lower_conf")],
aes(xintercept = lower_conf , color = "Lower Confidence Interval"), linetype="dashed", size=1) +
geom_vline(data = tweets[c("movie", "upper_conf")],
aes(xintercept = upper_conf , color = "Upper Confidence Interval"), linetype="dashed", size=1) +
scale_colour_manual(name="", values=c("NYT Sentiment Score" ="salmon",
"Lower Confidence Interval" = "darkolivegreen2",
"Upper Confidence Interval" = "lightskyblue3"))
# Building Confidence Interval Grapth in ggplot layer by layer
# base plot; each point is a tweet, graded for transparency (alpha setting)
p <- ggplot(mydf15, aes(x=reorder(movie, -count), y=jitter(Sentiment_Score, .4))) + geom_point(colour = "darkgrey", size = 3, alpha=.1)
# Added segment for Confidence Interval
p <- p + annotate("segment", x = movies, xend=movies, y=movie_cis[,1], yend = movie_cis[,2], colour = "blue", size = 2)
# Added mean for tweets and NYT score (color coded for critics_pick)
p <- p + geom_point(data=movie_mean_df, mapping=aes(y=movie_mean, colour = "red", size = 4))
p <- p + geom_point(data=mydf15, mapping=aes( y=Sentiment_Score_nyt, colour = as.factor(critics_pick), size = 4))
# change text for axis labels
p <- p + xlab("Movies") + ylab("Tweets Sentiment Analysis Score")
# change apperance of text for axis label
p <- p + theme (axis.title.x = element_text(face="italic", colour = "darkblue", size = 14), axis.title.y = element_text(face="italic", colour = "darkblue", size = 14))
# change appearance of text for x-axis tick text
p <- p + theme (axis.text.x = element_text(size = 10, angle = 90, hjust = 1, vjust = 1))
# Added title for graph and add text formatting with theme element
p <- p + ggtitle("Tweets Sentiment Analysis Confidence Intervals") + theme(plot.title = element_text(vjust = 3, colour = "darkblue", face = "bold", size = 14))
# Format Legends
# Remove Legends for size
p <- p + guides(size = FALSE)
# Format Legend box for Points
p <- p + labs(colour = "Points") + guides(colour = guide_legend (reverse = TRUE)) + scale_colour_discrete(labels=c("NYT score not picked", "NYT Score picked", "mean(Tweet Scores)"))
p
We sorted the movies by number of overall tweets in descending order.
From this graph, we had only one movie where the New York Times score fell into the confidence interval around the average of the tweets: “Viva,” we recalled that this was a New York Times Critics’ Pick that had a low score on the sentiment analysis of the review.
We conclucded that there was no correlation between the New York Times review and the average sentiment analysis about a given movie.
We looked at average Twitter sentiment score by movie and by city.
ggplot(
moviesum1, aes(x = reorder(Movie,Sentiment_Score), y = Sentiment_Score, fill=Sentiment_Score)) +
geom_bar(stat="identity") +
ggtitle("Average Sentiment Score from Twitter, by Movie")+
theme(axis.text=element_text(angle=90))+
labs(x="Movies",y="Score")
ggplot(
citysum1, aes(x = reorder(city,Sentiment_Score), y = Sentiment_Score, fill=Sentiment_Score)) +
geom_bar(stat="identity") +
ggtitle("Average Sentiment Score from Twitter, by City")+
theme(axis.text=element_text(angle=90))+
labs(x="City",y="Score")
We looked at average Twitter sentiment scores by movie for the top 10 movies. Larger circles indicate more positive sentiment. Different colors indicate different movies.
library(RColorBrewer)
mapped_tweets <- merge(tweets1, locations, all.x = TRUE, by.y = "City", by.x = "city") %>%
group_by(movie, city, Long, Lat) %>%
summarise(avg_twitter_score = mean(Sentiment_Score)) %>%
merge(movies_score_NYT[,c("movie", "Sentiment_score")], all.x = TRUE, by = "movie") %>%
rename(NYT_sentiment_score = Sentiment_score)
mapped_tweets_top <- subset(mapped_tweets, mapped_tweets$movie %in% top_movies)
factpal <- colorFactor(brewer.pal(10, "Spectral"), mapped_tweets_top$movie)
leaflet(mapped_tweets_top) %>%
addProviderTiles("Stamen.Toner") %>%
addCircles(lng = ~jitter(Long, factor = 20), lat = ~jitter(Lat, factor = 20), weight = 2,
radius = ~avg_twitter_score*100000,
popup =paste("<font color=", factpal(mapped_tweets_top$movie), ">" ,
"Movie:", as.character(mapped_tweets_top$movie), "<br>",
"Avg Twitter Sentiment Score:", as.character(signif(mapped_tweets_top$avg_twitter_score,3)),"<br>",
"NYT Sentiment Score:", as.character(signif(mapped_tweets_top$NYT_sentiment_score,3)),
"</font>"),
color = ~factpal(mapped_tweets_top$movie),
fillOpacity = 0.4)
We utilized the Twitter API to develop our collection of tweets. Using the twitteR
package we were able to pull tweets for locations by setting the geocode location and radius. OAuth was necessary because the twitteR
uses the Developer API to search tweets based on our settings, it also required interaction by allowing the application access by putting in a key given by Twitter.
We tested a number of different sentiment analysis machines.
We started out trying to using the bag of words method to count positive and negative words within the tweet to get the positivity score. The problem with this method was that many of the words we had did not show up in tweets and the context of the tweet was typically misclassified.
We realized that we needed to get some help to get a more accurate sentiment analysis. We looked into several APIs such as IBM’s AlchemyAPI. We ended up using Indico since it offered a free tier and they had developed an easy to use R
package.
We used the Indico Sentiment Analysis API to analyze our text to return the positity on a scale from 0-1. We tested out several obviously bad and positive tweets and the response was very accurate.
Indico Benchmark: The API performs with 93% accuracy on the IMDB dataset (state of the art).
The Leaflet package was simple to use with our given list of locations with latitude and longitude. One drawback is the lack of a built-in zoom reset button. The popup option is a clean way to allow for label display without clutter.
We found Slack easy to use with a pleasant user interface and file sharing options. Slack may be more useful for larger groups. For our three-person group, we found that email was mostly manageable. Since we were already checking in on email, RPubs, and GitHub, Slack seemed like one more thing to check.
This was our first experience contributing to a repository using GitHub. It made sharing the code and version control of each change much easier than previous attempts. GitHub versioning control is powerful and can handle most merges seamlessly. However, if there is a conflict that GitHub cannot handle, the resolution must be handled through the Git command line and can be challenging.
Special characters and encoding
Movie titles with accents, such as foreign movies, and movies with apostrophes in the titles caused encoding issues which had to be resolved with character substitution. Several movies would have been unintentionally omitted from the analysis if we had not resolved these characters.
Scope Limitations - Further Research
We had hoped that we could do an analysis including genre of movies but we had to reduce this portion of the scope to meet the deadline of the project. If we had more time and data points we would have wished to spend more time exploring other possible corelations, including a time series analysis.
Twitter limitations
Twitter’s free service for a directed search limited the number of tweets to the last 5000. We had to revise our approach and coordinate the timing of the New York Times reviews with when we were pulling tweets to have any hope of retrieving relevant data.
Indico limitations
Indico’s free service allows 10,000 pulls. Each tweet is counted as one pull. For a project with a contained scope, this may be feasible. However, for long-term research studies, a paid account would be needed.
New York Times API data
The syntax documentation for the New York Times API was easy to follow. However, the pulls seemed to omit some results. The API gave 20 results per pull, so the use of offsets was required to retrieve subsequent reqsults. Despite the offset function, it was necessary to combine two weeks of pulls and eliminate duplicate records to get the full dataset for our desired date range.
Twitter Data Reliability
Twitter has a limitation on interpretability. It is useful as a proof of concept but cannot be the sole source of information. Many tweets were commercial advertisements and had unhelpful links. Twitter is a useful tool to explore but we would not build a data product exclusively with it. Unless the search criteria can be narrowed to specific key words, we run the risk of pulling tweets that are not related to the indended search. The nature of tweets (shorter messages) makes it difficult to filter unwanted results.
Outliers
The Twitter data captured only 34 of our 40 New York Times reviewed movies. More obscure or independent movies, or movies with limited releases, were less likely to have mentions on Twitter. The New York Times does more in-depth cultural and artistic analysis than Twitter.
Lincoln, Nebraska had very few data points and skewed much of the results in that city. Also, a few movies had very few tweets and skewed the results of the movie score and we dropped in certain results.
Noisy Data
We encountered some issues with movies having one word titles that pulled lots of unrelated tweets. Our solution was to append movies
to the search to try to narrow down results to just movies. One other issue was that a movie Mother's Day
came out right before the actual Mother’s Day holiday. Many people appeared to go to the movies on Mother’s Day to celebrate but the actual movie they went to see may have been another one of the films. There were many nuances involved in pulling the tweets.
Interpretation of results and further research needed
The results indicated that the New York Times did not reflect the tweet scores which we used as a proxy indicator of public opinion of the movies.
Further research is needed to stratify the tweets because several movies are intended for different demographics and a more focused approach may yield different results.
One of the challenges of this type of analysis was our inability to understand if we had an overlap in people reading the New York Times and then tweeting about the film. An experiment would be needed to conclude a causal relationship between traditional media sources influencing new social media sources.
Detailed Approach