This is an analysis of movie reviews taken from five imaginary individuals. Each “reviewer” was asked to share their opinion of up to six films on a scale of 1 – 5 with 1 being “Below Average” and 5 being “Excellent”. There is also a response of N/A – Did not watch film which is meant to catch any reviewers who left the question blank.
Packages needed
#install.packages("RPostgres")
library(DBI)
library(RCurl)
library(ggplot2)
I created a database using PostgreSQL with 3 tables. One to hold all responses, one with reviewer information, and the third with just the movie names.
Connect to PostgreSQL DB by replacing all variables with your own credentials.
dbName <- 'data607'
host <- 'localhost'
port <- 5432
user <- 'postgres'
password <- '10Pancakes'
# Create connection and save it to con
con <- dbConnect(RPostgres::Postgres(), dbname = dbName,
host = host,
port = port,
user = user,
password = password)
You will need to create 3 tables using csv data saved in GitHub.
#Movies data
x1 <- getURL("https://raw.githubusercontent.com/ltcancel/HWK2_Data607_F20/master/movies.csv")
df1 <- read.csv(text = x1)
#Reviewers
x2 <- getURL("https://raw.githubusercontent.com/ltcancel/HWK2_Data607_F20/master/reviewers.csv")
df2 <- read.csv(text = x2)
#Review results
x3 <- getURL("https://raw.githubusercontent.com/ltcancel/HWK2_Data607_F20/master/Movie%20Review%20Results.csv")
df3 <- read.csv(text = x3)
Create the tables and load the information at the same time using dbWriteTable(). Then use dbListTables(con) to verify that the tables were created. You should have the tables movies, reviewers, and results.
#Movies table
dbWriteTable(con, "movies", df1, row.names = FALSE)
#Reviewers table
dbWriteTable(con, "reviewers", df2, row.names = FALSE)
#Results table
dbWriteTable(con, "results", df3, row.names = FALSE)
#Confirm table creation
dbListTables(con)
## [1] "movies" "reviewers" "results"
Query the data needed for your analysis. Since the main results table only has the reviewer’s ID we will join with the reviewers table to get the reviewer’s name.
res <- dbGetQuery(con, 'SELECT b."last_name", b."first_name",
a."Movie_Desc", a."Opinion_Desc"
FROM results a
INNER JOIN reviewers b ON a."Respondent" = b."X.U.FEFF.reviewer_id"')
#preview results of query
head(res)
## last_name first_name Movie_Desc
## 1 Baggins Bilbo Sonic The Hedghog
## 2 Baggins Bilbo My Spy
## 3 Baggins Bilbo Star Wars: The Rise of Skywalker
## 4 Baggins Bilbo Frozen II
## 5 Baggins Bilbo Parasite
## 6 Baggins Bilbo 1917
## Opinion_Desc
## 1 Average
## 2 N/A - Did not watch film
## 3 Excellent
## 4 Average
## 5 Average
## 6 Average
I used ggplot to create a barplot showing the ratings for each movie. I flipped the table, so the x-axis is on the left because some of the ratings name are long and get bunched up making it difficult to read. I kept the N/A data for this table because I thought it was important to know which movie had the least number of views out of all the reviewers. Parasite had the most N/A responses and Frozen II is not part of this bar at all, so I know that each reviewer has seen this film. Five out of the six films have an Average rating with 1917 having the greatest number of Average ratings.
ggplot(data = res) +
geom_bar(
mapping = aes(x = Opinion_Desc, fill = Movie_Desc), position = "fill"
) +
coord_flip() +
labs(x = "Rating", title = "Movie Ratings", fill = "Movie Titles")
I queried the same data again but this time without the N/A responses. The coord_polar setting was the best for these results. This confirms findings from the first table where most of the films have an Average rating. Although the Excellent and Below Average rating had the least number of movies, I am able see which films had the most responses for these ratings. Frozen II is rated the worse out of all films because it has the greatest number of Below Average ratings.
res2 <- dbGetQuery(con, 'SELECT b."last_name", b."first_name",
a."Movie_Desc", a."Opinion_Desc"
FROM results a
INNER JOIN reviewers b ON a."Respondent" = b."X.U.FEFF.reviewer_id"
WHERE a."Opinion_Cd" != 5')
ggplot(data = res2) +
geom_bar(
mapping = aes(x = Opinion_Desc, fill = Movie_Desc), position = "fill"
) +
coord_polar() +
labs(x = "Rating", title = "Movie Ratings", fill = "Movie Titles")
There were some challenges with setting up the database. I was going to originally use MySQL but had issues with the password and was not able to connect. PostgrSQL was the backup option and it was not difficult to setup. Creating tables and importing values at the same time was very easy and convenient however, the column names were created with quotes even though the original data does not have quotes. I would try to find a way to create the table without quotes around the column names or alter the column names to remove the quotes. I would also gather more reviews and more information about my reviewers so I can create a more interesting analysis.
Disconnect from db when you are done.
dbDisconnect(con)