CUNY MSDS DATA 607

Task:

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.

Libraries:

library("RPostgreSQL")
library("tidyverse")
library("knitr")
library("kableExtra")

R code to connect to PostgreSQL Database:

drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
con <- dbConnect(drv, dbname = "data606hw2",
                 host = "localhost", port = 5432,
                 user = "postgres", password = pw)

# dbListTables shows available tables in our dataset
dbListTables(con)

## [1] "reviewers" "movies"    "ratings"

SQL Code to create and fill tables in our Database:

This code doesn’t run, since I am connected to my SQL DB. The code is provided as an example of how DB was created.

CREATE TABLE was used to add new tables to the database.

INSERT INTO was used to add values into those tables.

JOIN is used to connect multiple tables in the DB together.

CREATE TABLE reviewers (
user_id INT PRIMARY KEY,
user_name VARCHAR(55) NOT NULL
);

INSERT INTO reviewers (user_id, user_name)
VALUES
(1, 'Nick'),
(2, 'Amanda'),
(3, 'Saverio'),
(4, 'Anthony'),
(5, 'Haley');

CREATE TABLE movies (
movie_id INT PRIMARY KEY,
movie_name VARCHAR(255) NOT NULL
);

INSERT INTO movies (movie_id, movie_name)
VALUES
(1, 'Star Wars: A New Hope'),
(2, 'Star Wars: Empire Strikes Back'),
(3, 'Star Wars: Return of the Jedi'),
(4, 'Star Wars: The Force Awakens'),
(5, 'Star Wars: Rogue One'),
(6, 'Star Wars: The Last Jedi');

CREATE TABLE ratings (
user_id INT,
movie_id INT,
rating FLOAT
);

INSERT INTO ratings (user_id, movie_id, raiting)
VALUES
(1, 1, '5'),
(1, 2, '5'),
(1, 3, '4.5'),
(1, 4, '4.5'),
(1, 5, '4'),
(1, 6, '3'),
-- I removed move of the values for ratings to save space.  Full SQL on github

Query to pull out some data:

This query accesses the SQL DB via dbGetQuery and passes SQL Statements

query2 <- dbGetQuery(con, "SELECT * FROM movies")
query3 <- dbGetQuery(con, "SELECT * FROM reviewers")

query2

##   movie_id                     movie_name
## 1        1          Star Wars: A New Hope
## 2        2 Star Wars: Empire Strikes Back
## 3        3  Star Wars: Return of the Jedi
## 4        4   Star Wars: The Force Awakens
## 5        5           Star Wars: Rogue One
## 6        6       Star Wars: The Last Jedi

query3

##   user_id user_name
## 1       1      Nick
## 2       2    Amanda
## 3       3   Saverio
## 4       4      Shea
## 5       5     Haley

Query to join the movies and rating tables together, and pull out useful information:

query1 <- dbGetQuery(con, "SELECT movie_name, user_id, ratings.movie_id, rating
                    FROM movies
                     LEFT JOIN ratings
                     on movies.movie_id = ratings.movie_id")
query1

##                        movie_name user_id movie_id rating
## 1           Star Wars: A New Hope       1        1    5.0
## 2  Star Wars: Empire Strikes Back       1        2    5.0
## 3   Star Wars: Return of the Jedi       1        3    4.5
## 4    Star Wars: The Force Awakens       1        4    4.5
## 5            Star Wars: Rogue One       1        5    4.0
## 6        Star Wars: The Last Jedi       1        6    3.0
## 7           Star Wars: A New Hope       2        1    5.0
## 8  Star Wars: Empire Strikes Back       2        2    5.0
## 9   Star Wars: Return of the Jedi       2        3    5.0
## 10   Star Wars: The Force Awakens       2        4    4.0
## 11           Star Wars: Rogue One       2        5    3.5
## 12       Star Wars: The Last Jedi       2        6    3.5
## 13          Star Wars: A New Hope       3        1    4.5
## 14 Star Wars: Empire Strikes Back       3        2    4.5
## 15  Star Wars: Return of the Jedi       3        3    4.0
## 16   Star Wars: The Force Awakens       3        4    3.0
## 17           Star Wars: Rogue One       3        5    4.0
## 18       Star Wars: The Last Jedi       3        6    3.0
## 19          Star Wars: A New Hope       4        1    4.0
## 20 Star Wars: Empire Strikes Back       4        2    4.0
## 21  Star Wars: Return of the Jedi       4        3    4.5
## 22   Star Wars: The Force Awakens       4        4    5.0
## 23           Star Wars: Rogue One       4        5    4.0
## 24       Star Wars: The Last Jedi       4        6    3.0
## 25          Star Wars: A New Hope       5        1    4.5
## 26 Star Wars: Empire Strikes Back       5        2    4.5
## 27  Star Wars: Return of the Jedi       5        3    4.0
## 28   Star Wars: The Force Awakens       5        4    4.5
## 29           Star Wars: Rogue One       5        5    3.5
## 30       Star Wars: The Last Jedi       5        6    3.5

Code to show movie name vs. the mean ratings:

It seems that the older Star Wars movies have a higher rating on average than the newer movies.

data1 <- group_by(query1, movie_name) %>% summarise(mean_rating = mean(rating))

#exploring with kable package to make fancy looking tables
kable(data1, "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1:2, bold = T) %>%
  row_spec(1:2, bold = T, background = "#b3ffb3") %>%
  row_spec(6, bold = T, background = "#ff8080") %>%
  row_spec(0, bold = T, color = "grey", background = "#e5ffcc")

movie_name	mean_rating
Star Wars: A New Hope	4.6
Star Wars: Empire Strikes Back	4.6
Star Wars: Return of the Jedi	4.4
Star Wars: Rogue One	3.8
Star Wars: The Force Awakens	4.2
Star Wars: The Last Jedi	3.2

cat("The mean rating from all Star Wars movies in the data set is: ", 
    mean(query1$rating), "\n")

## The mean rating from all Star Wars movies in the data set is:  4.133333

cat("The median rating from all Star Wars movies in the data set is: ", 
    median(query1$rating))

## The median rating from all Star Wars movies in the data set is:  4

*see below visualization

Data Visualization:

Line/scatter Plot:

ggplot(query1, aes(query1$movie_id, query1$rating)) + geom_smooth(se = F, method = loess) +
  geom_vline(xintercept=median(query1$rating), col='orange') +
  geom_vline(xintercept=mean(query1$rating), col='red') +
  xlab("Movie ID") +
  ylab("Raiting") +
  ggtitle("Star Wars: Movie Raitings") +
  labs(subtitle = "Orange = median, red = mean") +
  theme_dark() +
  geom_point(aes(color = query1$movie_name))

Mean = red Median = orange

*The mean and median shown here match with our R code above

Box Plot:

ggplot(query1, aes(query1$movie_name, query1$rating)) + 
  geom_boxplot(aes(fill = query1$movie_name)) +
  coord_flip() + theme_dark() + 
  xlab("Movie Title") + ylab("Rating") + 
  ggtitle("Movie vs. Rating") + 
  labs(fill="Movie Name by color")

Density Plot:

ggplot(query1, aes(query1$movie_id)) + 
  geom_density(aes(color = query1$movie_name, fill = query1$movie_name, alpha = 0.1)) +
  xlab("Movie ID") +
  ylab("Density") +
  ggtitle("Star Wars: Movies -  Density Plot")  +
  theme_dark()

Conclusions:

In this assignment we accessed a local SQL Database and connected that database to R. The database was created from scratch, and real individuals were interviewed for the data. From there, we pulled in data, manipulated that data, and transformed the data into visualizations. The graphics, and other code, allowed us to see that on average, the rating for all Star Wars movies in this database was about 4.1. It also showed us that the original Star Wars movies out ranked the newer movies.

CUNY MSDS DATA 607 - HW 2

Nicholas Schettini

February 3rd, 2018