Task:
Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.
Libraries:
library("RPostgreSQL")
library("tidyverse")
library("knitr")
library("kableExtra")
R code to connect to PostgreSQL Database:
drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
con <- dbConnect(drv, dbname = "data606hw2",
host = "localhost", port = 5432,
user = "postgres", password = pw)
# dbListTables shows available tables in our dataset
dbListTables(con)
## [1] "reviewers" "movies" "ratings"
SQL Code to create and fill tables in our Database:
This code doesn’t run, since I am connected to my SQL DB. The code is provided as an example of how DB was created.
CREATE TABLE was used to add new tables to the database.
INSERT INTO was used to add values into those tables.
JOIN is used to connect multiple tables in the DB together.
CREATE TABLE reviewers (
user_id INT PRIMARY KEY,
user_name VARCHAR(55) NOT NULL
);
INSERT INTO reviewers (user_id, user_name)
VALUES
(1, 'Nick'),
(2, 'Amanda'),
(3, 'Saverio'),
(4, 'Anthony'),
(5, 'Haley');
CREATE TABLE movies (
movie_id INT PRIMARY KEY,
movie_name VARCHAR(255) NOT NULL
);
INSERT INTO movies (movie_id, movie_name)
VALUES
(1, 'Star Wars: A New Hope'),
(2, 'Star Wars: Empire Strikes Back'),
(3, 'Star Wars: Return of the Jedi'),
(4, 'Star Wars: The Force Awakens'),
(5, 'Star Wars: Rogue One'),
(6, 'Star Wars: The Last Jedi');
CREATE TABLE ratings (
user_id INT,
movie_id INT,
rating FLOAT
);
INSERT INTO ratings (user_id, movie_id, raiting)
VALUES
(1, 1, '5'),
(1, 2, '5'),
(1, 3, '4.5'),
(1, 4, '4.5'),
(1, 5, '4'),
(1, 6, '3'),
-- I removed move of the values for ratings to save space. Full SQL on github
Query to pull out some data:
This query accesses the SQL DB via dbGetQuery and passes SQL Statements
query2 <- dbGetQuery(con, "SELECT * FROM movies")
query3 <- dbGetQuery(con, "SELECT * FROM reviewers")
query2
## movie_id movie_name
## 1 1 Star Wars: A New Hope
## 2 2 Star Wars: Empire Strikes Back
## 3 3 Star Wars: Return of the Jedi
## 4 4 Star Wars: The Force Awakens
## 5 5 Star Wars: Rogue One
## 6 6 Star Wars: The Last Jedi
query3
## user_id user_name
## 1 1 Nick
## 2 2 Amanda
## 3 3 Saverio
## 4 4 Shea
## 5 5 Haley
Query to join the movies and rating tables together, and pull out useful information:
query1 <- dbGetQuery(con, "SELECT movie_name, user_id, ratings.movie_id, rating
FROM movies
LEFT JOIN ratings
on movies.movie_id = ratings.movie_id")
query1
## movie_name user_id movie_id rating
## 1 Star Wars: A New Hope 1 1 5.0
## 2 Star Wars: Empire Strikes Back 1 2 5.0
## 3 Star Wars: Return of the Jedi 1 3 4.5
## 4 Star Wars: The Force Awakens 1 4 4.5
## 5 Star Wars: Rogue One 1 5 4.0
## 6 Star Wars: The Last Jedi 1 6 3.0
## 7 Star Wars: A New Hope 2 1 5.0
## 8 Star Wars: Empire Strikes Back 2 2 5.0
## 9 Star Wars: Return of the Jedi 2 3 5.0
## 10 Star Wars: The Force Awakens 2 4 4.0
## 11 Star Wars: Rogue One 2 5 3.5
## 12 Star Wars: The Last Jedi 2 6 3.5
## 13 Star Wars: A New Hope 3 1 4.5
## 14 Star Wars: Empire Strikes Back 3 2 4.5
## 15 Star Wars: Return of the Jedi 3 3 4.0
## 16 Star Wars: The Force Awakens 3 4 3.0
## 17 Star Wars: Rogue One 3 5 4.0
## 18 Star Wars: The Last Jedi 3 6 3.0
## 19 Star Wars: A New Hope 4 1 4.0
## 20 Star Wars: Empire Strikes Back 4 2 4.0
## 21 Star Wars: Return of the Jedi 4 3 4.5
## 22 Star Wars: The Force Awakens 4 4 5.0
## 23 Star Wars: Rogue One 4 5 4.0
## 24 Star Wars: The Last Jedi 4 6 3.0
## 25 Star Wars: A New Hope 5 1 4.5
## 26 Star Wars: Empire Strikes Back 5 2 4.5
## 27 Star Wars: Return of the Jedi 5 3 4.0
## 28 Star Wars: The Force Awakens 5 4 4.5
## 29 Star Wars: Rogue One 5 5 3.5
## 30 Star Wars: The Last Jedi 5 6 3.5
Code to show movie name vs. the mean ratings:
It seems that the older Star Wars movies have a higher rating on average than the newer movies.
data1 <- group_by(query1, movie_name) %>% summarise(mean_rating = mean(rating))
#exploring with kable package to make fancy looking tables
kable(data1, "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1:2, bold = T) %>%
row_spec(1:2, bold = T, background = "#b3ffb3") %>%
row_spec(6, bold = T, background = "#ff8080") %>%
row_spec(0, bold = T, color = "grey", background = "#e5ffcc")
movie_name | mean_rating |
---|---|
Star Wars: A New Hope | 4.6 |
Star Wars: Empire Strikes Back | 4.6 |
Star Wars: Return of the Jedi | 4.4 |
Star Wars: Rogue One | 3.8 |
Star Wars: The Force Awakens | 4.2 |
Star Wars: The Last Jedi | 3.2 |
cat("The mean rating from all Star Wars movies in the data set is: ",
mean(query1$rating), "\n")
## The mean rating from all Star Wars movies in the data set is: 4.133333
cat("The median rating from all Star Wars movies in the data set is: ",
median(query1$rating))
## The median rating from all Star Wars movies in the data set is: 4
*see below visualization
Data Visualization:
Line/scatter Plot:
ggplot(query1, aes(query1$movie_id, query1$rating)) + geom_smooth(se = F, method = loess) +
geom_vline(xintercept=median(query1$rating), col='orange') +
geom_vline(xintercept=mean(query1$rating), col='red') +
xlab("Movie ID") +
ylab("Raiting") +
ggtitle("Star Wars: Movie Raitings") +
labs(subtitle = "Orange = median, red = mean") +
theme_dark() +
geom_point(aes(color = query1$movie_name))
Mean = red Median = orange
*The mean and median shown here match with our R code above
Box Plot:
ggplot(query1, aes(query1$movie_name, query1$rating)) +
geom_boxplot(aes(fill = query1$movie_name)) +
coord_flip() + theme_dark() +
xlab("Movie Title") + ylab("Rating") +
ggtitle("Movie vs. Rating") +
labs(fill="Movie Name by color")
Density Plot:
ggplot(query1, aes(query1$movie_id)) +
geom_density(aes(color = query1$movie_name, fill = query1$movie_name, alpha = 0.1)) +
xlab("Movie ID") +
ylab("Density") +
ggtitle("Star Wars: Movies - Density Plot") +
theme_dark()