Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.

1.Generate database using SQL

The database to store the table with names, movies and ratings is to be created using SQL.


CREATE DATABASE rating;

USE rating;

CREATE TABLE rating (
 name_response varchar(100),
 movie_name varchar(100),
 rating int
);

INSERT INTO rating 
VALUES 
-- ('Anna', 'MINIONS. THE RISE OF GRU', 3),
('Alex', 'MINIONS. THE RISE OF GRU', 2),
('Inna', 'MINIONS. THE RISE OF GRU', 4),
('Dan', 'MINIONS. THE RISE OF GRU', 3),
('Lucy', 'MINIONS. THE RISE OF GRU', NULL),

('Anna', 'JURASSIC WORLD DOMINION', 4),
('Alex', 'JURASSIC WORLD DOMINION', 3),
('Inna', 'JURASSIC WORLD DOMINION', NULL),
('Dan', 'JURASSIC WORLD DOMINION', 3),
('Lucy', 'JURASSIC WORLD DOMINION', 4),

('Anna', 'ROGUE ONE: A STAR WARS STORY', NULL),
('Alex', 'ROGUE ONE: A STAR WARS STORY', 5),
('Inna', 'ROGUE ONE: A STAR WARS STORY', 1),
('Dan', 'ROGUE ONE: A STAR WARS STORY', 3),
('Lucy', 'ROGUE ONE: A STAR WARS STORY', NULL),

('Anna', 'SPIDER-MAN: NO WAY HOME', 5),
('Alex', 'SPIDER-MAN: NO WAY HOME', 5),
('Inna', 'SPIDER-MAN: NO WAY HOME', 4),
('Dan', 'SPIDER-MAN: NO WAY HOME', 5),
('Lucy', 'SPIDER-MAN: NO WAY HOME', 4),

('Anna', 'ORPHAN: FIRST KILL', NULL),
('Alex', 'ORPHAN: FIRST KILL', 1),
('Inna', 'ORPHAN: FIRST KILL', NULL),
('Dan', 'ORPHAN: FIRST KILL', 2),
('Lucy', 'ORPHAN: FIRST KILL', NULL),

 ('Anna', 'ELVIS', 4),
('Alex', 'ELVIS', 3),
('Inna', 'ELVIS', 3),
('Dan', 'ELVIS', 4),
('Lucy', 'ELVIS', NULL);

2. Load database using R

To load SQL database using R, RMySQL library is used.

db = dbConnect(MySQL(), user='root', password = '336261', dbname='rating', host='localhost')

Once R is connected to the database, we can view and select necessary table from the database.

ratings <- dbGetQuery(db, 'SELECT * FROM rating')

Let’s see some information about the table and what’s inside to check if data was imported correctly.

summary(ratings)
##  name_response       movie_name            rating     
##  Length:60          Length:60          Min.   :1.000  
##  Class :character   Class :character   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :3.500  
##                                        Mean   :3.409  
##                                        3rd Qu.:4.000  
##                                        Max.   :5.000  
##                                        NA's   :16
head(ratings,n=5)
##   name_response               movie_name rating
## 1          Anna MINIONS. THE RISE OF GRU      3
## 2          Anna MINIONS. THE RISE OF GRU      3
## 3          Alex MINIONS. THE RISE OF GRU      2
## 4          Inna MINIONS. THE RISE OF GRU      4
## 5           Dan MINIONS. THE RISE OF GRU      3

3. Data overview

From the data, we can gather some information about or movies and respondents.

Let’s check who didn’t watch watch which movie and didn’t provide rating as well as which movie has the least number of views. As we see, Lucy was the one who didn’t see almost all the movies. ORPHAN: FIRST KILL was the most unseen movie, Alex watched all six movies. SQL:

SELECT *
FROM rating
WHERE rating IS NULL;

R:

na_rating <- subset(ratings, is.na(ratings$rating))
na_rating 
##    name_response                   movie_name rating
## 6           Lucy     MINIONS. THE RISE OF GRU     NA
## 9           Inna      JURASSIC WORLD DOMINION     NA
## 12          Anna ROGUE ONE: A STAR WARS STORY     NA
## 16          Lucy ROGUE ONE: A STAR WARS STORY     NA
## 22          Anna           ORPHAN: FIRST KILL     NA
## 24          Inna           ORPHAN: FIRST KILL     NA
## 26          Lucy           ORPHAN: FIRST KILL     NA
## 31          Lucy                        ELVIS     NA
## 35          Lucy     MINIONS. THE RISE OF GRU     NA
## 38          Inna      JURASSIC WORLD DOMINION     NA
## 41          Anna ROGUE ONE: A STAR WARS STORY     NA
## 45          Lucy ROGUE ONE: A STAR WARS STORY     NA
## 51          Anna           ORPHAN: FIRST KILL     NA
## 53          Inna           ORPHAN: FIRST KILL     NA
## 55          Lucy           ORPHAN: FIRST KILL     NA
## 60          Lucy                        ELVIS     NA

Find average rating for each movie, put in descending order to see the best rated movie. We also need to ignore NULL values to make R calculate the average. It looks like SPIDER-MAN: NO WAY HOME is the best movie to watch at the current moment. SQL:

SELECT movie_name, AVG(rating) AS rating_avg
FROM rating 
GROUP BY movie_name
ORDER BY rating DESC;

R:

ratings %>%
  group_by(movie_name) %>%
  summarise_at(vars(rating), list(avg_rating = mean), na.rm=TRUE)  %>%
  arrange(desc(avg_rating))
## # A tibble: 6 x 2
##   movie_name                   avg_rating
##   <chr>                             <dbl>
## 1 SPIDER-MAN: NO WAY HOME             4.6
## 2 ELVIS                               3.5
## 3 JURASSIC WORLD DOMINION             3.5
## 4 MINIONS. THE RISE OF GRU            3  
## 5 ROGUE ONE: A STAR WARS STORY        3  
## 6 ORPHAN: FIRST KILL                  1.5

Disconnect from database:

on.exit(dbDisconnect(db))

Conclusion

Using above steps, we have connected to the SQL database and imported it to R data frame. Using both, SQL and R, we did some research on the data: the best rated movie is Spider-Man: No Way Home, don’t watch Orphan: First Kill, Lucy doesn’t like to watch movies, Alex is in trend and watched all latest films.