Assignment Requirements

DATA607 Week 2 assignment:

  1. Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.
  2. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.
  3. This assignment does NOT need to be 100% reproducible. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.
  4. Handling missing data is a foundational skill when working with SQL or R. To receive full credit, you should demonstrate a reasonable approach for handling missing data. After all, how likely is it that all five of your friends have seen all six movies?
  5. You’re encouraged to optionally find other ways to make your solution better. For example, consider incorporating one or more of the following suggestions into your solution:
    • Use survey software to gather the information.
    • Are you able to use a password without having to share the password with people who are viewing your code? There are a lot of interesting approaches that you can uncover with a little bit of research.
    • While it’s acceptable to create a single SQL table, can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
    • Is there any benefit in standardizing ratings? How might you approach this?

You may work in a small group on this assignment. If you work in a group, each group member should indicate who they worked with, and all group members should individually submit their week 2 assignment.

(Optional) Reading related to this assignment:

Overview

IMDB has a chart for there "Top Box Office (US)" movies. The rankings for the weekend of February 5-7, 2021 were as follows:

Top 5 IMBD.com

The created survey reflected ratings for the movies noted on IMDB.

Data Dictionary

The following lists the variables related to the .csv files and databases created using the SQL scripts on github.com/gcampos100/.CSV & SQL files :

movieRating Database
Attributes Description
SurveyID Unique ID of Survey taken
MovieID Unique ID of Movie (Foreign key to movieName Database)
Rating Rating from 1-5
Details Definition of Rating as per survey descriptions
movieName Database
Attributes Description
MovieID Unique ID of Movie (Primary key to movieName Database)
Name Movie Name

Load Data

NULL Values

Count

A count of the amount of NULL values can be done directly from the database using an SQL query or the obtained by subsetting the loaded data frame as show below. NOTE: Once again, NULL values are converted to NA using the dataframe, but will display as NULL when using dbGetQuery with not importing

dbGetQuery(con,
           "SELECT surveyID,name AS movie, rating, details
                              FROM movieRatings
                              JOIN movieName ON movieRatings.movieID = movieName.movieID
                              WHERE rating IS NULL;")
subset(movieRatingsQuery,is.na(movieRatingsQuery$rating))
surveyID movie rating details
2 The Little Things NA Have not seen this movie
2 The Marksman NA Have not seen this movie
6 The Marksman NA Have not seen this movie
2 Wonder Woman 1984 NA Have not seen this movie
5 Wonder Woman 1984 NA Have not seen this movie
6 Wonder Woman 1984 NA Have not seen this movie
6 Monster Hunter NA Have not seen this movie
1 News of the World NA Have not seen this movie
2 News of the World NA Have not seen this movie
4 News of the World NA Have not seen this movie
Note:
The count for observations on both methods is 10, indicating no data loss for either method

Modify data

The loaded data frame movieRatings obtained with a join query of movieRating and movieName databases, accounts for all data include 10 NULL values. In order to aggregate this, we again can chose to eliminate the NULL values from the SQL query used to import or from the already loaded data frame as shown below

dbGetQuery(con,
             "SELECT name AS movie, rating, details, count(rating)
              FROM movieRatings
              JOIN movieName ON movieRatings.movieID = movieName.movieID
              where rating IS NOT NULL 
              group by movie, rating, details 
              ORDER by movie ASC, rating DESC;"
            )
subset(movieRatingsQuery,!is.na(movieRatingsQuery$rating))
movie rating details count(rating)
Monster Hunter 5 Loved It 1
Monster Hunter 2 Didn’t hate it 1
Monster Hunter 1 Hated it 3
News of the World 4 Liked it 1
News of the World 3 Neutral 1
News of the World 1 Hated it 1
The Croods: A New Age 5 Loved It 1
The Croods: A New Age 4 Liked it 1
The Croods: A New Age 3 Neutral 1
The Croods: A New Age 2 Didn’t hate it 1
The Croods: A New Age 1 Hated it 2
The Little Things 5 Loved It 3
The Little Things 3 Neutral 1
The Little Things 1 Hated it 1
The Marksman 3 Neutral 3
The Marksman 1 Hated it 1
Wonder Woman 1984 4 Liked it 1
Wonder Woman 1984 2 Didn’t hate it 1
Wonder Woman 1984 1 Hated it 1
* The Total sum of our count is 26, accounting for 26 total observations
surveyID movie rating details
1 The Little Things 5 Loved It
3 The Little Things 1 Hated it
4 The Little Things 5 Loved It
5 The Little Things 3 Neutral
6 The Little Things 5 Loved It
1 The Croods: A New Age 2 Didn’t hate it
2 The Croods: A New Age 5 Loved It
3 The Croods: A New Age 1 Hated it
4 The Croods: A New Age 4 Liked it
5 The Croods: A New Age 3 Neutral
6 The Croods: A New Age 1 Hated it
1 The Marksman 3 Neutral
3 The Marksman 1 Hated it
4 The Marksman 3 Neutral
5 The Marksman 3 Neutral
1 Wonder Woman 1984 4 Liked it
3 Wonder Woman 1984 1 Hated it
4 Wonder Woman 1984 2 Didn’t hate it
1 Monster Hunter 5 Loved It
2 Monster Hunter 1 Hated it
3 Monster Hunter 1 Hated it
4 Monster Hunter 1 Hated it
5 Monster Hunter 2 Didn’t hate it
3 News of the World 1 Hated it
5 News of the World 3 Neutral
6 News of the World 4 Liked it

Conclusions

## # A tibble: 6 x 2
##   movie                     MeanRating
## * <chr>                          <dbl>
## 1 "Monster Hunter\r"              2   
## 2 "News of the World"             2.67
## 3 "The Croods: A New Age\r"       2.67
## 4 "The Little Things\r"           3.8 
## 5 "The Marksman\r"                2.5 
## 6 "Wonder Woman 1984\r"           2.33

Deciding on how to import the data can greatly affect what visuals are used to represent them. Using the charts with dbQuery() function can help with small sets like this, when few options are present. I could have concluded that Little Things was most liked in that format and even if hundreds of surveys were to come in, I feel the chart is sufficient to digest and draw conclusions. Importing fully into a data frame however, allows for more manipulation, allowing me to suggest with full confidence Little Things as it’s average rating is well above the rest.