Assignment Requirements

DATA607 Week 2 assignment:

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.
Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.
This assignment does NOT need to be 100% reproducible. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.
Handling missing data is a foundational skill when working with SQL or R. To receive full credit, you should demonstrate a reasonable approach for handling missing data. After all, how likely is it that all five of your friends have seen all six movies?
You’re encouraged to optionally find other ways to make your solution better. For example, consider incorporating one or more of the following suggestions into your solution:
- Use survey software to gather the information.
- Are you able to use a password without having to share the password with people who are viewing your code? There are a lot of interesting approaches that you can uncover with a little bit of research.
- While it’s acceptable to create a single SQL table, can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
- Is there any benefit in standardizing ratings? How might you approach this?

You may work in a small group on this assignment. If you work in a group, each group member should indicate who they worked with, and all group members should individually submit their week 2 assignment.

(Optional) Reading related to this assignment:

James Le, “The 4 Recommendation Engines That Can Predict Your Movie Tastes”, May 1, 2018. This a nice backgrounder on movie recommendation engines. We’ll learn more about recommender systems later in the course.
Steve Blank, “The Customer Development Process. 2 Minutes to See Why”, Jul 29, 2014. In this [<3 minute] YouTube video “lean startup” founder Steve Blank talks about the importance of getting out of the building to talk to customers. I’d encourage you to adopt this “builder mentality” in your own data science work whenever it’s practical, by collecting data yourself, whether it’s related to a “business experiment” or a “scientific experiment.”

Overview

IMDB has a chart for there "Top Box Office (US)" movies. The rankings for the weekend of February 5-7, 2021 were as follows:

Top 5 IMBD.com

The created survey reflected ratings for the movies noted on IMDB.

Data Dictionary

The following lists the variables related to the .csv files and databases created using the SQL scripts on github.com/gcampos100/.CSV & SQL files :

movieRating Database
Attributes	Description
SurveyID	Unique ID of Survey taken
MovieID	Unique ID of Movie (Foreign key to movieName Database)
Rating	Rating from 1-5
Details	Definition of Rating as per survey descriptions

movieName Database
Attributes	Description
MovieID	Unique ID of Movie (Primary key to movieName Database)
Name	Movie Name

Load Data

Data Set Title: 6 Recent Popular movies

Source Information:

Creator: Gabriel Campos
Personal Survey: surveymonkey.com NOTE: To avoid $25 monthly subscription with SurveyMonkey, I created .csv for my long-time imaginary friends’ survey results, to simulate what I imagine the data export would look like.
Github: github.com/gcampos100/ .CSV & SQL files

Connecting

Connection is established using the Open Database Connectivity (ODBC) configured on my local machine to connect to MySQL with a user that has read privileges on my specified table. The askForPassword function prompts the user to specify the DSN via a GUI. In this way the DSN name and connection string are both masked, while the database directly is protected via the Principle of Least Privilege.

DSN <- rstudioapi::askForPassword("DSN Name")

con <- dbConnect(odbc(),DSN)

## [1] "Tables available on this db are "

## [1] "moviename"    "movieratings" "tb"

Importing the Data

Using the dbGetQuery() function which is a part of the DBI package I was able to run a SQL query directly and import the results onto a dataframe for further use.

NOTE: Importing of NULL values in database, converts to NA in the dataframe.

movieRatingsQuery <- dbGetQuery(con,
                        "SELECT surveyID,name AS movie, rating, details
                        FROM movieRatings
                        JOIN movieName ON movieRatings.movieID = movieName.movieID;")

NULL Values

Count

A count of the amount of NULL values can be done directly from the database using an SQL query or the obtained by subsetting the loaded data frame as show below. NOTE: Once again, NULL values are converted to NA using the dataframe, but will display as NULL when using dbGetQuery with not importing

dbGetQuery(con,
           "SELECT surveyID,name AS movie, rating, details
                              FROM movieRatings
                              JOIN movieName ON movieRatings.movieID = movieName.movieID
                              WHERE rating IS NULL;")

subset(movieRatingsQuery,is.na(movieRatingsQuery$rating))

surveyID	movie	rating	details
2	The Little Things	NA	Have not seen this movie
2	The Marksman	NA	Have not seen this movie
6	The Marksman	NA	Have not seen this movie
2	Wonder Woman 1984	NA	Have not seen this movie
5	Wonder Woman 1984	NA	Have not seen this movie
6	Wonder Woman 1984	NA	Have not seen this movie
6	Monster Hunter	NA	Have not seen this movie
1	News of the World	NA	Have not seen this movie
2	News of the World	NA	Have not seen this movie
4	News of the World	NA	Have not seen this movie
Note:
The count for observations on both methods is 10, indicating no data loss for either method

Modify data

The loaded data frame movieRatings obtained with a join query of movieRating and movieName databases, accounts for all data include 10 NULL values. In order to aggregate this, we again can chose to eliminate the NULL values from the SQL query used to import or from the already loaded data frame as shown below

dbGetQuery(con,
             "SELECT name AS movie, rating, details, count(rating)
              FROM movieRatings
              JOIN movieName ON movieRatings.movieID = movieName.movieID
              where rating IS NOT NULL 
              group by movie, rating, details 
              ORDER by movie ASC, rating DESC;"
            )

subset(movieRatingsQuery,!is.na(movieRatingsQuery$rating))

movie	rating	details	count(rating)
Monster Hunter	5	Loved It	1
Monster Hunter	2	Didn’t hate it	1
Monster Hunter	1	Hated it	3
News of the World	4	Liked it	1
News of the World	3	Neutral	1
News of the World	1	Hated it	1
The Croods: A New Age	5	Loved It	1
The Croods: A New Age	4	Liked it	1
The Croods: A New Age	3	Neutral	1
The Croods: A New Age	2	Didn’t hate it	1
The Croods: A New Age	1	Hated it	2
The Little Things	5	Loved It	3
The Little Things	3	Neutral	1
The Little Things	1	Hated it	1
The Marksman	3	Neutral	3
The Marksman	1	Hated it	1
Wonder Woman 1984	4	Liked it	1
Wonder Woman 1984	2	Didn’t hate it	1
Wonder Woman 1984	1	Hated it	1
^* The Total sum of our count is 26, accounting for 26 total observations

surveyID	movie	rating	details
1	The Little Things	5	Loved It
3	The Little Things	1	Hated it
4	The Little Things	5	Loved It
5	The Little Things	3	Neutral
6	The Little Things	5	Loved It
1	The Croods: A New Age	2	Didn’t hate it
2	The Croods: A New Age	5	Loved It
3	The Croods: A New Age	1	Hated it
4	The Croods: A New Age	4	Liked it
5	The Croods: A New Age	3	Neutral
6	The Croods: A New Age	1	Hated it
1	The Marksman	3	Neutral
3	The Marksman	1	Hated it
4	The Marksman	3	Neutral
5	The Marksman	3	Neutral
1	Wonder Woman 1984	4	Liked it
3	Wonder Woman 1984	1	Hated it
4	Wonder Woman 1984	2	Didn’t hate it
1	Monster Hunter	5	Loved It
2	Monster Hunter	1	Hated it
3	Monster Hunter	1	Hated it
4	Monster Hunter	1	Hated it
5	Monster Hunter	2	Didn’t hate it
3	News of the World	1	Hated it
5	News of the World	3	Neutral
6	News of the World	4	Liked it

Conclusions

## # A tibble: 6 x 2
##   movie                     MeanRating
## * <chr>                          <dbl>
## 1 "Monster Hunter\r"              2   
## 2 "News of the World"             2.67
## 3 "The Croods: A New Age\r"       2.67
## 4 "The Little Things\r"           3.8 
## 5 "The Marksman\r"                2.5 
## 6 "Wonder Woman 1984\r"           2.33

Deciding on how to import the data can greatly affect what visuals are used to represent them. Using the charts with dbQuery() function can help with small sets like this, when few options are present. I could have concluded that Little Things was most liked in that format and even if hundreds of surveys were to come in, I feel the chart is sufficient to digest and draw conclusions. Importing fully into a data frame however, allows for more manipulation, allowing me to suggest with full confidence Little Things as it’s average rating is well above the rest.

DATA607 Week 2 Assignment - Loading Data into a Data Frame

Gabriel Campos

2/14/2021