What did my friends think of several films I saw recently?

all images from imbd.com

all images from imbd.com


Introduction & Data Collection

This assignment demonstrates the process of loading information from a SQL database into an R dataframe. I chose six movies and asked friends on facebook to rate the movies on a scale of 1 to 5 (where 1=Strongly Dislikes, …, 3=Indifference, …, & 5=Strongly Liked the movies). The movies were recent & relatively popular movies that I had seen in the theater. It was difficult for me to chose movies, because I have rather eccentric taste and wanted to suggest titles that at least several of my friends had hopefully seen. There was one exception, however, I had not seen BlacKkKlansman. I just got too busy and never made it to the theater. Perhaps from surveying my friends on thier thoughts of the movie, I can make an estimate of how I would rate BlacKkKlansman?…..

The movies I chose were:

  1. Black Panther (2018)
  2. Parasite (2019)
  3. Us (2019)
  4. BlacKkKlansman (2018)
  5. Get Out (2017)
  6. Star Wars: The Last Jedi (2017)

34 responses were received, however, of these, only 22 where considered for further analysis because ratings were given for at least half (>=3) of the movies.

With MySQL Workbench, the data was stored in a SQL database, ‘movie_ratings’, as a table, ‘FB_friend_ratings’.
The SQL script to load the data can be found here: DATA607: load_FB_friend_ratings.sql

Here we will look at two methods to load the data into RStudio:

  1. Access the SQL data directly from R
  2. Access the data through an intermediary .csv file

1. Accessing the SQL data directly from R

First we load the necessary R libraries.

Next, we establish a connection to a MySQL database. The block of R code below shows the syntax for the connection. However, the actual line that is executed (and that contains my password) is hidden.

## <MySQLConnection:0,0>

Great!, we have established a connection and can now list the tables in the database ‘movie_ratings’

## [1] "FB_friend_ratings"

Now to create an R dataframe from the SQL table. We will do this by using a SELECT statement to select all the data in the table. If we were accessing a very large database, this would be an eccellent point to filter the data using a SELECT statement with more constraints.

##      name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1    Adam             4       NA  4              4       5                   5
## 2    Alex             3        4  2              4       5                  NA
## 3     Ana             5       NA NA              5      NA                   4
## 4 Ayanthi            NA       NA NA              4       5                  NA
## 5     Ben             4       NA NA             NA       4                   3
## 6  Bonnie             4        5  4             NA       5                   4
## [1] "name"                "Black_Panther"       "Parasite"           
## [4] "Us"                  "BlacKkKlansman"      "Get_Out"            
## [7] "Star_Wars_Last_Jedi"
## [1] 22  7


We can see from the output of the above blocks, that the resulting dataframe has the expected features (columns) and dimensions.

2. Access the data through an intermediary .csv file

MySQL Workbench provides some very user friendly features to export a data table to a .csv file. Through directions to do so are given in this MySQLTutorial The following code walks through the process of accessing the same data from the ‘FB_friend_ratings’ table that has been previously exported as a .csv file and uploaded to githut

##      name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1    Adam             4       NA  4              4       5                   5
## 2    Alex             3        4  2              4       5                  NA
## 3     Ana             5       NA NA              5      NA                   4
## 4 Ayanthi            NA       NA NA              4       5                  NA
## 5     Ben             4       NA NA             NA       4                   3
## 6  Bonnie             4        5  4             NA       5                   4
## [1] "name"                "Black_Panther"       "Parasite"           
## [4] "Us"                  "BlacKkKlansman"      "Get_Out"            
## [7] "Star_Wars_Last_Jedi"
## [1] 22  7

We just generated two R dataframes using different methods; let’s test to see if they are equivalent. We will do this using the comparedf() function from the arsenal library

## Loading required package: arsenal
## 
## 
## Table: Summary of data.frames
## 
## version   arg           ncol   nrow
## --------  -----------  -----  -----
## x         gitRatings       7     22
## y         ratings_df       7     22
## 
## 
## 
## Table: Summary of overall comparison
## 
## statistic                                                      value
## ------------------------------------------------------------  ------
## Number of by-variables                                             0
## Number of non-by variables in common                               7
## Number of variables compared                                       6
## Number of variables in x but not y                                 0
## Number of variables in y but not x                                 0
## Number of variables compared with some values unequal              0
## Number of variables compared with all values equal                 6
## Number of observations in common                                  22
## Number of observations in x but not y                              0
## Number of observations in y but not x                              0
## Number of observations with some compared variables unequal        0
## Number of observations with all compared variables equal          22
## Number of values unequal                                           0
## 
## 
## 
## Table: Variables not shared
## 
## |                        |
## |:-----------------------|
## |No variables not shared |
## 
## 
## 
## Table: Other variables not compared
## 
## var.x    pos.x  class.x   var.y    pos.y  class.y   
## ------  ------  --------  ------  ------  ----------
## name         1  factor    name         1  character 
## 
## 
## 
## Table: Observations not shared
## 
## |                           |
## |:--------------------------|
## |No observations not shared |
## 
## 
## 
## Table: Differences detected by variable
## 
## var.x                 var.y                   n   NAs
## --------------------  --------------------  ---  ----
## Black_Panther         Black_Panther           0     0
## Parasite              Parasite                0     0
## Us                    Us                      0     0
## BlacKkKlansman        BlacKkKlansman          0     0
## Get_Out               Get_Out                 0     0
## Star_Wars_Last_Jedi   Star_Wars_Last_Jedi     0     0
## 
## 
## 
## Table: Differences detected
## 
## |                        |
## |:-----------------------|
## |No differences detected |
## 
## 
## 
## Table: Non-identical attributes
## 
## var.x   var.y   name   
## ------  ------  -------
## name    name    class  
## name    name    levels

The output above definitively shows that the two methods we used to load the SQL data to dataframes in RStudio yielded identical data structures.

Handling missing data

To avoid confusion, the rest of this demo will make use of ‘rating_df’ Use the head() function to preview the dataframe again

##      name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1    Adam             4       NA  4              4       5                   5
## 2    Alex             3        4  2              4       5                  NA
## 3     Ana             5       NA NA              5      NA                   4
## 4 Ayanthi            NA       NA NA              4       5                  NA
## 5     Ben             4       NA NA             NA       4                   3
## 6  Bonnie             4        5  4             NA       5                   4

From the output above, we can see that there are a lot of missing data entries given by ‘NA’. This indicates that a person gave no rating for the movie because they have not seen it. There are many ways to deal with missing data points. An excellent overview is given in this ‘Dealing with Missing Data using R’ article.

A simple way to deal with the missing data is imputation with the mean. Basically, any ‘NA’ in a given column is replaced by the column’s mean. We will execute this in the following code:

## $Black_Panther.means
## [1] 4
## 
## $Parasite.means
## [1] 4.818182
## 
## $Us.means
## [1] 4
## 
## $BlacKkKlansman.means
## [1] 4.272727
## 
## $Get_Out.means
## [1] 4.666667
## 
## $Star_Wars_Last_Jedi.means
## [1] 3.588235
##      name Black_Panther Parasite Us BlacKkKlansman  Get_Out Star_Wars_Last_Jedi
## 1    Adam             4 4.818182  4       4.000000 5.000000            5.000000
## 2    Alex             3 4.000000  2       4.000000 5.000000            3.588235
## 3     Ana             5 4.818182  4       5.000000 4.666667            4.000000
## 4 Ayanthi             4 4.818182  4       4.000000 5.000000            3.588235
## 5     Ben             4 4.818182  4       4.272727 4.000000            3.000000
## 6  Bonnie             4 5.000000  4       4.272727 5.000000            4.000000

If we compare the output for head( ratings_df ) with the output for head( ratings_NA2Mean ), we observe that the ‘NA’ values in ratings_df have been replaced with the column appropriate values that are held in the variable ratingsMeans.

This resulting dataframe, ‘ratings_NA2Mean’, gives simple ratings estimates for each missing value. For instance, I had not seen the movie ‘BlacKkKlansman’. The mean rating from my friends is 4.27, so perhaps I should get around to watching it.

Imputation of missing data values with an estimate such as the mean is a simple and straight-forward way to deal with ‘NA’ values. There are certainly many much more sophisticated modeling and prediction techniques that can be used deal with missing values (e.g. K Nearest Neighbor ). However, they are typically best when deployed on much larger datasets & are out of the scope for this assignment.