Homework 2

Initial Setup

I selected 6 recent popular movies and configured a Google form to capture people responses. The form is available at: https://docs.google.com/forms/d/e/1FAIpQLSc7RBE8IycRWMCkxoVtQ86lcrRu2H-RJDahN24TMkPFOg4zdw/viewform

I also setup a Google Sheet document which would receive the reponses.

The initial Idea was to write a Google App Script that would take the data from the Google Sheet and populate an AWS RDS Database. To that effect, I setup a script following this tutorial:

https://www.linkedin.com/pulse/two-way-synchronization-between-google-spreadsheet-aws-deshpande/

I also configured a free tier AWS RDS Database and whitelisted the ip addresses suggested in the tutorial.

Something did not work as I donโ€™t have write privileges into the RDS DB which prevented me from using my initial setup.

I had to revert to a backup configuration with a local MySQL Installation.

I could not populate the local table using LOAD DATA INFILE because MySql was complaining that security settings prevented it from running.

So, I had to settled to 1. Converting the Google Sheet Document into a csv file 2. Uploading the csv file to github 3. Populating the table by reading the csv into R and connecting to the local Db

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.0
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
theURL <- "https://raw.githubusercontent.com/georg4re/DS607/master/data/Movie%20Ratings%20(Responses).csv"
ratings <- read.csv(file=theURL, fileEncoding="UTF-8-BOM")

library(RMariaDB)

# The connection method below uses a password stored in a settings file.

# R needs a full path to find the settings file.
rmariadb.settingsfile<-"C:\\Users\\georg\\Documents\\George\\Data Science MS\\DBConnection.cnf"

rmariadb.db<-"d607-hw2"
moviesDb<-dbConnect(RMariaDB::MariaDB(),default.file=rmariadb.settingsfile,group=rmariadb.db)

#Clear previous data from the table
query<-paste("TRUNCATE movie_ratings")
rs = dbSendQuery(moviesDb,query)
#Store the movies rating data in the sql table 
dbWriteTable(moviesDb, value = ratings, row.names = FALSE, name = "movie_ratings", append = TRUE )
## Warning in result_create(conn@ptr, statement, is_statement): Cancelling previous
## query

Getting the data from MySQL

Once the data is populated into the DB, I read it by using a SQL Command:

query<-paste("SELECT * from movie_ratings")
rs = dbSendQuery(moviesDb, query)
dbRows<-dbFetch(rs)
dbRows
##   id         timestamp star_wars ford_ferrari little_women joker endgame
## 1  1 9/5/2020 18:20:03         4            3            4     5       5
## 2  2 9/5/2020 18:21:14         4            3            4     5       5
## 3  3 9/5/2020 18:22:01         3            3            5     3       5
## 4  4 9/5/2020 18:22:16        NA           NA           NA    NA       5
## 5  5 9/5/2020 18:23:56         5            4            4     5       5
## 6  6 9/5/2020 18:24:16         5            4            3     5       5
## 7  7 9/5/2020 18:24:48        NA            5            5     4       3
## 8  8 9/5/2020 18:34:04         1            3            1     5       5
## 9  9 9/5/2020 18:39:44         5            3            4     5       5
##   when_they
## 1         3
## 2         3
## 3         2
## 4        NA
## 5         5
## 6         4
## 7         5
## 8         4
## 9         5

Data Analysis

I will take the average rating for each movie and look for records where no answers were provided:

summary(dbRows)
##        id     timestamp           star_wars      ford_ferrari  little_women 
##  Min.   :1   Length:9           Min.   :1.000   Min.   :3.0   Min.   :1.00  
##  1st Qu.:3   Class :character   1st Qu.:3.500   1st Qu.:3.0   1st Qu.:3.75  
##  Median :5   Mode  :character   Median :4.000   Median :3.0   Median :4.00  
##  Mean   :5                      Mean   :3.857   Mean   :3.5   Mean   :3.75  
##  3rd Qu.:7                      3rd Qu.:5.000   3rd Qu.:4.0   3rd Qu.:4.25  
##  Max.   :9                      Max.   :5.000   Max.   :5.0   Max.   :5.00  
##                                 NA's   :2       NA's   :1     NA's   :1     
##      joker          endgame        when_they    
##  Min.   :3.000   Min.   :3.000   Min.   :2.000  
##  1st Qu.:4.750   1st Qu.:5.000   1st Qu.:3.000  
##  Median :5.000   Median :5.000   Median :4.000  
##  Mean   :4.625   Mean   :4.778   Mean   :3.875  
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :1                       NA's   :1

Based on the mean values, It would appear Avengers: Endgame was the preferred movie of my respondents.

new_data <- dbRows %>% filter_all(any_vars(is.na(.)))
new_data
##   id         timestamp star_wars ford_ferrari little_women joker endgame
## 1  4 9/5/2020 18:22:16        NA           NA           NA    NA       5
## 2  7 9/5/2020 18:24:48        NA            5            5     4       3
##   when_they
## 1        NA
## 2         5

Suggesting movies for the people that have not watched them

Based on the data, one person responded they only watched Endgame, and another person did not watch Star Wars, the Rise of Skywalker. One way to suggest a movie for the person who only watched endgame would be to find others that watched and enjoyed endgame as much as they did, and suggest the best movie for them based on the others ratings

dbRows
##   id         timestamp star_wars ford_ferrari little_women joker endgame
## 1  1 9/5/2020 18:20:03         4            3            4     5       5
## 2  2 9/5/2020 18:21:14         4            3            4     5       5
## 3  3 9/5/2020 18:22:01         3            3            5     3       5
## 4  4 9/5/2020 18:22:16        NA           NA           NA    NA       5
## 5  5 9/5/2020 18:23:56         5            4            4     5       5
## 6  6 9/5/2020 18:24:16         5            4            3     5       5
## 7  7 9/5/2020 18:24:48        NA            5            5     4       3
## 8  8 9/5/2020 18:34:04         1            3            1     5       5
## 9  9 9/5/2020 18:39:44         5            3            4     5       5
##   when_they
## 1         3
## 2         3
## 3         2
## 4        NA
## 5         5
## 6         4
## 7         5
## 8         4
## 9         5
enjoyedEG <- dbRows[dbRows$endgame == 5,]
enjoyedEG %>%
  summarize(star_wars = mean(star_wars, na.rm = TRUE), 
            ford_ferrari = mean(ford_ferrari, na.rm = TRUE),
            little_women = mean(little_women, na.rm = TRUE), 
            joker = mean(joker, na.rm = TRUE),
            when_they = mean(when_they, na.rm = TRUE))
##   star_wars ford_ferrari little_women    joker when_they
## 1  3.857143     3.285714     3.571429 4.714286  3.714286

Based on this findings, it would seem that the movie we should recommend to this person is Joker, as they would probably enjoy it.

Would The person that did not watch Star Wars enjoy it?

In order to answer that question we would look for others that share the same ratings and assume they would like the same movies.

tosuggest <- filter(dbRows, little_women == 5 )
tosuggest %>%
  summarize(star_wars = mean(star_wars, na.rm = TRUE))
##   star_wars
## 1         3

Based on these results, we can determine this viewer will probably not enjoy Star Wars too much.

Conclusions

We can make recommendations by finding other viewers who rated the movies as high as another viewer and probably determine if the would like a certain movie.

I would have loved to be able to get the whole initial workflow functioning and I will keep at it to make it work.

# Disconnect to clean up the connection to the database.
dbDisconnect(moviesDb)
## Warning in connection_release(conn@ptr): There is a result object still in use.
## The connection will be automatically released when it is closed