I selected 6 recent popular movies and configured a Google form to capture people responses. The form is available at: https://docs.google.com/forms/d/e/1FAIpQLSc7RBE8IycRWMCkxoVtQ86lcrRu2H-RJDahN24TMkPFOg4zdw/viewform
I also setup a Google Sheet document which would receive the reponses.
The initial Idea was to write a Google App Script that would take the data from the Google Sheet and populate an AWS RDS Database. To that effect, I setup a script following this tutorial:
https://www.linkedin.com/pulse/two-way-synchronization-between-google-spreadsheet-aws-deshpande/
I also configured a free tier AWS RDS Database and whitelisted the ip addresses suggested in the tutorial.
Something did not work as I donโt have write privileges into the RDS DB which prevented me from using my initial setup.
I had to revert to a backup configuration with a local MySQL Installation.
I could not populate the local table using LOAD DATA INFILE because MySql was complaining that security settings prevented it from running.
So, I had to settled to 1. Converting the Google Sheet Document into a csv file 2. Uploading the csv file to github 3. Populating the table by reading the csv into R and connecting to the local Db
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.0
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
theURL <- "https://raw.githubusercontent.com/georg4re/DS607/master/data/Movie%20Ratings%20(Responses).csv"
ratings <- read.csv(file=theURL, fileEncoding="UTF-8-BOM")
library(RMariaDB)
# The connection method below uses a password stored in a settings file.
# R needs a full path to find the settings file.
rmariadb.settingsfile<-"C:\\Users\\georg\\Documents\\George\\Data Science MS\\DBConnection.cnf"
rmariadb.db<-"d607-hw2"
moviesDb<-dbConnect(RMariaDB::MariaDB(),default.file=rmariadb.settingsfile,group=rmariadb.db)
#Clear previous data from the table
query<-paste("TRUNCATE movie_ratings")
rs = dbSendQuery(moviesDb,query)
#Store the movies rating data in the sql table
dbWriteTable(moviesDb, value = ratings, row.names = FALSE, name = "movie_ratings", append = TRUE )
## Warning in result_create(conn@ptr, statement, is_statement): Cancelling previous
## query
Once the data is populated into the DB, I read it by using a SQL Command:
query<-paste("SELECT * from movie_ratings")
rs = dbSendQuery(moviesDb, query)
dbRows<-dbFetch(rs)
dbRows
## id timestamp star_wars ford_ferrari little_women joker endgame
## 1 1 9/5/2020 18:20:03 4 3 4 5 5
## 2 2 9/5/2020 18:21:14 4 3 4 5 5
## 3 3 9/5/2020 18:22:01 3 3 5 3 5
## 4 4 9/5/2020 18:22:16 NA NA NA NA 5
## 5 5 9/5/2020 18:23:56 5 4 4 5 5
## 6 6 9/5/2020 18:24:16 5 4 3 5 5
## 7 7 9/5/2020 18:24:48 NA 5 5 4 3
## 8 8 9/5/2020 18:34:04 1 3 1 5 5
## 9 9 9/5/2020 18:39:44 5 3 4 5 5
## when_they
## 1 3
## 2 3
## 3 2
## 4 NA
## 5 5
## 6 4
## 7 5
## 8 4
## 9 5
I will take the average rating for each movie and look for records where no answers were provided:
summary(dbRows)
## id timestamp star_wars ford_ferrari little_women
## Min. :1 Length:9 Min. :1.000 Min. :3.0 Min. :1.00
## 1st Qu.:3 Class :character 1st Qu.:3.500 1st Qu.:3.0 1st Qu.:3.75
## Median :5 Mode :character Median :4.000 Median :3.0 Median :4.00
## Mean :5 Mean :3.857 Mean :3.5 Mean :3.75
## 3rd Qu.:7 3rd Qu.:5.000 3rd Qu.:4.0 3rd Qu.:4.25
## Max. :9 Max. :5.000 Max. :5.0 Max. :5.00
## NA's :2 NA's :1 NA's :1
## joker endgame when_they
## Min. :3.000 Min. :3.000 Min. :2.000
## 1st Qu.:4.750 1st Qu.:5.000 1st Qu.:3.000
## Median :5.000 Median :5.000 Median :4.000
## Mean :4.625 Mean :4.778 Mean :3.875
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1 NA's :1
Based on the mean values, It would appear Avengers: Endgame was the preferred movie of my respondents.
new_data <- dbRows %>% filter_all(any_vars(is.na(.)))
new_data
## id timestamp star_wars ford_ferrari little_women joker endgame
## 1 4 9/5/2020 18:22:16 NA NA NA NA 5
## 2 7 9/5/2020 18:24:48 NA 5 5 4 3
## when_they
## 1 NA
## 2 5
Based on the data, one person responded they only watched Endgame, and another person did not watch Star Wars, the Rise of Skywalker. One way to suggest a movie for the person who only watched endgame would be to find others that watched and enjoyed endgame as much as they did, and suggest the best movie for them based on the others ratings
dbRows
## id timestamp star_wars ford_ferrari little_women joker endgame
## 1 1 9/5/2020 18:20:03 4 3 4 5 5
## 2 2 9/5/2020 18:21:14 4 3 4 5 5
## 3 3 9/5/2020 18:22:01 3 3 5 3 5
## 4 4 9/5/2020 18:22:16 NA NA NA NA 5
## 5 5 9/5/2020 18:23:56 5 4 4 5 5
## 6 6 9/5/2020 18:24:16 5 4 3 5 5
## 7 7 9/5/2020 18:24:48 NA 5 5 4 3
## 8 8 9/5/2020 18:34:04 1 3 1 5 5
## 9 9 9/5/2020 18:39:44 5 3 4 5 5
## when_they
## 1 3
## 2 3
## 3 2
## 4 NA
## 5 5
## 6 4
## 7 5
## 8 4
## 9 5
enjoyedEG <- dbRows[dbRows$endgame == 5,]
enjoyedEG %>%
summarize(star_wars = mean(star_wars, na.rm = TRUE),
ford_ferrari = mean(ford_ferrari, na.rm = TRUE),
little_women = mean(little_women, na.rm = TRUE),
joker = mean(joker, na.rm = TRUE),
when_they = mean(when_they, na.rm = TRUE))
## star_wars ford_ferrari little_women joker when_they
## 1 3.857143 3.285714 3.571429 4.714286 3.714286
Based on this findings, it would seem that the movie we should recommend to this person is Joker, as they would probably enjoy it.
In order to answer that question we would look for others that share the same ratings and assume they would like the same movies.
tosuggest <- filter(dbRows, little_women == 5 )
tosuggest %>%
summarize(star_wars = mean(star_wars, na.rm = TRUE))
## star_wars
## 1 3
Based on these results, we can determine this viewer will probably not enjoy Star Wars too much.
We can make recommendations by finding other viewers who rated the movies as high as another viewer and probably determine if the would like a certain movie.
I would have loved to be able to get the whole initial workflow functioning and I will keep at it to make it work.
# Disconnect to clean up the connection to the database.
dbDisconnect(moviesDb)
## Warning in connection_release(conn@ptr): There is a result object still in use.
## The connection will be automatically released when it is closed