Part 1: Build Table
• Choose six recent popular movies. • Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.
Part 2: Store data in SQL database
• Take the results (observations) and store them in the class MySQL database:
Note: it is good practice to change your password. To do so, use this
SQL command: SET PASSWORD = ‘
Part 3: Transfer data from SQL database to R dataframe
• Load the information from the SQL database into an R dataframe. Part 4: Missing data strategy
• Implement an approach to missing data
• Explain why you decided to take the chosen approach
Note: consider that later in the course you will revisit this information you have collected and will use it to implement a Recommender.
A table has been created and stored in the cunydata607sql.mysql.database.azure.com called “popular_movie_poll”.
#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.2.3
## Loading required package: DBI
mysqlconnection = dbConnect(RMySQL::MySQL(),
dbname= 'natalie.kalukeerthie02',
host= 'cunydata607sql.mysql.database.azure.com',
port=3306,
user='natalie.kalukeerthie02',
password='natalie.kalukeerthie02')
dbListTables(mysqlconnection)
## [1] "popular_movie_poll"
result = dbSendQuery(mysqlconnection, "select * from popular_movie_poll")
popular_movie_poll <- fetch(result)
print(popular_movie_poll)
## RaterID RaterName Barbie Wonka Aquaman Oppenheimer Saltburn Wish
## 1 1 Natalie 4 0 3 4 0 0
## 2 2 Seth 5 2 0 3 0 2
## 3 3 Priya 0 0 0 0 4 4
## 4 4 Juls 4 3 4 5 0 0
## 5 5 Mell 0 2 3 0 0 0
## 6 6 Kojo 0 3 4 5 5 0
## 7 7 Morrigan 4 3 4 4 0 2
For missing data in this exercise, I used the “0” as the NULL when a rater did not watch the movie, this is because ratings were on a scale of 1 to 5, so zero was not in use, and because in SQL, I could not use “N/A” as the datatype of the ratings was INT. For calculations I would omit zeros when necessary and in client reports that require views of the data I might just add a caveat that 0 = has not seen and that the scaling is 1 to 5 to avoid confusion. I would rather not use a blank as there is a potential to cause an issue if using the data for calculations.