Assignment – R and SQL

Part 1: Build Table

• Choose six recent popular movies. • Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.

Part 2: Store data in SQL database

• Take the results (observations) and store them in the class MySQL database:

Note: it is good practice to change your password. To do so, use this SQL command: SET PASSWORD = ‘’;

Part 3: Transfer data from SQL database to R dataframe

• Load the information from the SQL database into an R dataframe. Part 4: Missing data strategy

• Implement an approach to missing data

• Explain why you decided to take the chosen approach

Note: consider that later in the course you will revisit this information you have collected and will use it to implement a Recommender.

Build Table and Store Data in SQL Database

A table has been created and stored in the cunydata607sql.mysql.database.azure.com called “popular_movie_poll”.

Transfer data from SQL database to R dataframe

#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.2.3
## Loading required package: DBI
mysqlconnection = dbConnect(RMySQL::MySQL(),
                            dbname= 'natalie.kalukeerthie02',
                            host= 'cunydata607sql.mysql.database.azure.com',
                            port=3306,
                            user='natalie.kalukeerthie02',
                            password='natalie.kalukeerthie02')
                            
dbListTables(mysqlconnection)
## [1] "popular_movie_poll"
result = dbSendQuery(mysqlconnection, "select * from popular_movie_poll")

popular_movie_poll <- fetch(result)

print(popular_movie_poll)
##   RaterID RaterName Barbie Wonka Aquaman Oppenheimer Saltburn Wish
## 1       1   Natalie      4     0       3           4        0    0
## 2       2      Seth      5     2       0           3        0    2
## 3       3     Priya      0     0       0           0        4    4
## 4       4      Juls      4     3       4           5        0    0
## 5       5      Mell      0     2       3           0        0    0
## 6       6      Kojo      0     3       4           5        5    0
## 7       7  Morrigan      4     3       4           4        0    2

Missing Data Strategy

For missing data in this exercise, I used the “0” as the NULL when a rater did not watch the movie, this is because ratings were on a scale of 1 to 5, so zero was not in use, and because in SQL, I could not use “N/A” as the datatype of the ratings was INT. For calculations I would omit zeros when necessary and in client reports that require views of the data I might just add a caveat that 0 = has not seen and that the scaling is 1 to 5 to avoid confusion. I would rather not use a blank as there is a potential to cause an issue if using the data for calculations.