loading packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.4.1
## Loading required package: DBI
library(DBI)
library(knitr)
library(rstudioapi)

Movie Ratings

Part 1: Build Table • Choose six recent popular movies. • Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.

In the data collected 6 popular movies from 2024 were rated from scale of 1-5, the movies were “Despicable Me 4, ‘’Twisters’‘,’‘Dead Pool’‘,’‘Inside Out’‘,’‘BeetleJuice BeetleJuice’‘, and’‘Borderlands’’. The participants in my data collection were my friends and family, Dejay, Johana, Nanie, Nathan, and Stephanie. The data was stored into an SQL database.

Part 2: Store data in SQL database • Take the results (observations) and store them in the class MySQL database: - Server name: cunydata607sql.mysql.database.azure.com - Username / password: will be given to you in an email Note: it is good practice to change your password. To do so, use this SQL command: SET PASSWORD = ‘’;

A Data table was created on SQL with the ratings collected for six movies, from 5 individuals

Part 3: Transfer data from SQL database to R dataframe • Load the information from the SQL database into an R dataframe.

##Loading data from SQL to R

mydb = dbConnect(MySQL(), user='andreina.arias80', password=rstudioapi::askForPassword(prompt = 'Password: '), dbname='andreina.arias80', host='cunydata607sql.mysql.database.azure.com')

Accessing tables from My database in SQL

dbListTables(mydb)
## [1] "movie_ratings"

#Loading and display data

Ratings=dbSendQuery(mydb,"select * from movie_ratings")
Rating_Of_Movies<- fetch(Ratings)
print(Rating_Of_Movies)
##   idmovie_ratings                  Movies Nathan_ratings Nanie_ratings
## 1               1         Despicable Me 4              3             4
## 2               2                Twisters              4             3
## 3               3               Dead Pool              4             3
## 4               4              Inside Out              2             3
## 5               5 BeetleJuice BeetleJuice              1             2
## 6               6             Borderlands              5             3
##   Stephanie_ratings Dejay_ratings Johana_ratings
## 1                 5             3              4
## 2                 3             2              2
## 3                 3             3              2
## 4                 2             3              3
## 5                 2             2              1
## 6                NA             1              1
glimpse(Rating_Of_Movies)
## Rows: 6
## Columns: 7
## $ idmovie_ratings   <int> 1, 2, 3, 4, 5, 6
## $ Movies            <chr> "Despicable Me 4", "Twisters", "Dead Pool", "Inside …
## $ Nathan_ratings    <int> 3, 4, 4, 2, 1, 5
## $ Nanie_ratings     <int> 4, 3, 3, 3, 2, 3
## $ Stephanie_ratings <int> 5, 3, 3, 2, 2, NA
## $ Dejay_ratings     <int> 3, 2, 3, 3, 2, 1
## $ Johana_ratings    <int> 4, 2, 2, 3, 1, 1

Part 4: Missing data strategy

• Implement an approach to missing data • Explain why you decided to take the chosen approach Note: consider that later in the course you will revisit this information you have collected and will use it to implement a Recommender.

For missing data in SQL uses ‘NULL’, while R uses ‘NA’ as the value. Below I created a data frame with the calculated average of the ratings and added the new values as a column on the data frame on R only, to not affect the data in SQL. It is seen that the average was still calculated for the movie “Borderlands” even with a value of ‘NA’, ‘NA’ was ignored. I would rather have the value as ‘NA’ then to place a zero as it would cause confusion and a false average, it is better to remove the value completely since the rating was not zero instead the value is missing since Stephanie didn’t watch the movie.

Rating_Average<- Rating_Of_Movies |>
  dplyr:::mutate.data.frame(avg_rating=rowMeans(Rating_Of_Movies[, c('Nathan_ratings', 'Nanie_ratings','Dejay_ratings','Stephanie_ratings','Johana_ratings')], na.rm=TRUE))
print(Rating_Average)
##   idmovie_ratings                  Movies Nathan_ratings Nanie_ratings
## 1               1         Despicable Me 4              3             4
## 2               2                Twisters              4             3
## 3               3               Dead Pool              4             3
## 4               4              Inside Out              2             3
## 5               5 BeetleJuice BeetleJuice              1             2
## 6               6             Borderlands              5             3
##   Stephanie_ratings Dejay_ratings Johana_ratings avg_rating
## 1                 5             3              4        3.8
## 2                 3             2              2        2.8
## 3                 3             3              2        3.0
## 4                 2             3              3        2.6
## 5                 2             2              1        1.6
## 6                NA             1              1        2.5