Connect to a MySQL Database

#libraries
library(RMySQL)
## Loading required package: DBI
library(RMariaDB)
library(DBI)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#database connection 
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "leslie.tavarez85",
                 host = "cunydata607sql.mysql.database.azure.com",
                 user = "leslie.tavarez85",
                 password = "a2c0b934dba8a1ef")


#query data from pupularmovies table
query <- "SELECT * FROM `leslie.tavarez85`.`pupularmovies`;"
pupularmovies <- dbGetQuery(con, query)
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 1 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 3 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 4 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 5 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 6 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 7 imported
## as numeric
#show the imported dataframe from MySQL
print(pupularmovies)
##                  movie lyra perla julie garlis martin taisa fernando
## 1               Barbie  3.0     5     4      1    4.0     5        1
## 2    BadBoys:RideOrDie  2.0     4    NA      4    3.0     3        5
## 3 StarWars:TheLastJedi  2.0     5     4      3     NA     4        3
## 4   AQuietPlace:DayOne   NA     5    NA      2    4.5     3        2
## 5        FordvsFerrari  3.0     5     4      5    5.0    NA        5
## 6             DunePtII  4.5     3     2      5     NA     2       NA
#close connection when done 
dbDisconnect(con)
## [1] TRUE

#Part 4: Missing data strategy There are a few ways to handle missing data but I do not want to imput a zero for NA values, replace it with the mean or make up information. Instead, I will just use the information that I have and analysis that data instead. I will do this by creating new tables for each movie omitting the NA/null values.

#Let's use a visual to show our missing data 
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
md.pattern(pupularmovies)

##   movie perla garlis lyra taisa fernando julie martin  
## 1     1     1      1    1     1        1     1      1 0
## 1     1     1      1    1     1        1     1      0 1
## 1     1     1      1    1     1        1     0      1 1
## 1     1     1      1    1     1        0     1      0 2
## 1     1     1      1    1     0        1     1      1 1
## 1     1     1      1    0     1        1     0      1 2
##       0     0      0    1     1        1     2      2 7

#First, I need to transpose my data (flip rows and columns)

movieRatings <- data.frame(t(pupularmovies[-1]))
colnames(movieRatings) <- pupularmovies[, 1]

print(movieRatings)
##          Barbie BadBoys:RideOrDie StarWars:TheLastJedi AQuietPlace:DayOne
## lyra          3                 2                    2                 NA
## perla         5                 4                    5                5.0
## julie         4                NA                    4                 NA
## garlis        1                 4                    3                2.0
## martin        4                 3                   NA                4.5
## taisa         5                 3                    4                3.0
## fernando      1                 5                    3                2.0
##          FordvsFerrari DunePtII
## lyra                 3      4.5
## perla                5      3.0
## julie                4      2.0
## garlis               5      5.0
## martin               5       NA
## taisa               NA      2.0
## fernando             5       NA

#Find mean of each movie omitting null values

mean(movieRatings$Barbie, na.rm = TRUE)
## [1] 3.285714
mean(movieRatings$`BadBoys:RideOrDie`, na.rm = TRUE)
## [1] 3.5
mean(movieRatings$`StarWars:TheLastJedi`, na.rm = TRUE)
## [1] 3.5
mean(movieRatings$`AQuietPlace:DayOne`, na.rm = TRUE)
## [1] 3.3
mean(movieRatings$FordvsFerrari, na.rm = TRUE)
## [1] 4.5
mean(movieRatings$DunePtII, na.rm = TRUE)
## [1] 3.3

Ford vs. Ferrari had the highest average rating of 4.5/5. Bad Boys and Star Wars are tied for second best with the same average rating of 3.5/5. A Quiet Place scored lower than I thought it would (I watched it in theaters and loved it). It averaged at a 3.3/5.

In conclusion, I enjoyed using data I was able to obtain from students and fellow teachers. I did not use a software to gather information since I wanted it to be more interactive but I normally have my students use smartsheet, google or survey monkey to make surveys!

Standariziation may increase the reliabilty and the consistency of surveys. I think it makes it easier to compare results. But, as a teacher who has taken way too many surveys, sometimes I think it may result in response bias and it is difficult to capture complex information. Respondents may not be able to explain their thoughts or experience with a simple rating scale. While doing this lab, I was able to hear their perspective on each movie and their reason why they voted the way they did. If I were to standardize the rating for this lab specifically, I can have people rate from 1-5 to the nearest integer instead of accepting decimal values.