#libraries
library(RMySQL)
## Loading required package: DBI
library(RMariaDB)
library(DBI)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#database connection
con <- dbConnect(RMySQL::MySQL(),
dbname = "leslie.tavarez85",
host = "cunydata607sql.mysql.database.azure.com",
user = "leslie.tavarez85",
password = "a2c0b934dba8a1ef")
#query data from pupularmovies table
query <- "SELECT * FROM `leslie.tavarez85`.`pupularmovies`;"
pupularmovies <- dbGetQuery(con, query)
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 1 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 3 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 4 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 5 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 6 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 7 imported
## as numeric
#show the imported dataframe from MySQL
print(pupularmovies)
## movie lyra perla julie garlis martin taisa fernando
## 1 Barbie 3.0 5 4 1 4.0 5 1
## 2 BadBoys:RideOrDie 2.0 4 NA 4 3.0 3 5
## 3 StarWars:TheLastJedi 2.0 5 4 3 NA 4 3
## 4 AQuietPlace:DayOne NA 5 NA 2 4.5 3 2
## 5 FordvsFerrari 3.0 5 4 5 5.0 NA 5
## 6 DunePtII 4.5 3 2 5 NA 2 NA
#close connection when done
dbDisconnect(con)
## [1] TRUE
#Part 4: Missing data strategy There are a few ways to handle missing data but I do not want to imput a zero for NA values, replace it with the mean or make up information. Instead, I will just use the information that I have and analysis that data instead. I will do this by creating new tables for each movie omitting the NA/null values.
#Let's use a visual to show our missing data
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
md.pattern(pupularmovies)
## movie perla garlis lyra taisa fernando julie martin
## 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 0 1 0 2
## 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 0 1 1 0 1 2
## 0 0 0 1 1 1 2 2 7
#First, I need to transpose my data (flip rows and columns)
movieRatings <- data.frame(t(pupularmovies[-1]))
colnames(movieRatings) <- pupularmovies[, 1]
print(movieRatings)
## Barbie BadBoys:RideOrDie StarWars:TheLastJedi AQuietPlace:DayOne
## lyra 3 2 2 NA
## perla 5 4 5 5.0
## julie 4 NA 4 NA
## garlis 1 4 3 2.0
## martin 4 3 NA 4.5
## taisa 5 3 4 3.0
## fernando 1 5 3 2.0
## FordvsFerrari DunePtII
## lyra 3 4.5
## perla 5 3.0
## julie 4 2.0
## garlis 5 5.0
## martin 5 NA
## taisa NA 2.0
## fernando 5 NA
#Find mean of each movie omitting null values
mean(movieRatings$Barbie, na.rm = TRUE)
## [1] 3.285714
mean(movieRatings$`BadBoys:RideOrDie`, na.rm = TRUE)
## [1] 3.5
mean(movieRatings$`StarWars:TheLastJedi`, na.rm = TRUE)
## [1] 3.5
mean(movieRatings$`AQuietPlace:DayOne`, na.rm = TRUE)
## [1] 3.3
mean(movieRatings$FordvsFerrari, na.rm = TRUE)
## [1] 4.5
mean(movieRatings$DunePtII, na.rm = TRUE)
## [1] 3.3
Ford vs. Ferrari had the highest average rating of 4.5/5. Bad Boys and Star Wars are tied for second best with the same average rating of 3.5/5. A Quiet Place scored lower than I thought it would (I watched it in theaters and loved it). It averaged at a 3.3/5.
In conclusion, I enjoyed using data I was able to obtain from students and fellow teachers. I did not use a software to gather information since I wanted it to be more interactive but I normally have my students use smartsheet, google or survey monkey to make surveys!
Standariziation may increase the reliabilty and the consistency of surveys. I think it makes it easier to compare results. But, as a teacher who has taken way too many surveys, sometimes I think it may result in response bias and it is difficult to capture complex information. Respondents may not be able to explain their thoughts or experience with a simple rating scale. While doing this lab, I was able to hear their perspective on each movie and their reason why they voted the way they did. If I were to standardize the rating for this lab specifically, I can have people rate from 1-5 to the nearest integer instead of accepting decimal values.