library(RMySQL)
## Loading required package: DBI
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Making the connection through the MySQL package to the azure database. Saving the connection as mydb for reference later.
mydb <- dbConnect(MySQL(), user='kaylie.evans64',password='PasswordDATA607',dbname='kaylie.evans64',host='cunydata607sql.mysql.database.azure.com')
Also opening up the movies df to see what the table transfer looks like
movies <- dbReadTable(mydb,'movie_data')
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 1 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 3 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 4 imported
## as numeric
## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 5 imported
## as numeric
movies
## MOVIE BROCK NATE KIRSTEN LANCE JEN
## 1 Barbie (2023) 2.0 4.5 4.0 4.5 5.0
## 2 The Super Mario Bros. Movie (2023) 4.0 4.0 3.5 1.5 3.0
## 3 Avatar: The Way of Water (2022) 5.0 4.5 NA NA 3.0
## 4 Moana (2016) 3.5 3.5 4.0 3.0 4.0
## 5 Encanto (2021) 3.5 3.0 3.5 4.0 4.5
## 6 Spider-Man: Across the Spider-Verse (2023) 3.5 5.0 4.0 3.5 3.0
## 7 Spider-Man: Into the Spider-Verse (2018) 3.0 5.0 4.5 4.0 5.0
## 8 Hamilton (2020) 1.0 4.0 5.0 2.5 5.0
There are 2 null values for Avatar: The Way of Water (2022), and it looks like R has properly read them as null.
There are a few missing data strategies that would work for this situation, all with their own pros and cons. Here, I am going to use the method for filtering out the entire row that has null values. This will get rid of the only movie with missing data.
#where exactly are any null values?
is.na(movies)
## MOVIE BROCK NATE KIRSTEN LANCE JEN
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE TRUE TRUE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE
#remove those rows
movies |>
filter(!is.na(LANCE))
## MOVIE BROCK NATE KIRSTEN LANCE JEN
## 1 Barbie (2023) 2.0 4.5 4.0 4.5 5.0
## 2 The Super Mario Bros. Movie (2023) 4.0 4.0 3.5 1.5 3.0
## 3 Moana (2016) 3.5 3.5 4.0 3.0 4.0
## 4 Encanto (2021) 3.5 3.0 3.5 4.0 4.5
## 5 Spider-Man: Across the Spider-Verse (2023) 3.5 5.0 4.0 3.5 3.0
## 6 Spider-Man: Into the Spider-Verse (2018) 3.0 5.0 4.5 4.0 5.0
## 7 Hamilton (2020) 1.0 4.0 5.0 2.5 5.0
I chose to remove the entire movie data for a few reasons. In case of missing data, I chose more than 6 movies. This means that after filtering out the data, the number of movies still reaches the guidelines. During the data collection, the 2 missing values were noted as missing not because they had an especially good or bad expectation for the movie, just that they hadn’t seen it. The major issue is with the removed set of numbers that were not null. In the view, we can see that BROCK’s highest rating was for the removed row so that data is now not able to be part of the analysis.