Assignment02

Load the library

library(RMySQL)

## Loading required package: DBI

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Connect the database

Making the connection through the MySQL package to the azure database. Saving the connection as mydb for reference later.

mydb <- dbConnect(MySQL(), user='kaylie.evans64',password='PasswordDATA607',dbname='kaylie.evans64',host='cunydata607sql.mysql.database.azure.com')

Add the table with my data to R as data frame

Also opening up the movies df to see what the table transfer looks like

movies <- dbReadTable(mydb,'movie_data')

## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 1 imported
## as numeric

## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 2 imported
## as numeric

## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 3 imported
## as numeric

## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 4 imported
## as numeric

## Warning in dbSendQuery(conn, statement, ...): Decimal MySQL column 5 imported
## as numeric

movies

##                                        MOVIE BROCK NATE KIRSTEN LANCE JEN
## 1                              Barbie (2023)   2.0  4.5     4.0   4.5 5.0
## 2         The Super Mario Bros. Movie (2023)   4.0  4.0     3.5   1.5 3.0
## 3            Avatar: The Way of Water (2022)   5.0  4.5      NA    NA 3.0
## 4                               Moana (2016)   3.5  3.5     4.0   3.0 4.0
## 5                             Encanto (2021)   3.5  3.0     3.5   4.0 4.5
## 6 Spider-Man: Across the Spider-Verse (2023)   3.5  5.0     4.0   3.5 3.0
## 7   Spider-Man: Into the Spider-Verse (2018)   3.0  5.0     4.5   4.0 5.0
## 8                            Hamilton (2020)   1.0  4.0     5.0   2.5 5.0

There are 2 null values for Avatar: The Way of Water (2022), and it looks like R has properly read them as null.

Missing Data Strategy

There are a few missing data strategies that would work for this situation, all with their own pros and cons. Here, I am going to use the method for filtering out the entire row that has null values. This will get rid of the only movie with missing data.

#where exactly are any null values?
is.na(movies)

##      MOVIE BROCK  NATE KIRSTEN LANCE   JEN
## [1,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE    TRUE  TRUE FALSE
## [4,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE   FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE   FALSE FALSE FALSE

#remove those rows 
movies |>
  filter(!is.na(LANCE))

##                                        MOVIE BROCK NATE KIRSTEN LANCE JEN
## 1                              Barbie (2023)   2.0  4.5     4.0   4.5 5.0
## 2         The Super Mario Bros. Movie (2023)   4.0  4.0     3.5   1.5 3.0
## 3                               Moana (2016)   3.5  3.5     4.0   3.0 4.0
## 4                             Encanto (2021)   3.5  3.0     3.5   4.0 4.5
## 5 Spider-Man: Across the Spider-Verse (2023)   3.5  5.0     4.0   3.5 3.0
## 6   Spider-Man: Into the Spider-Verse (2018)   3.0  5.0     4.5   4.0 5.0
## 7                            Hamilton (2020)   1.0  4.0     5.0   2.5 5.0

Missing Data Strategy Reasoning: Removing the Movie

I chose to remove the entire movie data for a few reasons. In case of missing data, I chose more than 6 movies. This means that after filtering out the data, the number of movies still reaches the guidelines. During the data collection, the 2 missing values were noted as missing not because they had an especially good or bad expectation for the movie, just that they hadn’t seen it. The major issue is with the removed set of numbers that were not null. In the view, we can see that BROCK’s highest rating was for the removed row so that data is now not able to be part of the analysis.