Introduction

In this file, I chose six recent popular movies–All of Us Strangers, Anatomy of a Fall, Past Lives, Poor Things, Saltburn, and Zone of Interest–and asked 5 friends for their ratings. Later in this file, I will store these data on a SQL database, and handle missings.

Part 1: Build a table

I build a table of friends’ ratings below (their initials are the column names).

movieratings <- data.frame(  
  Title = c("All of Us Strangers", "Anatomy of a Fall", "Past Lives", "Poor Things", "Saltburn", "Zone of Interest"),
  AH = c(NA, NA, 5, NA, 2, 5), 
  SC = c(4, 4.5, 4.5, 4.5, 2, 4),
  MB = c(4.5, NA, 4, 5, 4, NA),
  PT = c(NA, NA, NA, 4, 3.5, NA),
  BH = c(NA, NA, 5, NA, 3, NA)
)

Part 2: Store data in SQL database

In order to store the results in the class MySQL database, I connect to MySQL in R using the RMySQL library (code is not shown in R-markdown since the password is included).

Next I use the dbWriteTable method in this library to copy my existing data frame to SQL.

# Use dbWriteTable to copy the "movieratings" dataframe to MySQL.
dbWriteTable(conn = con, name = "movieratings", value = movieratings, overwrite = TRUE)

## [1] TRUE

Part 3: Transfer data from SQL database to R dataframe

The prior R dataframe now exists as a table in the SQL database called “movieratings” and a query can be used to view it in R.

# View the table data with dbGetQuery command.
head(dbGetQuery(con, "SELECT * FROM movieratings"))

##   row_names               Title AH  SC  MB  PT BH
## 1         1 All of Us Strangers NA 4.0 4.5  NA NA
## 2         2   Anatomy of a Fall NA 4.5  NA  NA NA
## 3         3          Past Lives  5 4.5 4.0  NA  5
## 4         4         Poor Things NA 4.5 5.0 4.0 NA
## 5         5            Saltburn  2 2.0 4.0 3.5  3
## 6         6    Zone of Interest  5 4.0  NA  NA NA

Part 4: Missing data strategy

Regarding the missing data, as not all 5 friends had seen each movie, I coded these observations as “NA” in the data frame. I will keep these rows with NAs in my data, as removing them would leave me with only one movie. Instead, I plot the values using box plots below to show the distribution of ratings for each movie, allowing us to compare movie ratings despite the missing observations.

Data Visualization

# Boxplot
boxplot(t(movieratings[, -1]), main = "Ratings for Movies", ylab = "Rating", names = movieratings$Title, cex.axis = .55)

Week Two - R and SQL

Naomi Buell

2024-02-02