This week’s assignment required creating a small dataset of movie ratings by surveying one’s acquaintances. As my family and friends are not movie buffs, I researched a set of six popular current movies on Rotten Tomatoes (https://editorial.rottentomatoes.com/guide/popular-movies/) and followed the suggestion to survey five imaginary friends, who were very accommodating with their opinions.
# Connect to class database
classdb_connection<-dbConnect(MySQL(),user='amanda.fox02',password='FT4QT_yy',dbname='amanda.fox02',
host='cunydata607sql.mysql.database.azure.com')
# Create new dataframe by fetching result of query against the "ratings" table
df<-fetch(dbSendQuery(classdb_connection,paste0("SELECT * FROM `amanda.fox02`.ratings;")))
# Profile data and verify
str(df)
## 'data.frame': 20 obs. of 3 variables:
## $ Movie Title: chr "Poor Things" "The Beekeeper" "Mean Girls 2024" "Migration" ...
## $ Reviewer : chr "Amy" "Amy" "Amy" "Amy" ...
## $ Rating : num 4 3 1 5 2 2 2 1 1 5 ...
summary(df)
## Movie Title Reviewer Rating
## Length:20 Length:20 Min. :1.00
## Class :character Class :character 1st Qu.:2.00
## Mode :character Mode :character Median :3.00
## Mean :3.15
## 3rd Qu.:4.00
## Max. :5.00
Records with blank scores were excluded from the MySQL table and the R data frame.
In the course of this exercise, I found that MySQL very much wanted to reject records with blank scores, and I tried several ways to force it to accept blanks (’‘) as null values in that field. I also researched ways to use SQL to handle importing a value of’’ (e.g. set values to 0 or NULL during import process).
Ultimately I decided to reject null records for this use case:
- Nulls should not be treated as ratings of zero
- Excluding nulls from every calculation would be cumbersome
- Null records do not add value to this analysis, and we could back into
them if needed
While I did not break out the collected data into a normalized structure in this exercise, ideally we would create three tables:
Movie | Person | Rating
123 | A | 1
234 | B | 4
345 | B | 2
In practice, it is probably not worth the squeeze in small datasets to design and create multiple tables and then create joins in every query, but in very large databases, it is a best practice for several very good reasons including efficiency and accuracy.
For example, one can change the name of a person by changing a single record in the “person” table vs. seeking out every record with the person’s old name across potentially many tables in a large database. This also reduces opportunity for errors and improves accuracy in queries and reports. Plus, in the past when storage was expensive, not duplicating data unnecessarily was in itself good management.