In this assignment, a group of friends and family of the author were asked to rate 5 recent movies on a scale from 1 to 5, with 1 being one of the worst to 5 being one of the best. A Google form was sent out and 30 people had responded to the form. Their responses were then stored onto a .csv file. The data from this .csv file was then uploaded onto a MySQL database server. The data from the MySQL database server was then imported into R. Average imputation was then used in order to assign numerical values to NA responses (where the person who was surveyed had not seen the film, and thus, unable to give a score).
The data was then imported and stored onto a MySQL database. Instead of going through the trouble of having to store the .csv file onto your local machine then importing it into the MySQL server, and to account for reproducibility, The Convert CSV to SQL[1] online tool was used to convert the data from the .csv file into SQL code, because typing in 30 INSERT INTO statements by hand is not fun. The MySQL code to generate the table, movie_survey, containing the movie survey data is shown below.
Figure 1: MySQL Code used to generate movie survey data.
On MySQL Workbench, a new user was added who has permission to connect to the database. The credentials for this new user, which include the username and password for this user, were then stored on a configuration file (movie_survey.cnf). These credentials were stored on a configuration file for security reasons. In R, the location of the configuration file was used in order to connect to the MySQL database using the dbConnect function.
library(RMariaDB)
library(knitr)
rmariadb.settingsfile <- "/home/peter/Downloads/movie_survey.cnf"
rmariadb.db <- "movie_survey"
movie_survey_db <- dbConnect(RMariaDB::MariaDB(),
default.file=rmariadb.settingsfile,
group=rmariadb.db)
dbListTables(movie_survey_db)
## [1] "movie_survey"
The dbListTables function shows the movie_survey table that was generated in MySQL using the movie survey data. The movie_survey table was then imported into R as a data frame and stored in the movie_survey_df variable.
query<-paste(
"SELECT * FROM movie_survey;",sep=""
)
rs = dbSendQuery(movie_survey_db, query)
movie_survey_df <- dbFetch(rs)
kable(movie_survey_df)
| Survey ID | Joker | Whiplash | Us | Once Upon a Time in Hollywood | The Rise of Skywalker |
|---|---|---|---|---|---|
| 0 | 5 | 4 | 2 | 1 | 3 |
| 1 | NA | 4 | NA | 3 | 3 |
| 2 | 5 | 1 | NA | 1 | 1 |
| 3 | 1 | 5 | NA | 2 | 5 |
| 4 | 4 | 4 | 5 | 1 | 3 |
| 5 | 1 | NA | 5 | 1 | 1 |
| 6 | 3 | 2 | 2 | NA | 5 |
| 7 | 3 | 5 | 4 | 2 | 3 |
| 8 | 4 | 5 | 2 | 1 | 5 |
| 9 | NA | NA | 5 | 2 | NA |
| 10 | 5 | 4 | 1 | 3 | 1 |
| 11 | 1 | NA | 3 | 2 | 2 |
| 12 | 4 | 4 | 4 | 2 | 1 |
| 13 | NA | 4 | 5 | 4 | 1 |
| 14 | 1 | 4 | NA | 2 | 3 |
| 15 | 1 | 3 | 4 | 4 | 3 |
| 16 | 4 | NA | 1 | 3 | 4 |
| 17 | 5 | 1 | 5 | 5 | 2 |
| 18 | NA | 3 | 3 | 1 | 4 |
| 19 | 3 | 5 | 2 | 4 | 2 |
| 20 | NA | NA | 4 | NA | NA |
| 21 | 4 | 4 | 5 | 5 | 3 |
| 22 | 5 | 3 | 3 | 1 | 1 |
| 23 | 2 | 3 | 5 | NA | 3 |
| 24 | NA | 1 | 2 | 5 | 2 |
| 25 | 4 | 4 | NA | 5 | 5 |
| 26 | 4 | NA | 5 | 3 | 5 |
| 27 | 4 | 1 | NA | 5 | 3 |
| 28 | 2 | 2 | 2 | 2 | 4 |
| 29 | NA | NA | 1 | NA | 2 |
In the data, there are several instances of NA values. These values indicate that the person that was being surveyed had not seen the film. In order to fill in these values with some numerical value, it was decided to compute the average value for each column, then fill in the NA values with this average for each column.
apply(movie_survey_df, 2, mean, na.rm=TRUE)
## Survey ID Joker
## 14.500000 3.260870
## Whiplash Us
## 3.304348 3.333333
## Once Upon a Time in Hollywood The Rise of Skywalker
## 2.692308 2.857143
The code block above shows the averages for each of the columns. The code block below applies all of these averages to each of the NA values for each column.
for(i in 1:ncol(movie_survey_df)){
movie_survey_df[is.na(movie_survey_df[,i]), i] <- mean(movie_survey_df[,i], na.rm = TRUE)
}
kable(movie_survey_df)
| Survey ID | Joker | Whiplash | Us | Once Upon a Time in Hollywood | The Rise of Skywalker |
|---|---|---|---|---|---|
| 0 | 5.00000 | 4.000000 | 2.000000 | 1.000000 | 3.000000 |
| 1 | 3.26087 | 4.000000 | 3.333333 | 3.000000 | 3.000000 |
| 2 | 5.00000 | 1.000000 | 3.333333 | 1.000000 | 1.000000 |
| 3 | 1.00000 | 5.000000 | 3.333333 | 2.000000 | 5.000000 |
| 4 | 4.00000 | 4.000000 | 5.000000 | 1.000000 | 3.000000 |
| 5 | 1.00000 | 3.304348 | 5.000000 | 1.000000 | 1.000000 |
| 6 | 3.00000 | 2.000000 | 2.000000 | 2.692308 | 5.000000 |
| 7 | 3.00000 | 5.000000 | 4.000000 | 2.000000 | 3.000000 |
| 8 | 4.00000 | 5.000000 | 2.000000 | 1.000000 | 5.000000 |
| 9 | 3.26087 | 3.304348 | 5.000000 | 2.000000 | 2.857143 |
| 10 | 5.00000 | 4.000000 | 1.000000 | 3.000000 | 1.000000 |
| 11 | 1.00000 | 3.304348 | 3.000000 | 2.000000 | 2.000000 |
| 12 | 4.00000 | 4.000000 | 4.000000 | 2.000000 | 1.000000 |
| 13 | 3.26087 | 4.000000 | 5.000000 | 4.000000 | 1.000000 |
| 14 | 1.00000 | 4.000000 | 3.333333 | 2.000000 | 3.000000 |
| 15 | 1.00000 | 3.000000 | 4.000000 | 4.000000 | 3.000000 |
| 16 | 4.00000 | 3.304348 | 1.000000 | 3.000000 | 4.000000 |
| 17 | 5.00000 | 1.000000 | 5.000000 | 5.000000 | 2.000000 |
| 18 | 3.26087 | 3.000000 | 3.000000 | 1.000000 | 4.000000 |
| 19 | 3.00000 | 5.000000 | 2.000000 | 4.000000 | 2.000000 |
| 20 | 3.26087 | 3.304348 | 4.000000 | 2.692308 | 2.857143 |
| 21 | 4.00000 | 4.000000 | 5.000000 | 5.000000 | 3.000000 |
| 22 | 5.00000 | 3.000000 | 3.000000 | 1.000000 | 1.000000 |
| 23 | 2.00000 | 3.000000 | 5.000000 | 2.692308 | 3.000000 |
| 24 | 3.26087 | 1.000000 | 2.000000 | 5.000000 | 2.000000 |
| 25 | 4.00000 | 4.000000 | 3.333333 | 5.000000 | 5.000000 |
| 26 | 4.00000 | 3.304348 | 5.000000 | 3.000000 | 5.000000 |
| 27 | 4.00000 | 1.000000 | 3.333333 | 5.000000 | 3.000000 |
| 28 | 2.00000 | 2.000000 | 2.000000 | 2.000000 | 4.000000 |
| 29 | 3.26087 | 3.304348 | 1.000000 | 2.692308 | 2.000000 |
A future experiment would be to reconduct the survey, but add an additional question to the survey, which asks for those that did not see the movie in question, what would be the likelihood that you would see the movie in the future. The scale that would be set for this question would also be from 1 to 5, with 1 being “Very Unlikely” and 5 being “Very Likely”. We could have also standardized the ratings for each of the films, as the occasional low rating given by a user would impact the average rating for that movie. Calculating the median would probably benefit, also subtracting the mean and dividing by the standard deviation would normalize the ratings.
[1] Convert CSV to SQL. CSV To SQL Converter. (n.d.). https://www.convertcsv.com/csv-to-sql.htm.