Assignment - SQL and R

Introduction

In this assignment, a group of friends and family of the author were asked to rate 5 recent movies on a scale from 1 to 5, with 1 being one of the worst to 5 being one of the best. A Google form was sent out and 30 people had responded to the form. Their responses were then stored onto a .csv file. The data from this .csv file was then uploaded onto a MySQL database server. The data from the MySQL database server was then imported into R. Average imputation was then used in order to assign numerical values to NA responses (where the person who was surveyed had not seen the film, and thus, unable to give a score).

MySQL Importing of Movie Survey Data

The data was then imported and stored onto a MySQL database. Instead of going through the trouble of having to store the .csv file onto your local machine then importing it into the MySQL server, and to account for reproducibility, The Convert CSV to SQL[1] online tool was used to convert the data from the .csv file into SQL code, because typing in 30 INSERT INTO statements by hand is not fun. The MySQL code to generate the table, movie_survey, containing the movie survey data is shown below.

Figure 1: MySQL Code used to generate movie survey data.

Importing of Movie Survey Data from MySQL to R.

On MySQL Workbench, a new user was added who has permission to connect to the database. The credentials for this new user, which include the username and password for this user, were then stored on a configuration file (movie_survey.cnf). These credentials were stored on a configuration file for security reasons. In R, the location of the configuration file was used in order to connect to the MySQL database using the dbConnect function.

library(RMariaDB)
library(knitr)

rmariadb.settingsfile <- "/home/peter/Downloads/movie_survey.cnf"

rmariadb.db <- "movie_survey"
movie_survey_db <- dbConnect(RMariaDB::MariaDB(),
                             default.file=rmariadb.settingsfile,
                             group=rmariadb.db)

dbListTables(movie_survey_db)

## [1] "movie_survey"

The dbListTables function shows the movie_survey table that was generated in MySQL using the movie survey data. The movie_survey table was then imported into R as a data frame and stored in the movie_survey_df variable.

query<-paste(
"SELECT * FROM movie_survey;",sep=""
)

rs = dbSendQuery(movie_survey_db, query)
movie_survey_df <- dbFetch(rs)
kable(movie_survey_df)

Survey ID	Joker	Whiplash	Us	Once Upon a Time in Hollywood	The Rise of Skywalker
0	5	4	2	1	3
1	NA	4	NA	3	3
2	5	1	NA	1	1
3	1	5	NA	2	5
4	4	4	5	1	3
5	1	NA	5	1	1
6	3	2	2	NA	5
7	3	5	4	2	3
8	4	5	2	1	5
9	NA	NA	5	2	NA
10	5	4	1	3	1
11	1	NA	3	2	2
12	4	4	4	2	1
13	NA	4	5	4	1
14	1	4	NA	2	3
15	1	3	4	4	3
16	4	NA	1	3	4
17	5	1	5	5	2
18	NA	3	3	1	4
19	3	5	2	4	2
20	NA	NA	4	NA	NA
21	4	4	5	5	3
22	5	3	3	1	1
23	2	3	5	NA	3
24	NA	1	2	5	2
25	4	4	NA	5	5
26	4	NA	5	3	5
27	4	1	NA	5	3
28	2	2	2	2	4
29	NA	NA	1	NA	2

Handling Missing Values

In the data, there are several instances of NA values. These values indicate that the person that was being surveyed had not seen the film. In order to fill in these values with some numerical value, it was decided to compute the average value for each column, then fill in the NA values with this average for each column.

apply(movie_survey_df, 2, mean, na.rm=TRUE)

##                     Survey ID                         Joker 
##                     14.500000                      3.260870 
##                      Whiplash                            Us 
##                      3.304348                      3.333333 
## Once Upon a Time in Hollywood         The Rise of Skywalker 
##                      2.692308                      2.857143

The code block above shows the averages for each of the columns. The code block below applies all of these averages to each of the NA values for each column.

for(i in 1:ncol(movie_survey_df)){
  movie_survey_df[is.na(movie_survey_df[,i]), i] <- mean(movie_survey_df[,i], na.rm = TRUE)
}
kable(movie_survey_df)

Survey ID	Joker	Whiplash	Us	Once Upon a Time in Hollywood	The Rise of Skywalker
0	5.00000	4.000000	2.000000	1.000000	3.000000
1	3.26087	4.000000	3.333333	3.000000	3.000000
2	5.00000	1.000000	3.333333	1.000000	1.000000
3	1.00000	5.000000	3.333333	2.000000	5.000000
4	4.00000	4.000000	5.000000	1.000000	3.000000
5	1.00000	3.304348	5.000000	1.000000	1.000000
6	3.00000	2.000000	2.000000	2.692308	5.000000
7	3.00000	5.000000	4.000000	2.000000	3.000000
8	4.00000	5.000000	2.000000	1.000000	5.000000
9	3.26087	3.304348	5.000000	2.000000	2.857143
10	5.00000	4.000000	1.000000	3.000000	1.000000
11	1.00000	3.304348	3.000000	2.000000	2.000000
12	4.00000	4.000000	4.000000	2.000000	1.000000
13	3.26087	4.000000	5.000000	4.000000	1.000000
14	1.00000	4.000000	3.333333	2.000000	3.000000
15	1.00000	3.000000	4.000000	4.000000	3.000000
16	4.00000	3.304348	1.000000	3.000000	4.000000
17	5.00000	1.000000	5.000000	5.000000	2.000000
18	3.26087	3.000000	3.000000	1.000000	4.000000
19	3.00000	5.000000	2.000000	4.000000	2.000000
20	3.26087	3.304348	4.000000	2.692308	2.857143
21	4.00000	4.000000	5.000000	5.000000	3.000000
22	5.00000	3.000000	3.000000	1.000000	1.000000
23	2.00000	3.000000	5.000000	2.692308	3.000000
24	3.26087	1.000000	2.000000	5.000000	2.000000
25	4.00000	4.000000	3.333333	5.000000	5.000000
26	4.00000	3.304348	5.000000	3.000000	5.000000
27	4.00000	1.000000	3.333333	5.000000	3.000000
28	2.00000	2.000000	2.000000	2.000000	4.000000
29	3.26087	3.304348	1.000000	2.692308	2.000000

Conclusions

A future experiment would be to reconduct the survey, but add an additional question to the survey, which asks for those that did not see the movie in question, what would be the likelihood that you would see the movie in the future. The scale that would be set for this question would also be from 1 to 5, with 1 being “Very Unlikely” and 5 being “Very Likely”. We could have also standardized the ratings for each of the films, as the occasional low rating given by a user would impact the average rating for that movie. Calculating the median would probably benefit, also subtracting the mean and dividing by the standard deviation would normalize the ratings.

References

[1] Convert CSV to SQL. CSV To SQL Converter. (n.d.). https://www.convertcsv.com/csv-to-sql.htm.

Survey ID	Joker	Whiplash	Us	Once Upon a Time in Hollywood	The Rise of Skywalker
0	5	4	2	1	3
1	NA	4	NA	3	3
2	5	1	NA	1	1
3	1	5	NA	2	5
4	4	4	5	1	3
5	1	NA	5	1	1
6	3	2	2	NA	5
7	3	5	4	2	3
8	4	5	2	1	5
9	NA	NA	5	2	NA
10	5	4	1	3	1
11	1	NA	3	2	2
12	4	4	4	2	1
13	NA	4	5	4	1
14	1	4	NA	2	3
15	1	3	4	4	3
16	4	NA	1	3	4
17	5	1	5	5	2
18	NA	3	3	1	4
19	3	5	2	4	2
20	NA	NA	4	NA	NA
21	4	4	5	5	3
22	5	3	3	1	1
23	2	3	5	NA	3
24	NA	1	2	5	2
25	4	4	NA	5	5
26	4	NA	5	3	5
27	4	1	NA	5	3
28	2	2	2	2	4
29	NA	NA	1	NA	2

Survey ID	Joker	Whiplash	Us	Once Upon a Time in Hollywood	The Rise of Skywalker
0	5	4	2	1	3
1	NA	4	NA	3	3
2	5	1	NA	1	1
3	1	5	NA	2	5
4	4	4	5	1	3
5	1	NA	5	1	1
6	3	2	2	NA	5
7	3	5	4	2	3
8	4	5	2	1	5
9	NA	NA	5	2	NA
10	5	4	1	3	1
11	1	NA	3	2	2
12	4	4	4	2	1
13	NA	4	5	4	1
14	1	4	NA	2	3
15	1	3	4	4	3
16	4	NA	1	3	4
17	5	1	5	5	2
18	NA	3	3	1	4
19	3	5	2	4	2
20	NA	NA	4	NA	NA
21	4	4	5	5	3
22	5	3	3	1	1
23	2	3	5	NA	3
24	NA	1	2	5	2
25	4	4	NA	5	5
26	4	NA	5	3	5
27	4	1	NA	5	3
28	2	2	2	2	4
29	NA	NA	1	NA	2

Survey ID	Joker	Whiplash	Us	Once Upon a Time in Hollywood	The Rise of Skywalker
0	5	4	2	1	3
1	NA	4	NA	3	3
2	5	1	NA	1	1
3	1	5	NA	2	5
4	4	4	5	1	3
5	1	NA	5	1	1
6	3	2	2	NA	5
7	3	5	4	2	3
8	4	5	2	1	5
9	NA	NA	5	2	NA
10	5	4	1	3	1
11	1	NA	3	2	2
12	4	4	4	2	1
13	NA	4	5	4	1
14	1	4	NA	2	3
15	1	3	4	4	3
16	4	NA	1	3	4
17	5	1	5	5	2
18	NA	3	3	1	4
19	3	5	2	4	2
20	NA	NA	4	NA	NA
21	4	4	5	5	3
22	5	3	3	1	1
23	2	3	5	NA	3
24	NA	1	2	5	2
25	4	4	NA	5	5
26	4	NA	5	3	5
27	4	1	NA	5	3
28	2	2	2	2	4
29	NA	NA	1	NA	2