This assignment uses the following additional software and packages: MySQL Server MySQL Workbench RMySQL plyr
In order to create the tables create the movie_ratings data base and set it as the default schema (sql command for this is commented out). Run the rest movie_ratings.sql script to create and populate the tables (https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/movie_ratings.sql). Create the connection to MySQL server with the appropriate hostname and credentials.
The csv file representing the final data frame can be found at (https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/MOVIE_RATINGS.csv)
library("RMySQL")
## Loading required package: DBI
mydb = dbConnect(MySQL(), user='root', password='root', dbname='movie_ratings', host='localhost')
The movie_ratings database has two two tables: movies and ratings. The movies table stores each movie with a unique id, and the ratings table holds each rating for a movie by a participant as an individual row (There will be 6 rows with the same participant id, each representing the participant’s rating for one movie).
dbListTables(mydb)
## [1] "movies" "ratings"
The following code queries the names of all the movie names to later us as the column names for the data frame.
library("plyr")
rs = dbSendQuery(mydb, "select * from MOVIES")
data = fetch(rs, n=-1)
m_count <- length(data$M_ID)
cols <- data$M_NAME
cols
## [1] "Black Panther" "Luca" "Black Widow"
## [4] "Respect" "The Imitation Game" "Hidden Figures"
The following code queries all the ratings for each movie. The 6 results are added to the final data frame, and the column names set the the movie names extracted above.
str <- paste("select rating from ratings where m_id =", toString(1))
rs = dbSendQuery(mydb, str)
m1 = fetch(rs, n=-1)
str <- paste("select rating from ratings where m_id =", toString(2))
rs = dbSendQuery(mydb, str)
m2 = fetch(rs, n=-1)
str <- paste("select rating from ratings where m_id =", toString(3))
rs = dbSendQuery(mydb, str)
m3 = fetch(rs, n=-1)
str <- paste("select rating from ratings where m_id =", toString(4))
rs = dbSendQuery(mydb, str)
m4 = fetch(rs, n=-1)
str <- paste("select rating from ratings where m_id =", toString(5))
rs = dbSendQuery(mydb, str)
m5 = fetch(rs, n=-1)
str <- paste("select rating from ratings where m_id =", toString(6))
rs = dbSendQuery(mydb, str)
m6 = fetch(rs, n=-1)
dataframe <- data.frame(m1,m2,m3,m4,m5,m6)
colnames(dataframe) <- cols
dataframe
## Black Panther Luca Black Widow Respect The Imitation Game Hidden Figures
## 1 1 0 4 NA NA 3
## 2 0 2 4 NA 1 1
## 3 0 5 4 3 3 1
## 4 0 0 2 0 5 5
## 5 3 4 0 0 0 2
## 6 NA 2 NA 1 3 NA
## 7 NA 4 1 0 0 3
## 8 1 3 0 5 4 1
## 9 2 3 3 5 5 0
In order to account for missing values for each movie, I utilized the na.rm function to omit them from the calculations. The data set is really samll for any significant information, thus it would not be reasonable to just omit all of a participant’s reviews simply because one is missing.
summary(dataframe, na.rm = TRUE)
## Black Panther Luca Black Widow Respect The Imitation Game
## Min. :0.0 Min. :0.000 Min. :0.00 Min. :0 Min. :0.000
## 1st Qu.:0.0 1st Qu.:2.000 1st Qu.:0.75 1st Qu.:0 1st Qu.:0.750
## Median :1.0 Median :3.000 Median :2.50 Median :1 Median :3.000
## Mean :1.0 Mean :2.556 Mean :2.25 Mean :2 Mean :2.625
## 3rd Qu.:1.5 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4 3rd Qu.:4.250
## Max. :3.0 Max. :5.000 Max. :4.00 Max. :5 Max. :5.000
## NA's :2 NA's :1 NA's :2 NA's :1
## Hidden Figures
## Min. :0.0
## 1st Qu.:1.0
## Median :1.5
## Mean :2.0
## 3rd Qu.:3.0
## Max. :5.0
## NA's :1
Using the csv file to create the data frame.
df_csv = data.frame(read.csv("https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/MOVIE_RATINGS.csv"))
df_csv
## Black.Panther Luca Black.Widow Respect The.Imitation.Game Hidden.Figures
## 1 1 0 4 NA NA 3
## 2 0 2 4 NA 1 1
## 3 0 5 4 3 3 1
## 4 0 0 2 0 5 5
## 5 3 4 0 0 0 2
## 6 NA 2 NA 1 3 NA
## 7 NA 4 1 0 0 3
## 8 1 3 0 5 4 1
## 9 2 3 3 5 5 0
summary(df_csv)
## Black.Panther Luca Black.Widow Respect The.Imitation.Game
## Min. :0.0 Min. :0.000 Min. :0.00 Min. :0 Min. :0.000
## 1st Qu.:0.0 1st Qu.:2.000 1st Qu.:0.75 1st Qu.:0 1st Qu.:0.750
## Median :1.0 Median :3.000 Median :2.50 Median :1 Median :3.000
## Mean :1.0 Mean :2.556 Mean :2.25 Mean :2 Mean :2.625
## 3rd Qu.:1.5 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4 3rd Qu.:4.250
## Max. :3.0 Max. :5.000 Max. :4.00 Max. :5 Max. :5.000
## NA's :2 NA's :1 NA's :2 NA's :1
## Hidden.Figures
## Min. :0.0
## 1st Qu.:1.0
## Median :1.5
## Mean :2.0
## 3rd Qu.:3.0
## Max. :5.0
## NA's :1
colMeans(dataframe, na.rm=TRUE)
## Black Panther Luca Black Widow Respect
## 1.000000 2.555556 2.250000 2.000000
## The Imitation Game Hidden Figures
## 2.625000 2.000000
barplot(colMeans(dataframe, na.rm=TRUE), las=3, space = .5, col= c("Grey", "Yellow", "Red", "Lightblue", "Brown", "Green"))