Data_607_Week2

This assignment uses the following additional software and packages: MySQL Server MySQL Workbench RMySQL plyr

In order to create the tables create the movie_ratings data base and set it as the default schema (sql command for this is commented out). Run the rest movie_ratings.sql script to create and populate the tables (https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/movie_ratings.sql). Create the connection to MySQL server with the appropriate hostname and credentials.

The csv file representing the final data frame can be found at (https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/MOVIE_RATINGS.csv)

library("RMySQL")

## Loading required package: DBI

mydb = dbConnect(MySQL(), user='root', password='root', dbname='movie_ratings', host='localhost')

The movie_ratings database has two two tables: movies and ratings. The movies table stores each movie with a unique id, and the ratings table holds each rating for a movie by a participant as an individual row (There will be 6 rows with the same participant id, each representing the participant’s rating for one movie).

dbListTables(mydb)

## [1] "movies"  "ratings"

The following code queries the names of all the movie names to later us as the column names for the data frame.

library("plyr")
rs = dbSendQuery(mydb, "select  * from MOVIES")
data = fetch(rs, n=-1)

m_count <- length(data$M_ID)
cols <- data$M_NAME

cols

## [1] "Black Panther"      "Luca"               "Black Widow"       
## [4] "Respect"            "The Imitation Game" "Hidden Figures"

The following code queries all the ratings for each movie. The 6 results are added to the final data frame, and the column names set the the movie names extracted above.

str <- paste("select rating from ratings where m_id =", toString(1))
rs = dbSendQuery(mydb, str)
m1 = fetch(rs, n=-1)

str <- paste("select rating from ratings where m_id =", toString(2))
rs = dbSendQuery(mydb, str)
m2 = fetch(rs, n=-1)

str <- paste("select rating from ratings where m_id =", toString(3))
rs = dbSendQuery(mydb, str)
m3 = fetch(rs, n=-1)

str <- paste("select rating from ratings where m_id =", toString(4))
rs = dbSendQuery(mydb, str)
m4 = fetch(rs, n=-1)

str <- paste("select rating from ratings where m_id =", toString(5))
rs = dbSendQuery(mydb, str)
m5 = fetch(rs, n=-1)

str <- paste("select rating from ratings where m_id =", toString(6))
rs = dbSendQuery(mydb, str)
m6 = fetch(rs, n=-1)


dataframe <- data.frame(m1,m2,m3,m4,m5,m6)

colnames(dataframe) <- cols

dataframe

##   Black Panther Luca Black Widow Respect The Imitation Game Hidden Figures
## 1             1    0           4      NA                 NA              3
## 2             0    2           4      NA                  1              1
## 3             0    5           4       3                  3              1
## 4             0    0           2       0                  5              5
## 5             3    4           0       0                  0              2
## 6            NA    2          NA       1                  3             NA
## 7            NA    4           1       0                  0              3
## 8             1    3           0       5                  4              1
## 9             2    3           3       5                  5              0

In order to account for missing values for each movie, I utilized the na.rm function to omit them from the calculations. The data set is really samll for any significant information, thus it would not be reasonable to just omit all of a participant’s reviews simply because one is missing.

summary(dataframe, na.rm = TRUE)

##  Black Panther      Luca        Black Widow      Respect  The Imitation Game
##  Min.   :0.0   Min.   :0.000   Min.   :0.00   Min.   :0   Min.   :0.000     
##  1st Qu.:0.0   1st Qu.:2.000   1st Qu.:0.75   1st Qu.:0   1st Qu.:0.750     
##  Median :1.0   Median :3.000   Median :2.50   Median :1   Median :3.000     
##  Mean   :1.0   Mean   :2.556   Mean   :2.25   Mean   :2   Mean   :2.625     
##  3rd Qu.:1.5   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4   3rd Qu.:4.250     
##  Max.   :3.0   Max.   :5.000   Max.   :4.00   Max.   :5   Max.   :5.000     
##  NA's   :2                     NA's   :1      NA's   :2   NA's   :1         
##  Hidden Figures
##  Min.   :0.0   
##  1st Qu.:1.0   
##  Median :1.5   
##  Mean   :2.0   
##  3rd Qu.:3.0   
##  Max.   :5.0   
##  NA's   :1

Using the csv file to create the data frame.

df_csv = data.frame(read.csv("https://raw.githubusercontent.com/Patel-Krutika/Data_607/main/MOVIE_RATINGS.csv"))
df_csv

##   Black.Panther Luca Black.Widow Respect The.Imitation.Game Hidden.Figures
## 1             1    0           4      NA                 NA              3
## 2             0    2           4      NA                  1              1
## 3             0    5           4       3                  3              1
## 4             0    0           2       0                  5              5
## 5             3    4           0       0                  0              2
## 6            NA    2          NA       1                  3             NA
## 7            NA    4           1       0                  0              3
## 8             1    3           0       5                  4              1
## 9             2    3           3       5                  5              0

summary(df_csv)

##  Black.Panther      Luca        Black.Widow      Respect  The.Imitation.Game
##  Min.   :0.0   Min.   :0.000   Min.   :0.00   Min.   :0   Min.   :0.000     
##  1st Qu.:0.0   1st Qu.:2.000   1st Qu.:0.75   1st Qu.:0   1st Qu.:0.750     
##  Median :1.0   Median :3.000   Median :2.50   Median :1   Median :3.000     
##  Mean   :1.0   Mean   :2.556   Mean   :2.25   Mean   :2   Mean   :2.625     
##  3rd Qu.:1.5   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4   3rd Qu.:4.250     
##  Max.   :3.0   Max.   :5.000   Max.   :4.00   Max.   :5   Max.   :5.000     
##  NA's   :2                     NA's   :1      NA's   :2   NA's   :1         
##  Hidden.Figures
##  Min.   :0.0   
##  1st Qu.:1.0   
##  Median :1.5   
##  Mean   :2.0   
##  3rd Qu.:3.0   
##  Max.   :5.0   
##  NA's   :1

colMeans(dataframe, na.rm=TRUE)

##      Black Panther               Luca        Black Widow            Respect 
##           1.000000           2.555556           2.250000           2.000000 
## The Imitation Game     Hidden Figures 
##           2.625000           2.000000

barplot(colMeans(dataframe, na.rm=TRUE), las=3, space = .5, col= c("Grey", "Yellow", "Red", "Lightblue", "Brown", "Green"))