Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.

I have prepared 2 tables demo(personal information) and ratings(rating of the movies)

Read the DB password for from the file

password = read.table("C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/flights/pass.txt")

## Warning in read.table("C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/
## flights/pass.txt"): incomplete final line found by readTableHeader on 'C:/
## ProgramData/MySQL/MySQL Server 5.7/Uploads/flights/pass.txt'

install.packages(‘RMySQL’,type=‘source’)

library(RMySQL)

## Loading required package: DBI

library(DBI)
library(ggplot2)

Connection to the MySQL:

moviesdb = dbConnect(MySQL(), user='root', password=toString(password[1,1]), dbname='movies', host='localhost')

List the fields in demo table:

dbListFields(moviesdb, 'demo')

## [1] "name"    "name_id" "state"   "age"

List the fields in ratings table:

dbListFields(moviesdb, 'ratings')

## [1] "name_id"    "movie_name" "rating"

Now getting the frequency of the ratings:

freq_df<-dbGetQuery(moviesdb,'select movie_name, rating, count(rating) as count
from ratings
group by movie_name,rating
order by movie_name, rating;')

View the frequency distribution:

freq_df

##           movie_name rating count
## 1        Baby Driver      1     5
## 2        Baby Driver      2     5
## 3        Baby Driver      3     9
## 4        Baby Driver      4     1
## 5        Baby Driver      5     4
## 6  Blade Runner 2049      1     4
## 7  Blade Runner 2049      2     7
## 8  Blade Runner 2049      3     4
## 9  Blade Runner 2049      4     7
## 10 Blade Runner 2049      5     2
## 11           Dunkirk      1     4
## 12           Dunkirk      2     1
## 13           Dunkirk      3     4
## 14           Dunkirk      4     6
## 15           Dunkirk      5     9
## 16              Lion      1     2
## 17              Lion      2     4
## 18              Lion      3     4
## 19              Lion      4     8
## 20              Lion      5     6
## 21            Wonder      1     9
## 22            Wonder      2     9
## 23            Wonder      3     3
## 24            Wonder      4     2
## 25            Wonder      5     1

Joining the 2 tables and prepare the dataframes:

movies_df<-dbGetQuery(moviesdb, 'select 
a.name, 
a.name_id, 
a.state, 
a.age,
b.movie_name,
b.rating
from demo a
left join ratings b
on a.name_id=b.name_id
order by b.rating
;')

Contents of the dataframe

head(movies_df)

##          name name_id state age        movie_name rating
## 1 Curt Russel      A2    AZ  33           Dunkirk      1
## 2       Brady     A11    NC  11            Wonder      1
## 3     Keshava      A4    AZ  31            Wonder      1
## 4   Venugopal     A20    FL  50            Wonder      1
## 5     Pradeep     A12    NC  10       Baby Driver      1
## 6     Johnson      A9    DE  18 Blade Runner 2049      1

To check if there are any missing values for the ratings:

Contents of the dataframe:

colSums(is.na(movies_df))

##       name    name_id      state        age movie_name     rating 
##          0          0          0          0          0          0

So no missing values.

Frequency by table:

table(movies_df$movie_name, movies_df$rating)

##                    
##                     1 2 3 4 5
##   Baby Driver       5 5 9 1 4
##   Blade Runner 2049 4 7 4 7 2
##   Dunkirk           4 1 4 6 9
##   Lion              2 4 4 8 6
##   Wonder            9 9 3 2 1

Ritesh’s rating view:

Ritesh_Rating <- subset(movies_df, name_id == "A1", select = c(movie_name, 
    rating))

ggplot(Ritesh_Rating, aes(x = movie_name, y = rating, fill = movie_name)) + 
    geom_bar(stat = "identity") + ggtitle("Ritesh Reviews") + xlab("Movie_Name") + 
    ylab("Rating") + guides(fill = FALSE) + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1))

Average movie scores:

avgscore <- aggregate(movies_df$rating ~ movies_df$movie_name, 
    movies_df, mean)
colnames(avgscore) <- c("movie_name", "mean_score")

ggplot(avgscore, aes(x = movie_name, y = mean_score, fill = movie_name)) + 
    geom_bar(stat = "identity") + ggtitle("Average Scores by movie_name") + 
    xlab("Name") + ylab("Mean Score") + guides(fill = FALSE)

Ratings by age:

summary(movies_df$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.50   30.50   31.04   39.50   65.00

Histogram for ratings’s age:

hist(movies_df$age, xlab = "Age", main = "Histogram for distribution of Age", breaks = 20)

Data607 HW2 V2.0

Ritesh Lohiya

February 10, 2018

Read the DB password for from the file

install.packages(‘RMySQL’,type=‘source’)