Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe.
Data has been loaded to the MYSQL Workbench using the below sql query placed at GitHub āhttps://github.com/Riteshlohiya/Data607-Week-2-Assignment-Movies/blob/master/movies.sqlā
I have prepared 2 tables demo(personal information) and ratings(rating of the movies)
password = read.table("C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/flights/pass.txt")
## Warning in read.table("C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/
## flights/pass.txt"): incomplete final line found by readTableHeader on 'C:/
## ProgramData/MySQL/MySQL Server 5.7/Uploads/flights/pass.txt'
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(ggplot2)
Connection to the MySQL:
moviesdb = dbConnect(MySQL(), user='root', password=toString(password[1,1]), dbname='movies', host='localhost')
List the fields in demo table:
dbListFields(moviesdb, 'demo')
## [1] "name" "name_id" "state" "age"
List the fields in ratings table:
dbListFields(moviesdb, 'ratings')
## [1] "name_id" "movie_name" "rating"
Now getting the frequency of the ratings:
freq_df<-dbGetQuery(moviesdb,'select movie_name, rating, count(rating) as count
from ratings
group by movie_name,rating
order by movie_name, rating;')
View the frequency distribution:
freq_df
## movie_name rating count
## 1 Baby Driver 1 5
## 2 Baby Driver 2 5
## 3 Baby Driver 3 9
## 4 Baby Driver 4 1
## 5 Baby Driver 5 4
## 6 Blade Runner 2049 1 4
## 7 Blade Runner 2049 2 7
## 8 Blade Runner 2049 3 4
## 9 Blade Runner 2049 4 7
## 10 Blade Runner 2049 5 2
## 11 Dunkirk 1 4
## 12 Dunkirk 2 1
## 13 Dunkirk 3 4
## 14 Dunkirk 4 6
## 15 Dunkirk 5 9
## 16 Lion 1 2
## 17 Lion 2 4
## 18 Lion 3 4
## 19 Lion 4 8
## 20 Lion 5 6
## 21 Wonder 1 9
## 22 Wonder 2 9
## 23 Wonder 3 3
## 24 Wonder 4 2
## 25 Wonder 5 1
Joining the 2 tables and prepare the dataframes:
movies_df<-dbGetQuery(moviesdb, 'select
a.name,
a.name_id,
a.state,
a.age,
b.movie_name,
b.rating
from demo a
left join ratings b
on a.name_id=b.name_id
order by b.rating
;')
Contents of the dataframe
head(movies_df)
## name name_id state age movie_name rating
## 1 Curt Russel A2 AZ 33 Dunkirk 1
## 2 Brady A11 NC 11 Wonder 1
## 3 Keshava A4 AZ 31 Wonder 1
## 4 Venugopal A20 FL 50 Wonder 1
## 5 Pradeep A12 NC 10 Baby Driver 1
## 6 Johnson A9 DE 18 Blade Runner 2049 1
To check if there are any missing values for the ratings:
Contents of the dataframe:
colSums(is.na(movies_df))
## name name_id state age movie_name rating
## 0 0 0 0 0 0
So no missing values.
Frequency by table:
table(movies_df$movie_name, movies_df$rating)
##
## 1 2 3 4 5
## Baby Driver 5 5 9 1 4
## Blade Runner 2049 4 7 4 7 2
## Dunkirk 4 1 4 6 9
## Lion 2 4 4 8 6
## Wonder 9 9 3 2 1
Riteshās rating view:
Ritesh_Rating <- subset(movies_df, name_id == "A1", select = c(movie_name,
rating))
ggplot(Ritesh_Rating, aes(x = movie_name, y = rating, fill = movie_name)) +
geom_bar(stat = "identity") + ggtitle("Ritesh Reviews") + xlab("Movie_Name") +
ylab("Rating") + guides(fill = FALSE) + theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Average movie scores:
avgscore <- aggregate(movies_df$rating ~ movies_df$movie_name,
movies_df, mean)
colnames(avgscore) <- c("movie_name", "mean_score")
ggplot(avgscore, aes(x = movie_name, y = mean_score, fill = movie_name)) +
geom_bar(stat = "identity") + ggtitle("Average Scores by movie_name") +
xlab("Name") + ylab("Mean Score") + guides(fill = FALSE)
Ratings by age:
summary(movies_df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 19.50 30.50 31.04 39.50 65.00
Histogram for ratingsās age:
hist(movies_df$age, xlab = "Age", main = "Histogram for distribution of Age", breaks = 20)