In this assignment I have chosen six of the most popular Hollywood movies, and I have asked 6 people to rate each of the movies they’ve seen on a scale from 1 (being terrible) to 5 (being great).
I have stored the observations in a SQL database, so I will first need to load this data into an R dataframe.
We’ll need to load the following libraries:
library(RMySQL)
## Loading required package: DBI
library(dbConnect)
## Loading required package: gWidgets
And connect to MySQL server:
con = dbConnect (MySQL(), user = 'root', password = getOption('database_password'), dbname = 'movie_ratings', Host = 'localhost')
Now I can see a list of the tables in my “movie_ratings” database:
dbListTables(con)
## [1] "audience" "movies" "ratings"
As you can see, I have created 3 different tables, each stores information regarding audience, movies and ratings.
I will create 3 queries within R using these tables:
myquery1 <- "select * from audience;"
myquery2 <- "select * from movies;"
myquery3 <- "select * from ratings;"
I will save them as dataframes in this next step and display their content:
audience <- dbGetQuery(con, myquery1)
audience
## nameid firstname lastname
## 1 1 Amel H
## 2 2 Dina H
## 3 3 Mario P
## 4 4 Ana J
## 5 5 Teresa C
## 6 6 Alba P
movies <- dbGetQuery(con, myquery2)
movies
## movieid title genre
## 1 1 The Lion King (Original) Family
## 2 2 The Dark Night Action
## 3 3 Titanic Drama
## 4 4 Forrest Gump Drama
## 5 5 Jaws Horror
## 6 6 E.T. Science Fiction
ratings <- dbGetQuery(con, myquery3)
ratings
## ratingid nameid movieid rating
## 1 1 1 1 5
## 2 2 1 2 5
## 3 3 1 3 5
## 4 4 1 4 NA
## 5 5 1 5 3
## 6 6 1 6 4
## 7 7 2 1 5
## 8 8 2 2 5
## 9 9 2 3 5
## 10 10 2 4 5
## 11 11 2 5 2
## 12 12 2 6 4
## 13 13 3 1 4
## 14 14 3 2 5
## 15 15 3 3 4
## 16 16 3 4 5
## 17 17 3 5 1
## 18 18 3 6 3
## 19 19 4 1 5
## 20 20 4 2 4
## 21 21 4 3 5
## 22 22 4 4 5
## 23 23 4 5 4
## 24 24 4 6 5
## 25 25 5 1 5
## 26 26 5 2 3
## 27 27 5 3 4
## 28 28 5 4 3
## 29 29 5 5 2
## 30 30 5 6 4
## 31 31 6 1 5
## 32 32 6 2 NA
## 33 33 6 3 5
## 34 34 6 4 NA
## 35 35 6 5 5
## 36 36 6 6 5
I will merge ‘movies’ and ‘ratings’ so we can have a clearer picture of how each movie was rated:
movie_ratings <- merge(movies, ratings, by.x="movieid")
movie_ratings
## movieid title genre ratingid nameid rating
## 1 1 The Lion King (Original) Family 1 1 5
## 2 1 The Lion King (Original) Family 19 4 5
## 3 1 The Lion King (Original) Family 7 2 5
## 4 1 The Lion King (Original) Family 25 5 5
## 5 1 The Lion King (Original) Family 31 6 5
## 6 1 The Lion King (Original) Family 13 3 4
## 7 2 The Dark Night Action 2 1 5
## 8 2 The Dark Night Action 32 6 NA
## 9 2 The Dark Night Action 8 2 5
## 10 2 The Dark Night Action 14 3 5
## 11 2 The Dark Night Action 20 4 4
## 12 2 The Dark Night Action 26 5 3
## 13 3 Titanic Drama 3 1 5
## 14 3 Titanic Drama 9 2 5
## 15 3 Titanic Drama 15 3 4
## 16 3 Titanic Drama 21 4 5
## 17 3 Titanic Drama 27 5 4
## 18 3 Titanic Drama 33 6 5
## 19 4 Forrest Gump Drama 4 1 NA
## 20 4 Forrest Gump Drama 10 2 5
## 21 4 Forrest Gump Drama 16 3 5
## 22 4 Forrest Gump Drama 22 4 5
## 23 4 Forrest Gump Drama 28 5 3
## 24 4 Forrest Gump Drama 34 6 NA
## 25 5 Jaws Horror 11 2 2
## 26 5 Jaws Horror 17 3 1
## 27 5 Jaws Horror 23 4 4
## 28 5 Jaws Horror 29 5 2
## 29 5 Jaws Horror 5 1 3
## 30 5 Jaws Horror 35 6 5
## 31 6 E.T. Science Fiction 24 4 5
## 32 6 E.T. Science Fiction 6 1 4
## 33 6 E.T. Science Fiction 12 2 4
## 34 6 E.T. Science Fiction 30 5 4
## 35 6 E.T. Science Fiction 18 3 3
## 36 6 E.T. Science Fiction 36 6 5
Let’s take a look at what the summmary of this new table looks like:
summary(movie_ratings)
## movieid title genre ratingid
## Min. :1.0 Length:36 Length:36 Min. : 1.00
## 1st Qu.:2.0 Class :character Class :character 1st Qu.: 9.75
## Median :3.5 Mode :character Mode :character Median :18.50
## Mean :3.5 Mean :18.50
## 3rd Qu.:5.0 3rd Qu.:27.25
## Max. :6.0 Max. :36.00
##
## nameid rating
## Min. :1.0 Min. :1.000
## 1st Qu.:2.0 1st Qu.:4.000
## Median :3.5 Median :5.000
## Mean :3.5 Mean :4.212
## 3rd Qu.:5.0 3rd Qu.:5.000
## Max. :6.0 Max. :5.000
## NA's :3
We can see that the average rating for all the movies in this database is 4.212. This tells us these movies were very well received by the audience who watched them.
Now let’s see what is the average movie rating per movie:
avg_movie_rating <- aggregate(x=movie_ratings["rating"], by = list(movie=movie_ratings$title), FUN = mean, na.rm=TRUE)
avg_movie_rating
## movie rating
## 1 E.T. 4.166667
## 2 Forrest Gump 4.500000
## 3 Jaws 2.833333
## 4 The Dark Night 4.400000
## 5 The Lion King (Original) 4.833333
## 6 Titanic 4.666667
We can see that “The Lion King (Original)” has the highest average rating in this list, while “Jaws” has the lowest average rating as also shown in the barplot below:
library(ggplot2)
ggplot(aes(x = movie, y = rating), data = avg_movie_rating) + geom_bar(stat = "identity")