In this assignment, I have chosen six of the most popular Hollywood movies, and I have asked seven of my coworkers to rate each of the movies they have watched on a scale from 1 (being not good) to 5 (being great). From the responses of my survey, I have created a SQL database. Here, I will first load these data into an R dataframe. For our purposes, we will be comparing which movies has higher average ratings than others.
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(ggplot2)
con = dbConnect (MySQL(), user = 'root', password = getOption('database_password'), dbname = 'movie_ratings', Host = 'localhost')
dbListTables(con)
## [1] "Audience" "Movies" "Ratings"
As you can see, 3 different tables have been created with information regarding audience, movies and ratings.
Q1 <- "select * from audience;"
Q2 <- "select * from movies;"
Q3 <- "select * from ratings;"
audience <- dbGetQuery(con, Q1)
audience
## Nameid Firstname Lastname
## 1 1 Mario P
## 2 2 Teresa C
## 3 3 Tyleah C
## 4 4 Marisol L
## 5 5 Leslie B
## 6 6 Elizabeth F
## 7 7 Denzil R
movies <- dbGetQuery(con, Q2)
movies
## Movieid Title Genre
## 1 1 Jurassic Park Science Fiction
## 2 2 Avatar Sicence Fiction
## 3 3 Titanic Drama
## 4 4 The Pursuit of Happyness Drama
## 5 5 The Lion King (Oreginal) Family
## 6 6 The Dark Knight Action
ratings <- dbGetQuery(con, Q3)
ratings
## Ratingid Nameid Movieid Rating
## 1 1 1 1 5
## 2 2 1 2 5
## 3 3 1 3 4
## 4 4 1 4 4
## 5 5 1 5 3
## 6 6 1 6 5
## 7 7 2 1 2
## 8 8 2 2 NA
## 9 9 2 3 4
## 10 10 2 4 5
## 11 11 2 5 5
## 12 12 2 6 3
## 13 13 3 1 4
## 14 14 3 2 5
## 15 15 3 3 5
## 16 16 3 4 5
## 17 17 3 5 3
## 18 18 3 6 1
## 19 19 4 1 4
## 20 20 4 2 5
## 21 21 4 3 4
## 22 22 4 4 5
## 23 23 4 5 5
## 24 24 4 6 4
## 25 25 5 1 3
## 26 26 5 2 4
## 27 27 5 3 4
## 28 28 5 4 5
## 29 29 5 5 5
## 30 30 5 6 NA
## 31 31 6 1 5
## 32 32 6 2 3
## 33 33 6 3 4
## 34 34 6 4 NA
## 35 35 6 5 5
## 36 36 6 6 5
## 37 37 7 1 5
## 38 38 7 2 5
## 39 39 7 3 4
## 40 40 7 4 5
## 41 41 7 5 4
## 42 42 7 6 5
movie_ratings <- merge(movies, ratings, by.x="Movieid")
movie_ratings2 <- subset(movie_ratings, select=c(Title, Genre, Rating))
movie_ratings2
## Title Genre Rating
## 1 Jurassic Park Science Fiction 5
## 2 Jurassic Park Science Fiction 4
## 3 Jurassic Park Science Fiction 2
## 4 Jurassic Park Science Fiction 3
## 5 Jurassic Park Science Fiction 5
## 6 Jurassic Park Science Fiction 5
## 7 Jurassic Park Science Fiction 4
## 8 Avatar Sicence Fiction 5
## 9 Avatar Sicence Fiction NA
## 10 Avatar Sicence Fiction 5
## 11 Avatar Sicence Fiction 3
## 12 Avatar Sicence Fiction 5
## 13 Avatar Sicence Fiction 5
## 14 Avatar Sicence Fiction 4
## 15 Titanic Drama 4
## 16 Titanic Drama 4
## 17 Titanic Drama 5
## 18 Titanic Drama 4
## 19 Titanic Drama 4
## 20 Titanic Drama 4
## 21 Titanic Drama 4
## 22 The Pursuit of Happyness Drama 4
## 23 The Pursuit of Happyness Drama 5
## 24 The Pursuit of Happyness Drama 5
## 25 The Pursuit of Happyness Drama 5
## 26 The Pursuit of Happyness Drama 5
## 27 The Pursuit of Happyness Drama NA
## 28 The Pursuit of Happyness Drama 5
## 29 The Lion King (Oreginal) Family 3
## 30 The Lion King (Oreginal) Family 3
## 31 The Lion King (Oreginal) Family 5
## 32 The Lion King (Oreginal) Family 5
## 33 The Lion King (Oreginal) Family 5
## 34 The Lion King (Oreginal) Family 5
## 35 The Lion King (Oreginal) Family 4
## 36 The Dark Knight Action NA
## 37 The Dark Knight Action 5
## 38 The Dark Knight Action 3
## 39 The Dark Knight Action 1
## 40 The Dark Knight Action 5
## 41 The Dark Knight Action 4
## 42 The Dark Knight Action 5
summary(movie_ratings2)
## Title Genre Rating
## Length:42 Length:42 Min. :1.000
## Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Median :5.000
## Mean :4.256
## 3rd Qu.:5.000
## Max. :5.000
## NA's :3
In summary above we can see that the average rating for all the movies in this database is 4.256. This tells us that these movies were well received by the audience who watched them.
avg_movie_rating <- aggregate(x=movie_ratings2["Rating"], by = list(movie=movie_ratings2$Title), FUN = mean, , na.rm=TRUE)
avg_movie_rating
## movie Rating
## 1 Avatar 4.500000
## 2 Jurassic Park 4.000000
## 3 The Dark Knight 3.833333
## 4 The Lion King (Oreginal) 4.285714
## 5 The Pursuit of Happyness 4.833333
## 6 Titanic 4.142857
From the table above, we can see that “The Pursuit of Happyness” has the highest average rating in this list, while “The Dark Knight” has the lowest average rating.
# Draw plot
theme_set(theme_bw())
ggplot(avg_movie_rating, aes(x= reorder(movie, -Rating), y=Rating)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
labs(title="Movie Ratings",
subtitle="Movie vs Avg. Rating") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
scale_x_discrete(name="Movies")
Based on my survey analysis, and the data visualization, The Pursuit of Happyness has the highest average rating of 4.8 compared to The Dark Knight which received a 3.8 out of 5. In the future, what I would be interested in exploring is the profit margins of each movie and compare them to my survey to analyze the relationship between them. To do so, I would research online and go through each movies budget amount vs. the profit as well as compared to my survey vs general popularity. ```