Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe. Your deliverables should include your SQL scripts and your R Markdown code, posted to GitHub.
library(RMySQL)
library(sqldf)
library(ggplot2)
library(dplyr)
library(plyr)
For this project, I created a google form and posted it on social media for people to fill out. I did a bit of manipulation on the final output so that each record in the dataset represents 1 movie rating for 1 person. The output of the RATING is text, so I will use a case statement in my query to convert this field to a numeric data type. Let’s load it in now:
mydb = dbConnect(MySQL(), user='root', password='password', dbname='dbName', host='localhost')
movieRatings <- dbGetQuery(mydb, "
SELECT
USER_ID,
AGE,
GENDER,
MOVIE,
CASE WHEN RATING LIKE '5%' THEN 5
WHEN RATING LIKE '4%' THEN 4
WHEN RATING LIKE '3%' THEN 3
WHEN RATING LIKE '2%' THEN 2
WHEN RATING LIKE '1%' THEN 1
WHEN RATING LIKE 'Not%' THEN 0
ELSE 0 END AS RATING
FROM hw.DATA607_01_MOVIES")
Now that we have the data loaded, let’s take a look at it.
dim(movieRatings)
## [1] 504 5
colnames(movieRatings)
## [1] "USER_ID" "AGE" "GENDER" "MOVIE" "RATING"
str(movieRatings)
## 'data.frame': 504 obs. of 5 variables:
## $ USER_ID: int 1 2 3 4 5 6 7 8 9 10 ...
## $ AGE : int NA NA NA NA NA NA NA NA NA NA ...
## $ GENDER : chr NA NA NA NA ...
## $ MOVIE : chr "The Lion King" "The Lion King" "The Lion King" "The Lion King" ...
## $ RATING : num 4 0 0 5 3 5 0 4 4 4 ...
We have 504 records in the data set with a total of 5 columns.
Note that there are also nulls in the Age and Gender columns – some of these are from individuals not answering while others are a result of how I collected the data; I originally posted the survey without the Age and Gender questions and added them later on.
Now that we’ve taken a look at the structure, let’s dig a little deeper into the actual data. First, let’s look at the demographics of the people responding to the questions. In order to do so, we need to get a distinct list of individuals with their age and gender. We’ll also go ahead and change the null values in our data set to Unknown so that we can take them into account as well.
demoInfo <- unique(movieRatings[,1:3])
demoInfo[is.na(demoInfo)] <- "UNKNOWN"
demoPlot <- barplot(table(demoInfo$GENDER), main = "Survey Participants by Gender",
xlab= "Gender", col = c("pink", "lightblue", "lightyellow"),
legend = rownames(table(demoInfo$GENDER)), ylim=c(0,45))
text(x=demoPlot, y= table(demoInfo$GENDER)+2, labels=as.character(table(demoInfo$GENDER)))
For known survey participants, about 58% were female, and 42% were male.
Now let’s take a look at the age distribution. We will remove the unknowns from this graph
barplot(table(demoInfo$AGE[which(demoInfo$AGE != "UNKNOWN")]), main = "Number of Survey Participants by Age", xlab= "Age",ylim=c(0,9))
When we exclude the unknown ages, we can see that the majority of the survey participants fall between the ages of 23 and 30, with the most frequent age being 27. Now let’s switch gears a bit and focus on the movie ratings. First, let’s look at the average rating per movie, excluding the records that weren’t seen.
movieSubset <- movieRatings[which(movieRatings$RATING !=0),]
avgRatings <- as.data.frame(aggregate(movieSubset$RATING, list(movieSubset$MOVIE), FUN=mean))
colnames(avgRatings) <- c("MOVIE", "AVG_RATING")
avgRatings$AVG_RATING <- round(avgRatings$AVG_RATING,2)
avgRatings$MOVIE <- with(avgRatings, reorder(MOVIE, -AVG_RATING))
g <- ggplot(avgRatings,
aes(x = MOVIE, y = AVG_RATING, color = MOVIE, fill = MOVIE, label = AVG_RATING)) +
geom_col() +
geom_text(nudge_y = 0.5) +
ggtitle("Average Rating by Movie") +
xlab("Movie") + ylab("Average Movie Rating") +
theme_bw() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5))
g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Looks like Avengers had the highest movie rating from people that had seen it! What about the gender split for each of these movies?
movieSubset2 <- movieSubset[which(movieSubset$GENDER=='Female'),]
movieSubset3 <- movieSubset[which(movieSubset$GENDER=='Male'),]
femaleRatings <- as.data.frame(aggregate(movieSubset2$RATING, list(movieSubset2$MOVIE), FUN=mean))
colnames(femaleRatings) <- c("MOVIE", "AVG_RATING")
femaleRatings$GENDER <- 'Female'
maleRatings <- as.data.frame(aggregate(movieSubset3$RATING, list(movieSubset3$MOVIE), FUN=mean))
colnames(maleRatings) <- c("MOVIE", "AVG_RATING")
maleRatings$GENDER <- 'Male'
finalRatings <- rbind(femaleRatings, maleRatings)
finalRatings$AVG_RATING <- round(finalRatings$AVG_RATING,2)
finalRatings$MOVIE <- with(finalRatings, reorder(MOVIE, -AVG_RATING))
h <- ggplot(finalRatings,
aes(x = MOVIE, y = AVG_RATING, fill = GENDER, label = AVG_RATING)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle("Average Rating by Movie") +
xlab("Movie") + ylab("Average Movie Rating") +
theme_bw() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5))
h + theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can see that on average, females tended to rank the movies a bit higher than males (with the exception of Avengers).
Lastly, let’s take a look at the movies that people did not see.
notSeen <- movieRatings[which(movieRatings$RATING== 0),]
notSeen2 <- as.data.frame(table(notSeen$MOVIE, notSeen$GENDER))
colnames(notSeen2) <- c('MOVIE', 'GENDER', 'FREQUENCY')
notSeen2$MOVIE <- with(notSeen2, reorder(MOVIE, -FREQUENCY))
i <- ggplot(notSeen2,
aes(x = MOVIE, y = FREQUENCY, fill = GENDER, label = FREQUENCY)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle("Number of Indivdiduals that have not seen each Movie") +
xlab("Movie") + ylab("Number of Individuals") +
theme_bw() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5))
i + theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can see that the top 2 movies that people haven’t seen are Isn’t It Romantic and Fighting with my Family. I would expect more females than males to have not seen these movies simply due to the fact that there are more females than males in the sample, but it is curious to note that despite this, more males than females had not seen Aladdin and The Lion King (both Disney movies).
Altogether, this data shows us a few of things: