Data 110 “dslabs” homework assignment using the “movielens” dataset.
Variable details:
movieId: Unique ID for the movie. title: Movie title (not unique). year: Year the movie was released. genres: Genres associated with the movie. userId: Unique ID for the user. rating: A rating between 0 and 5 for the movie. timestamp: Date and time the rating was given.
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
# install.packages("dslabs")
library("dslabs")
## Warning: package 'dslabs' was built under R version 4.0.5
## add dplyr library to get the filter code
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## filter for Horror, Comedy and Drama
threegenres <- movielens %>%
filter(genres %in% c("Horror","Drama", "Comedy"))
## select(genres, year, rating)
Filtering for three genres Horror, Drama, Comedy.
## remove unused columns
threegenres<- threegenres %>%
select(genres, year, rating)
Selecting three columns genres, year, and rating to remove unwanted columns. Next, aggregate the remaining data and show average rating of each genre over multiple decades.
## aggregate rating for each year per genre
threegenresavg <- aggregate(x=threegenres$rating,
by=list(threegenres$genres,threegenres$year),
FUN=mean, round(mean(threegenres$rating,2)))
head(threegenresavg)
## Group.1 Group.2 x
## 1 Comedy 1917 4.25
## 2 Comedy 1918 4.25
## 3 Comedy 1921 4.50
## 4 Comedy 1922 4.25
## 5 Horror 1922 4.00
## 6 Comedy 1923 4.25
##renaming columns
colnames(threegenresavg) <- c("Genre","Year", "Rating")
I want to change rating to two decimals.
## rounding rating to two decimals
threegenresavg$Rating <- round(threegenresavg$Rating, digit=2)
threegenresavg$Year <- round(threegenresavg$Year, digit=0)
head(threegenresavg)
## Genre Year Rating
## 1 Comedy 1917 4.25
## 2 Comedy 1918 4.25
## 3 Comedy 1921 4.50
## 4 Comedy 1922 4.25
## 5 Horror 1922 4.00
## 6 Comedy 1923 4.25
Making a graph using ggplot and loess smoother regression.
##bring in ggplot
library(ggplot2)
threegenresavg %>% ggplot(aes(Year, Rating, color = Genre)) +
geom_point(show.legend = FALSE, alpha=0.4) +
geom_smooth(method = "loess", span = 0.15) +
scale_y_continuous(limits = c(0,6)) +
scale_x_continuous(limits = c(1922, 2016))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
Drama doesn’t start until 1924 causing a gap at the beginning, and there’s a gap in horror from 1945 to 1956. I’m going to adjust the graph to start at 1957 to compensate. Also changing the line colors, theme, and removing the grey confidence interval.
library(withr)
## Warning: package 'withr' was built under R version 4.0.5
#set limit to 1957 and change the line color, and add minimal theme
p <- threegenresavg %>% ggplot(aes(Year, Rating, color = Genre)) +
labs(title = "AVG RATING FOR MOVIE GENRES \n OVER THE YEARS",
caption = "ACM Digital Library") +
geom_point(show.legend = TRUE, alpha=0.2) +
geom_smooth(method = "loess", span = 0.15, se = FALSE) +
scale_y_continuous(limits = c(1,5)) +
scale_x_continuous(limits = c(1957, 2016, 5), breaks=seq(1955, 2015, 5)) +
theme_minimal() +
scale_color_manual(values=c("#009E73", "#9999CC", "#F0E442")) +
ylab("Avg Rating Out of 5")
p
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 62 rows containing non-finite values (stat_smooth).
## Warning: Removed 62 rows containing missing values (geom_point).
Next, a hover widget.
## bring in plotly
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 62 rows containing non-finite values (stat_smooth).
Outside of a brief time period around 1970-1973, 1977-1979, and a tie with Horror in 2016, Drama has the highest average rating of the three genres since 1957.