Getting information about the dataset make-movielens
#?movielens
For this assignment I’m going to work with the movielens dataset from the dslabs packages. I will work with three variable: genres, year, and rating.
head(movielens)
movieId title year
1 31 Dangerous Minds 1995
2 1029 Dumbo 1941
3 1061 Sleepers 1996
4 1129 Escape from New York 1981
5 1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
6 1263 Deer Hunter, The 1978
genres userId rating timestamp
1 Drama 1 2.5 1260759144
2 Animation|Children|Drama|Musical 1 3.0 1260759179
3 Thriller 1 3.0 1260759182
4 Action|Adventure|Sci-Fi|Thriller 1 2.0 1260759185
5 Drama 1 4.0 1260759205
6 Drama|War 1 2.0 1260759151
check my dataset for missing value
#is.na(movielens)
my dataset has no missing value now let’s do our analysis.
Scatterplot of genres vs year by rating
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
This plot is somehow overwelming, and there are too many data in there. The only thing I can get from there is that there has been more rating in recent years than the pass years. We can filter the data and create another plot.
I’m going to consider 4 genres and filter for years between 1925 - 2010
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# In this chunk I choose my 3 variable to work with and I choose 4 genres in the genres column to work with. And I filter so my result can be between 1925- 2010.movielens2
# A tibble: 183 × 3
# Groups: year [85]
year genres avrg_rating
<int> <fct> <dbl>
1 1925 Comedy 4
2 1926 Drama 3.5
3 1927 Comedy 2.67
4 1928 Comedy 4.38
5 1928 Drama 4.3
6 1929 Drama 2
7 1930 Drama 3.72
8 1931 Comedy 3.67
9 1931 Drama 4
10 1932 Comedy 4.11
# ℹ 173 more rows
For this plot I’m going to use geom line and geom point so we can see the rating movements over the year.
1925-2010 Average Movie Genres Ratings by Years
ggplot(movielens2, aes(x = year, y = avrg_rating, color = genres)) +geom_line() +geom_point() +ylim(1,5) +xlab("Years") +ylab("Ratings") +ggtitle("1925-2010 Average Movie Genres Ratings") +labs(color ="Genres") +theme_minimal()
In this plot it looks like drama has become the one with the highest rating.
Genres vs Year by avrg_rating
For this assignment I choose to use the movielens dataset that has some variable like userID, genres, year, movieId, title, rating, timestamp. I use this dataset to see how movie rating has change over the the years. For my chart I use geom_line and geom_point to create a visualization of genres vs Year by avrg_rating. I choose the genres and put it betwee 1925- 2010 for the years. In that plot I observe that Drama is generally the one with the highest rating, with some change over the years but still remain the highest. The other genres has more variations in the rating. Between 1960-2000 we can the ratings are very noticeable.