Assignment 7

Author

R Josue

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
data("movielens")
str(movielens)
'data.frame':   100004 obs. of  7 variables:
 $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
 $ title    : chr  "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
 $ year     : int  1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
 $ genres   : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
 $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
 $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
glimpse(movielens)
Rows: 100,004
Columns: 7
$ movieId   <int> 31, 1029, 1061, 1129, 1172, 1263, 1287, 1293, 1339, 1343, 13…
$ title     <chr> "Dangerous Minds", "Dumbo", "Sleepers", "Escape from New Yor…
$ year      <int> 1995, 1941, 1996, 1981, 1989, 1978, 1959, 1982, 1992, 1991, …
$ genres    <fct> Drama, Animation|Children|Drama|Musical, Thriller, Action|Ad…
$ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ rating    <dbl> 2.5, 3.0, 3.0, 2.0, 4.0, 2.0, 2.0, 2.0, 3.5, 2.0, 2.5, 1.0, …
$ timestamp <int> 1260759144, 1260759179, 1260759182, 1260759185, 1260759205, …
summary(movielens)
    movieId             title             year     
 Min.   :     1   Length   :100004   Min.   :1902  
 1st Qu.:  1028   N.unique :  8831   1st Qu.:1987  
 Median :  2406   N.blank  :     0   Median :1995  
 Mean   : 12549   Min.nchar:     1   Mean   :1992  
 3rd Qu.:  5418   Max.nchar:   151   3rd Qu.:2001  
 Max.   :163949   NAs      :     7   Max.   :2016  
                                     NAs    :7     
                  genres          userId        rating        timestamp        
 Drama               : 7757   Min.   :  1   Min.   :0.500   Min.   :7.897e+08  
 Comedy              : 6748   1st Qu.:182   1st Qu.:3.000   1st Qu.:9.658e+08  
 Comedy|Romance      : 3973   Median :367   Median :4.000   Median :1.110e+09  
 Drama|Romance       : 3462   Mean   :347   Mean   :3.544   Mean   :1.130e+09  
 Comedy|Drama        : 3272   3rd Qu.:520   3rd Qu.:4.000   3rd Qu.:1.296e+09  
 Comedy|Drama|Romance: 3204   Max.   :671   Max.   :5.000   Max.   :1.477e+09  
 (Other)             :71588                                                    
movie_summary <- movielens %>% group_by(title, year, genres) %>%
summarize(average_rating = mean(rating),number_of_ratings = n(),.groups = "drop")
movie_summary <- movie_summary %>% filter(number_of_ratings >= 50)
movie_plot <- ggplot(movie_summary,
aes(
x = year,
y = average_rating,
color = genres,
size = number_of_ratings))
movie_plot <- movie_plot +
geom_point(alpha = 0.7)
movie_plot <- movie_plot +
scale_color_manual(values = c(
"Action" = "firebrick",
"Adventure" = "dodgerblue4",
"Comedy" = "darkorange3",
"Drama" = "purple4",
"Romance" = "deeppink4",
"Thriller" = "darkgreen"
))
movie_plot <- movie_plot +
labs(
title = "The Average Movie Ratings Over Time",
subtitle = "The Movies with at least 50 ratings, colored by genre",
x = "The Year of the Movie Release",
y = "The Average Movie Rating",
color = "The Movie Genre",
size = "The Number of Ratings",
caption = "Source: DS Labs MovieLens Dataset")
movie_plot <- movie_plot +
theme_minimal()
movie_plot

For this assignment, I used the “movielens” data set. I created this graph using the slides and tutorials done in class recently. This is an interesting data set because it has so many movies and their registered movies from decades ago. I chose this data set because there was a lot of data and I could choose different variables to focus on this assignment. My focus was getting the data of different genres and applying the average rating in the graph. Over the years, comedies and Dramas have earned the better ratings. On the x axis I have the years they were released and on the y axis I have the rating scale. There is a legend explaining the color and the number of ratings.