library(dslabs)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data("movielens")
head(movielens)
## movieId title year
## 1 31 Dangerous Minds 1995
## 2 1029 Dumbo 1941
## 3 1061 Sleepers 1996
## 4 1129 Escape from New York 1981
## 5 1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
## 6 1263 Deer Hunter, The 1978
## genres userId rating timestamp
## 1 Drama 1 2.5 1260759144
## 2 Animation|Children|Drama|Musical 1 3.0 1260759179
## 3 Thriller 1 3.0 1260759182
## 4 Action|Adventure|Sci-Fi|Thriller 1 2.0 1260759185
## 5 Drama 1 4.0 1260759205
## 6 Drama|War 1 2.0 1260759151
str(movielens)
## 'data.frame': 100004 obs. of 7 variables:
## $ movieId : int 31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
## $ title : chr "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
## $ year : int 1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
## $ genres : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rating : num 2.5 3 3 2 4 2 2 2 3.5 2 ...
## $ timestamp: int 1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
ggplot(movielens, aes(x = year, fill = rating)) +
geom_histogram(color = "red", fill = "black") +
labs(title = "Histogram of Movie Ratings and Year", x = "Year", y = "# of Ratings", caption = "Source: movielens (DSLABS)")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
From the dslabs library, I found the dataset titled “movielens” and thought it would be interesting to compare the variables year and ratings from it. This dataset provides descriptions of each movie with their rating as well. At first, I was going to create a scatter plot of these two variables but it did not look pleasing in my eyes. I then made this histogram from those variables and it shows the relationship of the year the movie was released and a rating between 0-5 of the movie. We can see that this graph is left-skewed showing that movies began getting more ratings as time progressed. This graph proves that in modern times, more people rated movies compared to the older generations of movies which could be caused by accessibility to rating opportunities.