Data 110 Homework Week 8

Author

Leika Joseph

Dslabs packages

Upload my libraries

#upload my libraries and choose view the dslabs dataset
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("dsLabs")
library("dslabs")
data(package = "dslabs")

Getting information about the dataset make-movielens

#?movielens

For this assignment I’m going to work with the movielens dataset from the dslabs packages. I will work with three variable: genres, year, and rating.

head(movielens)
  movieId                                   title year
1      31                         Dangerous Minds 1995
2    1029                                   Dumbo 1941
3    1061                                Sleepers 1996
4    1129                    Escape from New York 1981
5    1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
6    1263                        Deer Hunter, The 1978
                            genres userId rating  timestamp
1                            Drama      1    2.5 1260759144
2 Animation|Children|Drama|Musical      1    3.0 1260759179
3                         Thriller      1    3.0 1260759182
4 Action|Adventure|Sci-Fi|Thriller      1    2.0 1260759185
5                            Drama      1    4.0 1260759205
6                        Drama|War      1    2.0 1260759151

check my dataset for missing value

#is.na(movielens)

my dataset has no missing value now let’s do our analysis.

Scatterplot of genres vs year by rating

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

This plot is somehow overwelming, and there are too many data in there. The only thing I can get from there is that there has been more rating in recent years than the pass years. We can filter the data and create another plot.

I’m going to consider 4 genres and filter for years between 1925 - 2010

movielens2 <- movielens |>
  group_by(year, genres) |>
  filter(genres%in%c("Drama", "Comedy", "Romance", "Action")) |>
  filter(year >= 1925, year <= 2010) |>
  summarise(avrg_rating = mean(rating , na.rm = TRUE))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
  # In this chunk I choose my 3 variable to work with and I choose 4 genres in the genres column to work with. And I filter so my result can be between 1925- 2010.

movielens2
# A tibble: 183 × 3
# Groups:   year [85]
    year genres avrg_rating
   <int> <fct>        <dbl>
 1  1925 Comedy        4   
 2  1926 Drama         3.5 
 3  1927 Comedy        2.67
 4  1928 Comedy        4.38
 5  1928 Drama         4.3 
 6  1929 Drama         2   
 7  1930 Drama         3.72
 8  1931 Comedy        3.67
 9  1931 Drama         4   
10  1932 Comedy        4.11
# ℹ 173 more rows

For this plot I’m going to use geom line and geom point so we can see the rating movements over the year.

1925-2010 Average Movie Genres Ratings by Years

ggplot(movielens2, aes(x = year, y = avrg_rating, color = genres)) +
  geom_line() +
  geom_point() +
  ylim(1,5) +
  xlab("Years") +
  ylab("Ratings") +
  ggtitle("1925-2010 Average Movie Genres Ratings") + 
  labs(color = "Genres") + 
  theme_minimal()

In this plot it looks like drama has become the one with the highest rating.

Genres vs Year by avrg_rating

For this assignment I choose to use the movielens dataset that has some variable like userID, genres, year, movieId, title, rating, timestamp. I use this dataset to see how movie rating has change over the the years. For my chart I use geom_line and geom_point to create a visualization of genres vs Year by avrg_rating. I choose the genres and put it betwee 1925- 2010 for the years. In that plot I observe that Drama is generally the one with the highest rating, with some change over the years but still remain the highest. The other genres has more variations in the rating. Between 1960-2000 we can the ratings are very noticeable.