DS Labs Homework

Project Breakdown

I wanted to examine 3 movie genres and determine on average how well they performed in the previous decade (2000-2010)

Loading library’s

library(tidyverse)
library(dslabs)
data(package="dslabs")

Loading in dataset

The dataset I am using is called “movielens. This dataset includes thousands of movies released and includes information about its genre, title, rating, and length of movie. I will use this dataset to visualize my project.

data("movielens")

Lets take a look at all the movie years

First, I wanted to examine all the years that were included in this dataset.

unique(movielens$year)
##   [1] 1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 1979 1971 1980 1988 1998
##  [16] 1986 1974 1994 1993 1990 1970 1987 1983 1997 1999 1984 2000 2002 2003 2004
##  [31] 2006 2008 2009 1977 1937 1940 1972 1958 1939 1950 1964 1951 1975 1960 1985
##  [46] 1962 1976 1942 1967 1955 1961 1953 1928 1973 1965 2001 2005 1957 1954 1968
##  [61] 1966 2007 2010 2011 2012 2013 1952 1963 1945 1946 1949 1948 1931 1969 1927
##  [76] 1933 1956 1944 1936 1925 1929 1935 2014 2015 2016 1922 1947 1926 1920 1938
##  [91] 1934 1930 1943 1921 1932 1924   NA 1915 1902 1923 1918 1917 1916 1919

Now it’s time to filter for the information needed

The first genre I wanted to examine are the action movies. To accomplish this I created a new variable that will include all the action movies between the years 2000-2010. I also used the grepl function to grab any movies that had the word action in them. Lastly, place all those movies in the single_genre column and name it action.

action_movies <-  movielens %>%
    filter(grepl('Action', genres)) %>%
    filter(year >= 2000, year <= 2010)
    action_movies$single_genre <- "Action"

Now let’s repeat those same steps to filter for the drama movies

drama_movies <- movielens %>%
  filter(grepl('Drama', genres)) %>%
  filter(year >= 2000, year <= 2010)
  drama_movies$single_genre <- "Drama"

Same steps to filter for the comedy movies

comedy_movies <- movielens %>%
filter(grepl('Comedy', genres)) %>%
  filter(year >= 2000, year <= 2010)
  comedy_movies$single_genre <- "Comedy"

Now we caculate

This was by far the hardest step and took a lot of trial and error. However, in this code I created a new variable “genre_movies” and use rbind to combine three data frames: action_movies, comedy_movies, and drama_movies. Next, I plugged that variable into real_movies. In this new variable I grouped by year and the single_genre column from my previous code. Lastly, calculate the mean rating for each of the movie genres.

genre_movies <- rbind(action_movies, comedy_movies, drama_movies)
real_movies<- genre_movies %>%
  group_by(year, single_genre) %>%
  summarize(avg_rating = mean(rating))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

Now it’s time to plot!

In this first plot I used geom_line and point to track the movements of these genre types. I added my x-axis which is years. My y axis is based on average movie ratings. Lastly, my legend is based on genre types. I also added a new theme and set my ylim to start at 3 and end at 5.

ggplot(real_movies, aes(x = year, y = avg_rating, color = single_genre)) +
  geom_line() +
  geom_point() +
  ylim(3,5) +
  xlab("Years") +
  ylab("Movie Ratings") +
  ggtitle("2000-2010 Average Movie Genre Ratings") + 
  theme_bw() +
  labs(color = "Movie Genres") +
 scale_x_continuous(breaks = seq(2000, 2010, 5))

Plot 2.0

The first plot was good, but I can do better. I wanted to add interactivity. Using the same principles from my previous plot. I added interactivity to show the year and average rating per year. I also loaded in plotly which gave me the tools needed to make this possible.

library(plotly)

g2 <- ggplot(real_movies, aes(x = year, y = avg_rating, color = single_genre, text = paste("Year: ", year, "<br>Avg Rating: ", round(avg_rating, 2)))) +
  geom_point() +
  geom_line(aes(group = single_genre)) +
  ylim(3, 5) +
  xlab("Years") +
  ylab("Movie Ratings") +
  ggtitle("2000-2010 Average Movie Genre Ratings") +
  theme_bw() +
  labs(color = "Movie Genres")  +
   scale_x_continuous(breaks = seq(2000, 2010, 5))


ggplotly(g2, tooltip = "text")

Closing Thoughts

This assignment challenged me as I knew from the beginning what I wanted to visualize. However, my coding technical skills were not at the level needed in order to achieve my goal at first. However, after two days of working on this project I finally accomplished my visualization. As shown above, it seems that drama movies are more highly rated compared to action and comedy movies. Comedy movies on average were rated lower throughout the decade except for the years 2001 and 2007. I think I would like to revisit this assignment when I have improved my coding skills. But I am very proud that I was able to work past my challenges and achieve my visualization!

Thank you!