DS 2870 - Homework 3

Data Description

The movies data set has 5684 rows about the amount of expicit content (drugs, language, sex & nudity, and violence) found in 1421 movies released since 1985. Each movies is represented by 4 rows (1 row = movie & content type combo).

The relevant variables in the data set are:

imdb_id: The identifier used by IMDB to uniquely specify the movie
name: The name of the movie
year: The year the movie was released
rating: The MPAA rating of the movie (PG/PG-13/R)
category: The type of explicit content
occurred: If the movie had at least 1 scene of with the category of explicit content category (yes/no)
occurrences: The number of times/scenes of the explicit type of content in the movie

tibble(movies)

## # A tibble: 5,684 × 8
##    imdb_id   name             year rating run_time category occurred occurrences
##    <chr>     <chr>           <int> <chr>     <int> <chr>    <chr>          <int>
##  1 tt0087231 The Falcon and…  1985 R          7920 violence yes                1
##  2 tt0087231 The Falcon and…  1985 R          7920 sex_nud… yes               18
##  3 tt0087231 The Falcon and…  1985 R          7920 language yes               46
##  4 tt0087231 The Falcon and…  1985 R          7920 drugs    yes                9
##  5 tt0088763 Back to the Fu…  1985 PG         6960 violence yes                2
##  6 tt0088763 Back to the Fu…  1985 PG         6960 sex_nud… yes                3
##  7 tt0088763 Back to the Fu…  1985 PG         6960 language yes               60
##  8 tt0088763 Back to the Fu…  1985 PG         6960 drugs    yes                6
##  9 tt0088846 Brazil           1985 R          8580 violence yes               22
## 10 tt0088846 Brazil           1985 R          8580 sex_nud… yes                8
## # ℹ 5,674 more rows

When using color to represent the category variable, use the colors in the vector below:

cat_col <- c("drugs" = "#355A20", 
             "language" = "steelblue", 
             "sex_nudity" = "pink", 
             "violence" = "#880808")

Question 1: Bar Charts

Part 1A) Bar Chart if the explicit content occurred

Create the bar chart seen in Brightspace.

ggplot(
  data = movies,
  mapping = aes(x = fct_rev(occurred),
                fill = category)) + 
  
  # Creating the bar chart
  geom_bar(color = "black",
           show.legend = F) +
  
  # Separate bar charts for each content type
  facet_wrap(facets = ~category) +
  
  theme_test() + 
  
  # Removing the buffer space on the bottom of the x-axis
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    breaks = seq(0, 1500, 250)
  ) + 
  
  # Changing the colors used for the diff categories
  scale_fill_manual(values = cat_col) + 
  
  # Changing the labels and title
  labs(title = "Number of Movies with Explicit Content",
       x = "Did the movie contain the type of expicity content?",
       y = NULL) + 
  
  # Centering and increasing the size of the title
  theme(
    plot.title = element_text(hjust = 0.5,
                              size = 16),
    text = element_text(size = 14)
  )

Which category or categories of explicit content is in fewer than half of movies?

Drugs is the only category with fewer than half the movies having at least 1 scene in it

Part 1B) Explicit Content by Rating

Again, create the graph in Brightspace.

ggplot(
  data = movies,
  mapping = aes(x = category,
                fill = occurred)) + 
  
  geom_bar(position = "fill") +
  
  facet_wrap(facets = ~ rating,
             ncol = 3) +
  
  theme_test() + 
  
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) + 
  
  labs(title = "At Least One Explicit Scene in the Movie",
       fill = NULL,
       x = NULL,
       y = NULL) + 
  
  theme(
    plot.title = element_text(hjust = 0.5,
                              size = 16),
    legend.position = "bottom"
  ) + 
  
  scale_fill_manual(values = c("yes" = "tomato", "no" = "steelblue"))

Which type of the explicit content has the largest increase across PG, PG-13, and R rated movies?

Either drugs or sex_nudity has the largest change across PG/PG-13/R rate movies.

Part 1C) Average number of scenes per movie

Create a bar chart that displays the average number of occurrences of the explicit content in movies. HINT: To use the methods we learned so far in class, you can divide the appropriate column in aes() by 1421, the number of unique movies in the data.

ggplot(
  data = movies,
  mapping = aes(x = category,
                y = occurrences/n_distinct(imdb_id),
                fill = category)) + 
  
  # Need to use geom_col() since we are mapping the y aesthetic
  # And hiding the legend since category is on the x-axis
  geom_col(show.legend = F) +
  
  theme_bw() + 
  
  # Removing the buffer space
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) + 
  
  # Matching the colors correctly
  scale_fill_manual(values = cat_col) + 
  
  # Adding a title and changing the labels
  labs(title = "Occurrences in Movies of Explicit Content",
       x = "Explicit Content Type",
       y = "Average count per movie",
#       subtitle = "PG, PG-13, and R-rate movies: 1985 - 2023"
  ) + 
  
  # Centering the title
  theme(
    text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5,
                              size = 16)
  )

Question 2: Density Plots

Part 2A) Density plot of occurrences in movies

Create the graph Brightspace and save it. Make sure the graph appears in your solutions!

ggplot(
  data = movies,
  mapping = aes(x = occurrences,
                fill = category)) + 
  
  geom_density(show.legend = F) +
  
  labs(title = "Occurrences in Movies of Explicit Content by Movie",
       fill = "Content\nType",
       x = "Number of Occurrences in a Movie",
       y = NULL
  ) + 
  
  theme_bw() + 
  
  theme(
    plot.title = element_text(hjust = 0.5,
                              size = 16)
  ) + 
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) + 
  
  scale_fill_manual(values = cat_col) + 
  
  #scale_x_continuous(trans = "log10") + 
  
  facet_wrap(
    facets = ~ category,
    #scales = "free",
    ncol = 1
  ) ->
  
  gg_occurrences

gg_occurrences

What issue do the graphs above have?

At least one of the categories is extremely right skewed, making it impossible to see the shape of any of them

Part 2B) Log 10 Scale

Add the appropriate function to the graph created in part 2A to create the graph in Brightspace

gg_occurrences +
  scale_x_log10()

When making the changes to the graph, the warning states that 1750 rows were removed. Why did applying a log10 transformation cause the need for 1750 rows to be removed?

1750 rows had an occurrences value of 0, and you can’t take

Question 3: Amount of explicit content in movies by year

To answer the last question, you’ll need to use the data set created below:

movies |> 
  group_by(year, category) |> 
  summarize(movies = n(),
            occur_avg = mean(occurrences),
            occur_per = mean(occurred == "yes")) |> 
  ungroup()->
  avg_by_year

Using the avg_by_year data set, create a line graph that shows the percentage of movies with each type of explicit content by year. See the graphs in Brightspace for what it should look like.

ggplot(
  data = avg_by_year,
  mapping = aes(x = year,
                y = occur_avg,
                color = category)) + 
  
  geom_line(linewidth = 1) + 
  
  scale_color_manual(values = c("#355A20", "steelblue", "pink", "#880808")) + 
  
  labs(
    title = "Occurrences of Explicit Content by Year",
    color = NULL
  ) +
  
  theme_bw() + 
  
  theme(
    legend.position = "top",
    plot.title = element_text(hjust = 0.5,
                              size = 14))

DS 2870 - Homework 3 - Movies

Your Name

2023-10-02