The movies data set has 5684 rows about the amount of expicit content (drugs, language, sex & nudity, and violence) found in 1421 movies released since 1985. Each movies is represented by 4 rows (1 row = movie & content type combo).
The relevant variables in the data set are:
tibble(movies)
## # A tibble: 5,684 × 8
## imdb_id name year rating run_time category occurred occurrences
## <chr> <chr> <int> <chr> <int> <chr> <chr> <int>
## 1 tt0087231 The Falcon and… 1985 R 7920 violence yes 1
## 2 tt0087231 The Falcon and… 1985 R 7920 sex_nud… yes 18
## 3 tt0087231 The Falcon and… 1985 R 7920 language yes 46
## 4 tt0087231 The Falcon and… 1985 R 7920 drugs yes 9
## 5 tt0088763 Back to the Fu… 1985 PG 6960 violence yes 2
## 6 tt0088763 Back to the Fu… 1985 PG 6960 sex_nud… yes 3
## 7 tt0088763 Back to the Fu… 1985 PG 6960 language yes 60
## 8 tt0088763 Back to the Fu… 1985 PG 6960 drugs yes 6
## 9 tt0088846 Brazil 1985 R 8580 violence yes 22
## 10 tt0088846 Brazil 1985 R 8580 sex_nud… yes 8
## # ℹ 5,674 more rows
When using color to represent the category variable, use the colors in the vector below:
cat_col <- c("drugs" = "#355A20",
"language" = "steelblue",
"sex_nudity" = "pink",
"violence" = "#880808")
Create the bar chart seen in Brightspace.
ggplot(
data = movies,
mapping = aes(x = fct_rev(occurred),
fill = category)) +
# Creating the bar chart
geom_bar(color = "black",
show.legend = F) +
# Separate bar charts for each content type
facet_wrap(facets = ~category) +
theme_test() +
# Removing the buffer space on the bottom of the x-axis
scale_y_continuous(
expand = c(0, 0, 0.05, 0),
breaks = seq(0, 1500, 250)
) +
# Changing the colors used for the diff categories
scale_fill_manual(values = cat_col) +
# Changing the labels and title
labs(title = "Number of Movies with Explicit Content",
x = "Did the movie contain the type of expicity content?",
y = NULL) +
# Centering and increasing the size of the title
theme(
plot.title = element_text(hjust = 0.5,
size = 16),
text = element_text(size = 14)
)
Which category or categories of explicit content is in fewer than half of movies?
Drugs is the only category with fewer than half the movies having at least 1 scene in it
Again, create the graph in Brightspace.
ggplot(
data = movies,
mapping = aes(x = category,
fill = occurred)) +
geom_bar(position = "fill") +
facet_wrap(facets = ~ rating,
ncol = 3) +
theme_test() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
labs(title = "At Least One Explicit Scene in the Movie",
fill = NULL,
x = NULL,
y = NULL) +
theme(
plot.title = element_text(hjust = 0.5,
size = 16),
legend.position = "bottom"
) +
scale_fill_manual(values = c("yes" = "tomato", "no" = "steelblue"))
Which type of the explicit content has the largest increase across PG, PG-13, and R rated movies?
Either drugs or sex_nudity has the largest change across PG/PG-13/R rate movies.
Create a bar chart that displays the average number of
occurrences of the explicit content in movies. HINT: To use the methods
we learned so far in class, you can divide the appropriate column in
aes()
by 1421, the number of unique movies in the
data.
ggplot(
data = movies,
mapping = aes(x = category,
y = occurrences/n_distinct(imdb_id),
fill = category)) +
# Need to use geom_col() since we are mapping the y aesthetic
# And hiding the legend since category is on the x-axis
geom_col(show.legend = F) +
theme_bw() +
# Removing the buffer space
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
# Matching the colors correctly
scale_fill_manual(values = cat_col) +
# Adding a title and changing the labels
labs(title = "Occurrences in Movies of Explicit Content",
x = "Explicit Content Type",
y = "Average count per movie",
# subtitle = "PG, PG-13, and R-rate movies: 1985 - 2023"
) +
# Centering the title
theme(
text = element_text(size = 14),
plot.title = element_text(hjust = 0.5,
size = 16)
)
Create the graph Brightspace and save it. Make sure the graph appears in your solutions!
ggplot(
data = movies,
mapping = aes(x = occurrences,
fill = category)) +
geom_density(show.legend = F) +
labs(title = "Occurrences in Movies of Explicit Content by Movie",
fill = "Content\nType",
x = "Number of Occurrences in a Movie",
y = NULL
) +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5,
size = 16)
) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) +
scale_fill_manual(values = cat_col) +
#scale_x_continuous(trans = "log10") +
facet_wrap(
facets = ~ category,
#scales = "free",
ncol = 1
) ->
gg_occurrences
gg_occurrences
What issue do the graphs above have?
At least one of the categories is extremely right skewed, making it impossible to see the shape of any of them
Add the appropriate function to the graph created in part 2A to create the graph in Brightspace
gg_occurrences +
scale_x_log10()
When making the changes to the graph, the warning states that 1750 rows were removed. Why did applying a log10 transformation cause the need for 1750 rows to be removed?
1750 rows had an occurrences value of 0, and you can’t take
To answer the last question, you’ll need to use the data set created below:
movies |>
group_by(year, category) |>
summarize(movies = n(),
occur_avg = mean(occurrences),
occur_per = mean(occurred == "yes")) |>
ungroup()->
avg_by_year
Using the avg_by_year data set, create a line graph that shows the percentage of movies with each type of explicit content by year. See the graphs in Brightspace for what it should look like.
ggplot(
data = avg_by_year,
mapping = aes(x = year,
y = occur_avg,
color = category)) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("#355A20", "steelblue", "pink", "#880808")) +
labs(
title = "Occurrences of Explicit Content by Year",
color = NULL
) +
theme_bw() +
theme(
legend.position = "top",
plot.title = element_text(hjust = 0.5,
size = 14))