Use this file as the template for your submission. Be sure to write your name at the top of this page in the author section.
For your lab submission, generate an .html file and an .Rmd file (named as: [AndrewID]-lab01.Rmd – e.g. “ryurko-lab01.Rmd”). When you’re done, submit it to Gradescope (the link to the course’s Gradescope can be found on Canvas). Gradescope only accepts PDFs, knit to PDF before submitting (see the Knit to PDF instructions in the Lab 00 assignment).
Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file. Your lab and homework files will include the template code chunks like the following:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
[WRITE YOUR ANSWER HERE]
As discussed in the class material, bar charts are usually easier to read than spine charts and pie charts. There are several ways we can add even more useful information to bar charts, as we’ll explore throughout this lab.
In this assignment, we’ll be using a dataset of IMDb rated films, TV Series, etc. The data is available on Github and can be loaded with the code below:
library(tidyverse)
# Reads IMDb rated movies TV Series, etc.
imdb_data <- read.csv("https://raw.githubusercontent.com/ryurko/DataViz-Class-Data/main/imdb_ratings.csv")
# Filters the data to only Featured Films
imdb_movies <- imdb_data %>%
filter(Title.type == "Feature Film")
For a description of the variables in the dataset, please read here.
imdb_movies %>%
group_by(Directors) %>%
summarize(count = n()) %>%
filter(count == max(count))
## # A tibble: 3 × 2
## Directors count
## <chr> <int>
## 1 Christopher Nolan 9
## 2 David Fincher 9
## 3 Martin Scorsese 9
In the above code, %>% is called the “pipe” operator.
x %>% f(y) means the same thing as f(x, y).
The pipe operator is useful in making code easily readable and
understandable. Note that you need to load the
tidyverse library before you can use the pipe
operator.
Given this, what is the above code doing, and in particular what is
the purpose of the filter(count == max(count))?
[WRITE YOUR ANSWER HERE]
imdb_movies <- imdb_movies %>%
mutate(vote_date = as.Date(imdb_movies$created,
format = "%a %b %d %H:%M:%S %Y"),
day_of_week = weekdays(vote_date),
weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"),
"Weekend", "Workday"),
duration = cut(Runtime..mins., c(0, 90, 120, Inf),
labels = c("Short", "Medium", "Long")),
ratings = cut(You.rated, c(0, 4, 7, Inf),
labels = c("Low", "Med", "High")),
movie_period = cut(Year, c(0, 1980, 2000, 2018),
labels = c("Old", "Recent", "New"))
)
vote_date is the date the movie was rated,
day_of_week is the day of the week the movie was rated, and
duration is an ordinal categorical variable indicating how
long the movie was.
In course material, we discussed marginal
distributions for categorical variables; here we’ll examine the
marginal distributions of the movie_period and
duration variables. Do this by calculating the counts,
proportions, and percentages of each variable. The following code is
provided for duration:
# Get counts, proportions, and percentages for duration
duration_marginal <- imdb_movies %>%
group_by(duration) %>%
summarize(count = n(),
total = nrow(imdb_movies),
proportion = round(count / total, 4),
percentage = proportion * 100)
duration_marginal
## # A tibble: 3 × 5
## duration count total proportion percentage
## <fct> <int> <int> <dbl> <dbl>
## 1 Short 53 737 0.0719 7.19
## 2 Medium 395 737 0.536 53.6
## 3 Long 289 737 0.392 39.2
Notice how, within the summarize() function, we can
sequentially define a new variable and then refer to that new variable
later on. For example, within summarize(), we first define
a variable total, and then we refer to total
when defining proportion.
Now write the same code for movie_period:
movie_period_marginal <- imdb_movies %>%
group_by(movie_period) %>%
summarize(
count = n(),
total = nrow(imdb_movies),
proportion = round(count / total, 4),
percentage = proportion * 100
)
Report your general observations in one sentence for each variable.
[Medium films are most common, then Long; Short is least common and most titles are Recent (1980–2000) or New (2000–2018); Old is smallest.]
The goal of this question is to make a bar chart of
duration on the percentage scale instead
of the count scale. In other words, we’d like to make a barplot where
duration is on the x-axis and the percentage
of each duration is on the y-axis. Note that
duration_marginal has exactly this information - so, let’s
use this dataset to make this bar chart. Alter the template code below
such that you’ve created a bar chart with one bar for each
duration whose height corresponds to the
percentage of each duration. After you’ve made the plot,
add appropriate titles, labels, and a non-default color.
duration_marginal %>%
ggplot(aes(x = duration, y = percentage)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Distribution of Movie Duration",
x = "Duration",
y = "Percentage of films") +
theme_minimal()
Note that, in this case, we needed to specify
stat = "identity" within geom_bar(), because
we want the height of the bars to correspond to whatever we put into
y. This differs from the default
stat = "count" within geom_bar().
[No, the proportions are clearly not equal — Medium-length films dominate, with Short being least common. Visually, the differences are large enough to suggest a real difference in proportions.]
chisq.test(duration_marginal$count) will check to see if
the proportions in each category are the same.chisq_out <- chisq.test(duration_marginal$count)
chisq_out
##
## Chi-squared test for given probabilities
##
## data: duration_marginal$count
## X-squared = 249.52, df = 2, p-value < 2.2e-16
After you run the test, answer the following questions:
[≈ < 2.2 × 10⁻¹⁶]
[The p-value is extremely small (≈ < 2.2 × 10⁻¹⁶), so we reject the null hypothesis of equal proportions, there are statistically significant differences between the Short, Medium, and Long categories.]
+ geom_text(x = ?, y = ?, label = ?). Here, you should
specify x and y as the x- and y-coordinates
where you want your text, and label as the text itself in
quotes. (Note: Because the x-axis is categorical, you can specify
x as one of the categories or as a number.)label as your p-value from your
chi-squared test in Part E (e.g.,
label = "chi-squared test p-value = ?), with ?
appropriately specified. If you got an extremely small p-value, it’s
fine if you just state “chi-squared test p-value approximately zero.” If
you need to put your text on two lines, you can use \n
within your label to create a line break.For this part, you just need to have the exact same plot as Part C, but with the desired text added.
pval_text <- paste0("Chi-square p = ", formatC(chisq_out$p.value, format = "e", digits = 2))
y_top <- max(duration_marginal$percentage) * 1.05
duration_marginal %>%
ggplot(aes(x = duration, y = percentage)) +
geom_bar(stat = "identity", fill = "steelblue") +
annotate("text", x = 2, y = y_top, label = pval_text) +
labs(title = "Distribution of Movie Duration (with χ² p-value)",
x = "Duration", y = "Percentage of films") +
theme_minimal() +
coord_cartesian(ylim = c(0, max(duration_marginal$percentage) * 1.12))
duration category. We can also use confidence intervals to
assess statistical significance.Notice that the data frame duration_marginal already has
the proportions in each duration category and the total
number of movies, which is exactly what you need to compute the
confidence intervals for each proportion. Fill in the code below to
display the lower and upper bounds for the 95% confidence intervals for
the proportions:
# Write code here for the CI lower bound
# Change 0 to something else
duration_marginal$proportion - qnorm(0.975) * sqrt(duration_marginal$proportion * (1 - duration_marginal$proportion) / duration_marginal$total)
## [1] 0.05325011 0.49999559 0.35685246
# Write code here for the CI upper bound
# Change 0 to something else
duration_marginal$proportion + qnorm(0.975) * sqrt(duration_marginal$proportion * (1 - duration_marginal$proportion) / duration_marginal$total)
## [1] 0.09054989 0.57200441 0.42734754
Hint: We talked about how to compute the confidence interval for a proportion in Concept 3, so it may be helpful to revisit your notes. Remember that the confidence interval involves the alpha/2 quantile from a standard Normal distribution. In this case, alpha = 0.05. The quantile can be computed as:
z <- qnorm(0.975)
duration_marginal <- duration_marginal %>%
mutate(
se = sqrt(proportion * (1 - proportion) / total),
ci_low = proportion - z * se,
ci_high= proportion + z * se
)
duration_marginal %>% select(duration, proportion, ci_low, ci_high)
## # A tibble: 3 × 4
## duration proportion ci_low ci_high
## <fct> <dbl> <dbl> <dbl>
## 1 Short 0.0719 0.0533 0.0905
## 2 Medium 0.536 0.500 0.572
## 3 Long 0.392 0.357 0.427
This is the “classic” 1.96 (sometimes rounded to 2).
After you’ve successfully displayed your confidence intervals: Which confidence intervals don’t overlap? What does this suggest, and how does this differ from the conclusion we could make from the chi-squared test?
[The confidence intervals for Short films (0.05 – 0.09) do not overlap with those for Medium (0.50 – 0.57) or Long (0.36 – 0.43).]
imdb_movies dataset from
Problem 1. Follow these steps:First, make a bar graph of weekend. Make the color
of the bars something other than gray by specifying the
fill variable.
Now we want to color the bars by the ratings
variable; in other words, we want to “fill” the bars according to the
ratings variable. So, set fill = ratings in
your code.
At this point, you probably get an error if you run your code.
Notice that you probably have fill = ratings within
geom_bar() (which is fine!) However, ggplot
requires all variable names to be within the aes()
function; the reason is that aes() “describe how variables
in the data are mapped to visual properties” (to quote the help
documentation for that function). So, if you have
fill = ratings outside of aes(),
ggplot is saying, “What is ratings? It must
not be a variable in the imdb_movies dataset, because
ratings isn’t inside aes()…so I’m confused.”
To fix this issue, either (1) write aes(fill = ratings)
within geom_bar(), or (2) put fill = ratings
within the aes() already inside ggplot().
Either should work fine!
The goal of this problem is to make a stacked bar plot, with bars for
weekend and different colors for ratings. If
you get stuck here, please ask your peers and/or TAs for help.
imdb_movies %>%
ggplot(aes(x = weekend, fill = ratings)) +
geom_bar() +
labs(
title = "Ratings by Weekend vs Workday",
x = "Day type",
y = "Number of ratings",
fill = "Ratings"
) +
theme_minimal()
weekend and the marginal distribution for
ratings.[The marginal distribution of weekend (total bar heights) is fairly similar for Weekend and Workday. The marginal distribution of ratings shows High ratings make up the largest share overall (largest stacked segments across bars), followed by Med, with Low the smallest.]
movie_period and day_of_week such that you
think it’s easiest to describe the marginal distribution of
movie_period. After you’ve made your graph, explain in 1-2
sentences why you made the choices you made - i.e., how the choices you
made allow you to most easily describe the marginal distribution of
movie_period. (For this part, there’s no need to actually
describe the marginal distribution.)imdb_movies %>%
ggplot(aes(x = movie_period, fill = day_of_week)) +
geom_bar() +
labs(
title = "Day of Week within Movie Period",
x = "Movie period",
y = "Count",
fill = "Day of week"
) +
theme_minimal()
[Placing movie_period on the x-axis makes each bar’s height equal its marginal count, so the marginal distribution of movie_period is immediately readable from bar heights. Coloring by day_of_week adds the second variable without obscuring those totals.]
In addition to visualizing a single categorical variable, we can also create the same visualization for different categories, thereby allowing for flexible multivariate graphics. This is known as “facetting,” where the data is grouped according to some categorical variable, and then we create the same graphic for each group. The resulting graphs are typically displayed in a grid, where each graph is a single “facet” of the full graphic.
This is a popular way to show how the features of the variable(s) being displayed in a particular graphic can change depending on some other variable (usually a categorical variable).
In Problem 2 you made a stacked bar plot for weekend and
ratings. Alternatively, you could make two different bar
plots for ratings, corresponding to the two different
values for weekend. We’ll demonstrate how to do this using
facetting.
ratings facetted by
weekend, follow these steps:ratings the standard way
(where fill is specified as a color within
geom_bar()).+ facet_grid(~ weekend) to your existing line
of code.imdb_movies %>%
ggplot(aes(x = ratings)) +
geom_bar(fill = "gray60") +
facet_grid(~ weekend) +
labs(
title = "Ratings facetted by Weekend",
x = "Ratings", y = "Count"
) +
theme_minimal()
You should see two bar plots graphed together, one for each value of
weekend. After you’ve successfully made your plot, consider
the following. When we have two categorical variables, there are four
distributions we should consider: The marginal distribution of
ratings, the marginal distribution of weekend,
the conditional distribution of ratings given
weekend, and the conditional distribution of
weekend given ratings. (We’re focusing on
marginal distributions this week - we’ll discuss conditional
distributions next week.) Which plot makes it easier to see both
marginal distributions: The facetted bar plot here, or the stacked bar
plot in 2A? Explain your reasoning in 1-3 sentences.
[The facetted version makes both marginals easier to see than the stacked plot: each panel shows the ratings distribution, and comparing total heights across panels reveals the weekend marginal.]
margin = TRUE within facet_grid(). So,
copy-and-paste your code from Part A and then add
margin = TRUE within facet_grid(). Your
submission should display a third bar plot, labeled as “(all)”. This
displays a marginal distribution - the marginal distribution of
what?imdb_movies %>%
ggplot(aes(x = ratings)) +
geom_bar(fill = "gray60") +
facet_grid(~ weekend, margins = TRUE) + # note: 'margins', not 'margin'
labs(
title = "Ratings facetted by Weekend with Marginal Panel",
x = "Ratings", y = "Count"
) +
theme_minimal()
[The “(all)” panel shows the marginal distribution of ratings aggregated over both weekend categories (i.e., ignoring weekend).]
weekend (don’t include
margin = TRUE). For this part, all you have to do is make
the desired graph. (You should end up with multiple facets of stacked
bar plots, where each stacked bar plot involves
movie_period and day_of_week.)imdb_movies %>%
ggplot(aes(x = movie_period, fill = day_of_week)) +
geom_bar() +
facet_grid(~ weekend) +
labs(
title = "Movie Period × Day of Week, facetted by Weekend",
x = "Movie period", y = "Count", fill = "Day of week"
) +
theme_minimal()
weekend and ratings by adding
facet_grid(weekend ~ ratings). We won’t ask you to
interpret this graph here, since there are four dimensions being plotted
in this visual. The purpose of this problem is just to demonstrate that
it’s pretty easy to make extensions to bar plots that allow for 3D and
4D graphs!imdb_movies %>%
ggplot(aes(x = movie_period, fill = day_of_week)) +
geom_bar() +
facet_grid(weekend ~ ratings) +
labs(
title = "Movie Period × Day of Week facetted by Weekend and Ratings",
x = "Movie period", y = "Count", fill = "Day of week"
) +
theme_minimal()