General instructions for all assignments:

Use this file as the template for your submission. Be sure to write your name at the top of this page in the author section.
For your lab submission, generate an .html file and an .Rmd file (named as: [AndrewID]-lab01.Rmd – e.g. “ryurko-lab01.Rmd”). When you’re done, submit it to Gradescope (the link to the course’s Gradescope can be found on Canvas). Gradescope only accepts PDFs, knit to PDF before submitting (see the Knit to PDF instructions in the Lab 00 assignment).
Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file. Your lab and homework files will include the template code chunks like the following:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Each answer must be supported by written statements (unless otherwise specified) in spaces marked by:

[WRITE YOUR ANSWER HERE]

Questions about the lab assignment will not be answered after the lab session has ended. You should engage during lab, instead of scrambling at the last minute to complete lab assignments.

Problem 1: Chi-Square Tests and Confidence Intervals (45pts)

As discussed in the class material, bar charts are usually easier to read than spine charts and pie charts. There are several ways we can add even more useful information to bar charts, as we’ll explore throughout this lab.

In this assignment, we’ll be using a dataset of IMDb rated films, TV Series, etc. The data is available on Github and can be loaded with the code below:

library(tidyverse)

# Reads IMDb rated movies TV Series, etc.
imdb_data <- read.csv("https://raw.githubusercontent.com/ryurko/DataViz-Class-Data/main/imdb_ratings.csv")

# Filters the data to only Featured Films
imdb_movies <- imdb_data %>%
  filter(Title.type == "Feature Film")

For a description of the variables in the dataset, please read here.

(5pts) First, let us find the most frequent directors which appear in this dataset. Code is provided below.

imdb_movies %>% 
  group_by(Directors) %>% 
  summarize(count = n()) %>% 
  filter(count == max(count))

## # A tibble: 3 × 2
##   Directors         count
##   <chr>             <int>
## 1 Christopher Nolan     9
## 2 David Fincher         9
## 3 Martin Scorsese       9

In the above code, %>% is called the “pipe” operator. x %>% f(y) means the same thing as f(x, y). The pipe operator is useful in making code easily readable and understandable. Note that you need to load the tidyverse library before you can use the pipe operator.

Given this, what is the above code doing, and in particular what is the purpose of the filter(count == max(count))?

[WRITE YOUR ANSWER HERE]

(5pts) The following code adds some additional variables to the dataset.

imdb_movies <- imdb_movies %>%
  mutate(vote_date = as.Date(imdb_movies$created, 
                             format = "%a %b %d %H:%M:%S %Y"),
         day_of_week = weekdays(vote_date),
         weekend = ifelse(day_of_week %in% c("Saturday", "Sunday"), 
                          "Weekend", "Workday"),
         duration = cut(Runtime..mins., c(0, 90, 120, Inf), 
                        labels = c("Short", "Medium", "Long")),
         ratings = cut(You.rated, c(0, 4, 7, Inf),
                       labels = c("Low", "Med", "High")),
         movie_period = cut(Year, c(0, 1980, 2000, 2018),
                            labels = c("Old", "Recent", "New"))
)

vote_date is the date the movie was rated, day_of_week is the day of the week the movie was rated, and duration is an ordinal categorical variable indicating how long the movie was.

In course material, we discussed marginal distributions for categorical variables; here we’ll examine the marginal distributions of the movie_period and duration variables. Do this by calculating the counts, proportions, and percentages of each variable. The following code is provided for duration:

#  Get counts, proportions, and percentages for duration
duration_marginal <- imdb_movies %>%
  group_by(duration) %>%
  summarize(count = n(), 
            total = nrow(imdb_movies),
            proportion = round(count / total, 4),
            percentage = proportion * 100)
duration_marginal

## # A tibble: 3 × 5
##   duration count total proportion percentage
##   <fct>    <int> <int>      <dbl>      <dbl>
## 1 Short       53   737     0.0719       7.19
## 2 Medium     395   737     0.536       53.6 
## 3 Long       289   737     0.392       39.2

Notice how, within the summarize() function, we can sequentially define a new variable and then refer to that new variable later on. For example, within summarize(), we first define a variable total, and then we refer to total when defining proportion.

Now write the same code for movie_period:

movie_period_marginal <- imdb_movies %>%
group_by(movie_period) %>%
summarize(
count = n(),
total = nrow(imdb_movies),
proportion = round(count / total, 4),
percentage = proportion * 100
)

Report your general observations in one sentence for each variable.

[Medium films are most common, then Long; Short is least common and most titles are Recent (1980–2000) or New (2000–2018); Old is smallest.]

(5pts) Marginal distributions are usually communicated with proportions or percentages; however, we’ve only discussed how to make bar plots in terms of counts instead of percentages. Now we’ll learn how to make bar plots in terms of percentages.

The goal of this question is to make a bar chart of duration on the percentage scale instead of the count scale. In other words, we’d like to make a barplot where duration is on the x-axis and the percentage of each duration is on the y-axis. Note that duration_marginal has exactly this information - so, let’s use this dataset to make this bar chart. Alter the template code below such that you’ve created a bar chart with one bar for each duration whose height corresponds to the percentage of each duration. After you’ve made the plot, add appropriate titles, labels, and a non-default color.

duration_marginal %>%
  ggplot(aes(x = duration, y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Distribution of Movie Duration",
       x = "Duration",
       y = "Percentage of films") +
  theme_minimal()

Note that, in this case, we needed to specify stat = "identity" within geom_bar(), because we want the height of the bars to correspond to whatever we put into y. This differs from the default stat = "count" within geom_bar().

(5pts) Your friend thinks there are equal proportions of short, medium, and long movies in this dataset. Using your bar chart from Part C, do you think your friend is right? (This time, just use your common statistical sense. Given the sample size and the difference in the bars, do you think there is probably a significant difference?)

[No, the proportions are clearly not equal — Medium-length films dominate, with Short being least common. Visually, the differences are large enough to suggest a real difference in proportions.]

(10pts) Let’s test this statistically. Run a chi-square test to check your friend’s assertion: chisq.test(duration_marginal$count) will check to see if the proportions in each category are the same.

chisq_out <- chisq.test(duration_marginal$count)
chisq_out

## 
##  Chi-squared test for given probabilities
## 
## data:  duration_marginal$count
## X-squared = 249.52, df = 2, p-value < 2.2e-16

After you run the test, answer the following questions:

What is the p-value for this test?

[≈ < 2.2 × 10⁻¹⁶]

What is your formal conclusion from this test?

[The p-value is extremely small (≈ < 2.2 × 10⁻¹⁶), so we reject the null hypothesis of equal proportions, there are statistically significant differences between the Short, Medium, and Long categories.]

(5pts) It can be helpful to report p-values on graphs, so that it’s clear whether what we’re looking at is “statistically significant” in some sense. Now we will take the same plot you made in Part C, and report the p-value on top of the graph. For this part, follow these steps:

First, copy-and-paste your code from Part C and place it here.
Now add the following line of code: + geom_text(x = ?, y = ?, label = ?). Here, you should specify x and y as the x- and y-coordinates where you want your text, and label as the text itself in quotes. (Note: Because the x-axis is categorical, you can specify x as one of the categories or as a number.)
In particular, specify label as your p-value from your chi-squared test in Part E (e.g., label = "chi-squared test p-value = ?), with ? appropriately specified. If you got an extremely small p-value, it’s fine if you just state “chi-squared test p-value approximately zero.” If you need to put your text on two lines, you can use \n within your label to create a line break.

For this part, you just need to have the exact same plot as Part C, but with the desired text added.

pval_text <- paste0("Chi-square p = ", formatC(chisq_out$p.value, format = "e", digits = 2))
y_top <- max(duration_marginal$percentage) * 1.05  

duration_marginal %>%
  ggplot(aes(x = duration, y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  annotate("text", x = 2, y = y_top, label = pval_text) +  
  labs(title = "Distribution of Movie Duration (with χ² p-value)",
       x = "Duration", y = "Percentage of films") +
  theme_minimal() +
  coord_cartesian(ylim = c(0, max(duration_marginal$percentage) * 1.12))

(10pts) In the previous part you used the chi-squared test to assess if there is a significantly different number of movies in each duration category. We can also use confidence intervals to assess statistical significance.

Notice that the data frame duration_marginal already has the proportions in each duration category and the total number of movies, which is exactly what you need to compute the confidence intervals for each proportion. Fill in the code below to display the lower and upper bounds for the 95% confidence intervals for the proportions:

# Write code here for the CI lower bound
# Change 0 to something else
duration_marginal$proportion - qnorm(0.975) * sqrt(duration_marginal$proportion * (1 - duration_marginal$proportion) / duration_marginal$total)

## [1] 0.05325011 0.49999559 0.35685246

# Write code here for the CI upper bound
# Change 0 to something else
duration_marginal$proportion + qnorm(0.975) * sqrt(duration_marginal$proportion * (1 - duration_marginal$proportion) / duration_marginal$total)

## [1] 0.09054989 0.57200441 0.42734754

Hint: We talked about how to compute the confidence interval for a proportion in Concept 3, so it may be helpful to revisit your notes. Remember that the confidence interval involves the alpha/2 quantile from a standard Normal distribution. In this case, alpha = 0.05. The quantile can be computed as:

z <- qnorm(0.975)
duration_marginal <- duration_marginal %>%
  mutate(
    se     = sqrt(proportion * (1 - proportion) / total),
    ci_low = proportion - z * se,
    ci_high= proportion + z * se
  )
duration_marginal %>% select(duration, proportion, ci_low, ci_high)

## # A tibble: 3 × 4
##   duration proportion ci_low ci_high
##   <fct>         <dbl>  <dbl>   <dbl>
## 1 Short        0.0719 0.0533  0.0905
## 2 Medium       0.536  0.500   0.572 
## 3 Long         0.392  0.357   0.427

This is the “classic” 1.96 (sometimes rounded to 2).

After you’ve successfully displayed your confidence intervals: Which confidence intervals don’t overlap? What does this suggest, and how does this differ from the conclusion we could make from the chi-squared test?

[The confidence intervals for Short films (0.05 – 0.09) do not overlap with those for Medium (0.50 – 0.57) or Long (0.36 – 0.43).]

Problem 2: 2D Bar Charts and Identifying Marginal Distributions (25 points)

(10pts) Here we’ll walk through how to make a “stacked bar plot.” In short, a “stacked bar plot” is simply a bar plot, where each bar is in turn a spine chart; as a result, it visualizes two categorical variables. We’ll again use the imdb_movies dataset from Problem 1. Follow these steps:

First, make a bar graph of weekend. Make the color of the bars something other than gray by specifying the fill variable.
Now we want to color the bars by the ratings variable; in other words, we want to “fill” the bars according to the ratings variable. So, set fill = ratings in your code.
At this point, you probably get an error if you run your code. Notice that you probably have fill = ratings within geom_bar() (which is fine!) However, ggplot requires all variable names to be within the aes() function; the reason is that aes() “describe how variables in the data are mapped to visual properties” (to quote the help documentation for that function). So, if you have fill = ratings outside of aes(), ggplot is saying, “What is ratings? It must not be a variable in the imdb_movies dataset, because ratings isn’t inside aes()…so I’m confused.” To fix this issue, either (1) write aes(fill = ratings) within geom_bar(), or (2) put fill = ratings within the aes() already inside ggplot(). Either should work fine!

The goal of this problem is to make a stacked bar plot, with bars for weekend and different colors for ratings. If you get stuck here, please ask your peers and/or TAs for help.

imdb_movies %>%
  ggplot(aes(x = weekend, fill = ratings)) +
  geom_bar() +
  labs(
    title = "Ratings by Weekend vs Workday",
    x = "Day type",
    y = "Number of ratings",
    fill = "Ratings"
  ) +
  theme_minimal()

(5pts) Use your graph from Part A to discuss the marginal distribution of weekend and the marginal distribution for ratings.

[The marginal distribution of weekend (total bar heights) is fairly similar for Weekend and Workday. The marginal distribution of ratings shows High ratings make up the largest share overall (largest stacked segments across bars), followed by Med, with Low the smallest.]

(10pts) Now make a stacked bar chart that involves movie_period and day_of_week such that you think it’s easiest to describe the marginal distribution of movie_period. After you’ve made your graph, explain in 1-2 sentences why you made the choices you made - i.e., how the choices you made allow you to most easily describe the marginal distribution of movie_period. (For this part, there’s no need to actually describe the marginal distribution.)

imdb_movies %>%
  ggplot(aes(x = movie_period, fill = day_of_week)) +
  geom_bar() +
  labs(
    title = "Day of Week within Movie Period",
    x = "Movie period",
    y = "Count",
    fill = "Day of week"
  ) +
  theme_minimal()

[Placing movie_period on the x-axis makes each bar’s height equal its marginal count, so the marginal distribution of movie_period is immediately readable from bar heights. Coloring by day_of_week adds the second variable without obscuring those totals.]

Problem 3: Adding More Categories with Facetting (30 points)

In addition to visualizing a single categorical variable, we can also create the same visualization for different categories, thereby allowing for flexible multivariate graphics. This is known as “facetting,” where the data is grouped according to some categorical variable, and then we create the same graphic for each group. The resulting graphs are typically displayed in a grid, where each graph is a single “facet” of the full graphic.

This is a popular way to show how the features of the variable(s) being displayed in a particular graphic can change depending on some other variable (usually a categorical variable).

In Problem 2 you made a stacked bar plot for weekend and ratings. Alternatively, you could make two different bar plots for ratings, corresponding to the two different values for weekend. We’ll demonstrate how to do this using facetting.

(10pts) To make a bar plot of ratings facetted by weekend, follow these steps:

First, make a bar plot of ratings the standard way (where fill is specified as a color within geom_bar()).
Then, add + facet_grid(~ weekend) to your existing line of code.

imdb_movies %>%
  ggplot(aes(x = ratings)) +
  geom_bar(fill = "gray60") +
  facet_grid(~ weekend) +
  labs(
    title = "Ratings facetted by Weekend",
    x = "Ratings", y = "Count"
  ) +
  theme_minimal()

You should see two bar plots graphed together, one for each value of weekend. After you’ve successfully made your plot, consider the following. When we have two categorical variables, there are four distributions we should consider: The marginal distribution of ratings, the marginal distribution of weekend, the conditional distribution of ratings given weekend, and the conditional distribution of weekend given ratings. (We’re focusing on marginal distributions this week - we’ll discuss conditional distributions next week.) Which plot makes it easier to see both marginal distributions: The facetted bar plot here, or the stacked bar plot in 2A? Explain your reasoning in 1-3 sentences.

[The facetted version makes both marginals easier to see than the stacked plot: each panel shows the ratings distribution, and comparing total heights across panels reveals the weekend marginal.]

(10pts) You can also display marginal distributions by adding margin = TRUE within facet_grid(). So, copy-and-paste your code from Part A and then add margin = TRUE within facet_grid(). Your submission should display a third bar plot, labeled as “(all)”. This displays a marginal distribution - the marginal distribution of what?

imdb_movies %>%
  ggplot(aes(x = ratings)) +
  geom_bar(fill = "gray60") +
  facet_grid(~ weekend, margins = TRUE) +  # note: 'margins', not 'margin'
  labs(
    title = "Ratings facetted by Weekend with Marginal Panel",
    x = "Ratings", y = "Count"
  ) +
  theme_minimal()

[The “(all)” panel shows the marginal distribution of ratings aggregated over both weekend categories (i.e., ignoring weekend).]

(5pts) We can also facet stacked bar plots, thereby allowing for even more dimensions. Recreate the graph in Problem 2C, but this time, facet on weekend (don’t include margin = TRUE). For this part, all you have to do is make the desired graph. (You should end up with multiple facets of stacked bar plots, where each stacked bar plot involves movie_period and day_of_week.)

imdb_movies %>%
  ggplot(aes(x = movie_period, fill = day_of_week)) +
  geom_bar() +
  facet_grid(~ weekend) +
  labs(
    title = "Movie Period × Day of Week, facetted by Weekend",
    x = "Movie period", y = "Count", fill = "Day of week"
  ) +
  theme_minimal()

(5pts) We can actually facet on multiple categorical variables at once. Recreate the graph in Problem 2C again, but this time facet by both weekend and ratings by adding facet_grid(weekend ~ ratings). We won’t ask you to interpret this graph here, since there are four dimensions being plotted in this visual. The purpose of this problem is just to demonstrate that it’s pretty easy to make extensions to bar plots that allow for 3D and 4D graphs!

imdb_movies %>%
  ggplot(aes(x = movie_period, fill = day_of_week)) +
  geom_bar() +
  facet_grid(weekend ~ ratings) +
  labs(
    title = "Movie Period × Day of Week facetted by Weekend and Ratings",
    x = "Movie period", y = "Count", fill = "Day of week"
  ) +
  theme_minimal()

Lab 1: Visualizations for Categorical Data

Humera Inayat

General instructions for all assignments:

Problem 1: Chi-Square Tests and Confidence Intervals (45pts)

Problem 2: 2D Bar Charts and Identifying Marginal Distributions (25 points)

Problem 3: Adding More Categories with Facetting (30 points)