1 Grading visualisations

Using the Data Visualisation checklist (under session1 files), please ‘grade’ Figure 3 of the JPMorgan Chase Institute on Year-over-year percent change in credit card spending by essential category. If you wanted the actual paper on which this publication was based, you can find it here.

1.1 Guideline

Dimension Score Comments
G1 0 Not left-justified, and also title “year over year” doesn’t make sense as the graph is in weeks.
G2 1 information about “national emergency”
G3 2 Sizing is corresponding
G4 2
G5 1 Data is represented in a separable legend, but also difficult to incorporate
G6 2 labels are clear
TOTAL 8

1.2 Arrangement

Dimension Score Comments
A1 2 Data is proportionate in visual representation
A2 2
A3 2 Clear on both axes
A4 2
A5 2 Clear
TOTAL 10

1.3 Colour

Dimension Score Comments
C1 0 Colors of the lines do not appear to have any meaningful connection
C2 0 e.g. groceries, which spark, are represented in a light faded color, instead of solid
C3 0 Cannot be told apart in greyscale
C4 0 Would not work for colorblind people
C5 2
TOTAL 2

1.4 Lines

Dimension Score Comments
L1 1 Gridlines are muted
L2 2
L3 2
L4 2
TOTAL 7

1.5 Overall

Dimension Score Comments
O1 1 Doesn’t really argue “so what”, plus given the color issues, it’s hard to quickly understand the point the graph is trying to make
O2 2 Yes, because its time series
O3 1 confusing, “year over year” change, but is presented in months.
O4 1 The main message is not very clear from the actual graph, due to different issues pointed out above. Only when investigating the graph, and reading the metadata, you can “hint” what its trying to tell, the main message however, is in the text..
TOTAL 5

2 Prepare the worst possible visualisation

# Load the gapminder package
library(gapminder)
# Load the tidyverse package
library(tidyverse)
# Load the skimr package
library(skimr)
# Load the viridis package
library(viridis)
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.8, 30.3, 32.0, 34.0, 36.1, 38.4, 39.9, 40.8, 41.7, 41.8, …
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779, 821, 853, 836, 740, 786, 978, 852, 649, 635, 727, 975, …
ggplot(data = filter(gapminder), 
       aes(x = gdpPercap, y = lifeExp, color=year)) +
  geom_point() +
  facet_wrap(~continent) +
  labs(title = "GDP per Capita vs Life Expectancy (2007)",
       x = "GDP",
       y = "Life Expectancy")

3 Rents in San Francsisco Bay Area 2000-2018

Kate Pennington created a panel of historic Craigslist rents by scraping posts archived by the Wayback Machine. You can read more about her work here

What impact does new housing have on rents, displacement, and gentrification in the surrounding neighborhood? Read our interview with economist Kate Pennington about her article, “Does Building New Housing Cause Displacement?:The Supply and Demand Effects of Construction in San Francisco.”

In our case, we have a clean(ish) dataset with about 200K rows that correspond to Craigslist listings for renting properties in the greater SF area. The data dictionary is as follows

variable class description
post_id character Unique ID
date double date
year double year
nhood character neighborhood
city character city
county character county
price double price in USD
beds double n of beds
baths double n of baths
sqft double square feet of rental
room_in_apt double room in apartment
address character address
lat double latitude
lon double longitude
title character title of listing
descr character description
details character additional details

The dataset was used in a recent tidyTuesday project.

# download directly off tidytuesdaygithub repo

rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

What are the variable types? Do they all correspond to what they really are? Which variables have most missing values?

rent %>% summarise_all(class)
## # A tibble: 1 × 17
##   post_id   date    year  nhood city  county price beds  baths sqft  room_in_apt
##   <chr>     <chr>   <chr> <chr> <chr> <chr>  <chr> <chr> <chr> <chr> <chr>      
## 1 character numeric nume… char… char… chara… nume… nume… nume… nume… numeric    
## # ℹ 6 more variables: address <chr>, lat <chr>, lon <chr>, title <chr>,
## #   descr <chr>, details <chr>
head(rent)
## # A tibble: 6 × 17
##   post_id      date  year nhood city  county price  beds baths  sqft room_in_apt
##   <chr>       <dbl> <dbl> <chr> <chr> <chr>  <dbl> <dbl> <dbl> <dbl>       <dbl>
## 1 pre2013_1… 2.01e7  2005 alam… alam… alame…  1250     2     2    NA           0
## 2 pre2013_1… 2.01e7  2005 alam… alam… alame…  1295     2    NA    NA           0
## 3 pre2013_1… 2.00e7  2004 alam… alam… alame…  1100     2    NA    NA           0
## 4 pre2013_6… 2.01e7  2012 alam… alam… alame…  1425     1    NA   735           0
## 5 pre2013_1… 2.00e7  2004 alam… alam… alame…   890     1    NA    NA           0
## 6 pre2013_1… 2.01e7  2006 alam… alam… alame…   825     1    NA    NA           0
## # ℹ 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## #   descr <chr>, details <chr>
missing_counts <- sapply(rent, function(x) sum(is.na(x)))
print(missing_counts)
##     post_id        date        year       nhood        city      county 
##           0           0           0           0           0        1394 
##       price        beds       baths        sqft room_in_apt     address 
##           0        6608      158121      136117           0      196888 
##         lat         lon       title       descr     details 
##      193145      196484        2517      197542      192780

when inspecting the data using summarise_all, we can see that the variable types are correct; also that they correspond to what they are. By using “head”, I can see that the data corresponds. E.g. rent is in dollars, sqft also corresponds etc. By inspecting missing_counts, which I created, I can tell that there’s a lot of missing values in baths, sqft, details, address, and longitude/latitude.

Make a plot that shows the top 20 cities in terms of % of classifieds between 2000-2018. You need to calculate the number of listings by city, and then convert that number to a %.

The final graph should look like this

# YOUR CODE GOES HERE

glimpse(rent)
## Rows: 200,796
## Columns: 17
## $ post_id     <chr> "pre2013_134138", "pre2013_135669", "pre2013_127127", "pre…
## $ date        <dbl> 20050111, 20050126, 20041017, 20120601, 20041021, 20060411…
## $ year        <dbl> 2005, 2005, 2004, 2012, 2004, 2006, 2007, 2017, 2009, 2006…
## $ nhood       <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ city        <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ county      <chr> "alameda", "alameda", "alameda", "alameda", "alameda", "al…
## $ price       <dbl> 1250, 1295, 1100, 1425, 890, 825, 1500, 2925, 450, 1395, 1…
## $ beds        <dbl> 2, 2, 2, 1, 1, 1, 1, 3, NA, 2, 2, 5, 4, 0, 4, 1, 3, 3, 1, …
## $ baths       <dbl> 2, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 3, NA, NA, NA…
## $ sqft        <dbl> NA, NA, NA, 735, NA, NA, NA, NA, NA, NA, NA, 2581, 1756, N…
## $ room_in_apt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ address     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lat         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 37.5, NA, …
## $ lon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ title       <chr> "$1250 / 2br - 2BR/2BA   1145 ALAMEDA DE LAS PULGAS", "$12…
## $ descr       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ details     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "<p class=…
top_craigslist_cities <- rent %>%
  count(city, sort = TRUE) %>%              
  mutate(
    percentage = (n / sum(n)) * 100,        
    rank = row_number()                     
  ) %>%
  slice_head(n = 20)                       

glimpse(top_craigslist_cities)
## Rows: 20
## Columns: 4
## $ city       <chr> "san francisco", "san jose", "oakland", "santa rosa", "sant…
## $ n          <int> 55730, 13733, 9443, 6230, 5464, 5127, 4526, 4414, 4201, 417…
## $ percentage <dbl> 27.75, 6.84, 4.70, 3.10, 2.72, 2.55, 2.25, 2.20, 2.09, 2.08…
## $ rank       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
top_craigslist_cities %>%
  slice_max(order_by = percentage, n = 20) %>%  
  ggplot(aes(x = reorder(city, percentage), y = percentage)) + 
  geom_col() +
  coord_flip() +
  theme_minimal(base_size = 6) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +  
  labs(
    title = "San Francisco accounts for more than a quarter of all rent classifieds",
    subtitle= "% of Craigslist listings, 2000-2018",
    x= "",
    y=""
  ) + 
  theme(
    plot.title.position = "plot",       
    plot.title = element_text(hjust = 0), 
    plot.subtitle = element_text(hjust = 0) 
  )

Make a plot that shows the evolution of median prices in San Francisco for 0, 1, 2, and 3 bedrooms listings. The final graph should look like this

# YOUR CODE GOES HERE

sf_rentals <- rent %>%
  filter(city == "san francisco", 
         beds %in% c(0, 1, 2, 3),
         !is.na(price), 
         !is.na(year)) %>%
  group_by(year, beds) %>%
  summarise(median_price = median(price), .groups = "drop")


sf_rentals %>%
  ggplot(aes(x = year, y = median_price)) +
  geom_line(aes(color = factor(beds)), size = 0.8) +
  facet_wrap(~ beds, ncol = 4) +
  theme_gray() +
  theme(
    legend.position = "none",
    strip.text = element_text(size = 8, color = "black"),
    strip.background = element_rect(fill = "gray80", color = "gray60"),
    panel.grid.major = element_line(color = "white", size = 0.6),
    panel.grid.minor = element_line(color = "white", size = 0.3),
    panel.background = element_rect(fill = "gray96"),
    axis.text = element_text(size = 9),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.background = element_rect(fill = "white"),
    plot.title = element_text(size = 10, face = "bold"),
    plot.subtitle = element_text(size = 7, color = "gray30")
  ) +
  scale_color_manual(values = c("0" = "#FF6B6B", "1" = "#8FBC8F", "2" = "#4DD0E1", "3" = "#BA68C8")) +
  labs(
    title = "San Francisco rents have been steadily increasing",
    subtitle = "0 to 3-bed listings, 2000-2018",
    x = "",
    y = "",
    caption = "Source:"
  ) +
  theme(
    plot.title.position = "plot",       
    plot.title = element_text(hjust = 0), 
    plot.subtitle = element_text(hjust = 0) 
  )

I tried to match the colors from your example, but i didn’t manage to succeed.

Finally, make a plot that shows median rental prices for the top 12 cities in the Bay area. Your final graph should look like this

# YOUR CODE GOES HERE

bay_area <- rent %>%
  filter(city %in% c("berkeley", "mountain view", "oakland", "palo alto", 
                      "san francisco", "san jose", "san mateo", "santa clara", 
                      "santa cruz", "santa rosa", "sunnyvale", "union city"),
         beds %in% c(1),
         !is.na(price), 
         !is.na(year)) %>%
  group_by(year, city) %>%
  summarise(mean_price = mean(price), .groups = "drop")


bay_area %>%
  ggplot(aes(x = year, y = mean_price)) +
  geom_line(aes(color = factor(city)), size = 0.5) +
  facet_wrap(~ city, ncol = 4) +
  theme_gray() +
  theme(
    legend.position = "none",
    strip.text = element_text(size = 8, color = "black"),
    strip.background = element_rect(fill = "gray80", color = "gray60"),
    panel.grid.major = element_line(color = "white", size = 0.6),
    panel.grid.minor = element_line(color = "white", size = 0.3),
    panel.background = element_rect(fill = "gray96"),
    axis.text = element_text(size = 9),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.background = element_rect(fill = "white"),
    plot.title = element_text(size = 10, face = "bold")
  ) +
  labs(
    title = "Rental prices for 1-bedroom flats in the Bay Area",
    x = "",
    y = "",
    caption = "Source:"
  ) +
  theme(
    plot.title.position = "plot",       
    plot.title = element_text(hjust = 0), 
    plot.subtitle = element_text(hjust = 0) 
  )

What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

**looking at the graphs, it’s evident that specific areas have experienced significant rent price increases. However, there’s a general trend across all towards rental prices increases from 2010 onwards. What’s intersting too, is that a range of the cities also experience a dip in rental prices in 2018. There can be multiple explanations to this, one would be insufficient data from 2018.

4 Hollywood Age Gap

The website https://hollywoodagegap.com is a record of THE AGE DIFFERENCE IN YEARS BETWEEN MOVIE LOVE INTERESTS. This is an informational site showing the age gap between movie love interests and the data follows certain rules:

  • The two (or more) actors play actual love interests (not just friends, coworkers, or some other non-romantic type of relationship)
  • The youngest of the two actors is at least 17 years old
  • No animated characters

The age gaps dataset includes “gender” columns, which always contain the values “man” or “woman”. These values appear to indicate how the characters in each film identify and some of these values do not match how the actor identifies. We apologize if any characters are misgendered in the data!

The following is a data dictionary of the variables used

variable class description
movie_name character Name of the film
release_year integer Release year
director character Director of the film
age_difference integer Age difference between the characters in whole years
couple_number integer An identifier for the couple in case multiple couples are listed for this film
actor_1_name character The name of the older actor in this couple
actor_2_name character The name of the younger actor in this couple
character_1_gender character The gender of the older character, as identified by the person who submitted the data for this couple
character_2_gender character The gender of the younger character, as identified by the person who submitted the data for this couple
actor_1_birthdate date The birthdate of the older member of the couple
actor_2_birthdate date The birthdate of the younger member of the couple
actor_1_age integer The age of the older actor when the film was released
actor_2_age integer The age of the younger actor when the film was released
age_gaps <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv')

How would you explore this data set? Here are some ideas of tables/ graphs to help you with your analysis

  • How is age_difference distributed? What’s the ‘typical’ age_difference in movies?
age_gap_summary <- age_gaps %>%
  summarise(
    min_diff = min(age_difference, na.rm = TRUE),
    max_diff = max(age_difference, na.rm = TRUE),
    mean_diff = mean(age_difference, na.rm = TRUE),
    median_diff = median(age_difference, na.rm = TRUE),
    sd_diff = sd(age_difference, na.rm = TRUE),
    q25 = quantile(age_difference, 0.25, na.rm = TRUE),
    q75 = quantile(age_difference, 0.75, na.rm = TRUE)
  )
print(age_gap_summary)
## # A tibble: 1 × 7
##   min_diff max_diff mean_diff median_diff sd_diff   q25   q75
##      <dbl>    <dbl>     <dbl>       <dbl>   <dbl> <dbl> <dbl>
## 1        0       52      10.4           8    8.51     4    15

Based on my analysis above, I can see that the median age gap, which I would argue is the most typical as the median represents that better, is 8 years, naturally with more extreme cases.

  • The half plus seven\ rule. Large age disparities in relationships carry certain stigmas. One popular rule of thumb is the half-your-age-plus-seven rule. This rule states you should never date anyone under half your age plus seven, establishing a minimum boundary on whom one can date. In order for a dating relationship to be acceptable under this rule, your partner’s age must be:

\[\frac{\text{Your age}}{2} + 7 < \text{Partner Age} < (\text{Your age} - 7) * 2\] How frequently does this rule apply in this dataset?

#to calculate the number of relationships that follow the half plus seven rule, I begin by calculating the min and max age for both actors, then check if they fall within the range. I use the mutate function to transfer the calculations into new variables. I then use the filter to keep the rows in which the requirement is satisfied.
seven_year_rule_couples <- age_gaps %>%
  mutate(
    actor_1_min_age = (actor_1_age / 2) + 7,
    actor_1_max_age = (actor_1_age - 7) * 2,
    actor_2_min_age = (actor_2_age / 2) + 7,
    actor_2_max_age = (actor_2_age - 7) * 2,
    actor_1_within_rule = actor_2_age >= actor_1_min_age & actor_2_age <= actor_1_max_age,
    actor_2_within_rule = actor_1_age >= actor_2_min_age & actor_1_age <= actor_2_max_age
  ) %>%
  filter(actor_1_within_rule & actor_2_within_rule) %>%
  summarise(n_couples = n())

seven_year_rule_couples
## # A tibble: 1 × 1
##   n_couples
##       <int>
## 1       829

The amount of relationships that follow the half plus seven rule is 829

  • Which movie has the greatest number of love interests?
most_romantic_movies <- age_gaps %>%
  count(movie_name, sort = TRUE)

head(most_romantic_movies)
## # A tibble: 6 × 2
##   movie_name                      n
##   <chr>                       <int>
## 1 Love Actually                   7
## 2 The Family Stone                6
## 3 A View to a Kill                5
## 4 He's Just Not That Into You     5
## 5 Mona Lisa Smile                 5
## 6 A Star Is Born                  4

As visible above, the movie with the greatest number of love interests is Love Actually, perhaps hinted by the name - Which actors/ actresses have the greatest number of love interests in this dataset?

#Below I begin by selecting the names of the two actors. I then pivot the data longer, so that I have a single column with all the actors names. Finally I count the occurences of each actor, and sort them in descending order.
number_of_love_interests <- age_gaps %>%
  select(actor_1_name, actor_2_name) %>%
  tidyr::pivot_longer(cols = c(actor_1_name, actor_2_name),
                      names_to = "role", values_to = "actor") %>%
  count(actor, sort = TRUE)

head(number_of_love_interests)
## # A tibble: 6 × 2
##   actor                 n
##   <chr>             <int>
## 1 Keanu Reeves         27
## 2 Adam Sandler         20
## 3 Leonardo DiCaprio    17
## 4 Roger Moore          17
## 5 Sean Connery         17
## 6 Keira Knightley      14

The number of love interests are now ranked in the dataset above, and show that Keanu Reeves is the actor with the most

    
-   Is the mean/median age difference staying constant over the years
    (1935 - 2022)?
  
  




``` r
age_diff_trend <- age_gaps %>%
  group_by(release_year) %>%
  summarise(
    mean_diff = mean(age_difference, na.rm = TRUE),
    median_diff = median(age_difference, na.rm = TRUE),
    n = n()
  )

age_diff_trend
## # A tibble: 82 × 4
##    release_year mean_diff median_diff     n
##           <dbl>     <dbl>       <dbl> <int>
##  1         1935     13           13       2
##  2         1936     21           21       1
##  3         1937      7.33         9       3
##  4         1939     12           12       1
##  5         1940     11.3         10       3
##  6         1942     20.5         20.5     2
##  7         1944     25           25       1
##  8         1946     25           25       1
##  9         1947     25           25       1
## 10         1948     23.2         25       4
## # ℹ 72 more rows
ggplot(age_diff_trend, aes(x = release_year)) +
  geom_line(aes(y = mean_diff, color = "Mean"), size = 1) +
  geom_line(aes(y = median_diff, color = "Median"), size = 1, linetype = "dashed") +
  labs(
    title = "Trend of Age Differences in Hollywood (1935–2022)",
    x = "Release Year",
    y = "Age Difference (years)",
    color = ""
  ) +
  theme_minimal()

As seen in the plot above, there’s no trend - nor is the age different constant across the years

  • How frequently does Hollywood depict same-gender love interests?
same_gender_stats <- age_gaps %>%
  mutate(same_gender = character_1_gender == character_2_gender) %>%
  summarise(
    total_couples = n(),
    same_gender = sum(same_gender, na.rm = TRUE),
    pct_same_gender = (same_gender/total_couples) * 100
  )

same_gender_stats
## # A tibble: 1 × 3
##   total_couples same_gender pct_same_gender
##           <int>       <int>           <dbl>
## 1          1155          23            1.99
same_gender_pie <- age_gaps %>%
  mutate(same_gender = ifelse(character_1_gender == character_2_gender,
                              "Same-gender", "Different-gender")) %>%
  count(same_gender)


ggplot(same_gender_pie, aes(x = "", y = n, fill = same_gender)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(
    title = "Same-Gender vs Different-Gender Couples",
    fill = "Couple type"
  ) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))

Above are the results of my calculations. About 2% of the whole dataset are same-gender, i.e. 23 love interests. For ONCE perhaps, I thought that a pie chart was illustrative - at least if you, for whatever reason, wish to display, that the amount of same gender love interests is predominant

5 Challenge 1: Replicating a chart

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. It builds on exercise 1, the San Francisco rentals data.

You have to create a graph that calculates the cumulative % change in median rental prices for 0-, 1-, and 2-bed flats between 2000 and 2018 for the top twelve cities in Bay Area, by number of ads that appeared in Craigslist. Your final graph should look like below. You may find dplyr::first() a useful function

I couldn’t manage to get the spacing right, i.e. make the smaller charts wider, however it should resemble the data correctly.

Deliverable - a knitted HTML, displaying code and output.