This week I’ve been practicing my data viz with some Tidy Tuesday data. If you are looking for interesting datasets to try out your data viz skills on, the TidyTuesday repo is a really good place to start. Each week Tom Mock posts a dataset to the github repo, #rstats people from all over have a go at making a visualisation and then they post their plots to twitter with the hashtag #TidyTuesday.
I will try and model the kind of commenting we want to see you doing as you document the code in your learning logs and verification reports.
note: this Rmd theme comes from the
prettydocpackage
load packages
Here I am using the tidyverse package, which includes ggplot for data vis and dplyr for data wrangling. I’m also using the janitor package which is useful for counting frequency and the gt() package which makes nice tables.
I really dislike the default grey background with gridlines theme (theme_grey) that ggplot uses, so I am using theme_set(theme_classic()) here to change the default theme for all of my plots in this Rmd doc.
library(tidyverse)
library(janitor)
library(gt)
theme_set(theme_classic())read in the data
The nice thing about Tidytuesday data is that you can use read the data straight from the TidyTuesday repo using the URL. This dataset is about the movies and whether they pass the Bechdel test. The Bechdel test is a test of gender representaiton in film. Movies get a score of 0, 1, 2, or 3, depending on whether there are two named female characters in the film, they have a conversation with each other, and that conversation is not about a man.
Here I am reading the data using the read_csv() function and using glimpse() to get an idea of what variables are in the dataset and what kind of data R thinks each variable is.
bechdel <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/raw_bechdel.csv')
glimpse(bechdel)## Rows: 8,839
## Columns: 5
## $ year <dbl> 1888, 1892, 1895, 1895, 1896, 1896, 1896, 1896, 1897, 1898, 18…
## $ id <dbl> 8040, 5433, 6200, 5444, 5406, 5445, 6199, 4982, 9328, 4978, 54…
## $ imdb_id <chr> "0392728", "0000003", "0132134", "0000014", "0000131", "022334…
## $ title <chr> "Roundhay Garden Scene", "Pauvre Pierrot", "The Execution of M…
## $ rating <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,…
The summary() function is useful when you want to get an idea of the range of years in the dataset (all the way back to 1888). It also tells me that the mean rating across the 8839 films is 2.1 (not great). I wonder whether bechdel scores have changed over time?
summary(bechdel)## year id imdb_id title
## Min. :1888 Min. : 1 Length:8839 Length:8839
## 1st Qu.:1989 1st Qu.:2314 Class :character Class :character
## Median :2006 Median :4686 Mode :character Mode :character
## Mean :1997 Mean :4696
## 3rd Qu.:2013 3rd Qu.:7020
## Max. :2021 Max. :9506
## rating
## Min. :0.000
## 1st Qu.:1.000
## Median :3.000
## Mean :2.156
## 3rd Qu.:3.000
## Max. :3.000
Has the number of movies getting Bechdel scores of 3 improved over time?
Here I am plotting the ratings (0,1,2,3) as a function of year using geom_point(). It is hard to see whether there are more ratings of 3 on the right side of the graph though, because here are 8000+ movies on the plot and all the points are on top of each other.
bechdel %>%
ggplot(aes(x = year, y = rating)) +
geom_point()Here I am adding a tiny bit of noise to the points so they don’t all end up on top of each other using geom_jitter(). Still a bit hard to tell what is going on due to the density of the points. I really love violin plots, maybe geom_violin() would help?
bechdel %>%
ggplot(aes(x = year, y = rating)) +
geom_jitter()bechdel %>%
ggplot(aes(x = year, y = rating)) +
geom_violin()Hahaha oops no, violin is not what I need…
Changing tack… I think I need to summarise the data a little bit. Rather than plotting scores for every movie in the dataset, I want to know how many movies each year got a rating of 0, 1, 2 or 3.
The janitor package contains lots of useful cleaning related functions, like clean_names(), but it is also really helpful when you are looking to count how many different kinds of observations there are (i.e constructing frequency tables).
Here I am using the tabyl() function and asking it to count how many movies get each of the different ratings, each year. I am assigning the output to a new dataframe called rating_year, so that I can use it in my plots later on. And then displaying the dataframe as a table using gt() so it appears in my document. The table ends up being VERY wide.
rating_year <- bechdel %>%
tabyl(rating, year)
rating_year %>%
gt()| rating | 1888 | 1892 | 1895 | 1896 | 1897 | 1898 | 1899 | 1900 | 1901 | 1902 | 1903 | 1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1912 | 1913 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 | 1924 | 1925 | 1926 | 1927 | 1928 | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 | 1936 | 1937 | 1938 | 1939 | 1940 | 1941 | 1942 | 1943 | 1944 | 1945 | 1946 | 1947 | 1948 | 1949 | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 4 | 1 | 3 | 1 | 11 | 3 | 4 | 5 | 4 | 2 | 6 | 2 | 1 | 1 | 2 | 2 | 2 | 0 | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 7 | 6 | 2 | 3 | 8 | 4 | 2 | 3 | 3 | 2 | 3 | 0 | 3 | 1 | 2 | 3 | 5 | 2 | 4 | 4 | 2 | 3 | 3 | 1 | 6 | 4 | 1 | 1 | 1 | 6 | 6 | 3 | 4 | 9 | 5 | 3 | 4 | 7 | 5 | 7 | 6 | 6 | 9 | 6 | 11 | 4 | 4 | 5 | 7 | 6 | 5 | 3 | 3 | 5 | 7 | 6 | 7 | 7 | 8 | 3 | 7 | 14 | 12 | 6 | 8 | 6 | 9 | 8 | 12 | 12 | 12 | 7 | 9 | 6 | 5 | 6 | 12 | 15 | 16 | 17 | 10 | 17 | 21 | 21 | 19 | 29 | 39 | 23 | 22 | 33 | 25 | 27 | 22 | 17 | 16 | 15 | 4 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 4 | 2 | 0 | 2 | 6 | 8 | 4 | 5 | 2 | 4 | 4 | 3 | 2 | 5 | 6 | 6 | 6 | 2 | 2 | 8 | 2 | 6 | 2 | 3 | 3 | 4 | 9 | 6 | 5 | 7 | 8 | 6 | 9 | 3 | 3 | 7 | 10 | 14 | 4 | 18 | 11 | 6 | 13 | 10 | 18 | 7 | 12 | 9 | 6 | 17 | 13 | 8 | 13 | 8 | 16 | 11 | 18 | 17 | 14 | 14 | 15 | 17 | 15 | 15 | 19 | 21 | 25 | 28 | 23 | 26 | 28 | 45 | 33 | 29 | 36 | 49 | 33 | 45 | 43 | 61 | 55 | 69 | 85 | 80 | 89 | 72 | 70 | 82 | 68 | 65 | 49 | 39 | 41 | 12 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 3 | 2 | 2 | 3 | 0 | 2 | 4 | 2 | 1 | 3 | 4 | 0 | 1 | 5 | 4 | 7 | 5 | 12 | 4 | 5 | 2 | 1 | 2 | 3 | 5 | 7 | 4 | 2 | 3 | 2 | 3 | 3 | 5 | 6 | 6 | 6 | 4 | 1 | 8 | 2 | 4 | 4 | 5 | 3 | 2 | 2 | 5 | 2 | 3 | 5 | 7 | 6 | 7 | 3 | 2 | 3 | 7 | 4 | 4 | 6 | 9 | 4 | 7 | 10 | 10 | 5 | 9 | 9 | 8 | 8 | 11 | 8 | 7 | 16 | 10 | 15 | 12 | 13 | 18 | 13 | 16 | 10 | 21 | 20 | 19 | 24 | 20 | 35 | 24 | 24 | 32 | 36 | 49 | 30 | 32 | 21 | 23 | 18 | 4 | 2 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 1 | 2 | 2 | 3 | 1 | 2 | 0 | 1 | 2 | 1 | 3 | 6 | 14 | 8 | 13 | 9 | 5 | 16 | 8 | 5 | 14 | 15 | 12 | 8 | 13 | 13 | 14 | 9 | 9 | 7 | 10 | 11 | 12 | 5 | 16 | 14 | 17 | 15 | 13 | 13 | 19 | 19 | 19 | 17 | 17 | 16 | 14 | 12 | 18 | 15 | 9 | 19 | 23 | 21 | 19 | 17 | 22 | 12 | 23 | 14 | 16 | 21 | 26 | 29 | 22 | 35 | 40 | 32 | 47 | 41 | 39 | 41 | 44 | 37 | 55 | 57 | 67 | 77 | 66 | 75 | 84 | 93 | 110 | 100 | 100 | 126 | 130 | 159 | 159 | 158 | 194 | 210 | 231 | 224 | 270 | 231 | 206 | 200 | 180 | 153 | 144 | 78 | 4 |
OK I can see that my new rating_year dataframe has 4 observations of 129 variables. It has counts of how many 0, 1, 2, and 3s there were for all 128 years and each year is in a separate column.
Now I want to plot those counts by year, but to do that I need to make the data long. I want the data in 3 columns, one with the ratings (0,1,2,3), one with the years, and one with the count of how many movies got that rating in that year. Luckily the pivot_longer() function makes this transformation really easy.
Here I am taking my rating_year dataframe (which is wide) and piping it into pivot_longer(). I’m telling R that I want the names of all the wide variables to go into a new column called “year” and I want all the values to go into a new column called “count”, and the range of variables that are currently wide that I want to be long are everything from the 2nd to the 129th variable.
I am creating a new thing in my enviornment called rating_year_long and using glimpse() to show that it now has 512 observations of 3 variables.
rating_year_long <- rating_year %>%
pivot_longer(names_to = "year", values_to ="count", 2:129)
glimpse(rating_year_long)## Rows: 512
## Columns: 3
## $ rating <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ year <chr> "1888", "1892", "1895", "1896", "1897", "1898", "1899", "1900",…
## $ count <dbl> 1, 1, 2, 4, 1, 3, 1, 11, 3, 4, 5, 4, 2, 6, 2, 1, 1, 2, 2, 2, 0,…
I can also see that R thinks my rating variable is double (aka numeric) and my year variable is characters, which might cause me struggles down the track, but lets see…
Here I want to plot how ratings have changed over time, so I am plotting year on the x axis and count on the y axis. I am using colour to differentiate between 0, 1, 2, and 3 ratings.
rating_year_long %>%
ggplot(aes(x = year, y = count, colour = rating)) +
geom_point() OK there are a few problems with this plot that I would like to work out how to fix:
- the year on the x axis is UNREADABLE, I think that is because R thinks year in characters and it is trying to print them all
- the colour scale ranges from dark blue to light blue. This is what R does when it is trying to map a continuous numeric variable to a colour palette. I think I need to make the rating a factor to get 4 distinct colours for 0, 1, 2, and 3 ratings.
First, lets make year numeric and see what that does to the x axis. Here I am using as.numeric() to change my year variable from characters to numeric, then glimpse() to check that it did what I want.
rating_year_long$year <- as.numeric(rating_year_long$year)
glimpse(rating_year_long)## Rows: 512
## Columns: 3
## $ rating <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ year <dbl> 1888, 1892, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 190…
## $ count <dbl> 1, 1, 2, 4, 1, 3, 1, 11, 3, 4, 5, 4, 2, 6, 2, 1, 1, 2, 2, 2, 0,…
Better, year is now double. Ok what does that plot look like now ?
rating_year_long %>%
ggplot(aes(x = year, y = count, colour = rating)) +
geom_point() Yay- fewer years on the x axis is definitely better, but I’d like a few more ticks so I can see what the range of the years in the dataset are more clearly. To change the number of “ticks” you can use scale_x_continuous() and specify the breaks manually. Here I am telling it I want the x axis to show 1888 to 2021, with ticks every 10 years.
rating_year_long %>%
ggplot(aes(x = year, y = count, colour = rating)) +
geom_point() +
scale_x_continuous(breaks=seq(1888,2021,10))Awesome…ok lets deal with the colour scale. So at the moment R is treating the ratings as numeric and trying to map them to a continuous colour scale. Lets convert the ratings to a factor (there are only 4 values in the data) and see if we can get distinct colours.
Here I am using as.factor() to change the kind of data that R thinks the ratings are from numeric to factor and then glimpse() to check it has done what I want.
rating_year_long$rating <- as.factor(rating_year_long$rating)
glimpse(rating_year_long)## Rows: 512
## Columns: 3
## $ rating <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ year <dbl> 1888, 1892, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 190…
## $ count <dbl> 1, 1, 2, 4, 1, 3, 1, 11, 3, 4, 5, 4, 2, 6, 2, 1, 1, 2, 2, 2, 0,…
Excellent, now trying the plot again.
rating_year_long %>%
ggplot(aes(x = year, y = count, colour = rating)) +
geom_point() +
scale_x_continuous(breaks=seq(1888,2021,10))Woot! It seems that there has definitely been an increase in the number of movies getting a score of 3 on the Bechdel test, but it is also clear that there has just been a MASSIVE increase since the 1970s in the number of movies being produced. In this case, raw counts make it hard to see whether we have made much of an improvement in the representation of women in film.
What if we turned this data into proportions? Has the proportion of films that get a 3 score improved over time?
OK how to do this??? maybe group_by() + summarise()??
Here I am taking my long data (always easier to get R do do stuff when your data is long) and piping it into a group_by(). I want scores for the total number of films and the proportion of films getting a 3, separately for each year, so I am going to summarise the number of films by getting sum of the counts.
OK now I can use my new total dataframe to plot how the number of films produced has changed.
total <- rating_year_long %>%
group_by(year) %>%
summarise(total = sum(count))
total %>%
ggplot(aes(x = year, y = total)) +
geom_point()Now I want to count the number of films that get a score of 3 each year. Here I am filtering for just the counts of films that get a score of 3 and then doing the same group_by() + summarise() as above.
rating3 <- rating_year_long %>%
filter(rating == 3) %>%
group_by(year) %>%
summarise(total3 = sum(count))This method has created 2 separate dataframes, one with the total number of films each year (total) and one with the number of films that score a 3 each year (rating3). There is probably a way to do that without making separate dataframes (but I can’t work that out right now) so now I need to join them back together.
Because my two dataframes have the same number of observations, I can use the cbind() function (which binds by columns). I end up with 2 year variables, but that is probably ok.
note if there weren’t the same number of obs in each df, I would need to use one of the
join()functions from dplyr to join the dataframes by the values in one of the variables.
rating_total <- cbind(total, rating3)
glimpse(rating_total)## Rows: 128
## Columns: 4
## $ year <dbl> 1888, 1892, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 190…
## $ total <dbl> 1, 1, 2, 4, 1, 3, 2, 11, 3, 4, 5, 4, 2, 7, 2, 1, 3, 3, 3, 2, 1,…
## $ year <dbl> 1888, 1892, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 190…
## $ total3 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
Now that I ahve joined the totals and the rating3 counts together, I can calculate proportions by using mutate() to make a new column.
proportion <- rating_total %>%
mutate(proportion = total3 / total) ## not ok to have 2 x year columnsOK the chunk above threw an error because apparently it is not ok to have two year columns, so I am trying again below after dropping one of the year columns using select(-1) i.e the 1st variable
proportion <- rating_total %>%
select(-1) %>%
mutate(proportion = total3 / total)Now to plot!
proportion %>%
ggplot(aes(x = year, y = proportion)) +
geom_point() +
scale_x_continuous(breaks=seq(1888,2021,10)) +
scale_y_continuous(limits = c(0.0,1.0))Well that is a disappointing rate of change!
Filtering for just films since 1980 and adding geom_smooth() below. It seems that 2020 was a particularly good year with close to 80% of films scoring 3, however, important to bear in mind that last year there were less than half the number of films made relative to 2019.
proportion %>%
filter(year > 1980) %>%
ggplot(aes(x = year, y = proportion)) +
geom_point() +
geom_smooth() +
scale_x_continuous(breaks=seq(1980,2021,10)) +
scale_y_continuous(limits = c(0.0,1.0))Challenges
- as always R makes assumptions about what kind of data you have and those assumptions have implications for what your plot ends up looking like.
- NOTE Dani has some useful resources re data types and how to “coerce” data into being the type you need here https://psyr.djnavarro.net/data-types.html
- I tried for a long time to get both totals and proportions in a single
group_by()+summarise()(thus avoiding the need to make two dataframes and then join them back together) but no luck… maybe this is a situation where making the data wide again (with separate columns for each rating) so that you can calculate proportions across columns might work better.