Term Project

Description of the data and definition of variables

David examined Bob Ross paintings and found out that trees show up the most throughout all the episodes. In the second graph he tried to find correlation between the shows seasons progressing and various elements within Bobs paintings. But the only one that showed any negative correlation was cumulous clouds, but david decided that none of the elements showed significant trends over the seasons. In the third analysis hes looking to see what elements appear with rivers most often, that being waterfalls, and the least likely being lakes and oceans. Towards the end he goes on to find the four principle components in Bob Ross paintings, those being: Mountains/Conifer vs Ocean/Beach and deciduous trees Trees, especially deciduous, vs Ocean Spring/Summer vs Winter Lake vs River

Visualize data

Hint: One graph of your choice.

This is an extension of the tidytuesday assignment you have already done. Complete the questions below, using the screencast you chose for the tidytuesday assigment.

library(tidyverse)
theme_set(theme_light())
bob_ross <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv")

### Most Elements

bob_ross_gathered <- bob_ross %>%
  janitor::clean_names() %>%
  gather(element, present, -episode, -title) %>%
  filter(present == 1) %>%
  mutate(title = str_to_title(str_remove_all(title, '"')),
         element = str_to_title(str_replace(element, "_", " "))) %>%
  select(-present) %>%
  extract(episode, c("season", "episode_number"), "S(.*)E(.*)", convert = TRUE, remove = FALSE) %>%
  arrange(season, episode_number)

bob_ross_gathered %>%
  count(element, sort = TRUE) %>%
  head(25) %>%
  mutate(element = fct_reorder(element, n)) %>%
  ggplot(aes(element, n)) +
  geom_col() +
  coord_flip()

How have Ross’s paintings been changing over time?

by_season_element <- bob_ross_gathered %>%
  filter(!element %in% c("Tree", "Trees")) %>%
  group_by(season) %>%
  mutate(number_episodes = n_distinct(episode)) %>%
  count(season, element, number_episodes, sort = TRUE) %>%
  mutate(percent_included = n / number_episodes) %>%
  group_by(element) %>%
  mutate(element_total = sum(n)) %>%
  ungroup()
by_season_element %>%
  filter(element_total >= 50) %>%
  ggplot(aes(season, percent_included, color = element)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  expand_limits(y = 0) +
  facet_wrap(~ element)

Could have used: many models with broom

Clustering

What tends to appear together?

library(widyr)
correlations <- bob_ross_gathered %>%
  add_count(element) %>%
  filter(n >= 5) %>%
  pairwise_cor(element, episode, sort = TRUE)
correlations %>%
  filter(item1 == "River") %>%
  mutate(item2 = fct_reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_col() +
  coord_flip() +
  labs(title = "What tends to appear with a river?",
       subtitle = "Among elements that appeared in at least 10 paintings")

Principal Component Analysis

What dimensions drive a lot of the variation among paintings?

library(reshape2)
library(broom)
library(tidytext)
binary_matrix <- bob_ross_gathered %>%
  acast(title ~ element)
# Center the columns
centered_matrix <- t(t(binary_matrix) - colMeans(binary_matrix))
svd_result <- svd(centered_matrix)
element_weights <- tidy(svd_result, matrix = "v") %>%
  mutate(element = colnames(binary_matrix)[column])
element_weights %>%
  filter(PC <= 4) %>%
  group_by(PC) %>%
  top_n(16, abs(value)) %>%
  ungroup() %>%
  mutate(element = reorder_within(element, value, PC)) %>%
  ggplot(aes(element, value, fill = factor(PC))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ PC, scales = "free") +
  scale_x_reordered() +
  coord_flip() +
  labs(title = "First four principal components of elements in Bob Ross paintings")

## What is the story behind the graph?

Throughout the exploration of this dataset David has mostly looked to find out what elements are most likely to be painted in conjunction with one another. These graphs do a particularly good job of visualizaion for me.

Term Project

Kyle Dunnell

Description of the data and definition of variables

Visualize data

Clustering

Principal Component Analysis

Hide the messages, but display the code and its results on the webpage.

Write your name for the author at the top.

Use the correct slug.