JLok Tidyverse Create

Overview

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/SPRING2023TIDYVERSE

Your task here is to Create an Example. Using one or more TidyVerse packages, and any data set from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected data set.

Topic

To illustrate how to use a TidyVerse packages, I will be using a Data Set with the works of Bob Ross. Here this data set contains 67 different elements that were found in Bob Ross paintings featured in “The Joy of Painting”. The analysis will be which types of elements that are reoccurring within the entirety of his show.

Reference: https://github.com/fivethirtyeight/data/tree/master/bob-ross

Setup

To begin analysis, we’ll simply load the tidyverse package and the fivethrityeight package. This will allow us to pull the data set we want and the packages we need to broadcast the data.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

Gathering the Data

The “utils” package provides functions to read table formatted data from local files or web depositories such as github. The read.csv() function below reads a .CSV file into a data frame. Alternatively through the “fivethirtyeight” package, the Bob Ross data set is already in the package. So we are able to use the data() in order to load the data set.

library(utils)
Bob_ross_alt <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv", sep = ',')

data("bob_ross")
str(bob_ross)
## tibble [403 × 71] (S3: tbl_df/tbl/data.frame)
##  $ episode           : chr [1:403] "S01E01" "S01E02" "S01E03" "S01E04" ...
##  $ season            : num [1:403] 1 1 1 1 1 1 1 1 1 1 ...
##  $ episode_num       : num [1:403] 1 2 3 4 5 6 7 8 9 10 ...
##  $ title             : chr [1:403] "A WALK IN THE WOODS" "MT. MCKINLEY" "EBONY SUNSET" "WINTER MIST" ...
##  $ apple_frame       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ aurora_borealis   : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ barn              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ beach             : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
##  $ boat              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bridge            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ building          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bushes            : int [1:403] 1 0 0 1 0 0 0 1 0 1 ...
##  $ cabin             : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
##  $ cactus            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ circle_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cirrus            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cliff             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ clouds            : int [1:403] 0 1 0 1 0 0 0 0 1 0 ...
##  $ conifer           : int [1:403] 0 1 1 1 0 1 0 1 0 1 ...
##  $ cumulus           : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ deciduous         : int [1:403] 1 0 0 0 1 0 1 0 0 1 ...
##  $ diane_andre       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ dock              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ double_oval_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ farm              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fence             : int [1:403] 0 0 1 0 0 0 0 0 1 0 ...
##  $ fire              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ florida_frame     : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ flowers           : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fog               : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ framed            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ grass             : int [1:403] 1 0 0 0 0 0 0 0 0 0 ...
##  $ guest             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ half_circle_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ half_oval_frame   : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hills             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lake              : int [1:403] 0 0 0 1 0 1 1 1 0 1 ...
##  $ lakes             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lighthouse        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mill              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ moon              : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
##  $ mountain          : int [1:403] 0 1 1 1 0 1 1 1 0 1 ...
##  $ mountains         : int [1:403] 0 0 1 0 0 1 1 1 0 0 ...
##  $ night             : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
##  $ ocean             : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
##  $ oval_frame        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ palm_trees        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ path              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ person            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ portrait          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangle_3d_frame: int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangular_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ river             : int [1:403] 1 0 0 0 1 0 0 0 0 0 ...
##  $ rocks             : int [1:403] 0 0 0 0 1 0 0 0 0 0 ...
##  $ seashell_frame    : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ snow              : int [1:403] 0 1 0 0 0 1 0 0 0 0 ...
##  $ snowy_mountain    : int [1:403] 0 1 0 1 0 1 1 0 0 0 ...
##  $ split_frame       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ steve_ross        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ structure         : int [1:403] 0 0 1 0 0 1 0 0 0 0 ...
##  $ sun               : int [1:403] 0 0 1 0 0 0 0 0 0 0 ...
##  $ tomb_frame        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ tree              : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
##  $ trees             : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
##  $ triple_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ waterfall         : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ waves             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ windmill          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ window_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ winter            : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
##  $ wood_framed       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...

Dplyr Functions

Dplyr will help easily manipulate and organize data frames based upon various conditions. The functions used in this block: mutate(); filter(); arrange(); slice().

#We are using colSums() in order to get the sums of the elements or columns that we want
sums_ <- colSums(bob_ross[,5:71])
#Next we are using sort() in order to set the order from Max to Min
sums_ <- sort(sums_, decreasing = TRUE)
#Quickly Converting this into a data frame.
sums_1 <- as.data.frame(sums_)

#Alternatively you can use the function arrange()
sums_2 <- arrange(sums_1, desc(sums_))

#Below I decided to only really care about the top 30 elements used and made the rest of them into one element called "Other_sum". This can be accomplished with the slice() to take the top 30 first.
top30 <- sums_1 %>% slice(1:30)
#In order to get the sum, I decided to slice the rest of the elements used and then mutate a new column that would be the sum of all of the remaining elements which ended up being 132.
other_sum <- sums_1 %>% 
  slice(31:nrow(sums_1)) 
other_sum <- other_sum %>% mutate(Number = sum(other_sum))
#Here we are creating a data frame entry to make it easier to add the element "Other_elements" back


Other_unit <- data.frame(
  row.names = "Other_elements", 
  sums_ = other_sum$Number[1]
  )

#Finally re-adding the Other_sum to the top 30 list with rbind()
top30 <- rbind(top30, Other_unit)

ggplot2 Functions

For Visual graphics and presentation we will be using the library “ggplot2”. This package allows us to extend the basic R ggplot package and have more opportunistic changes for the graph. One of the main advantage using “ggplot2” is being able to add aspects of the graph one by one without having to call functions with many parameters. The function used: ggplot; geom_polint; aes; theme; labs.

# Use the ggplot function to display our results
ggplot(top30, aes(x = rownames(top30), y = sums_)) +
  #We want to use geom_point in order to have a scatterplot
  geom_point() +
  #labs() are the elements of the graph such as Variable names and Keys
  labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") +
  #theme() allows us to change parts of the element such as in this case we are rotating the x-variable's name in order clearly see each elements name 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Other Potential Analysis

Another form analysis can be done through the library “wordcloud”, here we will be using the function wordcloud() to allow us to have a different representation of which elements are frequently used by Bob Ross. Other tools that we can use that won’t be apart of this Vignette is using sentiment analysis on each of the elements commonly used by Bob Ross to see if he was changing his style throughout his show.

library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(row.names(top30), top30$sums_, max.words = 100, random.order = FALSE)

Susanna Tidyverse Extend

Extend Visualization of the JLok17’s example above

In order to visualize the elements Bob Ross used the most in his painting, “The Joy of Painting”, we should order the elements based on the number of occurrence from greatest to least.

Option 1:

The elements is on the x-axis. However, it would be easier to visualize the elements by rotate the labels 45° instead of rotating it 90°. In addition, center the title using + theme(plot.title=element_text(hjust=0.5))

ggplot(top30, aes(x = rownames(top30), y = sums_)) +
  geom_point() +
  labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") +
 # Instead of rotating the labels on the x-axis 90°, rotate the labels 45°
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) + theme(plot.title=element_text(hjust=0.5)) #Center the title

To visualize which element occurs the most, we should order the elements based on the number of occurrence from greatest to least.

# Reorder the elements based on the number of occurrence from greatest to least. 
ggplot(top30, aes(x = reorder(rownames(top30),-sums_ ), y = sums_)) +
  geom_point() +
  labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) + theme(plot.title=element_text(hjust=0.5)) #Center the title

Option 2:

The elements is on the y-axis. Swap the x and y to flip the axes. Usually, we would use coord_flip(). However, there is an error.

We will need the following portion of the original code + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

# Swap the x and y to flip the axes. Usually, we would use coord_flip(). However, there is an error.
# Reorder the elements based on the number of occurrence from greatest to least. 
ggplot(top30, aes(y = reorder(rownames(top30),sums_ ), x = sums_)) +
  geom_point() +
 labs(y = "Elements", x = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") + # Swap the x and y to change the labels
theme(plot.title=element_text(hjust=0.5)) #Center the title

# We don't need the following portion of the original code 
# + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

Option 3

We can also visualize it using bar plot.

# Swap the x and y to flip the axes. Usually, we would use coord_flip(). However, there is an error.
# Reorder the elements based on the number of occurrence from greatest to least. 
ggplot(top30, aes(y = reorder(rownames(top30),sums_ ), x = sums_)) +
  geom_bar(stat = "identity",fill = "seagreen", color = "black") +
  labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") + theme(plot.title=element_text(hjust=0.5)) #Center the title

# We do not need the following portion of the original code 
# + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

Dplyr Functions Extend

Beside trees, which element did Bob Ross Use most often per season?

# Convert from wide to long format
long_table <- bob_ross %>% 
  pivot_longer(cols = !c('title','season','episode',"episode_num","title"), names_to = "element", values_to = "count" )

season <- long_table[,-1] #Removes the episode column
season <- season[,-2] #Removes the episode_num column
season <- season[,-2] #Removes the title column

# Count the number of times each element occurs in each season
season <- season %>% 
  group_by(season, element) %>% 
  mutate(count= sum(count))

season <- season %>%
  group_by(season) %>%
  filter(!element%in% c('tree', 'trees')) %>% # If we did not filter out the "tree" or "trees", the top element for all season will be "tree" or "trees"
  arrange(desc(count)) %>% 
  slice(1)

library(DT)
datatable(season)