Overview

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/SPRING2023TIDYVERSE

Your task here is to Create an Example. Using one or more TidyVerse packages, and any data set from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected data set.

Topic

To illustrate how to use a TidyVerse packages, I will be using a Data Set with the works of Bob Ross. Here this data set contains 67 different elements that were found in Bob Ross paintings featured in “The Joy of Painting”. The analysis will be which types of elements that are reoccurring within the entirety of his show.

Reference: https://github.com/fivethirtyeight/data/tree/master/bob-ross

Setup

To begin analysis, we’ll simply load the tidyverse package and the fivethrityeight package. This will allow us to pull the data set we want and the packages we need to broadcast the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

Gathering the Data

The “utils” package provides functions to read table formatted data from local files or web depositories such as github. The read.csv() function below reads a .CSV file into a data frame. Alternatively through the “fivethirtyeight” package, the Bob Ross data set is already in the package. So we are able to use the data() in order to load the data set.

library(utils)
Bob_ross_alt <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv", sep = ',')

data("bob_ross")
str(bob_ross)
## tibble [403 × 71] (S3: tbl_df/tbl/data.frame)
##  $ episode           : chr [1:403] "S01E01" "S01E02" "S01E03" "S01E04" ...
##  $ season            : num [1:403] 1 1 1 1 1 1 1 1 1 1 ...
##  $ episode_num       : num [1:403] 1 2 3 4 5 6 7 8 9 10 ...
##  $ title             : chr [1:403] "A WALK IN THE WOODS" "MT. MCKINLEY" "EBONY SUNSET" "WINTER MIST" ...
##  $ apple_frame       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ aurora_borealis   : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ barn              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ beach             : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
##  $ boat              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bridge            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ building          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ bushes            : int [1:403] 1 0 0 1 0 0 0 1 0 1 ...
##  $ cabin             : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
##  $ cactus            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ circle_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cirrus            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cliff             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ clouds            : int [1:403] 0 1 0 1 0 0 0 0 1 0 ...
##  $ conifer           : int [1:403] 0 1 1 1 0 1 0 1 0 1 ...
##  $ cumulus           : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ deciduous         : int [1:403] 1 0 0 0 1 0 1 0 0 1 ...
##  $ diane_andre       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ dock              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ double_oval_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ farm              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fence             : int [1:403] 0 0 1 0 0 0 0 0 1 0 ...
##  $ fire              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ florida_frame     : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ flowers           : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fog               : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ framed            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ grass             : int [1:403] 1 0 0 0 0 0 0 0 0 0 ...
##  $ guest             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ half_circle_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ half_oval_frame   : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hills             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lake              : int [1:403] 0 0 0 1 0 1 1 1 0 1 ...
##  $ lakes             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lighthouse        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mill              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ moon              : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
##  $ mountain          : int [1:403] 0 1 1 1 0 1 1 1 0 1 ...
##  $ mountains         : int [1:403] 0 0 1 0 0 1 1 1 0 0 ...
##  $ night             : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
##  $ ocean             : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
##  $ oval_frame        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ palm_trees        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ path              : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ person            : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ portrait          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangle_3d_frame: int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangular_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ river             : int [1:403] 1 0 0 0 1 0 0 0 0 0 ...
##  $ rocks             : int [1:403] 0 0 0 0 1 0 0 0 0 0 ...
##  $ seashell_frame    : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ snow              : int [1:403] 0 1 0 0 0 1 0 0 0 0 ...
##  $ snowy_mountain    : int [1:403] 0 1 0 1 0 1 1 0 0 0 ...
##  $ split_frame       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ steve_ross        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ structure         : int [1:403] 0 0 1 0 0 1 0 0 0 0 ...
##  $ sun               : int [1:403] 0 0 1 0 0 0 0 0 0 0 ...
##  $ tomb_frame        : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ tree              : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
##  $ trees             : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
##  $ triple_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ waterfall         : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ waves             : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ windmill          : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ window_frame      : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
##  $ winter            : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
##  $ wood_framed       : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...

Dplyr Functions

Dplyr will help easily manipulate and organize data frames based upon various conditions. The functions used in this block: mutate(); filter(); arrange(); slice().

#We are using colSums() in order to get the sums of the elements or columns that we want
sums_ <- colSums(bob_ross[,5:71])
#Next we are using sort() in order to set the order from Max to Min
sums_ <- sort(sums_, decreasing = TRUE)
#Quickly Converting this into a data frame.
sums_1 <- as.data.frame(sums_)

#Alternatively you can use the function arrange()
sums_2 <- arrange(sums_1, desc(sums_))

#Below I decided to only really care about the top 30 elements used and made the rest of them into one element called "Other_sum". This can be accomplished with the slice() to take the top 30 first.
top30 <- sums_1 %>% slice(1:30)
#In order to get the sum, I decided to slice the rest of the elements used and then mutate a new column that would be the sum of all of the remaining elements which ended up being 132.
other_sum <- sums_1 %>% 
  slice(31:nrow(sums_1)) 
other_sum <- other_sum %>% mutate(Number = sum(other_sum))
#Here we are creating a data frame entry to make it easier to add the element "Other_elements" back


Other_unit <- data.frame(
  row.names = "Other_elements", 
  sums_ = other_sum$Number[1]
  )

#Finally re-adding the Other_sum to the top 30 list with rbind()
top30 <- rbind(top30, Other_unit)

ggplot2 Functions

For Visual graphics and presentation we will be using the library “ggplot2”. This package allows us to extend the basic R ggplot package and have more opportunistic changes for the graph. One of the main advantage using “ggplot2” is being able to add aspects of the graph one by one without having to call functions with many parameters. The function used: ggplot; geom_polint; aes; theme; labs.

# Use the ggplot function to display our results
ggplot(top30, aes(x = rownames(top30), y = sums_)) +
  #We want to use geom_point in order to have a scatterplot
  geom_point() +
  #labs() are the elements of the graph such as Variable names and Keys
  labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") +
  #theme() allows us to change parts of the element such as in this case we are rotating the x-variable's name in order clearly see each elements name 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Other Potential Analysis

Another form analysis can be done through the library “wordcloud”, here we will be using the function wordcloud() to allow us to have a different representation of which elements are frequently used by Bob Ross. Other tools that we can use that won’t be apart of this Vignette is using sentiment analysis on each of the elements commonly used by Bob Ross to see if he was changing his style throughout his show.

library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(row.names(top30), top30$sums_, max.words = 100, random.order = FALSE)