The overarching purpose of this assignment was the utilization of github as a collaborative coding tool to explore push, pull, clone and forking capabilities. In this assignment a github repository was cloned, another student’s vignette .rmd file modified and then the .rmd file was pushed back to the original github repository to demonstrate github’s collaboration capabilities.
The vignette which was modified was created by Jlok17. The vignette was modified to improve ggplot readability. More specifically the plot was reordered and observation value labels were introduced to a scatter plot.
TITLE: “TidyVerse Create”
AUTHOR: “Jlok17”
DATE: “2023-04-11”
OUTPUT: html_document
OVERVIEW
In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/SPRING2023TIDYVERSE
Your task here is to Create an Example. Using one or more TidyVerse packages, and any data set from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected data set.
TOPIC
To illustrate how to use a TidyVerse packages, I will be using a Data Set with the works of Bob Ross. Here this data set contains 67 different elements that were found in Bob Ross paintings featured in “The Joy of Painting”. The analysis will be which types of elements that are reoccurring within the entirety of his show.
Reference: https://github.com/fivethirtyeight/data/tree/master/bob-ross
SETUP
To begin analysis, we’ll simply load the tidyverse package and the fivethrityeight package. This will allow us to pull the data set we want and the packages we need to broadcast the data.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.7 ✔ dplyr 1.1.0
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
GATHERING THE DATA
The “utils” package provides functions to read table formatted data from local files or web depositories such as github. The read.csv() function below reads a .CSV file into a data frame. Alternatively through the “fivethirtyeight” package, the Bob Ross data set is already in the package. So we are able to use the data() in order to load the data set.
library(utils)
Bob_ross_alt <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv", sep = ',')
data("bob_ross")
str(bob_ross)
## tibble [403 × 71] (S3: tbl_df/tbl/data.frame)
## $ episode : chr [1:403] "S01E01" "S01E02" "S01E03" "S01E04" ...
## $ season : num [1:403] 1 1 1 1 1 1 1 1 1 1 ...
## $ episode_num : num [1:403] 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr [1:403] "A WALK IN THE WOODS" "MT. MCKINLEY" "EBONY SUNSET" "WINTER MIST" ...
## $ apple_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ aurora_borealis : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ barn : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ beach : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
## $ boat : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ building : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ bushes : int [1:403] 1 0 0 1 0 0 0 1 0 1 ...
## $ cabin : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
## $ cactus : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ circle_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ cirrus : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ cliff : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ clouds : int [1:403] 0 1 0 1 0 0 0 0 1 0 ...
## $ conifer : int [1:403] 0 1 1 1 0 1 0 1 0 1 ...
## $ cumulus : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ deciduous : int [1:403] 1 0 0 0 1 0 1 0 0 1 ...
## $ diane_andre : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ dock : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ double_oval_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ farm : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ fence : int [1:403] 0 0 1 0 0 0 0 0 1 0 ...
## $ fire : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ florida_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ flowers : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ fog : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ framed : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ grass : int [1:403] 1 0 0 0 0 0 0 0 0 0 ...
## $ guest : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ half_circle_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ half_oval_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ hills : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ lake : int [1:403] 0 0 0 1 0 1 1 1 0 1 ...
## $ lakes : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ lighthouse : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ mill : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ moon : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
## $ mountain : int [1:403] 0 1 1 1 0 1 1 1 0 1 ...
## $ mountains : int [1:403] 0 0 1 0 0 1 1 1 0 0 ...
## $ night : int [1:403] 0 0 0 0 0 1 0 0 0 0 ...
## $ ocean : int [1:403] 0 0 0 0 0 0 0 0 1 0 ...
## $ oval_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ palm_trees : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ path : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ person : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ portrait : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ rectangle_3d_frame: int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ rectangular_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ river : int [1:403] 1 0 0 0 1 0 0 0 0 0 ...
## $ rocks : int [1:403] 0 0 0 0 1 0 0 0 0 0 ...
## $ seashell_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ snow : int [1:403] 0 1 0 0 0 1 0 0 0 0 ...
## $ snowy_mountain : int [1:403] 0 1 0 1 0 1 1 0 0 0 ...
## $ split_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ steve_ross : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ structure : int [1:403] 0 0 1 0 0 1 0 0 0 0 ...
## $ sun : int [1:403] 0 0 1 0 0 0 0 0 0 0 ...
## $ tomb_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ tree : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
## $ trees : int [1:403] 1 1 1 1 1 1 1 1 0 1 ...
## $ triple_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ waterfall : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ waves : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ windmill : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ window_frame : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
## $ winter : int [1:403] 0 1 1 0 0 1 0 0 0 0 ...
## $ wood_framed : int [1:403] 0 0 0 0 0 0 0 0 0 0 ...
DPLYR FUNCTIONS
Dplyr will help easily manipulate and organize data frames based upon various conditions. The functions used in this block: mutate(); filter(); arrange(); slice().
#We are using colSums() in order to get the sums of the elements or columns that we want
sums_ <- colSums(bob_ross[,5:71])
#Next we are using sort() in order to set the order from Max to Min
sums_ <- sort(sums_, decreasing = TRUE)
#Quickly Converting this into a data frame.
sums_1 <- as.data.frame(sums_)
#Alternatively you can use the function arrange()
sums_2 <- arrange(sums_1, desc(sums_))
#Below I decided to only really care about the top 30 elements used and made the rest of them into one element called "Other_sum". This can be accomplished with the slice() to take the top 30 first.
top30 <- sums_1 %>% slice(1:30)
#In order to get the sum, I decided to slice the rest of the elements used and then mutate a new column that would be the sum of all of the remaining elements which ended up being 132.
other_sum <- sums_1 %>%
slice(31:nrow(sums_1))
other_sum <- other_sum %>% mutate(Number = sum(other_sum))
#Here we are creating a data frame entry to make it easier to add the element "Other_elements" back
Other_unit <- data.frame(
row.names = "Other_elements",
sums_ = other_sum$Number[1]
)
#Finally re-adding the Other_sum to the top 30 list with rbind()
top30 <- rbind(top30, Other_unit)
GGPLOT2 FUNCTIONS
For Visual graphics and presentation we will be using the library “ggplot2”. This package allows us to extend the basic R ggplot package and have more opportunistic changes for the graph. One of the main advantage using “ggplot2” is being able to add aspects of the graph one by one without having to call functions with many parameters. The function used: ggplot; geom_polint; aes; theme; labs.
# Use the ggplot function to display our results
ggplot(top30, aes(x = rownames(top30), y = sums_)) +
#We want to use geom_point in order to have a scatterplot
geom_point() +
#labs() are the elements of the graph such as Variable names and Keys
labs(x = "Elements", y = "Number of Appearances", title = "Elements found in Bob Ross's Paintings") +
#theme() allows us to change parts of the element such as in this case we are rotating the x-variable's name in order clearly see each elements name
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
OTHER POTENTIAL ANALYSIS
Another form analysis can be done through the library “wordcloud”, here we will be using the function wordcloud() to allow us to have a different representation of which elements are frequently used by Bob Ross. Other tools that we can use that won’t be apart of this Vignette is using sentiment analysis on each of the elements commonly used by Bob Ross to see if he was changing his style throughout his show.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.2.3
## Loading required package: RColorBrewer
wordcloud(row.names(top30), top30$sums_, max.words = 100, random.order = FALSE)
As previously stated, part of the extend assignment was the modification of another student’s vignette. Below I include the original code and ggplot of student Jlok17’s vignette. As you can see, the x-axis is ordered alphabetically which makes it relatively more difficult to compare x-axis values.
ggplot(top30, aes(x = rownames(top30), y = sums_)) +
geom_point() +
labs(x = "Elements", y = "Number of Appearances", title = "Elements Found in Bob Ross's Paintings") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
To improve plot readability, the code was modified to reorder the plot based on y-axis values. The end result is a scatter plot ordered by observation values instead of labels.
# Reordered ggplot by y-values instead of alphabetically to improve readability
ggplot(top30, aes(x = reorder(rownames(top30),sums_), y = sums_)) +
geom_point() +
labs(x = "Elements", y = "Number of Appearances", title = "Reordered Scatter Plot: Elements Found in Bob Ross's Paintings") +
# Observation value labels were also added to improve readability
geom_text(aes(label=sums_), size=2, vjust =- .8) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Utilization of github facilitates coding collaboration, while also assisting with both version and quality control.