The goal of Project 2 is to choose wide/untidy datasets from Week 5 Discussion, read from CSV into R to tidy/transform as needed, and perform the analysis requested in the discussion. I use David Moste Week 5 Discussion here

Read data from original repo and write csv.

df = read_csv('https://raw.githubusercontent.com/jmhsi/DATA_607/master/Project%202/bob_ross.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )
## See spec(...) for full column specifications.
write.csv(df, 'bob_ross.csv', row.names=FALSE)

Load in data and tidy-up to long form. David recommends pivoting_longer turning each element into an observation and also splitting out season and title.

df = read_csv('bob_ross.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )
## See spec(...) for full column specifications.
# pivot all the subjects to make long
df = df %>% pivot_longer(cols = 3:ncol(df), names_to = 'subject', values_to ='present', values_drop_na = TRUE)
season_ep = matrix(unlist(str_extract_all(df$EPISODE, '\\d+')), nrow = nrow(df), byrow=TRUE)
colnames(season_ep) = c('season', 'episode')
df = cbind(season_ep, df)
df = subset(df, select = -c(EPISODE))
kable(head(df), caption='Tidied data')
Tidied data
season episode TITLE subject present
01 01 “A WALK IN THE WOODS” APPLE_FRAME 0
01 01 “A WALK IN THE WOODS” AURORA_BOREALIS 0
01 01 “A WALK IN THE WOODS” BARN 0
01 01 “A WALK IN THE WOODS” BEACH 0
01 01 “A WALK IN THE WOODS” BOAT 0
01 01 “A WALK IN THE WOODS” BRIDGE 0

Discussion questions: I add my own questions as nothing was explicitly mentioned to analyze. Which subjects appear the most and least?

sub_df = df %>% filter(present == 1)
sub_df = sub_df %>% count(subject) %>% mutate(perc = n / nrow(sub_df))
ggplot(sub_df, aes(x = reorder(subject, perc), y = perc))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip()

We see that by far, the most common subject of paintings are tree(s). The least common subject is an apple frame.

What season had the most and least subjects?

sub_df = df %>% filter(present == 1)
sub_df = sub_df %>% count(season) %>% mutate(perc = n / nrow(sub_df))
ggplot(sub_df, aes(x = reorder(season, perc), y = perc))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip()

Season 10 seemed to have the most subjects. Season 25 had the least.

It would be interesting how the number of subjects changes per seasons. You might try and investigate if this is because Bob Ross is teaching different techniques, and thus must spends varying amounts of time showing a technique on a particular subject. You might reason that seasons 10, 12, 13, and 14 are technique-dense as there are more subjects in these seasons.