607_Project_2

The goal of Project 2 is to choose wide/untidy datasets from Week 5 Discussion, read from CSV into R to tidy/transform as needed, and perform the analysis requested in the discussion. I use David Moste Week 5 Discussion here

Read data from original repo and write csv.

df = read_csv('https://raw.githubusercontent.com/jmhsi/DATA_607/master/Project%202/bob_ross.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )

## See spec(...) for full column specifications.

write.csv(df, 'bob_ross.csv', row.names=FALSE)

Load in data and tidy-up to long form. David recommends pivoting_longer turning each element into an observation and also splitting out season and title.

df = read_csv('bob_ross.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )

## See spec(...) for full column specifications.

# pivot all the subjects to make long
df = df %>% pivot_longer(cols = 3:ncol(df), names_to = 'subject', values_to ='present', values_drop_na = TRUE)
season_ep = matrix(unlist(str_extract_all(df$EPISODE, '\\d+')), nrow = nrow(df), byrow=TRUE)
colnames(season_ep) = c('season', 'episode')
df = cbind(season_ep, df)
df = subset(df, select = -c(EPISODE))
kable(head(df), caption='Tidied data')

Tidied data
season	episode	TITLE	subject
01	01	“A WALK IN THE WOODS”	APPLE_FRAME
01	01	“A WALK IN THE WOODS”	AURORA_BOREALIS
01	01	“A WALK IN THE WOODS”	BARN
01	01	“A WALK IN THE WOODS”	BEACH
01	01	“A WALK IN THE WOODS”	BOAT
01	01	“A WALK IN THE WOODS”	BRIDGE

Discussion questions: I add my own questions as nothing was explicitly mentioned to analyze. Which subjects appear the most and least?

sub_df = df %>% filter(present == 1)
sub_df = sub_df %>% count(subject) %>% mutate(perc = n / nrow(sub_df))
ggplot(sub_df, aes(x = reorder(subject, perc), y = perc))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip()

We see that by far, the most common subject of paintings are tree(s). The least common subject is an apple frame.

What season had the most and least subjects?

sub_df = df %>% filter(present == 1)
sub_df = sub_df %>% count(season) %>% mutate(perc = n / nrow(sub_df))
ggplot(sub_df, aes(x = reorder(season, perc), y = perc))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip()

Season 10 seemed to have the most subjects. Season 25 had the least.

It would be interesting how the number of subjects changes per seasons. You might try and investigate if this is because Bob Ross is teaching different techniques, and thus must spends varying amounts of time showing a technique on a particular subject. You might reason that seasons 10, 12, 13, and 14 are technique-dense as there are more subjects in these seasons.

607_Project_2_2

Justin Hsi

3/5/2020