Let’s pick up where we left off from the first two steps, with one big difference – we’ll be using R, a statistical software and programming language used in other chapters in this book – instead of Python.

You can find instructions on downloading R and RStudio here, in a chapter in the book Data Science in Education Using R (Estrellado et al., 2020): https://datascienceineducation.com/c05

We’ll assume some basic knowledge of R here.

First, let’s load the tidyverse library—a set of R packages that work together for common analytic tasks.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Then, let’s find the .srt files we created in the last step:

file_paths <- list.files(path = ".", pattern = "\\.srt$", recursive = TRUE, full.names = TRUE) # List all .txt files recursively

And then read them in, saving them to the object l:

l <- map(file_paths, read_lines)

We have now read in the transcripts! They should look like this (here is the first one, accessed by indexing the first list item — the first ten rows):

l[[1]] %>% 
  head(10)
##  [1] "1"                                                                                     
##  [2] "00:00:00,000 --> 00:00:07,160"                                                         
##  [3] "Hey folks, Phil Plait here, and for the past few episodes I've been going over what we"
##  [4] ""                                                                                      
##  [5] "2"                                                                                     
##  [6] "00:00:07,160 --> 00:00:11,880"                                                         
##  [7] "know about the structure, history, and evolution of the universe and how we know it."  
##  [8] ""                                                                                      
##  [9] "3"                                                                                     
## [10] "00:00:11,880 --> 00:00:14,160"

Before proceeding with those, we will do some work on the file names. Backing up a step, we often need meta-data, information on files (in this case, which channel, playlist, and video each is for), for use in analyses. That meta-data can be manually entered or can be cleverly found.

Here, when we look at the names of the files, we can see that they have three pieces of information that may be useful as meta-data. Let’s take a look at one:

file_paths[[1]]
## [1] "./crash course/astronomy/audio_A Brief History of the Universe Crash Course Astronomy 44.srt"

Let’s create a data frame with these, and then split the path into its different elements:

df <- tibble(path = file_paths) # Create a data frame

# Split the paths into separate columns
df <- df %>%
  # Separate the path into multiple columns based on the "/" delimiter
  separate(col = path, 
           into = c("Folder1", "channel", "playlist", "video"), # Add more columns if needed
           sep = "/", 
           remove = FALSE, 
           extra = "merge") %>% 
  select(-Folder1)

df

Now, we have our transcripts loaded, and our data data ready. The next step is a big one that we’ll introduce primarily through comments - this is code to create a manual function that we will use to read each of the transcript files:

process_transcripts <- function(d) {
  
  my_nrow <- length(d) # this is to find out how long each transcript file is
  
  d %>% 
    as_tibble() %>% 
    rename(X1 = 1) %>% # this ie to rename the first column to the name X1, as 1 is difficult to type in R!
    mutate(id = rep(c("i", "time", "transcript", "blank"), my_nrow/4), # create different values for each of the rows of the transcript, and repeat
           index = rep(1:(my_nrow/4), times = 4) %>% sort()) %>% # create a row index
    spread(id, X1) %>% # change the data from long to wide format
    select(-i) %>% 
    separate(time, into = c("start", "end"), sep = "-->") %>% # process the time stamps
    mutate(start = str_trim(start), 
           end = str_trim(end)) # trim the time stamps so they are easier to read and use
}

Whereas withi Python we used a for loop, in R, for loops are less common than apply functions. These two approaches share a commonality: they are both used for iteration. Given the kinds of data R is chiefly intended to work with and ow R as a programming language most efficiently works, apply functions are generally the better way to go. Here, we will use the map() function that is a part of the tidyverse package (specifically, the purrr package).

ll <- l %>% 
  map(possibly(process_transcripts, NULL)) # here, we use the apply function; possibly is used to handle errors

ll <- compact(ll) # this removes any NULL list items that resulted from errors

ll <- imap(ll, ~ mutate(.x, group = .y)) # this adds an index for the rows associated with each transcript

bound_rows <- ll %>% 
  map_df(~.) # this changes the list of data frames into a single data frame

We are getting close to ready for analyses. Next, we process the transcript to create several variables that will be useful for our analysis. Most important among these is two variables we have created:

For these purposes, we employ Kirch’s (2011) typology of uncertainty as both a and a . Kirsch writes, “uncertainty refers to a psychological condition of being in doubt (e.g., I am uncertain about something or someone…). It also refers to a statistical (or mathematical) object (e.g., a statistical estimation of uncertainty)” . As a psychological condition, uncertainty can be procedural, sociocultural, epistemological, and ontological. As a statistical object, it refers to measurement, sampling, repeatability, and predictive value . We use this framework later in our analysis.

Our computational approach is a dictionary-based approach (see Nelson et al., 2021 for a definition). This approach is a relatively straightforward text-analysis technique - it involves searching for key words in text. The dictionary is provided by the analayst, but we can use R to conduct the search automatically. We acknowledge that more complex approaches could be helpful, but we chose this approach given our aim of triangulating evidence—qualitative analyses can complement this approach by providing context and depth to what the computational approach reveals.

Our two dictionaries corresponding to our conception of uncertainty—psychological and mathematical—follow.

Note: We queried chatGPT several times to generate starting points for these lists by providing the above definition and asking for a list of relevant terms, which we then edited. We subsequently queried chatGPT to ask for different forms of words (e.g., “unsure” and “not sure”) and to ensure the words were appropriate for middle and high school students.

psychological_uncertainty <- c(
  "unsure", "not sure", "maybe", "kind of", "sort of", "don't know", 
  "doubt", "doubtful", "no clue", "unclear", "confused", "confusing", 
  "hesitant", "don't get", "don't understand", "ambivalent", 
  "can't decide", "questioning", "question", "wondering", "wonder", 
  "weird", "strange", "odd", "weirded out", "puzzled", "puzzling", 
  "don't get it", "weird feeling", "weirdly", "skeptical", "skeptic", 
  "guessing", "guess", "vague", "ambiguous", "indefinite", 
  "uncertain", "iffy", "on the fence", "mixed up", "unsure what to do"
)


statistical_uncertainty <- c(
  "mistake", "mistakes", "error", "errors", "different every time", "changes", 
  "change", "guess", "guessing", "rough idea", "about", "around", 
  "estimate", "estimates", "estimating", "guesswork", "ballpark", 
  "unsure how many", "don't know how many", "probably", "might be", 
  "could be", "possibly", "chance", "luck", "random", "weird result", 
  "unexpected", "surprised", "surprising", "weird answer", "weird number", 
  "doesn't add up", "doesn't seem right", "doesn't make sense", "not exact", 
  "not precise", "approximate", "roughly", "about right", "just about", 
  "kind of close", "more or less", "almost", "nearly", "not quite", 
  "sort of like", "kind of like", "like", "similar to", "close to", 
  "in the ballpark", "round about", "give or take", "plus or minus", 
  "so-so", "thereabouts", "ish", "something like", "can't be sure", 
  "unclear number", "unclear amount", "funny number", "funny result"
)

First, let’s process the transcripts a bit further to create some useful variables and select columns.

out <- bound_rows %>% 
  mutate(start = str_sub(start, 1, 8),
         end = str_sub(end, 1, 8)) %>% 
  mutate(start = chron::chron(times = start),
         end = chron::chron(times = end)) %>% 
  mutate(duration = end - start) %>% 
  select(index, start, end, duration, everything())

out

Next, we can apply these lists.

# function to count words from a dictionary in a text
count_words <- function(text, dictionary) {
  sum(str_count(text, paste0("\\b", dictionary, "\\b")))
}

# apply the function to each row of your dataframe
out <- out %>%
  mutate(transcript = tolower(transcript)) %>%
  mutate(
    count_psychological_uncertainty = map_dbl(transcript, ~count_words(.x, psychological_uncertainty)),
    count_statistical_uncertainty = map_dbl(transcript, ~count_words(.x, statistical_uncertainty)),
  )

We’re almost done! Here’s our data frame at this point.

out

Last, we can join together the meta-data we created earlier.

df <- mutate(df, group = 1:nrow(df))

out <- out %>% 
  left_join(df)
## Joining with `by = join_by(group)`
out <- out %>% 
  select(group, channel, playlist, video, everything(), -index)

out
out %>% 
  write_csv("processed-youtube-videos-3.csv")

@book{estrellado2020data, title={Data science in education using R}, author={Estrellado, Ryan A and Freer, Emily and Mostipak, Jesse and Rosenberg, Joshua M and Vel{'a}squez, Isabella C}, year={2020}, publisher={Routledge} }

@book{wickham2019advanced, title={Advanced r}, author={Wickham, Hadley}, year={2019}, publisher={CRC press} }

@article{nelson2021future, title={The future of coding: A comparison of hand-coding and three types of computer-assisted text analysis methods}, author={Nelson, Laura K and Burk, Derek and Knudsen, Marcel and McCall, Leslie}, journal={Sociological Methods & Research}, volume={50}, number={1}, pages={202–237}, year={2021}, publisher={SAGE Publications Sage CA: Los Angeles, CA} }