Here I hope to extract the titles from the episodes of Frasier into an easy-to-use data frame.
From a relatively small text file, we can simply use readLines to load the data.
raw_data <- readLines("frasierEpisodes.txt")
head(raw_data)
## [1] "1. 1-1 16 Sep 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49441/frasier-1x01-the-good-son\">The Good Son</a>"
## [2] "2. 1-2 23 Sep 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49442/frasier-1x02-space-quest\">Space Quest</a>"
## [3] "3. 1-3 30 Sep 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49443/frasier-1x03-dinner-at-eight\">Dinner at Eight</a>"
## [4] "4. 1-4 07 Oct 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49444/frasier-1x04-i-hate-frasier-crane\">I Hate Frasier Crane</a>"
## [5] "5. 1-5 14 Oct 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49445/frasier-1x05-heres-looking-at-you\">Here's Looking at You</a>"
## [6] "6. 1-6 21 Oct 93 <a target=\"_blank\" href=\"http://www.tvmaze.com/episodes/49446/frasier-1x06-the-crucible\">The Crucible</a>"
Now comes the fun part. I want to extract the season numbers, the episode numbers, the titles of each episode, and maybe the air date.
library("stringr")
The pattern for the season and episode numbers (presently) is a one- or two-digit number, a hyphen, and another one- or two-digit number.
seasonEpisode_pattern <- "[0-9]+-[0-9]+"
seasonAndEpisode <- str_extract(raw_data, seasonEpisode_pattern)
The pattern for the dates is two digits, a space, three characters, a space, and two digits.
airDate_pattern <- "[0-9]{2} [A-Za-z]{3} [0-9]{2}"
airDate <- str_extract(raw_data, airDate_pattern)
To extract the titles, I will start by finding any characters between the “>” and the “<” of the HTML link tags, and then I can simply trim off those brackets.
title_pattern <- ">(.+?)<"
raw_title <- str_extract(raw_data, title_pattern)
title <- str_sub(raw_title, 2, str_length(raw_title) - 1)
Now we can combine the extracted strings into a nice data frame.
Frasier <- data.frame(seasonAndEpisode, airDate, title)
Let us split that seasonAndEpisode variable.
library(tidyverse)
Frasier <- Frasier %>%
separate(seasonAndEpisode, c("season", "episode"), "-")
head(Frasier)
## season episode airDate title
## 1 1 1 16 Sep 93 The Good Son
## 2 1 2 23 Sep 93 Space Quest
## 3 1 3 30 Sep 93 Dinner at Eight
## 4 1 4 07 Oct 93 I Hate Frasier Crane
## 5 1 5 14 Oct 93 Here's Looking at You
## 6 1 6 21 Oct 93 The Crucible
Now we can make some code that randomly selects an episode! Since each of the 11 seasons each had exactly 24 episodes, this part is easy to code.
seasonPicker <- sample(1:11, 1)
episodePicker <- sample(1:24, 1)
Frasier %>%
filter(season == seasonPicker) %>%
filter(episode == episodePicker)
## season episode airDate title
## 1 4 1 17 Sep 96 The Two Mrs. Cranes
I started with http://epguides.com/Frasier/, viewed the HTML code, and copied-and-pasted what I needed into a text document.