Project 2

Overview


Discussion posts chosen for this project:

1. My own: literary agents from pw.org
2. Brad’s: New York voter registration from elections.ny.gov
3. Daniel’s: epidemics listing from wikipedia.org

In each of these cases, the following steps are performed:
Figure 1. Overview flowchart

Literary agents


The task is to scrape the pw.org website for a listing of literary agents and store the results in a searchable format.

* Scrape, parse, and store the agent listings.
* Filter out the agents who are most likely to accept my book.
* Display a list of candidate agents, along with submission guidelines.

Scrape website


I used a python script (data607-1.py) to scrape the dat from the pw.org website. The script does the following:

1. Connects to pw.org and saves the html output to a file.
2. Parses the file using a series of regular expressions.
3. Saves the main agent listing in a csv file.
4. Using the URLs scraped in step 2 above, connects to each agency’s web page on pw.org.
5. Scrapes the agency’s web page and saves the html output to a file.
6. Parses the files using regular expressions.
7. Saves the agency info in a csv file.

Import from Github


# Import the main pw.org listing CSV from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/data607-main.csv")
pwmain <- read.csv(text = csvfile)
kable(pwmain[0:5,] %>% 
  mutate(genres = str_c(substr(genres, 0, 30), "..."), authors = str_c(substr(authors, 0, 30), "...")), caption = 
  "<i><font color=#000000><b>Table 1.</b> pw.org agent listing (main table) read from CSV</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 1. pw.org agent listing (main table) read from CSV
agent_id name link agency genres authors
1 Laurie Abkemeier /literary_agents/laurie_abkemeier DeFiore and Company BIPOC Voices|Journalism/Invest… Jennifer Keishin Armstrong|Joh…
2 Miriam Altshuler /literary_agents/miriam_altshuler DeFiore and Company Autobiography/Memoir|Commercia… Harriet Brown|Andrew Carroll|R…
3 Betsy Amster /literary_agents/betsy_amster Betsy Amster Literary Enterprises Fiction|Graphic/Illustrated|Li… Dr. Elaine Aron|Sandi Ault|Kim…
4 Claire Anderson-Wheeler /literary_agents/claire_andersonwheeler Regal Hoffmann & Associates Autobiography/Memoir|Fiction|L… Dan Cluchey|Patrick Dacey|Eldo…
5 Nicole Aragi /literary_agents/nicole_aragi Aragi Inc.  Feminist|Graphic/Illustrated|H… Jonathan Safran Foer|Colson Wh…
# Import the agency listing CSV from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/data607-agency.csv")
pwagencies <- read.csv(text = csvfile)
kable(pwagencies[0:5,] %>% 
  mutate(guidelines = str_c(substr(guidelines, 0, 30), "..."), tips = str_c(substr(tips, 0, 30), "...")), caption = 
  "<i><font color=#000000><b>Table 2.</b> pw.org agent listing (agents table) read from CSV</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 2. pw.org agent listing (agents table) read from CSV
agent_fk address ext_addr locality region postal phone web email twitter elec_sub guidelines tips
1 47 East 19th Street Third Floor New York NY 10003
  1. 925-7744
http://www.defliterary.com @LaurieAbkemeier Yes … …
2 47 East 19th Street Third Floor New York NY 10003
  1. 925-7744
http://www.defliterary.com @MiriamAltshuler Yes … Authors should visit our websi…
3 607 Foothill Boulevard #1061 Third Floor La Canada Flintridge CA 91012
  1. 529-5667
http://www.amsterlit.com Yes … Please consult my website for …
4 242 West 38th Street Suite 901 New York NY 10001
  1. 684-7900
http://rhaliterary.com @claireawheeler Yes Please send an overview/synops… A thorough synopsis is always …
5 143 West 27th Street #4F New York NY 10001
  1. 675-8353
http://www.aragi.net @AragiAuthors Yes … …

Tidy


Tidying tasks:

* Separate genres by the pipe symbol and put into a separate dataframe.
* Separate authors by the pipe symbol and put into a separate dataframe.

# separate genres
agent_genres_df <- pwmain %>% select(agent_id, genres) %>% rename(agent_fk = agent_id)
agent_genres_df <- separate_rows(agent_genres_df, genres, sep = "\\|", convert = TRUE)
agent_genres_df <- mutate(agent_genres_df, agent_genre_id = row_number()) %>% 
  relocate(agent_genre_id, .before = agent_fk)

# separate clients (authors)
agent_authors_df <- pwmain %>% select(agent_id, authors) %>% rename(agent_fk = agent_id)
agent_authors_df <- separate_rows(agent_authors_df, authors, sep = "\\|", convert = TRUE)
agent_authors_df <- mutate(agent_authors_df, agent_author_id = row_number()) %>% 
  relocate(agent_author_id, .before = agent_fk)

Normalize


Normalize the dataframes into third normal form. Entity relationship diagram:

Image of an entity relationship diagram

# genres table
genres_df <- agent_genres_df %>% filter(genres != "") %>% group_by(genres) %>% summarize(genre = unique(genres)) %>%
  select(-genres) %>% mutate(genre_id = row_number()) %>% relocate(genre_id, .before = "genre")

# authors table
authors_df <- agent_authors_df %>% filter(str_length(authors) > 2) %>% group_by(authors) %>% summarize(author = unique(authors)) %>%
  select(-authors) %>% mutate(author_id = row_number()) %>% relocate(author_id, .before = "author")

# normalize agent_genres_df
agent_genres_df <- agent_genres_df %>% merge(genres_df, by.x = "genres", by.y = "genre") %>%
  select(-genres) %>% rename(genre_fk = genre_id)

# nNrmalize agent_authors_df
agent_authors_df <- agent_authors_df %>% merge(authors_df, by.x = "authors", by.y = "author") %>%
  select(-authors) %>% rename(author_fk = author_id)

# Remove genres and authors from pwmain since we normalized these fields
pwmain <- select(pwmain, -genres, -authors)

# Add primary key to pwagencies
pwagencies <- pwagencies %>% mutate(pwagency_id = row_number()) %>% relocate(pwagency_id, .before = agent_fk)

# Display tables
kable(genres_df[0:5,], 
  caption = "<i><font color=#000000><b>Table 3.</b> genres_df child tablef</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 3. genres_df child tablef
genre_id genre
1 Autobiography/Memoir
2 BIPOC Voices
3 Commercial Fiction
4 Cross-genre
5 Experimental
kable(authors_df[0:5,], 
  caption = "<i><font color=#000000><b>Table 4.</b> authors_df child tablef</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 4. authors_df child tablef
author_id author
1 A. Brad Schwartz
2 A.E. Hotchner
3 Abbigail Rosewood
4 Abby Wambach
5 Abigail Thomas
kable(agent_genres_df[0:5,], 
  caption = "<i><font color=#000000><b>Table 5.</b> agent_genres_df child tablef</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 5. agent_genres_df child tablef
agent_genre_id agent_fk genre_fk
583 109 1
147 30 1
404 77 1
254 47 1
753 140 1
kable(agent_authors_df[0:5,], 
  caption = "<i><font color=#000000><b>Table 5.</b> agent_authors_df child tablef</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 5. agent_authors_df child tablef
agent_author_id agent_fk author_fk
555 77 1
130 20 2
1016 149 3
834 122 4
277 40 5

Analysis


# Generate agent listing by select genres
selected_agents <- filter(genres_df, genre == "Speculative Fiction" | genre == "Literary Fiction") %>%
  merge(agent_genres_df, by.x = "genre_id", by.y = "genre_fk") %>%
  merge(pwmain, by.x = "agent_fk", by.y = "agent_id") %>%
  merge(pwagencies, by = "agent_fk") %>%
  spread(key = genre, value = genre) %>%
  mutate(`Literary Fiction` = ifelse(is.na(`Literary Fiction`), "no", "yes")) %>%
  mutate(`Speculative Fiction` = ifelse(is.na(`Speculative Fiction`), "no", "yes")) %>%
  select(-genre_id, -agent_genre_id, -link, -pwagency_id)

# Search agent tips for keywords
selected_agents <- selected_agents %>% 
  filter(str_detect(tips, regex("new|expand|suspense|characterization|voice", ignore_case = TRUE))) %>%
  select(agent_fk) %>% mutate(focus = "yes") %>%
  merge(selected_agents, by = "agent_fk", all.y = TRUE)
selected_agents <- arrange(selected_agents, focus, name)
kable(selected_agents[0:5,] %>% 
  mutate(guidelines = str_c(substr(guidelines, 1, 30), "..."), tips = str_c(substr(tips, 1, 30), "...")), 
  caption = "<i><font color=#000000><b>Table 6.</b> Selected agents listing</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 6. Selected agents listing
agent_fk focus name agency address ext_addr locality region postal phone web email twitter elec_sub guidelines tips Literary Fiction Speculative Fiction
133 yes Adam Schear DeFiore and Company 47 East 19th Street Third Floor New York NY 10003
  1. 925-7744
http://defliterary.com @adamschear Yes … Dont be afraid to let your voi… yes no
14 yes Amy Berkower Writers House LLC 21 West 26th Street Suite 305 New York NY 10010
  1. 685-2400
http://www.writershouse.com @AmyBerkower Yes … I am currently looking to expa… yes no
16 yes Amy Elizabeth Bishop Dystel, Goderich & Bourret 1 Union Square West Suite 904 New York NY 10003 212-627-9100 10 http://www.dystel.com @amylizbishop Yes For fiction, please send query… Im always looking for work by … yes no
139 yes Anjali Singh Ayesha Pande Literary 132 West 128th Street 18th Floor New York NY 10027 212-283-5825 http://www.pandeliterary.com @agent_anjali Yes Please follow the directions o… I have a particular interest i… yes no
30 yes Farley Chase Chase Literary Agency 11 Broadway Suite 1010 New York NY 10004
  1. 477-5100
http://chaseliterary.com @farleychase Yes Please visit my <a href=’http:… In fiction Im looking for lite… yes no
# Which genres attract the most agents
agent_genres_df %>% group_by(genre_fk) %>% summarize(n = n()) %>%
  merge(genres_df, by.x = 'genre_fk', by.y = 'genre_id') %>%
  ggplot((aes(x = reorder(genre, n), y = n))) +
  geom_bar(stat = 'identity', fill = 'lightblue') + coord_flip() +
  xlab('genre') + ylab('count') +
  ggtitle('Figure 1. Literary agents by genre')

Conclusion


While this wasn’t a dataset I’d intended to perform an analysis on, it did allow for some tidying, and it creates a repeatable process that can be used periodically when the pw.org database changes.

In terms of deliverables, the above process generates a concise listing of agents prioritized by those most likely to accept queries based on how similar they match my criteria.

It was also interesting to note that there is a breadth of agents representing works of literary fiction, while a scant few who deal with speculative fiction. This tells me my chances of success could be limited and that it might be time to start writing pure literary fiction rather than speculating in speculative fiction.

New York Voters


The task is to retrieve New York state voter registration data from https://www.elections.ny.gov/FoilRequestVoterRegDataPrint.html, tidy the data, and analyze the data over time. I limited my analysis to the past four datasets, since those were the ones provided in Excel format. The rest were in PDF, which would require considerable effort to export to CSV.

I downloaded the Excel versions of the four files, exported them to CSV, then did some preliminary tidying using a text editor (e.g. removing blank lines and removing the title sections). These were the datasets I used:

* Nov 2019
* Feb 2020
* Nov 2020
* Feb 2021

Import from Github


# Import the CSVs from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_nov19.csv")
df_nov19 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_feb20.csv")
df_feb20 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_nov20.csv")
df_nov20 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_feb21.csv")
df_feb21 <- read.csv(text = csvfile)

# Display sample data
kable(df_nov19[0:5,], caption = 
  "<i><font color=#000000><b>Table 7.</b> Sample raw data read from CSV</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 7. Sample raw data read from CSV
REGION COUNTY STATUS DEM REP CON WOR GRE LBT IND SAM OTH BLANK TOTAL
Outside NYC Albany Active 94,240 34,375 3,020 586 517 231 9,339 6 170 42,205 184,689
Outside NYC Albany Inactive 11,861 3,466 304 131 116 53 1,247 0 23 6,021 23,222
Outside NYC Albany Total 106,101 37,841 3,324 717 633 284 10,586 6 193 48,226 207,911
Outside NYC Allegany Active 5,667 12,599 447 136 94 47 1,256 0 13 4,910 25,169
Outside NYC Allegany Inactive 300 403 14 6 7 1 59 0 2 331 1,123

Tidy


Tidy data by gathering keyed off party.

# Gather by party, add month and year fields
df_nov19 <- df_nov19 %>% gather(key = "party", value = "voters", 
  'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
  mutate(year = 2019, month = 11)
df_feb20 <- df_feb20 %>% gather(key = "party", value = "voters", 
  'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
  mutate(year = 2020, month = 2)
df_nov20 <- df_nov20 %>% gather(key = "party", value = "voters", 
  'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
  mutate(year = 2020, month = 11)
df_feb21 <- df_feb21 %>% gather(key = "party", value = "voters", 
  'CON', 'DEM', 'REP', 'WOR', 'OTH', 'BLANK', 'TOTAL') %>%
  mutate(year = 2021, month = 2)

# Cram all tables together into one dataframe, remove 'total' rows
nydf <- df_nov19 %>% union(df_feb20) %>% union(df_nov20) %>% union(df_feb21) %>%
  subset(STATUS != 'Total' & party != 'TOTAL' & REGION != 'Within NYC Total' & 
    REGION != 'Statewide Total' & REGION != 'Outside NYC Grand Tot' & REGION != 'Outside NYC Grand Total')

# Remove commas from numbers, convert to numeric, display table
nydf <- nydf %>% mutate(voters = gsub(',', '', nydf$voters)) %>% mutate(voters = as.numeric(voters))
kable(nydf[0:5,], caption = 
  "<i><font color=#000000><b>Table 8.</b> Tidied dataframe</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 8. Tidied dataframe
REGION COUNTY STATUS party voters year month
1 Outside NYC Albany Active CON 3020 2019 11
2 Outside NYC Albany Inactive CON 304 2019 11
4 Outside NYC Allegany Active CON 447 2019 11
5 Outside NYC Allegany Inactive CON 14 2019 11
7 Outside NYC Broome Active CON 1595 2019 11

Normalize


Normalize the dataframes into third normal form. Entity relationship diagram:

Image of an entity relationship diagram

# Parties
parties <- data.frame(party_id = c(1:10), short_party = c('CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK'),
  party_name = c('Constitutaion', 'Democratic', 'Republican', 'Working Families', 'Green', 'Libertarian', 'Independent', 
  'Serve America Movement', 'Other', 'Blank'))

# Counties
counties <- nydf %>% group_by(COUNTY) %>% summarize(n = n()) %>% 
  mutate(county_id = row_number()) %>% select(-n) %>% relocate(county_id, .before = COUNTY) %>%
  rename(county = COUNTY)

# Regions
regions <- nydf %>% group_by(REGION) %>% summarize(n = n()) %>% 
  mutate(region_id = row_number()) %>% select(-n) %>% relocate(region_id, .before = REGION) %>%
  rename(region = REGION)

# Status
statuses <- data.frame(status_id = c(0, 1), status = c("Inactive", "Active"))

# Normalize main df
nydf <- nydf %>% merge(parties, by.x = 'party', by.y = 'short_party') %>%
  select(-party_name, -party) %>% rename(party_fk = party_id) %>%
  merge(counties, by.x = 'COUNTY', by.y = 'county') %>%
  select(-COUNTY) %>% rename(county_fk = county_id) %>%
  merge(regions, by.x = 'REGION', by.y = 'region') %>%
  select(-REGION) %>% rename(region_fk = region_id) %>%
  merge(statuses, by.x = 'STATUS', by.y = 'status') %>%
  select(-STATUS) %>% rename(status_fk = status_id) %>%
  arrange(year, month, region_fk, county_fk, party_fk, status_fk) %>%
  mutate(nydf_id = row_number()) %>% relocate(nydf_id, .before = voters)

# Display table
kable(nydf[0:10,], 
  caption = "<i><font color=#000000><b>Table 9.</b> Normalized data frame</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 9. Normalized data frame
nydf_id voters year month party_fk county_fk region_fk status_fk
1 304 2019 11 1 1 1 0
2 3020 2019 11 1 1 1 1
3 11861 2019 11 2 1 1 0
4 94240 2019 11 2 1 1 1
5 3466 2019 11 3 1 1 0
6 34375 2019 11 3 1 1 1
7 131 2019 11 4 1 1 0
8 586 2019 11 4 1 1 1
9 116 2019 11 5 1 1 0
10 517 2019 11 5 1 1 1

Analysis


# Graph party registration over time
nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
  filter(status_fk == 1) %>%
  group_by(party_fk, period) %>% summarize(voters = sum(voters)) %>%
  merge(parties, by.x = 'party_fk', by.y = 'party_id') %>%
  ggplot() + geom_bar(aes(x = short_party, y = voters, fill = period), stat = 'identity', position = 'dodge') +
  ggtitle("Figure 2. Party registration over time (active registrations only)") + theme_light()

# Party registration percent change between Nov 2019 and Feb 2021
nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
  filter(status_fk == 1) %>%
  group_by(party_fk, period) %>% summarize(voters = sum(voters)) %>%
  filter(period == '2019 Nov' | period == '2021 Feb') %>%
  merge(parties, by.x = 'party_fk', by.y = 'party_id') %>%
  filter(short_party != 'OTH') %>%
  spread(key = period, value = voters) %>%
  mutate(`2021 Feb` = ifelse(is.na(`2021 Feb`), 0, `2021 Feb`)) %>%
  mutate(pct_change = 100 * (`2021 Feb` - `2019 Nov`) / `2019 Nov`) %>%
  filter(`2021 Feb` != 0) %>%
  ggplot(aes(x = reorder(short_party, -pct_change), y = pct_change)) + 
  geom_bar(stat = 'identity', fill = 'lightblue') +
  ggtitle("Figure 3. Party registration percent change between Nov 2019 and Feb 2021\n(exluding 'other' party and parties not present in Feb 2021)") + xlab("Party") + theme_light() + geom_text(aes(label = round(pct_change, 1)), vjust = 1.5)

# Voter percent change by count between Nov 2019 and Feb 2021
tmpdf <- nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
  filter(status_fk == 1) %>%
  group_by(county_fk, period) %>% summarize(voters = sum(voters)) %>%
  filter(period == '2019 Nov' | period == '2021 Feb') %>%
  merge(counties, by.x = 'county_fk', by.y = 'county_id') %>%
  spread(key = period, value = voters) %>%
  mutate(pct_change = 100 * (`2021 Feb` - `2019 Nov`) / `2019 Nov`) %>%
  arrange(desc(pct_change))

# top 20
tmpdf %>% head(20) %>%
  ggplot(aes(x = reorder(county, pct_change), y = pct_change)) + geom_bar(stat = 'identity', fill = 'lightblue') +
  ggtitle("Figure 3. The 20 counties seeing the most growth in registrations\nbetween Nov 2019 and Feb 2021") +
  coord_flip() + xlab("County") + theme_light() + geom_text(aes(label = round(pct_change, 1)), hjust = 1.2)

# bottom 20
tmpdf %>% tail(20) %>%
  ggplot(aes(x = reorder(county, pct_change), y = pct_change)) + geom_bar(stat = 'identity', fill = 'lightblue') +
  ggtitle("Figure 4. The 20 counties seeing the least growth in registrations\nbetween Nov 2019 and Feb 2021") +
  coord_flip() + xlab("County") + theme_light() + geom_text(aes(label = round(pct_change, 1)), hjust = 1.2)

Conclusions


Based on the above, the following conclusions can be drawn:

* Between Nov 2019 and Feb 2021, the biggest increase was seen in unaffilated voters (assumed by blank registration).
* Orange County saw the biggest increase at 8.8%.
* Livingston County was the only county to see a decline in registrations (-0.3%).

Epidemics


The task is to retrieve world and regional epidemic data from https://en.wikipedia.org/wiki/List_of_epidemics and tidy the data. Analysis includes evaluating:

* which diseases cause the most pandemics
* which diseases have been the most disastrous to humans
* regions of the world most prone to outbreaks

Scrape the data from wikipedia


# Scrape the data from wikipedia
parsed_page <- read_html("https://en.wikipedia.org/wiki/List_of_epidemics", encoding = "UTF-8")

# Parse out the tables
parsed_tables <- html_table(parsed_page, fill = TRUE)
epidemics <- data.frame(parsed_tables[1])
chron <- data.frame(parsed_tables[2])

# Display sample data
kable(epidemics[0:5,], caption = 
  "<i><font color=#000000><b>Table 10.</b> Sample epidemic data</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 10. Sample epidemic data
Rank Epidemics.pandemics Death.toll Global.population.lost Regional.population.lost Date Location
1 Black Death 75–200 million [Note 1] 30–60% of European population[8] 1346–1353 Europe, Asia, and North Africa
2 Spanish flu 17–100 million 1–5.4%[9][10] – 1918–1920 Worldwide
3 Plague of Justinian 15–100 million [Note 1] 25–60% of European population[11] 541–549 Europe and West Asia
4 HIV/AIDS pandemic 35 million+ (as of 2020) [Note 2] – 1981–present Worldwide
5 Third plague pandemic 12–15 million [Note 2] – 1855–1960 Worldwide
kable(chron[0:5,], caption = 
  "<i><font color=#000000><b>Table 11.</b> Sample chronology data</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 11. Sample chronology data
Event Date Location Disease Death.toll..estimate. Ref.
1200 BC Babylon influenza epidemic 1200 BC Babylon, or Babirus of the Persians, Central Asia, Mesopotamia and Southern Asia Sanskrit scholars found records of a disease resembling the Flu Unknown [17]
Plague of Athens 429–426 BC Greece, Libya, Egypt, Ethiopia Unknown, possibly typhus, typhoid fever or viral hemorrhagic fever 75,000–100,000 [18][19][20][21]
412 BC epidemic 412 BC Greece (Northern Greece, Roman Republic) Unknown, possibly influenza Unknown [22]
Antonine Plague 165–180 (possibly up to 190) Roman Empire Unknown, possibly smallpox 5–10 million [23][24]
Jian’an Plague 217 Han Dynasty Unknown, possibly typhoid fever or viral hemorrhagic fever Unknown [25][26]
# Save to CSV
write.csv(epidemics, "epidemics.csv")
write.csv(chron, "chron.csv")

Tidy


Tidy data:\
* Remove bracketed notes.
* Split ranges into two fields (upper and lower).
* Remove extraneous words.
* Make numeric fields numeric.

# Add disease column to epidemics table
epidemics <- mutate(epidemics, disease = c("bubonic plague", "influenza", "bubonic plague", "HIV", "bubonic plague", 
  "cocoliztli", "smallpox", "smallpox", "coronavirus", "typhus", "influenza", "influenza", "cocoliztli", 
  "smallpox", "bubonic plague", "bubonic plague", "cholera", "bubonic plague", "influenza"))

# Death toll: remove extraneous stuff, split to lower and upper ranges
epidemics <- epidemics %>% mutate(death_toll = gsub(' million', '', epidemics$`Death.toll`)) %>%
  separate(death_toll, sep = '–', into = c('death_toll_lower', 'death_toll_upper')) %>%
  mutate(death_toll_lower = gsub('^(\\d+)[^\\d]?.*', '\\1', death_toll_lower)) %>%
  mutate(death_toll_upper = ifelse(is.na(death_toll_upper), death_toll_lower, death_toll_upper)) %>%
  select(-`Death.toll`) %>% mutate(death_toll_lower = as.numeric(death_toll_lower)) %>%
  mutate(death_toll_upper = as.numeric(death_toll_upper))

# Global population lost
epidemics <- epidemics %>% mutate(global_pop_lost = gsub('\\[.*\\]', '', `Global.population.lost`)) %>%
  mutate(global_pop_lost = gsub('%', '', global_pop_lost)) %>%
  separate(global_pop_lost, sep = '–', into = c('global_pop_lost_lower', 'global_pop_lost_upper')) %>%
  mutate(global_pop_lost_lower = ifelse(global_pop_lost_lower == '', NA, global_pop_lost_lower)) %>%
  mutate(global_pop_lost_upper = ifelse(is.na(global_pop_lost_upper), global_pop_lost_lower, global_pop_lost_upper)) %>%
  select(-`Global.population.lost`) %>% mutate(global_pop_lost_lower = as.numeric(global_pop_lost_lower)) %>%
  mutate(global_pop_lost_upper = as.numeric(global_pop_lost_upper))
  
# Regional population lost
epidemics <- epidemics %>% mutate(regional_pop_lost = gsub('\\[.*\\]', '', `Regional.population.lost`)) %>%
  mutate(regional_pop_lost = gsub('^–$', '', regional_pop_lost)) %>% 
  separate(regional_pop_lost, sep = " of ", into = c('regional_pop_lost', 'regional_pop')) %>%
  mutate(regional_pop_lost = gsub('%', '', regional_pop_lost)) %>% 
  separate(regional_pop_lost, sep = '–', into = c('regional_pop_lost_lower', 'regional_pop_lost_upper')) %>%
  mutate(regional_pop_lost_lower = ifelse(regional_pop_lost_lower == '', NA, regional_pop_lost_lower)) %>%
  mutate(regional_pop_lost_upper = ifelse(is.na(regional_pop_lost_upper), regional_pop_lost_lower, regional_pop_lost_upper)) %>%
  select(-`Regional.population.lost`) %>% mutate(regional_pop_lost_lower = as.numeric(regional_pop_lost_lower)) %>%
  mutate(regional_pop_lost_upper = as.numeric(regional_pop_lost_upper))

# Date
epidemics <- epidemics %>% mutate(dates = gsub('^(.+?) .+$', '\\1', `Date`)) %>%
  mutate(dates = gsub('present', format(Sys.Date(), "%Y"), dates)) %>%
  separate(dates, sep = '–', into = c('date_from', 'date_to')) %>%
  select (-`Date`) %>%
  mutate(date_from = as.numeric(date_from)) %>% mutate(date_to = as.numeric(date_to))

# Rename columns for consistency
epidemics <- epidemics %>% rename(epidemic = `Epidemics.pandemics`) %>%
  rename(rank = Rank)

Normalize


Normalize the location field.

Image of an entity relationship diagram

# Tidy the location field
epidemics <- epidemics %>% mutate(location = gsub(', and', ',', Location)) %>%
  mutate(location = gsub(' and ', ', ', location)) %>%
  separate_rows(location, sep = ', ', convert=TRUE) %>%
  select(-Location)

# Create location_list child table
location_list <- epidemics %>% group_by(location) %>% summarize(n = n()) %>%
  mutate(location_id = row_number()) %>% select(-n) %>%
  relocate(location_id, .before = location)

# Create epidemic_list child table
epidemic_list <- epidemics %>% group_by(epidemic) %>% summarize(n = n()) %>%
  mutate(epidemic_id = row_number()) %>% select(-n) %>%
  relocate(epidemic_id, .before = epidemic)

# Create diseases child table
disease_list <- epidemics %>% group_by(disease) %>% summarize(n = n()) %>%
  mutate(disease_id = row_number()) %>% select(-n) %>%
  relocate(disease_id, .before = disease)

# Normalize main df
epidemics <- epidemics %>% merge(location_list, by = 'location') %>%
  select(-location) %>% rename(location_fk = location_id) %>%
  merge(epidemic_list, by = 'epidemic') %>%
  select(-epidemic) %>% rename(epidemic_fk = epidemic_id) %>%
  merge(disease_list, by = 'disease') %>%
  select(-disease) %>% rename(disease_fk = disease_id) %>%
  arrange(date_from) %>%
  mutate(epidemics_id = row_number()) %>% relocate(epidemics_id, .before = rank)

# Display table
kable(epidemics[0:10,], 
  caption = "<i><font color=#000000><b>Table 12.</b> Normalized data frame</font></i>") %>% 
  kable_styling(latex_options = "striped")
Table 12. Normalized data frame
epidemics_id rank death_toll_lower death_toll_upper global_pop_lost_lower global_pop_lost_upper regional_pop_lost_lower regional_pop_lost_upper regional_pop date_from date_to location_fk epidemic_fk disease_fk
1 7 5 10.0 3.0 6.0 25 33 Roman population 165 180 8 7 7
2 3 15 100.0 NA NA 25 60 European population 541 549 2 16 1
3 3 15 100.0 NA NA 25 60 European population 541 549 11 16 1
4 14 2 2.0 1.0 1.0 33 33 Japanese population 735 737 4 6 7
5 1 75 200.0 NA NA 30 60 European population 1346 1353 1 9 1
6 1 75 200.0 NA NA 30 60 European population 1346 1353 2 9 1
7 1 75 200.0 NA NA 30 60 European population 1346 1353 6 9 1
8 8 5 8.0 NA NA 23 37 Mexican population 1519 1520 5 1 7
9 6 5 15.0 NA NA 27 80 Mexican population 1545 1548 5 10 3
10 13 2 2.5 0.4 0.5 50 50 Mexican population 1576 1580 5 11 3

Analysis


# Graph count of epidemics by disease
epidemics %>% group_by(disease_fk) %>% summarize(n = n()) %>%
  merge(disease_list, by.x = 'disease_fk', by.y = 'disease_id') %>%
  ggplot(aes(x = reorder(disease, -n), y = n)) + geom_bar(stat = 'identity', fill = 'lightblue') +
  theme_light() + geom_text(aes(label = n), vjust = -0.3) + xlab('disease') +
  ggtitle("Figure 5. Epidemics by disease")

# Graph death toll by disease
epidemics %>% group_by(disease_fk) %>% summarize(toll_hi = sum(death_toll_upper)) %>%
  merge(disease_list, by.x = 'disease_fk', by.y = 'disease_id') %>%
  ggplot(aes(x = reorder(disease, -toll_hi), y = toll_hi)) + geom_bar(stat = 'identity', fill = 'lightblue') +
  theme_light() + geom_text(aes(label = toll_hi), vjust = -0.3) + xlab('disease') +
  ggtitle("Figure 6. Deadliest epidemics by disease (by upper limit of death toll)") +
  ylab("Death toll in millions")

# Graph epidemics by region
epidemics %>% group_by(location_fk) %>% summarize(n = n()) %>%
  merge(location_list, by.x = 'location_fk', by.y = 'location_id') %>%
  ggplot(aes(x = reorder(location, n), y = n)) + geom_bar(stat = 'identity', fill = 'lightblue') +
  theme_light() + geom_text(aes(label = n), hjust = -1.2) + xlab('region') +
  ggtitle("Figure 7. Epidemics by region") + coord_flip() + ylab("# of epidemics")

Conclusions


Based on the above, the following conclusions can be drawn:

* Bubonic plague has been by far the most prevalent source of epidemic throughtout history.
* Likewise, bubonic plague has been the most deadly in terms of lives lost.
* Of the areas survey, most epidemics have been worldwide phenomena. Mexico has seen the highest number of regional-specific epidemics.