Discussion posts chosen for this project:
1. My own: literary agents from pw.org
2. Brad’s: New York voter registration from elections.ny.gov
3. Daniel’s: epidemics listing from wikipedia.org
In each of these cases, the following steps are performed:
Figure 1. Overview flowchart
The task is to scrape the pw.org website for a listing of literary agents and store the results in a searchable format.
* Scrape, parse, and store the agent listings.
* Filter out the agents who are most likely to accept my book.
* Display a list of candidate agents, along with submission guidelines.
I used a python script (data607-1.py) to scrape the dat from the pw.org website. The script does the following:
1. Connects to pw.org and saves the html output to a file.
2. Parses the file using a series of regular expressions.
3. Saves the main agent listing in a csv file.
4. Using the URLs scraped in step 2 above, connects to each agency’s web page on pw.org.
5. Scrapes the agency’s web page and saves the html output to a file.
6. Parses the files using regular expressions.
7. Saves the agency info in a csv file.
# Import the main pw.org listing CSV from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/data607-main.csv")
pwmain <- read.csv(text = csvfile)
kable(pwmain[0:5,] %>%
mutate(genres = str_c(substr(genres, 0, 30), "..."), authors = str_c(substr(authors, 0, 30), "...")), caption =
"<i><font color=#000000><b>Table 1.</b> pw.org agent listing (main table) read from CSV</font></i>") %>%
kable_styling(latex_options = "striped")
| agent_id | name | link | agency | genres | authors |
|---|---|---|---|---|---|
| 1 | Laurie Abkemeier | /literary_agents/laurie_abkemeier | DeFiore and Company | BIPOC Voices|Journalism/Invest… | Jennifer Keishin Armstrong|Joh… |
| 2 | Miriam Altshuler | /literary_agents/miriam_altshuler | DeFiore and Company | Autobiography/Memoir|Commercia… | Harriet Brown|Andrew Carroll|R… |
| 3 | Betsy Amster | /literary_agents/betsy_amster | Betsy Amster Literary Enterprises | Fiction|Graphic/Illustrated|Li… | Dr. Elaine Aron|Sandi Ault|Kim… |
| 4 | Claire Anderson-Wheeler | /literary_agents/claire_andersonwheeler | Regal Hoffmann & Associates | Autobiography/Memoir|Fiction|L… | Dan Cluchey|Patrick Dacey|Eldo… |
| 5 | Nicole Aragi | /literary_agents/nicole_aragi | Aragi Inc. | Feminist|Graphic/Illustrated|H… | Jonathan Safran Foer|Colson Wh… |
# Import the agency listing CSV from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/data607-agency.csv")
pwagencies <- read.csv(text = csvfile)
kable(pwagencies[0:5,] %>%
mutate(guidelines = str_c(substr(guidelines, 0, 30), "..."), tips = str_c(substr(tips, 0, 30), "...")), caption =
"<i><font color=#000000><b>Table 2.</b> pw.org agent listing (agents table) read from CSV</font></i>") %>%
kable_styling(latex_options = "striped")
| agent_fk | address | ext_addr | locality | region | postal | phone | web | elec_sub | guidelines | tips | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 47 East 19th Street | Third Floor | New York | NY | 10003 |
|
http://www.defliterary.com | laurie@defliterary.com | @LaurieAbkemeier | Yes | … | … |
| 2 | 47 East 19th Street | Third Floor | New York | NY | 10003 |
|
http://www.defliterary.com | miriam@defliterary.com | @MiriamAltshuler | Yes | … | Authors should visit our websi… |
| 3 | 607 Foothill Boulevard #1061 | Third Floor | La Canada Flintridge | CA | 91012 |
|
http://www.amsterlit.com | b.amster.assistant@gmail.com | Yes | … | Please consult my website for … | |
| 4 | 242 West 38th Street | Suite 901 | New York | NY | 10001 |
|
http://rhaliterary.com | claire@rhaliterary.com | @claireawheeler | Yes | Please send an overview/synops… | A thorough synopsis is always … |
| 5 | 143 West 27th Street | #4F | New York | NY | 10001 |
|
http://www.aragi.net | queries@aragi.net | @AragiAuthors | Yes | … | … |
Tidying tasks:
* Separate genres by the pipe symbol and put into a separate dataframe.
* Separate authors by the pipe symbol and put into a separate dataframe.
# separate genres
agent_genres_df <- pwmain %>% select(agent_id, genres) %>% rename(agent_fk = agent_id)
agent_genres_df <- separate_rows(agent_genres_df, genres, sep = "\\|", convert = TRUE)
agent_genres_df <- mutate(agent_genres_df, agent_genre_id = row_number()) %>%
relocate(agent_genre_id, .before = agent_fk)
# separate clients (authors)
agent_authors_df <- pwmain %>% select(agent_id, authors) %>% rename(agent_fk = agent_id)
agent_authors_df <- separate_rows(agent_authors_df, authors, sep = "\\|", convert = TRUE)
agent_authors_df <- mutate(agent_authors_df, agent_author_id = row_number()) %>%
relocate(agent_author_id, .before = agent_fk)
Normalize the dataframes into third normal form. Entity relationship diagram:
# genres table
genres_df <- agent_genres_df %>% filter(genres != "") %>% group_by(genres) %>% summarize(genre = unique(genres)) %>%
select(-genres) %>% mutate(genre_id = row_number()) %>% relocate(genre_id, .before = "genre")
# authors table
authors_df <- agent_authors_df %>% filter(str_length(authors) > 2) %>% group_by(authors) %>% summarize(author = unique(authors)) %>%
select(-authors) %>% mutate(author_id = row_number()) %>% relocate(author_id, .before = "author")
# normalize agent_genres_df
agent_genres_df <- agent_genres_df %>% merge(genres_df, by.x = "genres", by.y = "genre") %>%
select(-genres) %>% rename(genre_fk = genre_id)
# nNrmalize agent_authors_df
agent_authors_df <- agent_authors_df %>% merge(authors_df, by.x = "authors", by.y = "author") %>%
select(-authors) %>% rename(author_fk = author_id)
# Remove genres and authors from pwmain since we normalized these fields
pwmain <- select(pwmain, -genres, -authors)
# Add primary key to pwagencies
pwagencies <- pwagencies %>% mutate(pwagency_id = row_number()) %>% relocate(pwagency_id, .before = agent_fk)
# Display tables
kable(genres_df[0:5,],
caption = "<i><font color=#000000><b>Table 3.</b> genres_df child tablef</font></i>") %>%
kable_styling(latex_options = "striped")
| genre_id | genre |
|---|---|
| 1 | Autobiography/Memoir |
| 2 | BIPOC Voices |
| 3 | Commercial Fiction |
| 4 | Cross-genre |
| 5 | Experimental |
kable(authors_df[0:5,],
caption = "<i><font color=#000000><b>Table 4.</b> authors_df child tablef</font></i>") %>%
kable_styling(latex_options = "striped")
| author_id | author |
|---|---|
| 1 | A. Brad Schwartz |
| 2 | A.E. Hotchner |
| 3 | Abbigail Rosewood |
| 4 | Abby Wambach |
| 5 | Abigail Thomas |
kable(agent_genres_df[0:5,],
caption = "<i><font color=#000000><b>Table 5.</b> agent_genres_df child tablef</font></i>") %>%
kable_styling(latex_options = "striped")
| agent_genre_id | agent_fk | genre_fk |
|---|---|---|
| 583 | 109 | 1 |
| 147 | 30 | 1 |
| 404 | 77 | 1 |
| 254 | 47 | 1 |
| 753 | 140 | 1 |
kable(agent_authors_df[0:5,],
caption = "<i><font color=#000000><b>Table 5.</b> agent_authors_df child tablef</font></i>") %>%
kable_styling(latex_options = "striped")
| agent_author_id | agent_fk | author_fk |
|---|---|---|
| 555 | 77 | 1 |
| 130 | 20 | 2 |
| 1016 | 149 | 3 |
| 834 | 122 | 4 |
| 277 | 40 | 5 |
# Generate agent listing by select genres
selected_agents <- filter(genres_df, genre == "Speculative Fiction" | genre == "Literary Fiction") %>%
merge(agent_genres_df, by.x = "genre_id", by.y = "genre_fk") %>%
merge(pwmain, by.x = "agent_fk", by.y = "agent_id") %>%
merge(pwagencies, by = "agent_fk") %>%
spread(key = genre, value = genre) %>%
mutate(`Literary Fiction` = ifelse(is.na(`Literary Fiction`), "no", "yes")) %>%
mutate(`Speculative Fiction` = ifelse(is.na(`Speculative Fiction`), "no", "yes")) %>%
select(-genre_id, -agent_genre_id, -link, -pwagency_id)
# Search agent tips for keywords
selected_agents <- selected_agents %>%
filter(str_detect(tips, regex("new|expand|suspense|characterization|voice", ignore_case = TRUE))) %>%
select(agent_fk) %>% mutate(focus = "yes") %>%
merge(selected_agents, by = "agent_fk", all.y = TRUE)
selected_agents <- arrange(selected_agents, focus, name)
kable(selected_agents[0:5,] %>%
mutate(guidelines = str_c(substr(guidelines, 1, 30), "..."), tips = str_c(substr(tips, 1, 30), "...")),
caption = "<i><font color=#000000><b>Table 6.</b> Selected agents listing</font></i>") %>%
kable_styling(latex_options = "striped")
| agent_fk | focus | name | agency | address | ext_addr | locality | region | postal | phone | web | elec_sub | guidelines | tips | Literary Fiction | Speculative Fiction | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 133 | yes | Adam Schear | DeFiore and Company | 47 East 19th Street | Third Floor | New York | NY | 10003 |
|
http://defliterary.com | adam@defliterary.com | @adamschear | Yes | … | Dont be afraid to let your voi… | yes | no |
| 14 | yes | Amy Berkower | Writers House LLC | 21 West 26th Street | Suite 305 | New York | NY | 10010 |
|
http://www.writershouse.com | aberkower@writershouse.com | @AmyBerkower | Yes | … | I am currently looking to expa… | yes | no |
| 16 | yes | Amy Elizabeth Bishop | Dystel, Goderich & Bourret | 1 Union Square West | Suite 904 | New York | NY | 10003 | 212-627-9100 10 | http://www.dystel.com | abishop@dystel.com | @amylizbishop | Yes | For fiction, please send query… | Im always looking for work by … | yes | no |
| 139 | yes | Anjali Singh | Ayesha Pande Literary | 132 West 128th Street | 18th Floor | New York | NY | 10027 | 212-283-5825 | http://www.pandeliterary.com | @agent_anjali | Yes | Please follow the directions o… | I have a particular interest i… | yes | no | |
| 30 | yes | Farley Chase | Chase Literary Agency | 11 Broadway | Suite 1010 | New York | NY | 10004 |
|
http://chaseliterary.com | farley@chaseliterary.com | @farleychase | Yes | Please visit my <a href=’http:… | In fiction Im looking for lite… | yes | no |
# Which genres attract the most agents
agent_genres_df %>% group_by(genre_fk) %>% summarize(n = n()) %>%
merge(genres_df, by.x = 'genre_fk', by.y = 'genre_id') %>%
ggplot((aes(x = reorder(genre, n), y = n))) +
geom_bar(stat = 'identity', fill = 'lightblue') + coord_flip() +
xlab('genre') + ylab('count') +
ggtitle('Figure 1. Literary agents by genre')
While this wasn’t a dataset I’d intended to perform an analysis on, it did allow for some tidying, and it creates a repeatable process that can be used periodically when the pw.org database changes.
In terms of deliverables, the above process generates a concise listing of agents prioritized by those most likely to accept queries based on how similar they match my criteria.
It was also interesting to note that there is a breadth of agents representing works of literary fiction, while a scant few who deal with speculative fiction. This tells me my chances of success could be limited and that it might be time to start writing pure literary fiction rather than speculating in speculative fiction.
The task is to retrieve New York state voter registration data from https://www.elections.ny.gov/FoilRequestVoterRegDataPrint.html, tidy the data, and analyze the data over time. I limited my analysis to the past four datasets, since those were the ones provided in Excel format. The rest were in PDF, which would require considerable effort to export to CSV.
I downloaded the Excel versions of the four files, exported them to CSV, then did some preliminary tidying using a text editor (e.g. removing blank lines and removing the title sections). These were the datasets I used:
* Nov 2019
* Feb 2020
* Nov 2020
* Feb 2021
# Import the CSVs from Github
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_nov19.csv")
df_nov19 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_feb20.csv")
df_feb20 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_nov20.csv")
df_nov20 <- read.csv(text = csvfile)
csvfile <- getURL("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/project2/county_feb21.csv")
df_feb21 <- read.csv(text = csvfile)
# Display sample data
kable(df_nov19[0:5,], caption =
"<i><font color=#000000><b>Table 7.</b> Sample raw data read from CSV</font></i>") %>%
kable_styling(latex_options = "striped")
| REGION | COUNTY | STATUS | DEM | REP | CON | WOR | GRE | LBT | IND | SAM | OTH | BLANK | TOTAL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Outside NYC | Albany | Active | 94,240 | 34,375 | 3,020 | 586 | 517 | 231 | 9,339 | 6 | 170 | 42,205 | 184,689 |
| Outside NYC | Albany | Inactive | 11,861 | 3,466 | 304 | 131 | 116 | 53 | 1,247 | 0 | 23 | 6,021 | 23,222 |
| Outside NYC | Albany | Total | 106,101 | 37,841 | 3,324 | 717 | 633 | 284 | 10,586 | 6 | 193 | 48,226 | 207,911 |
| Outside NYC | Allegany | Active | 5,667 | 12,599 | 447 | 136 | 94 | 47 | 1,256 | 0 | 13 | 4,910 | 25,169 |
| Outside NYC | Allegany | Inactive | 300 | 403 | 14 | 6 | 7 | 1 | 59 | 0 | 2 | 331 | 1,123 |
Tidy data by gathering keyed off party.
# Gather by party, add month and year fields
df_nov19 <- df_nov19 %>% gather(key = "party", value = "voters",
'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
mutate(year = 2019, month = 11)
df_feb20 <- df_feb20 %>% gather(key = "party", value = "voters",
'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
mutate(year = 2020, month = 2)
df_nov20 <- df_nov20 %>% gather(key = "party", value = "voters",
'CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK', 'TOTAL') %>%
mutate(year = 2020, month = 11)
df_feb21 <- df_feb21 %>% gather(key = "party", value = "voters",
'CON', 'DEM', 'REP', 'WOR', 'OTH', 'BLANK', 'TOTAL') %>%
mutate(year = 2021, month = 2)
# Cram all tables together into one dataframe, remove 'total' rows
nydf <- df_nov19 %>% union(df_feb20) %>% union(df_nov20) %>% union(df_feb21) %>%
subset(STATUS != 'Total' & party != 'TOTAL' & REGION != 'Within NYC Total' &
REGION != 'Statewide Total' & REGION != 'Outside NYC Grand Tot' & REGION != 'Outside NYC Grand Total')
# Remove commas from numbers, convert to numeric, display table
nydf <- nydf %>% mutate(voters = gsub(',', '', nydf$voters)) %>% mutate(voters = as.numeric(voters))
kable(nydf[0:5,], caption =
"<i><font color=#000000><b>Table 8.</b> Tidied dataframe</font></i>") %>%
kable_styling(latex_options = "striped")
| REGION | COUNTY | STATUS | party | voters | year | month | |
|---|---|---|---|---|---|---|---|
| 1 | Outside NYC | Albany | Active | CON | 3020 | 2019 | 11 |
| 2 | Outside NYC | Albany | Inactive | CON | 304 | 2019 | 11 |
| 4 | Outside NYC | Allegany | Active | CON | 447 | 2019 | 11 |
| 5 | Outside NYC | Allegany | Inactive | CON | 14 | 2019 | 11 |
| 7 | Outside NYC | Broome | Active | CON | 1595 | 2019 | 11 |
Normalize the dataframes into third normal form. Entity relationship diagram:
# Parties
parties <- data.frame(party_id = c(1:10), short_party = c('CON', 'DEM', 'REP', 'WOR', 'GRE', 'LBT', 'IND', 'SAM', 'OTH', 'BLANK'),
party_name = c('Constitutaion', 'Democratic', 'Republican', 'Working Families', 'Green', 'Libertarian', 'Independent',
'Serve America Movement', 'Other', 'Blank'))
# Counties
counties <- nydf %>% group_by(COUNTY) %>% summarize(n = n()) %>%
mutate(county_id = row_number()) %>% select(-n) %>% relocate(county_id, .before = COUNTY) %>%
rename(county = COUNTY)
# Regions
regions <- nydf %>% group_by(REGION) %>% summarize(n = n()) %>%
mutate(region_id = row_number()) %>% select(-n) %>% relocate(region_id, .before = REGION) %>%
rename(region = REGION)
# Status
statuses <- data.frame(status_id = c(0, 1), status = c("Inactive", "Active"))
# Normalize main df
nydf <- nydf %>% merge(parties, by.x = 'party', by.y = 'short_party') %>%
select(-party_name, -party) %>% rename(party_fk = party_id) %>%
merge(counties, by.x = 'COUNTY', by.y = 'county') %>%
select(-COUNTY) %>% rename(county_fk = county_id) %>%
merge(regions, by.x = 'REGION', by.y = 'region') %>%
select(-REGION) %>% rename(region_fk = region_id) %>%
merge(statuses, by.x = 'STATUS', by.y = 'status') %>%
select(-STATUS) %>% rename(status_fk = status_id) %>%
arrange(year, month, region_fk, county_fk, party_fk, status_fk) %>%
mutate(nydf_id = row_number()) %>% relocate(nydf_id, .before = voters)
# Display table
kable(nydf[0:10,],
caption = "<i><font color=#000000><b>Table 9.</b> Normalized data frame</font></i>") %>%
kable_styling(latex_options = "striped")
| nydf_id | voters | year | month | party_fk | county_fk | region_fk | status_fk |
|---|---|---|---|---|---|---|---|
| 1 | 304 | 2019 | 11 | 1 | 1 | 1 | 0 |
| 2 | 3020 | 2019 | 11 | 1 | 1 | 1 | 1 |
| 3 | 11861 | 2019 | 11 | 2 | 1 | 1 | 0 |
| 4 | 94240 | 2019 | 11 | 2 | 1 | 1 | 1 |
| 5 | 3466 | 2019 | 11 | 3 | 1 | 1 | 0 |
| 6 | 34375 | 2019 | 11 | 3 | 1 | 1 | 1 |
| 7 | 131 | 2019 | 11 | 4 | 1 | 1 | 0 |
| 8 | 586 | 2019 | 11 | 4 | 1 | 1 | 1 |
| 9 | 116 | 2019 | 11 | 5 | 1 | 1 | 0 |
| 10 | 517 | 2019 | 11 | 5 | 1 | 1 | 1 |
# Graph party registration over time
nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
filter(status_fk == 1) %>%
group_by(party_fk, period) %>% summarize(voters = sum(voters)) %>%
merge(parties, by.x = 'party_fk', by.y = 'party_id') %>%
ggplot() + geom_bar(aes(x = short_party, y = voters, fill = period), stat = 'identity', position = 'dodge') +
ggtitle("Figure 2. Party registration over time (active registrations only)") + theme_light()
# Party registration percent change between Nov 2019 and Feb 2021
nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
filter(status_fk == 1) %>%
group_by(party_fk, period) %>% summarize(voters = sum(voters)) %>%
filter(period == '2019 Nov' | period == '2021 Feb') %>%
merge(parties, by.x = 'party_fk', by.y = 'party_id') %>%
filter(short_party != 'OTH') %>%
spread(key = period, value = voters) %>%
mutate(`2021 Feb` = ifelse(is.na(`2021 Feb`), 0, `2021 Feb`)) %>%
mutate(pct_change = 100 * (`2021 Feb` - `2019 Nov`) / `2019 Nov`) %>%
filter(`2021 Feb` != 0) %>%
ggplot(aes(x = reorder(short_party, -pct_change), y = pct_change)) +
geom_bar(stat = 'identity', fill = 'lightblue') +
ggtitle("Figure 3. Party registration percent change between Nov 2019 and Feb 2021\n(exluding 'other' party and parties not present in Feb 2021)") + xlab("Party") + theme_light() + geom_text(aes(label = round(pct_change, 1)), vjust = 1.5)
# Voter percent change by count between Nov 2019 and Feb 2021
tmpdf <- nydf %>% mutate(period = str_c(year, ' ', ifelse(month == 2, 'Feb', 'Nov'))) %>%
filter(status_fk == 1) %>%
group_by(county_fk, period) %>% summarize(voters = sum(voters)) %>%
filter(period == '2019 Nov' | period == '2021 Feb') %>%
merge(counties, by.x = 'county_fk', by.y = 'county_id') %>%
spread(key = period, value = voters) %>%
mutate(pct_change = 100 * (`2021 Feb` - `2019 Nov`) / `2019 Nov`) %>%
arrange(desc(pct_change))
# top 20
tmpdf %>% head(20) %>%
ggplot(aes(x = reorder(county, pct_change), y = pct_change)) + geom_bar(stat = 'identity', fill = 'lightblue') +
ggtitle("Figure 3. The 20 counties seeing the most growth in registrations\nbetween Nov 2019 and Feb 2021") +
coord_flip() + xlab("County") + theme_light() + geom_text(aes(label = round(pct_change, 1)), hjust = 1.2)
# bottom 20
tmpdf %>% tail(20) %>%
ggplot(aes(x = reorder(county, pct_change), y = pct_change)) + geom_bar(stat = 'identity', fill = 'lightblue') +
ggtitle("Figure 4. The 20 counties seeing the least growth in registrations\nbetween Nov 2019 and Feb 2021") +
coord_flip() + xlab("County") + theme_light() + geom_text(aes(label = round(pct_change, 1)), hjust = 1.2)
Based on the above, the following conclusions can be drawn:
* Between Nov 2019 and Feb 2021, the biggest increase was seen in unaffilated voters (assumed by blank registration).
* Orange County saw the biggest increase at 8.8%.
* Livingston County was the only county to see a decline in registrations (-0.3%).
The task is to retrieve world and regional epidemic data from https://en.wikipedia.org/wiki/List_of_epidemics and tidy the data. Analysis includes evaluating:
* which diseases cause the most pandemics
* which diseases have been the most disastrous to humans
* regions of the world most prone to outbreaks
# Scrape the data from wikipedia
parsed_page <- read_html("https://en.wikipedia.org/wiki/List_of_epidemics", encoding = "UTF-8")
# Parse out the tables
parsed_tables <- html_table(parsed_page, fill = TRUE)
epidemics <- data.frame(parsed_tables[1])
chron <- data.frame(parsed_tables[2])
# Display sample data
kable(epidemics[0:5,], caption =
"<i><font color=#000000><b>Table 10.</b> Sample epidemic data</font></i>") %>%
kable_styling(latex_options = "striped")
| Rank | Epidemics.pandemics | Death.toll | Global.population.lost | Regional.population.lost | Date | Location |
|---|---|---|---|---|---|---|
| 1 | Black Death | 75–200 million | [Note 1] | 30–60% of European population[8] | 1346–1353 | Europe, Asia, and North Africa |
| 2 | Spanish flu | 17–100 million | 1–5.4%[9][10] | – | 1918–1920 | Worldwide |
| 3 | Plague of Justinian | 15–100 million | [Note 1] | 25–60% of European population[11] | 541–549 | Europe and West Asia |
| 4 | HIV/AIDS pandemic | 35 million+ (as of 2020) | [Note 2] | – | 1981–present | Worldwide |
| 5 | Third plague pandemic | 12–15 million | [Note 2] | – | 1855–1960 | Worldwide |
kable(chron[0:5,], caption =
"<i><font color=#000000><b>Table 11.</b> Sample chronology data</font></i>") %>%
kable_styling(latex_options = "striped")
| Event | Date | Location | Disease | Death.toll..estimate. | Ref. |
|---|---|---|---|---|---|
| 1200 BC Babylon influenza epidemic | 1200 BC | Babylon, or Babirus of the Persians, Central Asia, Mesopotamia and Southern Asia | Sanskrit scholars found records of a disease resembling the Flu | Unknown | [17] |
| Plague of Athens | 429–426 BC | Greece, Libya, Egypt, Ethiopia | Unknown, possibly typhus, typhoid fever or viral hemorrhagic fever | 75,000–100,000 | [18][19][20][21] |
| 412 BC epidemic | 412 BC | Greece (Northern Greece, Roman Republic) | Unknown, possibly influenza | Unknown | [22] |
| Antonine Plague | 165–180 (possibly up to 190) | Roman Empire | Unknown, possibly smallpox | 5–10 million | [23][24] |
| Jian’an Plague | 217 | Han Dynasty | Unknown, possibly typhoid fever or viral hemorrhagic fever | Unknown | [25][26] |
# Save to CSV
write.csv(epidemics, "epidemics.csv")
write.csv(chron, "chron.csv")
Tidy data:\
* Remove bracketed notes.
* Split ranges into two fields (upper and lower).
* Remove extraneous words.
* Make numeric fields numeric.
# Add disease column to epidemics table
epidemics <- mutate(epidemics, disease = c("bubonic plague", "influenza", "bubonic plague", "HIV", "bubonic plague",
"cocoliztli", "smallpox", "smallpox", "coronavirus", "typhus", "influenza", "influenza", "cocoliztli",
"smallpox", "bubonic plague", "bubonic plague", "cholera", "bubonic plague", "influenza"))
# Death toll: remove extraneous stuff, split to lower and upper ranges
epidemics <- epidemics %>% mutate(death_toll = gsub(' million', '', epidemics$`Death.toll`)) %>%
separate(death_toll, sep = '–', into = c('death_toll_lower', 'death_toll_upper')) %>%
mutate(death_toll_lower = gsub('^(\\d+)[^\\d]?.*', '\\1', death_toll_lower)) %>%
mutate(death_toll_upper = ifelse(is.na(death_toll_upper), death_toll_lower, death_toll_upper)) %>%
select(-`Death.toll`) %>% mutate(death_toll_lower = as.numeric(death_toll_lower)) %>%
mutate(death_toll_upper = as.numeric(death_toll_upper))
# Global population lost
epidemics <- epidemics %>% mutate(global_pop_lost = gsub('\\[.*\\]', '', `Global.population.lost`)) %>%
mutate(global_pop_lost = gsub('%', '', global_pop_lost)) %>%
separate(global_pop_lost, sep = '–', into = c('global_pop_lost_lower', 'global_pop_lost_upper')) %>%
mutate(global_pop_lost_lower = ifelse(global_pop_lost_lower == '', NA, global_pop_lost_lower)) %>%
mutate(global_pop_lost_upper = ifelse(is.na(global_pop_lost_upper), global_pop_lost_lower, global_pop_lost_upper)) %>%
select(-`Global.population.lost`) %>% mutate(global_pop_lost_lower = as.numeric(global_pop_lost_lower)) %>%
mutate(global_pop_lost_upper = as.numeric(global_pop_lost_upper))
# Regional population lost
epidemics <- epidemics %>% mutate(regional_pop_lost = gsub('\\[.*\\]', '', `Regional.population.lost`)) %>%
mutate(regional_pop_lost = gsub('^–$', '', regional_pop_lost)) %>%
separate(regional_pop_lost, sep = " of ", into = c('regional_pop_lost', 'regional_pop')) %>%
mutate(regional_pop_lost = gsub('%', '', regional_pop_lost)) %>%
separate(regional_pop_lost, sep = '–', into = c('regional_pop_lost_lower', 'regional_pop_lost_upper')) %>%
mutate(regional_pop_lost_lower = ifelse(regional_pop_lost_lower == '', NA, regional_pop_lost_lower)) %>%
mutate(regional_pop_lost_upper = ifelse(is.na(regional_pop_lost_upper), regional_pop_lost_lower, regional_pop_lost_upper)) %>%
select(-`Regional.population.lost`) %>% mutate(regional_pop_lost_lower = as.numeric(regional_pop_lost_lower)) %>%
mutate(regional_pop_lost_upper = as.numeric(regional_pop_lost_upper))
# Date
epidemics <- epidemics %>% mutate(dates = gsub('^(.+?) .+$', '\\1', `Date`)) %>%
mutate(dates = gsub('present', format(Sys.Date(), "%Y"), dates)) %>%
separate(dates, sep = '–', into = c('date_from', 'date_to')) %>%
select (-`Date`) %>%
mutate(date_from = as.numeric(date_from)) %>% mutate(date_to = as.numeric(date_to))
# Rename columns for consistency
epidemics <- epidemics %>% rename(epidemic = `Epidemics.pandemics`) %>%
rename(rank = Rank)
Normalize the location field.
# Tidy the location field
epidemics <- epidemics %>% mutate(location = gsub(', and', ',', Location)) %>%
mutate(location = gsub(' and ', ', ', location)) %>%
separate_rows(location, sep = ', ', convert=TRUE) %>%
select(-Location)
# Create location_list child table
location_list <- epidemics %>% group_by(location) %>% summarize(n = n()) %>%
mutate(location_id = row_number()) %>% select(-n) %>%
relocate(location_id, .before = location)
# Create epidemic_list child table
epidemic_list <- epidemics %>% group_by(epidemic) %>% summarize(n = n()) %>%
mutate(epidemic_id = row_number()) %>% select(-n) %>%
relocate(epidemic_id, .before = epidemic)
# Create diseases child table
disease_list <- epidemics %>% group_by(disease) %>% summarize(n = n()) %>%
mutate(disease_id = row_number()) %>% select(-n) %>%
relocate(disease_id, .before = disease)
# Normalize main df
epidemics <- epidemics %>% merge(location_list, by = 'location') %>%
select(-location) %>% rename(location_fk = location_id) %>%
merge(epidemic_list, by = 'epidemic') %>%
select(-epidemic) %>% rename(epidemic_fk = epidemic_id) %>%
merge(disease_list, by = 'disease') %>%
select(-disease) %>% rename(disease_fk = disease_id) %>%
arrange(date_from) %>%
mutate(epidemics_id = row_number()) %>% relocate(epidemics_id, .before = rank)
# Display table
kable(epidemics[0:10,],
caption = "<i><font color=#000000><b>Table 12.</b> Normalized data frame</font></i>") %>%
kable_styling(latex_options = "striped")
| epidemics_id | rank | death_toll_lower | death_toll_upper | global_pop_lost_lower | global_pop_lost_upper | regional_pop_lost_lower | regional_pop_lost_upper | regional_pop | date_from | date_to | location_fk | epidemic_fk | disease_fk |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7 | 5 | 10.0 | 3.0 | 6.0 | 25 | 33 | Roman population | 165 | 180 | 8 | 7 | 7 |
| 2 | 3 | 15 | 100.0 | NA | NA | 25 | 60 | European population | 541 | 549 | 2 | 16 | 1 |
| 3 | 3 | 15 | 100.0 | NA | NA | 25 | 60 | European population | 541 | 549 | 11 | 16 | 1 |
| 4 | 14 | 2 | 2.0 | 1.0 | 1.0 | 33 | 33 | Japanese population | 735 | 737 | 4 | 6 | 7 |
| 5 | 1 | 75 | 200.0 | NA | NA | 30 | 60 | European population | 1346 | 1353 | 1 | 9 | 1 |
| 6 | 1 | 75 | 200.0 | NA | NA | 30 | 60 | European population | 1346 | 1353 | 2 | 9 | 1 |
| 7 | 1 | 75 | 200.0 | NA | NA | 30 | 60 | European population | 1346 | 1353 | 6 | 9 | 1 |
| 8 | 8 | 5 | 8.0 | NA | NA | 23 | 37 | Mexican population | 1519 | 1520 | 5 | 1 | 7 |
| 9 | 6 | 5 | 15.0 | NA | NA | 27 | 80 | Mexican population | 1545 | 1548 | 5 | 10 | 3 |
| 10 | 13 | 2 | 2.5 | 0.4 | 0.5 | 50 | 50 | Mexican population | 1576 | 1580 | 5 | 11 | 3 |
# Graph count of epidemics by disease
epidemics %>% group_by(disease_fk) %>% summarize(n = n()) %>%
merge(disease_list, by.x = 'disease_fk', by.y = 'disease_id') %>%
ggplot(aes(x = reorder(disease, -n), y = n)) + geom_bar(stat = 'identity', fill = 'lightblue') +
theme_light() + geom_text(aes(label = n), vjust = -0.3) + xlab('disease') +
ggtitle("Figure 5. Epidemics by disease")
# Graph death toll by disease
epidemics %>% group_by(disease_fk) %>% summarize(toll_hi = sum(death_toll_upper)) %>%
merge(disease_list, by.x = 'disease_fk', by.y = 'disease_id') %>%
ggplot(aes(x = reorder(disease, -toll_hi), y = toll_hi)) + geom_bar(stat = 'identity', fill = 'lightblue') +
theme_light() + geom_text(aes(label = toll_hi), vjust = -0.3) + xlab('disease') +
ggtitle("Figure 6. Deadliest epidemics by disease (by upper limit of death toll)") +
ylab("Death toll in millions")
# Graph epidemics by region
epidemics %>% group_by(location_fk) %>% summarize(n = n()) %>%
merge(location_list, by.x = 'location_fk', by.y = 'location_id') %>%
ggplot(aes(x = reorder(location, n), y = n)) + geom_bar(stat = 'identity', fill = 'lightblue') +
theme_light() + geom_text(aes(label = n), hjust = -1.2) + xlab('region') +
ggtitle("Figure 7. Epidemics by region") + coord_flip() + ylab("# of epidemics")
Based on the above, the following conclusions can be drawn:
* Bubonic plague has been by far the most prevalent source of epidemic throughtout history.
* Likewise, bubonic plague has been the most deadly in terms of lives lost.
* Of the areas survey, most epidemics have been worldwide phenomena. Mexico has seen the highest number of regional-specific epidemics.