Copyright JISC/British Library
I’ve written elsewhere about how until the end of the seventeenth century, news was written in paragraphs. In the sample page above you can see that the paragraphs start with a place dispatch, and some of them contain news from other places. If you look at the paragraph ‘From Hamburg, April 12 stilo novo’ on the second page, you’ll see that the paragraph contains news ‘from Denmark’. News from Denmark was being sent to Hamburg. I spent much of my PhD research looking for these connections, because I think they tell you a lot about information flow.
Here I describe a bipartite network, meaning that places are connected if they share the same paragraph - but we need a) a list of places to work with and b) some way of filtering out so we only get the headers and then text preceded by ‘from’.
These are the steps:
I’ve done all this in a markdown document which contains both Python and R, but to be completely reproducible it requires downloading a piece of software called the Edinburgh geoparser and running a shell script. I’ve included this as text instructions rather than code.
Make sure we load Python3 and not 2:
## Warning: package 'knitr' was built under R version 3.5.2
Download the Lancaster newsbook corpus from here: https://ota.ox.ac.uk/id/2531
It’s a set of newsbooks transcribed using the TEI standard.
First we need to turn all the .xml files into plain text (python):
We’ll use the BeautifulSoup library and Python to process the .xml:
This lets us run Python in R markdown:
library(reticulate)
## Warning: package 'reticulate' was built under R version 3.5.2
library(knitr)
This makes sure that we run python3 and not 2:
Use beautifulSoup to process the .xml into plain text:
from bs4 import BeautifulSoup
import glob
all_files = glob.glob("/Data/1654_newsbooks/*.xml")
for n, file in enumerate((all_files)):
with open("file{}.txt".format(n),"a+") as f:
infile = open(file, encoding = 'latin1').read()
soup = BeautifulSoup(infile)
print(soup.text, file = f)
Download the Edinburgh Geoparser: https://www.ltg.ed.ac.uk/software/geoparser/
Put all the finished text files from the code above in the ‘in’ folder.
To bulk geoparse the text files, follow the instructions here: https://programminghistorian.org/en/lessons/geoparsing-text-with-edinburgh#geo-parsing-multiple-text-files
Download the shell script: http://groups.inf.ed.ac.uk/geoparser/scripts/run-multiple-files.sh
Open the script and change settings to the following - this uses the geonames dataset, and it gives a bounding box around Europe, within which the geoparser will try to find the places first.
#cat $i | ./run -t plain -g geonames -o $outputdirname $prefix -lb 27.3 68.5 52.2 34.3 10
Put the shell scrip in the scripts folder, open a terminal window, browse to the folder, run ./run-multiple-files.sh -i ../in -o ../out
This gives us a lot of files. One ends in .gaz.xml, and just lists each of the places in very simple XML. We can go back to the files at a later point and get coordinates, but just the place names will do for now.
We want a list of places to then feed back into the lancaster .xml which has paragraph information.
This code below creates the list of places, including coordinates, from the Edinburgh geoparser:
from bs4 import BeautifulSoup
import glob
read_gaz_files = glob.glob("/Users/coding/Documents/r-projects/lcc_workshop/geoparser-v1.1/out/file*.gaz.xml")
outputfile = open('geoparsedfiles.tsv', 'w')
for file in read_gaz_files:
xml = open(file).read()
soup = BeautifulSoup(xml)
for place in soup.find_all('place'):
if place['clusteriness_rank'] == "1":
print(place['name'] + '\t' + place['lat'] + '\t' + place['long'], file = outputfile)
This next bit is in R, because I haven’t got around to learning proper data wrangling in python yet.
Load some R libraries we’ll use:
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Load the file from R, change the column names. Ok this bit might be a bit pointless.
geoparsedfiles <- read.delim("~/Documents/GitHub/lancaster_files/edinburgh-parsed-places/geoparsedfiles.tsv", header=FALSE, encoding = 'Latin1')
colnames(geoparsedfiles) = c('place', 'lat', 'lng')
Write it back to a .csv for python (sorry)
geoparsedfiles %>% write.csv("places_foreign.csv")
Using the list of places from the Edinburgh geoparser, we go through all the original Lancaster newsbook files and add places, as well as a unique code for each paragraph:
It uses the BeautifulSoup library to parse the XML.
Basically this next code:
from bs4 import BeautifulSoup
import glob
from csv import DictReader # Make a list of every place found by Edinburgh geoparser
with open("places_foreign.csv") as f:
cityList = [row["place"] for row in DictReader(f)]
cityList = [x.lower() for x in cityList]
index_int = 0
all_files = glob.glob("/Users/coding/Documents/GitHub/lancaster_files/Data/1654_newsbooks/*.xml")
log = open('paragraphs_foreign.tsv', 'w')
for file in all_files:
newsbook = open(file, encoding = 'latin1').read()
soup = BeautifulSoup(newsbook)
for link in soup.find_all('p'):
paragraph_list = []
paragraph_list.append(link.text)
try:
for paragraph in paragraph_list:
index_int +=1
lower_list = [x.lower() for x in paragraph.split(' ')]
if lower_list[0] in cityList:
print(lower_list[0]+ '\t' + str(index_int) + '\t' + file[67:-4] + '\t' + 'x' + '\t' + 'header', file = log)
elif lower_list[1] in cityList:
print(lower_list[1]+ '\t' + str(index_int) + '\t' + file[67:-4] + '\t' + 'x' + '\t' + 'header', file = log)
for index, item in enumerate(lower_list):
next = index + 1
if next < len(lower_list):
if item == 'from':
if lower_list[index +1] in cityList:
print(lower_list[index + 1] + '\t' + str(index_int) + '\t' + file[67:-4] + '\t', next, file = log)
except:
pass
Load libraries for making the bipartite network:
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidygraph':
##
## groups
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(ggthemes)
Load the file we’ve made with Python, get rid of the unnecessary fourth column, and have a look:
paragraphs_foreign <- read.delim(
"paragraphs_foreign.tsv",
header=FALSE,
stringsAsFactors=FALSE) %>% distinct(V1,V2, V3, .keep_all = TRUE) %>% select(V1, V2, V3, V5)
print(head(paragraphs_foreign))
## V1 V2 V3 V5
## 1 holland 22 LIntell73 header
## 2 gravesend 52 FScout168
## 3 flanders 53 FScout168 header
## 4 paris 54 FScout168 header
## 5 whitechapel 58 FScout168 header
## 6 scotland 61 FScout168
Change the column names and keep only distinct rows - this is because our headers will also get picked up by the second part of the algorithm as in the code above. Also add ‘inner’ to the blank fields - now we know whether our location was a header (so a place of dispatch) or an inner (relay) place. Very useful.
colnames(paragraphs_foreign) = c('place', 'paragraph', 'title', 'type')
paragraphs_foreign = paragraphs_foreign %>% distinct(place, paragraph, title, .keep_all = TRUE)
paragraphs_foreign = paragraphs_foreign %>% mutate(type = replace(type, type == '', 'inner'))
First we can map the places by volume of mentions, colour by type - whether they were a relay or first point of dispatch.
geoparsedfiles$place = tolower(geoparsedfiles$place)
paragraphs_foreign %>% left_join(geoparsedfiles %>% arrange(desc(lng)) %>%distinct(place, .keep_all = TRUE))
## Joining, by = "place"
Though it’s good at finding locations, the coordinates from the Edinburgh software are pretty terrible. We can do better with an off the shelf geocoder.
from functools import partial
import pandas as pd
df = pd.read_csv("places_foreign.csv")
from geopy.point import Point
df = df['place'].value_counts()
df.to_csv('out.csv')
df = pd.read_csv("out.csv", names=['name','amount'])
from geopy.geocoders import Bing
geolocator = Bing(api_key="insert-your-bing-api-key-here")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=.1)
from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas(desc = "Geocoding locations:")
df['location'] = df['name'].progress_apply(partial(geocode, user_location=(Point(latitude=53.5, longitude=2.4)),culture='EN'))
df['point'] = df['location'].apply(lambda loc: tuple(loc.point) if loc else None)
df.to_csv('out.csv')
Load the geocoded points and do some processing:
points = read_csv('out.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## name = col_character(),
## amount = col_double(),
## location = col_character(),
## point = col_character()
## )
colnames(points)[2] = 'place'
points$place = tolower(points$place)
points = points %>% mutate(point = str_replace(point,'\\(','')) %>% separate(col = point, into = c('lat','lng'), sep = ',')
## Warning: Expected 2 pieces. Additional pieces discarded in 655 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
paragraphs_foreign %>% group_by(place, type) %>% tally() %>% left_join(points)
## Joining, by = "place"
library(mapdata)
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
Draw a blank map:
map = map_data('world')
ggplot() +
geom_polygon(data = map, aes(x = long, y = lat, group = group), fill = 'gray50') +
coord_fixed(1.3, xlim = c(-12, 25), ylim = c(35, 60))
ggplot() +
geom_polygon(data = map, aes(x = long, y = lat, group = group), fill = 'gray60') +
coord_fixed(1.3, xlim = c(-12, 25), ylim = c(35, 60)) + geom_point(data = paragraphs_foreign %>% group_by(place, type) %>%
tally() %>%
left_join(points), aes(x = as.numeric(lng), y = as.numeric(lat), size = n, fill = type), alpha = .6, color = 'black', pch = 21) + theme_map() +
theme(legend.position = 'bottom')
## Joining, by = "place"
## Warning: Removed 1 rows containing missing values (geom_point).
Now on to the network.
Make a unique code by concatenating the paragraph and title strings:
paragraphs_foreign$code = paste0(paragraphs_foreign$paragraph, paragraphs_foreign$title)
Now we have a dataframe with geographic terms, each with a unique paragraph code, and information on whether they were in the header. Basically I’ve just invalidated the first six months of my PhD research.
Make our final file, which is just the placename and the unique paragraph code. Filter out ‘York’ and ‘Conde’ because although these are places, they are used much more to refer people in this set of newsbooks and throw off the data:
networkdf = paragraphs_foreign %>%
select(place, code) %>%
filter(!place %in% c('York', 'Conde'))
head(networkdf)
Time to make a bipartite network as described here, using igraph: https://rpubs.com/pjmurphy/317838
A bipartite network is one in which the elements are connected because they share a common property. So in this example, our elements are cities, connected because they share a paragraph.
First make a regular igraph network:
g = graph.data.frame(networkdf, directed=FALSE)
Next we make the bipartite matrix
V(g)$type <- bipartite_mapping(g)$type
bipartite_matrix <- as_incidence_matrix(g)
title_matrix_prod <- t(bipartite_matrix) %*% bipartite_matrix
This creates a matrix with each place in each row and each column. The numbers represent the times they were mentioned in the same paragraph. We can see, for example, that Paris and Flanders were mentioned 4 times together, but Paris and Scotland only once. When we go to draw the network, Paris will be placed closer to Flanders than to Scotland.
place_matrix_prod <- bipartite_matrix %*% t(bipartite_matrix)
place_matrix_prod <- bipartite_matrix %*% t(bipartite_matrix)
diag(place_matrix_prod) <- 0
place_matrix_prod[order(place_matrix_prod[,1], decreasing = TRUE),][0:10,0:10]
## holland gravesend flanders paris whitechapel scotland brest
## scotland 6 0 1 1 0 0 0
## france 3 0 2 0 0 2 0
## leith 2 0 0 0 0 0 0
## portsmouth 2 0 0 0 0 0 0
## inverness 1 0 0 0 0 0 0
## zealand 1 0 0 0 0 0 0
## holland 0 0 0 0 0 6 0
## gravesend 0 0 0 0 0 0 0
## flanders 0 0 0 4 0 1 0
## paris 0 0 4 0 0 1 0
## newcastle uppsala southwark
## scotland 0 0 0
## france 0 0 1
## leith 1 0 0
## portsmouth 0 0 0
## inverness 0 0 0
## zealand 0 0 0
## holland 0 0 0
## gravesend 0 0 0
## flanders 0 0 0
## paris 0 0 0
This turns the matrix into a network graph object:
places_overlap <- graph_from_adjacency_matrix(place_matrix_prod,
mode = "undirected",
weighted = TRUE)
This turns the network graph object into a tidygraph object, which I find easier to work with.
tg = as_tbl_graph(places_overlap)
tg
## # A tbl_graph: 194 nodes and 113 edges
## #
## # An undirected simple graph with 120 components
## #
## # Node Data: 194 x 1 (active)
## name
## <chr>
## 1 holland
## 2 gravesend
## 3 flanders
## 4 paris
## 5 whitechapel
## 6 scotland
## # … with 188 more rows
## #
## # Edge Data: 113 x 3
## from to weight
## <int> <int> <dbl>
## 1 1 6 6
## 2 1 18 3
## 3 1 22 2
## # … with 110 more rows
Draw the network diagram:
Unsurprisingly, places associated with each other are placed close together. Stockholm, Gothenburg, and Frankfurt form a separate group, for example. Points are sized by their overall connections: Paris and London have the most, and are at the centre.
tg %>%
activate(nodes) %>%
mutate(degree = centrality_degree()) %>%
filter(degree>0) %>%
ggraph(layout = 'nicely') +
geom_edge_link(alpha = .1, size = .1) +
geom_node_point(alpha = .5,
pch = 21,
fill = 'lightblue',
color = 'black',
aes(size = degree)) +
geom_node_text(alpha = .8,
aes(size = degree,
label =ifelse(degree > 0,
as.character(name),
NA_character_)),
repel = FALSE,
show.legend = FALSE) +
theme_map() +
scale_size_area()
## Warning: Ignoring unknown parameters: edge_size
I think this has actually done an amazing job - it’s really similar to the network I made with manual data. Important places are in the centre, and some connecting cities (Madrid, Genoa, Venice, for example) are really obvious in the network.
Tidygraph allows for lots of easy calculations of network metrics - here I’ve plotted the total connections - here’s the top 20:
tg %>%
activate(edges) %>%
filter(weight > 1) %>%
activate(nodes) %>%
mutate(degree = centrality_degree()) %>%
filter(degree>1) %>%
as.data.frame() %>%
arrange(desc(degree)) %>% head(20) %>%
ggplot() +
geom_bar(aes(x = reorder(name, degree), y = degree), stat = 'identity', fill = 'lightblue', color = 'black', alpha = .8) + coord_flip() + theme_minimal() + ggtitle("1654 newsbook locations, ranked by connections (degree):") + theme(axis.title.y = element_blank())
Here’s a plot of another measure of centrality, betweenness, which measures how like a node is to be used as a path between all other pairs of nodes.
tg %>%
activate(edges) %>%
filter(weight > 1) %>%
activate(nodes) %>%
mutate(between = centrality_betweenness()) %>%
as.data.frame() %>%
arrange(desc(between)) %>% head(20) %>%
ggplot() +
geom_bar(aes(x = reorder(name, between), y = between), stat = 'identity', fill = 'lightblue', color = 'black', alpha = .8) + coord_flip() + theme_minimal() + ggtitle("1654 newsbook locations, ranked by betweenness:") + theme(axis.title.y = element_blank())