Overall idea is using matches in the User_MatchList table of the DNAGedcom database to create a heatmap of those matches as well as other visualizations.
This provides a way of using the heatmap approach (amd more) with a
variety of lists of matches. Some examples.
- ICWs for a single person
- Matches for descendants of one or more ancestors
- … and all of their ICW matches
- Match lists exported from Gephi (e.g. all matches in a Gephi
cluster)
The goal is to facilitate seeing relationships within these groups to
help working out their relationships and trees.
One example is using the tree sizes in the key to facilitate finding the
good trees within a family group.
v2 of this file adds the following capabilities.
- Specify profile as row number in Ancestry_Profiles table.
- Network graphs.
I originally intended to add cross highlighting of selections in the multiple views, but have been unable to make that work.
This file provides multiple useful visualizations for looking at the ICWs for a list of matches. They have different strengths.
The Heatmap with Jaccard Similarity is useful for being able to see how all of the matches fit together and examine clusters. The Jaccard Similarity is useful for attempting to identify Triangulation Groups (TGs) when the segment chromosome location information is not available.
The Phylogenetic Dendrogram is useful for dividing the matches into distinct groups. But note that these divisions are not always intuitive. For example, I see 2Cs of mine appearing in separate subgroups of 3Cs (and more distant) because of the different specific segments they have.
The Graph Network view is useful for providing another view of how all of the matches fit together and form clusters.
Two other visualizations which are helpful but cannot be included in this file are:
Charting Companion DNA Matrix provides a way to view (in a family tree format) all ICWs for descendants of a single ancestor. I find it a helpful adjunct for the visualizations here. In particular, I find the visualizations in this file more useful for discovering how matches fit together in the family tree (and placing them) and CC DNA Matrix more useful for understanding the genealogical relationship between the matches and validating the correctness of their placement in the tree.
Gephi provides a much more powerful
environment for looking at the network graph visualization for matches.
In particular, for large collections of matches.
One way of using this file is to select a subset of matches in Gephi
then export that as a CSV to populate User_MatchList then run this file
to gain multiple views of that subset of matches.
The only variable which MUST be specified is the location of the DNAGedcom database. Everything needed will be read from the tables there. The primary interface is specifying the matches to be examined in the User_MatchList table which is simply a list of guids in a column named Match_Guid.
#DNAGedcom.db.location <- "F:/Documents/Genealogy/DNA/August 2024 DNA Matches/DNAGedcom4_20240919/DNAGedcom_20240827.db"
DNAGedcom.db.location <- "DNAGedcom_20240827.db"
This is a general purpose file, but allow specifying a more descriptive title for the plots in each run.
#plot.title <- "ICWs for Matches in User_MatchList Table" # Generic title for what this file does
#plot.title <- "ThruLines Descendants of Mathias Boll and His Ancestors Plus Their ICWs"
plot.title <- "Descendants of Mathias Boll and His Parents Plus Their ICWs"
Specify the profile as a row number in the Ancestry_Profiles table. Default to the first.
profilerow <- 1
For public sharing information displayed is limited.
is.public <- TRUE
Query MatchList, ICWs, and Match Information from DNAGedcom.
Note possible issues with duplicates in User_MatchList.
It is an open question for me whether we should define the match list
as being a unique list or whether it is best to allow duplicates and let
the user deal with it with SELECT DISTINCT.
Currently doing the latter.
con <- dbConnect(RSQLite::SQLite(), DNAGedcom.db.location)
#dbListTables(con)
# first get the profile
profileq <- dbSendQuery(con, paste("SELECT guid, name FROM Ancestry_Profiles
WHERE Id = ", profilerow, ";", sep=""))
profile.info <- dbFetch(profileq)
profile.id <- profile.info$guid
profile.name <- profile.info$name
dbClearResult(profileq)
matchq <- dbSendQuery(con, "SELECT DISTINCT * FROM User_MatchList;") # Remove duplicates
matches <- dbFetch(matchq)
nrow(matches)
## [1] 136
dbClearResult(matchq)
# Extract ICWs for match list
icwq <- dbSendQuery(con, "SELECT matchid, icwid, sharedCentimorgans FROM Ancestry_ICW
WHERE matchid IN (SELECT DISTINCT MatchGuid FROM User_MatchList) AND
icwid IN (SELECT DISTINCT MatchGuid FROM User_MatchList);")
icws <- dbFetch(icwq)
nrow(icws)
## [1] 2296
dbClearResult(icwq)
icws$weight <- icws$sharedCentimorgans # No longer needed since we are using different weights?
# Get information for these matches in db for use in keys etc.
#allmatchq <- dbSendQuery(con, "SELECT * FROM Ancestry_matchGroups WHERE matchGuid IN (SELECT MatchGuid FROM User_MatchList);")
allmatchq <- dbSendQuery(con, paste("SELECT * FROM Ancestry_matchGroups
WHERE testGuid = \"", profile.id, "\" AND
matchGuid IN (SELECT DISTINCT MatchGuid FROM User_MatchList);", sep=""))
all.matches <- dbFetch(allmatchq)
nrow(all.matches)
## [1] 136
dbClearResult(allmatchq)
# Get colored dot group information for inclusion in match key
dotgroupq <- dbSendQuery(con, "SELECT DNA_Tag.matchguid, GROUP_CONCAT(DNA_Tag.name) AS DotGroups FROM DNA_Tag
INNER JOIN (SELECT DISTINCT * FROM User_MatchList) AS uml ON DNA_Tag.matchguid = uml.MatchGuid
WHERE DNA_Tag.name IS NOT NULL
GROUP BY DNA_Tag.matchguid;")
dotgroups <- dbFetch(dotgroupq)
nrow(dotgroups)
## [1] 28
dbClearResult(dotgroupq)
# # Check for missing and duplicated matches
# nrow(matches)
# length(intersect(matches$MatchGuid, all.matches$matchGuid))
# setdiff(matches$MatchGuid, all.matches$matchGuid)
# duplicated(matches$MatchGuid)
# duplicated(all.matches$matchGuid)
dbDisconnect(con)
Note use of pseudo_log (base 2) transform for cM data (needed to
handle 0s for non-matches).
I use 4096 cM for self matches to compress the range (7000 would make
more sense) a little and give an integer log2 for the max.
The output of this process is a matrix with log2 transformed cM values and A* kit ids as row/column names.
icw.g <- graph.data.frame(icws)
#is_weighted(icw.g)
icw.mat <- as.matrix(get.adjacency(icw.g, attr='weight'))
#str(icw.mat) # Note this prints Ancestry ids
# Compare how many matches were in initial list and how many appear in adjacency matrix
nrow(matches)
## [1] 136
nrow(icw.mat)
## [1] 136
icw.mat.cM <- icw.mat # Save this before or after using 4096 for self?
icw.mat <- icw.mat.cM
diag(icw.mat) <- 4096 # Try changing to 4096 to reduce gap in data values and give integer log2
# See functions here
# https://github.com/r-lib/scales/issues/219
# https://scales.r-lib.org/reference/trans_new.html
pseudo_log <- function (val, sigma = 1, base = exp(1)) {
xval <- asinh(val/(2*sigma))/log(base)
}
pseudo_log2 <- function(x) pseudo_log(x, base=2)
icw.mat <- pseudo_log(icw.mat, base=2)
# Provide inverse functions so we can recover cM values easily
# https://stackoverflow.com/questions/33612312/how-to-reverse-inverse-hyperbolic-sine-transformation-in-r
inverse_pseudo_log <- function (val, sigma = 1, base = exp(1)) {
xval <- 2 * sigma * sinh(val*log(base))
}
inverse_pseudo_log2 <- function(x) inverse_pseudo_log(x, base=2)
# # Check accuracy of the transform/inverse process
# icw.mat.cM.transformed <- inverse_pseudo_log(icw.mat, base=2)
# diag(icw.mat.cM.transformed) <- 0 # Conform with original format having no self matches
#
# icw.mat.diff <- icw.mat.cM - icw.mat.cM.transformed
# max(icw.mat.diff) # 3e-12 difference good enough for me
#
# # Look at the result
# icw.mat[1:5, 1:5]
icw.mat <- round(icw.mat, 2) # Only present log2(cM) to 2 decimal places
We use Ancestry_matchGroups row Ids as identifiers because they are
short anonymous integers.
A key is provided for non-public versions.
rowNAMES <- all.matches[match(rownames(icw.mat), all.matches$matchGuid), "Id"]
kitNAMES <- all.matches[match(rownames(icw.mat), all.matches$matchGuid), "matchTestDisplayName"]
# Save mapping between Ids and Guids to use for generating key (easier and clearer to reuse than to do a lookup there)
icw.mat.row.keys <- data.frame(Id = rowNAMES, matchGuid = rownames(icw.mat), matchTestDisplayName = kitNAMES)
rownames(icw.mat) <- rowNAMES
colnames(icw.mat) <- rowNAMES
Color palette is designed to create a gap between the smallest value and 0 to make the small values easier to distinguish from 0 while leaving more color range for the values seen.
Key will be displayed below in the order used in the pheatmap.
Pheatmap is limited in display size. And specifying larger sizes
shrinks the fonts.
Should set figure sizes based on dimensions of icw.mat
# Create a different color palette which reduces the gap (0 to 6/8 cM for matches, 20 cM for ICWs, or specified threshold) in the data.
gap.colors <- colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(256)
data.min <- min(icw.mat[icw.mat != 0])
data.max <- max(icw.mat)
# Add additional colors so we use full display range for actual data range
# Leaving some gap helps see the min values in the heatmap. Trying different values.
gap <- 1 # Good compromise for seeing minimum values as distinct from 0 values
num.added <- round(256 * ((data.min-gap) / (data.max - (data.min-gap))))
gap.fill <- gap.colors[1]
gap.filled.colors <- c(rep(gap.fill, num.added), gap.colors)
# Save pheatmap to support key generation below
hmj <- pheatmap(icw.mat, color = gap.filled.colors, main=plot.title)
hmj
Done after pheatmap because I am using that ordering for now.
The all.matches data frame (Ancestry_MatchGroups table data for these
matches) has everything we might want except for the icw.mat row #, the
dotgroups, and the match hyperlink.
# Create key with numbers, kitids, names, and hyperlinks to match with profile. And more...
heatmap.key <- all.matches[all.matches$matchGuid %in% icw.mat.row.keys$matchGuid, c("Id", "matchGuid", "matchTestDisplayName", "sharedCentimorgans", "sharedSegment", "matchTreeNodeCount")]
heatmap.key$log2cMs <- round(pseudo_log2(heatmap.key$sharedCentimorgans), 2)
heatmap.key$sharedCentimorgans <- round(heatmap.key$sharedCentimorgans, 0)
heatmap.key$DotGroups <- dotgroups$DotGroups[match(heatmap.key$matchGuid, dotgroups$matchguid)]
heatmap.key$DotGroups[is.na(heatmap.key$DotGroups)] = ""
heatmap.key$Links <- paste("https://www.ancestry.com/discoveryui-matches/compare/", profile.id, "/with/", heatmap.key$matchGuid, sep="")
heatmap.key$Notes <- all.matches$note[match(heatmap.key$matchGuid, all.matches$matchGuid)]
Only display key if not public version. Reordered to match pheatmap order.
Try both kable static version and DT active version.
One advantage to static version is hyperlinks work. Another is
searching is easier when all entries are visible at once.
Some advantages to DT version are: more compact display, searching,
possible suport for cross highlighting.
Kable static key.
#rownames(icw.mat)[hmj$tree_row$order]
reordered.heatmap.key <- heatmap.key[match(rownames(icw.mat)[hmj$tree_row$order], heatmap.key$Id),]
kable(reordered.heatmap.key, row.names=FALSE)
DT active key.
DT::datatable(reordered.heatmap.key, escape=FALSE)
This version is preferred because it has several desirable
features.
- Supports larger plot sizes
- Allows displaying number in cell (e.g. see Jaccard Similarity
below)
- Allows zooming and panning
- Can display additional data in hover text boxes (unable to get this to
work)
Heatmaply is a powerful package. Documentation available at
https://talgalili.github.io/heatmaply/articles/heatmaply.html
Also described in this paper
Heatmaply:
An R package for creating interactive cluster heatmaps for online
publishing
A useful feature is the ability to zoom and pan within the heatmap. Brief overview of how to do that.
Zoom - click, drag, and hold to select a focal area (rectangle) to
zoom to
Zoom - zooming can be done in either the heatmap itself or the
dendrogram
Zoom - to reset the zoom double click, be patient, rendering can take a
while
Pan - use the toolbar at the upper right (scroll browser window all the
way right first, sometimes it goes missing and hovering over the title
seems to restore)
Pan - click the crossed arrows then drag the heatmap to pan
This documentation for the base plotly package has more on zoom, pan,
and hover.
https://plotly.com/chart-studio-help/zoom-pan-hover-controls/
Note that at RPubs it may be necessary to hide the toolbars at the bottom to gain access to the browser horizontal scroll bar!
Disabled in favor of version below with Jaccard Similarity as well.
heatmaply(icw.mat, color = gap.filled.colors, hclust_method="complete",
key.title = "log2(cM)",
grid_gap=0.3, # grid_gap recommended even though it does not work well with light colored cells
branches_lwd = 0.2, subplot_widths = c(0.95, 0.05), subplot_heights = c(0.05, 0.95),
main=plot.title
)
Heatmaply version with Jaccard similarity in cell and additional information in hover box.
I think this will become the primary heatmap display.
Jaccard Similarity is multiplied by 10 to save display space (leading
zero and decimal point).
Have been unable to get hover text box to work.
The Jaccard Similarity is a way to compare the similarity of two sets
(here the ICW lists for two matches).
In short, it is computed as the size of the intersection of the sets
divided by the size of the union of the sets.
size(intersect(A,B)) / size(union(A,B))
This gives a number between 0 and 1. More details here.
https://en.wikipedia.org/wiki/Jaccard_index
One use of Jaccard Similarity is to attempt to detect triangulation groups (TGs). Fully populated (or almost, exact segment sizes and locations may vary slightly) clusters with high Jaccard Similarity may represent a TG. This is easiest to see with distant relations who just happen to match on the same segment.
A simple rule of thumb is to focus on cells with discordant match
size and Jaccard similarity:
- large match, small JS probably means one of a pair of close relatives
did not get segment responsible for most of the matches here
- small match, large JS probably means distant relatives who share a TG
segment
As an aside, the best way to assess the utility of Jaccard Similarity for locating TGs is to create heatmaps like this on a platform with segment information and then see how the heatmap clusters compare to the true TGs there.
# Compute the Jaccard similarity
icw.mat.jaccard.dist <- dist(icw.mat, method="binary", diag=TRUE, upper=TRUE)
icw.mat.jaccard <- round((1 - as.matrix(icw.mat.jaccard.dist)) * 10, 0) # Multiply by 10 to save display space (leading zero and decimal point)
# Create hover text containing: row, column, cM, log2(cM), Jaccard
# If not public add A ids and names.
# https://www.datanovia.com/en/blog/how-to-create-a-beautiful-interactive-heatmap-in-r/#add-custom-hover-text
# Was unable to get this to work. See saved code in older version of this file.
heatmaply(icw.mat, hclust_method="complete",
color = gap.filled.colors, # I think this does better job of highlighting clusters than default
key.title = "log2(cM)",
grid_gap=0.3, # grid_gap recommended even though it does not work well with light colored cells
branches_lwd = 0.2, subplot_widths = c(0.95, 0.05), subplot_heights = c(0.05, 0.95),
cellnote = icw.mat.jaccard,
cellnote_textposition = "middle center",
main=plot.title)
Another heatmap option which is probably overkill for the sizes of
heatmaps this is creating.
shinyheatmap:
ultra fast low memory heatmap web interface for big data
genomics
I find the Phylogenic dendrogram most useful so focus on that. Display this for public use.
icw.dist <- dist(icw.mat)
icw.hc <- hclust(icw.dist)
# Phylogenic (see other versions considered in earlier version of this file)
Phylo = fviz_dend(icw.hc,
main=plot.title,
# cex = 0.8, lwd = 0.8, # Looks good in plot window
cex = 1.6, lwd = 1.6, # Looks good when knitted
k = 10,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic",
repel = TRUE)
Phylo
Match name version is easier to interpret. Display this for non-public use.
icw.dist <- dist(icw.mat)
icw.hc <- hclust(icw.dist)
icw.hc.names <- icw.hc
icw.hc.names$labels <- all.matches[match(icw.hc.names$labels, all.matches$Id), "matchTestDisplayName"]
# Phylogenic (see other versions considered in earlier version of this file)
Phylo = fviz_dend(icw.hc.names,
main=plot.title,
# cex = 0.8, lwd = 0.8, # Looks good in plot window
cex = 1.6, lwd = 1.6, # Looks good when knitted
k = 10,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic",
repel = TRUE)
Phylo
https://datastorm-open.github.io/visNetwork/
https://cran.r-project.org/web/packages/visNetwork/vignettes/Introduction-to-visNetwork.html
Use the key and icw data as nodes and edges adding named columns to meet visNetwork requirements.
iGraph layouts documented here:
https://r.igraph.org/articles/igraph.html#layouts-and-plotting
Working on ways to highlight groups.
https://stackoverflow.com/questions/50532361/visnetwork-highlightnearest-show-only-connected-edges-to-a-selected-node
An example https://visjs.github.io/vis-network/examples/network/exampleApplications/neighbourhoodHighlight.html
Hover or click on a node to highlight connected edges and
nodes.
Hover over a node or edge to see more information about it.
Or use the Select Id dropdown menu at the upper left to select a node by
numerical id.
# Use rowids as node ids for consistency with graphics above
nodes <- heatmap.key[order(heatmap.key$Id),] # Order by id to make it easier to select value from list
nodes$rowid <- nodes$Id
nodes$id <- nodes$rowid
nodes$size <- nodes$log2cMs * 3
nodes$label <- nodes$rowid
nodes$title <- paste(nodes$rowid, # Newlines don't work, but HTML <br> does
nodes$matchTestDisplayName, "<br>Match:", nodes$sharedCentimorgans, "cM over", nodes$sharedSegment, "segment(s)",
"<br>Tree size:", nodes$matchTreeNodeCount,
"<br>Dot groups:", nodes$DotGroups)
edges <- icws
edges$from <- nodes$rowid[match(edges$matchid, nodes$matchGuid)]
edges$to <- nodes$rowid[match(edges$icwid, nodes$matchGuid)]
edges$weight <- pseudo_log2(edges$sharedCentimorgans)
edges$value <- edges$weight # visNetwork uses the value field for edge weights
edges$title <- paste(edges$from, "to", edges$to, "size", round(edges$sharedCentimorgans,0), "cM")
# Default layout is too slow to use, trying others
visNetwork(nodes, edges, main=plot.title) %>%
visIgraphLayout(randomSeed=123, layout = "layout_with_fr", type = "full") %>% # FR looks best to me
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
# The next two add ID selection dropdown menu and enable interactive highlighting of connected edges
# See https://stackoverflow.com/questions/50532361/visnetwork-highlightnearest-show-only-connected-edges-to-a-selected-node
visEdges(color = list(highlight = "blue", hover = "blue")) %>% # explicit edge options
visOptions(highlightNearest = list(enabled = TRUE, degree = 1,
labelOnly = FALSE, hover = TRUE),
nodesIdSelection = TRUE)
# Other layouts considered
#visIgraphLayout(randomSeed=123, layout = "layout_in_circle", type = "full") %>% # Not terribly useful, but maybe for debugging
#visLayout(randomSeed = 123, clusterThreshold=50) %>% # Too slow!
#visIgraphLayout(randomSeed=123, layout = "layout_nicely", type = "full") %>% # Good default: set seed, layout_nicely is normal default, full scales to rectangle vs. square
#visIgraphLayout(randomSeed=123, layout = "layout_with_kk", type = "full") %>% # I like FR more, but not a bad alternative
# Misc other
#visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>% # Adds ID selection dropdown menu
File Initially created: Friday, November 22, 2024
File knitted: Sat Nov 23 22:00:33 2024