1 Overview

Overall idea is using matches in the User_MatchList table of the DNAGedcom database to create a heatmap of those matches as well as other visualizations.

This provides a way of using the heatmap approach with a variety of lists of matches. Some examples.
- ICWs for a single person
- Matches for descendants of one or more ancestors
- … and all of their ICW matches
- Match lists exported from Gephi (e.g. all matches in a Gephi cluster)

The goal is to facilitate seeing relationships within these groups to help working out their relationships and trees.
One example is using the tree sizes in the key to facilitate finding the good trees within a family group.

This is a general purpose file, but allow specifying a more descriptive title for the plots in each run.

#plot.title <- "ICWs for Matches in User_MatchList Table" # Generic title for what this file does
plot.title <- "ThruLines Descendants of Mathias Boll and His Ancestors Plus Their ICWs"

2 Query DNGedcom DB for the Match List and Associated Information

For public sharing information displayed is limited.

is.public <- TRUE

Hard coding the profile seems like the easiest way to handle wanting to have it for adding additional information from Ancestry_MathGroups to the key.
A better approach might be to specify a row number in Ancestry_Profiles and default to the first.

Here is an example.

profileid <- "<A* profile id>" # Describe profile here

Query MatchList, ICWs, and Match Information from DNAGedcom.
Note possible issues with duplicates in User_MatchList.

It is an open question for me whether we should define the match list as being a unique list or whether it is best to allow duplicates and let the user deal with it with SELECT DISTINCT.
Currently doing the latter.

#con <- dbConnect(RSQLite::SQLite(), "F:/Documents/Genealogy/DNA/August 2024 DNA Matches/DNAGedcom4_20240919/DNAGedcom_20240827.db")
con <- dbConnect(RSQLite::SQLite(), "DNAGedcom_20240827.db")
dbListTables(con)
##  [1] "AncestryAncestorCouple"   "Ancestry_Ethnicity"      
##  [3] "Ancestry_ICW"             "Ancestry_Kit"            
##  [5] "Ancestry_Profiles"        "Ancestry_TreeData"       
##  [7] "Ancestry_matchDetail"     "Ancestry_matchEthnicity" 
##  [9] "Ancestry_matchGroups"     "Ancestry_matchTrees"     
## [11] "DGIndividual"             "DGTree"                  
## [13] "DNA_Kits"                 "DNA_Login"               
## [15] "DNA_Process"              "DNA_Schedule"            
## [17] "DNA_Tag"                  "EWS_Ancestry_ICW"        
## [19] "EWS_Ancestry_matchGroups" "FTDNA_Chromo2"           
## [21] "FTDNA_ICW2"               "FTDNA_Kit"               
## [23] "FTDNA_Matches2"           "FTDNA_Tree"              
## [25] "GM_ICW"                   "GM_Kit"                  
## [27] "GM_Match"                 "GM_Segment"              
## [29] "Gather_History"           "MH_Ancestors"            
## [31] "MH_Chromo"                "MH_ICW"                  
## [33] "MH_Match"                 "MH_Tree"                 
## [35] "MH_Triangulate"           "Match_List"              
## [37] "MeAnd23W_FIANew"          "MeAnd23W_Haplo"          
## [39] "MeAnd23W_ICW"             "MeAnd23W_Kit"            
## [41] "MeAnd23W_Matches"         "MeAnd23W_Profiles"       
## [43] "MeAnd23W_Relatives"       "MeAnd23W_Shares"         
## [45] "MeAnd23W_Triag"           "Settings"                
## [47] "User_AncestryPeople"      "User_MatchList"          
## [49] "sqlite_sequence"
matchq <- dbSendQuery(con, "SELECT DISTINCT * FROM User_MatchList;") # Remove duplicates
matches <- dbFetch(matchq)
nrow(matches)
## [1] 115
dbClearResult(matchq)

# Extract ICWs for match list
icwq <- dbSendQuery(con, "SELECT matchid, icwid, sharedCentimorgans FROM Ancestry_ICW
WHERE matchid IN (SELECT MatchGuid FROM User_MatchList) AND
      icwid IN (SELECT MatchGuid FROM User_MatchList);")
icws <- dbFetch(icwq)
nrow(icws)
## [1] 2172
dbClearResult(icwq)

icws$weight <- icws$sharedCentimorgans  # No longer needed since we are using different weights?

# Get information for these matches in db for use in keys etc.
#allmatchq <- dbSendQuery(con, "SELECT * FROM Ancestry_matchGroups WHERE matchGuid IN (SELECT MatchGuid FROM User_MatchList);")
allmatchq <- dbSendQuery(con, paste("SELECT * FROM Ancestry_matchGroups
WHERE testGuid = \"", profileid, "\" AND
      matchGuid IN (SELECT MatchGuid FROM User_MatchList);", sep=""))
all.matches <- dbFetch(allmatchq)
nrow(all.matches)
## [1] 115
dbClearResult(allmatchq)

# Get colored dot group information for inclusion in match key
dotgroupq <- dbSendQuery(con, "SELECT DNA_Tag.matchguid, GROUP_CONCAT(DNA_Tag.name) AS DotGroups FROM DNA_Tag
INNER JOIN (SELECT DISTINCT * FROM User_MatchList) AS uml ON DNA_Tag.matchguid = uml.MatchGuid
WHERE   DNA_Tag.name IS NOT NULL
GROUP BY DNA_Tag.matchguid;")
dotgroups <- dbFetch(dotgroupq)
nrow(dotgroups)
## [1] 27
dbClearResult(dotgroupq)

# # Check for missing and duplicated matches
# nrow(matches)
# length(intersect(matches$MatchGuid, all.matches$matchGuid))
# setdiff(matches$MatchGuid, all.matches$matchGuid)
# duplicated(matches$MatchGuid)
# duplicated(all.matches$matchGuid)

dbDisconnect(con)

3 Prepare Data for Heatmaps

Note use of pseudo_log (base 2) transform for cM data (needed to handle 0s for non-matches).
I use 4096 cM for self matches to compress the range (7000 would make more sense) a little and give an integer log2 for the max.

The output of this process is a matrix with log2 transformed cM values and A* kit ids as row/column names.

icw.g <- graph.data.frame(icws)
#is_weighted(icw.g)
icw.mat <- as.matrix(get.adjacency(icw.g, attr='weight'))
#str(icw.mat) # Note this prints Ancestry ids

# Compare how many matches were in initial list and how many appear in adjacency matrix
nrow(matches)
## [1] 115
nrow(icw.mat)
## [1] 115
icw.mat.cM <- icw.mat # Save this before or after using 4096 for self?
icw.mat <- icw.mat.cM

diag(icw.mat) <- 4096 # Try changing to 4096 to reduce gap in data values and give integer log2

# See functions here
# https://github.com/r-lib/scales/issues/219
# https://scales.r-lib.org/reference/trans_new.html
pseudo_log <- function (val, sigma = 1, base = exp(1)) {
  xval <- asinh(val/(2*sigma))/log(base)
}
pseudo_log2 <- function(x) pseudo_log(x, base=2)
icw.mat <- pseudo_log(icw.mat, base=2)

# Provide inverse functions so we can recover cM values easily
# https://stackoverflow.com/questions/33612312/how-to-reverse-inverse-hyperbolic-sine-transformation-in-r
inverse_pseudo_log <- function (val, sigma = 1, base = exp(1)) {
  xval <- 2 * sigma * sinh(val*log(base))
}
inverse_pseudo_log2 <- function(x) inverse_pseudo_log(x, base=2)

# # Check accuracy of the transform/inverse process
# icw.mat.cM.transformed <- inverse_pseudo_log(icw.mat, base=2)
# diag(icw.mat.cM.transformed) <- 0 # Conform with original format having no self matches
# 
# icw.mat.diff <- icw.mat.cM - icw.mat.cM.transformed
# max(icw.mat.diff) # 3e-12 difference good enough for me
# 
# # Look at the result
# icw.mat[1:5, 1:5]

4 Display Heatmaps

4.1 Anonymize Data for Display

We use Ancestry_matchGroups row Ids as identifiers because they are short anonymous integers.
A key is provided for non-public versions.

rowNAMES <- all.matches[match(rownames(icw.mat), all.matches$matchGuid), "Id"]
kitNAMES <- all.matches[match(rownames(icw.mat), all.matches$matchGuid), "matchTestDisplayName"]
# Save mapping between Ids and Guids to use for generating key (easier and clearer to reuse than to do a lookup there)
icw.mat.row.keys <- data.frame(Id = rowNAMES, matchGuid = rownames(icw.mat), matchTestDisplayName = kitNAMES)

rownames(icw.mat) <- rowNAMES
colnames(icw.mat) <- rowNAMES

4.2 Display Pheatmap Version

Color palette is designed to create a gap between the smallest value and 0 to make the small values easier to distinguish from 0 while leaving more color range for the values seen.

Key will be displayed below in the order used in the pheatmap.

Pheatmap is limited in display size. And specifying larger sizes shrinks the fonts.
Should set figure sizes based on dimensions of icw.mat

# Create a different color palette which reduces the gap (0 to 6/8 cM for matches, 20 cM for ICWs, or specified threshold) in the data.
gap.colors <- colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(256)
data.min <- min(icw.mat[icw.mat != 0])
data.max <- max(icw.mat)

# Add additional colors so we use full display range for actual data range
# Leaving some gap helps see the min values in the heatmap.  Trying different values.
gap <- 1 # Good compromise for seeing minimum values as distinct from 0 values
num.added <- round(256 * ((data.min-gap) / (data.max - (data.min-gap))))
gap.fill <- gap.colors[1]
gap.filled.colors <- c(rep(gap.fill, num.added), gap.colors)

# Save pheatmap to support key generation below
hmj <- pheatmap(icw.mat, color = gap.filled.colors, main=plot.title)
hmj

4.3 Create Key

Done after pheatmap because I am using that ordering for now.
The all.matches data frame (Ancestry_MatchGroups table data for these matches) has everything we might want except for the icw.mat row #, the dotgroups, and the match hyperlink.

# Create key with numbers, kitids, names, and hyperlinks to match with profile.  And more...
heatmap.key <- all.matches[all.matches$matchGuid %in% icw.mat.row.keys$matchGuid, c("Id", "matchGuid", "matchTestDisplayName", "sharedCentimorgans", "sharedSegment", "matchTreeNodeCount")]
heatmap.key$log2cMs <- round(pseudo_log2(heatmap.key$sharedCentimorgans), 2)
heatmap.key$sharedCentimorgans <- round(heatmap.key$sharedCentimorgans, 0)
heatmap.key$DotGroups <- dotgroups$DotGroups[match(heatmap.key$matchGuid, dotgroups$matchguid)]
heatmap.key$DotGroups[is.na(heatmap.key$DotGroups)] = ""
heatmap.key$Links <- paste("https://www.ancestry.com/discoveryui-matches/compare/", profileid, "/with/", heatmap.key$matchGuid, sep="")
heatmap.key$Notes <- all.matches$note[match(heatmap.key$matchGuid, all.matches$matchGuid)]

Only display key if not public version. Reordered to match pheatmap order.

Try both kable static version and DT active version.

One advantage to static version is hyperlinks work. Another is searching is easier when all entries are visible at once.

Kable static key.

rownames(icw.mat)[hmj$tree_row$order]
#reordered.heatmap.key <- heatmap.key[rownames(icw.mat)[hmj$tree_row$order],]

# Why is reordering breaking?  Perhaps better to ask why did above ever work?
reordered.heatmap.key <- heatmap.key[match(rownames(icw.mat)[hmj$tree_row$order], heatmap.key$Id),]

kable(reordered.heatmap.key, row.names=FALSE)

DT active key.

DT::datatable(reordered.heatmap.key, escape=FALSE)

4.4 Display Heatmaply Version

This version is preferred because it has several desirable features.
- Supports larger plot sizes
- Allows displaying number in cell (e.g. see Jaccard Similarity below)
- Allows zooming and panning
- Can display additional data in hover text boxes (unable to get this to work)

Heatmaply is a powerful package. Documentation available at
https://talgalili.github.io/heatmaply/articles/heatmaply.html

Also described in this paper
Heatmaply: An R package for creating interactive cluster heatmaps for online publishing

A useful feature is the ability to zoom and pan within the heatmap. Brief overview of how to do that.

Zoom - click, drag, and hold to select a focal area (rectangle) to zoom to
Zoom - zooming can be done in either the heatmap itself or the dendrogram
Zoom - to reset the zoom double click, be patient, rendering can take a while
Pan - use the toolbar at the upper right (scroll browser window all the way right first, sometimes it goes missing and hovering over the title seems to restore)
Pan - click the crossed arrows then drag the heatmap to pan

This documentation for the base plotly package has more on zoom, pan, and hover.
https://plotly.com/chart-studio-help/zoom-pan-hover-controls/

Note that at RPubs it may be necessary to hide the toolbars at the bottom to gain access to the browser horizontal scroll bar!

Disabled in favor of version below with Jaccard Similarity as well.

heatmaply(icw.mat, color = gap.filled.colors, hclust_method="complete",
          key.title = "log2(cM)",
          grid_gap=0.3, # grid_gap recommended even though it does not work well with light colored cells
          branches_lwd = 0.2, subplot_widths = c(0.95, 0.05), subplot_heights = c(0.05, 0.95),
          main=plot.title          
)

4.5 Display Heatmaply Version with Jaccard Similarity

Heatmaply version with Jaccard similarity in cell and additional information in hover box.

I think this will become the primary heatmap display.

Jaccard Similarity is multiplied by 10 to save display space (leading zero and decimal point).
Have been unable to get hover text box to work.

The Jaccard Similarity is a way to compare the similarity of two sets (here the ICW lists for two matches).
In short, it is computed as the size of the intersection of the sets divided by the size of the union of the sets.
size(intersect(A,B)) / size(union(A,B))
This gives a number between 0 and 1. More details here.
https://en.wikipedia.org/wiki/Jaccard_index

One use of Jaccard Similarity is to attempt to detect triangulation groups (TGs). Fully populated (or almost, exact segment sizes and locations may vary slightly) clusters with high Jaccard Similarity may represent a TG. This is easiest to see with distant relations who just happen to match on the same segment.

A simple rule of thumb is to focus on cells with discordant match size and Jaccard similarity:
- large match, small JS probably means one of a pair of close relatives did not get segment responsible for most of the matches here
- small match, large JS probably means distant relatives who share a TG segment

As an aside, the best way to assess the utility of Jaccard Similarity for locating TGs is to create heatmaps like this on a platform with segment information and then see how the heatmap clusters compare to the true TGs there.

# Compute the Jaccard similarity
icw.mat.jaccard.dist <- dist(icw.mat, method="binary", diag=TRUE, upper=TRUE)
icw.mat.jaccard <- round((1 - as.matrix(icw.mat.jaccard.dist)) * 10, 0) # Multiply by 10 to save display space (leading zero and decimal point)

# Create hover text containing: row, column, cM, log2(cM), Jaccard
# If not public add A ids and names.
# https://www.datanovia.com/en/blog/how-to-create-a-beautiful-interactive-heatmap-in-r/#add-custom-hover-text
# Was unable to get this to work.  See saved code in older version of this file.  

heatmaply(icw.mat, hclust_method="complete",
          color = gap.filled.colors, # I think this does better job of highlighting clusters than default
          key.title = "log2(cM)",
          grid_gap=0.3, # grid_gap recommended even though it does not work well with light colored cells
          branches_lwd = 0.2, subplot_widths = c(0.95, 0.05), subplot_heights = c(0.05, 0.95),
          cellnote = round(icw.mat.jaccard, digits=1),
          cellnote_textposition = "middle center",
          main=plot.title)

Another heatmap option which is probably overkill for the sizes of heatmaps this is creating.
shinyheatmap: ultra fast low memory heatmap web interface for big data genomics

5 Display Dendrograms

I find the Phylogenic dendrogram most useful so focus on that. Display this for public use.

icw.dist <- dist(icw.mat)
icw.hc <- hclust(icw.dist)

# Phylogenic (see other versions considered in earlier version of this file)
Phylo = fviz_dend(icw.hc,
                  main=plot.title,
                  # cex = 0.8, lwd = 0.8, # Looks good in plot window
                  cex = 1.6, lwd = 1.6, # Looks good when knitted
                  k = 10,
                  rect = TRUE,
                  k_colors = "jco",
                  rect_border = "jco",
                  rect_fill = TRUE,
                  type = "phylogenic",
                  repel = TRUE)
Phylo

Match name version is easier to interpret. Display this for non-public use.

icw.dist <- dist(icw.mat)
icw.hc <- hclust(icw.dist)

icw.hc.names <- icw.hc
icw.hc.names$labels <- all.matches[match(icw.hc.names$labels, all.matches$Id), "matchTestDisplayName"]

# Phylogenic (see other versions considered in earlier version of this file)
Phylo = fviz_dend(icw.hc.names,
                  main=plot.title,
                  # cex = 0.8, lwd = 0.8, # Looks good in plot window
                  cex = 1.6, lwd = 1.6, # Looks good when knitted
                  k = 10,
                  rect = TRUE,
                  k_colors = "jco",
                  rect_border = "jco",
                  rect_fill = TRUE,
                  type = "phylogenic",
                  repel = TRUE)
Phylo

5.1 File History

File Initially created: Wednesday, November 20, 2024

File knitted: Fri Nov 22 09:57:17 2024