Final_Project

Getting Started

Loading Packages

As always, I make a chunk at the top of the markdown document that I use as a list of the packages I will use for the project. As I come across packages I need to construct a figure, or a function I find usefiul, I add the inclusive package to this chunk

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:igraph':
## 
##     groups

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Organizing Data

Here I set the working drive so that the program will look there by default. A working drive is a good strategy to keep all of your data sources and outputs grouped into the same folder

setwd("~/Desktop/ANT191WD")

The main data set I will be working with is the Frantz et al. data on ancient and modern dog haplotypes by country. Looking at the haplogroup frequency by country is too fine-grained and approach, so prior to loading the data into the environment, I opened the table using Microsoft Excel and added a column titled “Region.” Additionally, I have trimmed the data to remove any individuals not classifiable by region, or outside of the haplogroups A,B,C,D,E, and F.

dna <- read.csv("~/Desktop/ANT191WD/mtDNA_info.csv")

Subsetting the Data

The data can be broken into any number of subsets to facilitate interpretation and visualization. I have broken the data into modern and archaeological subsets below

Arch <- dna[dna$Age > 0,]
Modern <- dna[dna$Age == 0,]

Here I have broken them into subsets by region. These data subsets are not necessary to produce the following plots, but were useful for determining the frequency of haplogroup by region

Africa <- dna[dna$Region == "Africa",]
Caribbean <- dna[dna$Region == "Caribbean",]
C.America <- dna[dna$Region == "Central America",]
E.Asia <- dna[dna$Region == "East Asia",]
Europe <- dna[dna$Region == "Europe",]
Middle.East <- dna[dna$Region == "Middle East",]
N.America <- dna[dna$Region == "North America",]
Oceania <- dna[dna$Region == "Oceania",]
SCA <- dna[dna$Region == "SCA",]
SEA <- dna[dna$Region == "SEA",]
Siberia <- dna[dna$Region == "Siberia",]
S.America <- dna[dna$Region == "South America",]
S.Asia <- dna[dna$Region == "South Asia",]

Network Analysis

To show the relationship between haplogroups and region, I have made an edgelist that shows the representation of haplogroups by region.

The first step is importing the edgelist from a csv file

dnet <- read.csv("~/Desktop/ANT191WD/dna_network.csv", header=T, row.names=NULL)

Then I convert the object from a data frame to a matrix

dnetmat <- as.matrix(dnet)

I assign an object name to the graph I created from the function graph_from_edgelist.

Note: This function is part of the iGraph package, which must be loaded at the beginning of the session (see the chunk titled “Loading Packages”)

net <- graph_from_edgelist(dnetmat, directed=FALSE)

This object may now be plotted

plot.igraph(net, main="Haplogroup links by region" )

The resulting plot is a jumbled mess. The function tkplot allows you to manipulate the nodes in a GUI. You can manually drag nodes to different parts of the plot space and change parameters, like the color and size of the nodes, or the collor and thickness of the edges. In the following plot, I have changed the nodes representitive of the regions to areas that roughly approximate their geographic relation to each other and have used the color red to distinguish them from the haplogroup nodes. I have given each haplogroup node and its corresponding edges a unique color to demonstrate the ties of each haplogroup to a region.

tkplot(net)

Network created by “tkplot” function

Dot Charts and Pie Charts Using Frequency Tables

I have made a seperate data frame of the frequency of each haplogroup by region. Because the various regions have unequal sample sizes, frequencies are more telling than raw counts

dna_freq <- read.csv("~/Desktop/ANT191WD/dna_freq.csv", row.names = 1)

Creating the Dot Charts

In the chunk below, I create the dot charts. These dot charts show the frequency of each haplogroup by region. I used the dotchart function and assigned each haplogroup (row on the chart) a color using and object I created from a concatenation of six color values (mycolors)

mycolors <- c("royalblue", "orange2", "gold1", "darkorchid4", "cyan", "darkgoldenrod4")

AfrDot <- dotchart(dna_freq$Africa, color = mycolors, labels = row.names(dna_freq), pch=16, xlim=c(0,1.1), main="Africa")

CarDot <- dotchart(dna_freq$Caribbean, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Caribbean")

CamDot <- dotchart(dna_freq$CentAmerica, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Central America")

EasDot <- dotchart(dna_freq$EAsia, color = mycolors, labels = row.names(dna_freq), pch=16, xlim=c(0,1.1), main = "East Asia")

EurDot <- dotchart(dna_freq$Europe, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Europe")

MeDot <- dotchart(dna_freq$MiddleEast, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Middle East")

NamDot <- dotchart(dna_freq$Namerica, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "North America")

OceDot <- dotchart(dna_freq$Oceania, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Oceania")

SibDot <- dotchart(dna_freq$Siberia, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "Siberia")

SamDot <- dotchart(dna_freq$Samerica, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "South America")

SasDot <- dotchart(dna_freq$SAsia, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "South Asia")

SeaDot <- dotchart(dna_freq$SEA, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "South East Asia")

ScaDot <- dotchart(dna_freq$SCA, color = mycolors, labels = row.names(dna_freq), pch=16, xlim = c(0,1.1), main = "South Central Asia")

par(mfrow= c(4,4))
NamDot

## NULL

CamDot

## NULL

CarDot

## NULL

SamDot

## NULL

AfrDot

## NULL

MeDot

## NULL

EurDot

## NULL

ScaDot

## NULL

SasDot

## NULL

SeaDot

## NULL

EasDot

## NULL

SibDot

## NULL

OceDot

## NULL

I tried to combine the plots into a single plot using the par(mfrow =) function, but the objects I created showed up as “NULL” in the environment. Sources online said I should reset my graphics by using the dev.off function, but this did not remedy the problem

Pie charts

If you look at the help file for pie charts, there is a note that says any information a pie chart can convey can be better communicated by a dot chart. However, I would like to take the pie charts I produce by region and put them into my map. First, I have to make another data farme. Whereas to exclude haplogroups not represented in each sample, “NA”s were left in the data frame, the pie function will not work with them. They must be converted to zeros, and that is what I have done in this data frame

dnaf <- read.csv("~/Desktop/ANT191WD/dnaf.csv", row.names=1)

Below is the code used for producing each pie chart. Notice, I use the same object mycolors for the colorscheme in this function as well

AfrPie <- pie(dnaf$Africa, col = mycolors, labels = c("A","B","C",NULL,NULL,NULL))

CarPie <- pie(dnaf$Caribbean, col = mycolors, labels = c("A",NULL,NULL,NULL,NULL,NULL))

CamPie <- pie(dnaf$Africa, col = mycolors, labels = c("A","B","C",NULL,NULL,NULL))

EasPie <- pie(dnaf$EAsia, col = mycolors, labels = c("A","B","C",NULL,"E","F"))

EurPie <- pie(dnaf$Europe, col = mycolors, labels = c("A","B","C","D",NULL,NULL))

MePie <- pie(dnaf$MiddleEast, col = mycolors, labels = c("A","B","C","D",NULL,NULL))

NamPie <- pie(dnaf$Namerica, col = mycolors, labels = c("A","B","C",NULL,NULL,NULL))

OcePie <- pie(dnaf$Oceania, col = mycolors, labels = c("A","B","C",NULL,NULL,NULL))

ScaPie <- pie(dnaf$SCA, col = mycolors, labels = c("A","B",NULL,NULL,NULL,NULL))

SeaPie <- pie(dnaf$SEA, col = mycolors, labels = c("A","B","C",NULL,"E",NULL))

SibPie <- pie(dnaf$Siberia, col = mycolors, labels = c("A","B","C","D",NULL,NULL))

SamPie <- pie(dnaf$Samerica, col = mycolors, labels = c("A","B","C","D",NULL,NULL))

SasPie <- pie(dnaf$SAsia, col = mycolors, labels = c("A","B","C",NULL,NULL,NULL))

The pie charts were saved by manually clicking on them, as the export function would not work. As with the dot charts, the objects I created registered as “NULL” in my environment. I saved the pie charts and put them in Adobe Illustrator to crop them, and now they are ready for inclusion in my map

Map

I produced the map in QGIS3, using two basemaps that were available in the program. The points on the map are a layer produced by two columns in the main dataset and represent the latitude and longitude of the sites from which the DNA was collected, both modern and archaeological. I added the pie charts by using the “add a new picture to layout” button in the layout GUI

mtDNA representation by region

A couple of issues: The scale bar seems to be broken. If it were correct, the circumference of the world would only be slightly greater than 200 km. I tried setting its reference to various map layers, but that didn’t fix it. I have to keep working on this issue. Another is that some of the points do not display the correct color for the region they represent. This only seems to happen when there are multiple individuals represented by the same lat/long data. When you use the identify tool and mouse over them, they display the correct region designator, but the color does not match

Interactive Pie Charts

For the interactive plot, I have chosen to show the difference in the diversity of the archaeological and modern sample. This is not a robust analysis, but demonstrates the process of producing a side by side plot with the package plotly

dna_count <- read.csv("~/Desktop/ANT191WD/dna_count.csv", row.names=1)

The key here is loading the package dplyr so the function %>% will operate correctly. Here it serves to add elements to the code. The function plot_ly() with the empty parentheses is there simply to plot the elements that are added on after that

plyc <- plot_ly() %>% 
add_pie(data = dna_count, labels = ~rownames(dna_count), values = ~Modern, name = "Modern", domain = list(x = c(0,.5), y = c(.4,1))) %>%
add_pie(data = dna_count, labels = ~rownames(dna_count), values = ~Archaeological, name = "Archaeological", domain = list(x = c(.6,1), y = c(0.4,1))) %>%
layout(title = "Global Genetic Diversity: Modern vs Archaeological Samples", showlegend=T,xaxis = list(showgrid = FALSE, zeroline=FALSE,showticklabels=FALSE), yaxis = list(showgrid = FALSE, zeroline=FALSE,showticklabels=FALSE))
plyc

Admittedly, all of these plots could do with some improvements before they are ready for publication. The production of good plots is time consuming and meticulous work, but the code above is a good base for producing good maps and images

Final_Project_ANT191

Kirk J Schmitz

6/3/2019