allele and genotype counter

A breakdown of how the allele & geno counters work and a demo workflow of how to deploy them

Read in a file. Columns should be genotypes for individual loci, rows individuals.
Convert "ND" to NA

dat <- read.delim("~/Documents/workflows/Joe/tinytest.txt", header=FALSE, stringsAsFactors = F)
      #read.csv()

# change ND to NAs
dat[dat == "ND"] <- NA


dat

##      V1   V2   V3
## 1    AA   CC   AA
## 2    CC   CC   AB
## 3    AA   AC   BD
## 4    AD   DD   AA
## 5  <NA> <NA>   AB
## 6  <NA>   DD   BB
## 7  <NA>   CD <NA>
## 8    CC   CD   BB
## 9    AC   CD   BB
## 10   AD   CC   BB
## 11   AA   CD   AB
## 12 <NA>   BC   AA
## 13   AC   DD <NA>
## 14   CC <NA>   BB
## 15   AC   CD   BB
## 16   AC   CD   BB
## 17   AC   CD   BB
## 18   AD <NA> <NA>
## 19   AC   CD <NA>
## 20   AC <NA> <NA>
## 21   AC   CD   BB

names(dat) # gives you the names of the columns in a data.frame

## [1] "V1" "V2" "V3"

r default set to column number

names(dat) <- 1:ncol(dat)
dat

##       1    2    3
## 1    AA   CC   AA
## 2    CC   CC   AB
## 3    AA   AC   BD
## 4    AD   DD   AA
## 5  <NA> <NA>   AB
## 6  <NA>   DD   BB
## 7  <NA>   CD <NA>
## 8    CC   CD   BB
## 9    AC   CD   BB
## 10   AD   CC   BB
## 11   AA   CD   AB
## 12 <NA>   BC   AA
## 13   AC   DD <NA>
## 14   CC <NA>   BB
## 15   AC   CD   BB
## 16   AC   CD   BB
## 17   AC   CD   BB
## 18   AD <NA> <NA>
## 19   AC   CD <NA>
## 20   AC <NA> <NA>
## 21   AC   CD   BB

Load the two functions:

function genoNo

Given a data.frame, counts unique genotypes across each column. Use identity = T if you want to get the actual genotypes rather than the counts. in apply will return a list rather than a vector(cannot be used in data.frame, defaults to identity = F)

# FUNCTIONS ..............................................................................

genoNo <- function(x, identity = F){
  
  if(identity){unique(x[!is.na(x)])}else{
  length(unique(x[!is.na(x)]))}
}

function alleleNo

Given a data.frame, counts unique alleles across each column. use identity = T if you want to get the actual alleles rather than the counts. in apply will return a list rather than a vector(cannot be used in data.frame)

alleleNo <- function(x, identity = F){
  
  if(identity){unique(unlist(strsplit(x[!is.na(x)],"")))}else{
  length(unique(unlist(strsplit(x[!is.na(x)],""))))}

}

Workflow

GET ALLELE/GENO COUNTS

apply each function over each column and compile into `data.frame`

dd <- data.frame(geno = apply(dat, 2, FUN = genoNo), allele = apply(dat, 2, FUN = alleleNo))
dd

##   geno allele
## 1    4      3
## 2    5      4
## 3    4      3

add column identifier

dd$ID <- names(dat)
dd

##   geno allele ID
## 1    4      3  1
## 2    5      4  2
## 3    4      3  3

This gives you a data.frame dd containing genotype and allele counts for each column. You can use these to susbset the original data.frame

Identify rows in dd (ie columns in original dat file) which have allele nos > 2. More info on indexing data.frames here

dat.id <- dd[dd$allele > 2,"ID"]

dat.id

## [1] "1" "2" "3"

get columns of data from original file identified as deviant

dat.sub <- dat[, dat.id]
dat.sub

##       1    2    3
## 1    AA   CC   AA
## 2    CC   CC   AB
## 3    AA   AC   BD
## 4    AD   DD   AA
## 5  <NA> <NA>   AB
## 6  <NA>   DD   BB
## 7  <NA>   CD <NA>
## 8    CC   CD   BB
## 9    AC   CD   BB
## 10   AD   CC   BB
## 11   AA   CD   AB
## 12 <NA>   BC   AA
## 13   AC   DD <NA>
## 14   CC <NA>   BB
## 15   AC   CD   BB
## 16   AC   CD   BB
## 17   AC   CD   BB
## 18   AD <NA> <NA>
## 19   AC   CD <NA>
## 20   AC <NA> <NA>
## 21   AC   CD   BB

Now you’ve got the subset of the original dataset in which you identified greater than 2 alleles Also to see a nice interactive data table in another window just either click on the name of the data.frame in the environment tab (upper right) or run the code: view(data.frame), e.g. view(dat.sub)

Looking further into the data

Just a note about using these. I would probably not reccommend using these on a very large data.frame. It’s just a tool

You might like to look at the distribution of genotypes across each column:

geno.tables <- lapply(dat, FUN = table)
geno.tables

## $`1`
## 
## AA AC AD CC 
##  3  8  3  3 
## 
## $`2`
## 
## AC BC CC CD DD 
##  1  1  3  9  3 
## 
## $`3`
## 
## AA AB BB BD 
##  3  3  9  1

Or GET IDENTITIES OF UNIQUE GENOTYPES / ALLELES FOR EACH COLUMN.

Using the functions with identity = T

alleles <- apply(dat, 2, FUN = alleleNo, identity = T)
alleles

## $`1`
## [1] "A" "C" "D"
## 
## $`2`
## [1] "C" "A" "D" "B"
## 
## $`3`
## [1] "A" "B" "D"

genos <- apply(dat, 2, FUN = genoNo, identity = T)
genos

## $`1`
## [1] "AA" "CC" "AD" "AC"
## 
## $`2`
## [1] "CC" "AC" "DD" "CD" "BC"
## 
## $`3`
## [1] "AA" "AB" "BD" "BB"

allele and genotype counter

Anna Krystalli

2 November 2015

A breakdown of how the allele & geno counters work and a demo workflow of how to deploy them

Load the two functions:

Workflow

GET ALLELE/GENO COUNTS

apply each function over each column and compile into `data.frame`

Looking further into the data

You might like to look at the distribution of genotypes across each column:

Or GET IDENTITIES OF UNIQUE GENOTYPES / ALLELES FOR EACH COLUMN.

Using the functions with identity = T

allele and genotype counter

Anna Krystalli

2 November 2015

A breakdown of how the allele & geno counters work and a demo workflow of how to deploy them

Load the two functions:

Workflow

GET ALLELE/GENO COUNTS

apply each function over each column and compile into data.frame

Looking further into the data

You might like to look at the distribution of genotypes across each column:

Or GET IDENTITIES OF UNIQUE GENOTYPES / ALLELES FOR EACH COLUMN.

Using the functions with identity = T

apply each function over each column and compile into `data.frame`