"ND" to NAdat <- read.delim("~/Documents/workflows/Joe/tinytest.txt", header=FALSE, stringsAsFactors = F)
#read.csv()
# change ND to NAs
dat[dat == "ND"] <- NA
dat
## V1 V2 V3
## 1 AA CC AA
## 2 CC CC AB
## 3 AA AC BD
## 4 AD DD AA
## 5 <NA> <NA> AB
## 6 <NA> DD BB
## 7 <NA> CD <NA>
## 8 CC CD BB
## 9 AC CD BB
## 10 AD CC BB
## 11 AA CD AB
## 12 <NA> BC AA
## 13 AC DD <NA>
## 14 CC <NA> BB
## 15 AC CD BB
## 16 AC CD BB
## 17 AC CD BB
## 18 AD <NA> <NA>
## 19 AC CD <NA>
## 20 AC <NA> <NA>
## 21 AC CD BB
names(dat) # gives you the names of the columns in a data.frame
## [1] "V1" "V2" "V3"
r default set to column number
names(dat) <- 1:ncol(dat)
dat
## 1 2 3
## 1 AA CC AA
## 2 CC CC AB
## 3 AA AC BD
## 4 AD DD AA
## 5 <NA> <NA> AB
## 6 <NA> DD BB
## 7 <NA> CD <NA>
## 8 CC CD BB
## 9 AC CD BB
## 10 AD CC BB
## 11 AA CD AB
## 12 <NA> BC AA
## 13 AC DD <NA>
## 14 CC <NA> BB
## 15 AC CD BB
## 16 AC CD BB
## 17 AC CD BB
## 18 AD <NA> <NA>
## 19 AC CD <NA>
## 20 AC <NA> <NA>
## 21 AC CD BB
genoNoGiven a data.frame, counts unique genotypes across each column. Use identity = T if you want to get the actual genotypes rather than the counts. in apply will return a list rather than a vector(cannot be used in data.frame, defaults to identity = F)
# FUNCTIONS ..............................................................................
genoNo <- function(x, identity = F){
if(identity){unique(x[!is.na(x)])}else{
length(unique(x[!is.na(x)]))}
}
alleleNoGiven a data.frame, counts unique alleles across each column. use identity = T if you want to get the actual alleles rather than the counts. in apply will return a list rather than a vector(cannot be used in data.frame)
alleleNo <- function(x, identity = F){
if(identity){unique(unlist(strsplit(x[!is.na(x)],"")))}else{
length(unique(unlist(strsplit(x[!is.na(x)],""))))}
}
data.framedd <- data.frame(geno = apply(dat, 2, FUN = genoNo), allele = apply(dat, 2, FUN = alleleNo))
dd
## geno allele
## 1 4 3
## 2 5 4
## 3 4 3
add column identifier
dd$ID <- names(dat)
dd
## geno allele ID
## 1 4 3 1
## 2 5 4 2
## 3 4 3 3
This gives you a data.frame dd containing genotype and allele counts for each column. You can use these to susbset the original data.frame
Identify rows in dd (ie columns in original dat file) which have allele nos > 2. More info on indexing data.frames here
dat.id <- dd[dd$allele > 2,"ID"]
dat.id
## [1] "1" "2" "3"
get columns of data from original file identified as deviant
dat.sub <- dat[, dat.id]
dat.sub
## 1 2 3
## 1 AA CC AA
## 2 CC CC AB
## 3 AA AC BD
## 4 AD DD AA
## 5 <NA> <NA> AB
## 6 <NA> DD BB
## 7 <NA> CD <NA>
## 8 CC CD BB
## 9 AC CD BB
## 10 AD CC BB
## 11 AA CD AB
## 12 <NA> BC AA
## 13 AC DD <NA>
## 14 CC <NA> BB
## 15 AC CD BB
## 16 AC CD BB
## 17 AC CD BB
## 18 AD <NA> <NA>
## 19 AC CD <NA>
## 20 AC <NA> <NA>
## 21 AC CD BB
Now you’ve got the subset of the original dataset in which you identified greater than 2 alleles Also to see a nice interactive data table in another window just either click on the name of the data.frame in the environment tab (upper right) or run the code: view(data.frame), e.g. view(dat.sub)
Just a note about using these. I would probably not reccommend using these on a very large data.frame. It’s just a tool
geno.tables <- lapply(dat, FUN = table)
geno.tables
## $`1`
##
## AA AC AD CC
## 3 8 3 3
##
## $`2`
##
## AC BC CC CD DD
## 1 1 3 9 3
##
## $`3`
##
## AA AB BB BD
## 3 3 9 1
alleles <- apply(dat, 2, FUN = alleleNo, identity = T)
alleles
## $`1`
## [1] "A" "C" "D"
##
## $`2`
## [1] "C" "A" "D" "B"
##
## $`3`
## [1] "A" "B" "D"
genos <- apply(dat, 2, FUN = genoNo, identity = T)
genos
## $`1`
## [1] "AA" "CC" "AD" "AC"
##
## $`2`
## [1] "CC" "AC" "DD" "CD" "BC"
##
## $`3`
## [1] "AA" "AB" "BD" "BB"