n=1,184 (589 males, 595 females) There are 1,440,616 SNPs total across 23 chromosomes in this dataset. Here, we will only consider 508 SNPs in the first 1,000,000 base pairs of Chromosome 17. To open genotype data, use:
data<-read.table('Chr17_1to1000kb_notrecoded.ped', header=F, sep=' ')
#Note that there is a space between ‘ and ‘ for the “sep=” argument.
#1. How many columns and rows are there in this dataset?
ncol(data)
## [1] 1022
#2. Display the first 5 rows and first 7 columns of this dataset.
data[1:5, 1:7]
## V1 V2 V3 V4 V5 V6 V7
## 1 2427 NA19919 NA19908 NA19909 1 -9 G
## 2 2431 NA19916 0 0 1 -9 G
## 3 2424 NA19835 0 0 2 -9 A
## 4 2469 NA20282 0 0 2 -9 A
## 5 2368 NA19703 0 0 1 -9 G
The first 6 columns of this dataset are: Family ID Individual ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype
#3. What type of variable is the second column (Individual ID)? (Hint: class())
class(data$V2)
## [1] "factor"
V2 was a factor before the conversion below.
#4. Make this variable into a character variable and rename the variable as “IID”. (individual ID, which is the same name found in the following file).
#reclass V2 from factor to character
data$V2 <-as.character(data$V2)
class(data$V2)
## [1] "character"
#rename data$V2 to data$IID
names(data)[2]<-"IID"
Next, open up the file that displays the population affiliation for each individual in the genotype data. “HapMap3_IDs.txt” – Name this dataset ‘Pop’ – Note that there are headers in this dataset
Pop <- read.table("HapMap3_IDs.txt", header = T)
#5. How many individuals (rows) are in this dataset?
nrow(Pop)
## [1] 1301
#6. How many of the IDs in this file (“IID”) overlap with the genotype data IDs (2nd column)? You should first transform this second variable into a character variable.
#Transform "IID" from factor to character vairable class.
Pop$IID<-as.character(Pop$IID)
#Display nuber of IIDs that overlap between data and Pop
length(intersect(data$IID, Pop$IID))
## [1] 1184
#7. Merge the file with the population dataset to the genotype data (2nd column of the genotype data is the individual ID) into a new dataset. Hint use the merge () function, and put the population dataset first, and genotype dataset second).
popdata<-merge(Pop, data, by="IID")
head(names(data))
## [1] "V1" "IID" "V3" "V4" "V5" "V6"
#8. Using this new merged dataset, make a table of the number of individuals by population. (Column name: ‘population’). You can refer to the “phase_3_samples” file to see what these populations are.
table(popdata$population)
##
## ASW CEU CHB CHD GIH JPT LWK MEX MKK TSI YRI
## 83 165 84 85 88 86 90 77 171 88 167
Next, open the .map file which has information about each SNP in same order as the genotype file. There are NO headers in the file, so use header=F. Use sep=”, to specify that tabs separate the column entries.
map<-read.table("Chr17_1to1000kb_notrecoded.map", header = F, sep = "\t")
head(map)
## V1 V2 V3 V4
## 1 17 rs8069278 0 3587
## 2 17 rs6565733 0 6689
## 3 17 rs1106175 0 6888
## 4 17 rs6420495 0 7242
## 5 17 rs2396789 0 8547
## 6 17 rs6565703 0 12344
The first column is the chromosome number, the second is the SNP name, and the fourth is the base position along that chromosome. #9. Remove the third column, and rename the remaining columns of this dataset as c(“Chr”, “SNP”, “kb”)
#remove column 3 (V3)
map$V3 <- NULL
names(map)[1]<-"Chr"
names(map)[2]<-"SNP"
names(map)[3]<-"kb"
head(map)
## Chr SNP kb
## 1 17 rs8069278 3587
## 2 17 rs6565733 6689
## 3 17 rs1106175 6888
## 4 17 rs6420495 7242
## 5 17 rs2396789 8547
## 6 17 rs6565703 12344
#10. How many SNPs are there?
nrow(map)
## [1] 508
#11. How many columns do we have genotype information on? Again, the genotype file has 6 columns at the beginning which are not genotypes. However, after merging the IDs file at the beginning, there are now 6+7-1 (we subtract 1 because we merged on a common column), so there are 12 columns before the genotypes begin.
ncol(popdata) - 12
## [1] 1016
#12. What are the alleles of SNP rs6565733? Remember that the order of the SNPs in the genotype file is the same as the order in the map file. This SNP is the second SNP listed.
popdata[1:5, 15:16 ]
## V9 V10
## 1 G A
## 2 A A
## 3 A A
## 4 G A
## 5 A A
#the alleles are G and A
grep("rs2291778", map$SNP)
## [1] 287
#rs2291778 is the 287th variable, so its position would be
((287*2)+12-1)
## [1] 585
((287*2)+12)
## [1] 586
#The alleles should be in column 585 and 586
popdata[1:10, ((287*2)+12-1):((287*2)+12)]
## V579 V580
## 1 G G
## 2 G G
## 3 G G
## 4 G G
## 5 G G
## 6 G G
## 7 T G
## 8 G G
## 9 G G
## 10 T G
#The alleles are G and T
#14. What is the frequency (number between 0 and 1) of each allele in the entire sample for SNP rs2291778?
table(popdata$V579, popdata$V580)
##
## 0 G T
## 0 9 0 0
## G 0 912 0
## T 0 238 25
# G allele frequency
((912*2)+238)/((912*2)+(25*2)+(238*2))
## [1] 0.8774468
## G allele frequency is 0.88
# Calc T allele frequency
1-.88
## [1] 0.12
#or
((25*2)+238)/((912*2)+(25*2)+(238*2))
## [1] 0.1225532
## T allele frequency is 0.12
#15. What is the frequency of the each allele of SNP rs2291778 among ‘MEX’ (Mexicans)?
table(popdata$V579, popdata$V580, popdata$population)
## , , = ASW
##
##
## 0 G T
## 0 3 0 0
## G 0 72 0
## T 0 8 0
##
## , , = CEU
##
##
## 0 G T
## 0 1 0 0
## G 0 134 0
## T 0 29 1
##
## , , = CHB
##
##
## 0 G T
## 0 1 0 0
## G 0 44 0
## T 0 36 3
##
## , , = CHD
##
##
## 0 G T
## 0 0 0 0
## G 0 49 0
## T 0 28 8
##
## , , = GIH
##
##
## 0 G T
## 0 0 0 0
## G 0 84 0
## T 0 4 0
##
## , , = JPT
##
##
## 0 G T
## 0 2 0 0
## G 0 39 0
## T 0 38 7
##
## , , = LWK
##
##
## 0 G T
## 0 1 0 0
## G 0 84 0
## T 0 5 0
##
## , , = MEX
##
##
## 0 G T
## 0 1 0 0
## G 0 63 0
## T 0 11 2
##
## , , = MKK
##
##
## 0 G T
## 0 0 0 0
## G 0 123 0
## T 0 44 4
##
## , , = TSI
##
##
## 0 G T
## 0 0 0 0
## G 0 73 0
## T 0 15 0
##
## , , = YRI
##
##
## 0 G T
## 0 0 0 0
## G 0 147 0
## T 0 20 0
## , , = MEX
## 0 G T
## 0 1 0 0
## G 0 63 0
## T 0 11 2
# MEX G allele
((63*2)+11)/((63*2)+(2*2)+(11*2))
## [1] 0.9013158
## G allele frequency is 0.90 in MEX population
# MEX T allele
1-0.90
## [1] 0.1
((2*2)+11)/((63*2)+(2*2)+(11*2))
## [1] 0.09868421
## T allele frequency is 0.10 in MEX population