For this assignment we had to use a dataset. The one I went with was labeled “fsnps”.
fsnps <- read.csv("https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/gap/fsnps.csv")
My first question included a good deal of stumbling due to the dataset used and my uncertainty regarding it thanks to its documentation.
The summary function doesn’t seem particularly useful here, but the homework says to, so I shall.
summary(fsnps)
## X id y site1.a1 site1.a2
## Min. : 1.0 Min. : 1.0 Min. :0.0000 A:351 A:184
## 1st Qu.:108.8 1st Qu.:108.8 1st Qu.:1.0000 C: 35 C:202
## Median :216.5 Median :216.5 Median :1.0000 Z: 46 Z: 46
## Mean :216.5 Mean :216.5 Mean :0.9745
## 3rd Qu.:324.2 3rd Qu.:324.2 3rd Qu.:1.0000
## Max. :432.0 Max. :432.0 Max. :2.0000
## site2.a1 site2.a2 site3.a1 site3.a2 site4.a1 site4.a2
## C:364 C:217 G:277 G: 80 A:366 A:207
## T: 15 T:162 T:103 T:300 G: 34 G:193
## Z: 53 Z: 53 Z: 52 Z: 52 Z: 32 Z: 32
##
##
##
The HTML Document associated with this dataset doesn’t seem to explain what “Z” stands for. It could very well stand for the fact no allele was reported for a given instance, or that there was a deletion, or an insertion. The lack of clarity in the corresponding DOC makes me wonder.
fsnps_z_snp1test <- length(which(fsnps[4] == "Z"))
fsnps_z_snp1 <- length(which(fsnps[4] == "Z" & fsnps[5] == "Z"))
fsnps_z_snp1test == fsnps_z_snp1
## [1] TRUE
The above test informs me that all of these instances of the letter Z in the dataset are representations of when an SNP was not recorded by the DNA test that had been performed, based off of my own (self-taught) knowledge about DNA. Here’s why: of the 432 records, 46 presented an allele of Z in this dataset. This Z allele occured homogenously, meaning that it was either a double deletion, a double insertion… or, far more likely, the test did not pick up on it. Given that deletions are typically recorded with a minus (-) symbol and insertions are typically recorded by the string of alleles that have been inserted, it is safest to assume given the lack of clarification in the document pertaining to this file that the Z refers to a lack of data provided by the DNA test, which can happen even if two people take the exact same DNA test.
Moving forward to the rest of the assignment, the use of mean. Here, we’re going to calculate the average from the column “y” in the fsnps dataset.
mean(fsnps$y)
## [1] 0.974537
The document describing the dataset says that 0 is the control and 1 is the case, but there are instances of the number 2 in this column.
length(which(fsnps$y == 2))
## [1] 35
35 such in fact. Are these instances where there were two risk alleles present, or something else entirely? The document provided does not go into great depth.
More telling may be the mean of the two alleles for each site. Looking at the first SNP, we have the following:
snp_site1 <- c(rep(NA, length(which(fsnps$site1.a1 == "Z"))), rep(0, length(which(fsnps$site1.a1 == "A"))), rep(1, length(which(fsnps$site1.a1 == "C"))), rep(NA, length(which(fsnps$site1.a2 == "Z"))), rep(0, length(which(fsnps$site1.a2 == "A"))), rep(1, length(which(fsnps$site1.a2 == "C"))))
mean(snp_site1, na.rm=TRUE)
## [1] 0.3069948
With a value closer to 0, based off of the mean performed, we can see that the alleles lean towards a greater number of A alleles.
median(snp_site1, na.rm=TRUE)
## [1] 0
This is further confirmed by the fact that the median is 0, which corresponds to the A allele. It is safe to interpret this as the fact the most frequent or common allele in the SNP designated to site1 in the dataset is A.
new_df <- data.frame(
fsnps$id[25:75],
fsnps$site1.a1[25:75],
fsnps$site1.a2[25:75]
)
names(new_df) <- c("ID", "First Allele", "Second Allele")
summary(new_df)
## ID First Allele Second Allele
## Min. :25.0 A:42 A:21
## 1st Qu.:37.5 C: 5 C:26
## Median :50.0 Z: 4 Z: 4
## Mean :50.0
## 3rd Qu.:62.5
## Max. :75.0
new_df_snp <- c(rep(NA, length(which(new_df$"First Allele" == "Z"))), rep(0, length(which(new_df$"First Allele" == "A"))), rep(1, length(which(new_df$"First Allele" == "C"))), rep(NA, length(which(new_df$"Second Allele" == "Z"))), rep(0, length(which(new_df$"Second Allele" == "A"))), rep(1, length(which(new_df$"Second Allele" == "C"))))
As can be seen, only 50 records were used from the fsnp database this time. Previously, the mean was approximately 0.306. This time, the mean was
mean(new_df_snp, na.rm=TRUE)
## [1] 0.3297872
which indicates a higher frequency of the C allele in this dataset, but not enough for it to be the majority of alleles. As for the median, it was 0 in the overall dataset and
median(new_df_snp, na.rm=TRUE)
## [1] 0
0 in this smaller set, indicating that the A allele still is the most plentiful of the two in this one particular SNP.
levels(new_df$"First Allele") <- c(levels(new_df$"First Allele"), "Cytosine")
new_df$"First Allele"[new_df$"First Allele" == "C"] <- "Cytosine"
levels(new_df$"First Allele") <- c(levels(new_df$"First Allele"), "Adenine")
new_df$"First Allele"[new_df$"First Allele" == "A"] <- "Adenine"
levels(new_df$"Second Allele") <- c(levels(new_df$"Second Allele"), "Cytosine")
new_df$"Second Allele"[new_df$"Second Allele" == "C"] <- "Cytosine"
levels(new_df$"Second Allele") <- c(levels(new_df$"Second Allele"), "Adenine")
new_df$"Second Allele"[new_df$"Second Allele" == "A"] <- "Adenine"
new_df[new_df == "Z"] <- NA
print.data.frame(new_df)
## ID First Allele Second Allele
## 1 25 Cytosine Cytosine
## 2 26 Cytosine Cytosine
## 3 27 <NA> <NA>
## 4 28 Adenine Adenine
## 5 29 Adenine Cytosine
## 6 30 Adenine Cytosine
## 7 31 Adenine Adenine
## 8 32 Adenine Adenine
## 9 33 Adenine Cytosine
## 10 34 Adenine Adenine
## 11 35 Cytosine Cytosine
## 12 36 Adenine Cytosine
## 13 37 Adenine Cytosine
## 14 38 Adenine Adenine
## 15 39 Adenine Cytosine
## 16 40 Adenine Adenine
## 17 41 Adenine Cytosine
## 18 42 Adenine Adenine
## 19 43 Adenine Adenine
## 20 44 Adenine Cytosine
## 21 45 Adenine Cytosine
## 22 46 Adenine Cytosine
## 23 47 Adenine Cytosine
## 24 48 Adenine Adenine
## 25 49 Adenine Adenine
## 26 50 <NA> <NA>
## 27 51 Adenine Cytosine
## 28 52 Adenine Adenine
## 29 53 Adenine Adenine
## 30 54 Adenine Adenine
## 31 55 Adenine Adenine
## 32 56 <NA> <NA>
## 33 57 Cytosine Cytosine
## 34 58 Adenine Adenine
## 35 59 Adenine Cytosine
## 36 60 Adenine Adenine
## 37 61 Adenine Adenine
## 38 62 Adenine Adenine
## 39 63 Adenine Adenine
## 40 64 <NA> <NA>
## 41 65 Adenine Cytosine
## 42 66 Adenine Adenine
## 43 67 Adenine Cytosine
## 44 68 Adenine Cytosine
## 45 69 Cytosine Cytosine
## 46 70 Adenine Cytosine
## 47 71 Adenine Adenine
## 48 72 Adenine Cytosine
## 49 73 Adenine Cytosine
## 50 74 Adenine Cytosine
## 51 75 Adenine Cytosine