Note on the assignment

For this assignment we had to use a dataset. The one I went with was labeled “fsnps”.

fsnps <- read.csv("https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/gap/fsnps.csv")

My first question included a good deal of stumbling due to the dataset used and my uncertainty regarding it thanks to its documentation.

Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes

The summary function doesn’t seem particularly useful here, but the homework says to, so I shall.

summary(fsnps)
##        X               id              y          site1.a1 site1.a2
##  Min.   :  1.0   Min.   :  1.0   Min.   :0.0000   A:351    A:184   
##  1st Qu.:108.8   1st Qu.:108.8   1st Qu.:1.0000   C: 35    C:202   
##  Median :216.5   Median :216.5   Median :1.0000   Z: 46    Z: 46   
##  Mean   :216.5   Mean   :216.5   Mean   :0.9745                    
##  3rd Qu.:324.2   3rd Qu.:324.2   3rd Qu.:1.0000                    
##  Max.   :432.0   Max.   :432.0   Max.   :2.0000                    
##  site2.a1 site2.a2 site3.a1 site3.a2 site4.a1 site4.a2
##  C:364    C:217    G:277    G: 80    A:366    A:207   
##  T: 15    T:162    T:103    T:300    G: 34    G:193   
##  Z: 53    Z: 53    Z: 52    Z: 52    Z: 32    Z: 32   
##                                                       
##                                                       
## 

The HTML Document associated with this dataset doesn’t seem to explain what “Z” stands for. It could very well stand for the fact no allele was reported for a given instance, or that there was a deletion, or an insertion. The lack of clarity in the corresponding DOC makes me wonder.

fsnps_z_snp1test <- length(which(fsnps[4] == "Z"))
fsnps_z_snp1 <- length(which(fsnps[4] == "Z" & fsnps[5] == "Z"))
fsnps_z_snp1test == fsnps_z_snp1
## [1] TRUE

The above test informs me that all of these instances of the letter Z in the dataset are representations of when an SNP was not recorded by the DNA test that had been performed, based off of my own (self-taught) knowledge about DNA. Here’s why: of the 432 records, 46 presented an allele of Z in this dataset. This Z allele occured homogenously, meaning that it was either a double deletion, a double insertion… or, far more likely, the test did not pick up on it. Given that deletions are typically recorded with a minus (-) symbol and insertions are typically recorded by the string of alleles that have been inserted, it is safest to assume given the lack of clarification in the document pertaining to this file that the Z refers to a lack of data provided by the DNA test, which can happen even if two people take the exact same DNA test.

Moving forward to the rest of the assignment, the use of mean. Here, we’re going to calculate the average from the column “y” in the fsnps dataset.

mean(fsnps$y)
## [1] 0.974537

The document describing the dataset says that 0 is the control and 1 is the case, but there are instances of the number 2 in this column.

length(which(fsnps$y == 2))
## [1] 35

35 such in fact. Are these instances where there were two risk alleles present, or something else entirely? The document provided does not go into great depth.

More telling may be the mean of the two alleles for each site. Looking at the first SNP, we have the following:

snp_site1 <- c(rep(NA, length(which(fsnps$site1.a1 == "Z"))), rep(0, length(which(fsnps$site1.a1 == "A"))), rep(1, length(which(fsnps$site1.a1 == "C"))), rep(NA, length(which(fsnps$site1.a2 == "Z"))), rep(0, length(which(fsnps$site1.a2 == "A"))), rep(1, length(which(fsnps$site1.a2 == "C"))))
mean(snp_site1, na.rm=TRUE)
## [1] 0.3069948

With a value closer to 0, based off of the mean performed, we can see that the alleles lean towards a greater number of A alleles.

median(snp_site1, na.rm=TRUE)
## [1] 0

This is further confirmed by the fact that the median is 0, which corresponds to the A allele. It is safe to interpret this as the fact the most frequent or common allele in the SNP designated to site1 in the dataset is A.

Create a new data frame with a subset of the columns and rows. Make sure to rename it.

new_df <- data.frame(
  fsnps$id[25:75],
  fsnps$site1.a1[25:75],
  fsnps$site1.a2[25:75]
)

Create new column names for the new data frame.

names(new_df) <- c("ID", "First Allele", "Second Allele")

Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.

summary(new_df)
##        ID       First Allele Second Allele
##  Min.   :25.0   A:42         A:21         
##  1st Qu.:37.5   C: 5         C:26         
##  Median :50.0   Z: 4         Z: 4         
##  Mean   :50.0                             
##  3rd Qu.:62.5                             
##  Max.   :75.0
new_df_snp <- c(rep(NA, length(which(new_df$"First Allele" == "Z"))), rep(0, length(which(new_df$"First Allele" == "A"))), rep(1, length(which(new_df$"First Allele" == "C"))), rep(NA, length(which(new_df$"Second Allele" == "Z"))), rep(0, length(which(new_df$"Second Allele" == "A"))), rep(1, length(which(new_df$"Second Allele" == "C"))))

As can be seen, only 50 records were used from the fsnp database this time. Previously, the mean was approximately 0.306. This time, the mean was

mean(new_df_snp, na.rm=TRUE)
## [1] 0.3297872

which indicates a higher frequency of the C allele in this dataset, but not enough for it to be the majority of alleles. As for the median, it was 0 in the overall dataset and

median(new_df_snp, na.rm=TRUE)
## [1] 0

0 in this smaller set, indicating that the A allele still is the most plentiful of the two in this one particular SNP.

For at least 3 values in a column please rename so that every value in that column is renamed.

levels(new_df$"First Allele") <- c(levels(new_df$"First Allele"), "Cytosine")
new_df$"First Allele"[new_df$"First Allele" == "C"] <- "Cytosine"
levels(new_df$"First Allele") <- c(levels(new_df$"First Allele"), "Adenine")
new_df$"First Allele"[new_df$"First Allele" == "A"] <- "Adenine"

levels(new_df$"Second Allele") <- c(levels(new_df$"Second Allele"), "Cytosine")
new_df$"Second Allele"[new_df$"Second Allele" == "C"] <- "Cytosine"
levels(new_df$"Second Allele") <- c(levels(new_df$"Second Allele"), "Adenine")
new_df$"Second Allele"[new_df$"Second Allele" == "A"] <- "Adenine"

new_df[new_df == "Z"] <- NA

Display enough rows to see examples of all of steps 1-5 above.

print.data.frame(new_df)
##    ID First Allele Second Allele
## 1  25     Cytosine      Cytosine
## 2  26     Cytosine      Cytosine
## 3  27         <NA>          <NA>
## 4  28      Adenine       Adenine
## 5  29      Adenine      Cytosine
## 6  30      Adenine      Cytosine
## 7  31      Adenine       Adenine
## 8  32      Adenine       Adenine
## 9  33      Adenine      Cytosine
## 10 34      Adenine       Adenine
## 11 35     Cytosine      Cytosine
## 12 36      Adenine      Cytosine
## 13 37      Adenine      Cytosine
## 14 38      Adenine       Adenine
## 15 39      Adenine      Cytosine
## 16 40      Adenine       Adenine
## 17 41      Adenine      Cytosine
## 18 42      Adenine       Adenine
## 19 43      Adenine       Adenine
## 20 44      Adenine      Cytosine
## 21 45      Adenine      Cytosine
## 22 46      Adenine      Cytosine
## 23 47      Adenine      Cytosine
## 24 48      Adenine       Adenine
## 25 49      Adenine       Adenine
## 26 50         <NA>          <NA>
## 27 51      Adenine      Cytosine
## 28 52      Adenine       Adenine
## 29 53      Adenine       Adenine
## 30 54      Adenine       Adenine
## 31 55      Adenine       Adenine
## 32 56         <NA>          <NA>
## 33 57     Cytosine      Cytosine
## 34 58      Adenine       Adenine
## 35 59      Adenine      Cytosine
## 36 60      Adenine       Adenine
## 37 61      Adenine       Adenine
## 38 62      Adenine       Adenine
## 39 63      Adenine       Adenine
## 40 64         <NA>          <NA>
## 41 65      Adenine      Cytosine
## 42 66      Adenine       Adenine
## 43 67      Adenine      Cytosine
## 44 68      Adenine      Cytosine
## 45 69     Cytosine      Cytosine
## 46 70      Adenine      Cytosine
## 47 71      Adenine       Adenine
## 48 72      Adenine      Cytosine
## 49 73      Adenine      Cytosine
## 50 74      Adenine      Cytosine
## 51 75      Adenine      Cytosine