We load data into R and remove punctuation
library(rjson)
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE, fig.align='center')
stopwords <- readLines("stopwords_new.txt")
raw_data <- fromJSON(file = "set2_notannotated.json")
split_essays <- lapply(raw_data[[1]], function(x) x$sentences)
whole_essays <- sapply(split_essays, function(x) paste(x, collapse = " "))
clean_text <- function(x) {
# Converts to lower case and removes punctuation
result <- tolower(x)
result <- gsub("[[:punct:][:blank:]]+", " ", result)
}
whole_essays <- clean_text(whole_essays)
version <- sapply(raw_data[[1]], function(x) x$version)
#whole_essays <- whole_essays[version == "Final"]
cat("We have",length(whole_essays),"essays", "\n")
## We have 128 essays
cat("Below is an example of an essay\n\n")
## Below is an example of an essay
cat(whole_essays[5])
## because of mutation and inheritance over the generation across time due to mutation in each genome after each replication process the sequences generated will be different from their ancestors therefore after each process of replication across time more mutation is introduce thus causing a greater genetic diversity from the previous generation as each of the new genome will be different from one another phylogenetic analysis will shows us the genetic distance between and evolution relationship among the organism genome the branch length shows the time line of the evolution between the different genomes and the vertical line shows the splitting of the evolution natural selection is just a secondary factors and not a primary driving force for the molecular evolution in generating genetic diversity
Now we construct the document-term matrix. This will give us the vocabulary for the draft and the final versions.
Below is a sample of the term vector for one essay
## allelic analysis based caused disease diversity
## 0 1 0 0 0 2
## division estimating famous gene generated generating
## 0 0 0 0 1 1
## generation greater intermixing introduce length mutation
## 2 1 0 1 1 3
## organism polypeptide process profound replication responsible
## 1 0 2 0 2 0
## secondary sequences splitting taking time variation
## 1 1 1 0 3 0
Word cloud for the draft version
Word cloud for the final version
We will calculate a non-symmetric distance between term vectors as \[ d(t_1,t_2)=1-\frac{t_1\cdot t_2}{\|t_1\|} \] Where \(t_1\) and \(t_2\) is vector of 0’s and 1’s, i.e., we count appearances of each term rather than the number of appearances. This is the fraction of terms that appear in \(t_1\) but not in \(t_2\).
## First essay is
## [1] "molecular evolution itself is the process of cellular composition change across generations and time the fact that the composition is not static but dynamic could be because of various factors such as the external environment these cellular composition change can occur due to random mutations genetic drifts and many more it is in random which will then be selected for fitness to be an adapted trait there will be a lot of mutations and changes that occur randomly across species and organisms these high probability of variety in mutations that occur is the reason for genetic diversity in time other contributing factors such as the environment changes and unexpected phenomena may lead to more chances of mutations to arise contributing further to generate more diversity "
## Second essay is
## [1] "evolution in the absence of selection can produce great diversity due to the basic concepts of inheritable mutations mutations in germline cells will be eventually passed to the next generation and mutations accumulate across multiple generations resulting in increasing diversity over time a large range of the different types of mutations would also contribute to diversity where a particular sequence may experience deletions another may experience chromosomal translocations the theoretically calculated mutation rate is 10 9 mutations replication which translates into 3 mutations per replication of the human genome this would contribute to a high genetic diversity when taking into account our long evolutionary history additionally considering that the human dna polymerase has proofreading abilities this diversity is much more expanded in other genomes that lack proofreading functions during replication such as the genomes of rna viruses furthermore random homologous recombination events during meiosis of germ cells also increases variation in the genome on the macro scale random fertilization events random mating and intermixing between different species with successful offspring can also increase genetic variation phylogenetic analysis of diverse genetic sequences provides evolutionary relationships since these differences in sequences due to mutations can be used to distinguish between species that evolved differently due to unique mutations substitutions and indels can be used as markers to differentiate species when performing sequence analysis to generate phylogenetic trees that allows a better understanding of how human evolution came about phylogenetic analysis of diverse genetic sequences also provide information into how certain genetic diseases came about allowing the development of therapeutic treatments that better target these diseases "
##
## The term distance from t1 to t2 is
## 0.7560976
##
## The term distance from t2 to t1 is
## 0.9082569
Document-term distances. We set the diagonal to be infinite. Below are distances between first five draft essays
## 1 2 3 4 5
## 1 Inf 0.7560976 0.7073171 0.7560976 0.8292683
## 2 0.9082569 Inf 0.8532110 0.8440367 0.8715596
## 3 0.8333333 0.7777778 Inf 0.8194444 0.7777778
## 4 0.8550725 0.7536232 0.8115942 Inf 0.8550725
## 5 0.8250000 0.6500000 0.6000000 0.7500000 Inf
Here we try to find essays that may be directly copied from each other.
Nearest entries among the draft essays
## The distance between nearest essays is 0.4516129
## Nearest essays are
## [1] "mutations can cause the nucleotide sequences to differ from generation to generation as mutations are random and are not the same for all of the genomic sequences this can lead to a great diversity of nucleotide sequences selection will actually result in similar mutations and nucleotide sequences between species and thus selection can reduce the diversity of nucleotide sequences the phylogenetic analysis of the diverse genetic sequences can estimate the evolutionary relationships in phylogenetic analysis the sequence of a common gene or protein can be used to assess the evolutionary relationship of species "
## [2] "evolution of genomic sequences in the absence of selection can produce great diversity in the nucleotide sequences due to silent mutations that do accumulate as the generation time increases this is because the silent mutations can accumulate and be passed on to gametes and to offspring causing an accumulation of mutations leading to missense mutations that are selective for advantageous phenotypic traits inheritance from parent to offspring random mutations can accumulate in the absence of selection which will lead to diversity in nucleotide sequences as well as phenotypic traits this does not mean that there are any form of selection pressures however naturally organisms will select for phenotypes that are advantageous the phylogenetic analysis of the diverse genetic sequences provided will give an ancestral sequence analysis which can help determine an estimate of how the organism evolved and how closely related each species is to another without the need for phenotypic trait analysis one can determine its closeness and genetic relatedness from the analysis to gain a better understanding of evolution processes with without the presence of selection moreover phylogenetic analysis using different types of analysis will use different parameters and analysis to gain different trees and therefore a wide variety of trees to study to find the one with the best fit to the hypothesis of the organism s evolution from its root "
Nearest entries among the final essays
## The distance between nearest essays is 0.4054054
## Nearest essays are
## [1] "the processes of molecular evolution discussed here are the random copying errors by dna polymerase during dna replication prior to cell division which go unrepaired the result of the unrepaired copying errors in dna is dna mutations which are inherited from one generation to another the mutations can be substitution insertion and deletions in the dna which create genetic diversity such as in single nucleotide polymorphisms snps differences in snps between similar looking dna sequences can be analysed to determine the genetic distance or how closely related the sequences are "
## [2] "first off we must think about the main force that drive molecular evolution mutations the effect of advantageous and deleterious deletions are obvious however neutral mutations would be the main source of genetic diversity within us and amongst us we know that during cell division passing down our genetic material is not a foolproof process there will be mistakes made these mistakes or mutations that are introduced due to the shortcomings of our dna polymerase are then passed down and accumulated within our daughter cells these random mutations will result in an altered genotype although it may not always manifest itself as a distinct phenotype however these alterations in our genes will inadvertently alter our proteins over time and generations indeed from the doublet to the triplet code and even looking at our single nucleotide polymorphisms we realise now that the genetic diversity we know originates from the fact that we pass down our genetic material during cell division inheritance and that random mutations occur due to errors incurred during this process there are various kinds of mutations yet the vast majority are neutral mutations this is why each of us has a unique signature of polymorphic single nucleotide polymorphisms snps these snps itself give rise to the huge diversity that we see even amongst ourselves today if we exclude selection from this discussion indeed we would be able to see that genetic diversity itself can be easily generated as we are our father s and mother s children and their unique polymorphic snp signatures are passed down to us in addition we are exposed to the external environment which may further alter our unique genetic signature when taken together these single nucleotide mutations from multiple sources result in the generation of an even more unique signatures amongst us this kind of variation would also result in molecular evolution variations we now understand that mutations may result in a different genetic sequence but with the same phenotype or even similar but not the same this allows us to apply this view to not only the genetic sequence but even epigenetics some of us may have those minute sequence differences which may change how certain proteins work and these neutral changes by itself then gives rise to the phenotypical molecular variation that we now see even amongst ourselves epigenetics long non coding rna microrna biology and even splicing are affected due to the presence of certain snps or mutations within the genome and these contribute to the molecular variation that we might see today even though there might be little genetic distance with a high relatedness of sequence each protein may have a slightly altered property even with just one nucleotide change in its sequence "
Here we try to find essays with the final version not differnt from the draft version. To do that, we will use a different metric. Instead of counting terms that appear in essay 1 but not in essay 2, we simply take the square of the Euclidean distance \[ (w_1-w_2)\cdot (w1-w_2), \] where \(w_1\) and \(w_2\) are pure term vectors (with counts of each term, not barely indicators of whether that term is there).
We calculate distances from each draft essay to each final essay (not that mapping to authors is different for draft and final essays and hence we do not expect the diagonal to be zero here):
## 65 66 67 68 69
## 1 465 542 361 1446 600
## 2 510 93 490 1309 611
## 3 495 568 365 1258 592
## 4 538 563 258 1529 533
## 5 409 470 375 1242 452
The least changed essay is
## The distance between nearest essays is 0
## Nearest essays are
## [1] "genetic diversity can be loosely defined as the generation of differences of the genetic code in essence the generation of genetic code differences comes from changes in genetic code arising from mutation mutation results in changes in the sequence of the genetic code that takes the form of nucleotide sequences with different bases adenine a thymine t cytosine c and guanine g changes in such sequences could result in no change silent mutation in the amino acid sequence they code for 3 codons for 1 amino acid or result in both amino acid sequence and the protein product resulting from mutations like frameshift mutation nonsense and missense mutation genetic diversity is attributed to the accumulation of such mutations over time and then the passing on these differences to the next generation through inheritance "
## [2] "genetic diversity can be loosely defined as the generation of differences of the genetic code in essence the generation of genetic code differences comes from changes in genetic code arising from mutation mutation results in changes in the sequence of the genetic code that takes the form of nucleotide sequences with different bases adenine a thymine t cytosine c and guanine g changes in such sequences could result in no change silent mutation in the amino acid sequence they code for 3 codons for 1 amino acid or result in both amino acid sequence and the protein product resulting from mutations like frameshift mutation nonsense and missense mutation genetic diversity is attributed to the accumulation of such mutations over time and then the passing on of these differences to the next generation through inheritance "