dna analysis

We are importing two data sets from Kaggle using DNA data set test and training sets. x and y variables seems to be well correlated so we expect the \(R2\) to be close to 1.

library(ggplot2) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
test <- read.csv('https://raw.githubusercontent.com/joewarner89/DATA-605-Computational-Mathematics/main/Assignment/week%2012/test.csv')

train <- read.csv('https://raw.githubusercontent.com/joewarner89/DATA-605-Computational-Mathematics/main/Assignment/week%2012/train.csv')
# Check for NA and missing values
# is.na return a vector with value TT for missing values.
numberOfNA = length(which(is.na(train)==T))
if(numberOfNA > 0) {
  cat('Number of missing values found: ', numberOfNA)
  cat('\nRemoving missing values...')
  train = train[complete.cases(train), ]
}
## Number of missing values found:  1
## Removing missing values...
cor(train$x,train$y)
## [1] 0.9953399

Simple Linear Regression

Both dna strands x and y are well correlated. x and y are well distributed.

require(ResourceSelection)
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6   2023-06-27
#summarize the data set
kdepairs(train) 
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

#creating the model 
dna_lm <- lm(y~., data = train) 
summary(dna_lm)
## 
## Call:
## lm(formula = y ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1523 -2.0179  0.0325  1.8573  8.9132 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.107265   0.212170  -0.506    0.613    
## x            1.000656   0.003672 272.510   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.809 on 697 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9907 
## F-statistic: 7.426e+04 on 1 and 697 DF,  p-value: < 2.2e-16
par(ask=F)
par(mfrow=c(2,2))
plot(dna_lm)

hist(dna_lm$residuals)

The model acheives a \(R^2\) = 0.9907 and low p-value that are significant for the model development. we can use this model to predict chromosome y from x from the DNA simple. All four plots in the summary statistics of the model suggest a good model.