We are importing two data sets from Kaggle using DNA data set test and training sets. x and y variables seems to be well correlated so we expect the \(R2\) to be close to 1.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
test <- read.csv('https://raw.githubusercontent.com/joewarner89/DATA-605-Computational-Mathematics/main/Assignment/week%2012/test.csv')
train <- read.csv('https://raw.githubusercontent.com/joewarner89/DATA-605-Computational-Mathematics/main/Assignment/week%2012/train.csv')
# Check for NA and missing values
# is.na return a vector with value TT for missing values.
numberOfNA = length(which(is.na(train)==T))
if(numberOfNA > 0) {
cat('Number of missing values found: ', numberOfNA)
cat('\nRemoving missing values...')
train = train[complete.cases(train), ]
}
## Number of missing values found: 1
## Removing missing values...
cor(train$x,train$y)
## [1] 0.9953399
Both dna strands x and y are well correlated. x and y are well distributed.
require(ResourceSelection)
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6 2023-06-27
#summarize the data set
kdepairs(train)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
#creating the model
dna_lm <- lm(y~., data = train)
summary(dna_lm)
##
## Call:
## lm(formula = y ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1523 -2.0179 0.0325 1.8573 8.9132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.107265 0.212170 -0.506 0.613
## x 1.000656 0.003672 272.510 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.809 on 697 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9907
## F-statistic: 7.426e+04 on 1 and 697 DF, p-value: < 2.2e-16
par(ask=F)
par(mfrow=c(2,2))
plot(dna_lm)
hist(dna_lm$residuals)
The model acheives a \(R^2\) = 0.9907 and low p-value that are significant for the model development. we can use this model to predict chromosome y from x from the DNA simple. All four plots in the summary statistics of the model suggest a good model.