Our analysis focuses on the development of an objective, automated method to extract clinically useful information from in the context of Parkinson’s disease (PD).
Parkinson’s disease is a progressive disorder of the nervous system that affects movement. It develops gradually, sometimes starting with a barely noticeable tremor in just one hand. But while a tremor may be the most well-known sign of Parkinson’s disease, the disorder also commonly causes stiffness or slowing of movement. In the early stages of Parkinson’s disease, your face may show little or no expression, or your arms may not swing when you walk. Your speech may become soft or slurred. Parkinson’s disease symptoms worsen as your condition progresses over time. Parkinson’s disease continues to be a serious problem today, with Approximately 60,000 Americans are diagnosed with Parkinson’s disease each year, and this number does not reflect the thousands of cases that go undetected.
More than 10 million people worldwide are living with Parkinson’s disease.
Incidence of Parkinson’s increases with age, but an estimated four percent of people with PD are diagnosed before the age of 50.
Men are one and a half times more likely to have Parkinson’s than women.
Although Parkinson’s disease can’t be cured, medications may markedly improve your symptoms. In occasional cases, the doctor may suggest surgery to regulate certain regions of your brain and improve your symptoms.
We have a dataset for 5875 patients in various stages of Parkinson’s disease and will proceed to analyze the data in order to identify useful relations in the dataset.
Let’s first load all the required packages required to analyze the dataset
To read the data into R, we will use the “read.csv” command.
#setwd("D:\UC MSIS\Fall Flex 2")
park <- read.csv("Park.csv")
attach(park)
Let’s remove the ‘NA’ values from the dataset:
park <- na.omit(park)
Now, let’s inspect and summarize the data:
head(park,10)
## Subject.no age sex test_time motor_UPDRS total_UPDRS JitterPerc
## 1 1 72 0 5.6431 28.199 34.398 0.00662
## 2 1 72 0 12.6660 28.447 34.894 0.00300
## 3 1 72 0 19.6810 28.695 35.389 0.00481
## 4 1 72 0 25.6470 28.905 35.810 0.00528
## 5 1 72 0 33.6420 29.187 36.375 0.00335
## 6 1 72 0 40.6520 29.435 36.870 0.00353
## 7 1 72 0 47.6490 29.682 37.363 0.00422
## 8 1 72 0 54.6400 29.928 37.857 0.00476
## 9 1 72 0 61.6690 30.177 38.353 0.00432
## 10 1 72 0 68.6880 30.424 38.849 0.00496
## Jitter.Abs. Jitter.RAP Jitter.PPQ5 Jitter.DDP Shimmer Shimmer.dB.
## 1 3.38e-05 0.00401 0.00317 0.01204 0.02565 0.230
## 2 1.68e-05 0.00132 0.00150 0.00395 0.02024 0.179
## 3 2.46e-05 0.00205 0.00208 0.00616 0.01675 0.181
## 4 2.66e-05 0.00191 0.00264 0.00573 0.02309 0.327
## 5 2.01e-05 0.00093 0.00130 0.00278 0.01703 0.176
## 6 2.29e-05 0.00119 0.00159 0.00357 0.02227 0.214
## 7 2.40e-05 0.00212 0.00221 0.00637 0.04352 0.445
## 8 2.47e-05 0.00226 0.00259 0.00678 0.02191 0.212
## 9 2.85e-05 0.00156 0.00207 0.00468 0.04296 0.371
## 10 2.70e-05 0.00258 0.00253 0.00773 0.03610 0.310
## Shimmer.APQ3 Shimmer.APQ5 Shimmer.APQ11 Shimmer.DDA NHR HNR
## 1 0.01438 0.01309 0.01662 0.04314 0.014290 21.640
## 2 0.00994 0.01072 0.01689 0.02982 0.011112 27.183
## 3 0.00734 0.00844 0.01458 0.02202 0.020220 23.047
## 4 0.01106 0.01265 0.01963 0.03317 0.027837 24.445
## 5 0.00679 0.00929 0.01819 0.02036 0.011625 26.126
## 6 0.01006 0.01337 0.02263 0.03019 0.009438 22.946
## 7 0.02376 0.02621 0.03488 0.07128 0.013260 22.506
## 8 0.00979 0.01462 0.01911 0.02937 0.027969 22.929
## 9 0.01774 0.02134 0.03451 0.05323 0.013381 22.078
## 10 0.02030 0.01970 0.02569 0.06089 0.018021 22.606
## RPDE DFA PPE
## 1 0.41888 0.54842 0.16006
## 2 0.43493 0.56477 0.10810
## 3 0.46222 0.54405 0.21014
## 4 0.48730 0.57794 0.33277
## 5 0.47188 0.56122 0.19361
## 6 0.53949 0.57243 0.19500
## 7 0.49250 0.54779 0.17563
## 8 0.47712 0.54234 0.23844
## 9 0.51563 0.61864 0.20037
## 10 0.50032 0.58673 0.20117
## To have a look at the structure of the dataset 'park'
str(park)
## 'data.frame': 5875 obs. of 22 variables:
## $ Subject.no : int 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 72 72 72 72 72 72 72 72 72 72 ...
## $ sex : int 0 0 0 0 0 0 0 0 0 0 ...
## $ test_time : num 5.64 12.67 19.68 25.65 33.64 ...
## $ motor_UPDRS : num 28.2 28.4 28.7 28.9 29.2 ...
## $ total_UPDRS : num 34.4 34.9 35.4 35.8 36.4 ...
## $ JitterPerc : num 0.00662 0.003 0.00481 0.00528 0.00335 0.00353 0.00422 0.00476 0.00432 0.00496 ...
## $ Jitter.Abs. : num 3.38e-05 1.68e-05 2.46e-05 2.66e-05 2.01e-05 2.29e-05 2.40e-05 2.47e-05 2.85e-05 2.70e-05 ...
## $ Jitter.RAP : num 0.00401 0.00132 0.00205 0.00191 0.00093 0.00119 0.00212 0.00226 0.00156 0.00258 ...
## $ Jitter.PPQ5 : num 0.00317 0.0015 0.00208 0.00264 0.0013 0.00159 0.00221 0.00259 0.00207 0.00253 ...
## $ Jitter.DDP : num 0.01204 0.00395 0.00616 0.00573 0.00278 ...
## $ Shimmer : num 0.0256 0.0202 0.0168 0.0231 0.017 ...
## $ Shimmer.dB. : num 0.23 0.179 0.181 0.327 0.176 0.214 0.445 0.212 0.371 0.31 ...
## $ Shimmer.APQ3 : num 0.01438 0.00994 0.00734 0.01106 0.00679 ...
## $ Shimmer.APQ5 : num 0.01309 0.01072 0.00844 0.01265 0.00929 ...
## $ Shimmer.APQ11: num 0.0166 0.0169 0.0146 0.0196 0.0182 ...
## $ Shimmer.DDA : num 0.0431 0.0298 0.022 0.0332 0.0204 ...
## $ NHR : num 0.0143 0.0111 0.0202 0.0278 0.0116 ...
## $ HNR : num 21.6 27.2 23 24.4 26.1 ...
## $ RPDE : num 0.419 0.435 0.462 0.487 0.472 ...
## $ DFA : num 0.548 0.565 0.544 0.578 0.561 ...
## $ PPE : num 0.16 0.108 0.21 0.333 0.194 ...
## Summary
summary(park)
## Subject.no age sex test_time
## Min. : 1.00 Min. :36.0 Min. :0.0000 Min. : -4.263
## 1st Qu.:10.00 1st Qu.:58.0 1st Qu.:0.0000 1st Qu.: 46.847
## Median :22.00 Median :65.0 Median :0.0000 Median : 91.523
## Mean :21.49 Mean :64.8 Mean :0.3178 Mean : 92.864
## 3rd Qu.:33.00 3rd Qu.:72.0 3rd Qu.:1.0000 3rd Qu.:138.445
## Max. :42.00 Max. :85.0 Max. :1.0000 Max. :215.490
## motor_UPDRS total_UPDRS JitterPerc Jitter.Abs.
## Min. : 5.038 Min. : 7.00 Min. :0.000830 Min. :2.250e-06
## 1st Qu.:15.000 1st Qu.:21.37 1st Qu.:0.003580 1st Qu.:2.240e-05
## Median :20.871 Median :27.58 Median :0.004900 Median :3.450e-05
## Mean :21.296 Mean :29.02 Mean :0.006154 Mean :4.403e-05
## 3rd Qu.:27.596 3rd Qu.:36.40 3rd Qu.:0.006800 3rd Qu.:5.330e-05
## Max. :39.511 Max. :54.99 Max. :0.099990 Max. :4.456e-04
## Jitter.RAP Jitter.PPQ5 Jitter.DDP
## Min. :0.000330 Min. :0.000430 Min. :0.000980
## 1st Qu.:0.001580 1st Qu.:0.001820 1st Qu.:0.004730
## Median :0.002250 Median :0.002490 Median :0.006750
## Mean :0.002987 Mean :0.003277 Mean :0.008962
## 3rd Qu.:0.003290 3rd Qu.:0.003460 3rd Qu.:0.009870
## Max. :0.057540 Max. :0.069560 Max. :0.172630
## Shimmer Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5
## Min. :0.00306 Min. :0.026 Min. :0.00161 Min. :0.00194
## 1st Qu.:0.01912 1st Qu.:0.175 1st Qu.:0.00928 1st Qu.:0.01079
## Median :0.02751 Median :0.253 Median :0.01370 Median :0.01594
## Mean :0.03404 Mean :0.311 Mean :0.01716 Mean :0.02014
## 3rd Qu.:0.03975 3rd Qu.:0.365 3rd Qu.:0.02057 3rd Qu.:0.02375
## Max. :0.26863 Max. :2.107 Max. :0.16267 Max. :0.16702
## Shimmer.APQ11 Shimmer.DDA NHR HNR
## Min. :0.00249 Min. :0.00484 Min. :0.000286 Min. : 1.659
## 1st Qu.:0.01566 1st Qu.:0.02783 1st Qu.:0.010955 1st Qu.:19.406
## Median :0.02271 Median :0.04111 Median :0.018448 Median :21.920
## Mean :0.02748 Mean :0.05147 Mean :0.032120 Mean :21.680
## 3rd Qu.:0.03272 3rd Qu.:0.06173 3rd Qu.:0.031463 3rd Qu.:24.444
## Max. :0.27546 Max. :0.48802 Max. :0.748260 Max. :37.875
## RPDE DFA PPE
## Min. :0.1510 Min. :0.5140 Min. :0.02198
## 1st Qu.:0.4698 1st Qu.:0.5962 1st Qu.:0.15634
## Median :0.5423 Median :0.6436 Median :0.20550
## Mean :0.5415 Mean :0.6532 Mean :0.21959
## 3rd Qu.:0.6140 3rd Qu.:0.7113 3rd Qu.:0.26449
## Max. :0.9661 Max. :0.8656 Max. :0.73173
So, ‘park’ dataset has 22 variables and 5875 observations. As of now, we have the summary of all the variables.
We will now list down all the attributes in our dataset and work towards identifying a response variable.
Subject number: Integer that uniquely identifies each subject
Subject Age
Subject gender ‘0’ - male, ‘1’ - female
Test time - Time since recruitment into the trial. The integer part is the number of days since recruitment.
UPDRS: This is a clinician’s scale for recording symptoms related to Parkinson’s disease. The UPDRS metric consists of 44 sections, where each section addresses different symptoms in different parts of the body. Summing up these 44 sections gives rise to the total-UPDRS score, which spans the range 0-176, with 0 representing perfectly healthy individual and 176 total disability.
Motor UPDRS - Clinician’s motor UPDRS score, linearly interpolated - this forms sections 18-44 from the UPDRS sections
Total_UPDRS - Clinician’s total UPDRS score, linearly interpolated - this includes all 44 sections
Jitter Percentage - measure of variation in fundamental frequency
Jitter (Absolute) - measure of variation in fundamental frequency
Jitter (RAP) - measure of variation in fundamental frequency
Jitter (PPQ5) - measure of variation in fundamental frequency
Jitter (DDP) - measure of variation in fundamental frequency
Shimmer - measures of variation in amplitude
Shimmer(dB)- measure of variation in amplitude
Shimmer:APQ3- measure of variation in amplitude
Shimmer:APQ5- measure of variation in amplitude
Shimmer:APQ11- measure of variation in amplitude
Shimmer:DDA- measure of variation in amplitude
NHR: measures of ratio of noise to tonal components in the voice
HNR: measures of ratio of noise to tonal components in the voice
RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation
We will aim to use UDPRS as our response variable.
The aim is twofold: (a) this attribute helps us differentiate between healthy (non-affected) people and people with Parkinsons. (b) replicating the Unified Parkinson’s Disease Rating Scale (UPDRS) metric which provides a clinical impression of PD symptom severity will help enforce the findings in the actual physical interview.
This metric (UPDRS) spans the range 0 to 176 , where 0 denotes a healthy person and 176 total disability. Currently, UPDRS assessment requires the physical presence of the subject in the clinic, is subjective relying on the clinical rater’s expertise, and costly (in terms of time, money, energy and resources). Hence, the practical frequency of symptom tracking is typically confined to once every several months, hindering recruitment for large-scale clinical trials and under-representing the true time scale of PD fluctuations
We will aim to build and refine a model that will enable us to accurately estimate a measure for UPDRS (motor or total).
As we see that ‘sex’ has ‘0’ and ‘1’ values. We could factor it and it will not affect the dataset. Let’s do the factoring using the is.factor function:
sex <- is.factor(sex)
Let’s see the Histrogram plot of ‘total_UPDRS’:
hist(total_UPDRS, xlab="total_UPDRS",main="Histogram of total_UPDRS", col = blues9)
It seems to be normally distributed and that is a good thing for the model. Otherwise, transformations or other changes would have been required to normalize the dependent variable ‘total_UPDRS’.
Let’s look at the histogram of other variables:
par(mfrow = c(3,3))
hist(age, xlab="age",main="Histogram of age", col = blues9)
hist(test_time, xlab="test_time",main="Histogram of test_time", col = blues9)
hist(motor_UPDRS, xlab="motor_UPDRS",main="Histogram of motor_UPDRS", col = blues9)
hist(JitterPerc, xlab="JitterPerc",main="Histogram of JitterPerc", col = blues9)
hist(Shimmer, xlab="Shimmer",main="Histogram of Shimmer", col = blues9)
hist(NHR, xlab="NHR",main="Histogram of NHR", col = blues9)
hist(RPDE, xlab="RPDE",main="Histogram of RPDE", col = blues9)
hist(DFA, xlab="DFA",main="Histogram of DFA", col = blues9)
hist(PPE, xlab="PPE",main="Histogram of PPE", col = blues9)
From all the observed histograms, it can be concluded that only ‘RPDE’ and ‘PPE’ have the normalized data. ‘JitterPerc’, ‘Shimmer’ and ‘NHR’ have very unnormalized data. It can be improved by transforming these varaibles which will be used in later sections when we have a model giving linear regression.
We plotted Histogram to get the sense of the response variable and the covariates but Histograms can not be a very relaible method for determining the shape of a distribution because it is so strongly affected by the number of bins used. Thus we use the kernel density plots instead which are more effective ways to view the distribution of a variable
Let’s plot the density curves for the variables:
# Dependent variable
par(mfrow = c(1,1))
d <- density(park$total_UPDRS)
plot(d)
polygon(d, col="light green", border="blue")
Density plot of ‘total_UPDRS’ is showing that it is almost normalized variable but not fully normalized. Normalization of the response variable is improved by various methods used in the report later.
Let’s plot the density plots for the covariates to get more sense of the variables:
par(mfrow = c(3,3))
d <- density(park$age)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$test_time)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$motor_UPDRS)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$JitterPerc)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$Shimmer)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$NHR)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$RPDE)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$DFA)
plot(d)
polygon(d, col="light green", border="blue")
d <- density(park$PPE)
plot(d)
polygon(d, col="light green", border="blue")
As density plot is also showing that ‘RPDE’, ‘PPE’ have normalized data but not for ‘motor_UPDRS’ which it couldn’t be visible accurately in the Histogram plot.
‘JitterPerc’, ‘Shimmer’ and ‘NHR’ have very unnormalized data. It can be improved by transforming these varaibles which will be used in later sections when we have a model giving linear regression. We will show all these transformed variables after normalizing them.
To find a good linear model, let’s first try to see the correlation between the variables:
library(corrplot)
corrplot(cor(park), method = "color")
library(car)
library(visreg)
library(rgl)
library(knitr)
library(scatterplot3d)
library(MASS)
library(genridge)
##
## Attaching package: 'genridge'
## The following object is masked from 'package:rgl':
##
## plot3d
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
library(lattice)
As it can be observed that ‘total UPDRS’ has correlation with only ‘motor UPDRS’ which is 95%. If we build a model using these two variables, then our model will have an inherent bias induced into the estimates of coefficients and variance of predicted response and it is not the intention. So, other variables are required to correlate with the response variable and that can be done by improving the model created by these covariates. We will proceed by creating a model using ‘total_UPDRS’ as our response variable and other varaibles as predictor variables.
We will proceed to build the linear model using lm() function taking total_UPDRS as the response varaible.
model <- lm(park$total_UPDRS~park$age+park$sex+park$test_time+park$motor_UPDRS+park$Jitter.PPQ5+park$JitterPerc+park$Jitter.RAP+park$Jitter.DDP+park$Shimmer+park$Shimmer.dB.+park$Shimmer.APQ3+park$Shimmer.APQ5+park$Shimmer.APQ11+park$Shimmer.DDA+park$NHR+park$HNR+park$RPDE+park$DFA+park$PPE)
summary(model)
##
## Call:
## lm(formula = park$total_UPDRS ~ park$age + park$sex + park$test_time +
## park$motor_UPDRS + park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP +
## park$Jitter.DDP + park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 +
## park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$Shimmer.DDA +
## park$NHR + park$HNR + park$RPDE + park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9283 -2.0790 -0.5001 1.4599 11.4573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.879e+00 1.082e+00 1.737 0.08249 .
## park$age 6.843e-02 5.167e-03 13.243 < 2e-16 ***
## park$sex -1.557e+00 9.886e-02 -15.748 < 2e-16 ***
## park$test_time 2.616e-03 8.029e-04 3.258 0.00113 **
## park$motor_UPDRS 1.223e+00 5.666e-03 215.827 < 2e-16 ***
## park$Jitter.PPQ5 -3.789e+01 6.013e+01 -0.630 0.52866
## park$JitterPerc -2.158e+02 6.857e+01 -3.147 0.00166 **
## park$Jitter.RAP 6.302e+03 1.568e+04 0.402 0.68768
## park$Jitter.DDP -1.958e+03 5.226e+03 -0.375 0.70796
## park$Shimmer -2.757e+01 2.171e+01 -1.270 0.20411
## park$Shimmer.dB. -1.108e+00 1.618e+00 -0.685 0.49366
## park$Shimmer.APQ3 -1.305e+04 1.575e+04 -0.829 0.40720
## park$Shimmer.APQ5 8.851e+01 1.853e+01 4.777 1.82e-06 ***
## park$Shimmer.APQ11 -3.911e+01 8.308e+00 -4.707 2.57e-06 ***
## park$Shimmer.DDA 4.347e+03 5.249e+03 0.828 0.40756
## park$NHR -1.368e-01 2.025e+00 -0.068 0.94616
## park$HNR -1.095e-01 2.320e-02 -4.722 2.39e-06 ***
## park$RPDE 3.696e+00 6.039e-01 6.121 9.90e-10 ***
## park$DFA -1.855e+00 7.678e-01 -2.416 0.01573 *
## park$PPE -2.625e+00 9.446e-01 -2.779 0.00547 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.262 on 5855 degrees of freedom
## Multiple R-squared: 0.9074, Adjusted R-squared: 0.9071
## F-statistic: 3019 on 19 and 5855 DF, p-value: < 2.2e-16
R-square of the above model is 90.74% and adjusted R-square is 90.71% which shows good correlation in the model. But as, we can see that there are many insignificant predictor varaibles in this model, we will try to take best subset of the above model for best fit. Step AIC is used to get the subset of the model. Step AIC method will evaluate the model by checking AIC values for each varaible and try to reduce AIC of the model by removing the variables of high AIC.
step <- stepAIC(model, direction = "both")
## Start: AIC=13910.77
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP +
## park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$Shimmer.DDA + park$NHR + park$HNR +
## park$RPDE + park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$NHR 1 0 62283 13909
## - park$Jitter.DDP 1 1 62284 13909
## - park$Jitter.RAP 1 2 62285 13909
## - park$Jitter.PPQ5 1 4 62287 13909
## - park$Shimmer.dB. 1 5 62288 13909
## - park$Shimmer.DDA 1 7 62290 13910
## - park$Shimmer.APQ3 1 7 62290 13910
## - park$Shimmer 1 17 62300 13910
## <none> 62283 13911
## - park$DFA 1 62 62345 13915
## - park$PPE 1 82 62365 13916
## - park$JitterPerc 1 105 62388 13919
## - park$test_time 1 113 62396 13919
## - park$Shimmer.APQ11 1 236 62519 13931
## - park$HNR 1 237 62520 13931
## - park$Shimmer.APQ5 1 243 62526 13932
## - park$RPDE 1 399 62682 13946
## - park$age 1 1866 64149 14082
## - park$sex 1 2638 64921 14152
## - park$motor_UPDRS 1 495511 557794 26789
##
## Step: AIC=13908.77
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP +
## park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$Shimmer.DDA + park$HNR + park$RPDE +
## park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$Jitter.DDP 1 1 62285 13907
## - park$Jitter.RAP 1 2 62285 13907
## - park$Jitter.PPQ5 1 5 62288 13907
## - park$Shimmer.dB. 1 5 62288 13907
## - park$Shimmer.DDA 1 7 62290 13908
## - park$Shimmer.APQ3 1 7 62290 13908
## - park$Shimmer 1 18 62301 13908
## <none> 62283 13909
## + park$NHR 1 0 62283 13911
## - park$DFA 1 70 62353 13913
## - park$PPE 1 83 62366 13915
## - park$JitterPerc 1 108 62391 13917
## - park$test_time 1 113 62396 13917
## - park$HNR 1 243 62526 13930
## - park$Shimmer.APQ5 1 250 62534 13930
## - park$Shimmer.APQ11 1 258 62541 13931
## - park$RPDE 1 399 62682 13944
## - park$age 1 1885 64168 14082
## - park$sex 1 2697 64980 14156
## - park$motor_UPDRS 1 497188 559471 26804
##
## Step: AIC=13906.92
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Shimmer +
## park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$Shimmer.DDA + park$HNR + park$RPDE +
## park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$Jitter.PPQ5 1 5 62289 13905
## - park$Shimmer.dB. 1 5 62290 13905
## - park$Shimmer.DDA 1 7 62292 13906
## - park$Shimmer.APQ3 1 7 62292 13906
## - park$Shimmer 1 18 62303 13907
## <none> 62285 13907
## + park$Jitter.DDP 1 1 62283 13909
## + park$NHR 1 0 62284 13909
## - park$DFA 1 70 62355 13912
## - park$PPE 1 82 62367 13913
## - park$JitterPerc 1 108 62392 13915
## - park$test_time 1 113 62397 13916
## - park$Jitter.RAP 1 223 62507 13926
## - park$HNR 1 243 62527 13928
## - park$Shimmer.APQ5 1 252 62537 13929
## - park$Shimmer.APQ11 1 258 62543 13929
## - park$RPDE 1 401 62685 13943
## - park$age 1 1886 64170 14080
## - park$sex 1 2696 64980 14154
## - park$motor_UPDRS 1 497249 559533 26803
##
## Step: AIC=13905.36
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.dB. +
## park$Shimmer.APQ3 + park$Shimmer.APQ5 + park$Shimmer.APQ11 +
## park$Shimmer.DDA + park$HNR + park$RPDE + park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$Shimmer.dB. 1 4 62293 13904
## - park$Shimmer.DDA 1 7 62297 13904
## - park$Shimmer.APQ3 1 7 62297 13904
## <none> 62289 13905
## - park$Shimmer 1 21 62310 13905
## + park$Jitter.PPQ5 1 5 62285 13907
## + park$Jitter.DDP 1 2 62288 13907
## + park$NHR 1 0 62289 13907
## - park$DFA 1 68 62357 13910
## - park$PPE 1 79 62368 13911
## - park$test_time 1 112 62401 13914
## - park$JitterPerc 1 185 62474 13921
## - park$Jitter.RAP 1 223 62512 13924
## - park$HNR 1 239 62528 13926
## - park$Shimmer.APQ11 1 255 62544 13927
## - park$Shimmer.APQ5 1 272 62561 13929
## - park$RPDE 1 428 62717 13944
## - park$age 1 1886 64175 14079
## - park$sex 1 2701 64990 14153
## - park$motor_UPDRS 1 497258 559547 26801
##
## Step: AIC=13903.75
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ3 +
## park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$Shimmer.DDA +
## park$HNR + park$RPDE + park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$Shimmer.DDA 1 7 62301 13902
## - park$Shimmer.APQ3 1 7 62301 13902
## <none> 62293 13904
## + park$Shimmer.dB. 1 4 62289 13905
## + park$Jitter.PPQ5 1 4 62290 13905
## + park$Jitter.DDP 1 2 62292 13906
## + park$NHR 1 1 62293 13906
## - park$DFA 1 64 62358 13908
## - park$Shimmer 1 65 62358 13908
## - park$PPE 1 88 62381 13910
## - park$test_time 1 111 62404 13912
## - park$JitterPerc 1 191 62484 13920
## - park$Jitter.RAP 1 228 62521 13923
## - park$HNR 1 239 62533 13924
## - park$Shimmer.APQ11 1 272 62565 13927
## - park$Shimmer.APQ5 1 281 62574 13928
## - park$RPDE 1 433 62726 13942
## - park$age 1 1882 64176 14077
## - park$sex 1 2705 64999 14152
## - park$motor_UPDRS 1 497558 559852 26802
##
## Step: AIC=13902.43
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ3 +
## park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE +
## park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## - park$Shimmer.APQ3 1 0 62301 13900
## <none> 62301 13902
## + park$Shimmer.DDA 1 7 62293 13904
## + park$Shimmer.dB. 1 4 62297 13904
## + park$Jitter.PPQ5 1 4 62297 13904
## + park$Jitter.DDP 1 2 62299 13904
## + park$NHR 1 1 62300 13904
## - park$DFA 1 64 62365 13906
## - park$Shimmer 1 65 62366 13907
## - park$PPE 1 88 62389 13909
## - park$test_time 1 112 62412 13911
## - park$JitterPerc 1 190 62490 13918
## - park$Jitter.RAP 1 227 62528 13922
## - park$HNR 1 241 62541 13923
## - park$Shimmer.APQ11 1 272 62573 13926
## - park$Shimmer.APQ5 1 282 62583 13927
## - park$RPDE 1 430 62730 13941
## - park$age 1 1884 64185 14076
## - park$sex 1 2706 65006 14150
## - park$motor_UPDRS 1 497568 559869 26800
##
## Step: AIC=13900.47
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE
##
## Df Sum of Sq RSS AIC
## <none> 62301 13900
## + park$Shimmer.dB. 1 4 62297 13902
## + park$Jitter.PPQ5 1 3 62298 13902
## + park$Jitter.DDP 1 2 62299 13902
## + park$Shimmer.APQ3 1 0 62301 13902
## + park$Shimmer.DDA 1 0 62301 13902
## + park$NHR 1 0 62301 13902
## - park$DFA 1 65 62366 13905
## - park$PPE 1 90 62391 13907
## - park$test_time 1 111 62412 13909
## - park$Shimmer 1 167 62468 13914
## - park$JitterPerc 1 223 62524 13919
## - park$HNR 1 241 62542 13921
## - park$Jitter.RAP 1 257 62558 13923
## - park$Shimmer.APQ5 1 287 62588 13925
## - park$Shimmer.APQ11 1 335 62636 13930
## - park$RPDE 1 432 62733 13939
## - park$age 1 1889 64190 14074
## - park$sex 1 2766 65067 14154
## - park$motor_UPDRS 1 498032 560333 26803
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP +
## park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$Shimmer.DDA + park$NHR + park$HNR +
## park$RPDE + park$DFA + park$PPE
##
## Final Model:
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS +
## park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 +
## park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 5855 62282.98 13910.77
## 2 - park$NHR 1 0.04850983 5856 62283.03 13908.77
## 3 - park$Jitter.DDP 1 1.49780653 5857 62284.53 13906.92
## 4 - park$Jitter.PPQ5 1 4.65695110 5858 62289.19 13905.36
## 5 - park$Shimmer.dB. 1 4.15599272 5859 62293.34 13903.75
## 6 - park$Shimmer.DDA 1 7.26853396 5860 62300.61 13902.43
## 7 - park$Shimmer.APQ3 1 0.36606694 5861 62300.98 13900.47
We come out with the best subset for our model, let’s create another model with this subset in the next section:
model1 <- lm(park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model1)
##
## Call:
## lm(formula = park$total_UPDRS ~ park$age + park$sex + park$test_time +
## park$motor_UPDRS + park$JitterPerc + park$Jitter.RAP + park$Shimmer +
## park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE +
## park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9924 -2.0833 -0.5047 1.4742 11.4537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.754e+00 1.043e+00 1.682 0.09262 .
## park$age 6.835e-02 5.127e-03 13.331 < 2e-16 ***
## park$sex -1.557e+00 9.652e-02 -16.130 < 2e-16 ***
## park$test_time 2.593e-03 8.015e-04 3.236 0.00122 **
## park$motor_UPDRS 1.223e+00 5.650e-03 216.455 < 2e-16 ***
## park$JitterPerc -2.379e+02 5.199e+01 -4.576 4.84e-06 ***
## park$Jitter.RAP 4.271e+02 8.688e+01 4.917 9.05e-07 ***
## park$Shimmer -4.110e+01 1.038e+01 -3.958 7.64e-05 ***
## park$Shimmer.APQ5 8.421e+01 1.622e+01 5.192 2.15e-07 ***
## park$Shimmer.APQ11 -3.798e+01 6.769e+00 -5.610 2.11e-08 ***
## park$HNR -1.082e-01 2.273e-02 -4.759 1.99e-06 ***
## park$RPDE 3.775e+00 5.919e-01 6.378 1.94e-10 ***
## park$DFA -1.739e+00 7.014e-01 -2.479 0.01321 *
## park$PPE -2.655e+00 9.148e-01 -2.902 0.00372 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.26 on 5861 degrees of freedom
## Multiple R-squared: 0.9074, Adjusted R-squared: 0.9072
## F-statistic: 4416 on 13 and 5861 DF, p-value: < 2.2e-16
It looks like all the variables are significant. This model seems to be fittest model as R-square is 90.74% and Adjusted R-square is also high equal to 90.72% which show a good correlation. Now let’s try to plot this model.
par(mfrow = c(2,2))
plot(model1, col = "green")
Residual vs. Fitted values Plot - This plot is almost uniform which explains a good linearlity of the model. There are some outliers that cannot be removed because they contain important information. Another thing that this is graph is depicting is some constant variance of the model which is good for linear regression.
Normal QQ - Since it gives the normality of the response variable, after looking at its plot, we can say that our response variable is almost normally distributed as standardized residuals and theoretical quantile are coming from the same distribution as both meet almost in a straight line. This is subjective analysis just to check that this model has normalized dependent variable.
Checking for Leverage point and DFFITS: Leverage point , p = 13, n= 5875, so to check for leverage point, 2p/n is calculated and compared with hii. There are many points in the dataset whose hii > 2p/n. Similarly, for DFFITS, 2/sqrt(n) is calculated and compared with |DFFITS| for every point, if it is greater than 2/sqrt(n), then that point is considered to be influential.
Our model is best fitted but its variables are not normalized as we observed in the upper section when we checked their histogram and density plots. Lets transform the variables or data to get them normalized and look for a better model.
model2 <- lm(park$total_UPDRS ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS +log10(park$JitterPerc) + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model2)
##
## Call:
## lm(formula = park$total_UPDRS ~ log10(park$age) + park$sex +
## park$test_time + park$motor_UPDRS + log10(park$JitterPerc) +
## log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) +
## log10(park$Shimmer.APQ11) + park$HNR + park$RPDE + park$DFA +
## park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4338 -2.0648 -0.4393 1.3923 12.0127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.240e+01 2.053e+00 -6.040 1.64e-09 ***
## log10(park$age) 9.636e+00 7.405e-01 13.012 < 2e-16 ***
## park$sex -1.657e+00 9.969e-02 -16.626 < 2e-16 ***
## park$test_time 2.369e-03 8.022e-04 2.953 0.003161 **
## park$motor_UPDRS 1.228e+00 5.691e-03 215.731 < 2e-16 ***
## log10(park$JitterPerc) -1.740e+00 9.573e-01 -1.817 0.069207 .
## log10(park$Jitter.RAP) 2.538e+00 7.635e-01 3.324 0.000893 ***
## log10(park$Shimmer) -6.651e+00 1.397e+00 -4.760 1.98e-06 ***
## log10(park$Shimmer.APQ5) 8.750e+00 1.279e+00 6.840 8.73e-12 ***
## log10(park$Shimmer.APQ11) -4.184e+00 7.822e-01 -5.349 9.19e-08 ***
## park$HNR -5.694e-02 2.435e-02 -2.338 0.019428 *
## park$RPDE 4.830e+00 6.040e-01 7.997 1.53e-15 ***
## park$DFA -1.706e+00 7.131e-01 -2.393 0.016741 *
## park$PPE -4.317e+00 1.016e+00 -4.247 2.20e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.261 on 5861 degrees of freedom
## Multiple R-squared: 0.9073, Adjusted R-squared: 0.9071
## F-statistic: 4414 on 13 and 5861 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model2, col = "green")
After looking at the summary of the model2, there are two variables observed that are insignificant. R-square and adjusted R-square values are not changed. Only significant change if observed is in the Residual vs Leverage plot which seems to be uniform now as compare to the same plot of previous model.
Let’s try to remove the insignificant variables and create another model:
model3 <- lm(park$total_UPDRS ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11)+ park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model3)
##
## Call:
## lm(formula = park$total_UPDRS ~ log10(park$age) + park$sex +
## park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) +
## log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) +
## park$HNR + park$RPDE + park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.438 -2.054 -0.435 1.372 12.220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.141e+01 1.980e+00 -5.764 8.65e-09 ***
## log10(park$age) 9.588e+00 7.402e-01 12.953 < 2e-16 ***
## park$sex -1.621e+00 9.766e-02 -16.597 < 2e-16 ***
## park$test_time 2.314e-03 8.018e-04 2.886 0.003915 **
## park$motor_UPDRS 1.228e+00 5.684e-03 216.090 < 2e-16 ***
## log10(park$Jitter.RAP) 1.293e+00 3.369e-01 3.836 0.000126 ***
## log10(park$Shimmer) -6.771e+00 1.396e+00 -4.851 1.26e-06 ***
## log10(park$Shimmer.APQ5) 9.173e+00 1.258e+00 7.291 3.48e-13 ***
## log10(park$Shimmer.APQ11) -4.481e+00 7.652e-01 -5.856 5.01e-09 ***
## park$HNR -5.154e-02 2.418e-02 -2.132 0.033065 *
## park$RPDE 4.660e+00 5.968e-01 7.808 6.83e-15 ***
## park$DFA -1.662e+00 7.128e-01 -2.332 0.019727 *
## park$PPE -5.071e+00 9.281e-01 -5.464 4.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.262 on 5862 degrees of freedom
## Multiple R-squared: 0.9073, Adjusted R-squared: 0.9071
## F-statistic: 4780 on 12 and 5862 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model3, col = "green")
Again, R-square and adjusted R-square values are not changed. Only significant change if observed is in the Residual vs Leverage plot which seems to be uniform now as compare to the same plot of previous model.
So, moving with this model, let’s visualize the density plots of the transformed variables.
par(mfrow = c(2,2))
d <- density(log10(park$age))
plot(d)
polygon(d, col="light green", border="blue")
d <- density(log10(park$JitterPerc))
plot(d)
polygon(d, col="light green", border="blue")
d <- density(log10(park$Shimmer))
plot(d)
polygon(d, col="light green", border="blue")
d <- density(log10(park$NHR))
plot(d)
polygon(d, col="light green", border="blue")
As it can be observed that our transformed variables are normalized now.
Lets try Box-Cox power transformation to make our plot more uniform, we will use Box-Cox method to change the power of our dependent variable to get better linear model:
boxcox(model3)
We see that the value of /lmbda/ is around 0.8, let’s try to apply boxcox again to get the more sense of it.
boxcox(model3, seq(0, 1, 0.1))
This plot now gives value of /lmbda/ around 0.72. Let’s build a new model taking boxcox power result into consideration.
model4 <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model4)
##
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex +
## park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) +
## log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) +
## park$HNR + park$RPDE + park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2830 -0.5733 -0.0877 0.4341 3.2686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5303543 0.5473027 -0.969 0.332569
## log10(park$age) 2.9884630 0.2045584 14.609 < 2e-16 ***
## park$sex -0.4398947 0.0269883 -16.299 < 2e-16 ***
## park$test_time 0.0008363 0.0002216 3.775 0.000162 ***
## park$motor_UPDRS 0.3469606 0.0015708 220.876 < 2e-16 ***
## log10(park$Jitter.RAP) 0.4758264 0.0931148 5.110 3.32e-07 ***
## log10(park$Shimmer) -2.3007951 0.3857329 -5.965 2.59e-09 ***
## log10(park$Shimmer.APQ5) 2.5956145 0.3476729 7.466 9.50e-14 ***
## log10(park$Shimmer.APQ11) -1.0708185 0.2114486 -5.064 4.23e-07 ***
## park$HNR -0.0244980 0.0066812 -3.667 0.000248 ***
## park$RPDE 1.7280453 0.1649334 10.477 < 2e-16 ***
## park$DFA -0.8323850 0.1969754 -4.226 2.42e-05 ***
## park$PPE -1.5941375 0.2564721 -6.216 5.46e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9014 on 5862 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.9118
## F-statistic: 5061 on 12 and 5862 DF, p-value: < 2.2e-16
Now, we can say after analyzing the summary that this model has a good correlation compared to our previous moodel as adjusted R-square value is 91.18%. Additionally, all the covariates are significant in this model.
Let’s try to check the plot of this model:
par(mfrow = c(2,2))
plot(model4, col = "green")
So, Residual vs. Fittedvalues plot of this model is looking more uniform than the previous model. Normal QQ Plot - Since it gives the normality of the response variable, after looking at its plot, we can say that our response variable is normally distributed as standardized residuals and theoretical quantile are coming from the same distribution as both meet almost in a straight line. This is subjective analysis just to check that this model has normalized dependent variable.
Let’s check the multicollinearity:
vif(model4)
## log10(park$age) park$sex
## 1.195083 1.141852
## park$test_time park$motor_UPDRS
## 1.013852 1.178946
## log10(park$Jitter.RAP) log10(park$Shimmer)
## 4.404146 67.166605
## log10(park$Shimmer.APQ5) log10(park$Shimmer.APQ11)
## 61.796667 19.930120
## park$HNR park$RPDE
## 5.942632 2.005697
## park$DFA park$PPE
## 1.410161 3.981401
We see that,‘Shimmer’, ‘Shimmer.APQ5’ and ‘Shimmer.APQ11’ have vif values more than 5. We need to reduce or minimize the correlation between the covariates if there is any, as in fact, we try to move towards orthogonal model where all the covariates are orthogonal to each other or independent of each other.
We will use ridge expression to check the centering of any covariate around ‘0’.
rreg <- lm.ridge(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR + park$RPDE + park$DFA + park$PPE, lambda=seq(0,1,.05))
select(rreg)
## modified HKB estimator is 0.903449
## modified L-W estimator is 0.9673795
## smallest value of GCV at 1
temp <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11)+ park$HNR + park$RPDE + park$DFA + park$PPE, data=park)
As, we see that amongst all the values of estimators, GCV value is the smallest = 1. Now let’s try to plot ridge plot with the values of \(\lambda\) = 0.9.
y <- park[,6]
X0 <- model.matrix(temp)[,-1]
lambda <- seq(0, 1, 0.90)
aridge <- ridge(y, X0, lambda=lambda)
traceplot(aridge)
From the above graph, we are not able to conclude anything about the stability of covariates. Now, to remove multicollinearity, alternate options are to remove those variables in pairs from the model and check the multicollinearity at each step.
model5 <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + park$RPDE+ park$HNR + park$DFA + park$PPE)
vif(model5)
## log10(park$age) park$sex park$test_time
## 1.164472 1.134534 1.010799
## park$motor_UPDRS log10(park$Jitter.RAP) log10(park$Shimmer)
## 1.149221 3.703187 3.588583
## park$RPDE park$HNR park$DFA
## 1.904503 5.624520 1.319643
## park$PPE
## 3.690454
Multicollinearity issue seems to be resolved but we need to check the summary of this model to check the significancy as we removed some variables.
summary(model5)
##
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex +
## park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) +
## log10(park$Shimmer) + park$RPDE + park$HNR + park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30896 -0.57655 -0.09295 0.42896 3.12476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1432035 0.4942674 0.290 0.772
## log10(park$age) 2.8550623 0.2030578 14.060 < 2e-16 ***
## park$sex -0.4330832 0.0270531 -16.009 < 2e-16 ***
## park$test_time 0.0009119 0.0002225 4.099 4.21e-05 ***
## park$motor_UPDRS 0.3450480 0.0015596 221.237 < 2e-16 ***
## log10(park$Jitter.RAP) 0.5107844 0.0858643 5.949 2.86e-09 ***
## log10(park$Shimmer) -0.5642509 0.0896620 -6.293 3.34e-10 ***
## park$RPDE 1.4221568 0.1616232 8.799 < 2e-16 ***
## park$HNR -0.0337237 0.0065365 -5.159 2.56e-07 ***
## park$DFA -0.9074851 0.1916210 -4.736 2.23e-06 ***
## park$PPE -1.9998655 0.2483128 -8.054 9.64e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9064 on 5864 degrees of freedom
## Multiple R-squared: 0.9109, Adjusted R-squared: 0.9108
## F-statistic: 5999 on 10 and 5864 DF, p-value: < 2.2e-16
We are compromising the reduction in the value of adjusted R-square from 91.18% to 91.08% to resolve the issue of multicollinearity. Now, multicollinearity from this model is resolved and we see from the summary that all the variables are also significant. So, this model is way better linear than previous models.
Let’s also check the plot of the model:
par(mfrow = c(2,2))
plot(model5, col = "green")
Our model seems to be linearly fit as seen from the residual vs fitted values plot. Norm QQ plot also telling that quantiles and standardized residual are equally normally distributed. Residual vs Leverage plot is also uniform.
Now, let’s test the T-stat and F-stat and the p-values.
summary(model5)
##
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex +
## park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) +
## log10(park$Shimmer) + park$RPDE + park$HNR + park$DFA + park$PPE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30896 -0.57655 -0.09295 0.42896 3.12476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1432035 0.4942674 0.290 0.772
## log10(park$age) 2.8550623 0.2030578 14.060 < 2e-16 ***
## park$sex -0.4330832 0.0270531 -16.009 < 2e-16 ***
## park$test_time 0.0009119 0.0002225 4.099 4.21e-05 ***
## park$motor_UPDRS 0.3450480 0.0015596 221.237 < 2e-16 ***
## log10(park$Jitter.RAP) 0.5107844 0.0858643 5.949 2.86e-09 ***
## log10(park$Shimmer) -0.5642509 0.0896620 -6.293 3.34e-10 ***
## park$RPDE 1.4221568 0.1616232 8.799 < 2e-16 ***
## park$HNR -0.0337237 0.0065365 -5.159 2.56e-07 ***
## park$DFA -0.9074851 0.1916210 -4.736 2.23e-06 ***
## park$PPE -1.9998655 0.2483128 -8.054 9.64e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9064 on 5864 degrees of freedom
## Multiple R-squared: 0.9109, Adjusted R-squared: 0.9108
## F-statistic: 5999 on 10 and 5864 DF, p-value: < 2.2e-16
T test is used to test the slope of the linear regression model, assuming signifance level as 5%: T at significance level = 2.228. If T statistics of the model comes out to be between -2.228 and +2.228, then we can say that the response variable and a regressor varaiable does not have linear relationship between them.
To test the linear relationship between the response variable and other regressor variables: Null Hypothesis is used: The hypothesis are:
H0: \(\beta_0\) = \(\beta_1\) = 0 H1: \(\beta_j\) not equal to 0 for at least one j
1st case is the null Hypothesis. Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor. Thus, the null hypothesis is true if the observed data (in the sample) do not differ from what would be expected based on chance alone. Reject H0 if |t0| > t at significance level
From the observed summary, it can be concluded that null hypothesis can be rejected.
There is another way to check the linear relationship between the response variable and other regressor variables i.e P value.
P-value explains how significant is the linear relationship. if p-value of the regressor variable is less than 0.05 then also we can reject the null hypothesis. From the summary as observed, p-values of all the variables is less < 0.05, so we can reject the null Hypothesis and can conclude that there are linear relationships between ‘total_UPDRS’ and other covariate variables.
As observed from the above summary, we find that F-statistics of the above regression is 5864 and p-value is less than 2.2e-16. Assuming significance level as 5%, if we compare p-value of this regression with the assumed significant value, we can reject the null hypothesis for this model as 2.2e-16<0.05.
For comparing the models, lets use anova method:
anova(model5, model4)
## Analysis of Variance Table
##
## Model 1: park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time +
## park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) +
## park$RPDE + park$HNR + park$DFA + park$PPE
## Model 2: park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time +
## park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) +
## log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR +
## park$RPDE + park$DFA + park$PPE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5864 4818.1
## 2 5862 4762.7 2 55.397 34.092 1.903e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As it is observed that there is significant change in the new model from the previous model with a p-value of 2.2e-16.
So our final linear model equation:
total_UPDRS^0.72 = 0.1432 + 0.285log10(age) -0.433(sex) + 0.0009(test_time) + 0.345(motor_UPDRS)+ 0.510log10(Jitter.RAP) - 0.564log10(Shimmer) + 1.422(RPDE) -0.033(HNR) - 0.907(DFA) - 1.99(PPE) + \(\epsilon\) $
.