Data Analysis Methods: Final Exam

Team Members:

Mohit Kumar

M-number: M10618998

Divyay Gupta

M-number: M10825479

Sushant Karnik

M-number: M10629662

Aakanksha Joshi

M-number: M10619165

An Introduction to our analysis

Our analysis focuses on the development of an objective, automated method to extract clinically useful information from in the context of Parkinson’s disease (PD).

Parkinson’s disease is a progressive disorder of the nervous system that affects movement. It develops gradually, sometimes starting with a barely noticeable tremor in just one hand. But while a tremor may be the most well-known sign of Parkinson’s disease, the disorder also commonly causes stiffness or slowing of movement. In the early stages of Parkinson’s disease, your face may show little or no expression, or your arms may not swing when you walk. Your speech may become soft or slurred. Parkinson’s disease symptoms worsen as your condition progresses over time. Parkinson’s disease continues to be a serious problem today, with Approximately 60,000 Americans are diagnosed with Parkinson’s disease each year, and this number does not reflect the thousands of cases that go undetected.

More than 10 million people worldwide are living with Parkinson’s disease.

Incidence of Parkinson’s increases with age, but an estimated four percent of people with PD are diagnosed before the age of 50.

Men are one and a half times more likely to have Parkinson’s than women.

Although Parkinson’s disease can’t be cured, medications may markedly improve your symptoms. In occasional cases, the doctor may suggest surgery to regulate certain regions of your brain and improve your symptoms.

We have a dataset for 5875 patients in various stages of Parkinson’s disease and will proceed to analyze the data in order to identify useful relations in the dataset.

SECTION 1: LOADING DATASET AND DATASET SUMMARY

Let’s first load all the required packages required to analyze the dataset

To read the data into R, we will use the “read.csv” command.

#setwd("D:\UC MSIS\Fall Flex 2")
park <- read.csv("Park.csv")
attach(park)

Let’s remove the ‘NA’ values from the dataset:

park <- na.omit(park)

Now, let’s inspect and summarize the data:

head(park,10)

##    Subject.no age sex test_time motor_UPDRS total_UPDRS JitterPerc
## 1           1  72   0    5.6431      28.199      34.398    0.00662
## 2           1  72   0   12.6660      28.447      34.894    0.00300
## 3           1  72   0   19.6810      28.695      35.389    0.00481
## 4           1  72   0   25.6470      28.905      35.810    0.00528
## 5           1  72   0   33.6420      29.187      36.375    0.00335
## 6           1  72   0   40.6520      29.435      36.870    0.00353
## 7           1  72   0   47.6490      29.682      37.363    0.00422
## 8           1  72   0   54.6400      29.928      37.857    0.00476
## 9           1  72   0   61.6690      30.177      38.353    0.00432
## 10          1  72   0   68.6880      30.424      38.849    0.00496
##    Jitter.Abs. Jitter.RAP Jitter.PPQ5 Jitter.DDP Shimmer Shimmer.dB.
## 1     3.38e-05    0.00401     0.00317    0.01204 0.02565       0.230
## 2     1.68e-05    0.00132     0.00150    0.00395 0.02024       0.179
## 3     2.46e-05    0.00205     0.00208    0.00616 0.01675       0.181
## 4     2.66e-05    0.00191     0.00264    0.00573 0.02309       0.327
## 5     2.01e-05    0.00093     0.00130    0.00278 0.01703       0.176
## 6     2.29e-05    0.00119     0.00159    0.00357 0.02227       0.214
## 7     2.40e-05    0.00212     0.00221    0.00637 0.04352       0.445
## 8     2.47e-05    0.00226     0.00259    0.00678 0.02191       0.212
## 9     2.85e-05    0.00156     0.00207    0.00468 0.04296       0.371
## 10    2.70e-05    0.00258     0.00253    0.00773 0.03610       0.310
##    Shimmer.APQ3 Shimmer.APQ5 Shimmer.APQ11 Shimmer.DDA      NHR    HNR
## 1       0.01438      0.01309       0.01662     0.04314 0.014290 21.640
## 2       0.00994      0.01072       0.01689     0.02982 0.011112 27.183
## 3       0.00734      0.00844       0.01458     0.02202 0.020220 23.047
## 4       0.01106      0.01265       0.01963     0.03317 0.027837 24.445
## 5       0.00679      0.00929       0.01819     0.02036 0.011625 26.126
## 6       0.01006      0.01337       0.02263     0.03019 0.009438 22.946
## 7       0.02376      0.02621       0.03488     0.07128 0.013260 22.506
## 8       0.00979      0.01462       0.01911     0.02937 0.027969 22.929
## 9       0.01774      0.02134       0.03451     0.05323 0.013381 22.078
## 10      0.02030      0.01970       0.02569     0.06089 0.018021 22.606
##       RPDE     DFA     PPE
## 1  0.41888 0.54842 0.16006
## 2  0.43493 0.56477 0.10810
## 3  0.46222 0.54405 0.21014
## 4  0.48730 0.57794 0.33277
## 5  0.47188 0.56122 0.19361
## 6  0.53949 0.57243 0.19500
## 7  0.49250 0.54779 0.17563
## 8  0.47712 0.54234 0.23844
## 9  0.51563 0.61864 0.20037
## 10 0.50032 0.58673 0.20117

## To have a look at the structure of the dataset 'park'
str(park)

## 'data.frame':    5875 obs. of  22 variables:
##  $ Subject.no   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age          : int  72 72 72 72 72 72 72 72 72 72 ...
##  $ sex          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ test_time    : num  5.64 12.67 19.68 25.65 33.64 ...
##  $ motor_UPDRS  : num  28.2 28.4 28.7 28.9 29.2 ...
##  $ total_UPDRS  : num  34.4 34.9 35.4 35.8 36.4 ...
##  $ JitterPerc   : num  0.00662 0.003 0.00481 0.00528 0.00335 0.00353 0.00422 0.00476 0.00432 0.00496 ...
##  $ Jitter.Abs.  : num  3.38e-05 1.68e-05 2.46e-05 2.66e-05 2.01e-05 2.29e-05 2.40e-05 2.47e-05 2.85e-05 2.70e-05 ...
##  $ Jitter.RAP   : num  0.00401 0.00132 0.00205 0.00191 0.00093 0.00119 0.00212 0.00226 0.00156 0.00258 ...
##  $ Jitter.PPQ5  : num  0.00317 0.0015 0.00208 0.00264 0.0013 0.00159 0.00221 0.00259 0.00207 0.00253 ...
##  $ Jitter.DDP   : num  0.01204 0.00395 0.00616 0.00573 0.00278 ...
##  $ Shimmer      : num  0.0256 0.0202 0.0168 0.0231 0.017 ...
##  $ Shimmer.dB.  : num  0.23 0.179 0.181 0.327 0.176 0.214 0.445 0.212 0.371 0.31 ...
##  $ Shimmer.APQ3 : num  0.01438 0.00994 0.00734 0.01106 0.00679 ...
##  $ Shimmer.APQ5 : num  0.01309 0.01072 0.00844 0.01265 0.00929 ...
##  $ Shimmer.APQ11: num  0.0166 0.0169 0.0146 0.0196 0.0182 ...
##  $ Shimmer.DDA  : num  0.0431 0.0298 0.022 0.0332 0.0204 ...
##  $ NHR          : num  0.0143 0.0111 0.0202 0.0278 0.0116 ...
##  $ HNR          : num  21.6 27.2 23 24.4 26.1 ...
##  $ RPDE         : num  0.419 0.435 0.462 0.487 0.472 ...
##  $ DFA          : num  0.548 0.565 0.544 0.578 0.561 ...
##  $ PPE          : num  0.16 0.108 0.21 0.333 0.194 ...

## Summary
summary(park)

##    Subject.no         age            sex           test_time      
##  Min.   : 1.00   Min.   :36.0   Min.   :0.0000   Min.   : -4.263  
##  1st Qu.:10.00   1st Qu.:58.0   1st Qu.:0.0000   1st Qu.: 46.847  
##  Median :22.00   Median :65.0   Median :0.0000   Median : 91.523  
##  Mean   :21.49   Mean   :64.8   Mean   :0.3178   Mean   : 92.864  
##  3rd Qu.:33.00   3rd Qu.:72.0   3rd Qu.:1.0000   3rd Qu.:138.445  
##  Max.   :42.00   Max.   :85.0   Max.   :1.0000   Max.   :215.490  
##   motor_UPDRS      total_UPDRS      JitterPerc        Jitter.Abs.       
##  Min.   : 5.038   Min.   : 7.00   Min.   :0.000830   Min.   :2.250e-06  
##  1st Qu.:15.000   1st Qu.:21.37   1st Qu.:0.003580   1st Qu.:2.240e-05  
##  Median :20.871   Median :27.58   Median :0.004900   Median :3.450e-05  
##  Mean   :21.296   Mean   :29.02   Mean   :0.006154   Mean   :4.403e-05  
##  3rd Qu.:27.596   3rd Qu.:36.40   3rd Qu.:0.006800   3rd Qu.:5.330e-05  
##  Max.   :39.511   Max.   :54.99   Max.   :0.099990   Max.   :4.456e-04  
##    Jitter.RAP        Jitter.PPQ5         Jitter.DDP      
##  Min.   :0.000330   Min.   :0.000430   Min.   :0.000980  
##  1st Qu.:0.001580   1st Qu.:0.001820   1st Qu.:0.004730  
##  Median :0.002250   Median :0.002490   Median :0.006750  
##  Mean   :0.002987   Mean   :0.003277   Mean   :0.008962  
##  3rd Qu.:0.003290   3rd Qu.:0.003460   3rd Qu.:0.009870  
##  Max.   :0.057540   Max.   :0.069560   Max.   :0.172630  
##     Shimmer         Shimmer.dB.     Shimmer.APQ3      Shimmer.APQ5    
##  Min.   :0.00306   Min.   :0.026   Min.   :0.00161   Min.   :0.00194  
##  1st Qu.:0.01912   1st Qu.:0.175   1st Qu.:0.00928   1st Qu.:0.01079  
##  Median :0.02751   Median :0.253   Median :0.01370   Median :0.01594  
##  Mean   :0.03404   Mean   :0.311   Mean   :0.01716   Mean   :0.02014  
##  3rd Qu.:0.03975   3rd Qu.:0.365   3rd Qu.:0.02057   3rd Qu.:0.02375  
##  Max.   :0.26863   Max.   :2.107   Max.   :0.16267   Max.   :0.16702  
##  Shimmer.APQ11      Shimmer.DDA           NHR                HNR        
##  Min.   :0.00249   Min.   :0.00484   Min.   :0.000286   Min.   : 1.659  
##  1st Qu.:0.01566   1st Qu.:0.02783   1st Qu.:0.010955   1st Qu.:19.406  
##  Median :0.02271   Median :0.04111   Median :0.018448   Median :21.920  
##  Mean   :0.02748   Mean   :0.05147   Mean   :0.032120   Mean   :21.680  
##  3rd Qu.:0.03272   3rd Qu.:0.06173   3rd Qu.:0.031463   3rd Qu.:24.444  
##  Max.   :0.27546   Max.   :0.48802   Max.   :0.748260   Max.   :37.875  
##       RPDE             DFA              PPE         
##  Min.   :0.1510   Min.   :0.5140   Min.   :0.02198  
##  1st Qu.:0.4698   1st Qu.:0.5962   1st Qu.:0.15634  
##  Median :0.5423   Median :0.6436   Median :0.20550  
##  Mean   :0.5415   Mean   :0.6532   Mean   :0.21959  
##  3rd Qu.:0.6140   3rd Qu.:0.7113   3rd Qu.:0.26449  
##  Max.   :0.9661   Max.   :0.8656   Max.   :0.73173

So, ‘park’ dataset has 22 variables and 5875 observations. As of now, we have the summary of all the variables.

SECTION 2: EXPLANATION OF ALL ATTRIBUTES IN THE DATASET

We will now list down all the attributes in our dataset and work towards identifying a response variable.

Subject number: Integer that uniquely identifies each subject
Subject Age
Subject gender ‘0’ - male, ‘1’ - female
Test time - Time since recruitment into the trial. The integer part is the number of days since recruitment.

UPDRS: This is a clinician’s scale for recording symptoms related to Parkinson’s disease. The UPDRS metric consists of 44 sections, where each section addresses different symptoms in different parts of the body. Summing up these 44 sections gives rise to the total-UPDRS score, which spans the range 0-176, with 0 representing perfectly healthy individual and 176 total disability.

Motor UPDRS - Clinician’s motor UPDRS score, linearly interpolated - this forms sections 18-44 from the UPDRS sections
Total_UPDRS - Clinician’s total UPDRS score, linearly interpolated - this includes all 44 sections
Jitter Percentage - measure of variation in fundamental frequency
Jitter (Absolute) - measure of variation in fundamental frequency
Jitter (RAP) - measure of variation in fundamental frequency
Jitter (PPQ5) - measure of variation in fundamental frequency
Jitter (DDP) - measure of variation in fundamental frequency

Shimmer - measures of variation in amplitude
Shimmer(dB)- measure of variation in amplitude
Shimmer:APQ3- measure of variation in amplitude
Shimmer:APQ5- measure of variation in amplitude
Shimmer:APQ11- measure of variation in amplitude
Shimmer:DDA- measure of variation in amplitude

NHR: measures of ratio of noise to tonal components in the voice
HNR: measures of ratio of noise to tonal components in the voice

RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation

We will aim to use UDPRS as our response variable.

The aim is twofold: (a) this attribute helps us differentiate between healthy (non-affected) people and people with Parkinsons. (b) replicating the Unified Parkinson’s Disease Rating Scale (UPDRS) metric which provides a clinical impression of PD symptom severity will help enforce the findings in the actual physical interview.

This metric (UPDRS) spans the range 0 to 176 , where 0 denotes a healthy person and 176 total disability. Currently, UPDRS assessment requires the physical presence of the subject in the clinic, is subjective relying on the clinical rater’s expertise, and costly (in terms of time, money, energy and resources). Hence, the practical frequency of symptom tracking is typically confined to once every several months, hindering recruitment for large-scale clinical trials and under-representing the true time scale of PD fluctuations

We will aim to build and refine a model that will enable us to accurately estimate a measure for UPDRS (motor or total).

SECTION 3: INITIAL ANALYSIS AND MODEL BUILDING

As we see that ‘sex’ has ‘0’ and ‘1’ values. We could factor it and it will not affect the dataset. Let’s do the factoring using the is.factor function:

sex <- is.factor(sex)

Let’s see the Histrogram plot of ‘total_UPDRS’:

hist(total_UPDRS, xlab="total_UPDRS",main="Histogram of total_UPDRS", col = blues9)

It seems to be normally distributed and that is a good thing for the model. Otherwise, transformations or other changes would have been required to normalize the dependent variable ‘total_UPDRS’.

Let’s look at the histogram of other variables:

par(mfrow = c(3,3))
hist(age, xlab="age",main="Histogram of age", col = blues9)
hist(test_time, xlab="test_time",main="Histogram of test_time", col = blues9)
hist(motor_UPDRS, xlab="motor_UPDRS",main="Histogram of motor_UPDRS", col = blues9)
hist(JitterPerc, xlab="JitterPerc",main="Histogram of JitterPerc", col = blues9)
hist(Shimmer, xlab="Shimmer",main="Histogram of Shimmer", col = blues9)
hist(NHR, xlab="NHR",main="Histogram of NHR", col = blues9)
hist(RPDE, xlab="RPDE",main="Histogram of RPDE", col = blues9)
hist(DFA, xlab="DFA",main="Histogram of DFA", col = blues9)
hist(PPE, xlab="PPE",main="Histogram of PPE", col = blues9)

From all the observed histograms, it can be concluded that only ‘RPDE’ and ‘PPE’ have the normalized data. ‘JitterPerc’, ‘Shimmer’ and ‘NHR’ have very unnormalized data. It can be improved by transforming these varaibles which will be used in later sections when we have a model giving linear regression.

We plotted Histogram to get the sense of the response variable and the covariates but Histograms can not be a very relaible method for determining the shape of a distribution because it is so strongly affected by the number of bins used. Thus we use the kernel density plots instead which are more effective ways to view the distribution of a variable

Let’s plot the density curves for the variables:

# Dependent variable

par(mfrow = c(1,1))
d <- density(park$total_UPDRS)
plot(d)
polygon(d, col="light green", border="blue")

Density plot of ‘total_UPDRS’ is showing that it is almost normalized variable but not fully normalized. Normalization of the response variable is improved by various methods used in the report later.

Let’s plot the density plots for the covariates to get more sense of the variables:

par(mfrow = c(3,3))
d <- density(park$age)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$test_time)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$motor_UPDRS)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$JitterPerc)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$Shimmer)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$NHR)
plot(d)
polygon(d, col="light green", border="blue")


d <- density(park$RPDE)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$DFA)
plot(d)
polygon(d, col="light green", border="blue")

d <- density(park$PPE)
plot(d)
polygon(d, col="light green", border="blue")

As density plot is also showing that ‘RPDE’, ‘PPE’ have normalized data but not for ‘motor_UPDRS’ which it couldn’t be visible accurately in the Histogram plot.

‘JitterPerc’, ‘Shimmer’ and ‘NHR’ have very unnormalized data. It can be improved by transforming these varaibles which will be used in later sections when we have a model giving linear regression. We will show all these transformed variables after normalizing them.

To find a good linear model, let’s first try to see the correlation between the variables:

library(corrplot)
corrplot(cor(park), method = "color")

library(car)

library(visreg)
library(rgl)
library(knitr)
library(scatterplot3d)
library(MASS)
library(genridge)

## 
## Attaching package: 'genridge'

## The following object is masked from 'package:rgl':
## 
##     plot3d

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:stats':
## 
##     loadings

library(lattice)

As it can be observed that ‘total UPDRS’ has correlation with only ‘motor UPDRS’ which is 95%. If we build a model using these two variables, then our model will have an inherent bias induced into the estimates of coefficients and variance of predicted response and it is not the intention. So, other variables are required to correlate with the response variable and that can be done by improving the model created by these covariates. We will proceed by creating a model using ‘total_UPDRS’ as our response variable and other varaibles as predictor variables.

SECTION 4: APPLYING REGRESSION MODEL

We will proceed to build the linear model using lm() function taking total_UPDRS as the response varaible.

model <- lm(park$total_UPDRS~park$age+park$sex+park$test_time+park$motor_UPDRS+park$Jitter.PPQ5+park$JitterPerc+park$Jitter.RAP+park$Jitter.DDP+park$Shimmer+park$Shimmer.dB.+park$Shimmer.APQ3+park$Shimmer.APQ5+park$Shimmer.APQ11+park$Shimmer.DDA+park$NHR+park$HNR+park$RPDE+park$DFA+park$PPE)
summary(model)

## 
## Call:
## lm(formula = park$total_UPDRS ~ park$age + park$sex + park$test_time + 
##     park$motor_UPDRS + park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + 
##     park$Jitter.DDP + park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + 
##     park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$Shimmer.DDA + 
##     park$NHR + park$HNR + park$RPDE + park$DFA + park$PPE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9283 -2.0790 -0.5001  1.4599 11.4573 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.879e+00  1.082e+00   1.737  0.08249 .  
## park$age            6.843e-02  5.167e-03  13.243  < 2e-16 ***
## park$sex           -1.557e+00  9.886e-02 -15.748  < 2e-16 ***
## park$test_time      2.616e-03  8.029e-04   3.258  0.00113 ** 
## park$motor_UPDRS    1.223e+00  5.666e-03 215.827  < 2e-16 ***
## park$Jitter.PPQ5   -3.789e+01  6.013e+01  -0.630  0.52866    
## park$JitterPerc    -2.158e+02  6.857e+01  -3.147  0.00166 ** 
## park$Jitter.RAP     6.302e+03  1.568e+04   0.402  0.68768    
## park$Jitter.DDP    -1.958e+03  5.226e+03  -0.375  0.70796    
## park$Shimmer       -2.757e+01  2.171e+01  -1.270  0.20411    
## park$Shimmer.dB.   -1.108e+00  1.618e+00  -0.685  0.49366    
## park$Shimmer.APQ3  -1.305e+04  1.575e+04  -0.829  0.40720    
## park$Shimmer.APQ5   8.851e+01  1.853e+01   4.777 1.82e-06 ***
## park$Shimmer.APQ11 -3.911e+01  8.308e+00  -4.707 2.57e-06 ***
## park$Shimmer.DDA    4.347e+03  5.249e+03   0.828  0.40756    
## park$NHR           -1.368e-01  2.025e+00  -0.068  0.94616    
## park$HNR           -1.095e-01  2.320e-02  -4.722 2.39e-06 ***
## park$RPDE           3.696e+00  6.039e-01   6.121 9.90e-10 ***
## park$DFA           -1.855e+00  7.678e-01  -2.416  0.01573 *  
## park$PPE           -2.625e+00  9.446e-01  -2.779  0.00547 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.262 on 5855 degrees of freedom
## Multiple R-squared:  0.9074, Adjusted R-squared:  0.9071 
## F-statistic:  3019 on 19 and 5855 DF,  p-value: < 2.2e-16

R-square of the above model is 90.74% and adjusted R-square is 90.71% which shows good correlation in the model. But as, we can see that there are many insignificant predictor varaibles in this model, we will try to take best subset of the above model for best fit. Step AIC is used to get the subset of the model. Step AIC method will evaluate the model by checking AIC values for each varaible and try to reduce AIC of the model by removing the variables of high AIC.

step <- stepAIC(model, direction = "both")

## Start:  AIC=13910.77
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP + 
##     park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$Shimmer.DDA + park$NHR + park$HNR + 
##     park$RPDE + park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$NHR            1         0  62283 13909
## - park$Jitter.DDP     1         1  62284 13909
## - park$Jitter.RAP     1         2  62285 13909
## - park$Jitter.PPQ5    1         4  62287 13909
## - park$Shimmer.dB.    1         5  62288 13909
## - park$Shimmer.DDA    1         7  62290 13910
## - park$Shimmer.APQ3   1         7  62290 13910
## - park$Shimmer        1        17  62300 13910
## <none>                             62283 13911
## - park$DFA            1        62  62345 13915
## - park$PPE            1        82  62365 13916
## - park$JitterPerc     1       105  62388 13919
## - park$test_time      1       113  62396 13919
## - park$Shimmer.APQ11  1       236  62519 13931
## - park$HNR            1       237  62520 13931
## - park$Shimmer.APQ5   1       243  62526 13932
## - park$RPDE           1       399  62682 13946
## - park$age            1      1866  64149 14082
## - park$sex            1      2638  64921 14152
## - park$motor_UPDRS    1    495511 557794 26789
## 
## Step:  AIC=13908.77
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP + 
##     park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$Shimmer.DDA + park$HNR + park$RPDE + 
##     park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$Jitter.DDP     1         1  62285 13907
## - park$Jitter.RAP     1         2  62285 13907
## - park$Jitter.PPQ5    1         5  62288 13907
## - park$Shimmer.dB.    1         5  62288 13907
## - park$Shimmer.DDA    1         7  62290 13908
## - park$Shimmer.APQ3   1         7  62290 13908
## - park$Shimmer        1        18  62301 13908
## <none>                             62283 13909
## + park$NHR            1         0  62283 13911
## - park$DFA            1        70  62353 13913
## - park$PPE            1        83  62366 13915
## - park$JitterPerc     1       108  62391 13917
## - park$test_time      1       113  62396 13917
## - park$HNR            1       243  62526 13930
## - park$Shimmer.APQ5   1       250  62534 13930
## - park$Shimmer.APQ11  1       258  62541 13931
## - park$RPDE           1       399  62682 13944
## - park$age            1      1885  64168 14082
## - park$sex            1      2697  64980 14156
## - park$motor_UPDRS    1    497188 559471 26804
## 
## Step:  AIC=13906.92
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Shimmer + 
##     park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$Shimmer.DDA + park$HNR + park$RPDE + 
##     park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$Jitter.PPQ5    1         5  62289 13905
## - park$Shimmer.dB.    1         5  62290 13905
## - park$Shimmer.DDA    1         7  62292 13906
## - park$Shimmer.APQ3   1         7  62292 13906
## - park$Shimmer        1        18  62303 13907
## <none>                             62285 13907
## + park$Jitter.DDP     1         1  62283 13909
## + park$NHR            1         0  62284 13909
## - park$DFA            1        70  62355 13912
## - park$PPE            1        82  62367 13913
## - park$JitterPerc     1       108  62392 13915
## - park$test_time      1       113  62397 13916
## - park$Jitter.RAP     1       223  62507 13926
## - park$HNR            1       243  62527 13928
## - park$Shimmer.APQ5   1       252  62537 13929
## - park$Shimmer.APQ11  1       258  62543 13929
## - park$RPDE           1       401  62685 13943
## - park$age            1      1886  64170 14080
## - park$sex            1      2696  64980 14154
## - park$motor_UPDRS    1    497249 559533 26803
## 
## Step:  AIC=13905.36
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.dB. + 
##     park$Shimmer.APQ3 + park$Shimmer.APQ5 + park$Shimmer.APQ11 + 
##     park$Shimmer.DDA + park$HNR + park$RPDE + park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$Shimmer.dB.    1         4  62293 13904
## - park$Shimmer.DDA    1         7  62297 13904
## - park$Shimmer.APQ3   1         7  62297 13904
## <none>                             62289 13905
## - park$Shimmer        1        21  62310 13905
## + park$Jitter.PPQ5    1         5  62285 13907
## + park$Jitter.DDP     1         2  62288 13907
## + park$NHR            1         0  62289 13907
## - park$DFA            1        68  62357 13910
## - park$PPE            1        79  62368 13911
## - park$test_time      1       112  62401 13914
## - park$JitterPerc     1       185  62474 13921
## - park$Jitter.RAP     1       223  62512 13924
## - park$HNR            1       239  62528 13926
## - park$Shimmer.APQ11  1       255  62544 13927
## - park$Shimmer.APQ5   1       272  62561 13929
## - park$RPDE           1       428  62717 13944
## - park$age            1      1886  64175 14079
## - park$sex            1      2701  64990 14153
## - park$motor_UPDRS    1    497258 559547 26801
## 
## Step:  AIC=13903.75
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ3 + 
##     park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$Shimmer.DDA + 
##     park$HNR + park$RPDE + park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$Shimmer.DDA    1         7  62301 13902
## - park$Shimmer.APQ3   1         7  62301 13902
## <none>                             62293 13904
## + park$Shimmer.dB.    1         4  62289 13905
## + park$Jitter.PPQ5    1         4  62290 13905
## + park$Jitter.DDP     1         2  62292 13906
## + park$NHR            1         1  62293 13906
## - park$DFA            1        64  62358 13908
## - park$Shimmer        1        65  62358 13908
## - park$PPE            1        88  62381 13910
## - park$test_time      1       111  62404 13912
## - park$JitterPerc     1       191  62484 13920
## - park$Jitter.RAP     1       228  62521 13923
## - park$HNR            1       239  62533 13924
## - park$Shimmer.APQ11  1       272  62565 13927
## - park$Shimmer.APQ5   1       281  62574 13928
## - park$RPDE           1       433  62726 13942
## - park$age            1      1882  64176 14077
## - park$sex            1      2705  64999 14152
## - park$motor_UPDRS    1    497558 559852 26802
## 
## Step:  AIC=13902.43
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ3 + 
##     park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE + 
##     park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## - park$Shimmer.APQ3   1         0  62301 13900
## <none>                             62301 13902
## + park$Shimmer.DDA    1         7  62293 13904
## + park$Shimmer.dB.    1         4  62297 13904
## + park$Jitter.PPQ5    1         4  62297 13904
## + park$Jitter.DDP     1         2  62299 13904
## + park$NHR            1         1  62300 13904
## - park$DFA            1        64  62365 13906
## - park$Shimmer        1        65  62366 13907
## - park$PPE            1        88  62389 13909
## - park$test_time      1       112  62412 13911
## - park$JitterPerc     1       190  62490 13918
## - park$Jitter.RAP     1       227  62528 13922
## - park$HNR            1       241  62541 13923
## - park$Shimmer.APQ11  1       272  62573 13926
## - park$Shimmer.APQ5   1       282  62583 13927
## - park$RPDE           1       430  62730 13941
## - park$age            1      1884  64185 14076
## - park$sex            1      2706  65006 14150
## - park$motor_UPDRS    1    497568 559869 26800
## 
## Step:  AIC=13900.47
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE
## 
##                      Df Sum of Sq    RSS   AIC
## <none>                             62301 13900
## + park$Shimmer.dB.    1         4  62297 13902
## + park$Jitter.PPQ5    1         3  62298 13902
## + park$Jitter.DDP     1         2  62299 13902
## + park$Shimmer.APQ3   1         0  62301 13902
## + park$Shimmer.DDA    1         0  62301 13902
## + park$NHR            1         0  62301 13902
## - park$DFA            1        65  62366 13905
## - park$PPE            1        90  62391 13907
## - park$test_time      1       111  62412 13909
## - park$Shimmer        1       167  62468 13914
## - park$JitterPerc     1       223  62524 13919
## - park$HNR            1       241  62542 13921
## - park$Jitter.RAP     1       257  62558 13923
## - park$Shimmer.APQ5   1       287  62588 13925
## - park$Shimmer.APQ11  1       335  62636 13930
## - park$RPDE           1       432  62733 13939
## - park$age            1      1889  64190 14074
## - park$sex            1      2766  65067 14154
## - park$motor_UPDRS    1    498032 560333 26803

step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$Jitter.PPQ5 + park$JitterPerc + park$Jitter.RAP + park$Jitter.DDP + 
##     park$Shimmer + park$Shimmer.dB. + park$Shimmer.APQ3 + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$Shimmer.DDA + park$NHR + park$HNR + 
##     park$RPDE + park$DFA + park$PPE
## 
## Final Model:
## park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + 
##     park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 + 
##     park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE
## 
## 
##                  Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                        5855   62282.98 13910.77
## 2          - park$NHR  1 0.04850983      5856   62283.03 13908.77
## 3   - park$Jitter.DDP  1 1.49780653      5857   62284.53 13906.92
## 4  - park$Jitter.PPQ5  1 4.65695110      5858   62289.19 13905.36
## 5  - park$Shimmer.dB.  1 4.15599272      5859   62293.34 13903.75
## 6  - park$Shimmer.DDA  1 7.26853396      5860   62300.61 13902.43
## 7 - park$Shimmer.APQ3  1 0.36606694      5861   62300.98 13900.47

We come out with the best subset for our model, let’s create another model with this subset in the next section:

SECTION 5: REFINING THE MODEL ITERATIVELY

model1 <- lm(park$total_UPDRS ~ park$age + park$sex + park$test_time + park$motor_UPDRS + park$JitterPerc + park$Jitter.RAP + park$Shimmer + park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model1)

## 
## Call:
## lm(formula = park$total_UPDRS ~ park$age + park$sex + park$test_time + 
##     park$motor_UPDRS + park$JitterPerc + park$Jitter.RAP + park$Shimmer + 
##     park$Shimmer.APQ5 + park$Shimmer.APQ11 + park$HNR + park$RPDE + 
##     park$DFA + park$PPE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9924 -2.0833 -0.5047  1.4742 11.4537 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.754e+00  1.043e+00   1.682  0.09262 .  
## park$age            6.835e-02  5.127e-03  13.331  < 2e-16 ***
## park$sex           -1.557e+00  9.652e-02 -16.130  < 2e-16 ***
## park$test_time      2.593e-03  8.015e-04   3.236  0.00122 ** 
## park$motor_UPDRS    1.223e+00  5.650e-03 216.455  < 2e-16 ***
## park$JitterPerc    -2.379e+02  5.199e+01  -4.576 4.84e-06 ***
## park$Jitter.RAP     4.271e+02  8.688e+01   4.917 9.05e-07 ***
## park$Shimmer       -4.110e+01  1.038e+01  -3.958 7.64e-05 ***
## park$Shimmer.APQ5   8.421e+01  1.622e+01   5.192 2.15e-07 ***
## park$Shimmer.APQ11 -3.798e+01  6.769e+00  -5.610 2.11e-08 ***
## park$HNR           -1.082e-01  2.273e-02  -4.759 1.99e-06 ***
## park$RPDE           3.775e+00  5.919e-01   6.378 1.94e-10 ***
## park$DFA           -1.739e+00  7.014e-01  -2.479  0.01321 *  
## park$PPE           -2.655e+00  9.148e-01  -2.902  0.00372 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.26 on 5861 degrees of freedom
## Multiple R-squared:  0.9074, Adjusted R-squared:  0.9072 
## F-statistic:  4416 on 13 and 5861 DF,  p-value: < 2.2e-16

It looks like all the variables are significant. This model seems to be fittest model as R-square is 90.74% and Adjusted R-square is also high equal to 90.72% which show a good correlation. Now let’s try to plot this model.

par(mfrow = c(2,2))
plot(model1, col = "green")

Residual vs. Fitted values Plot - This plot is almost uniform which explains a good linearlity of the model. There are some outliers that cannot be removed because they contain important information. Another thing that this is graph is depicting is some constant variance of the model which is good for linear regression.

Normal QQ - Since it gives the normality of the response variable, after looking at its plot, we can say that our response variable is almost normally distributed as standardized residuals and theoretical quantile are coming from the same distribution as both meet almost in a straight line. This is subjective analysis just to check that this model has normalized dependent variable.

Checking for Leverage point and DFFITS: Leverage point , p = 13, n= 5875, so to check for leverage point, 2p/n is calculated and compared with hii. There are many points in the dataset whose hii > 2p/n. Similarly, for DFFITS, 2/sqrt(n) is calculated and compared with |DFFITS| for every point, if it is greater than 2/sqrt(n), then that point is considered to be influential.

Our model is best fitted but its variables are not normalized as we observed in the upper section when we checked their histogram and density plots. Lets transform the variables or data to get them normalized and look for a better model.

model2 <- lm(park$total_UPDRS ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS +log10(park$JitterPerc) + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11)  + park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model2)

## 
## Call:
## lm(formula = park$total_UPDRS ~ log10(park$age) + park$sex + 
##     park$test_time + park$motor_UPDRS + log10(park$JitterPerc) + 
##     log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + 
##     log10(park$Shimmer.APQ11) + park$HNR + park$RPDE + park$DFA + 
##     park$PPE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4338 -2.0648 -0.4393  1.3923 12.0127 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.240e+01  2.053e+00  -6.040 1.64e-09 ***
## log10(park$age)            9.636e+00  7.405e-01  13.012  < 2e-16 ***
## park$sex                  -1.657e+00  9.969e-02 -16.626  < 2e-16 ***
## park$test_time             2.369e-03  8.022e-04   2.953 0.003161 ** 
## park$motor_UPDRS           1.228e+00  5.691e-03 215.731  < 2e-16 ***
## log10(park$JitterPerc)    -1.740e+00  9.573e-01  -1.817 0.069207 .  
## log10(park$Jitter.RAP)     2.538e+00  7.635e-01   3.324 0.000893 ***
## log10(park$Shimmer)       -6.651e+00  1.397e+00  -4.760 1.98e-06 ***
## log10(park$Shimmer.APQ5)   8.750e+00  1.279e+00   6.840 8.73e-12 ***
## log10(park$Shimmer.APQ11) -4.184e+00  7.822e-01  -5.349 9.19e-08 ***
## park$HNR                  -5.694e-02  2.435e-02  -2.338 0.019428 *  
## park$RPDE                  4.830e+00  6.040e-01   7.997 1.53e-15 ***
## park$DFA                  -1.706e+00  7.131e-01  -2.393 0.016741 *  
## park$PPE                  -4.317e+00  1.016e+00  -4.247 2.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.261 on 5861 degrees of freedom
## Multiple R-squared:  0.9073, Adjusted R-squared:  0.9071 
## F-statistic:  4414 on 13 and 5861 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(model2, col = "green")

After looking at the summary of the model2, there are two variables observed that are insignificant. R-square and adjusted R-square values are not changed. Only significant change if observed is in the Residual vs Leverage plot which seems to be uniform now as compare to the same plot of previous model.

Let’s try to remove the insignificant variables and create another model:

model3 <- lm(park$total_UPDRS ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS  + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11)+ park$HNR + park$RPDE + park$DFA + park$PPE)
summary(model3)

## 
## Call:
## lm(formula = park$total_UPDRS ~ log10(park$age) + park$sex + 
##     park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + 
##     log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + 
##     park$HNR + park$RPDE + park$DFA + park$PPE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.438 -2.054 -0.435  1.372 12.220 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.141e+01  1.980e+00  -5.764 8.65e-09 ***
## log10(park$age)            9.588e+00  7.402e-01  12.953  < 2e-16 ***
## park$sex                  -1.621e+00  9.766e-02 -16.597  < 2e-16 ***
## park$test_time             2.314e-03  8.018e-04   2.886 0.003915 ** 
## park$motor_UPDRS           1.228e+00  5.684e-03 216.090  < 2e-16 ***
## log10(park$Jitter.RAP)     1.293e+00  3.369e-01   3.836 0.000126 ***
## log10(park$Shimmer)       -6.771e+00  1.396e+00  -4.851 1.26e-06 ***
## log10(park$Shimmer.APQ5)   9.173e+00  1.258e+00   7.291 3.48e-13 ***
## log10(park$Shimmer.APQ11) -4.481e+00  7.652e-01  -5.856 5.01e-09 ***
## park$HNR                  -5.154e-02  2.418e-02  -2.132 0.033065 *  
## park$RPDE                  4.660e+00  5.968e-01   7.808 6.83e-15 ***
## park$DFA                  -1.662e+00  7.128e-01  -2.332 0.019727 *  
## park$PPE                  -5.071e+00  9.281e-01  -5.464 4.86e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.262 on 5862 degrees of freedom
## Multiple R-squared:  0.9073, Adjusted R-squared:  0.9071 
## F-statistic:  4780 on 12 and 5862 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(model3, col = "green")

Again, R-square and adjusted R-square values are not changed. Only significant change if observed is in the Residual vs Leverage plot which seems to be uniform now as compare to the same plot of previous model.

So, moving with this model, let’s visualize the density plots of the transformed variables.

par(mfrow = c(2,2))
d <- density(log10(park$age))
plot(d)
polygon(d, col="light green", border="blue")

d <- density(log10(park$JitterPerc))
plot(d)
polygon(d, col="light green", border="blue")

d <- density(log10(park$Shimmer))
plot(d)
polygon(d, col="light green", border="blue")

d <- density(log10(park$NHR))
plot(d)
polygon(d, col="light green", border="blue")

As it can be observed that our transformed variables are normalized now.

SECTION 6: APPLYING TRANSFORMATIONS TO CHECK THE MODEL

Lets try Box-Cox power transformation to make our plot more uniform, we will use Box-Cox method to change the power of our dependent variable to get better linear model:

boxcox(model3)

We see that the value of /lmbda/ is around 0.8, let’s try to apply boxcox again to get the more sense of it.

boxcox(model3, seq(0, 1, 0.1))

This plot now gives value of /lmbda/ around 0.72. Let’s build a new model taking boxcox power result into consideration.

model4 <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS  + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR +  park$RPDE + park$DFA + park$PPE)
summary(model4)

## 
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + 
##     park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + 
##     log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + 
##     park$HNR + park$RPDE + park$DFA + park$PPE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2830 -0.5733 -0.0877  0.4341  3.2686 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.5303543  0.5473027  -0.969 0.332569    
## log10(park$age)            2.9884630  0.2045584  14.609  < 2e-16 ***
## park$sex                  -0.4398947  0.0269883 -16.299  < 2e-16 ***
## park$test_time             0.0008363  0.0002216   3.775 0.000162 ***
## park$motor_UPDRS           0.3469606  0.0015708 220.876  < 2e-16 ***
## log10(park$Jitter.RAP)     0.4758264  0.0931148   5.110 3.32e-07 ***
## log10(park$Shimmer)       -2.3007951  0.3857329  -5.965 2.59e-09 ***
## log10(park$Shimmer.APQ5)   2.5956145  0.3476729   7.466 9.50e-14 ***
## log10(park$Shimmer.APQ11) -1.0708185  0.2114486  -5.064 4.23e-07 ***
## park$HNR                  -0.0244980  0.0066812  -3.667 0.000248 ***
## park$RPDE                  1.7280453  0.1649334  10.477  < 2e-16 ***
## park$DFA                  -0.8323850  0.1969754  -4.226 2.42e-05 ***
## park$PPE                  -1.5941375  0.2564721  -6.216 5.46e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9014 on 5862 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.9118 
## F-statistic:  5061 on 12 and 5862 DF,  p-value: < 2.2e-16

Now, we can say after analyzing the summary that this model has a good correlation compared to our previous moodel as adjusted R-square value is 91.18%. Additionally, all the covariates are significant in this model.

Let’s try to check the plot of this model:

par(mfrow = c(2,2))
plot(model4, col = "green")

So, Residual vs. Fittedvalues plot of this model is looking more uniform than the previous model. Normal QQ Plot - Since it gives the normality of the response variable, after looking at its plot, we can say that our response variable is normally distributed as standardized residuals and theoretical quantile are coming from the same distribution as both meet almost in a straight line. This is subjective analysis just to check that this model has normalized dependent variable.

Let’s check the multicollinearity:

vif(model4)

##           log10(park$age)                  park$sex 
##                  1.195083                  1.141852 
##            park$test_time          park$motor_UPDRS 
##                  1.013852                  1.178946 
##    log10(park$Jitter.RAP)       log10(park$Shimmer) 
##                  4.404146                 67.166605 
##  log10(park$Shimmer.APQ5) log10(park$Shimmer.APQ11) 
##                 61.796667                 19.930120 
##                  park$HNR                 park$RPDE 
##                  5.942632                  2.005697 
##                  park$DFA                  park$PPE 
##                  1.410161                  3.981401

We see that,‘Shimmer’, ‘Shimmer.APQ5’ and ‘Shimmer.APQ11’ have vif values more than 5. We need to reduce or minimize the correlation between the covariates if there is any, as in fact, we try to move towards orthogonal model where all the covariates are orthogonal to each other or independent of each other.

We will use ridge expression to check the centering of any covariate around ‘0’.

rreg <- lm.ridge(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS  + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR +  park$RPDE + park$DFA + park$PPE, lambda=seq(0,1,.05))
select(rreg)

## modified HKB estimator is 0.903449 
## modified L-W estimator is 0.9673795 
## smallest value of GCV  at 1

temp <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS  + log10(park$Jitter.RAP) + log10(park$Shimmer) + log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11)+ park$HNR  +  park$RPDE + park$DFA + park$PPE, data=park)

As, we see that amongst all the values of estimators, GCV value is the smallest = 1. Now let’s try to plot ridge plot with the values of $\lambda$ = 0.9.

y <- park[,6]
X0 <- model.matrix(temp)[,-1]
lambda <- seq(0, 1, 0.90)
aridge <- ridge(y, X0, lambda=lambda)
traceplot(aridge)

From the above graph, we are not able to conclude anything about the stability of covariates. Now, to remove multicollinearity, alternate options are to remove those variables in pairs from the model and check the multicollinearity at each step.

model5 <- lm(park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + park$motor_UPDRS  + log10(park$Jitter.RAP) + log10(park$Shimmer) + park$RPDE+ park$HNR  + park$DFA + park$PPE)
vif(model5)

##        log10(park$age)               park$sex         park$test_time 
##               1.164472               1.134534               1.010799 
##       park$motor_UPDRS log10(park$Jitter.RAP)    log10(park$Shimmer) 
##               1.149221               3.703187               3.588583 
##              park$RPDE               park$HNR               park$DFA 
##               1.904503               5.624520               1.319643 
##               park$PPE 
##               3.690454

Multicollinearity issue seems to be resolved but we need to check the summary of this model to check the significancy as we removed some variables.

summary(model5)

## 
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + 
##     park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + 
##     log10(park$Shimmer) + park$RPDE + park$HNR + park$DFA + park$PPE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30896 -0.57655 -0.09295  0.42896  3.12476 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.1432035  0.4942674   0.290    0.772    
## log10(park$age)         2.8550623  0.2030578  14.060  < 2e-16 ***
## park$sex               -0.4330832  0.0270531 -16.009  < 2e-16 ***
## park$test_time          0.0009119  0.0002225   4.099 4.21e-05 ***
## park$motor_UPDRS        0.3450480  0.0015596 221.237  < 2e-16 ***
## log10(park$Jitter.RAP)  0.5107844  0.0858643   5.949 2.86e-09 ***
## log10(park$Shimmer)    -0.5642509  0.0896620  -6.293 3.34e-10 ***
## park$RPDE               1.4221568  0.1616232   8.799  < 2e-16 ***
## park$HNR               -0.0337237  0.0065365  -5.159 2.56e-07 ***
## park$DFA               -0.9074851  0.1916210  -4.736 2.23e-06 ***
## park$PPE               -1.9998655  0.2483128  -8.054 9.64e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9064 on 5864 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9108 
## F-statistic:  5999 on 10 and 5864 DF,  p-value: < 2.2e-16

We are compromising the reduction in the value of adjusted R-square from 91.18% to 91.08% to resolve the issue of multicollinearity. Now, multicollinearity from this model is resolved and we see from the summary that all the variables are also significant. So, this model is way better linear than previous models.

Let’s also check the plot of the model:

par(mfrow = c(2,2))
plot(model5, col = "green")

SECTION 7: MODEL VERIFICATION

Our model seems to be linearly fit as seen from the residual vs fitted values plot. Norm QQ plot also telling that quantiles and standardized residual are equally normally distributed. Residual vs Leverage plot is also uniform.

Now, let’s test the T-stat and F-stat and the p-values.

summary(model5)

## 
## Call:
## lm(formula = park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + 
##     park$test_time + park$motor_UPDRS + log10(park$Jitter.RAP) + 
##     log10(park$Shimmer) + park$RPDE + park$HNR + park$DFA + park$PPE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30896 -0.57655 -0.09295  0.42896  3.12476 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.1432035  0.4942674   0.290    0.772    
## log10(park$age)         2.8550623  0.2030578  14.060  < 2e-16 ***
## park$sex               -0.4330832  0.0270531 -16.009  < 2e-16 ***
## park$test_time          0.0009119  0.0002225   4.099 4.21e-05 ***
## park$motor_UPDRS        0.3450480  0.0015596 221.237  < 2e-16 ***
## log10(park$Jitter.RAP)  0.5107844  0.0858643   5.949 2.86e-09 ***
## log10(park$Shimmer)    -0.5642509  0.0896620  -6.293 3.34e-10 ***
## park$RPDE               1.4221568  0.1616232   8.799  < 2e-16 ***
## park$HNR               -0.0337237  0.0065365  -5.159 2.56e-07 ***
## park$DFA               -0.9074851  0.1916210  -4.736 2.23e-06 ***
## park$PPE               -1.9998655  0.2483128  -8.054 9.64e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9064 on 5864 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9108 
## F-statistic:  5999 on 10 and 5864 DF,  p-value: < 2.2e-16

T test is used to test the slope of the linear regression model, assuming signifance level as 5%: T at significance level = 2.228. If T statistics of the model comes out to be between -2.228 and +2.228, then we can say that the response variable and a regressor varaiable does not have linear relationship between them.

To test the linear relationship between the response variable and other regressor variables: Null Hypothesis is used: The hypothesis are:

H0: $\beta_0$ = $\beta_1$ = 0 H1: $\beta_j$ not equal to 0 for at least one j

1st case is the null Hypothesis. Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor. Thus, the null hypothesis is true if the observed data (in the sample) do not differ from what would be expected based on chance alone. Reject H0 if |t0| > t at significance level

From the observed summary, it can be concluded that null hypothesis can be rejected.

There is another way to check the linear relationship between the response variable and other regressor variables i.e P value.

P-value explains how significant is the linear relationship. if p-value of the regressor variable is less than 0.05 then also we can reject the null hypothesis. From the summary as observed, p-values of all the variables is less < 0.05, so we can reject the null Hypothesis and can conclude that there are linear relationships between ‘total_UPDRS’ and other covariate variables.

As observed from the above summary, we find that F-statistics of the above regression is 5864 and p-value is less than 2.2e-16. Assuming significance level as 5%, if we compare p-value of this regression with the assumed significant value, we can reject the null hypothesis for this model as 2.2e-16<0.05.

For comparing the models, lets use anova method:

anova(model5, model4)

## Analysis of Variance Table
## 
## Model 1: park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + 
##     park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + 
##     park$RPDE + park$HNR + park$DFA + park$PPE
## Model 2: park$total_UPDRS^0.72 ~ log10(park$age) + park$sex + park$test_time + 
##     park$motor_UPDRS + log10(park$Jitter.RAP) + log10(park$Shimmer) + 
##     log10(park$Shimmer.APQ5) + log10(park$Shimmer.APQ11) + park$HNR + 
##     park$RPDE + park$DFA + park$PPE
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5864 4818.1                                  
## 2   5862 4762.7  2    55.397 34.092 1.903e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As it is observed that there is significant change in the new model from the previous model with a p-value of 2.2e-16.

SECTION 8: CONCLUSION

So our final linear model equation:

total_UPDRS^0.72 = 0.1432 + 0.285log10(age) -0.433(sex) + 0.0009(test_time) + 0.345(motor_UPDRS)+ 0.510log10(Jitter.RAP) - 0.564log10(Shimmer) + 1.422(RPDE) -0.033(HNR) - 0.907(DFA) - 1.99(PPE) + $\epsilon$ $

Data Analysis Methods: Final Exam - Group 1

Mohit Kumar, Divyay Gupta, Sushant Karnik, Aakanksha Joshi

November 26, 2016