Data Set: Parkinsons Telemonitoring Data Set
The data was initially a mutivariate data set.The main aim of the data was to predict the motor and total UPDRS scores (‘motor_UPDRS’ and’total_UPDRS’) from the 16 voice measures.However, since we have not gone through mutivatiate regression in the class, I removed one of the dependent variables ‘motor UPDRS’ to make it a single dependent variable.
Source: UCI Machine Respositoty. It was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the telemonitoring device to record the speech signals. The original study used a range of linear and nonlinear regression methods to predict the clinician’s Parkinson’s disease symptom score on the UPDRS scale.
Data Set Information: Data Set Characteristics: Univariate Attribute Characteristics: Integer, Real Associated Tasks: Regression Number of Instances: 5875 Number of Attributes: 26 Area: Life Date Donated: 2009-10-29
ATTRIBUTE INFORMATION: subject# - Integer that uniquely identifies each subject age - Subject age sex - Subject gender ‘0’ - male, ‘1’ - female test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment. total_UPDRS - Clinician’s total UPDRS score, linearly interpolated Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of variation in fundamental frequency Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - Several measures of variation in amplitude NHR,HNR - Two measures of ratio of noise to tonal components in the voice RPDE - A nonlinear dynamical complexity measure DFA - Signal fractal scaling exponent PPE - A nonlinear measure of fundamental frequency variation
p<-read.csv('/Users/meierhabarexiti/Documents/biostas/class materials/statistical learning with R(6620)/project/parkinson/parkinsons_updrs.data.txt', header=TRUE)
# remove motor UPDRS to make the data sigle variate regression
p<-data.frame(p[,-5])
str(p)
'data.frame': 5875 obs. of 21 variables:
$ subject. : int 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 72 72 72 72 72 72 72 72 72 72 ...
$ sex : int 0 0 0 0 0 0 0 0 0 0 ...
$ test_time : num 5.64 12.67 19.68 25.65 33.64 ...
$ total_UPDRS : num 34.4 34.9 35.4 35.8 36.4 ...
$ Jitter... : num 0.00662 0.003 0.00481 0.00528 0.00335 0.00353 0.00422 0.00476 0.00432 0.00496 ...
$ Jitter.Abs. : num 3.38e-05 1.68e-05 2.46e-05 2.66e-05 2.01e-05 ...
$ Jitter.RAP : num 0.00401 0.00132 0.00205 0.00191 0.00093 0.00119 0.00212 0.00226 0.00156 0.00258 ...
$ Jitter.PPQ5 : num 0.00317 0.0015 0.00208 0.00264 0.0013 0.00159 0.00221 0.00259 0.00207 0.00253 ...
$ Jitter.DDP : num 0.01204 0.00395 0.00616 0.00573 0.00278 ...
$ Shimmer : num 0.0256 0.0202 0.0168 0.0231 0.017 ...
$ Shimmer.dB. : num 0.23 0.179 0.181 0.327 0.176 0.214 0.445 0.212 0.371 0.31 ...
$ Shimmer.APQ3 : num 0.01438 0.00994 0.00734 0.01106 0.00679 ...
$ Shimmer.APQ5 : num 0.01309 0.01072 0.00844 0.01265 0.00929 ...
$ Shimmer.APQ11: num 0.0166 0.0169 0.0146 0.0196 0.0182 ...
$ Shimmer.DDA : num 0.0431 0.0298 0.022 0.0332 0.0204 ...
$ NHR : num 0.0143 0.0111 0.0202 0.0278 0.0116 ...
$ HNR : num 21.6 27.2 23 24.4 26.1 ...
$ RPDE : num 0.419 0.435 0.462 0.487 0.472 ...
$ DFA : num 0.548 0.565 0.544 0.578 0.561 ...
$ PPE : num 0.16 0.108 0.21 0.333 0.194 ...
summary(p$subject.)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 10.00 22.00 21.49 33.00 42.00
table(p$subject.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
149 145 144 137 156 156 161 150 152 148 138 107 112 136 143 138 144 126 129 134 123 112
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
138 156 144 130 129 134 168 126 130 101 135 161 165 129 140 149 143 142 165 150
Distribution of total UPDRS
# boxplot for total UPDRS by different subjects
library(ggplot2)
fill <- "green"
line <- "black"
ggplot(p, aes(x =as.factor(p$subject.), y =p$total_UPDRS)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "total UPDRS",
breaks = seq(5, 60, 0.5),
limits=c(5, 60)) +
scale_x_discrete(name = "subject") +
ggtitle("Boxplot of total_UPDRS and subject")
Subject numebr 35 has the highest total UPDRS while number 18 has the smallest.
library(ggplot2)
fill <- "green"
line <- "black"
ggplot(p, aes(x =as.factor(p$age), y =p$total_UPDRS)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "total UPDRS",
breaks = seq(5, 60, 0.5),
limits=c(5, 60)) +
scale_x_discrete(name = "age") +
ggtitle("Boxplot of total_UPDRS and age")
summary(p)
subject. age sex test_time total_UPDRS
Min. : 1.00 Min. :36.0 Min. :0.0000 Min. : -4.263 Min. : 7.00
1st Qu.:10.00 1st Qu.:58.0 1st Qu.:0.0000 1st Qu.: 46.847 1st Qu.:21.37
Median :22.00 Median :65.0 Median :0.0000 Median : 91.523 Median :27.58
Mean :21.49 Mean :64.8 Mean :0.3178 Mean : 92.864 Mean :29.02
3rd Qu.:33.00 3rd Qu.:72.0 3rd Qu.:1.0000 3rd Qu.:138.445 3rd Qu.:36.40
Max. :42.00 Max. :85.0 Max. :1.0000 Max. :215.490 Max. :54.99
Jitter... Jitter.Abs. Jitter.RAP Jitter.PPQ5
Min. :0.000830 Min. :2.250e-06 Min. :0.000330 Min. :0.000430
1st Qu.:0.003580 1st Qu.:2.244e-05 1st Qu.:0.001580 1st Qu.:0.001820
Median :0.004900 Median :3.453e-05 Median :0.002250 Median :0.002490
Mean :0.006154 Mean :4.403e-05 Mean :0.002987 Mean :0.003277
3rd Qu.:0.006800 3rd Qu.:5.333e-05 3rd Qu.:0.003290 3rd Qu.:0.003460
Max. :0.099990 Max. :4.456e-04 Max. :0.057540 Max. :0.069560
Jitter.DDP Shimmer Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5
Min. :0.000980 Min. :0.00306 Min. :0.026 Min. :0.00161 Min. :0.00194
1st Qu.:0.004730 1st Qu.:0.01912 1st Qu.:0.175 1st Qu.:0.00928 1st Qu.:0.01079
Median :0.006750 Median :0.02751 Median :0.253 Median :0.01370 Median :0.01594
Mean :0.008962 Mean :0.03404 Mean :0.311 Mean :0.01716 Mean :0.02014
3rd Qu.:0.009870 3rd Qu.:0.03975 3rd Qu.:0.365 3rd Qu.:0.02057 3rd Qu.:0.02375
Max. :0.172630 Max. :0.26863 Max. :2.107 Max. :0.16267 Max. :0.16702
Shimmer.APQ11 Shimmer.DDA NHR HNR RPDE
Min. :0.00249 Min. :0.00484 Min. :0.000286 Min. : 1.659 Min. :0.1510
1st Qu.:0.01566 1st Qu.:0.02783 1st Qu.:0.010955 1st Qu.:19.406 1st Qu.:0.4698
Median :0.02271 Median :0.04111 Median :0.018448 Median :21.920 Median :0.5423
Mean :0.02748 Mean :0.05147 Mean :0.032120 Mean :21.680 Mean :0.5415
3rd Qu.:0.03272 3rd Qu.:0.06173 3rd Qu.:0.031463 3rd Qu.:24.444 3rd Qu.:0.6140
Max. :0.27546 Max. :0.48802 Max. :0.748260 Max. :37.875 Max. :0.9661
DFA PPE
Min. :0.5140 Min. :0.02198
1st Qu.:0.5962 1st Qu.:0.15634
Median :0.6436 Median :0.20550
Mean :0.6532 Mean :0.21959
3rd Qu.:0.7113 3rd Qu.:0.26449
Max. :0.8656 Max. :0.73173
Set up trainning and test data sets:
set.seed(350)
indx = sample(1:nrow(p), as.integer(0.9*nrow(p)))
indx[1:10]
[1] 584 964 4818 1785 4281 4250 1297 2369 5767 3159
p_train = p[indx,]
p_test = p[-indx,]
library(randomForest)
rf<- randomForest(total_UPDRS~ ., data =p_train )
rf
Call:
randomForest(formula = total_UPDRS ~ ., data = p_train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 3.523159
% Var explained: 96.94
The output notes that the random forest included 500 trees and tried 6 variables at each split, which is nearly squreroot of 21. 96.91% of the variation can be explained by our model.
Check importance of each predictor:
library(randomForest)
library(ggplot2)
importance(rf)
IncNodePurity
subject. 221925.704
age 176756.373
sex 19013.214
test_time 22636.259
Jitter... 5397.062
Jitter.Abs. 14580.451
Jitter.RAP 4434.246
Jitter.PPQ5 5550.120
Jitter.DDP 4604.473
Shimmer 5099.231
Shimmer.dB. 4694.506
Shimmer.APQ3 5727.225
Shimmer.APQ5 6042.239
Shimmer.APQ11 7198.272
Shimmer.DDA 6091.347
NHR 7661.834
HNR 17214.105
RPDE 15043.367
DFA 43116.041
PPE 13614.249
Variables importance plot
varImpPlot(rf)
pred<-predict(rf,p_test,type='response')
head(pred)
28 29 44 55 80 82
36.92143 36.82445 43.44714 36.37031 37.42590 38.89545
Compare the distribution of predict value and actual value
summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.804 21.910 28.784 28.943 35.871 54.390
summary(p$total_UPDRS)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 21.37 27.58 29.02 36.40 54.99
Compare the correlation between predicted and actual total UPDRS.
cor(pred,p_test$total_UPDRS)
[1] 0.9853146
The correlation between predicted and actual total UPDR is 98%.The correlation only measures how strongly the predictions are related to the true value; it is not a measure of how far off the predictions were from the true values.
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called themean absolute error (MAE). The equation for MAE is as follows, where n indicates the number of predictions and ei indicates the error for prediction i: Function to calculate the mean absolute error:
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
Mean absolute error between predicted and actual values:
MAE(pred, p_test$total_UPDRS)
[1] 1.365738
This implies that, on average, the difference between our model’s predictions and the true total UPDRS score was about 1.36. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.
increase number of features randomly selected at each split to 10 from previous 6.
library(randomForest)
rf1<- randomForest(total_UPDRS~ ., data =p_train,mtry=10)
rf1
Call:
randomForest(formula = total_UPDRS ~ ., data = p_train, mtry = 10)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.8979827
% Var explained: 99.22
This model increases the percent of variation can be explained to 99.22% from previous 96.94%. Meaning increase the number of features at each split increses the model performance.
library(randomForest)
library(ggplot2)
importance(rf1)
IncNodePurity
subject. 262890.126
age 203567.125
sex 21009.691
test_time 27744.962
Jitter... 1600.745
Jitter.Abs. 7563.505
Jitter.RAP 1489.900
Jitter.PPQ5 2012.132
Jitter.DDP 1882.344
Shimmer 1587.527
Shimmer.dB. 1704.745
Shimmer.APQ3 2120.414
Shimmer.APQ5 2722.125
Shimmer.APQ11 2907.182
Shimmer.DDA 2325.460
NHR 2246.011
HNR 11044.665
RPDE 9495.941
DFA 34901.599
PPE 6584.439
varImpPlot(rf1)
The result is the same as before.
Making predictions:
pred1<-predict(rf1,p_test,type='response')
head(pred1)
28 29 44 55 80 82
36.53166 36.42757 44.79359 36.83698 37.03186 38.18373
head(p_test$total_UPDRS)
[1] 35.810 36.375 44.861 36.870 36.870 37.857
Comparing the first five values of predicted and actual total UPDRS, we notice that 4/5 of them have been correctly predicted after rounding up.
Compare the distribution of predict value and actual value
summary(pred1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.312 21.738 28.188 28.910 36.455 54.676
summary(p$total_UPDRS)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 21.37 27.58 29.02 36.40 54.99
Compare the correlation between predicted and actual total UPDRS.
cor(pred1,p_test$total_UPDRS)
[1] 0.9957614
The correlation between predicted and actual total UPDR is 99.6%.
Mean absolute error between predicted and actual values:
MAE(pred1, p_test$total_UPDRS)
[1] 0.638016
This implies that, on average, the difference between our model’s predictions and the true total UPDRS score was decreased from 1.36 to 0.64. Indicating the model’s performance has been increased.
summary(p$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.0 58.0 65.0 64.8 72.0 85.0