1 Background

Tinnitus is one of the most distressing hearing-related symptoms. This can be a significant problem that negatively impacts the quality of life. An internet cognitive behavioral therapy (referred to as the treatment here in the data set) intervention has been developed in the UK to improve access to evidence-based tinnitus treatment. The Tinnitus Functional Index (TFI_score) has been used as the primary assessment measure to quantify tinnitus distress before and after the treatment. So our main goal of this study is to get hands-on experience in applying the supervised learning methods to this real-world data set and explore the cause of Tinnitus in light of the following data.

2 Dataset

In this dataset, we have the following information on 142 cases.

Variables Description
Subject_ID Subject ID of the participant
Group Subject’s Treatment/ Control group details
HHI_Score Hearing survey- Overall score- 0-40 (higher score more severe)
Generalized Anxiety Disorder (GAD) Anxiety sum: 0-21 (higher score more severe)
Patient Health Questionnaire (PHQ) Depression sum: 0-28 (higher score more severe)
Insomnia SeverityIndex (ISI) Insomnia total: 0-28 (higher score more severe)
Satisfaction with Life Scales (SWLS) Overall score, satisfaction with life, like Quality of Life(QOF). Higher scores better QOL (opposite to all other scales)
Hyperacusis 0-42 (higher score more severe)
Cognitive Failures(CFQ) 0-100 (higher score more severe)
Gender 1-Male, 2- Female
Age In years
Duration of tinnitus In years
Pre TFI-Score TFI score at the beginning of the study: Tinnitus score out of 100, higher more severe
Post TFI Score TFI score after the completion of the study; Tinnitus scores out of 100, higher more severe.

3 Exploratory data analysis

3.1 Library and others

# Libraries and others
setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")
library(ggcorrplot)
library(knitr)
library(dplyr)
library(leaps)
library(glmnet)
library(pls)
library(class)
library(tidyr)
library(purrr)
library(gridExtra)
library(lars)
library(kableExtra)
library(jtools)
library(huxtable)

3.2 Loading the data

# Loading and Cleaning the data
CaseStudy1 <- read.csv("CaseStudy1.csv")

CaseStudy1 <- CaseStudy1[1:142,-1]  # There was 142 available case id and rest were NA



CaseStudy1$Group[CaseStudy1$Group=="Control"]<- 0
CaseStudy1$Group[CaseStudy1$Group=="Control "]<- 0
CaseStudy1$Group[CaseStudy1$Group=="Treatment "]<- 1
CaseStudy1$Group<- as.numeric(CaseStudy1$Group)
#CaseStudy1$Gender<- as.factor(CaseStudy1$Gender)

3.3 Descriptive statistics

# 1) Descriptive statistics

str(CaseStudy1)
'data.frame':   142 obs. of  13 variables:
 $ Group                      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HHI_Score                  : int  0 10 40 12 16 10 26 2 18 10 ...
 $ GAD                        : int  11 6 16 5 8 13 9 3 5 6 ...
 $ PHQ                        : int  8 2 19 2 7 8 4 0 6 6 ...
 $ ISI                        : int  15 2 24 3 8 23 11 1 17 4 ...
 $ SWLS                       : int  15 30 9 13 32 33 26 28 16 26 ...
 $ Hyperacusis                : int  14 14 29 17 27 27 26 26 19 12 ...
 $ CFQ                        : int  34 25 30 45 53 67 62 28 41 45 ...
 $ Gender                     : int  1 2 1 1 2 2 2 2 2 1 ...
 $ Age                        : int  48 56 30 45 62 37 47 24 67 66 ...
 $ Duration_of_tinnitus.years.: num  5 20 0.4 3 3 0.8 0.7 8 6 10 ...
 $ Pre_TFI_Score              : num  69.6 32.8 82 52 58.4 74 54 37.6 66 43.6 ...
 $ Post_TFI_Score             : num  16 NA 77.2 12.8 38.4 56 23.6 4.4 22.8 39.2 ...
Summary <- summary(CaseStudy1)
kable(Summary[,2:7],format = "simple", caption = " Five number summary of the variables" ) # make comment about missing values
Table 3.1: Five number summary of the variables
HHI_Score GAD PHQ ISI SWLS Hyperacusis
Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 5.00 Min. : 1.00
1st Qu.: 8.00 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:14.00 1st Qu.:13.00
Median :18.00 Median : 6.000 Median : 7.000 Median :13.00 Median :20.00 Median :18.50
Mean :17.79 Mean : 7.479 Mean : 8.028 Mean :12.96 Mean :20.32 Mean :19.04
3rd Qu.:26.00 3rd Qu.:11.000 3rd Qu.:11.000 3rd Qu.:18.00 3rd Qu.:26.00 3rd Qu.:25.00
Max. :40.00 Max. :21.000 Max. :27.000 Max. :27.00 Max. :35.00 Max. :42.00
NA NA NA NA NA NA
kable(Summary[,8:13] ,format = "simple") 
CFQ Gender Age Duration_of_tinnitus.years. Pre_TFI_Score Post_TFI_Score
Min. : 7.00 Min. :1.000 Min. :22.00 Min. : 0.30 Min. :24.40 Min. : 4.00
1st Qu.:29.25 1st Qu.:1.000 1st Qu.:46.25 1st Qu.: 3.00 1st Qu.:46.80 1st Qu.:18.50
Median :41.00 Median :1.000 Median :58.00 Median :10.00 Median :58.60 Median :29.60
Mean :40.59 Mean :1.437 Mean :55.45 Mean :11.99 Mean :59.37 Mean :35.41
3rd Qu.:50.00 3rd Qu.:2.000 3rd Qu.:65.00 3rd Qu.:15.00 3rd Qu.:73.60 3rd Qu.:52.80
Max. :86.00 Max. :2.000 Max. :83.00 Max. :55.00 Max. :97.20 Max. :88.40
NA NA NA NA NA NA’s :86
Cor<- round(cor(CaseStudy1[,-c(1,9)]),2)


Chisq<-chisq.test(table(CaseStudy1$Group,CaseStudy1$Gender))  # expected no ass betn grouping and gender

3.4 Comments:

According to the table 3.1, there is only one variable with missing values in the data. There are 86 missing cases in the “Post_TFI_Score.” We did not notice any other anomalies in the remaining variables.

3.5 Visual inspection

ggcorrplot(Cor)
 The correlation plot of TFI data

Figure 3.1: The correlation plot of TFI data

p1<-ggplot(CaseStudy1, aes(HHI_Score)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")
p2<-ggplot(CaseStudy1, aes(GAD)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p3<-ggplot(CaseStudy1, aes(PHQ)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p4<-ggplot(CaseStudy1, aes(SWLS)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p5<-ggplot(CaseStudy1, aes(Hyperacusis)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p6<-ggplot(CaseStudy1, aes(CFQ)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p7<-ggplot(CaseStudy1, aes(x=factor(Gender)))+
  geom_bar(stat="count", width=0.7, fill="steelblue")

p8<-ggplot(CaseStudy1, aes(Age)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p9<-ggplot(CaseStudy1, aes(Duration_of_tinnitus.years.)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9, ncol=3)
The imperical distribution of variables

Figure 3.2: The imperical distribution of variables

3.6 Comments:

The correlation plot 3.1 shows some degrees of multicollinearity in the data. For example, “PHQ” id correlated with “GAD,” “ISI,” and “CFQ.” The rest of the data looks fine. Also, we did the \(\chi^{2}\) association test between Gender and Group. That came insignificant with \(\chi^{2} = 0.045\), df = 1, and p-value = 0.832. And according to the figure 3.2, it seems that variable “HHI_Score,” “Hyperacausis,” “CFQ,” and “SWLS” are normally distributed. But variables “GAD”, “PHQ”, “Duration_of_tinnitus.years.” exhibit a right-skewed pattern, whereas “Age” shows the opposite.

3.7 Missing value imputation and other anomalies

# 2) Missing value imputation


#sum(is.na(CaseStudy1$Pre_TFI_Score))

#sum(is.na(CaseStudy1$Post_TFI_Score))


meanPostTFI<- mean(na.omit(CaseStudy1$Post_TFI_Score))

imputed_Data<-CaseStudy1 %>% replace_na(list(Post_TFI_Score = meanPostTFI))

is.na(CaseStudy1$Post_TFI_Score)<-meanPostTFI

completeData<- imputed_Data%>% mutate(TFI_Reduction = Post_TFI_Score-Pre_TFI_Score)

# 3) Checking other abnormalities

CS1Dat <- completeData[sample(142),-c(12,13)]  # Trashing the post and pre tfi scores and mixing up the data

We use the findings from table one to identify the variables that contain missing values. Only the variable “Post_TFI_Score” has 86 missing values. Then we used the mean of the available observation meanPostTFI to replace all the NA’s in that variable. After that, we calculated a new variable called “TFI_Reduction” by subtracting “Post_TFI_Score” by “Pre_TFI_Score” and dropped the “Post_TFI_Score” and “Pre_TFI_Score.” Now we have a complete, ready-to-use data set in our hands. Before proceeding further, we rechecked for any missing values in the entire data and created our final data named “CS1Dat.”

3.8 Partitioning the data

# 4) Partitioning the data
set.seed(123)
t<- sample(110)
training <- CS1Dat[t,]  # training data
test <-   CS1Dat[-t,]  # testing data
write.table(training,file = "MyT")
write.table(test,file = "MyTest")

MyT<- read.table(file = "MyT")
MyTest<- read.table(file = "MyT")

There are 142 observations in the data, and we used \(80\%\) of it to create a training data named “training” (contains 110 observations). And with the rest 32 observation, we have created a test data set called “test.”

4 Modeling

4.1 Fitting Stepwise Regression Model

The idea here is to use the full data set to fit a linear model then use stepwise regression to select a subset of the variables that best fit the data. The following table shows that the stepwise process kept “GAD,” “ISI,” “SWLS” as significant variables. After training the model with these three variables and testing it on test data, we obtain the MSE 251.84.

# 5) Fitting Stepwise regression

mod1<- lm(TFI_Reduction~ ., data = CS1Dat) # full model

#LR.STEP<-step(mod1,direction = "both",k=2) #k=2 for AIC

LR.cheat<- lm(TFI_Reduction ~ GAD+ISI+SWLS, data = CS1Dat)
summ(LR.cheat)
Observations 142
Dependent variable TFI_Reduction
Type OLS linear regression
F(3,138) 18.82
0.29
Adj. R² 0.27
Est. S.E. t val. p
(Intercept) -19.09 5.52 -3.46 0.00
GAD -0.72 0.27 -2.61 0.01
ISI -0.82 0.21 -3.81 0.00
SWLS 0.54 0.19 2.82 0.01
Standard errors: OLS

4.2 Best subset modeling

The idea here is to select the variables that are most influential to TFI_Scores. To achieve that, we will impose multiple criteria like Adjusted \(R^2\), Residual Sum of Square, Mallow’s Cp, BIC, and AIC and accomplish the ultimate model.

# 6) Best subset model
  
LR.Leap<-regsubsets(x=CS1Dat[,-12] , y = CS1Dat[,12],
                    intercept=TRUE,method="seqrep")
sum.LR.Leap<-summary(LR.Leap)
kable(as.data.frame(sum.LR.Leap$outmat),format = "simple", caption = "The optimal number of variables using leap")
Table 4.1: The optimal number of variables using leap
Group HHI_Score GAD PHQ ISI SWLS Hyperacusis CFQ Gender Age Duration_of_tinnitus.years.
1 ( 1 ) *
2 ( 1 ) * *
3 ( 1 ) * * *
4 ( 1 ) * * * *
5 ( 1 ) * * * * *
6 ( 1 ) * * * * * *
7 ( 1 ) * * * * * * *
8 ( 1 ) * * * * * * * *
LR.CV<-cv.lars(x=as.matrix(CS1Dat[,-c(1,12)]) , y = as.matrix(CS1Dat[,12]),
               K= 10, plot.it=TRUE,se=TRUE,type="stepwise"
) 
The optimal number of variable using lars

Figure 4.1: The optimal number of variable using lars

The above table 4.1 gives us some idea on which variable to select. If I have to choose only one variable, then it will be “PHQ.” And for two-variable, it will be the “ISI” and “SWLS.” And the cross-validation technique in the 4.1 suggests “four” variables will give the lowest cross-validation error.

nad<-which.max(sum.LR.Leap$adjr2)

nrss<-which.min(sum.LR.Leap$rss)

ncp<-which.min(sum.LR.Leap$cp)

nbic<-which.min(sum.LR.Leap$bic)  

Adj_R2<-sum.LR.Leap$which[nad,]

RSS<-sum.LR.Leap$which[nrss,]

Cp<-sum.LR.Leap$which[ncp,]

BIC<-sum.LR.Leap$which[nbic,]  
  
AIC<- c(TRUE, FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE, FALSE)

# Selected variable by different methods


SUBVAR<- data.frame(Adj_R2,RSS,Cp,BIC, AIC)

All_var<- row.names(SUBVAR)

table1<- cbind(All_var,SUBVAR)
row.names(table1)<- NULL

cc<- c("#of Var in subset", 5,8,3,3,3)
table2<- rbind(table1,cc)

kable(table2,format = "simple", caption = "Subset selection based on diffrent criteria") 
Table 4.2: Subset selection based on diffrent criteria
All_var Adj_R2 RSS Cp BIC AIC
(Intercept) TRUE TRUE TRUE TRUE TRUE
Group FALSE TRUE FALSE FALSE FALSE
HHI_Score TRUE TRUE FALSE FALSE FALSE
GAD TRUE TRUE TRUE TRUE TRUE
PHQ FALSE FALSE FALSE FALSE FALSE
ISI TRUE TRUE TRUE TRUE TRUE
SWLS TRUE TRUE TRUE TRUE TRUE
Hyperacusis FALSE FALSE FALSE FALSE FALSE
CFQ FALSE TRUE FALSE FALSE FALSE
Gender FALSE FALSE FALSE FALSE FALSE
Age FALSE TRUE FALSE FALSE FALSE
Duration_of_tinnitus.years. FALSE FALSE FALSE FALSE FALSE
#of Var in subset 5 8 3 3 3

\[~\]

setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")

MyT<- read.table(file = "MyT")

LR.Sub.final<- lm(TFI_Reduction ~ GAD+ISI+SWLS, data = MyT)
summ(LR.Sub.final)
Observations 110
Dependent variable TFI_Reduction
Type OLS linear regression
F(3,106) 14.48
0.29
Adj. R² 0.27
Est. S.E. t val. p
(Intercept) -22.81 6.52 -3.50 0.00
GAD -0.62 0.33 -1.88 0.06
ISI -0.81 0.25 -3.26 0.00
SWLS 0.63 0.23 2.75 0.01
Standard errors: OLS

The table 4.2 gives us the final summary of the best subset possible based on adjusted \(R^{2}\), Residual sum of square, Mallows’s Cp, Bayesian Information Criterion, and Akaike information criterion. The last three of them suggest the same combination of variables. So our best subset model will contain GAD, ISI, SWLS. Now we will train our model using those three variables and training data and check the quality of fit on testing data.

par(mfrow=c(2,2))
plot(LR.Sub.final)
Residual analysis of the model

Figure 4.2: Residual analysis of the model

Before we proceed any further, let’s check the model diagnostic. The figure 4.2 below shows the residual vs. the fitted value plot, and it does not show any pattern. Residuals are homogeneously scattered, and the normality assumption of the residuals also holds.

4.3 Comment on the model:

The model we chose has an intercept value of -19.48. The coefficient of GAD is -0.65; that is, if we consider other regressors constant, then one unit change in GAD will result in a 0.62 unit decrease in the TFI_Reduction (in other words, increase in Post_TFI_Scores). Our model captures almost \(21\%\) variability in the data with training MSE 251.84. “ISI” and “SWLS” were found to be the most influential factor for TFI scores.

5 Performing penalized regression

5.1 Ridge

First, we have to find an optimal \(\lambda\) value. The following figure 5.1 shows the possible \(\lambda\) value list and their corresponding MSE. The \(\lambda\) value that gives the minimum error is 15.3, and \(\lambda\) value 129.7 is the closest \(\lambda\) value given the MSE is within one standard deviation that of minimum \(\lambda\)

x<-model.matrix(TFI_Reduction~.,data=MyT)[,-1]
scale.x<-scale(x)

y<-MyT$TFI_Reduction

# optimum lambda
ridge.cv<-cv.glmnet(scale.x, y ,alpha=0)
plot(ridge.cv)
MSE of several Ridge model vs log of lambda

Figure 5.1: MSE of several Ridge model vs log of lambda

We will train our ridge model using optimal \(\lambda=129.7\)

5.2 Lasso

First, we have to find an optimal \(\lambda\) value. Figure 5.2 shows the list of possible \(\lambda\) values and their corresponding MSE. The \(\lambda\) value that gives the minimum error is 1.6, and \(\lambda\) value 4.9 is the closest \(\lambda\) value given the MSE is within one standard deviation that of minimum \(\lambda\)

# optimum lambda
lasso.cv<-cv.glmnet(scale.x, y ,alpha=1)
plot(lasso.cv)
 Optimum lambda for lasso regression

Figure 5.2: Optimum lambda for lasso regression

6 Performing PCR, PLSR

6.1 PCR

pcr.model<-pcr(TFI_Reduction ~ . , data = CS1Dat, scale= TRUE,valiadtion = "CV") # 10-fold
summary(pcr.model)
Data:   X dimension: 142 11 
    Y dimension: 142 1
Fit method: svdpc
Number of components considered: 11
TRAINING: % variance explained
               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X                29.21    42.50    52.92    62.30    71.18    78.69    85.12
TFI_Reduction    24.13    24.15    24.34    26.11    29.45    30.12    30.12
               8 comps  9 comps  10 comps  11 comps
X                90.18    94.63     98.33    100.00
TFI_Reduction    30.71    30.72     30.76     31.01
validationplot(pcr.model,val.type = "MSEP")
MSEP vs the number of component graph

Figure 6.1: MSEP vs the number of component graph

In this approach, we chose several linear combinations of our regressor (component). According to the table above single component can explain 24 percent variability in TFI_Reduction. Trends keep improving until the 6th component. After that additional component does not capture any extra variability in the data. Figure 6.1 also corroborates our finding as the line goes almost straight after the 6th component. So we will use four components to train our PCR model.

6.2 PLSR

#optimum component
pls.model<-plsr(TFI_Reduction ~ . , data = CS1Dat, scale= TRUE,valiadtion = "CV") # 10-fold
summary(pls.model)
Data:   X dimension: 142 11 
    Y dimension: 142 1
Fit method: kernelpls
Number of components considered: 11
TRAINING: % variance explained
               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X                28.73    38.00    44.35    52.88    59.64    67.23    75.31
TFI_Reduction    27.54    30.77    30.94    30.98    31.01    31.01    31.01
               8 comps  9 comps  10 comps  11 comps
X                80.64    89.10     93.59    100.00
TFI_Reduction    31.01    31.01     31.01     31.01
validationplot(pls.model,val.type = "MSEP")
MSEP vs number of component

Figure 6.2: MSEP vs number of component

As we can see, when we use PSLR, the optimal number of components reduces to 2. Because after that, the line becomes almost straight. Figure 6.2 also corroborates our finding as the line goes almost straight after the 2nd component. So we will use two-component to train our PLS model.

Components one two three four five six
PCR X 29.1 43.51 53.9 63.16 72.0 79.84
TFI_Reduction 17.67 18.13 18.51 21.40 22.84 23.54
PLS X 28.40 38.30
TFI_Reduction 21.38 24.74

The table above summarizes the PCR and PLS regression. We can conclude that we are acquiring the same capture level in variability using just two-component in PLS, where PCR uses six-component.

setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")

mysum<- read.table("MySum")

kable(mysum, format = "simple", caption = "Regression summary table of diffrent methods")
Table 6.1: Regression summary table of diffrent methods
Stepwise Best_Subset Ridge Lasso PCR PLS
(Intercept) -19.09 -19.49 -24.11 -24.11 . .
Group . . 0.00 . -0.44 -0.49
HHI_Score . . -0.32 . -0.42 -0.43
GAD -0.72 -0.55 -0.61 . -2.47 -2.05
PHQ . . -0.71 -1.29 -3.24 -2.52
ISI -0.82 -0.8 -0.80 -1.82 -3.39 -4.34
SWLS 0.54 0.48 0.57 . 2.44 2.94
Hyperacusis . . -0.37 . 0.2 -0.65
CFQ . . -0.10 . 0.99 1.65
Gender . . -0.17 . -0.92 -1
Age . . 0.12 . 0.93 0.9
Duration_of_tinnitus.years. . . -0.01 . -0.71 0.17
Training_MSE 253.93 253.93 295.61 293.76 253.53 249.55
Test_MSE 251.84 251.84 369.10 368.75 246.5 255.86

7 Final conclusion

The above table 6.1 summarizes all the methods we used for analysis. Stepwise and best subset selection exhibit simpler performance holding the second-lowest test MSE 251.84. However, another nonparametric technique, PCR, has the lowest test MSE of all(246.5). So far, PCR regression seems to have the leads.