1 Background

Tinnitus is one of the most distressing hearing-related symptoms. This can be a significant problem that negatively impacts the quality of life. An internet cognitive behavioral therapy (referred to as the treatment here in the data set) intervention has been developed in the UK to improve access to evidence-based tinnitus treatment. The Tinnitus Functional Index (TFI_score) has been used as the primary assessment measure to quantify tinnitus distress before and after the treatment. So our main goal of this study is to get hands-on experience in applying the supervised learning methods to this real-world data set and explore the cause of Tinnitus in light of the following data.

2 Dataset

In this dataset, we have the following information on 142 cases.

Variables	Description
Subject_ID	Subject ID of the participant
Group	Subject’s Treatment/ Control group details
HHI_Score	Hearing survey- Overall score- 0-40 (higher score more severe)
Generalized Anxiety Disorder (GAD)	Anxiety sum: 0-21 (higher score more severe)
Patient Health Questionnaire (PHQ)	Depression sum: 0-28 (higher score more severe)
Insomnia SeverityIndex (ISI)	Insomnia total: 0-28 (higher score more severe)
Satisfaction with Life Scales (SWLS)	Overall score, satisfaction with life, like Quality of Life(QOF). Higher scores better QOL (opposite to all other scales)
Hyperacusis	0-42 (higher score more severe)
Cognitive Failures(CFQ)	0-100 (higher score more severe)
Gender	1-Male, 2- Female
Age	In years
Duration of tinnitus	In years
Pre TFI-Score	TFI score at the beginning of the study: Tinnitus score out of 100, higher more severe
Post TFI Score	TFI score after the completion of the study; Tinnitus scores out of 100, higher more severe.

3 Exploratory data analysis

3.1 Library and others

# Libraries and others
setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")
library(ggcorrplot)
library(knitr)
library(dplyr)
library(leaps)
library(glmnet)
library(pls)
library(class)
library(tidyr)
library(purrr)
library(gridExtra)
library(lars)
library(kableExtra)
library(jtools)
library(huxtable)

3.2 Loading the data

# Loading and Cleaning the data
CaseStudy1 <- read.csv("CaseStudy1.csv")

CaseStudy1 <- CaseStudy1[1:142,-1]  # There was 142 available case id and rest were NA



CaseStudy1$Group[CaseStudy1$Group=="Control"]<- 0
CaseStudy1$Group[CaseStudy1$Group=="Control "]<- 0
CaseStudy1$Group[CaseStudy1$Group=="Treatment "]<- 1
CaseStudy1$Group<- as.numeric(CaseStudy1$Group)
#CaseStudy1$Gender<- as.factor(CaseStudy1$Gender)

3.3 Descriptive statistics

# 1) Descriptive statistics

str(CaseStudy1)

'data.frame':   142 obs. of  13 variables:
 $ Group                      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HHI_Score                  : int  0 10 40 12 16 10 26 2 18 10 ...
 $ GAD                        : int  11 6 16 5 8 13 9 3 5 6 ...
 $ PHQ                        : int  8 2 19 2 7 8 4 0 6 6 ...
 $ ISI                        : int  15 2 24 3 8 23 11 1 17 4 ...
 $ SWLS                       : int  15 30 9 13 32 33 26 28 16 26 ...
 $ Hyperacusis                : int  14 14 29 17 27 27 26 26 19 12 ...
 $ CFQ                        : int  34 25 30 45 53 67 62 28 41 45 ...
 $ Gender                     : int  1 2 1 1 2 2 2 2 2 1 ...
 $ Age                        : int  48 56 30 45 62 37 47 24 67 66 ...
 $ Duration_of_tinnitus.years.: num  5 20 0.4 3 3 0.8 0.7 8 6 10 ...
 $ Pre_TFI_Score              : num  69.6 32.8 82 52 58.4 74 54 37.6 66 43.6 ...
 $ Post_TFI_Score             : num  16 NA 77.2 12.8 38.4 56 23.6 4.4 22.8 39.2 ...

Summary <- summary(CaseStudy1)
kable(Summary[,2:7],format = "simple", caption = " Five number summary of the variables" ) # make comment about missing values

Table 3.1: Five number summary of the variables
HHI_Score	GAD	PHQ	ISI	SWLS	Hyperacusis
Min. : 0.00	Min. : 0.000	Min. : 0.000	Min. : 0.00	Min. : 5.00	Min. : 1.00
1st Qu.: 8.00	1st Qu.: 3.000	1st Qu.: 4.000	1st Qu.: 8.00	1st Qu.:14.00	1st Qu.:13.00
Median :18.00	Median : 6.000	Median : 7.000	Median :13.00	Median :20.00	Median :18.50
Mean :17.79	Mean : 7.479	Mean : 8.028	Mean :12.96	Mean :20.32	Mean :19.04
3rd Qu.:26.00	3rd Qu.:11.000	3rd Qu.:11.000	3rd Qu.:18.00	3rd Qu.:26.00	3rd Qu.:25.00
Max. :40.00	Max. :21.000	Max. :27.000	Max. :27.00	Max. :35.00	Max. :42.00
NA	NA	NA	NA	NA	NA

kable(Summary[,8:13] ,format = "simple")

CFQ	Gender	Age	Duration_of_tinnitus.years.	Pre_TFI_Score	Post_TFI_Score
Min. : 7.00	Min. :1.000	Min. :22.00	Min. : 0.30	Min. :24.40	Min. : 4.00
1st Qu.:29.25	1st Qu.:1.000	1st Qu.:46.25	1st Qu.: 3.00	1st Qu.:46.80	1st Qu.:18.50
Median :41.00	Median :1.000	Median :58.00	Median :10.00	Median :58.60	Median :29.60
Mean :40.59	Mean :1.437	Mean :55.45	Mean :11.99	Mean :59.37	Mean :35.41
3rd Qu.:50.00	3rd Qu.:2.000	3rd Qu.:65.00	3rd Qu.:15.00	3rd Qu.:73.60	3rd Qu.:52.80
Max. :86.00	Max. :2.000	Max. :83.00	Max. :55.00	Max. :97.20	Max. :88.40
NA	NA	NA	NA	NA	NA’s :86

Cor<- round(cor(CaseStudy1[,-c(1,9)]),2)


Chisq<-chisq.test(table(CaseStudy1$Group,CaseStudy1$Gender))  # expected no ass betn grouping and gender

3.4 Comments:

According to the table 3.1, there is only one variable with missing values in the data. There are 86 missing cases in the “Post_TFI_Score.” We did not notice any other anomalies in the remaining variables.

3.5 Visual inspection

ggcorrplot(Cor)

Figure 3.1: The correlation plot of TFI data

p1<-ggplot(CaseStudy1, aes(HHI_Score)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")
p2<-ggplot(CaseStudy1, aes(GAD)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p3<-ggplot(CaseStudy1, aes(PHQ)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p4<-ggplot(CaseStudy1, aes(SWLS)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p5<-ggplot(CaseStudy1, aes(Hyperacusis)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p6<-ggplot(CaseStudy1, aes(CFQ)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p7<-ggplot(CaseStudy1, aes(x=factor(Gender)))+
  geom_bar(stat="count", width=0.7, fill="steelblue")

p8<-ggplot(CaseStudy1, aes(Age)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

p9<-ggplot(CaseStudy1, aes(Duration_of_tinnitus.years.)) +
  geom_histogram(aes(y=..density..),bins = 30) +  # scale histogram y
  geom_density(col = "red")

grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9, ncol=3)

Figure 3.2: The imperical distribution of variables

3.6 Comments:

The correlation plot 3.1 shows some degrees of multicollinearity in the data. For example, “PHQ” id correlated with “GAD,” “ISI,” and “CFQ.” The rest of the data looks fine. Also, we did the \(\chi^{2}\) association test between Gender and Group. That came insignificant with \(\chi^{2} = 0.045\), df = 1, and p-value = 0.832. And according to the figure 3.2, it seems that variable “HHI_Score,” “Hyperacausis,” “CFQ,” and “SWLS” are normally distributed. But variables “GAD”, “PHQ”, “Duration_of_tinnitus.years.” exhibit a right-skewed pattern, whereas “Age” shows the opposite.

3.7 Missing value imputation and other anomalies

# 2) Missing value imputation


#sum(is.na(CaseStudy1$Pre_TFI_Score))

#sum(is.na(CaseStudy1$Post_TFI_Score))


meanPostTFI<- mean(na.omit(CaseStudy1$Post_TFI_Score))

imputed_Data<-CaseStudy1 %>% replace_na(list(Post_TFI_Score = meanPostTFI))

is.na(CaseStudy1$Post_TFI_Score)<-meanPostTFI

completeData<- imputed_Data%>% mutate(TFI_Reduction = Post_TFI_Score-Pre_TFI_Score)

# 3) Checking other abnormalities

CS1Dat <- completeData[sample(142),-c(12,13)]  # Trashing the post and pre tfi scores and mixing up the data

We use the findings from table one to identify the variables that contain missing values. Only the variable “Post_TFI_Score” has 86 missing values. Then we used the mean of the available observation meanPostTFI to replace all the NA’s in that variable. After that, we calculated a new variable called “TFI_Reduction” by subtracting “Post_TFI_Score” by “Pre_TFI_Score” and dropped the “Post_TFI_Score” and “Pre_TFI_Score.” Now we have a complete, ready-to-use data set in our hands. Before proceeding further, we rechecked for any missing values in the entire data and created our final data named “CS1Dat.”

3.8 Partitioning the data

# 4) Partitioning the data
set.seed(123)
t<- sample(110)
training <- CS1Dat[t,]  # training data
test <-   CS1Dat[-t,]  # testing data
write.table(training,file = "MyT")
write.table(test,file = "MyTest")

MyT<- read.table(file = "MyT")
MyTest<- read.table(file = "MyT")

There are 142 observations in the data, and we used \(80\%\) of it to create a training data named “training” (contains 110 observations). And with the rest 32 observation, we have created a test data set called “test.”

4 Modeling

4.1 Fitting Stepwise Regression Model

The idea here is to use the full data set to fit a linear model then use stepwise regression to select a subset of the variables that best fit the data. The following table shows that the stepwise process kept “GAD,” “ISI,” “SWLS” as significant variables. After training the model with these three variables and testing it on test data, we obtain the MSE 251.84.

# 5) Fitting Stepwise regression

mod1<- lm(TFI_Reduction~ ., data = CS1Dat) # full model

#LR.STEP<-step(mod1,direction = "both",k=2) #k=2 for AIC

LR.cheat<- lm(TFI_Reduction ~ GAD+ISI+SWLS, data = CS1Dat)
summ(LR.cheat)

Observations	142
Dependent variable	TFI_Reduction
Type	OLS linear regression

F(3,138)	18.82
R²	0.29
Adj. R²	0.27

	Est.	S.E.	t val.	p
(Intercept)	-19.09	5.52	-3.46	0.00
GAD	-0.72	0.27	-2.61	0.01
ISI	-0.82	0.21	-3.81	0.00
SWLS	0.54	0.19	2.82	0.01
Standard errors: OLS

4.2 Best subset modeling

The idea here is to select the variables that are most influential to TFI_Scores. To achieve that, we will impose multiple criteria like Adjusted \(R^2\), Residual Sum of Square, Mallow’s Cp, BIC, and AIC and accomplish the ultimate model.

# 6) Best subset model
  
LR.Leap<-regsubsets(x=CS1Dat[,-12] , y = CS1Dat[,12],
                    intercept=TRUE,method="seqrep")
sum.LR.Leap<-summary(LR.Leap)
kable(as.data.frame(sum.LR.Leap$outmat),format = "simple", caption = "The optimal number of variables using leap")

Table 4.1: The optimal number of variables using leap
	Group	HHI_Score	GAD	PHQ	ISI	SWLS	Hyperacusis	CFQ	Age
1 ( 1 )				*
2 ( 1 )					*	*
3 ( 1 )			*		*	*
4 ( 1 )		*	*		*	*
5 ( 1 )		*	*		*	*			*
6 ( 1 )		*	*		*	*		*	*
7 ( 1 )	*	*	*		*	*		*	*
8 ( 1 )	*	*	*	*	*	*	*	*

LR.CV<-cv.lars(x=as.matrix(CS1Dat[,-c(1,12)]) , y = as.matrix(CS1Dat[,12]),
               K= 10, plot.it=TRUE,se=TRUE,type="stepwise"
)

Figure 4.1: The optimal number of variable using lars

The above table 4.1 gives us some idea on which variable to select. If I have to choose only one variable, then it will be “PHQ.” And for two-variable, it will be the “ISI” and “SWLS.” And the cross-validation technique in the 4.1 suggests “four” variables will give the lowest cross-validation error.

nad<-which.max(sum.LR.Leap$adjr2)

nrss<-which.min(sum.LR.Leap$rss)

ncp<-which.min(sum.LR.Leap$cp)

nbic<-which.min(sum.LR.Leap$bic)  

Adj_R2<-sum.LR.Leap$which[nad,]

RSS<-sum.LR.Leap$which[nrss,]

Cp<-sum.LR.Leap$which[ncp,]

BIC<-sum.LR.Leap$which[nbic,]  
  
AIC<- c(TRUE, FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE, FALSE)

# Selected variable by different methods


SUBVAR<- data.frame(Adj_R2,RSS,Cp,BIC, AIC)

All_var<- row.names(SUBVAR)

table1<- cbind(All_var,SUBVAR)
row.names(table1)<- NULL

cc<- c("#of Var in subset", 5,8,3,3,3)
table2<- rbind(table1,cc)

kable(table2,format = "simple", caption = "Subset selection based on diffrent criteria")

Table 4.2: Subset selection based on diffrent criteria
All_var	Adj_R2	RSS	Cp	BIC	AIC
(Intercept)	TRUE	TRUE	TRUE	TRUE	TRUE
Group	FALSE	TRUE	FALSE	FALSE	FALSE
HHI_Score	TRUE	TRUE	FALSE	FALSE	FALSE
GAD	TRUE	TRUE	TRUE	TRUE	TRUE
PHQ	FALSE	FALSE	FALSE	FALSE	FALSE
ISI	TRUE	TRUE	TRUE	TRUE	TRUE
SWLS	TRUE	TRUE	TRUE	TRUE	TRUE
Hyperacusis	FALSE	FALSE	FALSE	FALSE	FALSE
CFQ	FALSE	TRUE	FALSE	FALSE	FALSE
Gender	FALSE	FALSE	FALSE	FALSE	FALSE
Age	FALSE	TRUE	FALSE	FALSE	FALSE
Duration_of_tinnitus.years.	FALSE	FALSE	FALSE	FALSE	FALSE
#of Var in subset	5	8	3	3	3

\[~\]

setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")

MyT<- read.table(file = "MyT")

LR.Sub.final<- lm(TFI_Reduction ~ GAD+ISI+SWLS, data = MyT)
summ(LR.Sub.final)

Observations	110
Dependent variable	TFI_Reduction
Type	OLS linear regression

F(3,106)	14.48
R²	0.29
Adj. R²	0.27

	Est.	S.E.	t val.	p
(Intercept)	-22.81	6.52	-3.50	0.00
GAD	-0.62	0.33	-1.88	0.06
ISI	-0.81	0.25	-3.26	0.00
SWLS	0.63	0.23	2.75	0.01
Standard errors: OLS

The table 4.2 gives us the final summary of the best subset possible based on adjusted \(R^{2}\), Residual sum of square, Mallows’s Cp, Bayesian Information Criterion, and Akaike information criterion. The last three of them suggest the same combination of variables. So our best subset model will contain GAD, ISI, SWLS. Now we will train our model using those three variables and training data and check the quality of fit on testing data.

par(mfrow=c(2,2))
plot(LR.Sub.final)

Figure 4.2: Residual analysis of the model

Before we proceed any further, let’s check the model diagnostic. The figure 4.2 below shows the residual vs. the fitted value plot, and it does not show any pattern. Residuals are homogeneously scattered, and the normality assumption of the residuals also holds.

4.3 Comment on the model:

The model we chose has an intercept value of -19.48. The coefficient of GAD is -0.65; that is, if we consider other regressors constant, then one unit change in GAD will result in a 0.62 unit decrease in the TFI_Reduction (in other words, increase in Post_TFI_Scores). Our model captures almost \(21\%\) variability in the data with training MSE 251.84. “ISI” and “SWLS” were found to be the most influential factor for TFI scores.

5 Performing penalized regression

5.1 Ridge

First, we have to find an optimal \(\lambda\) value. The following figure 5.1 shows the possible \(\lambda\) value list and their corresponding MSE. The \(\lambda\) value that gives the minimum error is 15.3, and \(\lambda\) value 129.7 is the closest \(\lambda\) value given the MSE is within one standard deviation that of minimum \(\lambda\)

x<-model.matrix(TFI_Reduction~.,data=MyT)[,-1]
scale.x<-scale(x)

y<-MyT$TFI_Reduction

# optimum lambda
ridge.cv<-cv.glmnet(scale.x, y ,alpha=0)
plot(ridge.cv)

Figure 5.1: MSE of several Ridge model vs log of lambda

We will train our ridge model using optimal \(\lambda=129.7\)

5.2 Lasso

First, we have to find an optimal \(\lambda\) value. Figure 5.2 shows the list of possible \(\lambda\) values and their corresponding MSE. The \(\lambda\) value that gives the minimum error is 1.6, and \(\lambda\) value 4.9 is the closest \(\lambda\) value given the MSE is within one standard deviation that of minimum \(\lambda\)

# optimum lambda
lasso.cv<-cv.glmnet(scale.x, y ,alpha=1)
plot(lasso.cv)

Figure 5.2: Optimum lambda for lasso regression

6 Performing PCR, PLSR

6.1 PCR

pcr.model<-pcr(TFI_Reduction ~ . , data = CS1Dat, scale= TRUE,valiadtion = "CV") # 10-fold
summary(pcr.model)

Data:   X dimension: 142 11 
    Y dimension: 142 1
Fit method: svdpc
Number of components considered: 11
TRAINING: % variance explained
               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X                29.21    42.50    52.92    62.30    71.18    78.69    85.12
TFI_Reduction    24.13    24.15    24.34    26.11    29.45    30.12    30.12
               8 comps  9 comps  10 comps  11 comps
X                90.18    94.63     98.33    100.00
TFI_Reduction    30.71    30.72     30.76     31.01

validationplot(pcr.model,val.type = "MSEP")

Figure 6.1: MSEP vs the number of component graph

In this approach, we chose several linear combinations of our regressor (component). According to the table above single component can explain 24 percent variability in TFI_Reduction. Trends keep improving until the 6th component. After that additional component does not capture any extra variability in the data. Figure 6.1 also corroborates our finding as the line goes almost straight after the 6th component. So we will use four components to train our PCR model.

6.2 PLSR

#optimum component
pls.model<-plsr(TFI_Reduction ~ . , data = CS1Dat, scale= TRUE,valiadtion = "CV") # 10-fold
summary(pls.model)

Data:   X dimension: 142 11 
    Y dimension: 142 1
Fit method: kernelpls
Number of components considered: 11
TRAINING: % variance explained
               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X                28.73    38.00    44.35    52.88    59.64    67.23    75.31
TFI_Reduction    27.54    30.77    30.94    30.98    31.01    31.01    31.01
               8 comps  9 comps  10 comps  11 comps
X                80.64    89.10     93.59    100.00
TFI_Reduction    31.01    31.01     31.01     31.01

validationplot(pls.model,val.type = "MSEP")

Figure 6.2: MSEP vs number of component

As we can see, when we use PSLR, the optimal number of components reduces to 2. Because after that, the line becomes almost straight. Figure 6.2 also corroborates our finding as the line goes almost straight after the 2nd component. So we will use two-component to train our PLS model.

Components		one	two	three	four	five	six
PCR	X	29.1	43.51	53.9	63.16	72.0	79.84
	TFI_Reduction	17.67	18.13	18.51	21.40	22.84	23.54
PLS	X	28.40	38.30
	TFI_Reduction	21.38	24.74

The table above summarizes the PCR and PLS regression. We can conclude that we are acquiring the same capture level in variability using just two-component in PLS, where PCR uses six-component.

setwd("//Users//prosantabarai//Desktop//OrabyHW//Case_study")

mysum<- read.table("MySum")

kable(mysum, format = "simple", caption = "Regression summary table of diffrent methods")

Table 6.1: Regression summary table of diffrent methods
	Stepwise	Best_Subset	Ridge	Lasso	PCR	PLS
(Intercept)	-19.09	-19.49	-24.11	-24.11	.	.
Group	.	.	0.00	.	-0.44	-0.49
HHI_Score	.	.	-0.32	.	-0.42	-0.43
GAD	-0.72	-0.55	-0.61	.	-2.47	-2.05
PHQ	.	.	-0.71	-1.29	-3.24	-2.52
ISI	-0.82	-0.8	-0.80	-1.82	-3.39	-4.34
SWLS	0.54	0.48	0.57	.	2.44	2.94
Hyperacusis	.	.	-0.37	.	0.2	-0.65
CFQ	.	.	-0.10	.	0.99	1.65
Gender	.	.	-0.17	.	-0.92	-1
Age	.	.	0.12	.	0.93	0.9
Duration_of_tinnitus.years.	.	.	-0.01	.	-0.71	0.17
Training_MSE	253.93	253.93	295.61	293.76	253.53	249.55
Test_MSE	251.84	251.84	369.10	368.75	246.5	255.86

7 Final conclusion

The above table 6.1 summarizes all the methods we used for analysis. Stepwise and best subset selection exhibit simpler performance holding the second-lowest test MSE 251.84. However, another nonparametric technique, PCR, has the lowest test MSE of all(246.5). So far, PCR regression seems to have the leads.

Using Supervised Learning Techniques to Analyze Tinnitus Functional Index

Prosanta Barai

10/15/2021