Introduction

Forced expiratory volume (FEV) refers to the volume of air that can be exhaled during a forced breath in t seconds. FEV1 is a significant parameter for identifying restrictive and obstructive respiratory diseases like asthma. Spirometry is a test that is used to measure the ability of a person to inhale and exhale air respective to time. Results received from a spirometry test are dependent on the effort and cooperation between patients and examiners.

FEV values greater than 80% of the predicted average value are considered to be normal. These results depend heavily on the technicality of implementation as well as personal attributes of the patient. Age and gender are the major factors that affect the average values of FEV in healthy individuals. Height, weight, and ethnicity are some of the other influencing factors.

Therefore, researches seek to investigate how lung function as measured by forced expiratory volume in liters (FEV) is related to one or more patient characteristics, be they socio-demographic, genetic, environmental, molecular, clinical, or behavioral. Regression models provide tools for explaining how an outcome variable varies for different values of explanatory variables and predict the value of FEV for subjects. Such models can also be used to predict future outcomes for new individuals based on their set of explanatory variable values.

The main objective of this study is to determine the relationship between the level of pulmonary function and individual characteristics.

Methods

Data description

In this study, 654 subjects were recruited and measured their forced expiratory volume (FEV, liters), and assessed how age, height, sex, and smoking covariates related to forced expiratory volume.

Software

R-4 statistical package was used for statistical analysis.

Exploratory Data Analysis

Measures of central tendency, measures of dispersion, frequencies, and graphs were used to explain descriptive data analysis.

Response variable

The response variable for this study was forced expiratory volume measured in liters.

Predictor variables

The predictor variables included in this study were age in years , height(inches), sex, and smoking status. The dummy variable was created for categorical variables of sex (Female=0 and Male =1)and smoking status(non-smoker=0 and smoker=1).

Statistical Analysis

Simple linear regression

The dependent variable forced expiratory volume in liters was continues variable then simple linear regression was applied for data analysis.

Model is given as:

\[Y_i=β_0+β_1 X_i+ϵ_i \]

Where \(β_0\),\(β_1\) are parameters, \(X_i\) is predicator variable for \(i^{th}\) individuals, \(_i\) the value of the FEV for \(i^{th}\) observation, \(\epsilon_i\) are independent and normally distributed with mean zero and constant variance. \((ϵ_i\sim N (0, δ^2 ))\), \(\epsilon_i\) and \(\epsilon_j\) are uncorrelated and \(cov\,(\epsilon_i,\,\epsilon_j) = 0,i≠j\)

\(β_1\) indicates the change in the mean FEV per unit increase in \(X_i\) whereas \(β_0\) is the constant parameter when \(X_i=0\).

Multple lineaar regression

Multiple linear regression is used to determine the relationship between forced expiratory volume and all predictors simultaneously.

Model:

\[Y_i=β_0+β_1 X_i1+β_2 X_{i2}+⋯+β_{k-1} X_{i,k-1}+ϵ_i\] Where \(Y_i\) the value of the FEV for \(i^{th}\) individuals, \(β_0,β_1 X_1,β_2 X_2,…,β_{k-1} X_{k-1}\) are regression coefficients which indicates the direction and magnitude of the relationship between FEV and each explanatory variables.

\(X_{i1},X_{i2},…,X_{i,k-1}\) are values of the predictor variables for the $i^{th} $ individuals and \(ϵ_i\) are independent and normally distributed with mean zero and constant variance. \(ϵ_i\sim N (0,δ^2 ),ϵ_i\) and \(ϵ_j\) are uncorrelated and \(cov\,(ϵ_i,ϵ_j) = 0,i≠j\).

Model building

Preliminary model

To begin model building process, I started from the main effect for original data set that includes all explanatory variables age, height, sex and smoke and the two way interaction effect of all other variables.

Full model: \[FEV_i=β_0+β_1 Age_i+β_2 Height_i+β_3 Sex_i+β_4 Smoking_i+β_5 Age_i.Sex_i+β_6 Ag_i .Smoking_i+\] \[ β_7 Height_i.Sex_i+β_8 Height_i.Smoking_i+ϵ_i\]

Transformed model

In the preliminar model fitting the original data did not met all the aasumptions of linear regression model. So, the data was transformed to build model by similarly fashion done in preliminary model building steps for the main effect and the two way interaction for all explantory variables.

Full model:

\[log(fev_i)=β_0+β_1 Age_i+β_2 Height_i+β_3 Sex_i+β_4 Smoking_i+β_5 Age_i.Sex_i+β_6 Ag_i .Smoking_i+\] \[ β_7 Height_i.Sex_i+β_8 Height_i.Smoking_i+ϵ_i\]

Model selection

The Adjusted R-square, Mallow’s \(C_p\) and \(AIC\) criteria was used to select the “best” model among subset models.

Final model

For both premiliminary and transformed subset model selection, the four variables age, height, sex and height were selected and for the preliminary model interaction age, height, age * smoke, Age * Heiht were selected. but for transformed mdel interaction Age,height, age* Smoke, height * Sex were selected.

Model diagnostics

The goodness fit of the best selected linear model was checked by residual analysis.

  • Test for normality of residuals: the normal probability plot was used to informally check for normality of residuals. This is a plot of ordered residuals against their expected values under normality. A formal test, Shapiro-Wilk was used to test for normality assumption at 5% level of significance.

  • Test for constant variance of the error term: Plot of residuals against fitted values was used as preliminary method of checking for constancy of variance assumption. This was then substantiated using the Breusch-Pagan test for constancy of variance at 5% level of significance. The following hypothesis was tested:

  • Test for independence of error terms: Durbin-Watson test was used to check for the assumption of independence of error terms.

  • Test for linearity/ lack of fit: F- test is used to test for linearity of the model selected. However, there were no replicates in the explanatory variables hence the plot for LogFEV against fitted values was used.

The multicollinearity problem was cheched for final selected variables using variance inflation factor (VIF) whether predictors are highly correlated or not. Since VIF < 10, then there was no multicollinearity problem in data.

Reuslts

Exploratory Analysis

This section is a summary of the exploratory data analysis for the response variable as well as predictor variables. From Table 1 below the average forced expiratory volume in liters for non-smoker was lower compared to smoker subjects.

Table 1. show summary statistics of the data

## # A tibble: 2 x 4
##   Sex        n  mean std.dev
##   <fct>  <int> <dbl>   <dbl>
## 1 Female   318  2.45   0.646
## 2 Male     336  2.81   1.00
## # A tibble: 2 x 4
##   Smoke          n  mean std.dev
##   <fct>      <int> <dbl>   <dbl>
## 1 Non-smoker   589  2.57   0.851
## 2 Smoker        65  3.28   0.750

Figure 1: Scatter plot matrix for FEV and covariates.

From scater plot, we visualize that the fev variable is highly correlated with age and heigh.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 2: The relationship between fev and smoking adjusting for age.

Figure 2 indicated that the effect smokers on FEV as compared to non-smoker subjects by adjusting age of the individuals. So, age may be considered as a modification effect for the relation ship between smoking and FEV.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 3: The effect of height on fev for Sex.

From Figure 3, analysis of raw data revealed that there is an interactive effect on the change of FEV in liters between female and male considering for height. Hence, the interaction seems to be an important.

Simple linear regression model

From the result of simple linear regression all variables age, height, sex and smoke have showed a significant effect on the forced expiratory volume of subjects. The simple correlation between age and height was 0.79 which indicates high correlation (R output Appendix 1.1). An additional one year age for individuals, on average force expiratory volume in liters changed by 22% (Table 1).

Model building

Table 2. Simple linear regression analysis
  fev
Predictors Estimates std. Error CI p
(Intercept) 0.43 0.08 0.28 – 0.58 <0.001
Age 0.22 0.01 0.21 – 0.24 <0.001
Observations 654
R2 / R2 adjusted 0.572 / 0.572
AIC 1119.030

Multiple linear regression Model

Table 3.Two-way interaction analysis
  fev
Predictors Estimates std. Error CI p
(Intercept) -0.63 0.53 -1.67 – 0.42 0.239
Age -0.35 0.06 -0.47 – -0.24 <0.001
Height 0.04 0.01 0.02 – 0.06 <0.001
Sex [Male] -0.17 0.11 -0.39 – 0.06 0.148
Smoke [Smoker] 0.60 0.31 -0.01 – 1.21 0.052
Age * Height 0.01 0.00 0.00 – 0.01 <0.001
Age * Sex [Male] 0.03 0.01 0.01 – 0.05 0.014
Age * Smoke [Smoker] -0.06 0.02 -0.10 – -0.01 0.016
Observations 654
R2 / R2 adjusted 0.799 / 0.797
AIC 636.524

From Table 3 output, the interaction effect age with height, Age with sex and Age with Smoke were significant at 5% level of significance. Therefore, interaction effcet seem to be important for the change of force expiratory volume in liters of individuals. on average FEV decreases 69% for smoker compaared to non-smoker accounting the age of individuals. Similarly, there was significant meaan differences on FEV for Female and Male by adjusting the height of the subjects.

preliminary model

The fitted model is: \[E(fev)= -4.457 +0.0655Age+0.104Height+0.157Sex-0.087Smoke\] Figure 4. The plot of residual vs fitted

The plot devaited from the refrence line at tail and head. The normality plot shows heavy tails on both ends.There is extreme observation at the top and not clearly indicated wether normality assumption met or not. Also formal test by Shapiro-Wilk at 5% level of significance indicated that the assumption of normality of error terms is violated \((p =0.0002)\) (R outpu Appendix 1.1).

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 5. plot for linearity and constant variance assumption

In Figure 5, there is a linearity and homogeniety issues. The plot shows that residuals vary as the fitted values increase.I performed the Breusch Pagan test to check the assumption of constant variances at \(\alpha=0.05\) and p-valuet is small (p-value = 2.89e-13) which indicates that variance is not constant.

Transformed model

Due to lack of linearity, constant error variance and normality, I took log transformation remedy on dependent variable and did the analysis using linear model.

The fitted model is: \[E(fev)= -1.944 + 0.023Age + 0.043Height + 0.029Sex -0.046Smoke\] Linearity and Homogeniety assumption

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 6. Residual versus fitted plot for equal variance assumption and linearity.

From the above plot the fitted blue line seems a linear. The points are also systematicaly distributed and variance is seems like a constant. To assure wether variance is constant, I used Preusch-pagan test and the p-values \((p-value = 0.1881)\) is larger which indicates that the error variances are equal (R output Apppendix 2).

Normality assumption

Figure 6. Density versus studentised residuals

Frm the plot we visualized that logFEV skewed to left and normality assumption may be an issue. I used the Shapiro Wilk test to demonstrated that the residuals are normally distributed or not. The test showed that still there is Normality assumption problem at 5% level of significance \((p-value = 4.895e-06).\)

Independence assumption

The test statistic for Durbin Watson test was 1.586. Since p-value = 0.00 was less than \(\alpha = 0.\) the null hypothesis is rejected and we conclude that the errors are not independent of each others. This problem may be occured during collection method.

Identification of Outliers and Influential Observations

Outliers in the response variable

##      rstudent unadjusted p-value Bonferroni p
## 140 -4.428418         1.1141e-05     0.007286
## 2   -4.121554         4.2520e-05     0.027808

Figure 7. Standardized residuals versus leverage.

From Figure 7 above, two observations were etremely close to Cook’s distance line. Thet Bonferroni test is showed that 2 and 140 observations are detected as an outlier with p-value 0.007 and 0.029 at 5% level of significance respectively. Both 2 and 140 observation have negateve residuals which may be underestimated the FEV data of the individuals.we describe that the fev in liters rate in 140 is 0.79% while the model only predicts at 0.398%. similarly observation 2 spot the FEV data about 1.724% while model predicts 1.132%. Removing these two FEV observation will have a significant impact on the values of the intercept and slope in our regression model.

Discusion and Conclusion

The aim of the study was to determine wether age height, sex and smoke have a relationship between force expiratory volume in liters of subjects in the study. For original data age ,sex and height had a significanf impact on the change of FEV in this case study before response variable transformation. The two- way interaction in the original data smokers had a significant mean diference on FEV adjusting age compared to non-smoker and similarly height had also significant effect to change mean of FEV for male compared to female in first-order regression analysis for original data.

In the original data the linear regression model was not the best method to predict the average mean of FEV for the subjects becuase all assumptions of linear regression did not met and remedian measured was made to transform daata using logarithmic transformation and data analysis was reviseted with linear regression model.

After transformation of response variable age, sex ,height and smoke factors are significant to determine the level of FEV in individuals. The final model for transformed response had also the problem of normality and independence issues even though the linearity and constant variance valid. There were no significant interaction effects in the final model of the transformed data. but the interaction effect was important in the original data.

In conclusion, age, height, sex and smoke had a significanf effect on the force expiratory volume in liters of individuals. but the linear model may not predict the value of FEV in this case study because the assumption of normality and independence of errors is not valid for data. one can discuss adjusting FEV for age or height before assessing the relationship between FEV and smoking. The study design for these data requires that the adjustment be made post-sampling, through the use of regression models, but can elicit discussion concerning how to control for factors such as age, gender or height by the use of different sampling schemes.

Appendix

R-script for original data Analyis

Appendix 1: Simple and multiple linear regression without interaction term

1.1:Fitting simple linear regression

# simple correlation
plot(FEV[,c(-4,-5)]) 
round(cor(FEV[,c(-4,-5)]),2)
#--fit fev versus Age//------------------------
m1<-lm(fev~Age, data = FEV)
summary(m1)

1.2:Multiple linear

# Produce a Scatter matrix plot for continous explanatory variables
ggpairs(FEV[,c(-4,-5)]) + theme_bw(base_size = 20)
FEV.sum=FEV%>%
  mutate(Sex=as.factor(ifelse(Sex==0,"Female", "Male")))%>%
group_by(Sex)%>%
  summarize(
    n=n(),
    mean=mean(fev),
    `std.dev`=sd(fev) )
FEV.sum
FEV.sum=FEV%>%
  mutate(Smoke=as.factor(ifelse(Smoke==0,"Non-smoker", "Smoker")))%>%
group_by(Smoke)%>%
  summarize(
    n=n(),
    mean=mean(fev),
    `std.dev`=sd(fev) )
FEV.sum
# fev versus Age,Height, Sex and Smoke
model=lm(fev~Age + Height+Sex+Smoke, data = FEV)
summary(model)

1.3: Model selection

FEV <- FEV %>%
  dplyr::select(Age:Smoke)
best <- function(model, ...)
{
  subsets <- regsubsets(formula(model), model.frame(model), ...)
  subsets <- with(summary(subsets),
                  cbind(p = as.numeric(rownames(which)),
                        which, rss, rsq, adjr2, cp, bic))
  return(subsets)
}

fit<-lm(fev~., data = FEV)
round(best(fit,nbest=3),4)

# The best model is the model with Age, Height, Sex and Smoke based on adjR-square and Cp values.
best.fit=lm(fev~Age +Height+Sex+Smoke, data = FEV)
summary(best.fit)

1.4:model Diagnostics for multiple linear regression

bes.fitfit=lm(fev~Age +Height+Sex+Smoke, data = FEV)

plot(best.fit, which = 2)

# linearity assumption
plot(best.fit,which = 1)
# or
res <-ggplot(best.fit, aes(.fitted, .resid))+geom_point()+     
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res

#-Testing for constant variance using crplot() function
plot(best.fit,which =3)
#or
res <-ggplot(best.fit, aes(.fitted, .resid))+geom_point()+ 
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res
#preusch-pagan test
bptest(best.fit)

# #--Test for normality assumption
plot(best.fit,which = 2)
# or use studres() function
sresid <- studres(best.fit) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col="red")
# formal test:SHAPIRO TEST 
shapiro.test(best.fit$residuals)# p-value > 0.05 indicates normality assumption mets.

#--Test for indenpendence
dwtest(best.fit) # p < 0.05, indicates the errors are autocorrelated. 

# -------Assessing for Outliers ussing plot() function
plot(best.fit,which = 5)
# Bonferonni test
outlierTest(best.fit) #p-value < 0.05 indicates the etreme observations are outlier
# Influential Observations test using Cook's D plot or plot() function 
# identify D values > 4/(n-k-1) 
cutoff <- 4/((nrow(FEV)-length(best.fit$coefficients)-2)) 
plot(best.fit, which=4, cook.levels=cutoff)

#Testing for multicollinearity
vif(best.fit)
mean(vif(best.fit))

#------Global Test of Model Assumptions: gvlma() not applicaple for interaction
gvbest.fit <- gvlma(best.fit) 
summary(gvbest.fit)

# linear regression models are not good to fit the given original dataset because all assumptions of linear regression 
# are failed.so it is better to transform data to fit with linear regression.

1.5: model validation

# we set the seed to (1) 
set.seed(1)
# we will partition into 80% training and 20% validation
trainrows <- sample(rownames(FEV), dim(FEV)[1]*0.8)
traindata <- FEV[trainrows, ]
# set the difference of the training into the validation set i.e. 20%
validrows <- setdiff(rownames(FEV), trainrows)
validata <- FEV[validrows, ]

best.fit=lm(fev~Age +Height+Sex+Smoke, data = FEV)
summary(best.fit)

pred<-predict(best.fit,newdata = validata)
vres<-data.frame(validata$fev, pred, residuals=validata$fev-pred)
head(vres)

#the model prediction is also not good. there is a difference between predicted and actual fev observation. 
#for instance if fev is 2.118 the predicted value is 2.528 with the difference of 0.41. So it is not good prediction.

1.6 model building for interaction term

model.interaction=lm(fev~Age+Height+Age*Height+Age*Sex+Age*Smoke, data = FEV)
tab_model(model.interaction,show.se = TRUE,  show.aic = TRUE)

1.7 model selection for interaction

full.model = regsubsets(fev~.^2, FEV)
mod.summary = summary(full.model)
names(mod.summary)

#Rsq
which.max(mod.summary$rsq)
plot(mod.summary$rsq,xlab = "Number of varaibles", ylab = "Rsquare", type = "l", col="blue")
points(8,mod.summary$rsq[8],col="red", cex=2, pch=20)

#Adjusted R-square 
which.max(mod.summary$adjr2)
plot(mod.summary$adjr2,xlab = "Number of varaibles", ylab = "Adjusted Rsquare", type = "l")
points(7,mod.summary$adjr2[7],col="blue", cex=2, pch=20)

#RSS
which.min(mod.summary$rss)
plot(mod.summary$rss,xlab = "Number of varaibles", ylab = "Residual sum Square", type = "l", col="violet")
points(8,mod.summary$rss[8],col="blue", cex=2, pch=20)

#Cp
which.min(mod.summary$cp)
plot(mod.summary$cp,xlab = "Number of varaibles", ylab = " Mallow's Cp", type = "l")
points(7,mod.summary$cp[7],col="red", cex=2, pch=20)

#BIC
which.min(mod.summary$bic)
plot(mod.summary$bic,xlab = "Number of varaibles", ylab = " BIC", type = "l")
points(5,mod.summary$bic[5],col="red", cex=2, pch=20)

#selected variables
par(mfrow=c(2,2))
plot(full.model, scale = "r2")
plot(full.model, scale = "adjr2")
plot(full.model, scale = "Cp")
plot(full.model,scale = "bic")
round(coef(full.model,5),3)

1.8: model Diagnostics for interaction

###---linearity assumption
plot(Best.model.interaction,which = 1)
# or
res <-ggplot(Best.model.interaction, aes(.fitted, .resid))+geom_point()+ 
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res

#-Testing for constant variance using crplot() function
plot(Best.model.interaction,which =3)
#or
res <-ggplot(Best.model.interaction, aes(.fitted, .resid))+geom_point()+ 
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res
#preusch-pagan test
bptest(Best.model.interaction)

# #----------------Test for normality assumption
plot(Best.model.interaction,which = 2)
# or use studres() function
sresid <- studres(Best.model.interaction) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col="red")
shapiro.test(Best.model.interaction$residuals)

#----Test for indenpendence
durbinWatsonTest(Best.model.interaction) # p > 0.05, so the errors are not autocorrelated. 

# ---Assessing for Outliers ussing plot() function
plot(Best.model.interaction,which = 5)
# Bonferonni test
outlierTest(Best.model.interaction) #p-value < 0.05 indicates the etreme observations are outlier
# Influential Observations test using Cook's D plot or plot() function 
# identify D values > 4/(n-k-1) 
plot(Best.model.interaction,which = 4) # or
cutoff <- 4/((nrow(FEV)-length(Best.model.interaction$coefficients)-2)) 
plot(Best.model.interaction, which=4, cook.levels=cutoff)

R-script for Transformed data analysis

Appendix 2:

2.1 Multiple linear Regression model for transformed data

fitm=lm(log(fev)~Age+Height+Sex+Smoke, data = FEV)
summary(fitm)

2.2: Model selection for transformed data

full.model = regsubsets(log(fev)~., FEV)
mod.summary = summary(full.model)
names(mod.summary)

#Rsq
which.max(mod.summary$rsq)
plot(mod.summary$rsq,xlab = "Number of varaibles", ylab = "Rsquare", type = "l", col="blue")
points(4,mod.summary$rsq[4],col="red", cex=2, pch=20)

#Adjusted R-square 
which.max(mod.summary$adjr2)
plot(mod.summary$adjr2,xlab = "Number of varaibles", ylab = "Adjusted Rsquare", type = "l")
points(4,mod.summary$adjr2[4],col="blue", cex=2, pch=20)

#RSS
plot(mod.summary$rss,xlab = "Number of varaibles", ylab = "RSS", type = "l", col="violet")

#Cp
which.min(mod.summary$cp)
plot(mod.summary$cp,xlab = "Number of varaibles", ylab = " Cp", type = "l")
points(4,mod.summary$cp[4],col="red", cex=2, pch=20)

#BIC
which.min(mod.summary$bic)
plot(mod.summary$bic,xlab = "Number of varaibles", ylab = " BIC", type = "l")
points(3,mod.summary$bic[3],col="red", cex=2, pch=20)

#selected variables
plot(full.model, scale = "r2")
plot(full.model, scale = "adjr2")
plot(full.model, scale = "Cp")
plot(full.model,scale = "bic")
round(coef(full.model,4),3)

# After transformed also the best model is the model with Age, Height, Sex
transformbest.fit=lm(log(fev)~Age +Height+Sex +Smoke, data = FEV)
summary(transformbest.fit)

3.4: model Diagnostics for transformed data

###------linearity assumption
plot(transformbest.fit,which = 1)
# or
res <-ggplot(transformbest.fit, aes(.fitted, .resid))+geom_point()+ 
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res

#--Testing for constant variance using crplot() function
res <-ggplot(transformbest.fit, aes(.fitted, .resid))+geom_point()+ 
  geom_smooth()+
  geom_hline(yintercept=0, col="red", linetype="dashed", size= 1)+
  ggtitle("Residual vs fitted")+
  labs(x="Fitted value",y=" Residuals")+
  theme_bw();res
#preusch-pagan test
bptest(transformbest.fit)

# #-----Test for normality assumption
plot(transformbest.fit,which = 2)
# or use studres() function
sresid <- studres(transformbest.fit) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col="red")

# formal test:SHAPIRO TEST
shapiro.test(transformbest.fit$residuals)

#-----Test for indenpendence
durbinWatsonTest(transformbest.fit) # p < 0.05, so the errors are autocorrelated. 

# --- Assessing for Outliers ussing plot() function
# identify D values > 4/(n-k-1) 
cutoff <- 4/(nrow(FEV)-length(transformbest.fit$coefficients)-2)
plot(transformbest.fit, which=4, cook.levels=cutoff)
# Bonferonni test
outlierTest(transformbest.fit) #p-value < 0.05 indicates the etreme observations are outlier
# Influential Observations test using Cook's D plot or plot() function
FEV["2",]
fitted(transformbest.fit)["2"]
FEV["140", ]
fitted(transformbest.fit)["140"]
Testing for multicollinearity
vif(transformbest.fit)
mean(vif(transformbest.fit))
#------Global Test of Model Assumptions: gvlma() 
gvtransformbest.fit <- gvlma(transformbest.fit) 
summary(gvtransformbest.fit)