ST3131 Assignment

Section 1, Describing Data:

Type of variables: Age: Continuous quantitative variable FEV: Continuous quantitative variable Hgt: Continuous quantitative variable Hgt_m: Continuous quantitative variable Sex: Categorical variable Smoking status: Categorical variable

Section 2, Exploring Response Variable

Sub 2.1, Relationship between FEV measurement and Age in years:

We will use a scatter plot of FEV versus Age in years as seen in Figure 1 to see the relationship between FEV and Age in years. From Figure 1, we can see that there is a positive relationship between Age in years and FEV. As the age in years increases, the FEV measurement also increases. The correlation coefficient between age in years and FEV is 0.7565421, which is a moderately strong indicator of correlation between Age in years and FEV.

Sub 2.2, Relationship between FEV measurement and height in inches:

We will plot a scatter plot of FEV versus height in inches as seen in Figure 2. From Figure 2, there is a positive relationship between FEV and height in inches, as the height in inches increases, the FEV increases as well. The plot also appears to be either positively linear, or slightly curved. The correlation coefficient between height in inches and FEV is 0.8681802, meaning quite a strong indication of correlation between FEV and height in inches.

Sub 2.3, Relationship between FEV measurement and Sex:

We will plot the boxplot of Sex and FEV, according to the indicator variable 0 and 1 in Sex. From the boxplot in Figure 4, the range of FEV for female is smaller than range of FEV for male. Range of FEV of females is from (0.79,3.84) while for males it is (0.80,5.79). Since the range of values of FEV for different sex is different, we can conclude that sex and FEV are correlated.

Sub 2.4, Relationship between FEV measurement and Smoking Status:

We will plot the boxplot of Smoking Status and FEV, according to the indicator variable 0 and 1 in smoking status. From the boxplot in Figure 5, the range of FEV for non-smoker is larger than the range of FEV for smokers. The range of FEV for non smoker is (0.79,5.79) while for smokers is (1.69,4.87).Since the range of values of FEV for different smoking status is different, we can conclude that smoking status and FEV are correlated.

Sub 2.5, Relationship between Height in Inches and Height in Metres:

We suspect that this 2 regressors are highly correlated and might have problem of multicolinearity, since height in inches is a transformation on height in metres. We calculate the correlation coefficient between these 2 regressors (0.9998013), so they are highly positively correlated. Later, we can check for multicolinearity using VIF.

Sub 2.6, Distribution of the response (FEV measurement):

We will plot the histogram and QQ plot of FEV for normality checking. Figure 6 shows the FEV could follow a normal distribution since it is bell-shaped.However, the QQ plot suggests that maybe FEV does not follow a normal distribution as the left and right tails are much lighter than normal.

Section 3, The Model

Sub 3.1, Fit the full model:

I only include the linear regressors in the full model to be fitted, to prevent the problem of model misspecification and overspecifying the model. Hence, the full fitted model will be:

\[ \begin{aligned} \widehat{y} &= -4.436160+0.065435*Age-8.197478*Hgt_m+0.312051*Hgt\\+ 0.160431*I(Sex=1)-0.082226I*(Smoking\ status\ =1) \end{aligned} \] The R code for deriving the coefficients of the full model, and the coefficients table and ANOVA table can be found in the appendix below in Section 3.1.

Sub 3.2, Parameter estimation significance:

Sub 3.2.1, Interpretation:

Controlling all other regressors, for a unit increase in Age, FEV increases by 0.065435 units. Controlling all other regressors, for a unit increase in Hgt_m, FEV decreases by 8.197478 units. Controlling all other regressors, for a unit increase in Hgt, FEV increases by 0.312051 units. COntrolling all other regressors, on average, the mean FEV for male is 0.160431 units higher than for female. Controlling all other regressors, on average, the mean FEV for current smoker is 0.082226 units lower than for a current non-smoker.

We will now do test for the significance of regression, and test for significance of individual regression coefficients.

Sub 3.2.2, Testing for significance of regression,

From the R output from the coefficient table in Section 3.1, Test statistic = 449.4, with a p-value of < 2.2e-16. Since the p-value is extremely small, we can conclude that the model is significant.

Sub 3.2.3, Testing for significance of individual regression coefficients,

Using information from the ANOVA table from Section 3.1, we can conclude that Smoking Status might not be a significant regressor as it has a high p-value of 0.16580.

Section 4, Adequacy checking:

Previously, we have fitted the model above with the assumptions: 1. that the errors are IID, following a normal distribution of mean 0 and constant variance (constant variance assumption) 2. the relationship between the regressors and the response is linear (linearity assumption). 3. that the regressors are orthogonal, and that there is no linear relationship between the regressors. Now we will check if the model follows these assumptions. We will conduct the residual analysis to see if the model follow the assumptions or not. Plotting the QQ plot of the standardised residuals,

data<-read.csv("/Users/isabellasoh/Desktop/ST3k Notes/ST3131 NOTES/assignment /FEV.csv",header=TRUE)
attach(data)
model <- lm(FEV~Age+Hgt_m+Hgt+ as.factor(Sex)+as.factor(Smoke),data=data)
qqnorm(rstandard(model),ylab = "Standardized Residuals", xlab = "Z scores",main='Figure 8',datax=TRUE,pch=20)
qqline(rstandard(model),datax=TRUE)

From Figure 8, we can see that the standardised residuals appear to be normal. In order to see if any violation of assumptions, we will plot the residual plots of standardized residuals versus the fitted values.

plot(model$fitted.values,rstandard(model),xlab="fitted", ylab= "Standardized Residuals",main='Figure 9',pch=20)
abline(h=0)

From Figure 9, it appears that the residuals have a funnel-shape and also not scattered from (-2,2), as some points are above 4 and below -2. The funnel-shape tells us that the residuals might not have a constant variance, and violates the constant variance assumptions that we had when we build the model. From Figure 9,10,11 in the appendix, we also plot the standardised residuals against each regressor, and we notice that all the plots have a slight funnel shape as well. In order to deal with this, we can consider a transformation of the response or the regressors.

Section 4.1 Model Selection,

We will use stepwise regression to choose the significant regressors to include in the model. From the AIC generated in the R code in the appendix in Section 4.1.1, we will choose the model

\[ \begin{aligned} \widehat{y} &= -4.427282+0.061536*Age-8.612043 *Hgt_m+0.322902*Hgt+0.164377*I(Sex=1) \end{aligned} \] This model has a multiple R squared of 0.7755 and adjusted R squared of 0.7741, which is a bit lower than original model of Multiple R-squared: 0.7762, Adjusted R-squared: 0.7744. Now we will check if this model satisfies the assumptions with the residual analysis. (Figure 10, appendix)

From the plot of the standardised residuals vs the fitted values, it appears the residuals still have a funnel shape, indicating violation of the constant variance assumption. We will now try to transform the response so as to deal with the violation of this assumption. Applying log transformation on the response, the new transformed model is (information in Section 4.1.2)

\[ \begin{aligned} \widehat{logy} &= -1.93008+0.02127*Age-3.67912 *Hgt_m+0.13626*Hgt+0.03284 *I(Sex=1) \end{aligned} \] From the summary table of this new transformed model, the Multiple R-squared: 0.8102, Adjusted R-squared: 0.8091, which is much higher than the previous 2 models, indicating that the log transformed model has a much better goodness of fit. Now we will check if this model satisfies the assumptions.

trying <- lm(log(FEV)~Age + Hgt_m + Hgt + as.factor(Sex),data=data)
plot(trying$fitted.values,rstandard(trying),xlab='fitted',ylab='Standardised Residuals',main='Figure 11',pch=20)
abline(h=0)

The plot of standardised residuals vs fitted values of the new transformed model is improved, showing that the residuals are randomly scattered about 0, meaning there is no longer any violation of assumptions.

Now, we need to check the model for any problems of multicolinearity. We will find the Variance Inflation Factors (VIF) of the regressors to determine if there is problem of multicolinearity between the regressors, and if have, to identify which regressors cause the multicolinearity.

From the output in Section 4.1.3, we see that Hgt and Hgt_m have the largest VIF of 2527.887447 and 2527.138602 respectively, which is very large. Hence, we can conclude that these 2 regressors have severe multicolinearity, and result in multicolinearity in the model. This is expected as they are transformation of each other. They also are highly positively correlated with correlation coefficient 0.9998013. We want to remove mutlicolinearity in the model as it results in poor coefficient estimates and large variances. Hence, we can try Ridge Regression to reduce the multicolinearity in the model. We choose to use Ridge Regression to reduce multicolinearity instead of other methods such as variable elimination as it will result in a decrease in the predictive power of the model and results in less goodness of fit. This can be seen from the output in Section 4.1.4 which shows 2 models fitted, one without the regressor Hgt and one without the regressor Hgt_m. Both have a lower R2 (no_hgt_model:Multiple R-squared: 0.8096,Adjusted R-squared:0.8084, no_hgtm_model: Multiple R-squared:0.8092,Adjusted R-squared:0.8083) than the model which includes both regressors (Multiple R-squared: 0.8102,Adjusted R-squared:0.8091). Hence, we will keep both regressors in the model, since the problem of multicolinearity can also be reduced by other methods, like Ridge Regression. Using Ridge Regression, we are able to reduce the large variance of the estimator β, although it result in the estimator of β being biased.

Now we will plot the ridge trace to determine suitable value of lambda to use in the Ridge Regression model (Section 4.1.5) From the ridge trace, we choose a value of lambda to be 0.3 so that the variance of estimator β hat is stable. The equation of the model built using Ridge Regression is

\[ \begin{aligned} \widehat{logy} &= -1.93445340+0.02129261*Age-0.51816624*Hgt_m \\ +0.05605016*Hgt+0.03164162*I(Sex=1) \end{aligned} \] Now we will obtain the SSres and the R-squared of the log transformed model and the new model built using Ridge Regression. From the R output in Section 4.1.5, the R2 of the new model built using Ridge Regression is 0.8094897, which is lower than the log transformed model of 0.8102358, but that is expected as the Ridge Regression method sacrifices some goodness of fit to reduce the large variance of the estimator.

Section 5, Final Model:

the final model proposed is: \[ \begin{aligned} \widehat{logy} &= -1.93445340+0.02129261*Age-0.51816624*Hgt_m+ \\0.05605016*Hgt+0.03164162*I(Sex=1) \end{aligned} \] This model explains 80.9% of the variability in the response. Although there is the log term in the response which might make the interpretation of the model more difficult, I have balanced it out by selecting the regressors, so that the model has less regressors and is less complex. This model is also adequate, as the residual plot shows that there is no assumption violation.

Appendix for R code:

Import Data:

#importing the data
rm(list=ls())
data<-read.csv("/Users/isabellasoh/Desktop/ST3k Notes/ST3131 NOTES/assignment /FEV.csv",header=TRUE)
attach(data)

## The following objects are masked from data (pos = 3):
## 
##     Age, FEV, Hgt, Hgt_m, ID, Sex, Smoke

#Figure 1, scatter plot of FEV vs Age
plot(Age,FEV,main='Figure 1',pch=20)

cor(Age,FEV)

## [1] 0.7565421

#Figure 2, scatter plot of FEV vs height in inches
plot(Hgt,FEV,main='Figure 2',pch=20)

cor(Hgt,FEV)

## [1] 0.8681802

#Figure 3, scatter plot of FEV vs height in metres
plot(Hgt_m,FEV,main='Figure 3',pch=20)

cor(Hgt_m,FEV)

## [1] 0.8675619

#Figure 4, boxplot of FEV by Sex 
boxplot(FEV~Sex,main='Figure 4')

#Figure 5, boxplot of FEV by Smoking status
boxplot(FEV~Smoke,main='Figure 5')

#histogram of FEV
#Figure 6
hist(FEV,freq=F,main ='Figure 6',xlab='FEV measurement', ylab='frequency',axes=T,col='grey')
x<-seq(0.5,6,length.out=654)
y <- with(data,dnorm(x,mean(FEV),sd(FEV)))
lines(x,y,col='red')

#Figure 7
#qqplot of FEV
qqnorm(FEV,main='Figure 7',pch=20)
qqline(FEV)

#Figure 10 and stepwise regression code
library(MASS)
model <- lm(FEV~Age+Hgt_m+Hgt+ as.factor(Sex)+as.factor(Smoke),data=data)
sw<-stepAIC(model,direction='both')

## Start:  AIC=-1154.74
## FEV ~ Age + Hgt_m + Hgt + as.factor(Sex) + as.factor(Smoke)
## 
##                    Df Sum of Sq    RSS     AIC
## - as.factor(Smoke)  1    0.3263 110.17 -1154.8
## <none>                          109.85 -1154.7
## - Hgt_m             1    0.3625 110.21 -1154.6
## - Hgt               1    0.8160 110.66 -1151.9
## - as.factor(Sex)    1    3.9453 113.79 -1133.7
## - Age               1    8.0809 117.93 -1110.3
## 
## Step:  AIC=-1154.8
## FEV ~ Age + Hgt_m + Hgt + as.factor(Sex)
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          110.17 -1154.8
## + as.factor(Smoke)  1    0.3263 109.85 -1154.7
## - Hgt_m             1    0.4012 110.58 -1154.4
## - Hgt               1    0.8764 111.05 -1151.6
## - as.factor(Sex)    1    4.1723 114.35 -1132.5
## - Age               1    7.8357 118.01 -1111.9

plot(sw$fitted.values,rstandard(sw),xlab='fitted',ylab='Standardised Residuals',pch=20,main='Figure 10')
abline(h=0)

#Section 3.1
model <- lm(FEV~Age+Hgt_m+Hgt+ as.factor(Sex)+as.factor(Smoke),data=data)
summary(model)

## 
## Call:
## lm(formula = FEV ~ Age + Hgt_m + Hgt + as.factor(Sex) + as.factor(Smoke), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41306 -0.25696  0.00108  0.26249  1.89828 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.436160   0.222961 -19.897  < 2e-16 ***
## Age                0.065435   0.009477   6.904 1.21e-11 ***
## Hgt_m             -8.197478   5.605713  -1.462   0.1441    
## Hgt                0.312051   0.142227   2.194   0.0286 *  
## as.factor(Sex)1    0.160431   0.033255   4.824 1.75e-06 ***
## as.factor(Smoke)1 -0.082226   0.059267  -1.387   0.1658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4117 on 648 degrees of freedom
## Multiple R-squared:  0.7762, Adjusted R-squared:  0.7744 
## F-statistic: 449.4 on 5 and 648 DF,  p-value: < 2.2e-16

anova(model)

## Analysis of Variance Table
## 
## Response: FEV
##                   Df  Sum Sq Mean Sq   F value    Pr(>F)    
## Age                1 280.893 280.893 1657.0034 < 2.2e-16 ***
## Hgt_m              1  94.865  94.865  559.6121 < 2.2e-16 ***
## Hgt                1   0.662   0.662    3.9047   0.04858 *  
## as.factor(Sex)     1   4.172   4.172   24.6126 8.967e-07 ***
## as.factor(Smoke)   1   0.326   0.326    1.9248   0.16580    
## Residuals        648 109.848   0.170                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#plot for the Standardised residuals vs regressors 
plot(Age,rstandard(model),main='Figure 9',ylab='Standardised Residuals',xlab='Age',pch=20)
abline(h=0)

plot(Hgt,rstandard(model),main='Figure 10',ylab='Standardised Residuals',xlab='Hgt',pch=20)
abline(h=0)

plot(Hgt_m,rstandard(model),main='Figure 11',ylab='Standardised Residuals',xlab='Hgt_m',pch=20)
abline(h=0)

#Section 4.1
#stepwise regression to variable select
sw<-stepAIC(model,direction='both')

## Start:  AIC=-1154.74
## FEV ~ Age + Hgt_m + Hgt + as.factor(Sex) + as.factor(Smoke)
## 
##                    Df Sum of Sq    RSS     AIC
## - as.factor(Smoke)  1    0.3263 110.17 -1154.8
## <none>                          109.85 -1154.7
## - Hgt_m             1    0.3625 110.21 -1154.6
## - Hgt               1    0.8160 110.66 -1151.9
## - as.factor(Sex)    1    3.9453 113.79 -1133.7
## - Age               1    8.0809 117.93 -1110.3
## 
## Step:  AIC=-1154.8
## FEV ~ Age + Hgt_m + Hgt + as.factor(Sex)
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          110.17 -1154.8
## + as.factor(Smoke)  1    0.3263 109.85 -1154.7
## - Hgt_m             1    0.4012 110.58 -1154.4
## - Hgt               1    0.8764 111.05 -1151.6
## - as.factor(Sex)    1    4.1723 114.35 -1132.5
## - Age               1    7.8357 118.01 -1111.9

summary(sw)

## 
## Call:
## lm(formula = FEV ~ Age + Hgt_m + Hgt + as.factor(Sex), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41427 -0.25367  0.00314  0.25576  1.92073 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.427282   0.223027 -19.851  < 2e-16 ***
## Age              0.061536   0.009057   6.794 2.47e-11 ***
## Hgt_m           -8.612043   5.601730  -1.537   0.1247    
## Hgt              0.322902   0.142113   2.272   0.0234 *  
## as.factor(Sex)1  0.164377   0.033157   4.958 9.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.412 on 649 degrees of freedom
## Multiple R-squared:  0.7755, Adjusted R-squared:  0.7741 
## F-statistic: 560.5 on 4 and 649 DF,  p-value: < 2.2e-16

#Section 4.1.2
#log transformation of the original model
trying <- lm(log(FEV)~Age + Hgt_m + Hgt + as.factor(Sex),data=data)
summary(trying)

## 
## Call:
## lm(formula = log(FEV) ~ Age + Hgt_m + Hgt + as.factor(Sex), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64094 -0.08390  0.01535  0.09367  0.40192 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.93008    0.07881 -24.492  < 2e-16 ***
## Age              0.02127    0.00320   6.645 6.44e-11 ***
## Hgt_m           -3.67912    1.97935  -1.859  0.06352 .  
## Hgt              0.13626    0.05022   2.713  0.00683 ** 
## as.factor(Sex)1  0.03284    0.01172   2.803  0.00522 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1456 on 649 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:  0.8091 
## F-statistic: 692.8 on 4 and 649 DF,  p-value: < 2.2e-16

#Section 4.1.3
no_hgt_model <- lm(log(FEV) ~ Age +  Hgt_m + as.factor(Sex)+as.factor(Smoke))
summary(no_hgt_model)

## 
## Call:
## lm(formula = log(FEV) ~ Age + Hgt_m + as.factor(Sex) + as.factor(Smoke))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63305 -0.08571  0.00991  0.09277  0.40943 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.940303   0.078950 -24.576  < 2e-16 ***
## Age                0.023628   0.003356   7.041 4.86e-12 ***
## Hgt_m              1.681433   0.066346  25.344  < 2e-16 ***
## as.factor(Sex)1    0.028735   0.011755   2.445   0.0148 *  
## as.factor(Smoke)1 -0.047056   0.020962  -2.245   0.0251 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1458 on 649 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8084 
## F-statistic: 689.7 on 4 and 649 DF,  p-value: < 2.2e-16

no_hgtm_model <- lm(log(FEV)~Age+Hgt+as.factor(Sex),data=data)
summary(no_hgtm_model)

## 
## Call:
## lm(formula = log(FEV) ~ Age + Hgt + as.factor(Sex), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63782 -0.08705  0.01191  0.09164  0.40392 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.938492   0.078824 -24.593  < 2e-16 ***
## Age              0.021201   0.003206   6.612  7.9e-11 ***
## Hgt              0.042973   0.001681  25.560  < 2e-16 ***
## as.factor(Sex)1  0.031351   0.011711   2.677  0.00761 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1459 on 650 degrees of freedom
## Multiple R-squared:  0.8092, Adjusted R-squared:  0.8083 
## F-statistic: 919.1 on 3 and 650 DF,  p-value: < 2.2e-16

#Section 4.1.4
#obtain the VIF of the regressors to determine if there is multicolinearity in the model. 
x <- cbind(Age, Hgt_m,Hgt,Sex)
x <- cor(x)
C <- solve(x)
VIF <- diag(C)
VIF

##         Age       Hgt_m         Hgt         Sex 
##    2.753539 2527.887447 2527.138602    1.058023

#Section 4.1.5
#plot ridge trace and get lambda value
plot(lm.ridge(log(FEV)~Age + Hgt_m + Hgt + as.factor(Sex),data=data,lambda=seq(0,0.5,0.001)),type='l',lty=1)
abline(v=0.3)

#Section 4.1.6
tryridge<-lm.ridge(log(FEV)~Age + Hgt_m + Hgt + as.factor(Sex),data=data,lambda=0.3)
coef(tryridge)

##                             Age           Hgt_m             Hgt as.factor(Sex)1 
##     -1.93445340      0.02129261     -0.51816624      0.05605016      0.03164162

f<-function(Age,Hgt_m,Hgt,Sex){
  predict<- coef(tryridge)[1] + coef(tryridge)[2]*Age + coef(tryridge)[3]*Hgt_m + coef(tryridge)[4]*Hgt + coef(tryridge)[5]*Sex
  return(predict) }
tryridge.predict <- as.matrix(f(Age=Age,Hgt_m=Hgt_m,Hgt=Hgt,Sex=Sex), ncol = 1)
#cbind(log(FEV),tryridge.predict,trying$fitted.values)
newSS_res_1<- sum((log(FEV) - trying$fitted.values)^2)

newSS_res_2<- sum((log(FEV) - tryridge.predict)^2)

SST<-var(log(FEV))*653
newr2_1 <- 1 - newSS_res_1/SST
newr2_2 <- 1 - newSS_res_2/SST
newr2_1

## [1] 0.8102358

newr2_2

## [1] 0.8094897

ST3131 Assignment_A0205045U