Regression analysis

The peak flow rate of a person is the fastest rate at which the person can expel air after taking a deep breath. Peak flow rate is measured in units of liters per minute and gives an indication of the person’s respiratory health. Researchers measured peak flow rate and height for 17 men.

Getting the data ready:

height <- c(174,183,176,169,183,186,178,175,172,179,171,184,200,195,176,176,190)
pflow <- c(733,572,500,738,616,787,866,670,550,660,575,577,783,625,470,642,856)
data1 <- data.frame(height, pflow)

The scatter plot, using height as the predictor. So we want height on the x-axis. Use download the package ggplot2 (type install.packages(“ggplot2”), dependencies = TRUE). then activate the package withthe code library(ggplot2)

library(ggplot2)
ggplot(data1, aes(height, pflow)) + geom_point() + geom_smooth(method = "lm", se = FALSE)+
  xlab("Height") + ylab("Peak flow rate") + ggtitle("Scatter plot")

Description of the plot: The linear relationhip between height and peak flow rate looks positive and moderate. This means as height increases, peak flow rate alsoe increases moderately.

The model can be written out as: \[ PeakFlowRate_{i} = \beta_{0} + \beta_{1}(Height_{i}) + e_{i} \] The regression model assumes that the residuals are normally distrubuted with a mean of zero and a constant variance.
Fitting the simple linear regression model

model1 <- lm(pflow ~ 1 + height, data1)

Looking at the summary of the regression output

summary(model1)

## 
## Call:
## lm(formula = pflow ~ 1 + height, data = data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -170.096  -99.188    1.904  101.789  216.881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -153.926    607.464  -0.253    0.803
## height         4.511      3.364   1.341    0.200
## 
## Residual standard error: 115.2 on 15 degrees of freedom
## Multiple R-squared:  0.1071, Adjusted R-squared:  0.04757 
## F-statistic: 1.799 on 1 and 15 DF,  p-value: 0.1998

The results suggests that when height increases by 1 cm, the expected peak flow rate increases by 4.51 per minute.

The estimated regression equation: \[ \hat{PeakFlowRate}_{i} = -153.93 + 4.51(Height_{i})\]
The hypothesis is that the true slope (the population parameter) is equal to zero. The test statistic, t = 1.34, and the p-value = 0.2. At the level of 0.05, we fail to reject the hypothesis that true slope is zero. We conclude that the true slope is equal to zero.
95% CI for the slope:

low <- 4.51 - 1.96*3.36
up <- 4.51 + 1.96*3.36
c(low, up)

## [1] -2.0756 11.0956

The 95% CI for the slope is (-2.08, 11.1)

The 95% CI for the mean peak flow rate for 170cm tall men:

new.height <- data.frame(height = c(170))
predict(model1, newdata = new.height, interval = "confidence")

##        fit      lwr      upr
## 1 613.0274 517.5522 708.5026

The 95% CI is (517.55, 708.5)

The 95% prediction interval (PI) for the peak flow rate for a 170cm tall man

predict(model1, newdata = new.height, interval = "prediction")

##        fit      lwr      upr
## 1 613.0274 349.6655 876.3893

The 95% PI is (349.67, 876.39)

The correlation between height and peak flow rate

round(cor(height, pflow), 2)

## [1] 0.33

The correlation coefficient is 0.33

The “prostate”" data set has 97 observations and 8 predictors. A study was conducted on 97 men with prostate cancer who were due to receive a radical prostatectomy.

prostate <- read.csv("prostate.txt", header = FALSE, sep = "")
names(prostate) <- c("obs","lcavol","lweight","age","lbph","svi","lcp",
                     "gleason","pgg45","lpsa")
head(prostate,3)

##   obs     lcavol lweight age      lbph svi      lcp gleason pgg45     lpsa
## 1   1 -0.5798185  2.7695  50 -1.386294   0 -1.38629       6     0 -0.43078
## 2   2 -0.9942523  3.3196  58 -1.386294   0 -1.38629       6     0 -0.16252
## 3   3 -0.5108256  2.6912  74 -1.386294   0 -1.38629       7    20 -0.16252

Using lpsa as the dependent variable, the multiple regression model can be written out as: \[lpsa_{i}=\beta_{0}+\beta_{1}(lcavol_{i})+\beta_{2}(lweight_{1})+\beta_{3}(age_{i})+\beta_{4}(lbph_{i})+\beta_{5}(svi_{i})+\beta_{6}(lcp_{i})+\beta_{7}(gleason_{i})+\beta_{8}(pgg45_{i})+e_{1}\] The multiple regression model assumes that the residuals are normally distributed with a mean of zero and a constant variance.
Fitting the model

model2 <- lm(lpsa ~ 1 + lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45, prostate)

Looking at the result summary

summary(model2)

## 
## Call:
## lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi + 
##     lcp + gleason + pgg45, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7331 -0.3713 -0.0170  0.4141  1.6381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.669337   1.296387   0.516  0.60693    
## lcavol       0.587022   0.087920   6.677 2.11e-09 ***
## lweight      0.454467   0.170012   2.673  0.00896 ** 
## age         -0.019637   0.011173  -1.758  0.08229 .  
## lbph         0.107054   0.058449   1.832  0.07040 .  
## svi          0.766157   0.244309   3.136  0.00233 ** 
## lcp         -0.105474   0.091013  -1.159  0.24964    
## gleason      0.045142   0.157465   0.287  0.77503    
## pgg45        0.004525   0.004421   1.024  0.30886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared:  0.6548, Adjusted R-squared:  0.6234 
## F-statistic: 20.86 on 8 and 88 DF,  p-value: < 2.2e-16

Interpreting a single slope: For a unit increase in svi, the lpsa value is expected to increase by 0.77, holding constant all other predictors in the model.

The estimated regression equation: \[\hat{lpsa_{i}}=0.67+0.59(lcavol_{i})+0.45(lweight_{i})-0.02(age_{i})+0.11(lbph_{i})+0.77(svi_{i})-0.11(lcp_{i})+0.05(gleason_{i})+0.005(pgg45_{i})\]
Interpretation of the omnibus F test: The null hypothesis is that none of the predictor significantly predict the outcome, lpsa. The test statistics, F = 20.86. The p-value is less than 0.05. At the level of 0.05, we reject the null hypothesis. We conclude that at least one of the predictors significantly predicts the outcome.
Significance of the slope associated with gleason: The null hypothesis is that, holding constant all other predictors in the model, the linear relationship between gleason and lpsa is not significant (the true slope is zero). with a t value of 0.29 and a p-value of 0.77, we fail to reject the null hypothesis at the level of 0.05. WE conclude that gleason is not a significant predictor, when we hold constant the other predictors in the model.
The proportion of variation in lpsa that can be explained by the predictors is 0.65.

Regression analysis

Joel Messan