The peak flow rate of a person is the fastest rate at which the person can expel air after taking a deep breath. Peak flow rate is measured in units of liters per minute and gives an indication of the person’s respiratory health. Researchers measured peak flow rate and height for 17 men.
Getting the data ready:
height <- c(174,183,176,169,183,186,178,175,172,179,171,184,200,195,176,176,190)
pflow <- c(733,572,500,738,616,787,866,670,550,660,575,577,783,625,470,642,856)
data1 <- data.frame(height, pflow)
library(ggplot2)
ggplot(data1, aes(height, pflow)) + geom_point() + geom_smooth(method = "lm", se = FALSE)+
xlab("Height") + ylab("Peak flow rate") + ggtitle("Scatter plot")
Description of the plot: The linear relationhip between height and peak flow rate looks positive and moderate. This means as height increases, peak flow rate alsoe increases moderately.
The model can be written out as: \[ PeakFlowRate_{i} = \beta_{0} + \beta_{1}(Height_{i}) + e_{i} \] The regression model assumes that the residuals are normally distrubuted with a mean of zero and a constant variance.
Fitting the simple linear regression model
model1 <- lm(pflow ~ 1 + height, data1)
Looking at the summary of the regression output
summary(model1)
##
## Call:
## lm(formula = pflow ~ 1 + height, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.096 -99.188 1.904 101.789 216.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -153.926 607.464 -0.253 0.803
## height 4.511 3.364 1.341 0.200
##
## Residual standard error: 115.2 on 15 degrees of freedom
## Multiple R-squared: 0.1071, Adjusted R-squared: 0.04757
## F-statistic: 1.799 on 1 and 15 DF, p-value: 0.1998
The results suggests that when height increases by 1 cm, the expected peak flow rate increases by 4.51 per minute.
The hypothesis is that the true slope (the population parameter) is equal to zero. The test statistic, t = 1.34, and the p-value = 0.2. At the level of 0.05, we fail to reject the hypothesis that true slope is zero. We conclude that the true slope is equal to zero.
95% CI for the slope:
low <- 4.51 - 1.96*3.36
up <- 4.51 + 1.96*3.36
c(low, up)
## [1] -2.0756 11.0956
The 95% CI for the slope is (-2.08, 11.1)
new.height <- data.frame(height = c(170))
predict(model1, newdata = new.height, interval = "confidence")
## fit lwr upr
## 1 613.0274 517.5522 708.5026
The 95% CI is (517.55, 708.5)
predict(model1, newdata = new.height, interval = "prediction")
## fit lwr upr
## 1 613.0274 349.6655 876.3893
The 95% PI is (349.67, 876.39)
round(cor(height, pflow), 2)
## [1] 0.33
The correlation coefficient is 0.33
prostate <- read.csv("prostate.txt", header = FALSE, sep = "")
names(prostate) <- c("obs","lcavol","lweight","age","lbph","svi","lcp",
"gleason","pgg45","lpsa")
head(prostate,3)
## obs lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
Using lpsa as the dependent variable, the multiple regression model can be written out as: \[lpsa_{i}=\beta_{0}+\beta_{1}(lcavol_{i})+\beta_{2}(lweight_{1})+\beta_{3}(age_{i})+\beta_{4}(lbph_{i})+\beta_{5}(svi_{i})+\beta_{6}(lcp_{i})+\beta_{7}(gleason_{i})+\beta_{8}(pgg45_{i})+e_{1}\] The multiple regression model assumes that the residuals are normally distributed with a mean of zero and a constant variance.
Fitting the model
model2 <- lm(lpsa ~ 1 + lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45, prostate)
Looking at the result summary
summary(model2)
##
## Call:
## lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi +
## lcp + gleason + pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## age -0.019637 0.011173 -1.758 0.08229 .
## lbph 0.107054 0.058449 1.832 0.07040 .
## svi 0.766157 0.244309 3.136 0.00233 **
## lcp -0.105474 0.091013 -1.159 0.24964
## gleason 0.045142 0.157465 0.287 0.77503
## pgg45 0.004525 0.004421 1.024 0.30886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
Interpreting a single slope: For a unit increase in svi, the lpsa value is expected to increase by 0.77, holding constant all other predictors in the model.
Interpretation of the omnibus F test: The null hypothesis is that none of the predictor significantly predict the outcome, lpsa. The test statistics, F = 20.86. The p-value is less than 0.05. At the level of 0.05, we reject the null hypothesis. We conclude that at least one of the predictors significantly predicts the outcome.
Significance of the slope associated with gleason: The null hypothesis is that, holding constant all other predictors in the model, the linear relationship between gleason and lpsa is not significant (the true slope is zero). with a t value of 0.29 and a p-value of 0.77, we fail to reject the null hypothesis at the level of 0.05. WE conclude that gleason is not a significant predictor, when we hold constant the other predictors in the model.
The proportion of variation in lpsa that can be explained by the predictors is 0.65.