suppressPackageStartupMessages(library(tidyverse))
library(ggplot2)
Q.1 - Error assumptions: The errors are independent, with zero expected value and constant variance and are normally distributed.
Q.2 - View data
fev <- read.table("http://people.bath.ac.uk/trs35/fev.txt", header = T)
glimpse(fev)
## Observations: 654
## Variables: 7
## $ seqnbr <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ subjid <int> 301, 451, 501, 642, 901, 1701, 1752, 1753, 1901, 1951, ...
## $ age <int> 9, 8, 7, 9, 9, 8, 6, 6, 8, 9, 6, 8, 8, 8, 8, 7, 5, 6, 9...
## $ fev <dbl> 1.708, 1.724, 1.720, 1.558, 1.895, 2.336, 1.919, 1.415,...
## $ height <dbl> 57.0, 67.5, 54.5, 53.0, 57.0, 61.0, 58.0, 56.0, 58.5, 6...
## $ sex <int> 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1...
## $ smoke <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
#View(fev)
We have numerical data. Need to recode the smoke variable as it’s currently numeric but we want it to be a factor.
fev$smoke <- factor(fev$smoke,labels = c("yes", "no"))
#View(fev$smoke)
Q.3 - Plot fev vs. age coloured by smoking status.
fev %>% ggplot(aes(x = age, y = fev, colour= smoke)) +
geom_point(aes(shape = smoke, color = smoke), alpha = 0.6, position = "jitter") +
scale_shape_manual(values = c(17, 19)) +
scale_color_manual(values = c("#1380A1", "#FAAB18")) +
ggtitle("Fev vs. Age") +
theme_bw() +
theme(legend.position = "top", plot.title = element_text(face = "bold", hjust = 0.5)) + labs(x = "Age", y = "Fev")
Q.4 - Fit a linear model
fevLm <- lm(fev ~ age, data = fev)
summary(fevLm)
##
## Call:
## lm(formula = fev ~ age, data = fev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57539 -0.34567 -0.04989 0.32124 2.12786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.431648 0.077895 5.541 4.36e-08 ***
## age 0.222041 0.007518 29.533 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5675 on 652 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5716
## F-statistic: 872.2 on 1 and 652 DF, p-value: < 2.2e-16
We conclude that there is a significant, positive association between age and fev. For every additional year of age, the fev score increases by 0.22 L/s.
Q.5 - Make another plot with geom_smooth line fitted.
fev %>% ggplot(aes(x = height, y = fev, colour= smoke)) +
geom_point(aes(shape = smoke, color = smoke), alpha = 0.6, position = "jitter") +
geom_smooth(method = "lm", se = F) +
scale_shape_manual(values = c(17, 19)) +
scale_color_manual(values = c("#1380A1", "#FAAB18")) +
ggtitle("Fev vs. Height") +
theme_bw() +
theme(legend.position = "top", plot.title = element_text(face = "bold", hjust = 0.5)) +
labs(x = "Height", y = "Fev")
Q.6 - Diagonstic plots
fevLm <- lm(fev ~ age, data = fev)
plot(fitted(fevLm), residuals(fevLm),
xlab = "Fitted", ylab = "Residuals")
abline(h=0)
As per the plot we can see some pattern in the error. If there was constant variance we would expect there to be no pattern. HK may be an issue.
Q.7 - Relevel the fev data
fevMod <- lm(fev ~ age+smoke, data = fev)
summary(fevMod)
##
## Call:
## lm(formula = fev ~ age + smoke, data = fev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6653 -0.3564 -0.0508 0.3494 2.0894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.158378 0.131003 1.209 0.22712
## age 0.230605 0.008184 28.176 < 2e-16 ***
## smokeno 0.208995 0.080745 2.588 0.00986 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5651 on 651 degrees of freedom
## Multiple R-squared: 0.5766, Adjusted R-squared: 0.5753
## F-statistic: 443.3 on 2 and 651 DF, p-value: < 2.2e-16
The relationship between FEV and age is more or less unchanged with the previous result. Non-smokers have 0.21 L/s greater lung capacity compared to smokers to smokers on average.