For the attached two datasets, fit a simple linear model and check all assumptions using at least one graphical and one statistical method.
Then, identify and implement the correct transformation (with graphical and analytic support; could be identity!) , following up with appropriate checks of relevant assumptions.
The first dataset is on a company’s sales (in thousands of dollars) over a period of time (years). The second dataset is regarding the concentration of a liquid over a period of time (hours).
sale <- c(98.0,135.0,162.0,178.0,221.0,232.0,283.0,300.0,374.0,395.0)
year <- c(0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0)
mod1 <- lm(sale~year)
e1 <- resid(mod1)
stde1 <- scale(e1)
plot(sale~year)
#### Independence:
Since the data is measured by time, this makes it independent.
# Graphical:
hist(e1, breaks=10, probability = TRUE)
x1 <- seq(-20,20,length=100)
y1 <- dnorm(x1, mean= 0, sd=sd(e1))
lines(x1,y1, col="red")
# Statistical:
ks.test(stde1,"pnorm", 0,1)
##
## One-sample Kolmogorov-Smirnov test
##
## data: stde1
## D = 0.14985, p-value = 0.9543
## alternative hypothesis: two-sided
ks.test(e1,"pnorm",0,sd(e1))
##
## One-sample Kolmogorov-Smirnov test
##
## data: e1
## D = 0.14985, p-value = 0.9543
## alternative hypothesis: two-sided
shapiro.test(e1)
##
## Shapiro-Wilk normality test
##
## data: e1
## W = 0.96123, p-value = 0.7998
The 3 tests above show a p-value of greater than 0.05, so that means the data is normal. The one that gave major comfort in passing this assumption was Shapiros test, this gave 0.7998. The histogram does raise some red flags, due to the gaps and such high bars, but the taller bars tend to fill in for the missing ones.
# Graphical:
plot(fitted(mod1),resid(mod1))
plot(mod1, which=1:2)
# Statistical:
lmtest::bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 2.3972, df = 1, p-value = 0.1216
boxCox(mod1)
Even here, the HOV shows signs of normal varience, of homoscedasticity. The p-value is greater than 0.05.
qqnorm(e1)
qqline(e1)
As for linearity, it appears to pass. Just in case I ran the Box-Cox test and it showed with a 0.5 that the Y value needs to be transformed using the sqrt function.
mod1_1 <- lm(I(sqrt(year)~sale))
e11 <- resid(mod1_1)
# Linearity
qqnorm(e11)
qqline(e11)
# Normality
shapiro.test(e11)
##
## Shapiro-Wilk normality test
##
## data: e11
## W = 0.88325, p-value = 0.1422
# HOV
lmtest::bptest(mod1_1)
##
## studentized Breusch-Pagan test
##
## data: mod1_1
## BP = 1.6013, df = 1, p-value = 0.2057
The transformation looks like every assumption passes, except for Normality. The values seem to stray a little more from the line than before.
conc <- c(0.07,0.09,0.08,0.16,0.17,0.21,0.49,0.58,0.53,1.22,1.15,1.07,2.84,2.57,3.10)
hours <- c(9.0,9.0,9.0,7.0,7.0,7.0,5.0,5.0,5.0,3.0,3.0,3.0,1.0,1.0,1.0)
mod2 <- lm(hours~conc)
e2 <- resid(mod2)
Due to time being measured, this makes the data independent.
# Graphical:
hist(e2, breaks=10, probability = TRUE)
x2 <- seq(-10,10,length=100)
y2 <- dnorm(x2, mean= 0, sd=sd(e2))
lines(x2,y2, col="red")
hist(rnorm(length(e2),0,sd(e2)), breaks=10, probability = TRUE)
x2 <- seq(-10,10,length=100)
y2 <- dnorm(x2, mean= 0, sd=sd(e2))
lines(x2,y2, col="red")
# Statistical:
shapiro.test(e2)
##
## Shapiro-Wilk normality test
##
## data: e2
## W = 0.90193, p-value = 0.1019
The shapiro test is just barely above 0.05 and the qqplot shows a prety good line when the random concept is used for the second graph. The first one looks a little off, but I’m not entirely sure whether this is enough to reject this.
# Graphical:
plot(fitted(mod2),resid(mod2))
plot(mod2, which=1:2)
# Statistical:
lmtest::bptest(mod2)
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 0.60246, df = 1, p-value = 0.4376
boxCox(mod2)
The plots and the bp test fail to accept this assumption. The p-value is under 0.05, at 0.04182.
qqnorm(e2)
qqline(e2)
Linearity also fails.
mod2_1 <- lm(I(log(conc)~hours))
e21 <- resid(mod2_1)
# Normality
shapiro.test(e21)
##
## Shapiro-Wilk normality test
##
## data: e21
## W = 0.97357, p-value = 0.9069
# HOV
lmtest::bptest(mod2_1)
##
## studentized Breusch-Pagan test
##
## data: mod2_1
## BP = 1.3876, df = 1, p-value = 0.2388
# Linearity
qqnorm(e21)
qqline(e21)
After transferring using a log functoin, all assumptions appear to pass. The only one still slightly questionable is linearity. Even that is not too far off from what it should be.