Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Wisunee Puggard
Affiliation: Department of Statistics, Faculty of Science,
Chiang Mai University.
Objectives: Students are able to use R language to
analyse data using multiple linear regression:
Perform linear regression analysis
Check Normality Assumptions
Check Constant Variance Assumptions
Check Independence (Autocorrelation) Assumptions
Dealing with Invalid Model Assumption
Note that your working directory (the place where the data file is at) will be different from mine.
#data
data = read.csv('/Users/wisuneepuggard/Desktop/LAB208251/Hollywood_Movies.csv'
,header=TRUE)
# name variables for convenience
y = data$Y #first year box office receipts
x1 = data$X1 #total production costs
x2 = data$X2 #total promotional costs
x3 = data$X3 #total book sales
y
## [1] 85.1 106.3 50.2 130.6 54.8 30.3 79.4 91.0 135.4 89.3
x1
## [1] 8.5 12.9 5.2 10.7 3.1 3.5 9.2 9.0 15.1 10.2
x2
## [1] 5.100000 5.800000 2.100000 8.399999 2.900000 1.200000 3.700000 7.600000
## [9] 7.700000 4.500000
x3
## [1] 4.7 8.8 15.1 12.2 10.6 3.5 9.7 5.9 20.8 7.9
fit <- lm(y~x1+x2+x3)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4384 -3.1695 0.8499 3.5134 9.6207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6760 6.7602 1.135 0.2995
## x1 3.6616 1.1178 3.276 0.0169 *
## x2 7.6211 1.6573 4.598 0.0037 **
## x3 0.8285 0.5394 1.536 0.1754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9502
## F-statistic: 58.22 on 3 and 6 DF, p-value: 7.913e-05
We need to have residuals of this model to check its normality.
#get residual from function resid()
res = resid(fit)
#check histogram and normal q-q plot
par(mfrow=c(1,2))
hist(res,main="Histogram of Residuals")
qqnorm(res)
qqline(res)
#test it!
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.97047, p-value = 0.8952
#check the scatter plot
fitted.y = predict(fit)
plot(fitted.y,res)
#test it!
#install.packages("lmtest")
library("lmtest")
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 7.7416, df = 3, p-value = 0.05167
# Test it! durbin watson test
library("lmtest")
dwtest(fit)
##
## Durbin-Watson test
##
## data: fit
## DW = 1.9713, p-value = 0.5083
## alternative hypothesis: true autocorrelation is greater than 0
par(mfrow = c(2,2))
plot(fit)
height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 250)
bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
fit2 <- lm(height~bodymass)
par(mfrow = c(2,2))
plot(fit2)
shapiro.test(resid(fit2))
##
## Shapiro-Wilk normality test
##
## data: resid(fit2)
## W = 0.71587, p-value = 0.001372
dwtest(fit2)
##
## Durbin-Watson test
##
## data: fit2
## DW = 1.3237, p-value = 0.1873
## alternative hypothesis: true autocorrelation is greater than 0
bptest(fit2)
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 0.42597, df = 1, p-value = 0.514
Try to transform and re-fit the model
# BoxCox transformation
library(car)
## Loading required package: carData
b <- boxCox(lm(height~bodymass))
lambda <- b$x[which.max(b$y)]
t_height <- height^lambda
fit3 <- lm(t_height~bodymass)
par(mfrow = c(2,2))
plot(fit3)
shapiro.test(resid(fit3))
##
## Shapiro-Wilk normality test
##
## data: resid(fit3)
## W = 0.92767, p-value = 0.4254
dwtest(fit3)
##
## Durbin-Watson test
##
## data: fit3
## DW = 1.4414, p-value = 0.246
## alternative hypothesis: true autocorrelation is greater than 0
bptest(fit3)
##
## studentized Breusch-Pagan test
##
## data: fit3
## BP = 0.31966, df = 1, p-value = 0.5718
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On Mango, see the deadline there!
A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below.
| Student | Midterm | Final |
|---|---|---|
| 1 | 77 | 82 |
| 2 | 50 | 66 |
| 3 | 71 | 78 |
| 4 | 72 | 43 |
| 5 | 80 | 56 |
| 6 | 93 | 85 |
| 7 | 95 | 99 |
| 8 | 98 | 98 |
| 9 | 66 | 68 |
| 10 | 55 | 45 |
| 11 | 63 | 75 |
| 12 | 62 | 45 |
| 13 | 51 | 71 |
| 14 | 65 | 60 |
| 15 | 90 | 87 |
| 16 | 74 | 68 |
| 17 | 48 | 56 |
| 18 | 67 | 85 |
| 19 | 67 | 60 |
| 20 | 85 | 90 |
Use R language to:
Obtain the least-squares regression line for predicting final score from midterm score.
At a significance level of 0.05, check the validity of the model using ANOVA. Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)
Check the normality assumption at a significance level of 0.05. (Perform the 4-step process)
Check the constant variance assumption at a significance level of 0.05. (Perform the 4-step process)
Check the independence (autocorrelation) assumption at a significance level of 0.05. (Perform the 4-step process)
Do we have an assumption violation issue? If so, please fix it by using BoxCox transformations.