Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Parichart Pattarapanitchai
Affiliation: Department of Statistics, Faculty of Science, Chiang Mai
University.
Students are able to use R language to analyse data using multiple
linear regression:
1. Perform linear regression analysis
2. Check Normality Assumptions
3. Check Constant Variance Assumptions
4. Check Independence (Autocorrelation) Assumptions
5. Dealing with Invalid Model Assumption
#data
library(RCurl) # load 'RCurl' package
Hollywood_Movies <- read.csv(text=getURL("https://raw.githubusercontent.com/Paripai/208251/main/Hollywood_Movies.csv"))
# name variables for convenience
y = Hollywood_Movies$Y #first year box office receipts
x1 = Hollywood_Movies$X1 #total production costs
x2 = Hollywood_Movies$X2 #total promotional costs
x3 = Hollywood_Movies$X3 #total book sales
y
## [1] 85.1 106.3 50.2 130.6 54.8 30.3 79.4 91.0 135.4 89.3
x1
## [1] 8.5 12.9 5.2 10.7 3.1 3.5 9.2 9.0 15.1 10.2
x2
## [1] 5.100000 5.800000 2.100000 8.399999 2.900000 1.200000 3.700000 7.600000
## [9] 7.700000 4.500000
x3
## [1] 4.7 8.8 15.1 12.2 10.6 3.5 9.7 5.9 20.8 7.9
fit <- lm(y~x1+x2+x3)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4384 -3.1695 0.8499 3.5134 9.6207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6760 6.7602 1.135 0.2995
## x1 3.6616 1.1178 3.276 0.0169 *
## x2 7.6211 1.6573 4.598 0.0037 **
## x3 0.8285 0.5394 1.536 0.1754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9502
## F-statistic: 58.22 on 3 and 6 DF, p-value: 7.913e-05
We need to have residuals of this model to check its normality.
#get residual from function resid()
res = resid(fit)
#check histogram and normal q-q plot
par(mfrow=c(1,2))
hist(res,main="Histogram of Residuals")
qqnorm(res)
qqline(res)
#test it!
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.97047, p-value = 0.8952
#check the scatter plot
fitted.y = predict(fit)
plot(fitted.y,res)
#test it!
#install.packages("lmtest")
library("lmtest")
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'lmtest'
## The following object is masked from 'package:RCurl':
##
## reset
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 7.7416, df = 3, p-value = 0.05167
# Test it! durbin watson test
library("lmtest")
dwtest(fit)
##
## Durbin-Watson test
##
## data: fit
## DW = 1.9713, p-value = 0.5083
## alternative hypothesis: true autocorrelation is greater than 0
par(mfrow = c(2,2))
plot(fit)
height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 250)
bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
fit2 <- lm(height~bodymass)
par(mfrow = c(2,2))
plot(fit2)
shapiro.test(resid(fit2))
##
## Shapiro-Wilk normality test
##
## data: resid(fit2)
## W = 0.71587, p-value = 0.001372
dwtest(fit2)
##
## Durbin-Watson test
##
## data: fit2
## DW = 1.3237, p-value = 0.1873
## alternative hypothesis: true autocorrelation is greater than 0
bptest(fit2)
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 0.42597, df = 1, p-value = 0.514
Try to transform and re-fit the model
# BoxCox transformation
library(car)
## Loading required package: carData
b <- boxCox(lm(height~bodymass))
lambda <- b$x[which.max(b$y)]
t_height <- height^lambda
fit3 <- lm(t_height~bodymass)
par(mfrow = c(2,2))
plot(fit3)
shapiro.test(resid(fit3))
##
## Shapiro-Wilk normality test
##
## data: resid(fit3)
## W = 0.92767, p-value = 0.4254
dwtest(fit3)
##
## Durbin-Watson test
##
## data: fit3
## DW = 1.4414, p-value = 0.246
## alternative hypothesis: true autocorrelation is greater than 0
bptest(fit3)
##
## studentized Breusch-Pagan test
##
## data: fit3
## BP = 0.31966, df = 1, p-value = 0.5718
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On MS-teams, see the deadline there!
A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below
| Student | Midterm | Final |
|---|---|---|
| 1 | 77 | 82 |
| 2 | 50 | 66 |
| 3 | 71 | 78 |
| 4 | 72 | 43 |
| 5 | 80 | 56 |
| 6 | 93 | 85 |
| 7 | 95 | 99 |
| 8 | 98 | 98 |
| 9 | 66 | 68 |
| 10 | 55 | 45 |
| 11 | 63 | 75 |
| 12 | 62 | 45 |
| 13 | 51 | 71 |
| 14 | 65 | 60 |
| 15 | 90 | 87 |
| 16 | 74 | 68 |
| 17 | 48 | 56 |
| 18 | 67 | 85 |
| 19 | 67 | 60 |
| 20 | 85 | 90 |
Use R language to:
Obtain the least-squares regression line for predicting final score from midterm score.
At a significance level of 0.05, check the validity of the model using ANOVA. Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)
Check the normality assumption at a significance level of 0.05. (Perform the 4-step process)
Check the constant variance assumption at a significance level of 0.05. (Perform the 4-step process)
Check the independence (autocorrelation) assumption at a significance level of 0.05. (Perform the 4-step process)
Do we have an assumption violation issue? If so, please fix it by using BoxCox transformations.