Module: 208251 Regression Analysis and Non-Parametric Statistics
Instructor: Parichart Pattarapanitchai
Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives

Students are able to use R language to analyse data using multiple linear regression:
1. Perform linear regression analysis
2. Check Normality Assumptions
3. Check Constant Variance Assumptions
4. Check Independence (Autocorrelation) Assumptions
5. Dealing with Invalid Model Assumption

Exercise I

1. Import data into R.

#data
library(RCurl)  # load 'RCurl' package
Hollywood_Movies <- read.csv(text=getURL("https://raw.githubusercontent.com/Paripai/208251/main/Hollywood_Movies.csv"))

# name variables for convenience
y = Hollywood_Movies$Y #first year box office receipts
x1 = Hollywood_Movies$X1 #total production costs
x2 = Hollywood_Movies$X2 #total promotional costs
x3 = Hollywood_Movies$X3 #total book sales
y

##  [1]  85.1 106.3  50.2 130.6  54.8  30.3  79.4  91.0 135.4  89.3

x1

##  [1]  8.5 12.9  5.2 10.7  3.1  3.5  9.2  9.0 15.1 10.2

x2

##  [1] 5.100000 5.800000 2.100000 8.399999 2.900000 1.200000 3.700000 7.600000
##  [9] 7.700000 4.500000

x3

##  [1]  4.7  8.8 15.1 12.2 10.6  3.5  9.7  5.9 20.8  7.9

2. Perform linear regression analysis

fit <- lm(y~x1+x2+x3)
summary(fit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4384  -3.1695   0.8499   3.5134   9.6207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   7.6760     6.7602   1.135   0.2995   
## x1            3.6616     1.1178   3.276   0.0169 * 
## x2            7.6211     1.6573   4.598   0.0037 **
## x3            0.8285     0.5394   1.536   0.1754   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9502 
## F-statistic: 58.22 on 3 and 6 DF,  p-value: 7.913e-05

3. Check Normality Assumptions.

We need to have residuals of this model to check its normality.

#get residual from function resid()
res = resid(fit)
#check histogram and normal q-q plot
par(mfrow=c(1,2))
hist(res,main="Histogram of Residuals")
qqnorm(res)
qqline(res)

#test it!
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.97047, p-value = 0.8952

4. Check Constant Variance Assumptions

#check the scatter plot
fitted.y = predict(fit)
plot(fitted.y,res)

#test it!
#install.packages("lmtest")
library("lmtest")

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'lmtest'

## The following object is masked from 'package:RCurl':
## 
##     reset

bptest(fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 7.7416, df = 3, p-value = 0.05167

5. Check Independence (Autocorrelation) Assumptions

# Test it! durbin watson test
library("lmtest")
dwtest(fit)

## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 1.9713, p-value = 0.5083
## alternative hypothesis: true autocorrelation is greater than 0

6. R diagnostic plots

par(mfrow = c(2,2))
plot(fit)

7.Dealing with Invalid Model Assumption

height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 250)
bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
fit2 <- lm(height~bodymass)
par(mfrow = c(2,2))
plot(fit2)

shapiro.test(resid(fit2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.71587, p-value = 0.001372

dwtest(fit2)

## 
##  Durbin-Watson test
## 
## data:  fit2
## DW = 1.3237, p-value = 0.1873
## alternative hypothesis: true autocorrelation is greater than 0

bptest(fit2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.42597, df = 1, p-value = 0.514

Try to transform and re-fit the model

# BoxCox transformation
library(car)

## Loading required package: carData

b <- boxCox(lm(height~bodymass))

lambda <-  b$x[which.max(b$y)]
t_height <- height^lambda
fit3 <- lm(t_height~bodymass)
par(mfrow = c(2,2))
plot(fit3)

shapiro.test(resid(fit3))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit3)
## W = 0.92767, p-value = 0.4254

dwtest(fit3)

## 
##  Durbin-Watson test
## 
## data:  fit3
## DW = 1.4414, p-value = 0.246
## alternative hypothesis: true autocorrelation is greater than 0

bptest(fit3)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit3
## BP = 0.31966, df = 1, p-value = 0.5718

Assignment Lab3

You must submit:

R file with your codes, and
Answer sheet with your handwriting

On MS-teams, see the deadline there!

A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below

The midterm and final scores of 20 students
Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

Use R language to:

Obtain the least-squares regression line for predicting final score from midterm score.
At a significance level of 0.05, check the validity of the model using ANOVA. Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)
Check the normality assumption at a significance level of 0.05. (Perform the 4-step process)
Check the constant variance assumption at a significance level of 0.05. (Perform the 4-step process)
Check the independence (autocorrelation) assumption at a significance level of 0.05. (Perform the 4-step process)
Do we have an assumption violation issue? If so, please fix it by using BoxCox transformations.

Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

208251_LAB3_Model diagnostics

Parichart Pattarapanitchai