Module: 208251 Regression Analysis and Non-Parametric Statistics

Instructor: Wisunee Puggard

Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives: Students are able to use R language to analyse data using multiple linear regression:

Perform linear regression analysis
Check Normality Assumptions
Check Constant Variance Assumptions
Check Independence (Autocorrelation) Assumptions
Dealing with Invalid Model Assumption

Exercise I:

1. Import data into R.

Note that your working directory (the place where the data file is at) will be different from mine.

#data
data = read.csv('/Users/wisuneepuggard/Desktop/LAB208251/Hollywood_Movies.csv'
,header=TRUE)
# name variables for convenience
y = data$Y #first year box office receipts
x1 = data$X1 #total production costs
x2 = data$X2 #total promotional costs
x3 = data$X3 #total book sales
y

##  [1]  85.1 106.3  50.2 130.6  54.8  30.3  79.4  91.0 135.4  89.3

x1

##  [1]  8.5 12.9  5.2 10.7  3.1  3.5  9.2  9.0 15.1 10.2

x2

##  [1] 5.100000 5.800000 2.100000 8.399999 2.900000 1.200000 3.700000 7.600000
##  [9] 7.700000 4.500000

x3

##  [1]  4.7  8.8 15.1 12.2 10.6  3.5  9.7  5.9 20.8  7.9

2. Perform linear regression analysis

fit <- lm(y~x1+x2+x3)
summary(fit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4384  -3.1695   0.8499   3.5134   9.6207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   7.6760     6.7602   1.135   0.2995   
## x1            3.6616     1.1178   3.276   0.0169 * 
## x2            7.6211     1.6573   4.598   0.0037 **
## x3            0.8285     0.5394   1.536   0.1754   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.541 on 6 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9502 
## F-statistic: 58.22 on 3 and 6 DF,  p-value: 7.913e-05

3. Check Normality Assumptions.

We need to have residuals of this model to check its normality.

#get residual from function resid()
res = resid(fit)
#check histogram and normal q-q plot
par(mfrow=c(1,2))
hist(res,main="Histogram of Residuals")
qqnorm(res)
qqline(res)

#test it!
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.97047, p-value = 0.8952

4. Check Constant Variance Assumptions

#check the scatter plot
fitted.y = predict(fit)
plot(fitted.y,res)

#test it!
#install.packages("lmtest")
library("lmtest")

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 7.7416, df = 3, p-value = 0.05167

5. Check Independence (Autocorrelation) Assumptions

# Test it! durbin watson test
library("lmtest")
dwtest(fit)

## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 1.9713, p-value = 0.5083
## alternative hypothesis: true autocorrelation is greater than 0

6. R diagnostic plots

par(mfrow = c(2,2))
plot(fit)

7. Dealing with Invalid Model Assumption

height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 250)
bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
fit2 <- lm(height~bodymass)
par(mfrow = c(2,2))
plot(fit2)

shapiro.test(resid(fit2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.71587, p-value = 0.001372

dwtest(fit2)

## 
##  Durbin-Watson test
## 
## data:  fit2
## DW = 1.3237, p-value = 0.1873
## alternative hypothesis: true autocorrelation is greater than 0

bptest(fit2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 0.42597, df = 1, p-value = 0.514

Try to transform and re-fit the model

# BoxCox transformation
library(car)

## Loading required package: carData

b <- boxCox(lm(height~bodymass))

lambda <-  b$x[which.max(b$y)]
t_height <- height^lambda
fit3 <- lm(t_height~bodymass)
par(mfrow = c(2,2))
plot(fit3)

shapiro.test(resid(fit3))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit3)
## W = 0.92767, p-value = 0.4254

dwtest(fit3)

## 
##  Durbin-Watson test
## 
## data:  fit3
## DW = 1.4414, p-value = 0.246
## alternative hypothesis: true autocorrelation is greater than 0

bptest(fit3)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit3
## BP = 0.31966, df = 1, p-value = 0.5718

LAB3 ASSIGNMENT

You must submit:

R file with your codes, and
Answer sheet with your handwriting

On Mango, see the deadline there!

A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below.

The midterm and final scores of 20 students
Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

Use R language to:

Obtain the least-squares regression line for predicting final score from midterm score.
At a significance level of 0.05, check the validity of the model using ANOVA. Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)
Check the normality assumption at a significance level of 0.05. (Perform the 4-step process)
Check the constant variance assumption at a significance level of 0.05. (Perform the 4-step process)
Check the independence (autocorrelation) assumption at a significance level of 0.05. (Perform the 4-step process)
Do we have an assumption violation issue? If so, please fix it by using BoxCox transformations.

Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

Student	Midterm	Final
1	77	82
2	50	66
3	71	78
4	72	43
5	80	56
6	93	85
7	95	99
8	98	98
9	66	68
10	55	45
11	63	75
12	62	45
13	51	71
14	65	60
15	90	87
16	74	68
17	48	56
18	67	85
19	67	60
20	85	90

208251_LAB3_Model diagnostics