Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Wisunee Puggard
Affiliation: Department of Statistics, Faculty of Science,
Chiang Mai University.
Objectives:
Students are able to use R language to analyse data using multiple
linear regression:
perform descriptive statistsics
transform qualitative independent variable into dummy
variables
select independent variables
perform linear regression analysis and inference on regression
parameters
interpret the results
You can download data file in Class material in MANGO Canvas.
Data file is 208251_LAB2_LABDATA.xls
Note that your working directory (the place where the data file
is at) will be different from mine.
data = read.csv('/Users/wisuneepuggard/Desktop/LAB208251/208251_LAB2_LABDATA.csv'
,header=TRUE)
data
## Store NumberOfHousehold Location SalesPrice
## 1 1 161 Street 157.27
## 2 2 99 Street 93.28
## 3 3 135 Street 136.81
## 4 4 120 Street 123.79
## 5 5 164 Street 153.51
## 6 6 221 Mall 241.74
## 7 7 179 Mall 204.54
## 8 8 204 Mall 206.71
## 9 9 214 Mall 229.78
## 10 10 101 Mall 135.22
## 11 11 231 Downtown 224.71
## 12 12 206 Downtown 195.29
## 13 13 248 Downtown 242.16
## 14 14 107 Downtown 115.21
## 15 15 205 Downtown 197.82
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location #qualitative independent variable with k=3 groups
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 93.28 136.01 195.29 177.19 215.71 242.16
summary(x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 99.0 127.5 179.0 173.0 210.0 248.0
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
## [1] 0.9610064
boxplot(y~x2,xlab="Location",ylab="Sale price")
NOTE: It is easier to work with graphs when using ggplot
data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
geom_point(aes(color=Location))
ggplot(data_df,aes(y=SalesPrice,x=Location))+
geom_boxplot(aes(color=Location))
Since Location is qualitative independent variable with k=3
groups.
We need to transform Location into dummy variables
We create k-1=3-1=2 dummy variables, set as. x2.dummy, x3.dummy.
#create x2.dummy
x2.dummy= c() #create null vector
for(i in 1:length(data$Location)){
if(data$Location[i]=="Street") x2.dummy[i] = 1
else x2.dummy[i] = 0
}
x2.dummy
## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
if(data$Location[i]=="Mall") x3.dummy[i] = 1
else x3.dummy[i] = 0
}
x3.dummy
## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
Select independent variables using variable selection methods
# Forward selection
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
## Start: AIC=117.83
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x1 1 31278.1 2590 81.269
## + x2.dummy 1 14690.3 19178 111.302
## + x3.dummy 1 5230.6 28637 117.316
## <none> 33868 117.833
##
## Step: AIC=81.27
## y ~ x1
##
## Df Sum of Sq RSS AIC
## + x3.dummy 1 2038.47 551.29 60.063
## + x2.dummy 1 931.21 1658.56 76.585
## <none> 2589.77 81.269
##
## Step: AIC=60.06
## y ~ x1 + x3.dummy
##
## Df Sum of Sq RSS AIC
## + x2.dummy 1 84.366 466.92 59.572
## <none> 551.29 60.063
##
## Step: AIC=59.57
## y ~ x1 + x3.dummy + x2.dummy
fw.fit
##
## Call:
## lm(formula = y ~ x1 + x3.dummy + x2.dummy)
##
## Coefficients:
## (Intercept) x1 x3.dummy x2.dummy
## 21.958 0.868 22.101 -6.901
# Backward elimination
be.fit = step(full,direction="backward")
## Start: AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 466.9 59.572
## - x2.dummy 1 84.4 551.3 60.063
## - x3.dummy 1 1191.6 1658.6 76.585
## - x1 1 18527.4 18994.3 113.158
be.fit
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Coefficients:
## (Intercept) x1 x2.dummy x3.dummy
## 21.958 0.868 -6.901 22.101
# Stepwise regression
sw.fit = step(full, direction="both")
## Start: AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 466.9 59.572
## - x2.dummy 1 84.4 551.3 60.063
## - x3.dummy 1 1191.6 1658.6 76.585
## - x1 1 18527.4 18994.3 113.158
sw.fit
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Coefficients:
## (Intercept) x1 x2.dummy x3.dummy
## 21.958 0.868 -6.901 22.101
summary(sw.fit)
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.422 -2.989 2.243 4.572 5.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.95824 8.78193 2.500 0.029486 *
## x1 0.86800 0.04155 20.892 3.34e-10 ***
## x2.dummy -6.90102 4.89503 -1.410 0.186240
## x3.dummy 22.10084 4.17123 5.298 0.000253 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.515 on 11 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9825
## F-statistic: 262.3 on 3 and 11 DF, p-value: 1.641e-10
anova(sw.fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 31278.1 31278.1 736.863 1.976e-11 ***
## x2.dummy 1 931.2 931.2 21.938 0.0006675 ***
## x3.dummy 1 1191.6 1191.6 28.073 0.0002530 ***
## Residuals 11 466.9 42.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint.lm(sw.fit)
## 2.5 % 97.5 %
## (Intercept) 2.6293467 41.2871246
## x1 0.7765583 0.9594473
## x2.dummy -17.6749031 3.8728631
## x3.dummy 12.9200368 31.2816515
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat
## 1 2 3 4 5 6 7 8
## 154.8057 100.9895 132.2376 119.2176 157.4097 235.8877 199.4316 221.1317
## 9 10 11 12 13 14 15
## 229.8117 131.7274 222.4669 200.7668 237.2229 114.8345 199.8988
plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))
par(mfrow=c(2,2)) #set plot layout as 2 row 2 column
plot(fw.fit)
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On Mango, see the deadline there!
A math teacher wants to investigate the relationships between three independent variables, including the method of instruction, pretest performance, and student emotional intelligence (EQ). The data is on Mango canvas.
Use R language to:
Explore data using descriptive statistics (plot between Y and pretest, Y and EQ, boxplot of y and method) and write a summary of each plot.
Since Method is qualitative independent variable with k=4 groups. Transform Method into dummy variables.
Select independent variables using variable selection methods (forward selection, backward elimination, and stepwise regression).
Write down the fitted equation for predicting Y using the best set of independent variables obtained from stepwise regression.
Perform Test the overall fit of the model at significance level 0.05 (write down 4 steps of hypothesis testing)
Find 95 % confidence interval of regression parameters
Write down fitted or predicted equation for each method A, B, C, and D.