Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Parichart Pattarapanitchai
Affiliation: Department of Statistics, Faculty of Science, Chiang Mai
University.
Students are able to use R language to analyse data using multiple
linear regression:
1. perform descriptive statistsics
2. transform qualitative independent variable into dummy variables
3. select independent variables
4. perform linear regression analysis and inference on regression
parameters
5. interpret the results
library(RCurl) # load 'RCurl' package
data <- read.csv(text=getURL("https://raw.githubusercontent.com/Paripai/208251/main/LocationAndSalePrice.csv"))
data
## Store NumberOfHousehold Location SalesPrice
## 1 1 161 Street 157.27
## 2 2 99 Street 93.28
## 3 3 135 Street 136.81
## 4 4 120 Street 123.79
## 5 5 164 Street 153.51
## 6 6 221 Mall 241.74
## 7 7 179 Mall 204.54
## 8 8 204 Mall 206.71
## 9 9 214 Mall 229.78
## 10 10 101 Mall 135.22
## 11 11 231 Downtown 224.71
## 12 12 206 Downtown 195.29
## 13 13 248 Downtown 242.16
## 14 14 107 Downtown 115.21
## 15 15 205 Downtown 197.82
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location #qualitative independent variable with k=3 groups
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 93.28 136.01 195.29 177.19 215.71 242.16
summary(x1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 99.0 127.5 179.0 173.0 210.0 248.0
plot(x1,y,xlab="Number Of Household",ylab="Sale price")
cor(x1,y) #Compute the r between two quantitative variables
## [1] 0.9610064
boxplot(y~x2,xlab="Location",ylab="Sale price")
NOTE: It is easier to work with graphs when using ggplot
data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
geom_point(aes(color=Location))
ggplot(data_df,aes(y=SalesPrice,x=Location))+
geom_boxplot(aes(color=Location))
Since Location is qualitative independent variable with k=3
groups.
We need to transform Location into dummy variables
We create k-1=3-1=2 dummy variables, set as. x2.dummy, x3.dummy.
#create x2.dummy
x2.dummy= c() #create null vector
for(i in 1:length(data$Location)){
if(data$Location[i]=="Street") x2.dummy[i] = 1
else x2.dummy[i] = 0
}
x2.dummy
## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
if(data$Location[i]=="Mall") x3.dummy[i] = 1
else x3.dummy[i] = 0
}
x3.dummy
## [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
Select independent variables using variable selection methods
# Forward selection
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
## Start: AIC=117.83
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x1 1 31278.1 2590 81.269
## + x2.dummy 1 14690.3 19178 111.302
## + x3.dummy 1 5230.6 28637 117.316
## <none> 33868 117.833
##
## Step: AIC=81.27
## y ~ x1
##
## Df Sum of Sq RSS AIC
## + x3.dummy 1 2038.47 551.29 60.063
## + x2.dummy 1 931.21 1658.56 76.585
## <none> 2589.77 81.269
##
## Step: AIC=60.06
## y ~ x1 + x3.dummy
##
## Df Sum of Sq RSS AIC
## + x2.dummy 1 84.366 466.92 59.572
## <none> 551.29 60.063
##
## Step: AIC=59.57
## y ~ x1 + x3.dummy + x2.dummy
fw.fit
##
## Call:
## lm(formula = y ~ x1 + x3.dummy + x2.dummy)
##
## Coefficients:
## (Intercept) x1 x3.dummy x2.dummy
## 21.958 0.868 22.101 -6.901
# Backward elimination
be.fit = step(full,direction="backward")
## Start: AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 466.9 59.572
## - x2.dummy 1 84.4 551.3 60.063
## - x3.dummy 1 1191.6 1658.6 76.585
## - x1 1 18527.4 18994.3 113.158
be.fit
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Coefficients:
## (Intercept) x1 x2.dummy x3.dummy
## 21.958 0.868 -6.901 22.101
# Stepwise regression
sw.fit = step(full, direction="both")
## Start: AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
##
## Df Sum of Sq RSS AIC
## <none> 466.9 59.572
## - x2.dummy 1 84.4 551.3 60.063
## - x3.dummy 1 1191.6 1658.6 76.585
## - x1 1 18527.4 18994.3 113.158
sw.fit
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Coefficients:
## (Intercept) x1 x2.dummy x3.dummy
## 21.958 0.868 -6.901 22.101
summary(sw.fit)
##
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.422 -2.989 2.243 4.572 5.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.95824 8.78193 2.500 0.029486 *
## x1 0.86800 0.04155 20.892 3.34e-10 ***
## x2.dummy -6.90102 4.89503 -1.410 0.186240
## x3.dummy 22.10084 4.17123 5.298 0.000253 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.515 on 11 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9825
## F-statistic: 262.3 on 3 and 11 DF, p-value: 1.641e-10
anova(sw.fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 31278.1 31278.1 736.863 1.976e-11 ***
## x2.dummy 1 931.2 931.2 21.938 0.0006675 ***
## x3.dummy 1 1191.6 1191.6 28.073 0.0002530 ***
## Residuals 11 466.9 42.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint.lm(sw.fit)
## 2.5 % 97.5 %
## (Intercept) 2.6293467 41.2871246
## x1 0.7765583 0.9594473
## x2.dummy -17.6749031 3.8728631
## x3.dummy 12.9200368 31.2816515
new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat
## 1 2 3 4 5 6 7 8
## 154.8057 100.9895 132.2376 119.2176 157.4097 235.8877 199.4316 221.1317
## 9 10 11 12 13 14 15
## 229.8117 131.7274 222.4669 200.7668 237.2229 114.8345 199.8988
plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))
par(mfrow=c(2,2)) #set plot layout as 2 row 2 column
plot(fw.fit)
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On Mango, see the deadline there!
A math teacher wants to investigate the relationships between three
independent variables, including the method of instruction, pretest
performance, and student emotional intelligence (EQ). The data is on github
Use R language to:
Explore data using descriptive statistics (plot between Y and pretest, Y and EQ, boxplot of y and method) and write a summary of each plot.
Since Method is qualitative independent variable with k=4 groups. Transform Method into dummy variables.
Select independent variables using variable selection methods (forward selection, backward elimination, and stepwise regression).
Write down the fitted equation for predicting Y using the best set of independent variables obtained from stepwise regression.
Perform Test the overall fit of the model at significance level 0.05 (write down 4 steps of hypothesis testing)
Find 95 % confidence interval of regression parameters
Write down fitted or predicted equation for each method A, B, C, and D.