Module: 208251 Regression Analysis and Non-Parametric Statistics

Instructor: Wisunee Puggard

Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives:

Students are able to use R language to analyse data using multiple linear regression:

  1. perform descriptive statistsics

  2. transform qualitative independent variable into dummy variables

  3. select independent variables

  4. perform linear regression analysis and inference on regression parameters

  5. interpret the results

Exercise I:

You can download data file in Class material in MANGO Canvas.

1. Import data into R.

Data file is 208251_LAB2_LABDATA.xls
Note that your working directory (the place where the data file is at) will be different from mine.

data = read.csv('/Users/wisuneepuggard/Desktop/LAB208251/208251_LAB2_LABDATA.csv'
,header=TRUE)
data
##    Store NumberOfHousehold Location SalesPrice
## 1      1               161   Street     157.27
## 2      2                99   Street      93.28
## 3      3               135   Street     136.81
## 4      4               120   Street     123.79
## 5      5               164   Street     153.51
## 6      6               221     Mall     241.74
## 7      7               179     Mall     204.54
## 8      8               204     Mall     206.71
## 9      9               214     Mall     229.78
## 10    10               101     Mall     135.22
## 11    11               231 Downtown     224.71
## 12    12               206 Downtown     195.29
## 13    13               248 Downtown     242.16
## 14    14               107 Downtown     115.21
## 15    15               205 Downtown     197.82
# name variables for convenience
y = data$SalesPrice
x1 = data$NumberOfHousehold #quantitative independent variable
x2 = data$Location   #qualitative independent variable with k=3 groups 

2. Explore data using descriptive statistic

summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   93.28  136.01  195.29  177.19  215.71  242.16
summary(x1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    99.0   127.5   179.0   173.0   210.0   248.0
plot(x1,y,xlab="Number Of Household",ylab="Sale price")

cor(x1,y) #Compute the r between two quantitative variables
## [1] 0.9610064
boxplot(y~x2,xlab="Location",ylab="Sale price")

NOTE: It is easier to work with graphs when using ggplot

data_df = data.frame(data) # set data as dataframe type
library(ggplot2) # you might need to install packgage "ggplot2" first!
ggplot(data_df,aes(x=NumberOfHousehold,y=SalesPrice))+
  geom_point(aes(color=Location))

ggplot(data_df,aes(y=SalesPrice,x=Location))+
  geom_boxplot(aes(color=Location))

3. Dummy variables

Since Location is qualitative independent variable with k=3 groups.

We need to transform Location into dummy variables

We create k-1=3-1=2 dummy variables, set as. x2.dummy, x3.dummy.

#create x2.dummy
x2.dummy= c()   #create null vector
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Street") x2.dummy[i] = 1
  else x2.dummy[i] = 0
}
x2.dummy
##  [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
#create x3.dummy
x3.dummy= c()
for(i in 1:length(data$Location)){
  if(data$Location[i]=="Mall") x3.dummy[i] = 1
  else x3.dummy[i] = 0
}
x3.dummy
##  [1] 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

4. Variable selection methods

Select independent variables using variable selection methods

# Forward selection
null=lm(y~1)
full=lm(y~x1+x2.dummy+x3.dummy)
fw.fit = step(null, scope=list(lower=null,upper=full),direction="forward")
## Start:  AIC=117.83
## y ~ 1
## 
##            Df Sum of Sq   RSS     AIC
## + x1        1   31278.1  2590  81.269
## + x2.dummy  1   14690.3 19178 111.302
## + x3.dummy  1    5230.6 28637 117.316
## <none>                  33868 117.833
## 
## Step:  AIC=81.27
## y ~ x1
## 
##            Df Sum of Sq     RSS    AIC
## + x3.dummy  1   2038.47  551.29 60.063
## + x2.dummy  1    931.21 1658.56 76.585
## <none>                  2589.77 81.269
## 
## Step:  AIC=60.06
## y ~ x1 + x3.dummy
## 
##            Df Sum of Sq    RSS    AIC
## + x2.dummy  1    84.366 466.92 59.572
## <none>                  551.29 60.063
## 
## Step:  AIC=59.57
## y ~ x1 + x3.dummy + x2.dummy
fw.fit
## 
## Call:
## lm(formula = y ~ x1 + x3.dummy + x2.dummy)
## 
## Coefficients:
## (Intercept)           x1     x3.dummy     x2.dummy  
##      21.958        0.868       22.101       -6.901
# Backward elimination
be.fit = step(full,direction="backward")
## Start:  AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
## 
##            Df Sum of Sq     RSS     AIC
## <none>                    466.9  59.572
## - x2.dummy  1      84.4   551.3  60.063
## - x3.dummy  1    1191.6  1658.6  76.585
## - x1        1   18527.4 18994.3 113.158
be.fit
## 
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
## 
## Coefficients:
## (Intercept)           x1     x2.dummy     x3.dummy  
##      21.958        0.868       -6.901       22.101
# Stepwise regression
sw.fit = step(full, direction="both")
## Start:  AIC=59.57
## y ~ x1 + x2.dummy + x3.dummy
## 
##            Df Sum of Sq     RSS     AIC
## <none>                    466.9  59.572
## - x2.dummy  1      84.4   551.3  60.063
## - x3.dummy  1    1191.6  1658.6  76.585
## - x1        1   18527.4 18994.3 113.158
sw.fit
## 
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
## 
## Coefficients:
## (Intercept)           x1     x2.dummy     x3.dummy  
##      21.958        0.868       -6.901       22.101

5. Obtain fitted equation and CI of beta

summary(sw.fit)
## 
## Call:
## lm(formula = y ~ x1 + x2.dummy + x3.dummy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.422  -2.989   2.243   4.572   5.852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.95824    8.78193   2.500 0.029486 *  
## x1           0.86800    0.04155  20.892 3.34e-10 ***
## x2.dummy    -6.90102    4.89503  -1.410 0.186240    
## x3.dummy    22.10084    4.17123   5.298 0.000253 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.515 on 11 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9825 
## F-statistic: 262.3 on 3 and 11 DF,  p-value: 1.641e-10
anova(sw.fit)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 31278.1 31278.1 736.863 1.976e-11 ***
## x2.dummy   1   931.2   931.2  21.938 0.0006675 ***
## x3.dummy   1  1191.6  1191.6  28.073 0.0002530 ***
## Residuals 11   466.9    42.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint.lm(sw.fit)
##                   2.5 %     97.5 %
## (Intercept)   2.6293467 41.2871246
## x1            0.7765583  0.9594473
## x2.dummy    -17.6749031  3.8728631
## x3.dummy     12.9200368 31.2816515

6. Plot the fitted:

new = data.frame(x1=x1,x2.dummy=x2.dummy,x3.dummy=x3.dummy)
yhat = predict(sw.fit,newdata=new) #compute fitted y
yhat 
##        1        2        3        4        5        6        7        8 
## 154.8057 100.9895 132.2376 119.2176 157.4097 235.8877 199.4316 221.1317 
##        9       10       11       12       13       14       15 
## 229.8117 131.7274 222.4669 200.7668 237.2229 114.8345 199.8988
plot(x1,y,pch=1,xlab="Number Of Household",ylab="Sale price")
points(x1,yhat,type="p",pch=20)
legend("topleft",c("observed y","predicted y"),pch=c(1,20))

7. Analyse the residuals:

par(mfrow=c(2,2))  #set plot layout as 2 row 2 column
plot(fw.fit)

Assignment Lab 2

You must submit:

  1. R file with your codes, and

  2. Answer sheet with your handwriting

On Mango, see the deadline there!

A math teacher wants to investigate the relationships between three independent variables, including the method of instruction, pretest performance, and student emotional intelligence (EQ). The data is on Mango canvas.

Use R language to:

  1. Explore data using descriptive statistics (plot between Y and pretest, Y and EQ, boxplot of y and method) and write a summary of each plot.

  2. Since Method is qualitative independent variable with k=4 groups. Transform Method into dummy variables.

  3. Select independent variables using variable selection methods (forward selection, backward elimination, and stepwise regression).

  4. Write down the fitted equation for predicting Y using the best set of independent variables obtained from stepwise regression.

  5. Perform Test the overall fit of the model at significance level 0.05 (write down 4 steps of hypothesis testing)

  6. Find 95 % confidence interval of regression parameters

  7. Write down fitted or predicted equation for each method A, B, C, and D.