1. Introduction

Topics about properties have been analyzed in great detail by researchers in the past decades. Many factors can affect the value of housing, making the research more complex. In 100+ Interesting Data Sets for Statistics, a data set called Ecdat is introduced. It is a wealth of data sets available for R, containing gobs of econometric data.In this project, we study sales prices of houses in the City of Windsor. The data used in this recipe is found using the “100+ interesting data sets” webpage, and it is publicly available in R package named Ecdat. A summary of this package is available at https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf.

In this study, We apply fractional factorial design (FFD) among Housing dataset in Ecdat. The variables of study include two 2-level factors and two 3-level factors that may influence the price. Fractional Factorial Design was used to perform this analysis.

2. Setting

System Under Test

In the first step, the data is downloaded from Ecdat package.the Housing dataframe was imported and assigned to data dataframe:

library("Ecdat")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
data<- Housing

The structure, heading and tail of the data frame is shown below to analyze the data innitially.

str(data)
## 'data.frame':    546 obs. of  12 variables:
##  $ price   : num  42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
##  $ lotsize : num  5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
##  $ bedrooms: num  3 2 3 3 2 3 3 3 3 3 ...
##  $ bathrms : num  1 1 1 1 1 1 2 1 1 2 ...
##  $ stories : num  2 1 1 2 1 1 2 3 1 4 ...
##  $ driveway: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ recroom : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
##  $ fullbase: Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
##  $ gashw   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ airco   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
##  $ garagepl: num  1 0 0 0 0 0 2 0 0 1 ...
##  $ prefarea: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head(data)
##   price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000    5850        3       1       2      yes      no      yes    no
## 2 38500    4000        2       1       1      yes      no       no    no
## 3 49500    3060        3       1       1      yes      no       no    no
## 4 60500    6650        3       1       2      yes     yes       no    no
## 5 61000    6360        2       1       1      yes      no       no    no
## 6 66000    4160        3       1       1      yes     yes      yes    no
##   airco garagepl prefarea
## 1    no        1       no
## 2    no        0       no
## 3    no        0       no
## 4    no        0       no
## 5    no        0       no
## 6   yes        0       no
tail(data)
##      price lotsize bedrooms bathrms stories driveway recroom fullbase
## 541  85000    6525        3       2       4      yes      no       no
## 542  91500    4800        3       2       4      yes     yes       no
## 543  94000    6000        3       2       4      yes      no       no
## 544 103000    6000        3       2       4      yes     yes       no
## 545 105000    6000        3       2       2      yes     yes       no
## 546 105000    6000        3       1       2      yes      no       no
##     gashw airco garagepl prefarea
## 541    no    no        1       no
## 542    no   yes        0       no
## 543    no   yes        0       no
## 544    no   yes        1       no
## 545    no   yes        1       no
## 546    no   yes        1       no

In this study, we intend to perform a statistical analysis of sales prices of house in the City of Windsor. The main question that we are addressing is whether the price of house is significantly affected by the four factors we concern. The sales price of the house is our variable of interest, named price. There are many factors that may affect the price. In this study, one of the factors we select is bedrooms, and we categorize the number of bedrooms into three levels (“1-2”, “3-4”, “5-6”). The second variable selected was the number of bathrooms, garagepl. It is the second most important factor that determine the price of house. We also categorize the number of garage place into three levels (“1”, “2”, “3”). The third and fourth variable we choose are binary variables - prefarea and airco. prefarea indicates that the location of the house may affect the value of house, and airco indicates that the installation of air conditioner may affect the value of house. A summary of the variable is listed below.

data<-na.omit(data)
summary(data)
##      price           lotsize         bedrooms        bathrms     
##  Min.   : 25000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 49125   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 62000   Median : 4600   Median :3.000   Median :1.000  
##  Mean   : 68122   Mean   : 5150   Mean   :2.965   Mean   :1.286  
##  3rd Qu.: 82000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :190000   Max.   :16200   Max.   :6.000   Max.   :4.000  
##     stories      driveway  recroom   fullbase  gashw     airco    
##  Min.   :1.000   no : 77   no :449   no :355   no :521   no :373  
##  1st Qu.:1.000   yes:469   yes: 97   yes:191   yes: 25   yes:173  
##  Median :2.000                                                    
##  Mean   :1.808                                                    
##  3rd Qu.:2.000                                                    
##  Max.   :4.000                                                    
##     garagepl      prefarea 
##  Min.   :0.0000   no :418  
##  1st Qu.:0.0000   yes:128  
##  Median :0.0000            
##  Mean   :0.6923            
##  3rd Qu.:1.0000            
##  Max.   :3.0000
data$bedrooms[data$bedrooms <= 2] = "1~2"
data$bedrooms[(data$bedrooms > 2) & (data$bedrooms <= 3) & (data$bedrooms != "1~2")] = "3~4"
data$bedrooms[(data$bedrooms != "1~2") & (data$bedrooms != "3~4")] = "5~6"
data$bedrooms<-as.factor(data$bedrooms)
data$garagepl[data$garagepl == 1] = "1"
data$garagepl[(data$garagepl == 2) & (data$garagepl != "1")] = "2"
data$garagepl[(data$garagepl != "1") & (data$garagepl != "2")] = "3"
data$garagepl<-as.factor(data$garagepl)
levels(data$prefarea)<-c(-1,1)
levels(data$airco)<-c(-1,1)
head(data)
##   price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000    5850      3~4       1       2      yes      no      yes    no
## 2 38500    4000      1~2       1       1      yes      no       no    no
## 3 49500    3060      3~4       1       1      yes      no       no    no
## 4 60500    6650      3~4       1       2      yes     yes       no    no
## 5 61000    6360      1~2       1       1      yes      no       no    no
## 6 66000    4160      3~4       1       1      yes     yes      yes    no
##   airco garagepl prefarea
## 1    -1        1       -1
## 2    -1        3       -1
## 3    -1        3       -1
## 4    -1        3       -1
## 5    -1        3       -1
## 6     1        3       -1

3. Experimental Design

The experiment seeks to observe four factors that impact the price of houses. Statistical and visual analysis can be used to assess the value of houses. We do not capture all the factors in the dataset, but it does not impact the purpose of the experiment. Our purpose is to see if the four factors we observed have impact on the value of houses.

What is the rationale for this design?

Using both full factorial design and fractional factorial design in this experiment can allow us a better understanding of the cost saving effect of the fractional factorial design. In reality, researchers always face the problem of how to design an efficient experimental design. If we have many factors, the experimental runs for full factorial design will grow exponentially. Fractional factorial design can help to solve this problem at a required resolution level.

Is there any randomization, replication, or blocking used in the experiment?

Replication, replication and blocking are important in a design of experiment. Using these technics can help to reduce the bias caused by nuisance factors and increase the precision.

Randomization

Randamization Should be used in experiment in three ways,random selection, random assignment, and random execution.

In this dataset, we do not have certain information about randomization. However, the dataset should meet the assumptions required for truly random design. Also, with randomly sub-setting and ordering the data in our analysis, we can further make sure that we meet the requirement of randomization.

Replication

“The repeated measures design (also known as a within-subjects design) uses the same subjects with every condition of the research, including the control, while replication reflects sources of variability both between runs and (potentially) within runs.”[4]

Repeated measures involves measuring the same cases multiple times, and Replication involves running the same study on different subjects but identical conditions.In this dataset, we do not have any replication or repeated measures.

Blocking

“In the statistical theory of the design of experiments, blocking is the arranging of experimental units in groups (blocks) that are similar to one another.”[5]

The nuisance factor may have effect on the experiment but it is not the main facor that interests the researcher. In this study, blocking is not included in the experiment because we does not find the nuisance factors that may affect the results in the dataset.

4.Statistical Analysis

Descriptive Analysis of Data

The data set we use in this project is a cross-section data from 1987. There are 546 observations in the dataset. from the summary of the statistics, the prices of house range from $25000 to $190000, and the mean of the price is $68120. The price distribution is shown in the histogram below.

hist(data$price, breaks=20, main= "Housing Price", xlab="Price")

Plotting the histograms of all four input variables:

par(mfrow=c(2,2))
barplot(table(data$bedrooms), xlab="Bedrooms", ylab="Frequency", main="Histogram of Bedrooms")
barplot(table(data$garagepl), xlab="Garage Place", ylab="Frequency", main="Histogram of Garage Place")
barplot(table(data$prefarea), xlab="Preferred Area", ylab="Frequency", main="Histogram of Preferred Area")
barplot(table(data$airco), xlab="Air-Conditions", ylab="Frequency", main="Histogram of Air-Conditions")

The boxplots are shown below to indicate the price differences among different groups.

#bloxplots 
par(mfrow=c(2,2))
boxplot(data$price~data$bedrooms, xlab="No. of Bedrooms", ylab="House Prices")
means1 <- by(data$price,data$bedrooms, mean)
points(1:3, means1,pch = 23, cex = 2, bg = "red")
text(1:3 - 0.1, means1,labels = format(means1, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$garagepl, xlab="No. of Garage Places", ylab="House Prices")
means2 <- by(data$price,data$garagepl, mean)
points(1:3, means2,pch = 23, cex = 2, bg = "red")
text(1:3 - 0.1, means2,labels = format(means2, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$prefarea, xlab="Preferred Community", ylab="House Prices")
means3 <- by(data$price,data$prefarea, mean)
points(1:2, means3,pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means3,labels = format(means3, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$airco, xlab="Air Conditioner", ylab="House Prices")
means4 <- by(data$price,data$airco, mean)
points(1:2, means4,pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means4,labels = format(means4, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")

Testing

Fractional Factorial Design

In this project, a 2k fractional factorial design is used.In ou experiment,we contain 4 factors,two with 2 levels and two with 3 levels.The two 3-level factors are divided in to two 2-level factors for each. The process is shown as follows:

require(knitr)
## Loading required package: knitr
bd1 <- c(-1, 1, -1, 1)
bd2 <- c(-1, -1, 1, 1)
bedrooms <- c("1", "2","2", "3")
bd_factor <- as.data.frame((cbind(bd1,bd2,bedrooms)))
kable(bd_factor, align = 'c')
bd1 bd2 bedrooms
-1 -1 1
1 -1 2
-1 1 2
1 1 3
gp1 <- c(-1, 1, -1, 1)
gp2 <- c(-1, -1, 1, 1)
garagepl <- c("1", "2","2", "3")
gp_factor <- as.data.frame((cbind(gp1,gp2,garagepl)))
kable(gp_factor, align = 'c')
gp1 gp2 garagepl
-1 -1 1
1 -1 2
-1 1 2
1 1 3
library(FrF2)
## Loading required package: DoE.base
## Loading required package: grid
## Loading required package: conf.design
## 
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## The following object is masked from 'package:graphics':
## 
##     plot.design
## The following object is masked from 'package:base':
## 
##     lengths
design<-FrF2(8,6,estible = formula("~bd1+bd2+gp1+gp2+prefarea+airco:(bd1+bd2+gp1+gp2+prefarea+airco)"), factor.names = c('bd1','bd2','gp1','gp2','prefarea','airco'),res5=T,claear=F)
design
##   bd1 bd2 gp1 gp2 prefarea airco
## 1   1   1   1   1        1     1
## 2   1  -1  -1  -1       -1     1
## 3  -1  -1   1   1       -1    -1
## 4  -1   1  -1  -1        1    -1
## 5   1  -1   1  -1        1    -1
## 6  -1   1   1  -1       -1     1
## 7   1   1  -1   1       -1    -1
## 8  -1  -1  -1   1        1     1
## class=design, type= FrF2
set1<- subset(data,bedrooms=="1~2"&prefarea=="-1"&airco=="-1"&garagepl=="3")
set2<- subset(data,bedrooms=="3~4"&prefarea=="1"&airco=="-1"&garagepl=="1")
set3<- subset(data,bedrooms=="5~6"&prefarea=="1"&airco=="1"&garagepl=="3")
set4<- subset(data,bedrooms=="1~2"&prefarea=="1"&airco=="1"&garagepl=="2")
set5<- subset(data,bedrooms=="3~4"&prefarea=="-1"&airco=="1"&garagepl=="2")
set6<- subset(data,bedrooms=="5~6"&prefarea=="-1"&airco=="1"&garagepl=="1")
set7<- subset(data,bedrooms=="3~4"&prefarea=="-1"&airco=="-1"&garagepl=="2")
set8<- subset(data,bedrooms=="1~2"&prefarea=="1"&airco=="-1"&garagepl=="2")
set.seed(1583)
run1 <- set1[sample(1:nrow(set1), 1), ]
run2 <- set2[sample(1:nrow(set2), 1), ]
run3 <- set3[sample(1:nrow(set3), 1), ]
run4 <- set4[sample(1:nrow(set4), 1), ]
run5 <- set5[sample(1:nrow(set5), 1), ]
run6 <- set6[sample(1:nrow(set6), 1), ]
run7 <- set7[sample(1:nrow(set7), 1), ]
run8 <- set8[sample(1:nrow(set8), 1), ]
response <- c(run1$price, run2$price, run3$price, run4$price, run5$price, run6$price, run7$price, run8$price)
response
## [1]  35000  75000 174500  68000 128000 105000  53900  91700
plan <- add.response(design, response)
summary(plan)
## Call:
## FrF2(8, 6, estible = formula("~bd1+bd2+gp1+gp2+prefarea+airco:(bd1+bd2+gp1+gp2+prefarea+airco)"), 
##     factor.names = c("bd1", "bd2", "gp1", "gp2", "prefarea", 
##         "airco"), res5 = T, claear = F)
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##   bd1 bd2 gp1 gp2 prefarea airco
## 1  -1  -1  -1  -1       -1    -1
## 2   1   1   1   1        1     1
## 
## Responses:
## [1] response
## 
## Design generating information:
## $legend
## [1] A=bd1      B=bd2      C=gp1      D=gp2      E=prefarea F=airco   
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##   bd1 bd2 gp1 gp2 prefarea airco response
## 1   1   1   1   1        1     1    35000
## 2   1  -1  -1  -1       -1     1    75000
## 3  -1  -1   1   1       -1    -1   174500
## 4  -1   1  -1  -1        1    -1    68000
## 5   1  -1   1  -1        1    -1   128000
## 6  -1   1   1  -1       -1     1   105000
## 7   1   1  -1   1       -1    -1    53900
## 8  -1  -1  -1   1        1     1    91700
## class=design, type= FrF2

The results shown above list six 2-level factos to be analyzed in this design.

The fractional factorial design used in this experiment is 2III6-3, which has 8 runs with resolution of III. In such a experiment, the main factors are not aliased with any other main effects. However, they may be aliased with interaction factors.The table above shows which factors are confounded.

Main Effects

MEPlot(plan, abbrev = 5, cex.xax = 1.6, cex.main = 2)

Interaction Effects

IAPlot(plan, abbrev = 5, show.alias = TRUE, lwd = 2, cex = 2, cex.xax = 1.2, cex.lab = 1.5)

ANOVA Analysis

In this part,The main effects are estimated using lm and ANOVA. We both test the full fractional factorial design and fractional factorial design to compare the difference.

The test below is the initial analysis of variance (ANOVA) performed on the full factorial set of data.

model1=lm(data$price~ data$bedrooms+data$garagepl+data$prefarea+data$airco) 
anova(model1) 
## Analysis of Variance Table
## 
## Response: data$price
##                Df     Sum Sq    Mean Sq F value    Pr(>F)    
## data$bedrooms   2 5.9217e+10 2.9609e+10  72.118 < 2.2e-16 ***
## data$garagepl   2 4.0097e+10 2.0049e+10  48.833 < 2.2e-16 ***
## data$prefarea   1 2.6403e+10 2.6403e+10  64.309 6.656e-15 ***
## data$airco      1 4.1597e+10 4.1597e+10 101.320 < 2.2e-16 ***
## Residuals     539 2.2129e+11 4.1055e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the ANOVA test show that all the factors in the full factorial design have statistically significant effects on the value of house. From the results, we can conclude that the significance can be likely attribute to other than randomization. So we reject the null hypothesis that the four main factor does not have effect on the value of house.

Model Creation

summary(lm(plan))
## Number of observations used: 8 
## Formula:
## response ~ (bd1 + bd2 + gp1 + gp2 + prefarea + airco)^2
## 
## Call:
## lm.default(formula = fo, data = model.frame(fo, data = formula))
## 
## Residuals:
## ALL 8 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (14 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    91388         NA      NA       NA
## bd11          -18413         NA      NA       NA
## bd21          -25913         NA      NA       NA
## gp11           19238         NA      NA       NA
## gp21           -2612         NA      NA       NA
## prefarea1     -10712         NA      NA       NA
## airco1        -14712         NA      NA       NA
## bd11:airco1    -3262         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 7 and 0 DF,  p-value: NA

Diagnostics/Model Adequacy Checking

In order to check the adequacy of the ANOVA test. We perform Quantile-Quantile (Q-Q) tests on the residual errors to determine if the residuals followed a normal distribution.

par(mfrow=c(1,2))
qqnorm(residuals(model1)) 
qqline(residuals(model1))
plot(fitted(model1),residuals(model1))

If the resonse ins normal distributed, the points should be shown in a straight dotted line. So in our study, the response may not be strictly normal distributed.

5. Conclusions

In this project, we perform a statistical analysis of sales prices of house in the City of Windsor. The main question that we are addressing is whether the price of house is significantly affected by the four factors we concern. In the full factorial design, ‘bedrooms’, ‘garagepl’, ‘prefarea’, and ‘airco’ are statistically significant. In our fractional factorial design, some factors do not show the significance. This may caused by aliasing problem, and we may solve it by increasing the resolution.

6. Reference

[1] Anglin, P.M. and R. Gencay (1996) “Semiparametric estimation of a hedonic price function”, Journal of Applied Econometrics, 11(6), 633-648.

[2] Cran.r Project (2015), Package ‘Ecdat’, Accessed: 11-02-2016 https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

[3] Montgomery, D. (2013), Design and Analysis of Experiments, Wiley and Sons, 8th Edition, 752p.

[4] https://en.wikipedia.org/wiki/Replication_(statistics)

[5] https://en.wikipedia.org/wiki/Blocking_(statistics)