Main Effects
MEPlot(plan, abbrev = 5, cex.xax = 1.6, cex.main = 2)
Topics about properties have been analyzed in great detail by researchers in the past decades. Many factors can affect the value of housing, making the research more complex. In 100+ Interesting Data Sets for Statistics, a data set called Ecdat is introduced. It is a wealth of data sets available for R, containing gobs of econometric data.In this project, we study sales prices of houses in the City of Windsor. The data used in this recipe is found using the “100+ interesting data sets” webpage, and it is publicly available in R package named Ecdat. A summary of this package is available at https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf.
In this study, We apply fractional factorial design (FFD) among Housing dataset in Ecdat. The variables of study include two 2-level factors and two 3-level factors that may influence the price. Fractional Factorial Design was used to perform this analysis.
In the first step, the data is downloaded from Ecdat package.the Housing dataframe was imported and assigned to data dataframe:
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
data<- Housing
The structure, heading and tail of the data frame is shown below to analyze the data innitially.
str(data)
## 'data.frame': 546 obs. of 12 variables:
## $ price : num 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
## $ lotsize : num 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
## $ bedrooms: num 3 2 3 3 2 3 3 3 3 3 ...
## $ bathrms : num 1 1 1 1 1 1 2 1 1 2 ...
## $ stories : num 2 1 1 2 1 1 2 3 1 4 ...
## $ driveway: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ recroom : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
## $ fullbase: Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
## $ gashw : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ airco : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
## $ garagepl: num 1 0 0 0 0 0 2 0 0 1 ...
## $ prefarea: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head(data)
## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000 5850 3 1 2 yes no yes no
## 2 38500 4000 2 1 1 yes no no no
## 3 49500 3060 3 1 1 yes no no no
## 4 60500 6650 3 1 2 yes yes no no
## 5 61000 6360 2 1 1 yes no no no
## 6 66000 4160 3 1 1 yes yes yes no
## airco garagepl prefarea
## 1 no 1 no
## 2 no 0 no
## 3 no 0 no
## 4 no 0 no
## 5 no 0 no
## 6 yes 0 no
tail(data)
## price lotsize bedrooms bathrms stories driveway recroom fullbase
## 541 85000 6525 3 2 4 yes no no
## 542 91500 4800 3 2 4 yes yes no
## 543 94000 6000 3 2 4 yes no no
## 544 103000 6000 3 2 4 yes yes no
## 545 105000 6000 3 2 2 yes yes no
## 546 105000 6000 3 1 2 yes no no
## gashw airco garagepl prefarea
## 541 no no 1 no
## 542 no yes 0 no
## 543 no yes 0 no
## 544 no yes 1 no
## 545 no yes 1 no
## 546 no yes 1 no
In this study, we intend to perform a statistical analysis of sales prices of house in the City of Windsor. The main question that we are addressing is whether the price of house is significantly affected by the four factors we concern. The sales price of the house is our variable of interest, named price. There are many factors that may affect the price. In this study, one of the factors we select is bedrooms, and we categorize the number of bedrooms into three levels (“1-2”, “3-4”, “5-6”). The second variable selected was the number of bathrooms, garagepl. It is the second most important factor that determine the price of house. We also categorize the number of garage place into three levels (“1”, “2”, “3”). The third and fourth variable we choose are binary variables - prefarea and airco. prefarea indicates that the location of the house may affect the value of house, and airco indicates that the installation of air conditioner may affect the value of house. A summary of the variable is listed below.
data<-na.omit(data)
summary(data)
## price lotsize bedrooms bathrms
## Min. : 25000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 49125 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 62000 Median : 4600 Median :3.000 Median :1.000
## Mean : 68122 Mean : 5150 Mean :2.965 Mean :1.286
## 3rd Qu.: 82000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :190000 Max. :16200 Max. :6.000 Max. :4.000
## stories driveway recroom fullbase gashw airco
## Min. :1.000 no : 77 no :449 no :355 no :521 no :373
## 1st Qu.:1.000 yes:469 yes: 97 yes:191 yes: 25 yes:173
## Median :2.000
## Mean :1.808
## 3rd Qu.:2.000
## Max. :4.000
## garagepl prefarea
## Min. :0.0000 no :418
## 1st Qu.:0.0000 yes:128
## Median :0.0000
## Mean :0.6923
## 3rd Qu.:1.0000
## Max. :3.0000
data$bedrooms[data$bedrooms <= 2] = "1~2"
data$bedrooms[(data$bedrooms > 2) & (data$bedrooms <= 3) & (data$bedrooms != "1~2")] = "3~4"
data$bedrooms[(data$bedrooms != "1~2") & (data$bedrooms != "3~4")] = "5~6"
data$bedrooms<-as.factor(data$bedrooms)
data$garagepl[data$garagepl == 1] = "1"
data$garagepl[(data$garagepl == 2) & (data$garagepl != "1")] = "2"
data$garagepl[(data$garagepl != "1") & (data$garagepl != "2")] = "3"
data$garagepl<-as.factor(data$garagepl)
levels(data$prefarea)<-c(-1,1)
levels(data$airco)<-c(-1,1)
head(data)
## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000 5850 3~4 1 2 yes no yes no
## 2 38500 4000 1~2 1 1 yes no no no
## 3 49500 3060 3~4 1 1 yes no no no
## 4 60500 6650 3~4 1 2 yes yes no no
## 5 61000 6360 1~2 1 1 yes no no no
## 6 66000 4160 3~4 1 1 yes yes yes no
## airco garagepl prefarea
## 1 -1 1 -1
## 2 -1 3 -1
## 3 -1 3 -1
## 4 -1 3 -1
## 5 -1 3 -1
## 6 1 3 -1
The experiment seeks to observe four factors that impact the price of houses. Statistical and visual analysis can be used to assess the value of houses. We do not capture all the factors in the dataset, but it does not impact the purpose of the experiment. Our purpose is to see if the four factors we observed have impact on the value of houses.
Using both full factorial design and fractional factorial design in this experiment can allow us a better understanding of the cost saving effect of the fractional factorial design. In reality, researchers always face the problem of how to design an efficient experimental design. If we have many factors, the experimental runs for full factorial design will grow exponentially. Fractional factorial design can help to solve this problem at a required resolution level.
Replication, replication and blocking are important in a design of experiment. Using these technics can help to reduce the bias caused by nuisance factors and increase the precision.
Randamization Should be used in experiment in three ways,random selection, random assignment, and random execution.
In this dataset, we do not have certain information about randomization. However, the dataset should meet the assumptions required for truly random design. Also, with randomly sub-setting and ordering the data in our analysis, we can further make sure that we meet the requirement of randomization.
“The repeated measures design (also known as a within-subjects design) uses the same subjects with every condition of the research, including the control, while replication reflects sources of variability both between runs and (potentially) within runs.”[4]
Repeated measures involves measuring the same cases multiple times, and Replication involves running the same study on different subjects but identical conditions.In this dataset, we do not have any replication or repeated measures.
“In the statistical theory of the design of experiments, blocking is the arranging of experimental units in groups (blocks) that are similar to one another.”[5]
The nuisance factor may have effect on the experiment but it is not the main facor that interests the researcher. In this study, blocking is not included in the experiment because we does not find the nuisance factors that may affect the results in the dataset.
The data set we use in this project is a cross-section data from 1987. There are 546 observations in the dataset. from the summary of the statistics, the prices of house range from $25000 to $190000, and the mean of the price is $68120. The price distribution is shown in the histogram below.
hist(data$price, breaks=20, main= "Housing Price", xlab="Price")
Plotting the histograms of all four input variables:
par(mfrow=c(2,2))
barplot(table(data$bedrooms), xlab="Bedrooms", ylab="Frequency", main="Histogram of Bedrooms")
barplot(table(data$garagepl), xlab="Garage Place", ylab="Frequency", main="Histogram of Garage Place")
barplot(table(data$prefarea), xlab="Preferred Area", ylab="Frequency", main="Histogram of Preferred Area")
barplot(table(data$airco), xlab="Air-Conditions", ylab="Frequency", main="Histogram of Air-Conditions")
The boxplots are shown below to indicate the price differences among different groups.
#bloxplots
par(mfrow=c(2,2))
boxplot(data$price~data$bedrooms, xlab="No. of Bedrooms", ylab="House Prices")
means1 <- by(data$price,data$bedrooms, mean)
points(1:3, means1,pch = 23, cex = 2, bg = "red")
text(1:3 - 0.1, means1,labels = format(means1, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$garagepl, xlab="No. of Garage Places", ylab="House Prices")
means2 <- by(data$price,data$garagepl, mean)
points(1:3, means2,pch = 23, cex = 2, bg = "red")
text(1:3 - 0.1, means2,labels = format(means2, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$prefarea, xlab="Preferred Community", ylab="House Prices")
means3 <- by(data$price,data$prefarea, mean)
points(1:2, means3,pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means3,labels = format(means3, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(data$price~data$airco, xlab="Air Conditioner", ylab="House Prices")
means4 <- by(data$price,data$airco, mean)
points(1:2, means4,pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means4,labels = format(means4, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
In this project, a 2k fractional factorial design is used.In ou experiment,we contain 4 factors,two with 2 levels and two with 3 levels.The two 3-level factors are divided in to two 2-level factors for each. The process is shown as follows:
require(knitr)
## Loading required package: knitr
bd1 <- c(-1, 1, -1, 1)
bd2 <- c(-1, -1, 1, 1)
bedrooms <- c("1", "2","2", "3")
bd_factor <- as.data.frame((cbind(bd1,bd2,bedrooms)))
kable(bd_factor, align = 'c')
| bd1 | bd2 | bedrooms |
|---|---|---|
| -1 | -1 | 1 |
| 1 | -1 | 2 |
| -1 | 1 | 2 |
| 1 | 1 | 3 |
gp1 <- c(-1, 1, -1, 1)
gp2 <- c(-1, -1, 1, 1)
garagepl <- c("1", "2","2", "3")
gp_factor <- as.data.frame((cbind(gp1,gp2,garagepl)))
kable(gp_factor, align = 'c')
| gp1 | gp2 | garagepl |
|---|---|---|
| -1 | -1 | 1 |
| 1 | -1 | 2 |
| -1 | 1 | 2 |
| 1 | 1 | 3 |
library(FrF2)
## Loading required package: DoE.base
## Loading required package: grid
## Loading required package: conf.design
##
## Attaching package: 'DoE.base'
## The following objects are masked from 'package:stats':
##
## aov, lm
## The following object is masked from 'package:graphics':
##
## plot.design
## The following object is masked from 'package:base':
##
## lengths
design<-FrF2(8,6,estible = formula("~bd1+bd2+gp1+gp2+prefarea+airco:(bd1+bd2+gp1+gp2+prefarea+airco)"), factor.names = c('bd1','bd2','gp1','gp2','prefarea','airco'),res5=T,claear=F)
design
## bd1 bd2 gp1 gp2 prefarea airco
## 1 1 1 1 1 1 1
## 2 1 -1 -1 -1 -1 1
## 3 -1 -1 1 1 -1 -1
## 4 -1 1 -1 -1 1 -1
## 5 1 -1 1 -1 1 -1
## 6 -1 1 1 -1 -1 1
## 7 1 1 -1 1 -1 -1
## 8 -1 -1 -1 1 1 1
## class=design, type= FrF2
set1<- subset(data,bedrooms=="1~2"&prefarea=="-1"&airco=="-1"&garagepl=="3")
set2<- subset(data,bedrooms=="3~4"&prefarea=="1"&airco=="-1"&garagepl=="1")
set3<- subset(data,bedrooms=="5~6"&prefarea=="1"&airco=="1"&garagepl=="3")
set4<- subset(data,bedrooms=="1~2"&prefarea=="1"&airco=="1"&garagepl=="2")
set5<- subset(data,bedrooms=="3~4"&prefarea=="-1"&airco=="1"&garagepl=="2")
set6<- subset(data,bedrooms=="5~6"&prefarea=="-1"&airco=="1"&garagepl=="1")
set7<- subset(data,bedrooms=="3~4"&prefarea=="-1"&airco=="-1"&garagepl=="2")
set8<- subset(data,bedrooms=="1~2"&prefarea=="1"&airco=="-1"&garagepl=="2")
set.seed(1583)
run1 <- set1[sample(1:nrow(set1), 1), ]
run2 <- set2[sample(1:nrow(set2), 1), ]
run3 <- set3[sample(1:nrow(set3), 1), ]
run4 <- set4[sample(1:nrow(set4), 1), ]
run5 <- set5[sample(1:nrow(set5), 1), ]
run6 <- set6[sample(1:nrow(set6), 1), ]
run7 <- set7[sample(1:nrow(set7), 1), ]
run8 <- set8[sample(1:nrow(set8), 1), ]
response <- c(run1$price, run2$price, run3$price, run4$price, run5$price, run6$price, run7$price, run8$price)
response
## [1] 35000 75000 174500 68000 128000 105000 53900 91700
plan <- add.response(design, response)
summary(plan)
## Call:
## FrF2(8, 6, estible = formula("~bd1+bd2+gp1+gp2+prefarea+airco:(bd1+bd2+gp1+gp2+prefarea+airco)"),
## factor.names = c("bd1", "bd2", "gp1", "gp2", "prefarea",
## "airco"), res5 = T, claear = F)
##
## Experimental design of type FrF2
## 8 runs
##
## Factor settings (scale ends):
## bd1 bd2 gp1 gp2 prefarea airco
## 1 -1 -1 -1 -1 -1 -1
## 2 1 1 1 1 1 1
##
## Responses:
## [1] response
##
## Design generating information:
## $legend
## [1] A=bd1 B=bd2 C=gp1 D=gp2 E=prefarea F=airco
##
## $generators
## [1] D=AB E=AC F=BC
##
##
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
##
##
## The design itself:
## bd1 bd2 gp1 gp2 prefarea airco response
## 1 1 1 1 1 1 1 35000
## 2 1 -1 -1 -1 -1 1 75000
## 3 -1 -1 1 1 -1 -1 174500
## 4 -1 1 -1 -1 1 -1 68000
## 5 1 -1 1 -1 1 -1 128000
## 6 -1 1 1 -1 -1 1 105000
## 7 1 1 -1 1 -1 -1 53900
## 8 -1 -1 -1 1 1 1 91700
## class=design, type= FrF2
The results shown above list six 2-level factos to be analyzed in this design.
The fractional factorial design used in this experiment is 2III6-3, which has 8 runs with resolution of III. In such a experiment, the main factors are not aliased with any other main effects. However, they may be aliased with interaction factors.The table above shows which factors are confounded.
MEPlot(plan, abbrev = 5, cex.xax = 1.6, cex.main = 2)
IAPlot(plan, abbrev = 5, show.alias = TRUE, lwd = 2, cex = 2, cex.xax = 1.2, cex.lab = 1.5)
In this part,The main effects are estimated using lm and ANOVA. We both test the full fractional factorial design and fractional factorial design to compare the difference.
The test below is the initial analysis of variance (ANOVA) performed on the full factorial set of data.
model1=lm(data$price~ data$bedrooms+data$garagepl+data$prefarea+data$airco)
anova(model1)
## Analysis of Variance Table
##
## Response: data$price
## Df Sum Sq Mean Sq F value Pr(>F)
## data$bedrooms 2 5.9217e+10 2.9609e+10 72.118 < 2.2e-16 ***
## data$garagepl 2 4.0097e+10 2.0049e+10 48.833 < 2.2e-16 ***
## data$prefarea 1 2.6403e+10 2.6403e+10 64.309 6.656e-15 ***
## data$airco 1 4.1597e+10 4.1597e+10 101.320 < 2.2e-16 ***
## Residuals 539 2.2129e+11 4.1055e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the ANOVA test show that all the factors in the full factorial design have statistically significant effects on the value of house. From the results, we can conclude that the significance can be likely attribute to other than randomization. So we reject the null hypothesis that the four main factor does not have effect on the value of house.
summary(lm(plan))
## Number of observations used: 8
## Formula:
## response ~ (bd1 + bd2 + gp1 + gp2 + prefarea + airco)^2
##
## Call:
## lm.default(formula = fo, data = model.frame(fo, data = formula))
##
## Residuals:
## ALL 8 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (14 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91388 NA NA NA
## bd11 -18413 NA NA NA
## bd21 -25913 NA NA NA
## gp11 19238 NA NA NA
## gp21 -2612 NA NA NA
## prefarea1 -10712 NA NA NA
## airco1 -14712 NA NA NA
## bd11:airco1 -3262 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 7 and 0 DF, p-value: NA
In order to check the adequacy of the ANOVA test. We perform Quantile-Quantile (Q-Q) tests on the residual errors to determine if the residuals followed a normal distribution.
par(mfrow=c(1,2))
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1),residuals(model1))
If the resonse ins normal distributed, the points should be shown in a straight dotted line. So in our study, the response may not be strictly normal distributed.
In this project, we perform a statistical analysis of sales prices of house in the City of Windsor. The main question that we are addressing is whether the price of house is significantly affected by the four factors we concern. In the full factorial design, ‘bedrooms’, ‘garagepl’, ‘prefarea’, and ‘airco’ are statistically significant. In our fractional factorial design, some factors do not show the significance. This may caused by aliasing problem, and we may solve it by increasing the resolution.
[1] Anglin, P.M. and R. Gencay (1996) “Semiparametric estimation of a hedonic price function”, Journal of Applied Econometrics, 11(6), 633-648.
[2] Cran.r Project (2015), Package ‘Ecdat’, Accessed: 11-02-2016 https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf
[3] Montgomery, D. (2013), Design and Analysis of Experiments, Wiley and Sons, 8th Edition, 752p.