Introduction

When studying urban spaces and environment, land value is of mayor concern and is seen in great detail by researchers. Within the study of land value, housing prices have been analyzed and studied for years. These studies had been developed at an aggregate level (among regional areas) but prices within the city are also very complex and there are so many factors that affect the value. This study analyzes sales prices of houses in the city of Windsor, publicly available in r package named Ecdat, a summary of this package is available at https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf.

This exploratory analysis applied hypothetical testing among the housing dataset in Eckdat where the variable of study was price and two categorical independent variables were selected as possible factors that influence price. Hypothesis testing was used to perform this analysis and some of the alternatives, such as the Confidence Intervals and Resampling Techinques, were also developed for a more comprehensive experimental study.

Setting

As a starting point, the data was downloaded from the Ecdat package. From this, the housing dataframe was imported and assigned to data1 dataframe:

data1 <- Housing

The structure, heading and tail can be observed as part of the initial diagnostics of this dataframe:

str(data1)

## 'data.frame':    546 obs. of  12 variables:
##  $ price   : num  42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
##  $ lotsize : num  5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
##  $ bedrooms: num  3 2 3 3 2 3 3 3 3 3 ...
##  $ bathrms : num  1 1 1 1 1 1 2 1 1 2 ...
##  $ stories : num  2 1 1 2 1 1 2 3 1 4 ...
##  $ driveway: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ recroom : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
##  $ fullbase: Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
##  $ gashw   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ airco   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
##  $ garagepl: num  1 0 0 0 0 0 2 0 0 1 ...
##  $ prefarea: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

head(data1)

##   price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000    5850        3       1       2      yes      no      yes    no
## 2 38500    4000        2       1       1      yes      no       no    no
## 3 49500    3060        3       1       1      yes      no       no    no
## 4 60500    6650        3       1       2      yes     yes       no    no
## 5 61000    6360        2       1       1      yes      no       no    no
## 6 66000    4160        3       1       1      yes     yes      yes    no
##   airco garagepl prefarea
## 1    no        1       no
## 2    no        0       no
## 3    no        0       no
## 4    no        0       no
## 5    no        0       no
## 6   yes        0       no

tail(data1)

##      price lotsize bedrooms bathrms stories driveway recroom fullbase
## 541  85000    6525        3       2       4      yes      no       no
## 542  91500    4800        3       2       4      yes     yes       no
## 543  94000    6000        3       2       4      yes      no       no
## 544 103000    6000        3       2       4      yes     yes       no
## 545 105000    6000        3       2       2      yes     yes       no
## 546 105000    6000        3       1       2      yes      no       no
##     gashw airco garagepl prefarea
## 541    no    no        1       no
## 542    no   yes        0       no
## 543    no   yes        0       no
## 544    no   yes        1       no
## 545    no   yes        1       no
## 546    no   yes        1       no

As observed, there are 12 different variables in the dataset Housing, each with 546 observations. Six of these variables are identified as factors and each of these have two levels which represent a yes or no answer. Other variables, such as the lot size, number of bedrooms, number of bathrooms, number of stories and number of garage spaces are also available.

The sales price of the house is our variable of interest, named price. In order to understand the possible factors that may influence price, two categorical variables were selected. One of the variables selected prefarea indicates if the house is located in the preferred neighborhood of the city. The second variable selected was the number of bathrooms, bathrms, as it is one of the most critical aspects that determine the value of the house, after the number of bedrooms. This variable has been categorized for the purposes of the study. In order to determine the category levels, a quick summary of the variable bathrms is presented below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.286   2.000   4.000

From this summary it was observed that the number of levels range from 1 to 4, yet there was only one observation with 4 bedrooms so only three levels were defined for this variable. A new dataset was created data2 with the price variable and the selected categorical variables named pref and baths, which indicate the preferred neighborhood and the number of bathrooms, respectively.

bath_levels <- c(1,2,3)
data2 <- data.frame(price = data1$price, baths = factor(data1$bathrms, levels = bath_levels, exclude = NA), pref = data1$prefarea)

Experimental Design

This experimental study considers random samples of N observations from the Housing data. From the two variables selected as factors, one of them was blocked and other one was not for the analysis. On one hand, the pref variable has 2 levels and the bathsvariable has 3 levels, as observed before. The response variable is the selling price of the houses reported.

Statistical Analysis

Descriptive Analysis of Data

As observed from the summary statistics, housing prices in the city of Windsor ranges between 25000 and 190000, with an average of 68120.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25000   49120   62000   68120   82000  190000

The distribution of sales price is shown below, which indicates a higher frequency of houses valued at 50000 and it quickly decreases after 65000. Some outliers were also observed, as there are a few cases where houses were valued at a higher price than 150000.

With respect to the differences observed among the preferred neighborhood factor described in the pref variable, there appears to be a clear shift from those houses with a preferred neighborhood (pref=“yes”) versus those without (pref=“no”).

On the other hand, there appears to be a difference in price among the groups of houses with different number of bathrooms, but is not as straightforward. Thus, a more careful attention is given to this variable.

The following boxplots also provide a visual of the differences among the different groups. In both cases, differences in price can be observed among the levels of each factor.

Hypothesis Testing

As expressed above, the purpose of this study is to analyze potential factors that influence the price of housing in the city of Windsor. To achieve this, the dataset on Housing is analysed and a hypothesis test was performed. For this analysis, our conjecture is that the housing price will differ significantly with the number of bathrooms. Thus, this test aims at having sufficient evidence to rejecting the null hypothesis.

Ho: Prices of houses with different number of bathrooms are equal.

Ha: Prices of houses with different number of bathrooms were not equal.

Level of Significance and Power

For both hypotheses tested, a level of significance of \(\alpha=0.05\) was used as it has been commonly used by many statisticians. On the other hand, a balance was needed in the case of accurately detecting a difference among the groups. Thus, the probability of not rejecting the null hypothesis when in fact the alternate hypothesis was true was chosen as \(\beta=0.1\). Thus, the Power was set in \(1-\beta=0.9\).

The effect size was also estimated through the main effect of non-blocking independent variable, pref.

data_block4 <- data2[ which(data2$pref=="no"), ]
data_block5 <- data2[ which(data2$pref=="yes"), ]
mean(data_block4$price)

## [1] 63263.49

nrow(data_block4)

## [1] 418

mean(data_block5$price)

## [1] 83986.37

nrow(data_block5)

## [1] 128

sqrt(((nrow(data_block4)*sd(data_block4$price)^2)+nrow(data_block5)*sd(data_block5$price)^2)/(nrow(data_block4)+nrow(data_block5)))

## [1] 25242.8

By using GPower based on the selected \(\alpha\) and \(\beta\), the estimated effect size of 0.3548 was calculated from the mean effects and standard deviation within the non blocking groups, as shown in the graph. As a result, our sample size was calculated, which indicates that 86 number of observations must be randomly selected for our experimental design.

GPower Results

Random Selection of Data

As stated previously 86 experimental runs were generated randomly, where aproximately 29 were selected from each one of the levels, as shown below.

set.seed(101)
seldata1 = data2[sample(which(data2$baths==1), size=29, replace=T), ]
set.seed(102)
seldata2 = data2[sample(which(data2$baths==2), size=29, replace=T), ]
set.seed(103)
seldata3 = data2[sample(which(data2$baths==3), size=28, replace=T), ]
data3 <- rbind(seldata1, seldata2, seldata3)

Hypothesis Test

To test this hypothesis, analysis of variance (ANOVA) was developed, where the P-value was observed to test significance at \(\alpha=0.05\).

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## data3$baths  2 2.558e+10 1.279e+10   12.55 1.73e-05 ***
## Residuals   83 8.459e+10 1.019e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As observed, the P-value is much smaller than our level of significance 0.05, thus we are able to reject the null hypothesis. This indicates that in fact the number of bathrooms have a significant difference in Housing Pricing.

Alternatively, the Confidence Intervals were build up for this analysis to verify the difference is significant at higher or lower level of confidence.

##                 2.5 %   97.5 %
## (Intercept)  50718.64 73957.22
## data3$baths2 13833.36 46697.67
## data3$baths3 24076.71 57233.15

##                   5 %     95 %
## (Intercept)  52586.72 72089.14
## data3$baths2 16475.22 44055.81
## data3$baths3 26742.05 54567.81

##                  0.5 %   99.5 %
## (Intercept)  47067.600 77608.26
## data3$baths2  8670.008 51861.03
## data3$baths3 18867.453 62442.40

Model Adequacy Checking

Model adequacy checking was also developed, primarily to test Normality and Homogeneity for the model.

From this normality plot of residuals, it seems that the normality assumption is satisfied. To confirm this reasoning, Shapiro-Wilk Normality Test was also performed, as it tests the NULL hypothesis that the samples came from a Normal distribution.

# Shapiro Test of Normality
shapiro.test(residuals(anova1))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova1)
## W = 0.98264, p-value = 0.3026

As observed from the results, the p-value > 0.05 indicates that we cannot reject the NULL hypothesis that the samples came from a Normal distribution. Thus, the model does not violate the normality assumptions.

Also, the homogeneity of variances is also tested through the Barlett Test. As explained in [3], this test is designed to test the null hypothesis that variances across groups are equal, against the alternative hypothesis that variances are unequal for at least two groups.

# Bartlett Test of Homogeneity of Variances
bartlett.test(data3$price ~ data3$baths)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  data3$price by data3$baths
## Bartlett's K-squared = 8.5424, df = 2, p-value = 0.01396

As observed from the results, the p-value is greater than our level of significance. Thus, we can conclude that the variances are not statistically equal. These differences can also be observed from the homogeneity of variances plot and the Brown-Forsyth Test.

hov(data3$price ~ data3$baths)

## 
##  hov: Brown-Forsyth
## 
## data:  data3$price
## F = 6.7762, df:data3$baths = 2, df:Residuals = 83, p-value =
## 0.00188
## alternative hypothesis: variances are not identical

hovPlot(data3$price ~ data3$baths)

Resampling Techniques

Resampling techniques are used to confirm hypothetical testing of samples for non-ideal populations. For this study we were interested in the sensitivity of our ANOVA results to violations of the assumption of equal variance, in particular. Monte Carlo studies have demonstrated that when two samples are equal in size, the test for independent groups is unaffected by differences in population variance [1]. Through Monte Carlo simulation, one of the resampling techniques available, we were able to generate populations under a known distribution (this case, normal distribution) and under multiple resampling we were able to generate an empirical distribution of the F statistic. This was generated to compare the initial statistic obtained from ANOVA, fstat = 14.4892883, and verify its statistical difference.

meanstar = mean(data3$price)
sdstar = sqrt(sum((anova1$residuals)^2)/anova1$df.residual)
simbaths = data3$baths
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  # make residual normally distributed with known pooled variance
  # this is a Monte Carlo simulation since distribution is known normal
  groupA = rnorm(29, mean=meanstar, sd=sdstar)
  groupB = rnorm(29, mean=meanstar, sd=sdstar)
  groupC = rnorm(28, mean=meanstar, sd=sdstar)
  simprice = c(groupA,groupB,groupC)
  simdata2 = data.frame(simprice,simbaths)
  Fstar[i] = oneway.test(simprice~simbaths, data=simdata2)$statistic
}
maxf=max(Fstar)

As observed, 10,000 samples were generated following a normal random distribution and ANOVA was performed at each run. These 10,000 statistics were stored an an empirical distribution of F under the null hypothesis was plotted.

The p-value for the F test obtained from the initial model was calculated on the basis of this distribution. The resulting P-value, 0 that represents the area of the region superior to 14.4892883, indicates that there is 0 % possibility of type 1 error. As it is much less than the 5% level of significance, we can confidently conclude that there is a statistical significance on the prices of houses, according to the number of bathrooms.

Conclusions

As suggested from the results of this particular case study from the city of Windsor, the house price depends among other variables on the number of bathrooms each contains. It is also believed that the neighborhood preference directly influences sales price. But this study focused on the number of bathrooms, where a shift from 1 to 2 to 3 bathrooms significantly increase the price value.

This study also highlights the importance of Resampling techniques to achieve inferences from non-ideal populations as the ones we may encounter in practice (through surveys, datasets, etc) where we are unsure there was a random selection of observations and where the variables under study do not behave like a typical distribution. Monte Carlo simulation was successfully implemented in this study and the conclusions from our study were confirmed.

References

[1] Berger, D. (2001) A Gentle Introduction to Resampling Techniques, Accessed: 11-07-2016, http://wise.cgu.edu/wp-content/uploads/2015/04/Introduction-to-Resampling-Techniques-110901.pdf

[2] Cran.r Project (2015), Package ‘Ecdat’, Accessed: 11-02-2016 https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

[3] Engineering Statistic Handbook, 1.3.5.7.Bartlett’s Test, Accessed: 11-09-2016, http://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm

[4] Montgomery, D. (2013), Design and Analysis of Experiments, Wiley and Sons, 8th Edition, 752p.

[5] Engineering Statistic Handbook, 1.3.5.7.Bartlett’s Test, Accessed: 11-09-2016, http://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm

Appendix

#Obtaining the dataset used
install.packages('Ecdat')
library(Ecdat)
data1 <- Housing

#Initial Diagnosis of Data
head(data1)
tail(data1)
str(data1)

#Exploratory Data Analysis
summary(data1$bathrms)
summary(data1$prefarea)
summary(data1$price)
hist(data1$price,  main= "Housing Price Distribution", xlab="Price")

install.packages('nortest')
library(sm)
sm.density.compare(data2$price, data2$pref, xlab="Price per Neighborhood Preference")
sm.density.compare(data2$price, data2$baths, xlab="Price per Number of Bathrooms")

#Organizing Categorical IVs
bath_levels <- c(1,2,3)
#room_levels <- c(1,2,3,4,5,6)

data2 <- data.frame(price = data1$price, baths = factor(data1$bathrms, levels = bath_levels, exclude = NA), pref = data1$prefarea) 
#data2 <- data.frame(price = data1$price, rooms = factor(data1$bedrooms, levels = room_levels, exclude = NA), baths = factor(data1$bathrms, levels = bath_levels, exclude = NA), pref = data1$prefarea) 

save(data2,file="abrev_data")

boxplot(data2$price~data2$baths, xlab="No. of Bathrooms", ylab="Housing Price")
boxplot(data2$price~data2$pref, xlab="Neighborhood Preference", ylab="Housing Price")

install.packages('nortest')
library(nortest)

install.packages('HH')
library(HH)


#Initial Anova Testing
anova1 <- aov(data2$price ~ data2$baths)
summary(anova1)
qqnorm(residuals(anova1));qqline(residuals(anova1), col = 2)
# Shapiro Test of Normality
shapiro.test(residuals(anova1))
# Bartlett Test of Homogeneity of Variances
bartlett.test(data2$price ~ data2$baths)

# Homogeneity of Variance Plot
hov(data2$price ~ data2$baths)
hovPlot(data2$price ~ data2$baths)


anova2 <- aov(data2$price ~data2$pref)
summary(anova2)
# Normality Plot of Residuals
qqnorm(residuals(anova2));qqline(residuals(anova2), col = 2)
# Shapiro Test of Normality
shapiro.test(residuals(anova2))
hov(data2$price ~ data2$pref)
hovPlot(data2$price ~ data2$pref)

#Blocking Number of Bathrooms
data_block1 <- data2[ which(data2$baths==1), ]
data_block2 <- data2[ which(data2$baths==2), ]
data_block3 <- data2[ which(data2$baths==3), ]

qqnorm(data_block1$price);qqline(data_block1$price, col = 2)
pearson.test(data_block1$price)
qqnorm(data_block2$price);qqline(data_block2$price, col = 2)
pearson.test(data_block2$price)
qqnorm(data_block3$price);qqline(data_block3$price, col = 2)
pearson.test(data_block3$price)


#Blocking Neighborhood Preference
data_block4 <- data2[ which(data2$pref=="no"), ]
data_block5 <- data2[ which(data2$pref=="yes"), ]
qqnorm(data_block4$price);qqline(data_block4$price, col = 2)
pearson.test(data_block4$price)
qqnorm(data_block5$price);qqline(data_block5$price, col = 2)
pearson.test(data_block5$price)

#calculating Main Effects
me_12 <- mean(data_block1$price) - mean(data_block2$price)
me_13 <- mean(data_block1$price) - mean(data_block3$price)
me_23 <- mean(data_block2$price) - mean(data_block3$price)
me_45 <- mean(data_block4$price) - mean(data_block5$price)

me_bath <- abs(me_13)
me_pref <- abs(me_45)

#Choosing alfa, beta
alfa <- 0.05
beta <- 0.85

#choosing N for Bath with G*Power
mean(data_block1$price)
nrow(data_block1)
mean(data_block2$price)
nrow(data_block2)
mean(data_block3$price)
nrow(data_block3)
sqrt(((nrow(data_block1)*sd(data_block1$price)^2)+(nrow(data_block1)*sd(data_block1$price)^2)+ (nrow(data_block3)*sd(data_block3$price)^2))/(nrow(data_block1)+nrow(data_block2)+nrow(data_block2)))
n_bath <- 60

#choosing N for Pref with GPower
mean(data_block4$price)
nrow(data_block4)
mean(data_block5$price)
nrow(data_block5)
sqrt(((nrow(data_block4)*sd(data_block4$price)^2)+nrow(data_block5)*sd(data_block5$price)^2)/(nrow(data_block4)+nrow(data_block5)))
n_pref <- 86

#Generate N random samples
set.seed(101)
seldata1 = data2[sample(which(data2$baths==1), size=29, replace=T), ]
set.seed(102)
seldata2 = data2[sample(which(data2$baths==2), size=29, replace=T), ]
set.seed(103)
seldata3 = data2[sample(which(data2$baths==3), size=28, replace=T), ]
data3 <- rbind(seldata1, seldata2, seldata3)

#Hypothesis Test
anova1 <- aov(data3$price ~ data3$baths)
summary(anova1)
fstat <-  oneway.test(data3$price ~ data3$baths)$statistic
confint.default(anova1, level=0.95)
confint.default(anova1, level=0.90)
confint.default(anova1, level=0.99)

#Model Adequacy Checking
#Normality Plot of Residuals
qqnorm(residuals(anova1));qqline(residuals(anova1), col = 2)
# Shapiro Test of Normality
shapiro.test(residuals(anova1))
# Bartlett Test of Homogeneity of Variances
bartlett.test(data2$price ~ data2$baths)
# Homogeneity of Variance Plot
hov(data2$price ~ data2$baths)
hovPlot(data2$price ~ data2$baths)

#Resampling Technique: Monte Carlo Simulation
meanstar = mean(data3$price)
sdstar = sqrt(sum((anova1$residuals)^2)/anova1$df.residual)
simbaths = data3$baths
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  # make residual normally distributed with known pooled variance
  # this is a Monte Carlo simulation since distribution is known normal
  groupA = rnorm(29, mean=meanstar, sd=sdstar)
  groupB = rnorm(29, mean=meanstar, sd=sdstar)
  groupC = rnorm(28, mean=meanstar, sd=sdstar)
  simprice = c(groupA,groupB,groupC)
  simdata2 = data.frame(simprice,simbaths)
  Fstar[i] = oneway.test(simprice~simbaths, data=simdata2)$statistic
}
maxf=max(Fstar)

Hypothesis Testing: Sales Prices of Houses in the City of Windsor

Diana G. Ramirez-Rios, Rensselar Polytechnic Institute

11/7/2016