1. Setting

System Under Test

The data analyzed in this example is from the ‘Housing’ data set in the ‘Ecdat package in R. This data contains information on houses sold such as number of bedrooms, bathrooms, garage spaces, square footage and selling price. In this experiment, we will be examining the effect that having a driveway or not affects the selling price of a house, blocked by whether or not the house is in a ’preferred area.’

Factors and Levels

Our independent variable is if the house has a driveway or not, a bindary variable: yes or no.

library("Ecdat")
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.2.5
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
house <- Housing
levels(as.factor(house$driveway))
## [1] "no"  "yes"

If it is easier to think about this in a 0 or 1 setting, you can change the values in the data frame like so:

levels(house$driveway) <- c(levels(house$driveway), 0, 1)
house$driveway[house$driveway=="yes"] <- 1
house$driveway[house$driveway=="no"] <- 0

Continuous Variables

The selling price of the house is our dependent (resonse) variable, and is continous.

summary(house$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25000   49120   62000   68120   82000  190000

The Data: How is it organized and what does it look like?

head(house)
##   price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000    5850        3       1       2        1      no      yes    no
## 2 38500    4000        2       1       1        1      no       no    no
## 3 49500    3060        3       1       1        1      no       no    no
## 4 60500    6650        3       1       2        1     yes       no    no
## 5 61000    6360        2       1       1        1      no       no    no
## 6 66000    4160        3       1       1        1     yes      yes    no
##   airco garagepl prefarea
## 1    no        1       no
## 2    no        0       no
## 3    no        0       no
## 4    no        0       no
## 5    no        0       no
## 6   yes        0       no

Randomization

The data can be assumed to be randomly collected, as it is part of an R package of data sets that are meant to be used in analyses like these.

Replication / Repeated Measures

There are no repeated measures in this data set.

Block

In this experiment, the data will be blocked on if the house is in a ‘preferred area’ or not. This is also a binary variable with values of yes or no. Note that it is unclear what constitutes a ‘preferred area.’

levels(as.factor(house$prefarea))
## [1] "no"  "yes"

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment will test whether or not having a driveway affects the selling price of a house, blocked on whether or not the house is in a ‘preferred area’ of town. First, an ANOVA is run to test this theory, and then two alternative methods to null hypothesis statistical testing are run to test and verify the results. The first alternative is the resampling, or bootstrap, method. The second method is the power analysis method.

What is the rationale for this design?

The ANOVA method is a valid first step in analyzing any data set, but it makes the (sometimes untrue) assumption of normality. The resampling method does not make this assumption, which can help confirm or invalidate the results found in the ANOVA method.

3. Statistical Analysis

Exploratory Data Analysis

price <- house$price
driveway <- house$driveway
preferred <- as.factor(house$prefarea)

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25000   49120   62000   68120   82000  190000
summary(driveway)
##  no yes   0   1 
##   0   0  77 469
summary(preferred)
##  no yes 
## 418 128
par(mfrow = c(1,2))
boxplot(price ~ driveway, main = "Driveway vs. Selling Price", xlab = "Driveway?", ylab = "Selling Price")
boxplot(price ~ preferred, main = "Location vs. Selling Price", xlab = "Preferred Area?", ylab = "Selling Price")

Original Model Testing

fit <- aov(price ~ driveway + as.factor(preferred))
summary(fit)
##                       Df    Sum Sq   Mean Sq F value   Pr(>F)    
## driveway               1 3.432e+10 3.432e+10   57.37 1.57e-13 ***
## as.factor(preferred)   1 2.946e+10 2.946e+10   49.25 6.73e-12 ***
## Residuals            543 3.248e+11 5.982e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit)
## Analysis of Variance Table
## 
## Response: price
##                       Df     Sum Sq    Mean Sq F value    Pr(>F)    
## driveway               1 3.4317e+10 3.4317e+10  57.367 1.567e-13 ***
## as.factor(preferred)   1 2.9464e+10 2.9464e+10  49.254 6.733e-12 ***
## Residuals            543 3.2482e+11 5.9820e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculated the required sample size using G*Power: Required Sample Size in G*Power

Required sample size in 120.

Original Model Adequacy Checking

qqnorm(residuals(fit))
qqline(residuals(fit))

plot(fitted(fit), residuals(fit))

## a. Resampling Method

R = 10000
Fstar = numeric(R)
meanstar = with(house, tapply(price, driveway, mean))

levels(house$driveway) <- droplevels(house$driveway)
levels(house$driveway)
driveway <- house$driveway

grp0 = price[driveway == "0"] - meanstar[1]
grp1 = price[driveway == "1"] - meanstar[2]

simdriv = as.matrix(data.frame(driveway))

for (i in 1:R) {
     group0 = sample(grp0, size = 77, replace = T)     
     group1 = sample(grp1, size = 469, replace = T)
     simprice = as.matrix(data.frame(c(group0, group1)))
     simdata = data.frame(simprice, simdriv)
     Fstar[i] = t.test(simprice ~ simdriv, data = simdata)$statistic
 }

par(mfrow = c(1,1))
hist(Fstar, prob = T, ylim = c(0, 0.8), xlim = c(0, 8), main = "F-distribution of the empirical and analytical results")
x = seq(0.25, 8, 0.25)
points(x, y=df(x, 2, 546), type = "b", col = "red")

print(realFstar <- t.test(price ~ driveway, data = house)$statistic)

mean(Fstar >= realFstar)

b. Power Analysis

The second NHST alternative method used here was power analysis. Using a post-hoc test in G*Power, and the actual sample size of the data set (546 observations), it is possible to determine the actual power of the model.

Power Analysis in G*Power

Power Analysis in G*Power

As you can see, the actual power is very high (99.9%), meaning the probability of a beta error is quite low.

4. References to Literature

Fortune, Jenn, Alayna Gillespie, Ivana Pejakovic, and Anne Bergen. Resampling: Bootstrapping and Randomization. Psychology Statistics Department. University of Guelph, n.d. Web. 7 Nov. 2016.

5. Raw Data Source and Code

The Housing data set can be found in the ‘Ecdat’ package in R, here: https://cran.r-project.org/web/packages/Ecdat/index.html

Full Commented Code:

library("Ecdat") #loads the Ecdat package, which contains multiple datasets
house <- Housing #assigns Housing dataset to a variable
levels(as.factor(house$driveway)) #shows all values of driveway column

#converts the yes/no levels of driveway to 1/0 values
levels(house$driveway) <- c(levels(house$driveway), 0, 1)
house$driveway[house$driveway=="yes"] <- 1
house$driveway[house$driveway=="no"] <- 0


summary(house$price) # shows summary stats for price variable
head(house) #prints the first 6 rows of the house data set

levels(as.factor(house$prefarea)) #gives the levels of the preferred area column

#assigns columns to variables (easier to deal with in code)
price <- house$price
driveway <- house$driveway
preferred <- as.factor(house$prefarea) #specified as factor b/c it's the blocking factor

summary(price)
summary(driveway)
summary(preferred)

par(mfrow = c(1,2)) # displays two plots side-by-side for comparison purposes

#displays boxplots of both driveway and preferred area of house versus the selling price

boxplot(price ~ driveway, main = "Driveway vs. Selling Price", xlab = "Driveway?", ylab = "Selling Price")
boxplot(price ~ preferred, main = "Location vs. Selling Price", xlab = "Preferred Area?", ylab = "Selling Price")

#defines a linear model of price in relation to having a driveway or not, blocked on being in a preferred area
fit <- aov(price ~ driveway + preferred)
summary(fit)
anova(fit) # analysis of variance of linear model. assumes normality

#plots the residuals to check for normality
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))

# resampling method, samples 10,000 times to create a new sample data set based on the F distribution with a sample F value (Fstar)
R = 10000
Fstar = numeric(R)
meanstar = with(house, tapply(price, driveway, mean))

levels(house$driveway) <- droplevels(house$driveway)
levels(house$driveway)

grp0 = price[driveway == "0"] - meanstar[1]
grp1 = price[driveway == "1"] - meanstar[2]

simdriv = driveway

#loops through 10,000 times to get sample data
for (i in 1:R) {
     group0 = sample(grp0, size = 77, replace = T)     
     group1 = sample(grp1, size = 469, replace = T)
     simprice = c(group0, group1)
     simdata = data.frame(simprice, simdriv)
     Fstar[i] = t.test(simprice, simdriv, data = simdata)$statistic
 }


par(mfrow = c(1,1)) # resets plots field to just show one at a time

# graphs the new data F distribution

hist(Fstar, prob = T, ylim = c(0, 0.8), xlim = c(0, 8), main = "F-distribution of the empirical and analytical results")
x = seq(0.25, 8, 0.25)
points(x, y=df(x, 2, 546), type = "b", col = "red")

# tests the actual data
print(realFstar <- oneway.test(price ~ driveway, var.equal = T, data = house)$statistic)

# compares F values
mean(Fstar >= realFstar)