The data analyzed in this example is from the ‘Housing’ data set in the ‘Ecdat package in R. This data contains information on houses sold such as number of bedrooms, bathrooms, garage spaces, square footage and selling price. In this experiment, we will be examining the effect that having a driveway or not affects the selling price of a house, blocked by whether or not the house is in a ’preferred area.’
Our independent variable is if the house has a driveway or not, a bindary variable: yes or no.
library("Ecdat")
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.2.5
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
house <- Housing
levels(as.factor(house$driveway))
## [1] "no" "yes"
If it is easier to think about this in a 0 or 1 setting, you can change the values in the data frame like so:
levels(house$driveway) <- c(levels(house$driveway), 0, 1)
house$driveway[house$driveway=="yes"] <- 1
house$driveway[house$driveway=="no"] <- 0
The selling price of the house is our dependent (resonse) variable, and is continous.
summary(house$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25000 49120 62000 68120 82000 190000
head(house)
## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
## 1 42000 5850 3 1 2 1 no yes no
## 2 38500 4000 2 1 1 1 no no no
## 3 49500 3060 3 1 1 1 no no no
## 4 60500 6650 3 1 2 1 yes no no
## 5 61000 6360 2 1 1 1 no no no
## 6 66000 4160 3 1 1 1 yes yes no
## airco garagepl prefarea
## 1 no 1 no
## 2 no 0 no
## 3 no 0 no
## 4 no 0 no
## 5 no 0 no
## 6 yes 0 no
The data can be assumed to be randomly collected, as it is part of an R package of data sets that are meant to be used in analyses like these.
There are no repeated measures in this data set.
In this experiment, the data will be blocked on if the house is in a ‘preferred area’ or not. This is also a binary variable with values of yes or no. Note that it is unclear what constitutes a ‘preferred area.’
levels(as.factor(house$prefarea))
## [1] "no" "yes"
This experiment will test whether or not having a driveway affects the selling price of a house, blocked on whether or not the house is in a ‘preferred area’ of town. First, an ANOVA is run to test this theory, and then two alternative methods to null hypothesis statistical testing are run to test and verify the results. The first alternative is the resampling, or bootstrap, method. The second method is the power analysis method.
The ANOVA method is a valid first step in analyzing any data set, but it makes the (sometimes untrue) assumption of normality. The resampling method does not make this assumption, which can help confirm or invalidate the results found in the ANOVA method.
price <- house$price
driveway <- house$driveway
preferred <- as.factor(house$prefarea)
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25000 49120 62000 68120 82000 190000
summary(driveway)
## no yes 0 1
## 0 0 77 469
summary(preferred)
## no yes
## 418 128
par(mfrow = c(1,2))
boxplot(price ~ driveway, main = "Driveway vs. Selling Price", xlab = "Driveway?", ylab = "Selling Price")
boxplot(price ~ preferred, main = "Location vs. Selling Price", xlab = "Preferred Area?", ylab = "Selling Price")
fit <- aov(price ~ driveway + as.factor(preferred))
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## driveway 1 3.432e+10 3.432e+10 57.37 1.57e-13 ***
## as.factor(preferred) 1 2.946e+10 2.946e+10 49.25 6.73e-12 ***
## Residuals 543 3.248e+11 5.982e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## driveway 1 3.4317e+10 3.4317e+10 57.367 1.567e-13 ***
## as.factor(preferred) 1 2.9464e+10 2.9464e+10 49.254 6.733e-12 ***
## Residuals 543 3.2482e+11 5.9820e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculated the required sample size using G*Power:
Required sample size in 120.
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))
## a. Resampling Method
R = 10000
Fstar = numeric(R)
meanstar = with(house, tapply(price, driveway, mean))
levels(house$driveway) <- droplevels(house$driveway)
levels(house$driveway)
driveway <- house$driveway
grp0 = price[driveway == "0"] - meanstar[1]
grp1 = price[driveway == "1"] - meanstar[2]
simdriv = as.matrix(data.frame(driveway))
for (i in 1:R) {
group0 = sample(grp0, size = 77, replace = T)
group1 = sample(grp1, size = 469, replace = T)
simprice = as.matrix(data.frame(c(group0, group1)))
simdata = data.frame(simprice, simdriv)
Fstar[i] = t.test(simprice ~ simdriv, data = simdata)$statistic
}
par(mfrow = c(1,1))
hist(Fstar, prob = T, ylim = c(0, 0.8), xlim = c(0, 8), main = "F-distribution of the empirical and analytical results")
x = seq(0.25, 8, 0.25)
points(x, y=df(x, 2, 546), type = "b", col = "red")
print(realFstar <- t.test(price ~ driveway, data = house)$statistic)
mean(Fstar >= realFstar)
The second NHST alternative method used here was power analysis. Using a post-hoc test in G*Power, and the actual sample size of the data set (546 observations), it is possible to determine the actual power of the model.
Power Analysis in G*Power
As you can see, the actual power is very high (99.9%), meaning the probability of a beta error is quite low.
Fortune, Jenn, Alayna Gillespie, Ivana Pejakovic, and Anne Bergen. Resampling: Bootstrapping and Randomization. Psychology Statistics Department. University of Guelph, n.d. Web. 7 Nov. 2016.
The Housing data set can be found in the ‘Ecdat’ package in R, here: https://cran.r-project.org/web/packages/Ecdat/index.html
Full Commented Code:
library("Ecdat") #loads the Ecdat package, which contains multiple datasets
house <- Housing #assigns Housing dataset to a variable
levels(as.factor(house$driveway)) #shows all values of driveway column
#converts the yes/no levels of driveway to 1/0 values
levels(house$driveway) <- c(levels(house$driveway), 0, 1)
house$driveway[house$driveway=="yes"] <- 1
house$driveway[house$driveway=="no"] <- 0
summary(house$price) # shows summary stats for price variable
head(house) #prints the first 6 rows of the house data set
levels(as.factor(house$prefarea)) #gives the levels of the preferred area column
#assigns columns to variables (easier to deal with in code)
price <- house$price
driveway <- house$driveway
preferred <- as.factor(house$prefarea) #specified as factor b/c it's the blocking factor
summary(price)
summary(driveway)
summary(preferred)
par(mfrow = c(1,2)) # displays two plots side-by-side for comparison purposes
#displays boxplots of both driveway and preferred area of house versus the selling price
boxplot(price ~ driveway, main = "Driveway vs. Selling Price", xlab = "Driveway?", ylab = "Selling Price")
boxplot(price ~ preferred, main = "Location vs. Selling Price", xlab = "Preferred Area?", ylab = "Selling Price")
#defines a linear model of price in relation to having a driveway or not, blocked on being in a preferred area
fit <- aov(price ~ driveway + preferred)
summary(fit)
anova(fit) # analysis of variance of linear model. assumes normality
#plots the residuals to check for normality
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(fitted(fit), residuals(fit))
# resampling method, samples 10,000 times to create a new sample data set based on the F distribution with a sample F value (Fstar)
R = 10000
Fstar = numeric(R)
meanstar = with(house, tapply(price, driveway, mean))
levels(house$driveway) <- droplevels(house$driveway)
levels(house$driveway)
grp0 = price[driveway == "0"] - meanstar[1]
grp1 = price[driveway == "1"] - meanstar[2]
simdriv = driveway
#loops through 10,000 times to get sample data
for (i in 1:R) {
group0 = sample(grp0, size = 77, replace = T)
group1 = sample(grp1, size = 469, replace = T)
simprice = c(group0, group1)
simdata = data.frame(simprice, simdriv)
Fstar[i] = t.test(simprice, simdriv, data = simdata)$statistic
}
par(mfrow = c(1,1)) # resets plots field to just show one at a time
# graphs the new data F distribution
hist(Fstar, prob = T, ylim = c(0, 0.8), xlim = c(0, 8), main = "F-distribution of the empirical and analytical results")
x = seq(0.25, 8, 0.25)
points(x, y=df(x, 2, 546), type = "b", col = "red")
# tests the actual data
print(realFstar <- oneway.test(price ~ driveway, var.equal = T, data = house)$statistic)
# compares F values
mean(Fstar >= realFstar)