Author: Barnabás Piller
The data set examined here contains information on homes sold between May 2014 and May 2015, in King County, Washington, United States. In the following analysis, my aim is to answer the following research question: were the average selling prices of houses different in May 2014 and May 2015?
The null-and alternative hypotheses can be defined as follows:
To answer my research question, I need the selling prices of the houses in USD, as well as a date variable indicating when the houses were sold to identify the two groups to be examined.
I start off by importing the data and making the necessary adjustments:
houses <- read.csv("C:/WU/BBE/Sixth Semester/Applied Data Analysis with R/Homework 2/kc_house_data.csv")
houses <- houses[,1:3]
houses$date <- substr(houses$date, start = 1, stop = nchar(houses$date)-7)
houses$date <- as.Date(houses$date, format = "%Y%m%d")
houses_2014 <- houses %>% filter(format(houses$date, "%Y-%m") == "2014-05")
houses_2015 <- houses %>% filter(format(houses$date, "%Y-%m") == "2015-05")
Let us have a look at some entries from May 2014:
head(houses_2014)
## id date price
## 1 7237550310 2014-05-12 1225000
## 2 9212900260 2014-05-27 468000
## 3 114101516 2014-05-28 310000
## 4 6865200140 2014-05-29 485000
## 5 6300500875 2014-05-14 385000
## 6 8091400200 2014-05-16 252700
The variables are defined as follows:
The units of observation are the homes that were sold in King County in May 2014 and May 2015, with sample sizes of 1768 and 646, respectively.
The csv file containing the data set can be found on Kiva using the following link: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/data.
Let us look at some descriptive statistics of the data from May 2014:
describe(houses_2014$price)
## vars n mean sd median trimmed mad min max range
## X1 1 1768 548080.3 356502.8 465000 490461.7 229803 78000 3710000 3632000
## skew kurtosis se
## X1 2.83 12.85 8478.55
The average selling price of a home in May 2014 was $548080.
table(houses_2014$date)
##
## 2014-05-02 2014-05-03 2014-05-04 2014-05-05 2014-05-06 2014-05-07 2014-05-08
## 67 4 5 84 83 93 81
## 2014-05-09 2014-05-10 2014-05-11 2014-05-12 2014-05-13 2014-05-14 2014-05-15
## 81 5 2 80 86 81 82
## 2014-05-16 2014-05-17 2014-05-18 2014-05-19 2014-05-20 2014-05-21 2014-05-22
## 73 1 7 83 116 94 91
## 2014-05-23 2014-05-24 2014-05-25 2014-05-26 2014-05-27 2014-05-28 2014-05-29
## 84 11 5 8 104 111 75
## 2014-05-30 2014-05-31
## 65 6
The most houses were sold on the 20th of May (116).
Now, we examine the data from May 2015:
describe(houses_2015$price)
## vars n mean sd median trimmed mad min max range skew
## X1 1 646 558126.8 414822 455000 485645.4 207564 95000 4208000 4113000 4.03
## kurtosis se
## X1 23.07 16320.95
The average selling price of a home in May 2015 was $558127.
table(houses_2015$date)
##
## 2015-05-01 2015-05-02 2015-05-03 2015-05-04 2015-05-05 2015-05-06 2015-05-07
## 77 6 10 102 94 88 76
## 2015-05-08 2015-05-09 2015-05-10 2015-05-11 2015-05-12 2015-05-13 2015-05-14
## 54 3 2 40 49 31 11
## 2015-05-15 2015-05-24 2015-05-27
## 1 1 1
The most houses were sold on the 4th of May (102).
Since I am comparing the arithmetic means of two independent populations, namely the prices of the houses sold in May 2014 and May 2015, I will use the independent samples t-test with Welch correction, to account for the different variance in the two samples. If the assumptions of this parametric test are violated, I will use the non-parametric Wilcoxon Rank Sum Test.
Since the independent samples t-test is a parametric test, a number of assumptions must be satisfied for the test to be valid. Firstly, the variable of interest - in our case, the selling price - must be numeric. We have already seen that this is the case. Secondly, the variable must be normally distributed in both populations. To test this assumption, I first plot both samples as histograms:
plot_2014 <- ggplot(houses_2014, aes(x = price)) +
geom_histogram(bins = 70, fill = "springgreen2") +
theme_minimal() +
labs(x = "Selling price ($)", title = "Homes sold in May 2014")
plot_2015 <- ggplot(houses_2015, aes(x = price)) +
geom_histogram(bins = 70, fill = "red3") +
theme_minimal() +
labs(x = "Selling price ($)", title = "Homes sold in May 2015")
ggarrange(plot_2014, plot_2015, ncol = 2, nrow = 1)
As we can see, both samples are heavily skewed to the right. We may also perform the Shapiro-Wilk normality test:
shapiro.test(houses_2014$price)
##
## Shapiro-Wilk normality test
##
## data: houses_2014$price
## W = 0.76469, p-value < 2.2e-16
shapiro.test(houses_2015$price)
##
## Shapiro-Wilk normality test
##
## data: houses_2015$price
## W = 0.63624, p-value < 2.2e-16
Since the p-values < 0.001, we reject the null hypothesis and conclude that the data is not normally distributed. Therefore, we must use the non-parametric Wilcoxon Rank Sum Test.
Performing the appropriate test:
wilcox.test(houses_2014$price, houses_2015$price, paired = F, correct = F, exact = F, alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: houses_2014$price and houses_2015$price
## W = 568946, p-value = 0.8889
## alternative hypothesis: true location shift is not equal to 0
Since the p-value > 0.05, we fail to reject the null hypothesis and cannot conclude that the two population means are different at a significance level of 5%.
We can calculate the effect size to measure the magnitude of the difference between the two samples:
effectsize(wilcox.test(houses_2014$price, houses_2015$price, paired = F, correct = F, exact = F, alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -3.71e-03 | [-0.06, 0.05]
interpret_rank_biserial(3.71e-03)
## [1] "tiny"
## (Rules: funder2019)
Unsurprisingly, the effect size is dubbed as ‘tiny’ and is insignificant.
After having performed the Wilcoxon Rank Sum test, we may conclude that there is no sufficient evidence to reject the null hypothesis (p-value = 0.89 at \(\alpha\) = 0.05), and we cannot say that the average selling price of a home in May 2014 was different than the average selling price of a home in May 2015.