Author: Barnabás Piller


1 Research question (2.5 points)

1.1 Clear research question that can be tested statistically (1.5 points)

The data set examined here contains information on homes sold between May 2014 and May 2015, in King County, Washington, United States. In the following analysis, my aim is to answer the following research question: were the average selling prices of houses different in May 2014 and May 2015?

The null-and alternative hypotheses can be defined as follows:

  • \(H_{0}\): The mean selling price of the homes was equal in the two periods
  • \(H_{1}\): The mean selling price of the homes was different in the two periods
1.2 Which variables need to be collected to answer your research question (1 point)

To answer my research question, I need the selling prices of the houses in USD, as well as a date variable indicating when the houses were sold to identify the two groups to be examined.

2 Data (2.5 points)

2.1 Import of the data, presentation of the data using a function head, definition of the variables used (0.5 points)

I start off by importing the data and making the necessary adjustments:

houses <- read.csv("C:/WU/BBE/Sixth Semester/Applied Data Analysis with R/Homework 2/kc_house_data.csv")
houses <- houses[,1:3]

houses$date <- substr(houses$date, start = 1, stop = nchar(houses$date)-7)
houses$date <- as.Date(houses$date, format = "%Y%m%d")

houses_2014 <- houses %>% filter(format(houses$date, "%Y-%m") == "2014-05")
houses_2015 <- houses %>% filter(format(houses$date, "%Y-%m") == "2015-05")

Let us have a look at some entries from May 2014:

head(houses_2014)
##           id       date   price
## 1 7237550310 2014-05-12 1225000
## 2 9212900260 2014-05-27  468000
## 3  114101516 2014-05-28  310000
## 4 6865200140 2014-05-29  485000
## 5 6300500875 2014-05-14  385000
## 6 8091400200 2014-05-16  252700

The variables are defined as follows:

  • id: Unique ID for each home sold
  • date: Date of the sale
  • price: Selling price of the home (in USD)
2.2 Definition of the unit of observation and the sample size (0.5 points)

The units of observation are the homes that were sold in King County in May 2014 and May 2015, with sample sizes of 1768 and 646, respectively.

2.3 Source of the data set (0.5 points)

The csv file containing the data set can be found on Kiva using the following link: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/data.

2.4 Basic descriptive statistics (1 point) - estimate a few parameters (e.g., functions summary, describe, etc.) and explanation

Let us look at some descriptive statistics of the data from May 2014:

describe(houses_2014$price)
##    vars    n     mean       sd median  trimmed    mad   min     max   range
## X1    1 1768 548080.3 356502.8 465000 490461.7 229803 78000 3710000 3632000
##    skew kurtosis      se
## X1 2.83    12.85 8478.55

The average selling price of a home in May 2014 was $548080.

table(houses_2014$date)
## 
## 2014-05-02 2014-05-03 2014-05-04 2014-05-05 2014-05-06 2014-05-07 2014-05-08 
##         67          4          5         84         83         93         81 
## 2014-05-09 2014-05-10 2014-05-11 2014-05-12 2014-05-13 2014-05-14 2014-05-15 
##         81          5          2         80         86         81         82 
## 2014-05-16 2014-05-17 2014-05-18 2014-05-19 2014-05-20 2014-05-21 2014-05-22 
##         73          1          7         83        116         94         91 
## 2014-05-23 2014-05-24 2014-05-25 2014-05-26 2014-05-27 2014-05-28 2014-05-29 
##         84         11          5          8        104        111         75 
## 2014-05-30 2014-05-31 
##         65          6

The most houses were sold on the 20th of May (116).

Now, we examine the data from May 2015:

describe(houses_2015$price)
##    vars   n     mean     sd median  trimmed    mad   min     max   range skew
## X1    1 646 558126.8 414822 455000 485645.4 207564 95000 4208000 4113000 4.03
##    kurtosis       se
## X1    23.07 16320.95

The average selling price of a home in May 2015 was $558127.

table(houses_2015$date)
## 
## 2015-05-01 2015-05-02 2015-05-03 2015-05-04 2015-05-05 2015-05-06 2015-05-07 
##         77          6         10        102         94         88         76 
## 2015-05-08 2015-05-09 2015-05-10 2015-05-11 2015-05-12 2015-05-13 2015-05-14 
##         54          3          2         40         49         31         11 
## 2015-05-15 2015-05-24 2015-05-27 
##          1          1          1

The most houses were sold on the 4th of May (102).

3 Analysis (7.5 points)

3.1 Determine which statistical test to use and why (1 points)

Since I am comparing the arithmetic means of two independent populations, namely the prices of the houses sold in May 2014 and May 2015, I will use the independent samples t-test with Welch correction, to account for the different variance in the two samples. If the assumptions of this parametric test are violated, I will use the non-parametric Wilcoxon Rank Sum Test.

3.2 Evaluate all assumptions (1.5 points)

Since the independent samples t-test is a parametric test, a number of assumptions must be satisfied for the test to be valid. Firstly, the variable of interest - in our case, the selling price - must be numeric. We have already seen that this is the case. Secondly, the variable must be normally distributed in both populations. To test this assumption, I first plot both samples as histograms:

plot_2014 <- ggplot(houses_2014, aes(x = price)) +
  geom_histogram(bins = 70, fill = "springgreen2") +
  theme_minimal() +
  labs(x = "Selling price ($)", title = "Homes sold in May 2014")

plot_2015 <- ggplot(houses_2015, aes(x = price)) +
  geom_histogram(bins = 70, fill = "red3") +
  theme_minimal() +
  labs(x = "Selling price ($)", title = "Homes sold in May 2015")

ggarrange(plot_2014, plot_2015, ncol = 2, nrow = 1)

As we can see, both samples are heavily skewed to the right. We may also perform the Shapiro-Wilk normality test:

shapiro.test(houses_2014$price)
## 
##  Shapiro-Wilk normality test
## 
## data:  houses_2014$price
## W = 0.76469, p-value < 2.2e-16
shapiro.test(houses_2015$price)
## 
##  Shapiro-Wilk normality test
## 
## data:  houses_2015$price
## W = 0.63624, p-value < 2.2e-16

Since the p-values < 0.001, we reject the null hypothesis and conclude that the data is not normally distributed. Therefore, we must use the non-parametric Wilcoxon Rank Sum Test.

3.3 Perform the appropriate statistical test based on the results of the assumption evaluation and its interpretation (2.5 points)

Performing the appropriate test:

wilcox.test(houses_2014$price, houses_2015$price, paired = F, correct = F, exact = F, alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  houses_2014$price and houses_2015$price
## W = 568946, p-value = 0.8889
## alternative hypothesis: true location shift is not equal to 0

Since the p-value > 0.05, we fail to reject the null hypothesis and cannot conclude that the two population means are different at a significance level of 5%.

3.4 Calculation of the effect size and its interpretation (2.5 points)

We can calculate the effect size to measure the magnitude of the difference between the two samples:

effectsize(wilcox.test(houses_2014$price, houses_2015$price, paired = F, correct = F, exact = F, alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -3.71e-03         | [-0.06, 0.05]
interpret_rank_biserial(3.71e-03)
## [1] "tiny"
## (Rules: funder2019)

Unsurprisingly, the effect size is dubbed as ‘tiny’ and is insignificant.

4 Conclusion (2.5 points)

Clear answer to your research question based on the results of the statistical test performed (2.5 points)

After having performed the Wilcoxon Rank Sum test, we may conclude that there is no sufficient evidence to reject the null hypothesis (p-value = 0.89 at \(\alpha\) = 0.05), and we cannot say that the average selling price of a home in May 2014 was different than the average selling price of a home in May 2015.