knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
data <- read.table("./Housing.csv", header = TRUE, sep = ",")
head(data)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
Each row represents a unique property with its corresponding attributes and price.
The dataset contains 545 observations (where 545 is the total number of properties in the dataset). Each observation corresponds to a unique property.
Variable | Type | Description | Unit of Measurement |
---|---|---|---|
price | Numerical (Ratio) | The price of the property. | Currency (e.g., USD, EUR) |
area | Numerical (Ratio) | The area of the property (in square feet). | Square feet (ft²) |
bedrooms | Numerical (Interval) | The number of bedrooms in the property. | Count |
bathrooms | Numerical (Interval) | The number of bathrooms in the property. | Count |
stories | Numerical (Interval) | The number of stories in the property. | Count |
mainroad | Categorical (Nominal) | Whether the property is located on a main road. Possible values:
yes , no . |
N/A |
guestroom | Categorical (Nominal) | Whether the property has a guestroom. Possible values:
yes , no . |
N/A |
basement | Categorical (Nominal) | Whether the property has a basement. Possible values:
yes , no . |
N/A |
hotwaterheating | Categorical (Nominal) | Whether the property has hot water heating. Possible values:
yes , no . |
N/A |
airconditioning | Categorical (Nominal) | Whether the property has air conditioning. Possible values:
yes , no . |
N/A |
parking | Numerical (Interval) | The number of parking spaces available at the property. | Count |
prefarea | Categorical (Nominal) | Whether the property is located in a preferred area. Possible
values: yes , no . |
N/A |
furnishingstatus | Categorical (Nominal) | The furnishing status of the property. Possible values:
furnished , semi-furnished ,
unfurnished . |
N/A |
The dataset is obtained from Kaggle:
Housing
Prices Dataset
Since the ideal sample size for hypothesis testing ranges from 50 to 100, I will implement a function that randomly selects 75 rows from my dataset.
I used the code below to ensure that the randomly selected rows remain the same every time the document is run or knitted. By setting a seed and saving the selection, I prevent changes in the sampled data, ensuring reproducibility.
set.seed(3)
random_rows <- data[sample(nrow(data), 75), ]
saveRDS(random_rows, "random_sample.rds")
data <- readRDS("random_sample.rds")
library(dplyr)
library(tidyr)
data <- data %>% drop_na()
data$mainroad <- factor(data$mainroad,
levels = c("yes", "no"),
labels = c("yes", "no"))
For hypotheses testing, I found no additional data manipulation necessary.
library(psych)
describeBy(data$price, data$mainroad)
##
## Descriptive statistics by group
## group: yes
## vars n mean sd median trimmed mad min max range skew
## X1 1 65 5357414 1653151 5250000 5270472 1452948 2380000 9681000 7301000 0.46
## kurtosis se
## X1 -0.17 205048.2
## ------------------------------------------------------------
## group: no
## vars n mean sd median trimmed mad min max range
## X1 1 10 3638600 1006026 3500000 3686375 908092.5 1750000 5145000 3395000
## skew kurtosis se
## X1 -0.14 -0.9 318133.2
Is there a significant difference in housing prices between properties located on a main road and those not located on a main road?
\[ H_0: \mu_{\text{mainroad}} = \mu_{\text{no_mainroad}} \]
\[ H_1: \mu_{\text{mainroad}} \neq \mu_{\text{no_mainroad}} \]
I am working with independent samples, as the data belong to two distinct groups, with each unit measured once.
Below, I will review all the necessary assumptions for the application of parametric tests.
Variable is numeric: The variable of interest,
price
, is numeric.
Normality - The distribution of the variable is normal in both populations:
library(ggplot2)
Airconditioning <- ggplot(data[data$mainroad == "yes", ], aes(x = price)) +
theme_linedraw() +
geom_histogram(binwidth = 200000, col = "black", fill = "royalblue2") +
ylab("Frequency") +
ggtitle("Airconditioning")
No_Airconditioning <- ggplot(data[data$mainroad == "no", ], aes(x = price)) +
theme_linedraw() +
geom_histogram(binwidth = 200000, col = "black", fill = "violetred2") +
ylab("Frequency") +
ggtitle("No Airconditioning")
library(ggpubr)
ggarrange(Airconditioning, No_Airconditioning,
ncol = 2, nrow = 1)
The graphs don’t seem to exhibit a normal distribution for both samples; Therefore, I will still perform alternative tests to ensure normality.
library(ggpubr)
ggqqplot(data,
"price",
facet.by = "mainroad")
Since almost all the points appear to be within the grey area, we can conclude that the data follows a normal distribution.
\[ H_0: \text{The price for houses located on a main road follows a normal distribution.} \] \[ H_1: \text{The price for houses located on a main road does not follow a normal distribution.} \]
shapiro.test(data[data$mainroad == "yes", ]$price)
##
## Shapiro-Wilk normality test
##
## data: data[data$mainroad == "yes", ]$price
## W = 0.97688, p-value = 0.2631
Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we can assume normality for the houses located on a main road.
\[ H_0: \text{The price for houses not located on a main road follows a normal distribution.} \]
\[ H_1: \text{The price for houses not located on a main road does not follow a normal distribution.} \]
shapiro.test(data[data$mainroad == "no", ]$price)
##
## Shapiro-Wilk normality test
##
## data: data[data$mainroad == "no", ]$price
## W = 0.96522, p-value = 0.8433
Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we can assume normality for the female salary data.
In conclusion, we can assume that the distribution of the variable is normal in both populations.
The data comes from two independent populations
The variable has the same variance in both populations:
\[ H_0: \sigma^2_{\text{mainroad}} = \sigma^2_{\text{no_mainroad}} \]
\[ H_1: \sigma^2_{\text{mainroad}} \neq \sigma^2_{\text{no_mainroad}} \]
library(car)
leveneTest(data$price, group = data$mainroad)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.0408 0.08541 .
## 73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is no significant evidence to conclude that the variances are different between the houses located on a main road and those not located on a main road.
I will now proceed with performing the Independent t-test.
t.test(data$price ~ data$mainroad,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: data$price by data$mainroad
## t = 3.1871, df = 73, p-value = 0.002116
## alternative hypothesis: true difference in means between group yes and group no is not equal to 0
## 95 percent confidence interval:
## 643969.7 2793659.2
## sample estimates:
## mean in group yes mean in group no
## 5357414 3638600
We reject the null hypotheses at a p-value f 0.003. This suggests that there is a statistically significant difference in the prices between houses located on a main road and those not located on a main road, in the population from which the sample was drawn.
library(effectsize)
effectsize::cohens_d(data$price ~ data$mainroad,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## ------------------------
## 1.26 | [0.56, 1.93]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(1.20, rules = "sawilowsky2009")
## [1] "very large"
## (Rules: sawilowsky2009)
The difference in housing prices is very large (d = 1.20).
wilcox.test(data$price ~ data$mainroad, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$price by data$mainroad
## W = 534, p-value = 0.001152
## alternative hypothesis: true location shift is not equal to 0
We reject the null hypotheses at a p-value of 0.004. This suggests that there is a statistically significant difference in the prices between houses located on a main road and those not located on a main road, in the population from which the sample was drawn.
library(effectsize)
effectsize(wilcox.test(data$price ~ data$mainroad, alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.64 | [0.36, 0.82]
interpret_rank_biserial(0.57)
## [1] "very large"
## (Rules: funder2019)
The difference in housing prices is very large (𝑟 = 0.57).
Both parametric and non-parametric tests indicate that location on a main road significantly impacts housing prices in this dataset. The effect size is very large, with properties located on a main road having higher prices.
The answer to my research question is: There is a significant difference in housing prices between properties located on a main road and those not located on a main road, with properties on a main road being more expensive.
In my case, although both parametric and non-parametric tests led to the same conclusion, the parametric test is the more appropriate choice, as it meets all necessary assumptions (like normality or homogeneity of variance) and offers greater statistical power.