Homework Assigment 2: Hypotheses Testing

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

data <- read.table("./Housing.csv", header = TRUE, sep = ",")

head(data)

##      price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420        4         2       3      yes        no       no
## 2 12250000 8960        4         4       4      yes        no       no
## 3 12250000 9960        3         2       2      yes        no      yes
## 4 12215000 7500        4         2       2      yes        no      yes
## 5 11410000 7420        4         1       2      yes       yes      yes
## 6 10850000 7500        3         3       1      yes        no      yes
##   hotwaterheating airconditioning parking prefarea furnishingstatus
## 1              no             yes       2      yes        furnished
## 2              no             yes       3       no        furnished
## 3              no              no       2      yes   semi-furnished
## 4              no             yes       3      yes        furnished
## 5              no             yes       2       no        furnished
## 6              no             yes       2      yes   semi-furnished

Data Description

Unit of Observation

Each row represents a unique property with its corresponding attributes and price.

Sample Size

The dataset contains 545 observations (where 545 is the total number of properties in the dataset). Each observation corresponds to a unique property.

Definition of Variables

Variable	Type	Description	Unit of Measurement
price	Numerical (Ratio)	The price of the property.	Currency (e.g., USD, EUR)
area	Numerical (Ratio)	The area of the property (in square feet).	Square feet (ft²)
bedrooms	Numerical (Interval)	The number of bedrooms in the property.	Count
bathrooms	Numerical (Interval)	The number of bathrooms in the property.	Count
stories	Numerical (Interval)	The number of stories in the property.	Count
mainroad	Categorical (Nominal)	Whether the property is located on a main road. Possible values: `yes`, `no`.	N/A
guestroom	Categorical (Nominal)	Whether the property has a guestroom. Possible values: `yes`, `no`.	N/A
basement	Categorical (Nominal)	Whether the property has a basement. Possible values: `yes`, `no`.	N/A
hotwaterheating	Categorical (Nominal)	Whether the property has hot water heating. Possible values: `yes`, `no`.	N/A
airconditioning	Categorical (Nominal)	Whether the property has air conditioning. Possible values: `yes`, `no`.	N/A
parking	Numerical (Interval)	The number of parking spaces available at the property.	Count
prefarea	Categorical (Nominal)	Whether the property is located in a preferred area. Possible values: `yes`, `no`.	N/A
furnishingstatus	Categorical (Nominal)	The furnishing status of the property. Possible values: `furnished`, `semi-furnished`, `unfurnished`.	N/A

Source of Data

The dataset is obtained from Kaggle:
Housing Prices Dataset

Data Manipulation

Since the ideal sample size for hypothesis testing ranges from 50 to 100, I will implement a function that randomly selects 75 rows from my dataset.

I used the code below to ensure that the randomly selected rows remain the same every time the document is run or knitted. By setting a seed and saving the selection, I prevent changes in the sampled data, ensuring reproducibility.

set.seed(3)
random_rows <- data[sample(nrow(data), 75), ]

saveRDS(random_rows, "random_sample.rds")

data <- readRDS("random_sample.rds")

library(dplyr)
library(tidyr)

data <- data %>% drop_na()

data$mainroad <- factor(data$mainroad,
                      levels = c("yes", "no"),
                      labels = c("yes", "no"))

For hypotheses testing, I found no additional data manipulation necessary.

Descriptive Statistics

library(psych)

describeBy(data$price, data$mainroad)

## 
##  Descriptive statistics by group 
## group: yes
##    vars  n    mean      sd  median trimmed     mad     min     max   range skew
## X1    1 65 5357414 1653151 5250000 5270472 1452948 2380000 9681000 7301000 0.46
##    kurtosis       se
## X1    -0.17 205048.2
## ------------------------------------------------------------ 
## group: no
##    vars  n    mean      sd  median trimmed      mad     min     max   range
## X1    1 10 3638600 1006026 3500000 3686375 908092.5 1750000 5145000 3395000
##     skew kurtosis       se
## X1 -0.14     -0.9 318133.2

Explanation of some sample statistics

n: The number of observations. The dataset includes 65 houses located on a main road and 10 that are not.
mean: The sum of all values divided by the number of observations. For example, the average price of a house located on a main road is $5357414.
sd: Measure of how spread out the data points are around the mean. In the dataset of houses not located on a main road, the standard deviation is 1006026, indicating a relatively wide spread in the values.
median: The value bellow which 50% of the data fails. For example, 50% of the houses located on a main road cost $5250000 or less.
min: The smallest value in each column. For example, the lowest price for a house not located on a main road is $1750000.
max: The largest value in each column. For example, the highest price for a house located on a main road is $9681000.
range: The difference between the maximum and minimum values. For example, the difference between the highest and lowest prices for houses not located on a main road is $3395000.

Research Question

Is there a significant difference in housing prices between properties located on a main road and those not located on a main road?

\[ H_0: \mu_{\text{mainroad}} = \mu_{\text{no_mainroad}} \]

\[ H_1: \mu_{\text{mainroad}} \neq \mu_{\text{no_mainroad}} \]

I am working with independent samples, as the data belong to two distinct groups, with each unit measured once.

Parametric Test

Below, I will review all the necessary assumptions for the application of parametric tests.

Variable is numeric: The variable of interest, price, is numeric.
Normality - The distribution of the variable is normal in both populations:

library(ggplot2)

Airconditioning <- ggplot(data[data$mainroad == "yes",  ], aes(x = price)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 200000, col = "black", fill = "royalblue2") +
  ylab("Frequency") +
  ggtitle("Airconditioning")

No_Airconditioning <- ggplot(data[data$mainroad == "no",  ], aes(x = price)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 200000, col = "black", fill = "violetred2") +
  ylab("Frequency") +
  ggtitle("No Airconditioning")

library(ggpubr)
ggarrange(Airconditioning, No_Airconditioning,
          ncol = 2, nrow = 1)

The graphs don’t seem to exhibit a normal distribution for both samples; Therefore, I will still perform alternative tests to ensure normality.

Plot of Salary Distribution by Gender

library(ggpubr)

ggqqplot(data,
         "price",
         facet.by = "mainroad")

Since almost all the points appear to be within the grey area, we can conclude that the data follows a normal distribution.

Shapiro-Wilk Normality Test

\[ H_0: \text{The price for houses located on a main road follows a normal distribution.} \] \[ H_1: \text{The price for houses located on a main road does not follow a normal distribution.} \]

shapiro.test(data[data$mainroad == "yes", ]$price)

## 
##  Shapiro-Wilk normality test
## 
## data:  data[data$mainroad == "yes", ]$price
## W = 0.97688, p-value = 0.2631

Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we can assume normality for the houses located on a main road.

\[ H_0: \text{The price for houses not located on a main road follows a normal distribution.} \]

\[ H_1: \text{The price for houses not located on a main road does not follow a normal distribution.} \]

shapiro.test(data[data$mainroad == "no", ]$price)

## 
##  Shapiro-Wilk normality test
## 
## data:  data[data$mainroad == "no", ]$price
## W = 0.96522, p-value = 0.8433

Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. Therefore, we can assume normality for the female salary data.

In conclusion, we can assume that the distribution of the variable is normal in both populations.

The data comes from two independent populations
The variable has the same variance in both populations:

Levene’s Test for Homogeneity of Variance

\[ H_0: \sigma^2_{\text{mainroad}} = \sigma^2_{\text{no_mainroad}} \]

\[ H_1: \sigma^2_{\text{mainroad}} \neq \sigma^2_{\text{no_mainroad}} \]

library(car)

leveneTest(data$price, group = data$mainroad)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.0408 0.08541 .
##       73                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. This suggests that there is no significant evidence to conclude that the variances are different between the houses located on a main road and those not located on a main road.

I will now proceed with performing the Independent t-test.

t.test(data$price ~ data$mainroad, 
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  data$price by data$mainroad
## t = 3.1871, df = 73, p-value = 0.002116
## alternative hypothesis: true difference in means between group yes and group no is not equal to 0
## 95 percent confidence interval:
##   643969.7 2793659.2
## sample estimates:
## mean in group yes  mean in group no 
##           5357414           3638600

We reject the null hypotheses at a p-value f 0.003. This suggests that there is a statistically significant difference in the prices between houses located on a main road and those not located on a main road, in the population from which the sample was drawn.

library(effectsize)

effectsize::cohens_d(data$price ~ data$mainroad, 
                     pooled_sd = FALSE)

## Cohen's d |       95% CI
## ------------------------
## 1.26      | [0.56, 1.93]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(1.20, rules = "sawilowsky2009")

## [1] "very large"
## (Rules: sawilowsky2009)

The difference in housing prices is very large (d = 1.20).

Non-Parametric Test - Wilcoxon Rank Sum Test

wilcox.test(data$price ~ data$mainroad, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$price by data$mainroad
## W = 534, p-value = 0.001152
## alternative hypothesis: true location shift is not equal to 0

We reject the null hypotheses at a p-value of 0.004. This suggests that there is a statistically significant difference in the prices between houses located on a main road and those not located on a main road, in the population from which the sample was drawn.

library(effectsize)

effectsize(wilcox.test(data$price ~ data$mainroad, alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.64              | [0.36, 0.82]

interpret_rank_biserial(0.57)

## [1] "very large"
## (Rules: funder2019)

The difference in housing prices is very large (𝑟 = 0.57).

Both parametric and non-parametric tests indicate that location on a main road significantly impacts housing prices in this dataset. The effect size is very large, with properties located on a main road having higher prices.

The answer to my research question is: There is a significant difference in housing prices between properties located on a main road and those not located on a main road, with properties on a main road being more expensive.

In my case, although both parametric and non-parametric tests led to the same conclusion, the parametric test is the more appropriate choice, as it meets all necessary assumptions (like normality or homogeneity of variance) and offers greater statistical power.