Student name: Gerelchuluun Amarsanaa Student ID: 12300075
Is there a difference in house value between younger (median housing age = 2) and older (median housing age = 50) housing units in California? ___________________________________________
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
mydata <- read.table("housing.csv",
header=TRUE,
sep=",",
dec=".")
mydata <- mydata[mydata$housing_median_age == c(2, 50), ]
head(mydata)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 16 -122.26 37.85 50 1120 283 697
## 144 -122.21 37.80 50 2833 605 1260
## 184 -122.23 37.80 50 1746 480 1149
## 202 -122.22 37.78 50 1920 530 1525
## 230 -122.19 37.79 50 968 195 462
## 288 -122.18 37.78 50 1642 322 713
## households median_income median_house_value ocean_proximity
## 16 264 2.1250 140000 NEAR BAY
## 144 552 2.8929 216700 NEAR BAY
## 184 415 2.2500 123500 NEAR BAY
## 202 477 1.4886 128800 NEAR BAY
## 230 184 2.9844 179900 NEAR BAY
## 288 284 3.2984 160700 NEAR BAY
Factoring data by median age:
mydata$ageF <- factor(mydata$housing_median_age,
levels = c(2, 50),
labels = c("Young", "Old"))
describeBy(mydata$median_house_value, mydata$ageF)
##
## Descriptive statistics by group
## group: Young
## vars n mean sd median trimmed mad min max range skew
## X1 1 30 233533.4 117458.3 212300 221995.8 105857.6 47500 500001 452501 0.76
## kurtosis se
## X1 -0.31 21444.86
## ------------------------------------------------------------
## group: Old
## vars n mean sd median trimmed mad min max range skew
## X1 1 77 217137.8 134780.5 168800 203512.7 96220.74 49800 500001 450201 0.92
## kurtosis se
## X1 -0.23 15359.66
From the data description above, it is clear that: * The average price of 2 years old housing was 233533.4, while 50 years old housing was 217137.8. * The standard deviation of houses are similar, 117458.3 and 134780.5. * The median value of younger housing is higher than the older housing, suggesting that the younger housing is relatively expensive.
The data has been uploaded to kaggle.com under Apache 2.0 licence. Link to data: https://www.kaggle.com/datasets/hosammhmdali/house-price-dataset
As the test is about the difference between two independent samples, if assumptions are met, then Independent t-test is the most suitable.
is.numeric(mydata$median_house_value)
## [1] TRUE
From the code snippet above, it is clear that the median house value variable is numeric.
youngHousing <- ggplot(mydata[mydata$ageF == "Young",], aes(x=median_house_value)) +
geom_histogram(binwidth = 100000, color="black", fill="yellow") +
labs(title="Young housing", x = "Median Value", y="Frequency") +
theme_minimal()
oldHousing <- ggplot(mydata[mydata$ageF == "Old",], aes(x=median_house_value)) +
geom_histogram(binwidth = 100000, color="black", fill="yellow") +
labs(title="Old housing", x = "Median Value", y="Frequency") +
theme_minimal()
ggarrange(youngHousing, oldHousing, ncol=2, nrow=1)
From the histograms above, it is clear that the value distribution of both types of housing is right-skewed. Thus, they both are not normally distributed. To confirm, Shapiro-Wilk normality test is used.
mydata %>%
group_by(ageF) %>%
shapiro_test(median_house_value)
## # A tibble: 2 × 4
## ageF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Young median_house_value 0.927 0.0418
## 2 Old median_house_value 0.874 0.00000171
In both cases, the p value is more less thank 0.05. Therefore, the null hypothesis is rejected (p<0.05), meaning the data is not normally distributed.
Therefore, as housing value variable is not normally distributed in both sample, non-parametric test is more suitable. Thus, Wilcoxon Rank Sum Test will be used.
wilcox.test(mydata$median_house_value ~ mydata$ageF,
paired=FALSE,
correct=FALSE,
exact=FALSE,
alternative="two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$median_house_value by mydata$ageF
## W = 1300, p-value = 0.3144
## alternative hypothesis: true location shift is not equal to 0
From the result, we can see that the p value is larger than 0.001. Therefore, we cannot reject the null hypothesis (p>0.001).
effectsize(wilcox.test(mydata$median_house_value ~ mydata$ageF,
paired=FALSE,
correct=FALSE,
exact=FALSE,
alternative="two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.13 | [-0.12, 0.35]
interpret_rank_biserial(0.13)
## [1] "small"
## (Rules: funder2019)
It is clear that there is no difference in house value between younger (house age 2) and older (house age 5) housing (p>0.001). The difference in distribution is small.