Homework assignment at the course Applied Data Analysis in Business with R

Student name: Gerelchuluun Amarsanaa Student ID: 12300075

Research question

Is there a difference in house value between younger (median housing age = 2) and older (median housing age = 50) housing units in California? ___________________________________________

library(psych)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(ggpubr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

Importing data

mydata <- read.table("housing.csv",
                     header=TRUE,
                     sep=",",
                     dec=".")

mydata <- mydata[mydata$housing_median_age == c(2, 50), ]

head(mydata)

##     longitude latitude housing_median_age total_rooms total_bedrooms population
## 16    -122.26    37.85                 50        1120            283        697
## 144   -122.21    37.80                 50        2833            605       1260
## 184   -122.23    37.80                 50        1746            480       1149
## 202   -122.22    37.78                 50        1920            530       1525
## 230   -122.19    37.79                 50         968            195        462
## 288   -122.18    37.78                 50        1642            322        713
##     households median_income median_house_value ocean_proximity
## 16         264        2.1250             140000        NEAR BAY
## 144        552        2.8929             216700        NEAR BAY
## 184        415        2.2500             123500        NEAR BAY
## 202        477        1.4886             128800        NEAR BAY
## 230        184        2.9844             179900        NEAR BAY
## 288        284        3.2984             160700        NEAR BAY

Data explanation

Unit of observation:

A block group of housing units

Sample size:

107 observations in total
30 observations with median age of 2
77 observations with median age of 50

Variables:

Longitude: The longitude of the center of each block group in California.
Latitude: The latitude of the center of each block group in California.
Housing Median Age: The median age of the housing units in each block group. For this assignment, only houses with median age of 2 and 50 were used.
Total rooms: The total number of rooms in the housing units in each block group.
Total Bedrooms: The total number of bedrooms in the housing units in each block group.
Population: The total population of the block group.
Households: The total number of households in the block group.
Median Income: The median income of the block group.
Median House Value: The median value of the housing units in the block group.
Ocean Proximity: The proximity of the block group to the ocean or other bodies of water ( Near Bay, Near Ocean, Inland, Island, <1 Hour to Ocean)

Parameters:

Factoring data by median age:

mydata$ageF <- factor(mydata$housing_median_age, 
                      levels = c(2, 50),
                      labels = c("Young", "Old"))

describeBy(mydata$median_house_value, mydata$ageF)

## 
##  Descriptive statistics by group 
## group: Young
##    vars  n     mean       sd median  trimmed      mad   min    max  range skew
## X1    1 30 233533.4 117458.3 212300 221995.8 105857.6 47500 500001 452501 0.76
##    kurtosis       se
## X1    -0.31 21444.86
## ------------------------------------------------------------ 
## group: Old
##    vars  n     mean       sd median  trimmed      mad   min    max  range skew
## X1    1 77 217137.8 134780.5 168800 203512.7 96220.74 49800 500001 450201 0.92
##    kurtosis       se
## X1    -0.23 15359.66

From the data description above, it is clear that: * The average price of 2 years old housing was 233533.4, while 50 years old housing was 217137.8. * The standard deviation of houses are similar, 117458.3 and 134780.5. * The median value of younger housing is higher than the older housing, suggesting that the younger housing is relatively expensive.

Source of Data

The data has been uploaded to kaggle.com under Apache 2.0 licence. Link to data: https://www.kaggle.com/datasets/hosammhmdali/house-price-dataset

Hypothesis testing

As the test is about the difference between two independent samples, if assumptions are met, then Independent t-test is the most suitable.

Assumptions

Variable (median house value) is numeric.
Variable (median house value) is normally distributed in both samples.
Variable has the same variance in both population.

Checking if all assumtions are met:

is.numeric(mydata$median_house_value)

## [1] TRUE

From the code snippet above, it is clear that the median house value variable is numeric.

youngHousing <- ggplot(mydata[mydata$ageF == "Young",], aes(x=median_house_value)) +
  geom_histogram(binwidth = 100000, color="black", fill="yellow") +
  labs(title="Young housing", x = "Median Value", y="Frequency") +
  theme_minimal()

oldHousing <- ggplot(mydata[mydata$ageF == "Old",], aes(x=median_house_value)) +
  geom_histogram(binwidth = 100000, color="black", fill="yellow") +
  labs(title="Old housing", x = "Median Value", y="Frequency") +
  theme_minimal()

ggarrange(youngHousing, oldHousing, ncol=2, nrow=1)

From the histograms above, it is clear that the value distribution of both types of housing is right-skewed. Thus, they both are not normally distributed. To confirm, Shapiro-Wilk normality test is used.

Shapiro-Wilk normality test:

Null hypothesis: The data is normally distributed.
H1: The data is not normally distributed.

mydata %>%
  group_by(ageF) %>%
  shapiro_test(median_house_value)

## # A tibble: 2 × 4
##   ageF  variable           statistic          p
##   <fct> <chr>                  <dbl>      <dbl>
## 1 Young median_house_value     0.927 0.0418    
## 2 Old   median_house_value     0.874 0.00000171

In both cases, the p value is more less thank 0.05. Therefore, the null hypothesis is rejected (p<0.05), meaning the data is not normally distributed.

Therefore, as housing value variable is not normally distributed in both sample, non-parametric test is more suitable. Thus, Wilcoxon Rank Sum Test will be used.

Hypothesis testing with Wilcoxon Rank Sum Test

Null hypothesis: The location of median house value distribution in two house age groups are the same.
H1: The location of median house value distribution in two house age groups are not the same.

wilcox.test(mydata$median_house_value ~ mydata$ageF, 
            paired=FALSE,
            correct=FALSE,
            exact=FALSE,
            alternative="two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$median_house_value by mydata$ageF
## W = 1300, p-value = 0.3144
## alternative hypothesis: true location shift is not equal to 0

From the result, we can see that the p value is larger than 0.001. Therefore, we cannot reject the null hypothesis (p>0.001).

Calculating Effect size

effectsize(wilcox.test(mydata$median_house_value ~ mydata$ageF, 
            paired=FALSE,
            correct=FALSE,
            exact=FALSE,
            alternative="two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.13              | [-0.12, 0.35]

interpret_rank_biserial(0.13)

## [1] "small"
## (Rules: funder2019)

Conclusion

It is clear that there is no difference in house value between younger (house age 2) and older (house age 5) housing (p>0.001). The difference in distribution is small.