Homework 2

Nam Anh Le
Housing Prices

mydata <- read.table("./Housing.csv", header=TRUE,sep = ",", dec=".")
head(mydata)

##      price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420        4         2       3      yes        no       no
## 2 12250000 8960        4         4       4      yes        no       no
## 3 12250000 9960        3         2       2      yes        no      yes
## 4 12215000 7500        4         2       2      yes        no      yes
## 5 11410000 7420        4         1       2      yes       yes      yes
## 6 10850000 7500        3         3       1      yes        no      yes
##   hotwaterheating airconditioning parking furnishingstatus
## 1              no             yes       2        furnished
## 2              no             yes       3        furnished
## 3              no              no       2   semi-furnished
## 4              no             yes       3        furnished
## 5              no             yes       2        furnished
## 6              no             yes       2   semi-furnished

Unit of Observation: one house
The sample size is 545

Definition of Variable:

price: House Price in $
area: House Area in square feet
bedrooms: Number of Bedrooms
bathrooms: Number of Bathrooms
stories: Number of floors
mainroad: Whether the house is connected to a main road
guestroom: whether the house has a guest room
basement: Whether the house has a basement
hotwaterheating: Whether the house has a hot water heater
airconditioning: Whether the house has air conditioning
parking: Number of house parking spaces
furnishing status: how furnished the house is

Data Source: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data

mydatanew <- mydata

mydatanew$mainroad <- factor(mydata$mainroad,
                          levels = c("yes","no"),
                          labels = c("yes","no"))

mydatanew$basement <- factor(mydata$basement,
                          levels = c("yes","no"),
                          labels = c("yes","no"))

mydatanew$airconditioning <- factor(mydata$airconditioning,
                                 levels = c("yes","no"),
                                 labels = c("yes","no"))

mydatanew$hotwaterheating <- factor(mydata$hotwaterheating,
                                 levels = c("yes","no"),
                                 labels = c("yes","no"))

mydatanew$guestroom <- factor(mydata$guestroom,
                                 levels = c("yes","no"),
                                 labels = c("yes","no"))

mydatanew$airconditioning <- factor(mydata$airconditioning,
                                 levels = c("yes","no"),
                                 labels = c("yes","no"))

mydatanew$furnishingstatus <- factor(mydata$furnishingstatus,
                                 levels = c("furnished","semi-furnished","unfurnished"),
                                 labels = c("furnished","semi-furnished","unfurnished"))

mydatanew$price <- mydatanew$price / 1000000


library(psych)
describeBy(mydatanew$price,group = mydatanew$furnishingstatus)

## 
##  Descriptive statistics by group 
## group: furnished
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 140  5.5 2.12   5.08    5.28 2.02 1.75 13.3 11.55 1.06     1.34 0.18
## ------------------------------------------------------------ 
## group: semi-furnished
##    vars   n mean  sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 227 4.91 1.6   4.58    4.71 1.19 1.77 12.25 10.48 1.42     2.85 0.11
## ------------------------------------------------------------ 
## group: unfurnished
##    vars   n mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 178 4.01 1.72   3.43    3.78 1.14 1.75 10.15   8.4 1.29     1.25 0.13

Research Question: Is furnishing status and house price related?

Conditions and assumptions:

Analyzed variable is numeric.
Variable in the population is normally distributed within each group.
Homoskedasticity: the variance of analyzed variable is the same within all groups.

Homoskedasticity test:

$H_0$: $\sigma^2_{\text{unfurnished}}$ = $\sigma^2_{\text{semi-furnished}}$ = $\sigma^2_{\text{furnished}}$
$H_1$: At least one $\sigma^2_j$ is different from others

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(mydatanew$price,group=mydatanew$furnishingstatus)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)    
## group   2  7.4278 0.000657 ***
##       542                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject $H_0$ at p value < 0.001
Normality test:

$H_0$: price is normally distributed within all groups
$H_1$: price is not normally distributed within at least 1 group

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydatanew %>%
  group_by(furnishingstatus) %>%
  shapiro_test(price)

## # A tibble: 3 × 4
##   furnishingstatus variable statistic        p
##   <fct>            <chr>        <dbl>    <dbl>
## 1 furnished        price        0.930 1.94e- 6
## 2 semi-furnished   price        0.901 4.01e-11
## 3 unfurnished      price        0.875 5.20e-11

Reject $H_0$ for all 3 groups since p-value is < 0.001

Parametric Test:

$H_0$: $\mu_{\text{unfurnished}}$ = $\mu_{\text{semi-furnished}}$ = $\mu_{\text{furnished}}$
$H_1$: At least one $\mu_j$ is different from others

ANOVA_Results <- aov(price~furnishingstatus,
                     data = mydatanew)
summary(ANOVA_Results)

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## furnishingstatus   2  179.8   89.90   28.27 2.09e-12 ***
## Residuals        542 1723.4    3.18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject $H_0$ at p value < 0.001

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

eta_squared(ANOVA_Results)

## For one-way between subjects designs, partial eta squared is equivalent
##   to eta squared. Returning eta squared.

## # Effect Size for ANOVA
## 
## Parameter        | Eta2 |       95% CI
## --------------------------------------
## furnishingstatus | 0.09 | [0.06, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_eta_squared(0.09, rules = "cohen1992")

## [1] "small"
## (Rules: cohen1992)

There are small differences between the mean price for all level of furnishing

library(onewaytests)

## 
## Attaching package: 'onewaytests'

## The following object is masked from 'package:psych':
## 
##     describe

welch.test(price ~ furnishingstatus, 
           data = mydatanew)

## 
##   Welch's Heteroscedastic F Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : price and furnishingstatus 
## 
##   statistic  : 25.89152 
##   num df     : 2 
##   denom df   : 311.3324 
##   p.value    : 3.965734e-11 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------

Reject $H_0$ at p-value < 0.001

pairwise.t.test(x = mydatanew$price, g = mydatanew$furnishingstatus, 
                p.adj = "bonf")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mydatanew$price and mydatanew$furnishingstatus 
## 
##                furnished semi-furnished
## semi-furnished 0.0068    -             
## unfurnished    2.1e-12   2.3e-06       
## 
## P value adjustment method: bonferroni

The furnishing status does significantly affect house prices in all pairwise comparisons:

There is a significant difference between furnished and semi-furnished houses (p = 0.007).
There is a highly significant difference between semi-furnished and unfurnished houses (p < 0.001).
There is a highly significant difference between furnished and unfurnished houses (p < 0.001).

Non-Parametric Test:

$H_0$: All distribution locations of price are the same
$H_1$: At least one distribution location of price is different

kruskal.test(price ~ furnishingstatus, 
             data = mydatanew)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by furnishingstatus
## Kruskal-Wallis chi-squared = 69.583, df = 2, p-value = 7.767e-16

Reject $H_0$ at p value < 0.001

kruskal_effsize(price ~ furnishingstatus, 
             data = mydatanew)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 price   545   0.125 eta2[H] moderate

The effect size suggests that there are moderate differences between the distribution of prices

library(rstatix)

groups_nonpar <- wilcox_test(price ~ furnishingstatus,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydatanew)

groups_nonpar

## # A tibble: 3 × 9
##   .y.   group1       group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr> <chr>        <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 price furnished    semi-…   140   227    18238. 1.7 e- 2 5.2 e- 2 ns          
## 2 price furnished    unfur…   140   178    18366. 4   e-13 1.20e-12 ****        
## 3 price semi-furnis… unfur…   227   178    28260. 5.55e-12 1.67e-11 ****

There is no significant difference between furnished and semi-furnished houses after adjusting for multiple comparisons (p.adj = 0.052).
There is a highly significant difference between furnished and unfurnished houses (p.adj = 1.20e-12).
There is a highly significant difference between semi-furnished and unfurnished houses (p.adj = 1.67e-11).

Most Suitable Test:

The samples failed the Levene test, therefore the Anova test should not be applied
The samples failed the Shapiro Test, therefore the Welch test should not be applied
I choose the Kruskal-Wallis ranked sum test since both Levene and Shapiro test failed

Conclusion:
There is strong statistical evidence (p < 0.001) that house prices differ based on furnishing status. The effect size suggests there are moderate differences between the distribution of location of prices. Post-hoc analysis reveals that the distribution location of unfurnished homes’ prices is significantly different from to both furnished and semi-furnished homes. However, there is no significant difference between the distribution location of the prices of furnished and semi-furnished homes.

Homework 2

2025-03-30