Import dataset

housing_prices = read.csv("C:/Users/24864833/OneDrive - UTS/Desktop/Housing_prices.csv")

Number of variables and their names

dim(housing_prices)
## [1] 372   8
names(housing_prices)
## [1] "ID"       "river"    "rooms"    "price"    "age"      "industry" "ptratio" 
## [8] "low_ses"

Describe the characteristics of the study sample by the number of rooms/house

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ price + river + age + industry + ptratio + low_ses | rooms, data = housing_prices) 
4-room
(N=15)
5-room
(N=27)
6-room
(N=254)
7-room
(N=76)
Overall
(N=372)
price
Mean (SD) 17.3 (10.7) 14.0 (5.05) 18.9 (5.50) 28.8 (11.7) 20.5 (8.59)
Median [Min, Max] 13.8 [7.00, 50.0] 14.4 [5.00, 23.7] 19.4 [5.00, 50.0] 27.5 [7.50, 50.0] 19.8 [5.00, 50.0]
river
No 15 (100%) 23 (85.2%) 239 (94.1%) 67 (88.2%) 344 (92.5%)
Yes 0 (0%) 4 (14.8%) 15 (5.9%) 9 (11.8%) 28 (7.5%)
age
Mean (SD) 93.5 (15.9) 89.8 (20.4) 75.6 (24.3) 77.2 (21.1) 77.7 (23.6)
Median [Min, Max] 100 [37.8, 100] 96.2 [9.80, 100] 84.5 [6.00, 100] 82.7 [2.90, 100] 87.3 [2.90, 100]
industry
Mean (SD) 17.8 (2.23) 17.7 (5.43) 13.6 (6.16) 11.1 (6.53) 13.5 (6.32)
Median [Min, Max] 18.1 [9.90, 19.6] 18.1 [6.91, 27.7] 13.9 [2.18, 27.7] 9.90 [1.89, 19.6] 18.1 [1.89, 27.7]
ptratio
Mean (SD) 19.3 (1.94) 18.7 (2.31) 19.3 (1.71) 18.4 (1.81) 19.1 (1.82)
Median [Min, Max] 20.2 [14.7, 20.2] 20.1 [14.7, 21.2] 20.2 [14.7, 21.2] 18.0 [14.7, 21.0] 20.2 [14.7, 21.2]
low_ses
Mean (SD) 24.4 (11.4) 23.3 (6.75) 14.5 (5.37) 9.14 (6.25) 14.4 (7.15)
Median [Min, Max] 29.3 [3.26, 38.0] 24.0 [10.2, 34.4] 14.1 [5.08, 34.0] 6.79 [1.73, 25.8] 13.6 [1.73, 38.0]

Develop and interpret a histogram for the distribution of house prices (10%)

library(ggplot2)
library(grid)
library(gridExtra) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggthemes)

p = ggplot(data = housing_prices, aes(x = price)) + geom_histogram(color = "black", fill = "light blue") + labs(x = "Median housing prices (x USD 1,000)") + labs(y = "Count number")

p1 = ggplot(data = housing_prices, aes(x = price, y = ..density..)) + geom_histogram(color = "black", fill = "light blue") +  geom_density(col="blue") + labs(x = "Median housing prices (x USD 1,000)") + labs(y = "Density")

grid.arrange(p, p1, nrow = 2, top = textGrob("Distribution of house prices", gp = gpar(fontsize = 18, font = 2)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Develop and interpret a box plot to describe the differences in housing prices among the number of rooms per household (10%)

p = ggplot(data = housing_prices, aes(x = rooms, y = price, fill = rooms, col = rooms))

p1 = p + geom_boxplot(col = "cyan") + geom_jitter(alpha = 0.01) + labs(x = "Room numbers of each household", y = "Median housing prices (x USD 1,000)") + ggtitle("Box Plot of housing prices by room numbers of each household") + theme_bw()

p1

Conduct a statistical test to determine whether housing prices were different among houses with different numbers of rooms. Interpret the findings (15%) Kruskal-Wallis test

kruskal.test(price ~ rooms, data = housing_prices)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by rooms
## Kruskal-Wallis chi-squared = 75.504, df = 3, p-value = 2.826e-16

ANOVA test

anova_test= aov(price ~ rooms, data = housing_prices)
summary(anova_test)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## rooms         3   7162  2387.2   43.49 <2e-16 ***
## Residuals   368  20202    54.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Kruskal-Wallis and ANOVA tests yielded extremely low p-values (2.826e-16 and 2e-16), both less than 0.05. Thus, we reject the null hypothesis (H0) in favor of the alternative hypothesis (Ha).

In summary, there are statistically significant differences in housing prices across houses with varying numbers of rooms

Determine which particular number of rooms/house had different house prices using the Tukey posthoc test. Fill in the following table and interpret the findings (20%)

tukey_result = TukeyHSD(anova_test)
tukey_result
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ rooms, data = housing_prices)
## 
## $rooms
##                    diff       lwr       upr     p adj
## 5-room-4-room -3.215556 -9.373321  2.942210 0.5331459
## 6-room-4-room  1.603780 -3.477109  6.684668 0.8475492
## 7-room-4-room 11.511053  6.108559 16.913546 0.0000004
## 6-room-5-room  4.819335  0.948716  8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room  9.907273  7.407162 12.407384 0.0000000
tukey_results <- data.frame(
  
  Comparisons = c("5-room houses vs. 4-room houses","6-room houses vs. 4-room houses", "7-room houses vs. 4-room houses", "6-room houses vs. 5-room houses", "7-room houses vs. 5-room houses", "7-room houses vs. 6-room houses"),
  
  Mean_Differences_in_housing_price= c("-3.21 (-9.37, 2.94)","1.60 [-3.48, 6.68]","11.51 [6.11, 16.91]","4.82 [0.95, 8.69]","14.73 [10.44, 19.01]","9.91 [7.41, 12.41]"),
  
  P_Value = c("0.5331", "0.8475", "0.0000", "0.0078", "0.0000", "0.0000")
)

print(tukey_results, row.names = FALSE)
##                      Comparisons Mean_Differences_in_housing_price P_Value
##  5-room houses vs. 4-room houses               -3.21 (-9.37, 2.94)  0.5331
##  6-room houses vs. 4-room houses                1.60 [-3.48, 6.68]  0.8475
##  7-room houses vs. 4-room houses               11.51 [6.11, 16.91]  0.0000
##  6-room houses vs. 5-room houses                 4.82 [0.95, 8.69]  0.0078
##  7-room houses vs. 5-room houses              14.73 [10.44, 19.01]  0.0000
##  7-room houses vs. 6-room houses                9.91 [7.41, 12.41]  0.0000

The Tukey posthoc analysis indicates that houses with 7 rooms exhibit significant differences in housing prices (P-value ≈ 0 <0.05) compared to those with 4, 5, and 6 rooms. Similarly, significant differences are observed when comparing houses with 6 rooms to those with 5 rooms (P-value = 0.0078 <0.05).

Conversely, when comparing houses with 5 and 6 rooms to those with 5 rooms alone, there are no statistically significant differences in house prices (P-value=0.53 and 0.85 > 0.05).