housing_prices = read.csv("C:/Users/24980017/Downloads/Housing_prices.csv")

Number of variables and variable names

dim(housing_prices)
## [1] 372   8
names(housing_prices)
## [1] "ID"       "river"    "rooms"    "price"    "age"      "industry" "ptratio" 
## [8] "low_ses"

Present the study design, null and alternative hypotheses (5%)

Study Design: The cross-sectional investigation of 372 suburbs to determine whether there was variation in housing prices based on the number of rooms in houses.

Null Hypothesis (H0): There is no significant difference in median housing prices among houses with different numbers of rooms. Alternative Hypothesis (Ha): Median housing prices vary significantly among houses with different numbers of rooms.

Describe the characteristics of the study sample by the number of rooms/house (10%)

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ price + river + age + industry + ptratio + low_ses | rooms, data = housing_prices)
4-room
(N=15)
5-room
(N=27)
6-room
(N=254)
7-room
(N=76)
Overall
(N=372)
price
Mean (SD) 17.3 (10.7) 14.0 (5.05) 18.9 (5.50) 28.8 (11.7) 20.5 (8.59)
Median [Min, Max] 13.8 [7.00, 50.0] 14.4 [5.00, 23.7] 19.4 [5.00, 50.0] 27.5 [7.50, 50.0] 19.8 [5.00, 50.0]
river
No 15 (100%) 23 (85.2%) 239 (94.1%) 67 (88.2%) 344 (92.5%)
Yes 0 (0%) 4 (14.8%) 15 (5.9%) 9 (11.8%) 28 (7.5%)
age
Mean (SD) 93.5 (15.9) 89.8 (20.4) 75.6 (24.3) 77.2 (21.1) 77.7 (23.6)
Median [Min, Max] 100 [37.8, 100] 96.2 [9.80, 100] 84.5 [6.00, 100] 82.7 [2.90, 100] 87.3 [2.90, 100]
industry
Mean (SD) 17.8 (2.23) 17.7 (5.43) 13.6 (6.16) 11.1 (6.53) 13.5 (6.32)
Median [Min, Max] 18.1 [9.90, 19.6] 18.1 [6.91, 27.7] 13.9 [2.18, 27.7] 9.90 [1.89, 19.6] 18.1 [1.89, 27.7]
ptratio
Mean (SD) 19.3 (1.94) 18.7 (2.31) 19.3 (1.71) 18.4 (1.81) 19.1 (1.82)
Median [Min, Max] 20.2 [14.7, 20.2] 20.1 [14.7, 21.2] 20.2 [14.7, 21.2] 18.0 [14.7, 21.0] 20.2 [14.7, 21.2]
low_ses
Mean (SD) 24.4 (11.4) 23.3 (6.75) 14.5 (5.37) 9.14 (6.25) 14.4 (7.15)
Median [Min, Max] 29.3 [3.26, 38.0] 24.0 [10.2, 34.4] 14.1 [5.08, 34.0] 6.79 [1.73, 25.8] 13.6 [1.73, 38.0]

Develop and interpret a histogram for the distribution of house prices (10%)

library(ggplot2)
library(grid)
library(gridExtra) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggthemes)

p = ggplot(data = housing_prices, aes(x = price)) + geom_histogram(color = "black", fill = "light pink") + labs(x = "Median housing prices (x USD 1,000)") + labs(y = "Count number")

p1 = ggplot(data = housing_prices, aes(x = price, y = ..density..)) + geom_histogram(color = "black", fill = "light pink") +  geom_density(col="blue") + labs(x = "Median housing prices (x USD 1,000)") + labs(y = "Density")

grid.arrange(p, p1, nrow = 2, top = textGrob("Distribution of house prices", gp = gpar(fontsize = 18, font = 3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Develop and interpret a box plot to describe the differences in housing prices among the number of rooms per household (10%)

p = ggplot(data = housing_prices, aes(x = rooms, y = price, fill = rooms, col = rooms))

p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.1) + labs(x = "Room numbers of each household", y = "Median housing prices (x USD 1,000)") + ggtitle("Box Plot of housing prices by room numbers of each household") + theme_bw()

p1

Conduct a statistical test to determine whether housing prices were different among houses with different numbers of rooms. Interpret the findings (15%)

Kruskal-Wallis test

kruskal.test(price ~ rooms, data = housing_prices)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by rooms
## Kruskal-Wallis chi-squared = 75.504, df = 3, p-value = 2.826e-16

ANOVA test

anova_test= aov(price ~ rooms, data = housing_prices)
summary(anova_test)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## rooms         3   7162  2387.2   43.49 <2e-16 ***
## Residuals   368  20202    54.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Kruskal-Wallis and ANOVA tests given p-value of approximately zero (2.826e-16 and 2e-16), P-value < 0.05. Therefore, null hypothesis (H0) in favor of the alternative hypothesis (Ha) is rejected.

In conclusion, there are significant differences in housing prices among houses with different numbers of rooms.

Determine which particular number of rooms/house had different house prices using the Tukey posthoc test. Fill in the following table and interpret the findings (20%)

tukey_result = TukeyHSD(anova_test)
tukey_result
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ rooms, data = housing_prices)
## 
## $rooms
##                    diff       lwr       upr     p adj
## 5-room-4-room -3.215556 -9.373321  2.942210 0.5331459
## 6-room-4-room  1.603780 -3.477109  6.684668 0.8475492
## 7-room-4-room 11.511053  6.108559 16.913546 0.0000004
## 6-room-5-room  4.819335  0.948716  8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room  9.907273  7.407162 12.407384 0.0000000
tukey_results <- data.frame(
  
  Comparisons = c("5-room houses vs. 4-room houses","6-room houses vs. 4-room houses", "7-room houses vs. 4-room houses", "6-room houses vs. 5-room houses", "7-room houses vs. 5-room houses", "7-room houses vs. 6-room houses"),
  
  Mean_Differences_in_housing_price= c("-3.21 (-9.37, 2.94)","1.60 [-3.48, 6.68]","11.51 [6.11, 16.91]","4.82 [0.95, 8.69]","14.73 [10.44, 19.01]","9.91 [7.41, 12.41]"),
  
  P_Value = c("0.5331", "0.8475", "0.0000", "0.0078", "0.0000", "0.0000")
)

print(tukey_results, row.names = FALSE)
##                      Comparisons Mean_Differences_in_housing_price P_Value
##  5-room houses vs. 4-room houses               -3.21 (-9.37, 2.94)  0.5331
##  6-room houses vs. 4-room houses                1.60 [-3.48, 6.68]  0.8475
##  7-room houses vs. 4-room houses               11.51 [6.11, 16.91]  0.0000
##  6-room houses vs. 5-room houses                 4.82 [0.95, 8.69]  0.0078
##  7-room houses vs. 5-room houses              14.73 [10.44, 19.01]  0.0000
##  7-room houses vs. 6-room houses                9.91 [7.41, 12.41]  0.0000

The Tukey posthoc test shows that houses with 7 rooms had significant differencess in housing price (P-value ≈ 0 <0.05) compared to the ones of 4, 5 and 6 rooms. Similar manner is found when comparing houses with 6 rooms to those with 5 rooms (P-value = 0.0078 <0.05).

On the contrary, in case of comparing the houses with 5 and 6 rooms to the house with 5 rooms, there are no statistically significant differences in house prices (P-value=0.53 and 0.85 > 0.05).