housing_prices = read.csv("C:\\Users\\User\\Documents\\UTS\\AUTUMN 2024\\TRM\\Data Analyst R Basic\\Housing_prices.csv")
dim(housing_prices)
## [1] 372 8
names(housing_prices)
## [1] "ID" "river" "rooms" "price" "age" "industry" "ptratio"
## [8] "low_ses"
head(housing_prices)
## ID river rooms price age industry ptratio low_ses
## 1 1 No 6-room 21.6 78.9 7.07 17.8 9.14
## 2 2 No 7-room 34.7 61.1 7.07 17.8 4.03
## 3 3 No 7-room 33.4 45.8 2.18 18.7 2.94
## 4 4 No 7-room 36.2 54.2 2.18 18.7 5.33
## 5 5 No 6-room 28.7 58.7 2.18 18.7 5.21
## 6 6 No 6-room 20.4 61.8 8.14 21.0 8.26
Study Design: Investing of 372 suburbs whether there are differences in housing prices based on the number of rooms in a house.
Null hypothesis: There is no significant difference in housing prices among houses with different numbers of rooms per household.
Alternative hypothesis: There is a significant difference in housing prices among houses with different numbers of rooms per household.
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1(~ price + river + age + industry + ptratio + low_ses | rooms, data = housing_prices)
| 4-room (N=15) |
5-room (N=27) |
6-room (N=254) |
7-room (N=76) |
Overall (N=372) |
|
|---|---|---|---|---|---|
| price | |||||
| Mean (SD) | 17.3 (10.7) | 14.0 (5.05) | 18.9 (5.50) | 28.8 (11.7) | 20.5 (8.59) |
| Median [Min, Max] | 13.8 [7.00, 50.0] | 14.4 [5.00, 23.7] | 19.4 [5.00, 50.0] | 27.5 [7.50, 50.0] | 19.8 [5.00, 50.0] |
| river | |||||
| No | 15 (100%) | 23 (85.2%) | 239 (94.1%) | 67 (88.2%) | 344 (92.5%) |
| Yes | 0 (0%) | 4 (14.8%) | 15 (5.9%) | 9 (11.8%) | 28 (7.5%) |
| age | |||||
| Mean (SD) | 93.5 (15.9) | 89.8 (20.4) | 75.6 (24.3) | 77.2 (21.1) | 77.7 (23.6) |
| Median [Min, Max] | 100 [37.8, 100] | 96.2 [9.80, 100] | 84.5 [6.00, 100] | 82.7 [2.90, 100] | 87.3 [2.90, 100] |
| industry | |||||
| Mean (SD) | 17.8 (2.23) | 17.7 (5.43) | 13.6 (6.16) | 11.1 (6.53) | 13.5 (6.32) |
| Median [Min, Max] | 18.1 [9.90, 19.6] | 18.1 [6.91, 27.7] | 13.9 [2.18, 27.7] | 9.90 [1.89, 19.6] | 18.1 [1.89, 27.7] |
| ptratio | |||||
| Mean (SD) | 19.3 (1.94) | 18.7 (2.31) | 19.3 (1.71) | 18.4 (1.81) | 19.1 (1.82) |
| Median [Min, Max] | 20.2 [14.7, 20.2] | 20.1 [14.7, 21.2] | 20.2 [14.7, 21.2] | 18.0 [14.7, 21.0] | 20.2 [14.7, 21.2] |
| low_ses | |||||
| Mean (SD) | 24.4 (11.4) | 23.3 (6.75) | 14.5 (5.37) | 9.14 (6.25) | 14.4 (7.15) |
| Median [Min, Max] | 29.3 [3.26, 38.0] | 24.0 [10.2, 34.4] | 14.1 [5.08, 34.0] | 6.79 [1.73, 25.8] | 13.6 [1.73, 38.0] |
# install.packages("ggplot2")
# GEOM meaning: geometric; geometrical; geometry.
library(ggplot2)
library(grid)
library(gridExtra)
# First graph
p = ggplot(data = housing_prices, aes(x = price)) +
geom_histogram(color = "white", fill = "blue") +
labs(x = "Median housing prices (x USD 1,000)")
# Second graph
p1 = ggplot(data = housing_prices, aes(x = price, y = ..density..)) +
geom_histogram(color = "white", fill = "blue") +
geom_density(col="red") +
labs(x = "Median housing prices (x USD 1,000)")
# Combine the graphs together
grid.arrange(p, p1, nrow = 2, top = textGrob("Distribution of house prices", gp = gpar(fontsize = 20, font = 1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: Based on the histogram, it appears that the distribution of house prices is positively skewed, with the majority of house prices concentrated towards the lower end of the price range. This is evident from the higher frequency of observations in the lower price range bins compared to the higher price range bins. Additionally, the density curve overlaid on the histogram confirms this observation, as it shows a peak towards the lower end of the price range, tapering off towards the higher end.
p = ggplot(data = housing_prices, aes(x = rooms, y = price, fill = rooms, col = rooms))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05)
p1 + labs(x = "Number of Rooms per Household", y = "Median housing prices (x USD 1,000)") +
ggtitle("Box Plot of Housing Prices by Number of Rooms per Household") +
theme_bw()
Interpretation: From the plot, we observe that the median housing price generally increases as the number of rooms per household increases, indicating a positive correlation between the number of rooms and housing prices. This trend suggests that houses with more rooms tend to have higher prices, which is consistent with expectations.
However, there are some notable observations: - The box plot for the 6-room category exhibits a wider spread of housing prices compared to other categories, with several outliers both at the lower and upper ends. This suggests greater variability in housing prices for households with six rooms, potentially indicating differences in property characteristics or location within this category. - There is a single outlier at the $50 mark for the 4-room category, indicating a significant deviation from the typical price range for houses with four rooms. This outlier could represent a unique property or an error in the data.
Overall, the box plot provides valuable insights into the relationship between the number of rooms per household and housing prices, highlighting trends and variations across different categories.
kruskal.test(price ~ rooms, data = housing_prices)
##
## Kruskal-Wallis rank sum test
##
## data: price by rooms
## Kruskal-Wallis chi-squared = 75.504, df = 3, p-value = 2.826e-16
Interpretation: The test yielded a Kruskal-Wallis chi-squared statistic of 75.504 with 3 degrees of freedom, resulting in a p-value of approximately 2.826e-16. The extremely low p-value indicates strong evidence against the null hypothesis that housing prices are equal across all categories of rooms. Therefore, we reject the null hypothesis in favor of the alternative hypothesis, concluding that there are significant differences in housing prices among houses with different numbers of rooms.
library(lmboot)
boot = ANOVA.boot(price ~ rooms, B = 1000, seed = 1234, data = housing_prices)
boot$'p-values'
## [1] 0
Interpretation: A p-value of 0 suggests that none of the bootstrap samples produced a test statistic as extreme as the one observed in the original data, indicating strong evidence against the null hypothesis. Therefore, we reject the null hypothesis and conclude that there are significant differences in housing prices among houses with different numbers of rooms.
tukey.price.rooms = TukeyHSD(aov(price ~ rooms, data = housing_prices))
tukey.price.rooms
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ rooms, data = housing_prices)
##
## $rooms
## diff lwr upr p adj
## 5-room-4-room -3.215556 -9.373321 2.942210 0.5331459
## 6-room-4-room 1.603780 -3.477109 6.684668 0.8475492
## 7-room-4-room 11.511053 6.108559 16.913546 0.0000004
## 6-room-5-room 4.819335 0.948716 8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room 9.907273 7.407162 12.407384 0.0000000
tukey_results <- data.frame(
Comparisons = c("5-room houses vs. 4-room houses",
"6-room houses vs. 4-room houses",
"7-room houses vs. 4-room houses",
"6-room houses vs. 5-room houses",
"7-room houses vs. 5-room houses",
"7-room houses vs. 6-room houses"),
Mean_Differences = c("-3.22 (-9.37, 2.94)",
"1.60 (-3.48, 6.68)",
"11.51 (6.11, 16.91)",
"4.82 (0.95, 8.69)",
"14.73 (10.44, 19.01)",
"9.91 (7.41, 12.41)"),
P_Value = c("0.5331", "0.8475", "<0.001", "0.0078", "<0.001", "<0.001")
)
print(tukey_results, row.names = FALSE)
## Comparisons Mean_Differences P_Value
## 5-room houses vs. 4-room houses -3.22 (-9.37, 2.94) 0.5331
## 6-room houses vs. 4-room houses 1.60 (-3.48, 6.68) 0.8475
## 7-room houses vs. 4-room houses 11.51 (6.11, 16.91) <0.001
## 6-room houses vs. 5-room houses 4.82 (0.95, 8.69) 0.0078
## 7-room houses vs. 5-room houses 14.73 (10.44, 19.01) <0.001
## 7-room houses vs. 6-room houses 9.91 (7.41, 12.41) <0.001
The Tukey posthoc test was performed to determine which particular number of rooms per household had significantly different house prices. The results are presented in the following table above:
In summary, the Tukey posthoc test reveals that houses with 7 rooms have significantly higher prices compared to houses with fewer rooms (4, 5, or 6 rooms). Additionally, houses with 6 rooms tend to have higher prices than those with 5 rooms, while no significant differences are found between houses with 4 rooms and either 5 or 6 rooms.