Student ID: “24928614”

Name: “Wejdan Suhaiman G Alshammari”

Course: “TRM 32931 “Practical Data Analysis using R - Basic””

Task 1. Reflection:

Task 2. Analysis report:

2.1 Present the study design, null and alternative hypotheses

2.1.1. Study design: A cross-sectional investigation of house conditions in 372 suburbs wasconducted to determine whether the housing prices varied among houses with different numbers of rooms.
2.1.2. Null hypothesis: Housing prices did not varied among houses with different numbers of rooms.
2.1.3. Alternative hypothesis: Housing prices varied among houses with different numbers of rooms.

Import the “Housing_prices” datase

library(readr)
Housing_prices <- read_csv("C:/Users/24928614/Downloads/Housing_prices.csv")
## Rows: 372 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): river, rooms
## dbl (6): ID, price, age, industry, ptratio, low_ses
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Housing_prices)

2.2 Describe the characteristics of the study sample by the number of rooms/house

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ river + price + age + industry + ptratio + low_ses | rooms, data = Housing_prices)
4-room
(N=15)
5-room
(N=27)
6-room
(N=254)
7-room
(N=76)
Overall
(N=372)
river
No 15 (100%) 23 (85.2%) 239 (94.1%) 67 (88.2%) 344 (92.5%)
Yes 0 (0%) 4 (14.8%) 15 (5.9%) 9 (11.8%) 28 (7.5%)
price
Mean (SD) 17.3 (10.7) 14.0 (5.05) 18.9 (5.50) 28.8 (11.7) 20.5 (8.59)
Median [Min, Max] 13.8 [7.00, 50.0] 14.4 [5.00, 23.7] 19.4 [5.00, 50.0] 27.5 [7.50, 50.0] 19.8 [5.00, 50.0]
age
Mean (SD) 93.5 (15.9) 89.8 (20.4) 75.6 (24.3) 77.2 (21.1) 77.7 (23.6)
Median [Min, Max] 100 [37.8, 100] 96.2 [9.80, 100] 84.5 [6.00, 100] 82.7 [2.90, 100] 87.3 [2.90, 100]
industry
Mean (SD) 17.8 (2.23) 17.7 (5.43) 13.6 (6.16) 11.1 (6.53) 13.5 (6.32)
Median [Min, Max] 18.1 [9.90, 19.6] 18.1 [6.91, 27.7] 13.9 [2.18, 27.7] 9.90 [1.89, 19.6] 18.1 [1.89, 27.7]
ptratio
Mean (SD) 19.3 (1.94) 18.7 (2.31) 19.3 (1.71) 18.4 (1.81) 19.1 (1.82)
Median [Min, Max] 20.2 [14.7, 20.2] 20.1 [14.7, 21.2] 20.2 [14.7, 21.2] 18.0 [14.7, 21.0] 20.2 [14.7, 21.2]
low_ses
Mean (SD) 24.4 (11.4) 23.3 (6.75) 14.5 (5.37) 9.14 (6.25) 14.4 (7.15)
Median [Min, Max] 29.3 [3.26, 38.0] 24.0 [10.2, 34.4] 14.1 [5.08, 34.0] 6.79 [1.73, 25.8] 13.6 [1.73, 38.0]

2.3 Develop and interpret a histogram for the distribution of house prices

library(ggplot2)
p = ggplot(data = Housing_prices, aes(x = price))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of House Prices") + theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As can be seen from the graphs, the vast majority of houses are priced between 10 and 30 with the peak being around 20.

2.4 Develop and interpret a box plot to describe the differences in housing prices among the number of rooms per household

p = ggplot(data = Housing_prices, aes(x = rooms,  y = price, fill = rooms, col = rooms))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05) 
p1 + labs(x = "Rooms", y = "House Price (USD)") + ggtitle("Rooms per household by Price") + theme_bw()

The graphs show that a house with 4 rooms price more than a house with 5 rooms, and the highest price is for a house with 7 rooms.

2.5 Conduct a statistical test to determine whether housing prices were different among houses with different numbers of rooms. Interpret the findings.

Price.Rooms = aov(price ~ rooms, data = Housing_prices)
summary(Price.Rooms)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## rooms         3   7162  2387.2   43.49 <2e-16 ***
## Residuals   368  20202    54.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA reports a p-value far below 0.05 (0), indicating there are differences in the house price by rooms number (it means the null hypothesis is rejected). To investigate more into the differences between all house and rooms, Tukey’s Test is performed.

2.6 Determine which particular number of rooms/house had different house prices using the Tukey posthoc test. Fill in the following table and interpret the findings

tukey.Price.Rooms = TukeyHSD(Price.Rooms)
tukey.Price.Rooms
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ rooms, data = Housing_prices)
## 
## $rooms
##                    diff       lwr       upr     p adj
## 5-room-4-room -3.215556 -9.373321  2.942210 0.5331459
## 6-room-4-room  1.603780 -3.477109  6.684668 0.8475492
## 7-room-4-room 11.511053  6.108559 16.913546 0.0000004
## 6-room-5-room  4.819335  0.948716  8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room  9.907273  7.407162 12.407384 0.0000000
The output gives the difference in means, confidence levels and the adjusted p-values for all possible rooms The confidence levels and p-values show the significant between-group difference is for rooms 7-4,6-5,7-5, 7-6 and 2
2.7 Present the results in an R Markdown pdf file that includes the analysis codes, outputs and graphs