Task 1. Reflection:
When I started the model, I had no background in it; I only knew it
was related to data analysis, but I needed to learn how to apply it. In
the first workshop, I realised I needed to download an RStudio program
based on R programming. After three workshops, I understood what, how
and why to use RStudio, especially with the help of model teachers. So,
the most interesting insight I gained from the data analysis model using
R-Basic last month was data planning and management, which I found
greatly useful for applying data analysis through RStudio software to my
work, starting from data preparation, importing, then writing
code/functions and getting and understanding output and graph and table,
figures etc. In addition, I learned that data quality, including data
preparation and cleaning, is the first step after planning any data
analysis. After that, I learned about descriptive analysis, a type of
data analysis used to summarise the results to describe sample
distribution and characteristics ..etc. Through RStudio, I learned to
use it to report and visualise (graphs) sample characteristics and
distribution. Also, during this learning period, I can test the
hypothesis by various statistical methods, whether numeric or
categorical, and it can be two or three groups or more. My particular
area of interest is health informatics; health informatics involves
accessing, analysing, and managing health data and applying medical
knowledge to information technology systems so clinicians can provide
better patient care. Analysing and interpreting data using R and its
RStudio software is an opportunity that significantly motivates me for
future use in my field. In addition to my interest in health informatics
and healthcare data, I discovered that healthcare data analytics plays a
vital role in analysing data for valuable patterns, trends, and
insights. This process, in turn, aids in enhancing clinical
decision-making processes and ultimately contributes to improving
patient outcomes through informed healthcare interventions. Data
processing using R and its add-on R Markdown makes it easy to access all
the imported data and created objects; it allows me to develop projects
to organise and share my work with my collaborators more efficiently.
Analysing healthcare data this way may be extremely valuable because I
deal with health data and systems. The best result that supports
decision-making is to begin from data; if data is prepared, managed,
analysed, interpreted and visualised knowledgeably and understandably,
it is easy to deal with it and then choose the appropriate decision.
This course has contributed significantly to my learning journey.
Task 2. Analysis report:
2.1 Present the study design, null and alternative hypotheses
2.1.1. Study design: A cross-sectional investigation of house
conditions in 372 suburbs wasconducted to determine whether the housing
prices varied among houses with different numbers of rooms.
2.1.2. Null hypothesis: Housing prices did not varied among houses
with different numbers of rooms.
2.1.3. Alternative hypothesis: Housing prices varied among houses
with different numbers of rooms.
Import the “Housing_prices” datase
library(readr)
Housing_prices <- read_csv("C:/Users/24928614/Downloads/Housing_prices.csv")
## Rows: 372 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): river, rooms
## dbl (6): ID, price, age, industry, ptratio, low_ses
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Housing_prices)
2.2 Describe the characteristics of the study sample by the number
of rooms/house
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1(~ river + price + age + industry + ptratio + low_ses | rooms, data = Housing_prices)
|
4-room (N=15) |
5-room (N=27) |
6-room (N=254) |
7-room (N=76) |
Overall (N=372) |
| river |
|
|
|
|
|
| No |
15 (100%) |
23 (85.2%) |
239 (94.1%) |
67 (88.2%) |
344 (92.5%) |
| Yes |
0 (0%) |
4 (14.8%) |
15 (5.9%) |
9 (11.8%) |
28 (7.5%) |
| price |
|
|
|
|
|
| Mean (SD) |
17.3 (10.7) |
14.0 (5.05) |
18.9 (5.50) |
28.8 (11.7) |
20.5 (8.59) |
| Median [Min, Max] |
13.8 [7.00, 50.0] |
14.4 [5.00, 23.7] |
19.4 [5.00, 50.0] |
27.5 [7.50, 50.0] |
19.8 [5.00, 50.0] |
| age |
|
|
|
|
|
| Mean (SD) |
93.5 (15.9) |
89.8 (20.4) |
75.6 (24.3) |
77.2 (21.1) |
77.7 (23.6) |
| Median [Min, Max] |
100 [37.8, 100] |
96.2 [9.80, 100] |
84.5 [6.00, 100] |
82.7 [2.90, 100] |
87.3 [2.90, 100] |
| industry |
|
|
|
|
|
| Mean (SD) |
17.8 (2.23) |
17.7 (5.43) |
13.6 (6.16) |
11.1 (6.53) |
13.5 (6.32) |
| Median [Min, Max] |
18.1 [9.90, 19.6] |
18.1 [6.91, 27.7] |
13.9 [2.18, 27.7] |
9.90 [1.89, 19.6] |
18.1 [1.89, 27.7] |
| ptratio |
|
|
|
|
|
| Mean (SD) |
19.3 (1.94) |
18.7 (2.31) |
19.3 (1.71) |
18.4 (1.81) |
19.1 (1.82) |
| Median [Min, Max] |
20.2 [14.7, 20.2] |
20.1 [14.7, 21.2] |
20.2 [14.7, 21.2] |
18.0 [14.7, 21.0] |
20.2 [14.7, 21.2] |
| low_ses |
|
|
|
|
|
| Mean (SD) |
24.4 (11.4) |
23.3 (6.75) |
14.5 (5.37) |
9.14 (6.25) |
14.4 (7.15) |
| Median [Min, Max] |
29.3 [3.26, 38.0] |
24.0 [10.2, 34.4] |
14.1 [5.08, 34.0] |
6.79 [1.73, 25.8] |
13.6 [1.73, 38.0] |
2.3 Develop and interpret a histogram for the distribution of house
prices
library(ggplot2)
p = ggplot(data = Housing_prices, aes(x = price))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of House Prices") + theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As can be seen from the graphs, the vast majority of houses are
priced between 10 and 30 with the peak being around 20.
2.4 Develop and interpret a box plot to describe the differences in
housing prices among the number of rooms per household
p = ggplot(data = Housing_prices, aes(x = rooms, y = price, fill = rooms, col = rooms))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05)
p1 + labs(x = "Rooms", y = "House Price (USD)") + ggtitle("Rooms per household by Price") + theme_bw()

The graphs show that a house with 4 rooms price more than a house
with 5 rooms, and the highest price is for a house with 7 rooms.
2.5 Conduct a statistical test to determine whether housing prices
were different among houses with different numbers of rooms. Interpret
the findings.
Price.Rooms = aov(price ~ rooms, data = Housing_prices)
summary(Price.Rooms)
## Df Sum Sq Mean Sq F value Pr(>F)
## rooms 3 7162 2387.2 43.49 <2e-16 ***
## Residuals 368 20202 54.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA reports a p-value far below 0.05 (0), indicating there are
differences in the house price by rooms number (it means the null
hypothesis is rejected). To investigate more into the differences
between all house and rooms, Tukey’s Test is performed.
2.6 Determine which particular number of rooms/house had different
house prices using the Tukey posthoc test. Fill in the following table
and interpret the findings
tukey.Price.Rooms = TukeyHSD(Price.Rooms)
tukey.Price.Rooms
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ rooms, data = Housing_prices)
##
## $rooms
## diff lwr upr p adj
## 5-room-4-room -3.215556 -9.373321 2.942210 0.5331459
## 6-room-4-room 1.603780 -3.477109 6.684668 0.8475492
## 7-room-4-room 11.511053 6.108559 16.913546 0.0000004
## 6-room-5-room 4.819335 0.948716 8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room 9.907273 7.407162 12.407384 0.0000000
The output gives the difference in means, confidence levels and the
adjusted p-values for all possible rooms The confidence levels and
p-values show the significant between-group difference is for rooms
7-4,6-5,7-5, 7-6 and 2
2.7 Present the results in an R Markdown pdf file that includes the
analysis codes, outputs and graphs