Student ID: “24928614”

Name: “Wejdan Suhaiman G Alshammari”

Course: “TRM 32931 “Practical Data Analysis using R - Basic””

Task 1. Reflection:

When I started the model, I had no background in it; I only knew it was related to data analysis, but I needed to learn how to apply it. In the first workshop, I realised I needed to download an RStudio program based on R programming. After three workshops, I understood what, how and why to use RStudio, especially with the help of model teachers. So, the most interesting insight I gained from the data analysis model using R-Basic last month was data planning and management, which I found greatly useful for applying data analysis through RStudio software to my work, starting from data preparation, importing, then writing code/functions and getting and understanding output and graph and table, figures etc. In addition, I learned that data quality, including data preparation and cleaning, is the first step after planning any data analysis. After that, I learned about descriptive analysis, a type of data analysis used to summarise the results to describe sample distribution and characteristics ..etc. Through RStudio, I learned to use it to report and visualise (graphs) sample characteristics and distribution. Also, during this learning period, I can test the hypothesis by various statistical methods, whether numeric or categorical, and it can be two or three groups or more. My particular area of interest is health informatics; health informatics involves accessing, analysing, and managing health data and applying medical knowledge to information technology systems so clinicians can provide better patient care. Analysing and interpreting data using R and its RStudio software is an opportunity that significantly motivates me for future use in my field. In addition to my interest in health informatics and healthcare data, I discovered that healthcare data analytics plays a vital role in analysing data for valuable patterns, trends, and insights. This process, in turn, aids in enhancing clinical decision-making processes and ultimately contributes to improving patient outcomes through informed healthcare interventions. Data processing using R and its add-on R Markdown makes it easy to access all the imported data and created objects; it allows me to develop projects to organise and share my work with my collaborators more efficiently. Analysing healthcare data this way may be extremely valuable because I deal with health data and systems. The best result that supports decision-making is to begin from data; if data is prepared, managed, analysed, interpreted and visualised knowledgeably and understandably, it is easy to deal with it and then choose the appropriate decision. This course has contributed significantly to my learning journey.

Task 2. Analysis report:

2.1 Present the study design, null and alternative hypotheses

2.1.1. Study design: A cross-sectional investigation of house conditions in 372 suburbs wasconducted to determine whether the housing prices varied among houses with different numbers of rooms.

2.1.2. Null hypothesis: Housing prices did not varied among houses with different numbers of rooms.

2.1.3. Alternative hypothesis: Housing prices varied among houses with different numbers of rooms.

Import the “Housing_prices” datase

library(readr)
Housing_prices <- read_csv("C:/Users/24928614/Downloads/Housing_prices.csv")

## Rows: 372 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): river, rooms
## dbl (6): ID, price, age, industry, ptratio, low_ses
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(Housing_prices)

2.2 Describe the characteristics of the study sample by the number of rooms/house

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ river + price + age + industry + ptratio + low_ses | rooms, data = Housing_prices)

	4-room (N=15)	5-room (N=27)	6-room (N=254)	7-room (N=76)	Overall (N=372)
river
No	15 (100%)	23 (85.2%)	239 (94.1%)	67 (88.2%)	344 (92.5%)
Yes	0 (0%)	4 (14.8%)	15 (5.9%)	9 (11.8%)	28 (7.5%)
price
Mean (SD)	17.3 (10.7)	14.0 (5.05)	18.9 (5.50)	28.8 (11.7)	20.5 (8.59)
Median [Min, Max]	13.8 [7.00, 50.0]	14.4 [5.00, 23.7]	19.4 [5.00, 50.0]	27.5 [7.50, 50.0]	19.8 [5.00, 50.0]
age
Mean (SD)	93.5 (15.9)	89.8 (20.4)	75.6 (24.3)	77.2 (21.1)	77.7 (23.6)
Median [Min, Max]	100 [37.8, 100]	96.2 [9.80, 100]	84.5 [6.00, 100]	82.7 [2.90, 100]	87.3 [2.90, 100]
industry
Mean (SD)	17.8 (2.23)	17.7 (5.43)	13.6 (6.16)	11.1 (6.53)	13.5 (6.32)
Median [Min, Max]	18.1 [9.90, 19.6]	18.1 [6.91, 27.7]	13.9 [2.18, 27.7]	9.90 [1.89, 19.6]	18.1 [1.89, 27.7]
ptratio
Mean (SD)	19.3 (1.94)	18.7 (2.31)	19.3 (1.71)	18.4 (1.81)	19.1 (1.82)
Median [Min, Max]	20.2 [14.7, 20.2]	20.1 [14.7, 21.2]	20.2 [14.7, 21.2]	18.0 [14.7, 21.0]	20.2 [14.7, 21.2]
low_ses
Mean (SD)	24.4 (11.4)	23.3 (6.75)	14.5 (5.37)	9.14 (6.25)	14.4 (7.15)
Median [Min, Max]	29.3 [3.26, 38.0]	24.0 [10.2, 34.4]	14.1 [5.08, 34.0]	6.79 [1.73, 25.8]	13.6 [1.73, 38.0]

2.3 Develop and interpret a histogram for the distribution of house prices

library(ggplot2)
p = ggplot(data = Housing_prices, aes(x = price))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of House Prices") + theme_bw()

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As can be seen from the graphs, the vast majority of houses are priced between 10 and 30 with the peak being around 20.

2.4 Develop and interpret a box plot to describe the differences in housing prices among the number of rooms per household

p = ggplot(data = Housing_prices, aes(x = rooms,  y = price, fill = rooms, col = rooms))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05) 
p1 + labs(x = "Rooms", y = "House Price (USD)") + ggtitle("Rooms per household by Price") + theme_bw()

The graphs show that a house with 4 rooms price more than a house with 5 rooms, and the highest price is for a house with 7 rooms.

2.5 Conduct a statistical test to determine whether housing prices were different among houses with different numbers of rooms. Interpret the findings.

Price.Rooms = aov(price ~ rooms, data = Housing_prices)
summary(Price.Rooms)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## rooms         3   7162  2387.2   43.49 <2e-16 ***
## Residuals   368  20202    54.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA reports a p-value far below 0.05 (0), indicating there are differences in the house price by rooms number (it means the null hypothesis is rejected). To investigate more into the differences between all house and rooms, Tukey’s Test is performed.

2.6 Determine which particular number of rooms/house had different house prices using the Tukey posthoc test. Fill in the following table and interpret the findings

tukey.Price.Rooms = TukeyHSD(Price.Rooms)
tukey.Price.Rooms

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ rooms, data = Housing_prices)
## 
## $rooms
##                    diff       lwr       upr     p adj
## 5-room-4-room -3.215556 -9.373321  2.942210 0.5331459
## 6-room-4-room  1.603780 -3.477109  6.684668 0.8475492
## 7-room-4-room 11.511053  6.108559 16.913546 0.0000004
## 6-room-5-room  4.819335  0.948716  8.689954 0.0077590
## 7-room-5-room 14.726608 10.442544 19.010672 0.0000000
## 7-room-6-room  9.907273  7.407162 12.407384 0.0000000

The output gives the difference in means, confidence levels and the adjusted p-values for all possible rooms The confidence levels and p-values show the significant between-group difference is for rooms 7-4,6-5,7-5, 7-6 and 2

Assignment

2024-04-16

Student ID: “24928614”

Name: “Wejdan Suhaiman G Alshammari”

Course: “TRM 32931 “Practical Data Analysis using R - Basic””

Task 1. Reflection:

Task 2. Analysis report:

2.1 Present the study design, null and alternative hypotheses

2.1.1. Study design: A cross-sectional investigation of house conditions in 372 suburbs wasconducted to determine whether the housing prices varied among houses with different numbers of rooms.

2.1.2. Null hypothesis: Housing prices did not varied among houses with different numbers of rooms.

2.1.3. Alternative hypothesis: Housing prices varied among houses with different numbers of rooms.

Import the “Housing_prices” datase

2.2 Describe the characteristics of the study sample by the number of rooms/house

2.3 Develop and interpret a histogram for the distribution of house prices

As can be seen from the graphs, the vast majority of houses are priced between 10 and 30 with the peak being around 20.

2.4 Develop and interpret a box plot to describe the differences in housing prices among the number of rooms per household

The graphs show that a house with 4 rooms price more than a house with 5 rooms, and the highest price is for a house with 7 rooms.

2.5 Conduct a statistical test to determine whether housing prices were different among houses with different numbers of rooms. Interpret the findings.

ANOVA reports a p-value far below 0.05 (0), indicating there are differences in the house price by rooms number (it means the null hypothesis is rejected). To investigate more into the differences between all house and rooms, Tukey’s Test is performed.

2.6 Determine which particular number of rooms/house had different house prices using the Tukey posthoc test. Fill in the following table and interpret the findings

The output gives the difference in means, confidence levels and the adjusted p-values for all possible rooms The confidence levels and p-values show the significant between-group difference is for rooms 7-4,6-5,7-5, 7-6 and 2

2.7 Present the results in an R Markdown pdf file that includes the analysis codes, outputs and graphs