Sunny Kumar Vaishnov (S3822295)
library(readxl)
library(dplyr)
library(ggplot2)
library(magrittr)
library(Hmisc)
library(car)
It is claimed that patients in South Western Sydney hospitals have an average of length of stay (ALOS) of 4.5 days.The objective of this report is to test whether the above claim is true or not.In this investigation we will go through various steps such as:
•Importing and Filter the data on the basis of “Sydney Western Hospital”
•Summarizing the data by investigating the mean, median, standard deviation, max and min values.
•Descriptive visualizations of the data by showing Boxplot and histogram
•Conducting approriate test.
This data was collected from the data repository of RMIT university student drive of Applied Analytics named as average-length-of-stay-multilevel-data.csv and named it as Hospital.
• Sample Size 30021
• There are 19 variables.
• Average Lengtth of stay and Sydney Western Hospitals are important Variables as we have to conduct our analysis on them.
• We have filter the variable with which we want to work.We are going to perform one sample ttest on them.
#Importing the dataset
Hospital <- read_csv("average-length-of-stay-multilevel-data.csv",skip = 12)
Missing column names filled in: 'X9' [9], 'X11' [11], 'X13' [13], 'X15' [15], 'X17' [17], 'X19' [19]Parsed with column specification:
cols(
`Reporting unit` = [31mcol_character()[39m,
`Reporting unit type` = [31mcol_character()[39m,
State = [31mcol_character()[39m,
`Local Hospital Network (LHN)` = [31mcol_character()[39m,
`Peer group` = [31mcol_character()[39m,
`Time period` = [31mcol_character()[39m,
Category = [31mcol_character()[39m,
`Total number of stays` = [31mcol_character()[39m,
X9 = [33mcol_logical()[39m,
`Number of overnight stays` = [31mcol_character()[39m,
X11 = [33mcol_logical()[39m,
`Percentage of overnight stays` = [31mcol_character()[39m,
X13 = [33mcol_logical()[39m,
`Average length of stay (days)` = [31mcol_character()[39m,
X15 = [33mcol_logical()[39m,
`Peer group average (days)` = [31mcol_character()[39m,
X17 = [33mcol_logical()[39m,
`Total overnight patient bed days` = [31mcol_character()[39m,
X19 = [33mcol_logical()[39m
)
6 parsing failures.
row col expected actual file
30010 X15 1/0/T/F/TRUE/FALSE 㠼㸷 'average-length-of-stay-multilevel-data.csv'
30011 X15 1/0/T/F/TRUE/FALSE 㠼㸷 'average-length-of-stay-multilevel-data.csv'
30012 X15 1/0/T/F/TRUE/FALSE 㠼㸷 'average-length-of-stay-multilevel-data.csv'
30013 X15 1/0/T/F/TRUE/FALSE 㠼㸷 'average-length-of-stay-multilevel-data.csv'
30014 X15 1/0/T/F/TRUE/FALSE 㠼㸷 'average-length-of-stay-multilevel-data.csv'
..... ... .................. ...... ............................................
See problems(...) for more details.
#filter the dataset on the basis of Sydney Western Hospital
Filtered_data <- Hospital %>% filter(`Local Hospital Network (LHN)`== "South Western Sydney" )
View(Filtered_data)
#checking the dimensions of the database
dim(Filtered_data)
[1] 499 19
#checking the class
class(Filtered_data$`Average length of stay (days)`)
[1] "character"
#class was character so convert into numeric
Filtered_data$`Average length of stay (days)` <- Filtered_data$`Average length of stay (days)` %>% as.double()
NAs introduced by coercion
Filtered_data$`Local Hospital Network (LHN)`<- Filtered_data$`Local Hospital Network (LHN)` %>% as.factor()
#Checking the class
class(Filtered_data$`Average length of stay (days)`)
[1] "numeric"
Calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values) of the selected measurement.
#summarise the data
Filtered_data %>% summarise(Min=min(`Average length of stay (days)`,na.rm = TRUE),
Q1=quantile(`Average length of stay (days)`, probs= 0.25,na.rm = TRUE),
Median= median(`Average length of stay (days)`, na.rm = TRUE),
Q3 = quantile(`Average length of stay (days)`, probs= 0.75, na.rm = TRUE),
Max = max(`Average length of stay (days)`,na.rm = TRUE),
Mean = mean(`Average length of stay (days)`, na.rm = TRUE),
SD = sd(`Average length of stay (days)`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Average length of stay (days)`)))
NA
Now we will sho some descriptive visualization of the above data by plotting Histogram and Boxplot of the Random Sample.
#histogram
Filtered_data$`Average length of stay (days)` %>% hist(col="sky blue",
xlim = c(0,15),
xlab = "Average Length of Stay",
main = "Histogram of random sample")
#Boxplot
Filtered_data$`Average length of stay (days)` %>% boxplot(ylab = "ALOS")
In the above dataset we can see in the boxplot that there are some outliers present in the dataset. We have to remove them to proceed further in our analysis.
#find outliers values
outliers <-boxplot(Filtered_data$`Average length of stay (days)`,plot = FALSE)$out
print(outliers)
[1] 10.3 11.7 10.4 11.6 10.1 10.9 10.0 10.5 9.6 11.9 11.8 10.3 10.3 9.8 9.6 11.7 10.2 10.4 11.3 9.8 9.5
#finding in which row outliers are
Filtered_data[which(Filtered_data$`Average length of stay (days)` %in% outliers),]
#removing
Filtered_data <- Filtered_data[-which(Filtered_data$`Average length of stay (days)` %in% outliers),]
#again checking
boxplot(Filtered_data$`Average length of stay (days)`)$out
[1] 9.0 9.1 9.4 8.9 9.2 9.0 9.0 9.2 9.3 9.4 9.2 9.1 8.9 8.9
Now to test whether the above claim is true or not we have to conduct the one sample Ttest.
#one sample ttest
t.test(Filtered_data$`Average length of stay (days)`, alternative="two.sided")
One Sample t-test
data: Filtered_data$`Average length of stay (days)`
t = 40.036, df = 349, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
4.077895 4.499247
sample estimates:
mean of x
4.288571
A two-tailed, one-sample t-test was used to test whether the above claim(Average length of stay at hospital = 4.5 days) made was true or not.To check this we have to follow certain steps such as summarising the data and then showing some descriptive visualization by plotting histogram and boxplot. After that we remove certain outliers present in the dataset. Then we apply " One sample ttest" to check the claim.The readings were significantly different as per the above claim.The 0.05 level of significance was used.Average length of stay was claimed to be 4.5 days. The resultant mean was found out to be 4.28 days, SD= 2.43. The results of the one-sample t-test found the mean to be statistically significantly lower than the claimed mean. t() =40.036, p < .001, 95% CI [4.077,4.49]
Studies have shown that tutorials are quite important in improving student understanding.The objective of this report is to test whether the above claim is true or not.In this investigation we will go through various steps such as:
•Importing the data and subset the data on the basis of two variable“Score before and after the tutorials”
•Summarizing the data by investigating the mean, median, standard deviation, max and min values of both observations, before and after the tutorials and then show the difference between both the tutorials and summarise the differnce observation and mutate that into the dataset
•Descriptive visualizations of the data by showing Boxplot and qqplot.
•Conducting approriate test by using 0.05 level of significance.
This data was collected from canvas section of RMIT university of Applied Analytics named as “Assignment 4b-3.csv” and named it as Tutorials.
• Sample Size 1290
• There are 8 variables.
• Score before Tutorials and score after tutorials are important Variables as we have to conduct our analysis on them.
#Importing the dataset
Tutorials <- read_csv("Assignment 4b-3.csv")
Parsed with column specification:
cols(
Gender = [32mcol_double()[39m,
IQ = [32mcol_double()[39m,
Profession = [32mcol_double()[39m,
Advice = [32mcol_double()[39m,
School = [32mcol_double()[39m,
`Score before tutorial` = [32mcol_double()[39m,
`Score after tutorial` = [32mcol_double()[39m
)
View(Tutorials)
#Subsetting the dataset on the basis of two variables:
Subset_data <- Tutorials %>% select(`Score before tutorial`,`Score after tutorial`)
View(Subset_data)
Calculate summary statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values).
#before tutorials:
Subset_data %>% summarise( Min = min(`Score before tutorial`, na.rm = TRUE),
Q1 = quantile(`Score before tutorial`, probs = .25, na.rm = TRUE),
Median = median(`Score before tutorial`, na.rm = TRUE),
Q3 = quantile(`Score before tutorial`, probs = .75, na.rm = TRUE),
Max = max(`Score before tutorial`, na.rm = TRUE),
Mean = mean(`Score before tutorial`, na.rm = TRUE),
SD = sd(`Score before tutorial`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Score before tutorial`)))
#after tutorials:
Subset_data %>% summarise( Min = min(`Score after tutorial`, na.rm = TRUE),
Q1 = quantile(`Score after tutorial`, probs = .25, na.rm = TRUE),
Median = median(`Score after tutorial`, na.rm = TRUE),
Q3 = quantile(`Score after tutorial`, probs = .75, na.rm = TRUE),
Max = max(`Score after tutorial`, na.rm = TRUE),
Mean = mean(`Score after tutorial`, na.rm = TRUE),
SD = sd(`Score after tutorial`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Score after tutorial`)))
#Diff:
Subset_data <- Subset_data %>% mutate(Diff = `Score after tutorial` - `Score before tutorial`)
#summary
Subset_data %>% summarise( Min = min(Diff, na.rm = TRUE),
Q1 = quantile(Diff, probs = .25, na.rm = TRUE),
Median = median(Diff, na.rm = TRUE),
Q3 = quantile(Diff, probs = .75, na.rm = TRUE),
Max = max(Diff, na.rm = TRUE),
Mean = mean(Diff, na.rm = TRUE),
SD = sd(Diff, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Diff)))
Now we will show some descriptive visualization of the above data by plotting Q-Q Plot and Boxplot.
#boxplot
boxplot(Subset_data$Diff)
#q-q plot:
qqPlot(Subset_data$Diff, dist="norm")
[1] 8 83
Now we have to conduct paired sample ttest on the above data to check whether there is a improvement after the tutorials.
#ttest
t.test(Subset_data$Diff,
mu=0,
alternative = "two.sided")
One Sample t-test
data: Subset_data$Diff
t = 19.144, df = 1289, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
4.802104 5.898671
sample estimates:
mean of x
5.350388
A paired-samples t-test was used to test for a significant mean difference between students scores before and after some tutorials.To check this we have to follow certain steps such as importing,subsetting and summarising the data on the basis of scores of students before and after the test. After that we find the difference between the scores of students and mutate the information in the dataset,and named as “Difference” and after that show the summary stats of Differnec as well. Then showing some descriptive visualization by plotting Q-Q plot and boxplot of the differnce Variable. At the end to check the claim, paired-samples t-test was used to test for a significant mean difference between students scores before and after some tutorials. The mean difference was found to be 5.35 (SD = 10.03).The paired-samples t-test found a statistically significant mean difference between students scores before and after some tutorials, t(df=1289)=19.144, 95% CI [4.802 5.898]. Scores were found to significantly increased after the tutorials provided to students.
Q1: The data was taken from : https://www.aihw.gov.au/reports data/myhospitals/sectors/admitted patients
Q2: the data was taken from : https://rmit.instructure.com/courses/67269/assignments/443968