setwd("~/Desktop/MastersSem_1/1. Applied Analytics/Assignment4")
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(dplyr)
library(car)
library(knitr)
library(ggplot2)
library(outliers)
library(caret)
library(editrules)
Length of hospital Stay (LOS) or Average Length of Stay (ALOS) is an important indicator of the use of medical services that is used to assess the efficiency of hospital management, patient quality of care, and functional evaluation. Stratified sampling has been ultilised to collect the ALOS from different Peer Group. Then we implemented hypothesis testing for gathering statistical evidence from samples and we would focus on two sample t-test. A sample of 30021 observations were gathered and analysed as follows.
The average is calculated as the number of bed days for overnight stays divided by the number of overnight stays and is reported for selected conditions and procedures. It was thought that the mean value of the average of length of stay (ALOS) in a hospital was 4.5 days. We carry out this analysis to find out if this mean was correct.
hospital <- read.csv("Assignment4A.csv", fileEncoding="latin1", stringsAsFactors = FALSE)
hospital_tidy <- hospital %>% select("Average.length.of.stay..days.")
colnames(hospital_tidy)[colnames(hospital_tidy)=="Average.length.of.stay..days."] <- "ALOS"
hospital_tidy$ALOStay <- hospital_tidy$ALOS %>% as.numeric(hospital_tidy$ALOS)
knitr::kable(head(hospital_tidy))
| ALOS | ALOStay |
|---|---|
| 3.9 | 3.9 |
| 3.3 | 3.3 |
| 3.1 | 3.1 |
| 2.5 | 2.5 |
| 2.6 | 2.6 |
| 2.7 | 2.7 |
h<-hospital_tidy %>% summarise(Min = min(hospital_tidy$ALOStay,na.rm = TRUE),
Q1 = quantile(hospital_tidy$ALOStay,probs=.25,na.rm=TRUE),
Median = median(hospital_tidy$ALOStay, na.rm = TRUE),
Max = max(hospital_tidy$ALOStay,na.rm = TRUE),
Mean = mean(hospital_tidy$ALOStay, na.rm = TRUE),
SD = sd(hospital_tidy$ALOStay, na.rm = TRUE),
n = n(),
Missing = sum(is.na(hospital_tidy$ALOStay)))
knitr::kable(head(h))
| Min | Q1 | Median | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|
| 1 | 2.5 | 3.4 | 13.9 | 3.956307 | 1.989426 | 30021 | 19770 |
(19770/30021)*100
[1] 65.8539
Since the amount of missing data is large relative to the size of the data set (i.e. 66%). Hence, we need to replace the missing values in order not to bias the analysis. Since the data is skewed, we used the median value of the data to replace the missing values.
hospital_tidy$ALOStay[is.na(hospital_tidy$ALOStay)] <- median(hospital_tidy$ALOStay, na.rm = TRUE)
h_new<-hospital_tidy %>% summarise(Min = min(hospital_tidy$ALOStay,na.rm = TRUE),
Q1 = quantile(hospital_tidy$ALOStay,probs=.25,na.rm=TRUE),
Median = median(hospital_tidy$ALOStay, na.rm = TRUE),
Max = max(hospital_tidy$ALOStay,na.rm = TRUE),
Mean = mean(hospital_tidy$ALOStay, na.rm = TRUE),
SD = sd(hospital_tidy$ALOStay, na.rm = TRUE),
n = n(),
Missing = sum(is.na(hospital_tidy$ALOStay)))
knitr::kable(head(h_new))
| Min | Q1 | Median | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|
| 1 | 3.4 | 3.4 | 13.9 | 3.589957 | 1.192034 | 30021 | 0 |
hist(hospital_tidy$ALOStay, xlab = "Average of Length of Stay (in days)",
main = "Histogram of Average of Length of Stay (ALOS)", breaks=50)
The distribution appears to be rightly skewed. However, we have a large dataset (i.e. n>30), therefore, normality can be assumed by invoking the Central Limit Theorem (CLT). So, we are safe to continue with the two-tailed, one-sample t-test.
A two-tailed, one-sample t-test was used to determine if the average of length of stay (ALOS) were significantly different from the previously assumed average of length of stay (ALOS) of 4.5 days.
Assumptions:
Statistical Hypothesis:
t.test(hospital_tidy$ALOStay, mu = 4.5, alternative="two.sided")
One Sample t-test
data: hospital_tidy$ALOStay
t = -132.28, df = 30020, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 4.5
95 percent confidence interval:
3.576472 3.603442
sample estimates:
mean of x
3.589957
Our decision should be to reject \(H_0 : \mu = 4.5\)days. The analysis are as follows:
The results of the one-sample t-test were therefore statistically significant. This meant that the mean of average of length of stay (ALOS) was statistically significantly lower than the population mean of average of length of stay (ALOS).
The performance of a student in an exam depends on how well they are prepared with their studies. But there can be other factors which can affect their grades in an exam. Knowing these factors may help get a better understanding of how well a student can perform in an exam.
One such factor is if a student attend tutorial classes. We have gathered a dataset of the 1290 students on their grades before and after attending tutorial classes. We analyse these datasets and apply statistical tests and techniques to determine if the students scores improved after attending tutorial classes.
We want to determine if tutorial class can help improve the scores in the exams by students. In this investigation, we ultilised the paired-samples t-test was used to test for a significant mean difference between scores before and after tutorial.
performance <- read.csv("Assignment4b.csv")
performance_new <- performance %>% select(Score.after.tutorial,Score.before.tutorial)
knitr::kable(head(performance_new))
| Score.after.tutorial | Score.before.tutorial |
|---|---|
| 42 | 50 |
| 38 | 13 |
| 43 | 27 |
| 37 | 44 |
| 35 | 35 |
| 41 | 55 |
p_before<-performance %>% summarise(Min = min(performance$Score.before.tutorial,na.rm = TRUE),
Q1=quantile(performance$Score.before.tutorial,probs=.25,na.rm=TRUE),
Median = median(performance$Score.before.tutorial, na.rm = TRUE),
Max = max(performance$Score.before.tutorial,na.rm = TRUE),
Mean = mean(performance$Score.before.tutorial, na.rm = TRUE),
SD = sd(performance$Score.before.tutorial, na.rm = TRUE),
n = n(),
Missing = sum(is.na(performance$Score.before.tutorial)))
knitr::kable(head(p_before))
| Min | Q1 | Median | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|
| 13 | 27 | 37 | 55 | 35.82093 | 10.53833 | 1290 | 0 |
p_after<-performance %>% summarise(Min = min(performance$Score.after.tutorial,na.rm = TRUE),
Q1 = quantile(performance$Score.after.tutorial,probs=.25,na.rm=TRUE),
Median = median(performance$Score.after.tutorial, na.rm = TRUE),
Max = max(performance$Score.after.tutorial,na.rm = TRUE),
Mean = mean(performance$Score.after.tutorial, na.rm = TRUE),
SD = sd(performance$Score.after.tutorial, na.rm = TRUE),
n = n(),
Missing = sum(is.na(performance$Score.after.tutorial)))
knitr::kable(head(p_after))
| Min | Q1 | Median | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|
| 33 | 37 | 41 | 55 | 41.17132 | 4.952305 | 1290 | 0 |
boxplot(performance$Score.before.tutorial,performance$Score.after.tutorial, ylab = "Scores",
xlab = "Time")
axis(1, at = 1:2, labels = c("Before", "After"))
performance <- performance %>% select(Score.after.tutorial, Score.before.tutorial) %>% mutate(Difference = Score.after.tutorial - Score.before.tutorial)
knitr::kable(head(performance,3))
| Score.after.tutorial | Score.before.tutorial | Difference |
|---|---|---|
| 42 | 50 | -8 |
| 38 | 13 | 25 |
| 43 | 27 | 16 |
We can then calculate the descriptive statistics for the mean difference, difference.
p_difference<-performance %>% summarise(Min = min(performance$Difference,na.rm = TRUE),
Q1 = quantile(performance$Difference,probs=.25,na.rm=TRUE),
Median = median(performance$Difference, na.rm = TRUE),
Max = max(performance$Difference,na.rm = TRUE),
Mean = mean(performance$Difference, na.rm = TRUE),
SD = sd(performance$Difference, na.rm = TRUE),
n = n(),
Missing = sum(is.na(performance$Difference)))
knitr::kable(head(p_difference))
| Min | Q1 | Median | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|
| -14 | 0 | 0 | 31 | 5.350388 | 10.03793 | 1290 | 0 |
qqPlot(performance$Difference, dist="norm", ylab="Mean Difference")
[1] 8 83
The differences does not appear to be normally distributed. However, we have a large dataset (i.i n>30), therefore, normality can be assumed by invoking the Central Limit Theorem (CLT). So, we are safe to continue with the paired-samples t-test.
A paired-samples t-test was used to test for a significant mean difference between scores before and after tutorial. The population mean of these differences is denoted as \(\mu_{\Delta}\)
The statistical hypotheses for the paired-samples t-test are as follows:
t.test(performance$Score.after.tutorial, performance$Score.before.tutorial,
paired = TRUE,
alternative = "two.sided")
Paired t-test
data: performance$Score.after.tutorial and performance$Score.before.tutorial
t = 19.144, df = 1289, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.802104 5.898671
sample estimates:
mean of the differences
5.350388
Our decision should be to reject \(H_0 : \mu_{\Delta}=0\). The analysis are as follows:
The results of the paired-samples t-test were statistically significant which meant that there is statistically significant difference between scores before tutorial and score after tutorials. The student’s scores were found to be significantly improved after attending in tutorial classes. Hence, it is shown that tutorials are quite important in improving student scores.