library('data.table')
library("ggpubr")
library('dplyr')
library(tidyr)
library(kableExtra)
library(ggthemes)
library(gridExtra)
library(car) 
library(Hmisc)
library(psychometric)
library(granova)

options(scipen=999)

1. Student Details

Rashbir Singh Kohli (s3810585)

2. Introduction

The purpose of this assignment is two solve two questions, i.e:
1. To verify the claim of South Western Sydney hospital that an average length of stay (ALOS) of a patient is of 4.5 days.
2. To determine whether tutorials are effective in improving student’s performance.
In this assignment two tests were identified, i.e:
1. One sample two-sided t-test.
2. Paired sample one-sided t-test.
Two data sets were used for this assignment, i.e:
1. The average length of stay in hospital data set downloaded from the AIHW website [1]
2. Assignment 4b-3.csv, data that have information about the impact of tutorials on the performance of the students.
To conduct an experiment we used a software called R-Studio and the on the two data samples.
Data exploration was done by calculating and visualized using summary statistics, data table, box plots, histogram, and Q-Q Plots.
To complete the test and verify its p-value was compared with the significance level and confidence interval was checked.

3. Problem Statement

We have two problem statement:
1. To verify the claim of South Western Sydney hospital that an average length of stay (ALOS) of a patient is of 4.5 days.
2. To determine whether tutorials are effective in improving student’s performance.
For problem statement 1, the variable of interest are:
1. Local Hospital Network
2. The average length of stay In Days
For problem statement 2, the variable of interest are:
1. Score before tutorial
2. Score after tutorial
We are required to perform hypothesis testing to test the claims and check that if there any significant difference in student’s performance or not.
To perform any test the data need to be normal and to check normality various graphs need to be plotted.
Checking of descriptive statistics is also required.
We also need to manage any missing or any possible outliers in the data too.

Part A

4. Data (Problem Statment 1)

For the problem statment 1 the data Assignment 4A.csv is used, and read using fread() function from data.table library keeping header = TRUE.
The initial data have 14 variables and 30021 objects.
Our variable of interest are:
1. Local Hospital Network (LHN)
2. The average length of stay (days)
To ease the processing string substitution was used using gsub() function to remove white spaces, brackets, and text between them.
Subsetting was used to drop observations with NP and - symbols.
Then AveragelengthofstayInDays column was converted to numeric using as.numeric() function.

ALOSDf <- fread('Assignment 4A.csv', header = TRUE)

names(ALOSDf) <- gsub(" ", "", names(ALOSDf)) #Removing white spaces in column name
names(ALOSDf) <- gsub("LHN", "", names(ALOSDf)) #Removing alphabets inside the bracket
names(ALOSDf) <- gsub("days", "InDays", names(ALOSDf)) #Subsituting 'days' with'InDays'
names(ALOSDf) <- gsub("[^A-z]", "", names(ALOSDf)) #Removing everyting except the alphabets
ALOSDfInterest <- ALOSDf[ALOSDf$LocalHospitalNetwork == 'South Western Sydney',
                         c("LocalHospitalNetwork", "AveragelengthofstayInDays")]
## Removing NP and -
ALOSDfInterest <- ALOSDfInterest[ALOSDfInterest$AveragelengthofstayInDays != '-', ]
ALOSDfInterest <- ALOSDfInterest[ALOSDfInterest$AveragelengthofstayInDays != 'NP', ]
## Converting Averagel ength of stay(In Days) to numeric
ALOSDfInterest$AveragelengthofstayInDays <- ALOSDfInterest$AveragelengthofstayInDays %>% as.numeric()
head(ALOSDfInterest)

##    LocalHospitalNetwork AveragelengthofstayInDays
## 1: South Western Sydney                       2.6
## 2: South Western Sydney                       2.6
## 3: South Western Sydney                       2.5
## 4: South Western Sydney                       2.5
## 5: South Western Sydney                       2.5
## 6: South Western Sydney                       2.4

5. Descriptive Statistics and Visualisation (Problem Statment 1)

Median is smaller than mean that suggests that ALOS for South Western Sydney hospital is Right skewed.
There are no missing values in both the parameters.
First Quartile value test is that 25% of the data is less than that particular point.
Third Quartile value test is that 75% of the data is less than that particular point.
Low standard deviation and IQR of ALOS suggest less width and high peak.
In can be seen from the box plot that there are some outliers in the data, but these outliers can not be removed as they are not due to wrong data entry.
Data in the medical field is very crucial and should not be removed based on conventional practices.
Also removing the outlies will greatly shift the mean that can give false results when hypothesis testing is performed.
Also the majority of the values lie above-median so it is safe to assume that outliers are not due to mistake in data entry.

knitr::kable(ALOSDfInterest %>% summarise(Min = min(AveragelengthofstayInDays,na.rm = TRUE), Max = max(AveragelengthofstayInDays, na.rm = TRUE), n = n(), Missing = sum(is.na(AveragelengthofstayInDays)), Q1 = quantile(AveragelengthofstayInDays ,probs = .25,na.rm = TRUE), Median = median(AveragelengthofstayInDays, na.rm = TRUE), Q3 = quantile(AveragelengthofstayInDays, probs = .75,na.rm = TRUE), Mean = mean(AveragelengthofstayInDays, na.rm = TRUE), SD = sd(AveragelengthofstayInDays, na.rm = TRUE), IQR = IQR(AveragelengthofstayInDays ,na.rm = TRUE)) , "html", caption = "Table 1: Descriptive Statistics", align = "llllllllll", col.names = c("Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')

Table 1: Descriptive Statistics
Minimum	Maximum	Sample Size	Missing Count	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR
1.3	11.9	371	0	2.9	3.8	5.5	4.64	2.43	2.6

ggplot(ALOSDfInterest, aes(x=LocalHospitalNetwork, y=AveragelengthofstayInDays)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", hjust = 0.5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12)) + scale_x_discrete(name = "\n")+ ggtitle("Boxplot for ALOS for South Western Sydney hospital\n") + scale_y_continuous(name = 'Average Length of Stay (In Days)\n')

From the histogram it is clear that the data is Right skewed.
A small Bi-Modal nature was also seen in the histogram.
For the data mean(\(\mu\)) is 4.64 and for median is 3.80.
Normal Quantile-Quantile Plot helps us to compare the sample distribution of our data with that of a theoretical distribution.
Here we are using to compare the theoretical normal distribution with our sample data.
From the Q-Q plot it can be seen that the data do not fully follow the normal distribution.

ggplot(ALOSDfInterest, aes(AveragelengthofstayInDays)) + geom_histogram(fill = "#4271AE", color = "#1F3552", binwidth = 0.3, position="identity") + geom_vline(data=ALOSDfInterest, aes(xintercept=mean(ALOSDfInterest$AveragelengthofstayInDays)), colour="red", linetype = "dashed", size = 0.8) + geom_vline(data=ALOSDfInterest, aes(xintercept=median(ALOSDfInterest$AveragelengthofstayInDays) ), colour="orange", linetype = "dashed", size = 0.6) + ggtitle("Frequency histogram of South Western Sydney Hospital\n") + theme_economist() + theme(plot.title = element_text(family="Tahoma", hjust = 0.5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12)) + scale_x_continuous(name = "\nAverage Length of Stay (In Days)") + geom_text(aes(x=5.4, y=33.5, label= 'μ = 4.64', group=NULL), data=ALOSDfInterest[1,], size = 3) + geom_text(aes(x=3, y=36.5, label= 'Median = 3.8', group=NULL), data=ALOSDfInterest[1,], size = 3) + scale_y_continuous(name = 'Frequency\n')

ggqqplot(ALOSDfInterest$AveragelengthofstayInDays, size = 0.5) + ggtitle('QQ Plot for South Western Sydney Hospital ALOS') + theme(plot.title = element_text(hjust = 0.5))

6. Hypothesis Testing (Problem Statment 1)

The data set is sufficiently large i.e \(Sample size(n) > 30\) hence it satisfies the condition for the central limit theorem(CLT) to hold.
Our motive is to find that is mean of Average Length Of Stay(ALOS) for South Western Sydney Hospital is equal to 4.5 days or not.
This suggests the use of the one sample T-Test.
For this test the desired confidence interval(CI) is 95%, i.e \(CI = 0.95\) or 5% of significance, i.e(\(\alpha = 0.05\)).
So based on the following assumptions the hypothesis are:
1. Null Hypothesis (\(H_0\)): Mean equals 4.5.
2. The alternative hypothesis(\(H_A\)): Mean is not equal to 4.5. \[H_0 : \mu = 4.5 \\ H_A : \mu \neq 4.5\]
The One sample t-test resulted in \(p-value = 0.2573\) with degree of freedom(\(df = 370\)).
As it can be seen that \(p\) value is greater than our significance level, i.e \(p>\alpha\), (\(0.2573 > 0.05\)), so we will Fail to reject our Null Hypothesis(\(H_0\)).
Also, out 95% CI do capture our null hypothesis(\(H_0\)), i.e, \(CI[4.394867, 4.891926]\) based on this it can easily be said that we will Fail to reject our Null Hypothesis(\(H_0\)).
So, we can say that the result is statistically nonsignificant.

t.test(ALOSDfInterest$AveragelengthofstayInDays, mu = 4.5, alternative="two.sided", conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  ALOSDfInterest$AveragelengthofstayInDays
## t = 1.1346, df = 370, p-value = 0.2573
## alternative hypothesis: true mean is not equal to 4.5
## 95 percent confidence interval:
##  4.394867 4.891926
## sample estimates:
## mean of x 
##  4.643396

Part B

7. Data (Problem Statment 2)

For the problem statment 1 the data Assignment 4b-3.csv is used, and read using fread() function from data.table library.
The initial data have 7 variables and 1290 objects.
Our variable of interest are:
1. Score before tutorial
2. Score after tutorial
To ease the processing string substitution was used using gsub() function to remove white spaces.
There were no missing data or symbols in the variable of interest.
As the variable of interest is in numerical format by default so there is no conversion needed.

TutorialDf <- fread("Assignment 4b-3.csv")
names(TutorialDf) <- gsub(" ", "", names(TutorialDf)) #Removing white spaces in column name
TutorialDfInterest <- TutorialDf[, c('Scorebeforetutorial', 'Scoreaftertutorial')]
head(TutorialDfInterest)

##    Scorebeforetutorial Scoreaftertutorial
## 1:                  50                 42
## 2:                  13                 38
## 3:                  27                 43
## 4:                  44                 37
## 5:                  35                 35
## 6:                  55                 41

8. Descriptive Statistics and Visualisation (Problem Statment 2)

Median median for scores before the tutorial is greater than mean that suggest Left skewed data, whereas for the score after the tutorial is median is somewhat smaller than the mean that suggests that data is little Right skewed and for the difference in the score the median is smaller than mean that means it is Right skewed.
There are no missing values in both the parameters.
First Quartile value test is that 25% of the data is less than that particular point.
Third Quartile value test is that 75% of the data is less than that particular point.
High standard deviation and IQR of before tutorial score suggest that it has a smaller peak and larger width distribution, on the other hand, low standard deviation and IQR of after tutorial score suggest that it has less width and high peak distribution.
Also the difference in the score box plot suggests that the majority of values lie above the median with no visible outlier, and high standard deviation and IQR.
In can be seen from the box plot that there are some outliers in the score after tutorial data, but these outliers can not be removed as they are not due to wrong data entry.
But as we can see that after the tutorial variance reduced and the majority of score increased, so it is safe to assume that the outlier is an actual value and not a data entry mistake.

TutorialDfInterest <- TutorialDfInterest %>% mutate(difference = Scoreaftertutorial - Scorebeforetutorial)
GatherDf <- TutorialDfInterest %>% gather(Scorebeforetutorial, Scoreaftertutorial, difference, key = 'Parameter', value = 'value')
knitr::kable(GatherDf %>% group_by(GatherDf$Parameter) %>% summarise(Min = min(value,na.rm = TRUE),Max = max(value, na.rm = TRUE), n = n(), Missing = sum(is.na(value)), Q1 = quantile(value ,probs = .25,na.rm = TRUE), Median = median(value, na.rm = TRUE), Q3 = quantile(value, probs = .75,na.rm = TRUE), Mean = mean(value, na.rm = TRUE), SD = sd(value, na.rm = TRUE), IQR = IQR(value ,na.rm = TRUE)), "html", caption = "Table 1: Descriptive Statistics", align = "llllllllll", col.names = c("Score Type", "Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10), color = 'white', background = 'black')

Table 1: Descriptive Statistics
Score Type	Minimum	Maximum	Sample Size	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR
difference	-14	31	1290	0	0	14	5.35	10.04	14
Scoreaftertutorial	33	55	1290	37	41	44	41.17	4.95	7
Scorebeforetutorial	13	55	1290	27	37	44	35.82	10.54	17

ggplot(GatherDf, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", hjust = 0.5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12)) + scale_x_discrete(name = "\nScore type")+ ggtitle("Boxplot for Before and After tutorial score\n") + scale_y_continuous(name = 'Score of students\n')

All the assumptions can be verified that were made in the previous slide about the skewness of the data from each plot for the columns.
Also it can be seen from the distribution that each column is multi-modal that suggests some bias in the data.
It can be seen that for some students there is a decrease in score after the tutorial, whereas there was no change in score for some students too.
But the majority number of students do not have any change in the score as suggested by the histogram for the difference in score, the high variance leads to a smaller peak and a wider spread of the distribution.

hist(TutorialDfInterest$Scorebeforetutorial, breaks = 20, probability = TRUE, xlab = 'Test Score', ylab = 'Frequency', main = "Histogram for Score Before Tutorial"); abline(v = mean(TutorialDfInterest$Scorebeforetutorial), col="red", lwd=2, lty=2); abline(v = median(TutorialDfInterest$Scorebeforetutorial), col="orange", lwd=2, lty=2); text(x=31, y=0.0505, labels= 'μ = 35.82', cex = 0.72); text(x=40, y=0.055, labels= 'Median = 37', cex = 0.73); lines(density(TutorialDfInterest$Scorebeforetutorial), col = 'Blue', lwd=2); curve(dnorm(x, mean=mean(TutorialDfInterest$Scorebeforetutorial), sd=sd(TutorialDfInterest$Scorebeforetutorial)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE); legend("topright", legend = c("Density Curve for Score Before Tutorial", "Normal Curve", 'Mean', 'Median'), bty = "n", text.col = "black", horiz = F, pch=c(15,15, 15, 15), col = c('Blue', "darkgreen", 'red', 'orange'), cex = 0.60)

hist(TutorialDfInterest$Scoreaftertutorial, breaks = 20, probability = TRUE, xlab = 'Test Score', ylab = 'Frequency', main = "Histogram for Score After Tutorial"); abline(v = mean(TutorialDfInterest$Scoreaftertutorial), col="red", lwd=2, lty=2); abline(v = median(TutorialDfInterest$Scoreaftertutorial), col="orange", lwd=2, lty=2); text(x=42.5, y=0.12, labels= 'μ = 41.17', cex = 0.72); text(x=39, y=0.09, labels= 'Median = 41', cex = 0.73); lines(density(TutorialDfInterest$Scoreaftertutorial), col = 'Blue', lwd=2); curve(dnorm(x, mean=mean(TutorialDfInterest$Scoreaftertutorial), sd=sd(TutorialDfInterest$Scoreaftertutorial)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE); legend("topright", legend = c("Density Curve for Score After Tutorial", "Normal Curve", 'Mean', 'Median'), bty = "n", text.col = "black", horiz = F, pch=c(15,15, 15, 15), col = c('Blue', "darkgreen", 'red', 'orange'), cex = 0.60)

hist(TutorialDfInterest$difference, breaks = 20, probability = TRUE, xlab = 'Test Score', ylab = 'Frequency', main = "Histogram for Score Difference"); abline(v = mean(TutorialDfInterest$difference), col="red", lwd=2, lty=2); abline(v = median(TutorialDfInterest$difference), col="orange", lwd=2, lty=2); text(x=8, y=0.12, labels= 'μ = 5.35', cex = 0.72); text(x=-5, y=0.15, labels= 'Median = 0', cex = 0.73); lines(density(TutorialDfInterest$difference), col = 'Blue', lwd=2); curve(dnorm(x, mean=mean(TutorialDfInterest$difference), sd=sd(TutorialDfInterest$difference)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE); legend("topright", legend = c("Density Curve for Score After Tutorial", "Normal Curve", 'Mean', 'Median'), bty = "n", text.col = "black", horiz = F, pch=c(15,15, 15, 15), col = c('Blue', "darkgreen", 'red', 'orange'), cex = 0.60)

Normal Quantile-Quantile Plot helps us to compare the sample distribution of our data with that of a theoretical distribution.
Here we are using to compare the theoretical normal distribution with our sample data.
It can be seen that both the student’s score before and after tutorial does not follow the normal distribution.
Also Difference in both the student’s score before and after tutorial does not follow the normal distribution.

ggqqplot(TutorialDfInterest$Scorebeforetutorial, size = 0.5) + ggtitle('QQ Plot for Score Before Tutorial') + theme(plot.title = element_text(hjust = 0.5))

ggqqplot(TutorialDfInterest$Scoreaftertutorial, size = 0.5) +  ggtitle('QQ Plot for Score After Tutorial') + theme(plot.title = element_text(hjust = 0.5))

ggqqplot(TutorialDfInterest$difference, size = 0.5) +  ggtitle('QQ Plot for Score Differece b/w score') + theme(plot.title = element_text(hjust = 0.5))

9. Hypothesis Testing (Problem Statment 2)

The data set is sufficiently large i.e \(Sample size(n) > 30\) hence it satisfies the condition for the central limit theorem(CLT) to hold.
Our motive is to find that if tutorials are effective in improving student’s performance or not.
As the testing question suggests that we have to check improvement which means we have to perform a single tail on the positive side, i.e single tail greater than 0(zero) should be taken as an alternative hypothesis.
This suggests the use of the paired sample T-Test, as the data are collected from the same students before and after looking at the tutorial.
For this test the desired confidence interval(CI) is 95%, i.e \(CI = 0.95\) or 5% of significance, i.e(\(\alpha = 0.05\)).
So based on the following assumptions the hypothesis are:
1. Null Hypothesis (\(H_0\)): Difference in the mean is equal to 0(zero).
2. The alternative hypothesis(\(H_A\)): Difference in the mean is greater than 0(zero). \[H_0 : \mu_\Delta = 0 \\ H_A : \mu_\Delta > 0\]
The paired sample t-test resulted in \(p-value \approx 0.01\) with degree of freedom(\(df = 1289\)).
As it can be seen that \(p\) value is smaller than our significance level, i.e \(p < \alpha\), (\(0.01 < 0.05\)), so we Reject our Null Hypothesis(\(H_0\)).
Also, out 95% CI do capture our null hypothesis(\(H_0\)), i.e, \(CI[4.890355, Inf]\) based on this it can easily be said that we will Reject our Null Hypothesis(\(H_0\)).
So, we can say that the result is statistically significant.

t.test(TutorialDfInterest$Scoreaftertutorial, TutorialDfInterest$Scorebeforetutorial,
       paired = TRUE,
       alternative = "greater")

## 
##  Paired t-test
## 
## data:  TutorialDfInterest$Scoreaftertutorial and TutorialDfInterest$Scorebeforetutorial
## t = 19.144, df = 1289, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4.890355      Inf
## sample estimates:
## mean of the differences 
##                5.350388

Line plot where each line represents the paired scores is used to represent the change in score of the student before and after watching the tutorial.
The X-Axis represents the two categories of the score, i.e
1. Score before watching the tutorial, denoted by Before
2. Score after watching the tutorial, denoted by After
The Y-Axis represents the student score number.
So each student is connected with a score before and after using a line with each dot.
It can be seen from the plot that the variance between the student score is reduced to a great extent.
It can also be interpreted that students with low scores are benefitted more after watching the tutorial as compared to the students with the high score.
It can be said that watching tutorial does not cause a major change or even can lead to a reduction in score in students with a high score, but significantly help the students with the low score.

matplot(t(data.frame(TutorialDfInterest$Scorebeforetutorial, TutorialDfInterest$Scoreaftertutorial)), type = "b", pch = 19, col = 1, lty = 1, xlab = "Tutorial Type", ylab = "Student Score Number", xaxt = "n") 
axis(1, at = 1:2, labels = c("Before", "After"))

The following graph helps a better understanding of the data.
As we can see there a is relation between them and there is a visible improvement between the students before and after the tutorial score.

granova.ds( data.frame(TutorialDfInterest$Scoreaftertutorial, TutorialDfInterest$Scorebeforetutorial), xlab = "Student's Score - Before", ylab = "Student's Score - After")

##             Summary Stats
## n                1290.000
## mean(x)            41.171
## mean(y)            35.821
## mean(D=x-y)         5.350
## SD(D)              10.038
## ES(D)               0.533
## r(x,y)              0.334
## r(x+y,d)           -0.660
## LL 95%CI            4.802
## UL 95%CI            5.899
## t(D-bar)           19.144
## df.t             1289.000
## pval.t              0.000

10. Discussion

Part A
1. Initially in our data after getting target variable and subsetting we had 499 objects in ALOS, and after dropping NP and (-) we end up having 371 objects, i.e we still had 74.34 percent of the data.
2. For this, we only had one target variable, which means we only have to deal with a single sample.
3. So, based on the assumptions of CLT we had a large enough dataset and a one-sample t-test was performed with two tails.
4. For t-test we had t-value (\(t = 1.1346\)), degree of freedom (\(df = 370\)), p-value (\(p-value = 0.2573\)) with 95% confidence interval, (\(CI[4.394867, 4.891926]\)) and sample estimated mean of 4.643396. 5 So, based on the test result values we can say that for this part we fail to reject the null hypothesis(\(H_0\))
Part B
1. Initially in our data after getting target variable and subsetting we had 1290 objects in both the before and after tutorial scores, and we do not have any missing or null data so, we used the 100% of our required data.
2. For this we had two target variable, that means we had two samples.
3. Also, the data was collected from the sample twice, which suggests the data are paired.
4. We have to find that if there was any improvement, that suggests the use of a positive sided alternative hypothesis.
5. So, based on the assumptions of CLT we had a large enough dataset and a two-sample paired t-test was performed with two tails.
6. For t-test we had t-value (\(t = 19.144\)), degree of freedom (\(df = 1289\)), p-value (\(p-value < 0.00000000000000022 \approx 0.01\)) with 95% confidence interval, (\(CI[4.890355, Inf]\)) and sample estimated mean of difference of 5.350388.
7. So, based on the test result values we can say that for this part we reject our null hypothesis(\(H_0\)) that there is no visible improvement in the student and say that the result is statistically significant.

11. References

[1] “Admitted patients”, Australian Institute of Health and Welfare 2020. [Online]. Available: https://www.aihw.gov.au/reports-data/myhospitals/sectors/admitted-patients. [Accessed: 10-May-2020].

MATH1324 Assignment 4

One sample and pair sample t-test