This is a small file that accesses and generates some statistics from the results of Masters Programs related to Public Health held at 15th June 2017. The result was declared onthe 19th June 2017 and is available on the website at this link.
We will go through the analysis in the folowing order:
The original data is available on the UHS Website. As the original data is a scanned document, I had to use an external OCR service, ABBYY FineReader Online to read the text. The file was then stripped of the headers and footers of every page, leaving only the tabulated data. This tabulated data is available at this link and the actual dataframe used to do everything in this file is available at the bottom of this page.
The dataset thus obtained looks something like this:
if (file.exists("resultData.RData")) {
load("resultData.RData")
} else {
df <- read_tsv("https://gist.githubusercontent.com/AliSajid/4cbe5a684a5b922224d119bdff1cced3/raw/a0a1697f48335272038c490a5d559493dfd1b07d/data.txt", na = c("ABSENT", ""))
}
head(df) %>%
knitr::kable(caption = "Sneak peak at the loaded data")
| Sr. # | Roll Number | Name of Candidate | Obtained Marks (out of 100) |
|---|---|---|---|
| 1 | 6601 | Bushra Liaqat D/O Malik Liaqat Hayat | 53 |
| 2 | 6602 | Hurara Khalid D/O Khalid Nawaz | 56 |
| 3 | 6603 | Shamaila Mahmood D/O Parveze Mahmood | 34 |
| 4 | 6604 | Zahida Yasmeen D/O Bashir Muhammad | 58 |
| 5 | 6605 | Talal Farrukh S/O Farrukh Suleman | 63 |
| 6 | 6606 | Abeer Qazi D/O Tariq Ismail Qazi | 57 |
As we can see, we get 4 variables with some long names. We also have one variable which is actually two different variables. Furthermore, based on the given data we can also extract the gender of the student appearing in the exam. For this, we will use dplyr.
clean.df <- df %>%
rename(`Serial #` = `Sr. #`,
`Roll #` = `Roll Number`,
Name = `Name of Candidate`,
`Marks Obtained` = `Obtained Marks (out of 100)`) %>%
mutate(Gender = as.factor(ifelse(str_detect(Name, "S/O"), "Male", "Female")),
Name = str_replace(Name, "D/0", "D/O"),
Ranking = rank(-`Marks Obtained`, ties.method = "min"),
Attendance = as.factor(ifelse(is.na(`Marks Obtained`), "Absent", "Present"))) %>%
separate(Name, c('Name', "Father's Name"), sep = "(S|D)/O") %>%
group_by(Gender) %>%
mutate(`Rank Within Gender` = rank(Ranking, ties.method = "min")) %>%
ungroup
This now gives us a nice nine variables. We can now proceed with the analysis. Of these eight, 5 were present in the dataset explicitly and four were inferred out of it.
Here are the summary statistics obtained from the data:
clean.df %>%
na.omit %>%
summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
`Standard Deviation` = round(sd(`Marks Obtained`), 3),
`Median Marks` = median(`Marks Obtained`),
`Mode Marks` = getmode(`Marks Obtained`)) %>%
knitr::kable(caption = "Descriptive Statistics")
| Mean Score | Standard Deviation | Median Marks | Mode Marks |
|---|---|---|---|
| 54.811 | 9.856 | 55 | 56 |
We can see how removal of the two outliers affects the analysis.
clean.df %>%
na.omit %>%
filter(Ranking > 2) %>%
summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
`Standard Deviation` = round(sd(`Marks Obtained`), 3),
`Median Marks` = median(`Marks Obtained`),
`Mode Marks` = getmode(`Marks Obtained`)) %>%
knitr::kable(caption = "Descriptive Statistics After Removing Outliers")
| Mean Score | Standard Deviation | Median Marks | Mode Marks |
|---|---|---|---|
| 54.416 | 9.419 | 55 | 56 |
| Descriptive | Value |
|---|---|
| Number of Students Registered | 144 |
| Number of Students Appeared | 127 |
| Appearance Ratio | 0.882 |
| Number of Males Registered | 69 |
| Number of Males Appeared | 54 |
| Male Appearance Ratio | 0.783 |
| Number of Females Registered | 75 |
| Number of Females Appeared | 73 |
| Female Appearance Ratio | 0.973 |
clean.df %>%
select(-Gender, -`Rank Within Gender`, -Attendance) %>%
arrange(Ranking, Name, `Father's Name`) %>%
head(n=5) %>%
knitr::kable(caption = "Top 5 Students in the Entry Test")
| Serial # | Roll # | Name | Father’s Name | Marks Obtained | Ranking |
|---|---|---|---|---|---|
| 144 | 6744 | Abid Wazir | Malik Wazir Muhammad | 80 | 1 |
| 62 | 6662 | Ali Sajid Imami | Sajid Hussain Imami | 79 | 2 |
| 124 | 6724 | Zahabia Khalid | Muhammad Khalid | 74 | 3 |
| 25 | 6625 | Muhammad Shoaib | Malik Ghulam Qadir | 72 | 4 |
| 143 | 6743 | Naheed Pirzada | Pirzada MA Qureshi | 71 | 5 |
clean.df %>%
filter(Gender == "Male") %>%
select(-Gender, -Attendance) %>%
arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
head(n=5) %>%
knitr::kable(caption = "Top 5 Male Students in the Entry Test")
| Serial # | Roll # | Name | Father’s Name | Marks Obtained | Ranking | Rank Within Gender |
|---|---|---|---|---|---|---|
| 144 | 6744 | Abid Wazir | Malik Wazir Muhammad | 80 | 1 | 1 |
| 62 | 6662 | Ali Sajid Imami | Sajid Hussain Imami | 79 | 2 | 2 |
| 25 | 6625 | Muhammad Shoaib | Malik Ghulam Qadir | 72 | 4 | 3 |
| 139 | 6739 | Danish Mohsin | Gulzar Ahmed | 70 | 6 | 4 |
| 34 | 6634 | Ubaid Azher | Muhammad Pervaiz Azher | 70 | 6 | 4 |
clean.df %>%
filter(Gender == "Female") %>%
select(-Gender, -Attendance) %>%
arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
head(n=5) %>%
knitr::kable(caption = "Top 5 Female Students in the Entry Test")
| Serial # | Roll # | Name | Father’s Name | Marks Obtained | Ranking | Rank Within Gender |
|---|---|---|---|---|---|---|
| 124 | 6724 | Zahabia Khalid | Muhammad Khalid | 74 | 3 | 1 |
| 143 | 6743 | Naheed Pirzada | Pirzada MA Qureshi | 71 | 5 | 2 |
| 64 | 6664 | Kiran Batool | Hassan Akhtar | 70 | 6 | 3 |
| 56 | 6656 | Maria Noor | Afzal Ali Khokhar | 70 | 6 | 3 |
| 7 | 6607 | Shamaila Hassnain | Muhammad Aslam | 70 | 6 | 3 |
clean.df %>%
group_by(Gender) %>%
summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
`Median Score` = median(`Marks Obtained`, na.rm=T),
`Mode Score` = getmode(`Marks Obtained`),
`Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
knitr::kable(caption = "Differences in Descriptive Statistics by Gender")
| Gender | Average Score | Median Score | Mode Score | Standard Deviation |
|---|---|---|---|---|
| Female | 54.740 | 56.0 | 56 | 9.944 |
| Male | 54.907 | 54.5 | 49 | 9.827 |
clean.df %>%
group_by(Gender) %>%
filter(Ranking > 2) %>%
summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
`Median Score` = median(`Marks Obtained`, na.rm=T),
`Mode Score` = getmode(`Marks Obtained`),
`Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
knitr::kable(caption = "Differences in Descriptive Statistics by Gender After Removing Outliers")
| Gender | Average Score | Median Score | Mode Score | Standard Deviation |
|---|---|---|---|---|
| Female | 54.740 | 56 | 56 | 9.944 |
| Male | 53.962 | 54 | 49 | 8.702 |
From a simple look at the table above, it looks like that more Males signed up for the test but didn’t appear for the test than the females. These are two categorical variables that we have extracted specifically. Let’s focus at the data.
xtabs( ~ Attendance + Gender, data = clean.df) %>%
knitr::kable(caption = "Crosstabulation of Attendance and Gender") %>%
kableExtra::kable_styling(full_width = F)
| Female | Male | |
|---|---|---|
| Absent | 2 | 15 |
| Present | 73 | 54 |
It appears that there is a significant difference in attendance between genders. To test if it is statistically significant, we can use the \(\chi^2\) test for goodness of fit.
chisq.result <- chisq.test(xtabs(~ Attendance + Gender, data = clean.df))
print(chisq.result)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~Attendance + Gender, data = clean.df)
## X-squared = 10.791, df = 1, p-value = 0.00102
The \(\chi^2\) statistic is 10.791 with a \(p-value\) of 0.00102. Based on these two values, we can say that in our data there was a singificant relationship between the two variable. We can not elaborate on what kind of relationship however.
It appears that the difference between the mean scores of Males and Females is nearly equal. This means that the probability of there being a statistically significant difference is low. However, we will test it with whole data and again with the two top-most outliers removed.
We will state our null and alternative hypotheses first:
The \(H_A\) makes it clear that we are looking for a two-tailed test.
tt <- t.test(`Marks Obtained` ~ Gender, data=clean.df, conf.level = 0.90)
tt
##
## Welch Two Sample t-test
##
## data: Marks Obtained by Gender
## t = -0.094581, df = 115.09, p-value = 0.9248
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## -3.107472 2.772109
## sample estimates:
## mean in group Female mean in group Male
## 54.73973 54.90741
The \(t-statistic\) is a mere -0.0945814 and the \(p-value\) is 0.9248118 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-3.1074716, 2.7721089). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero.
However, let’s remove the top two outliers and try it again with the same hypotheses and confidence levels.
tt.outliers <- clean.df %>%
filter(Ranking > 2) %>%
t.test(`Marks Obtained` ~ Gender, data= . , conf.level = 0.90)
tt.outliers
##
## Welch Two Sample t-test
##
## data: Marks Obtained by Gender
## t = 0.46415, df = 117.81, p-value = 0.6434
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## -2.001381 3.557756
## sample estimates:
## mean in group Female mean in group Male
## 54.73973 53.96154
The \(t-statistic\) is a again low, a paltry 0.4641548 and the \(p-value\) is 0.6433934 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-2.0013809, 3.557756). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero, even when high scoring outliers are excluded.
plot <- ggplot(data = clean.df, aes(x = Attendance, fill=Attendance))
plot + geom_bar() + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))
plot + geom_bar() + facet_grid(. ~ Gender) + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))
plot <- ggplot(data = clean.df, aes(x = "", y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = "Green"), width = 0.5) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained")+ xlab("Students") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')
plot <- ggplot(data = clean.df, aes(x = Gender, y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = Gender)) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained with Gender") + xlab("Gender") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')
The dataset used in the analysis above was processed from OCR using ABBYY® FineReader® Online and then hand edited before processing. A link to download the data is given below.