Masters in Public Health Programs Results

Introduction

This is a small file that accesses and generates some statistics from the results of Masters Programs related to Public Health held at 15th June 2017. The result was declared onthe 19th June 2017 and is available on the website at this link.

We will go through the analysis in the folowing order:

We will obtain and clean the data
We will generate summary statistics
We will generate graphical summaries

Geting and Cleaning Data

Getting Data

The original data is available on the UHS Website. As the original data is a scanned document, I had to use an external OCR service, ABBYY FineReader Online to read the text. The file was then stripped of the headers and footers of every page, leaving only the tabulated data. This tabulated data is available at this link and the actual dataframe used to do everything in this file is available at the bottom of this page.

Cleaning Data

The dataset thus obtained looks something like this:

if (file.exists("resultData.RData")) {
  load("resultData.RData")
} else {
  df <- read_tsv("https://gist.githubusercontent.com/AliSajid/4cbe5a684a5b922224d119bdff1cced3/raw/a0a1697f48335272038c490a5d559493dfd1b07d/data.txt", na = c("ABSENT", ""))
  }
head(df) %>%
  knitr::kable(caption = "Sneak peak at the loaded data")

Sneak peak at the loaded data
Sr. #	Roll Number	Name of Candidate	Obtained Marks (out of 100)
1	6601	Bushra Liaqat D/O Malik Liaqat Hayat	53
2	6602	Hurara Khalid D/O Khalid Nawaz	56
3	6603	Shamaila Mahmood D/O Parveze Mahmood	34
4	6604	Zahida Yasmeen D/O Bashir Muhammad	58
5	6605	Talal Farrukh S/O Farrukh Suleman	63
6	6606	Abeer Qazi D/O Tariq Ismail Qazi	57

As we can see, we get 4 variables with some long names. We also have one variable which is actually two different variables. Furthermore, based on the given data we can also extract the gender of the student appearing in the exam. For this, we will use dplyr.

clean.df <- df %>%
            rename(`Serial #` = `Sr. #`,
                   `Roll #` = `Roll Number`,
                   Name = `Name of Candidate`,
                   `Marks Obtained` = `Obtained Marks (out of 100)`) %>%
            mutate(Gender = as.factor(ifelse(str_detect(Name, "S/O"), "Male", "Female")),
                   Name = str_replace(Name, "D/0", "D/O"),
                   Ranking = rank(-`Marks Obtained`, ties.method = "min"),
                   Attendance = as.factor(ifelse(is.na(`Marks Obtained`), "Absent", "Present"))) %>%
            separate(Name, c('Name', "Father's Name"), sep = "(S|D)/O") %>%
            group_by(Gender) %>%
            mutate(`Rank Within Gender` = rank(Ranking, ties.method = "min")) %>%
            ungroup

This now gives us a nice nine variables. We can now proceed with the analysis. Of these eight, 5 were present in the dataset explicitly and four were inferred out of it.

Statistical Analysis

Descriptive Statistics

Here are the summary statistics obtained from the data:

clean.df %>%
  na.omit %>%
  summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
            `Standard Deviation` = round(sd(`Marks Obtained`), 3),
            `Median Marks` = median(`Marks Obtained`),
            `Mode Marks` = getmode(`Marks Obtained`)) %>%
  knitr::kable(caption = "Descriptive Statistics")

Descriptive Statistics
Mean Score	Standard Deviation	Median Marks	Mode Marks
54.811	9.856	55	56

We can see how removal of the two outliers affects the analysis.

clean.df %>%
  na.omit %>%
  filter(Ranking > 2) %>%
  summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
            `Standard Deviation` = round(sd(`Marks Obtained`), 3),
            `Median Marks` = median(`Marks Obtained`),
            `Mode Marks` = getmode(`Marks Obtained`)) %>%
  knitr::kable(caption = "Descriptive Statistics After Removing Outliers")

Descriptive Statistics After Removing Outliers
Mean Score	Standard Deviation	Median Marks	Mode Marks
54.416	9.419	55	56

Descriptive	Value
Number of Students Registered	144
Number of Students Appeared	127
Appearance Ratio	0.882
Number of Males Registered	69
Number of Males Appeared	54
Male Appearance Ratio	0.783
Number of Females Registered	75
Number of Females Appeared	73
Female Appearance Ratio	0.973

Top 5 Candidates

clean.df %>%
  select(-Gender, -`Rank Within Gender`, -Attendance) %>%
  arrange(Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Students in the Entry Test")

Top 5 Students in the Entry Test
Serial #	Roll #	Name	Father’s Name	Marks Obtained	Ranking
144	6744	Abid Wazir	Malik Wazir Muhammad	80	1
62	6662	Ali Sajid Imami	Sajid Hussain Imami	79	2
124	6724	Zahabia Khalid	Muhammad Khalid	74	3
25	6625	Muhammad Shoaib	Malik Ghulam Qadir	72	4
143	6743	Naheed Pirzada	Pirzada MA Qureshi	71	5

Top 5 Male Candidates

clean.df %>%
  filter(Gender == "Male") %>%
  select(-Gender, -Attendance) %>%
  arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Male Students in the Entry Test")

Top 5 Male Students in the Entry Test
Serial #	Roll #	Name	Father’s Name	Marks Obtained	Ranking	Rank Within Gender
144	6744	Abid Wazir	Malik Wazir Muhammad	80	1	1
62	6662	Ali Sajid Imami	Sajid Hussain Imami	79	2	2
25	6625	Muhammad Shoaib	Malik Ghulam Qadir	72	4	3
139	6739	Danish Mohsin	Gulzar Ahmed	70	6	4
34	6634	Ubaid Azher	Muhammad Pervaiz Azher	70	6	4

Top 5 Female Candidates

clean.df %>%
  filter(Gender == "Female") %>%
  select(-Gender, -Attendance) %>%
  arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Female Students in the Entry Test")

Top 5 Female Students in the Entry Test
Serial #	Roll #	Name	Father’s Name	Marks Obtained	Ranking	Rank Within Gender
124	6724	Zahabia Khalid	Muhammad Khalid	74	3	1
143	6743	Naheed Pirzada	Pirzada MA Qureshi	71	5	2
64	6664	Kiran Batool	Hassan Akhtar	70	6	3
56	6656	Maria Noor	Afzal Ali Khokhar	70	6	3
7	6607	Shamaila Hassnain	Muhammad Aslam	70	6	3

Genderwise Descriptive Statistics

clean.df %>%
  group_by(Gender) %>%
  summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
         `Median Score` = median(`Marks Obtained`, na.rm=T),
         `Mode Score` = getmode(`Marks Obtained`),
         `Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
  knitr::kable(caption = "Differences in Descriptive Statistics by Gender")

Differences in Descriptive Statistics by Gender
Gender	Average Score	Median Score	Mode Score	Standard Deviation
Female	54.740	56.0	56	9.944
Male	54.907	54.5	49	9.827

Genderwise Descriptive Statistics Without Outliers

clean.df %>%
  group_by(Gender) %>%
  filter(Ranking > 2) %>%
  summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
         `Median Score` = median(`Marks Obtained`, na.rm=T),
         `Mode Score` = getmode(`Marks Obtained`),
         `Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
  knitr::kable(caption = "Differences in Descriptive Statistics by Gender After Removing Outliers")

Differences in Descriptive Statistics by Gender After Removing Outliers
Gender	Average Score	Median Score	Mode Score	Standard Deviation
Female	54.740	56	56	9.944
Male	53.962	54	49	8.702

Inferential Statistics

Relationship between Attendance of Test and Gender

From a simple look at the table above, it looks like that more Males signed up for the test but didn’t appear for the test than the females. These are two categorical variables that we have extracted specifically. Let’s focus at the data.

xtabs( ~ Attendance + Gender, data = clean.df) %>%
  knitr::kable(caption = "Crosstabulation of Attendance and Gender") %>%
  kableExtra::kable_styling(full_width = F)

Crosstabulation of Attendance and Gender
	Female	Male
Absent	2	15
Present	73	54

It appears that there is a significant difference in attendance between genders. To test if it is statistically significant, we can use the \(\chi^2\) test for goodness of fit.

chisq.result <- chisq.test(xtabs(~ Attendance + Gender, data = clean.df))
print(chisq.result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~Attendance + Gender, data = clean.df)
## X-squared = 10.791, df = 1, p-value = 0.00102

The \(\chi^2\) statistic is 10.791 with a \(p-value\) of 0.00102. Based on these two values, we can say that in our data there was a singificant relationship between the two variable. We can not elaborate on what kind of relationship however.

Relationship between Marks Obtained and Gender

It appears that the difference between the mean scores of Males and Females is nearly equal. This means that the probability of there being a statistically significant difference is low. However, we will test it with whole data and again with the two top-most outliers removed.

We will state our null and alternative hypotheses first:

\(H_0\): There is no difference in means of marks obtained between genders.
\(H_A\): There exists a difference in means of marks obtained between genders.

The \(H_A\) makes it clear that we are looking for a two-tailed test.

tt <- t.test(`Marks Obtained` ~ Gender, data=clean.df, conf.level = 0.90)
tt

## 
##  Welch Two Sample t-test
## 
## data:  Marks Obtained by Gender
## t = -0.094581, df = 115.09, p-value = 0.9248
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -3.107472  2.772109
## sample estimates:
## mean in group Female   mean in group Male 
##             54.73973             54.90741

The \(t-statistic\) is a mere -0.0945814 and the \(p-value\) is 0.9248118 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-3.1074716, 2.7721089). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero.

However, let’s remove the top two outliers and try it again with the same hypotheses and confidence levels.

tt.outliers <- clean.df %>%
  filter(Ranking > 2) %>%
  t.test(`Marks Obtained` ~ Gender, data= . , conf.level = 0.90)
tt.outliers

## 
##  Welch Two Sample t-test
## 
## data:  Marks Obtained by Gender
## t = 0.46415, df = 117.81, p-value = 0.6434
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -2.001381  3.557756
## sample estimates:
## mean in group Female   mean in group Male 
##             54.73973             53.96154

The \(t-statistic\) is a again low, a paltry 0.4641548 and the \(p-value\) is 0.6433934 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-2.0013809, 3.557756). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero, even when high scoring outliers are excluded.

Graphical Representation

Distribution of Attendance

plot <- ggplot(data = clean.df, aes(x = Attendance, fill=Attendance))
plot + geom_bar() + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))

Distribution of Attendance by Gender

plot + geom_bar() + facet_grid(. ~ Gender) + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))

Distribution of Marks

plot <- ggplot(data = clean.df, aes(x = "", y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = "Green"), width = 0.5) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained")+ xlab("Students") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')

Distribution of Marks by Gender

plot <- ggplot(data = clean.df, aes(x = Gender, y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = Gender)) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained with Gender") + xlab("Gender") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')

Dataset

The dataset used in the analysis above was processed from OCR using ABBYY® FineReader® Online and then hand edited before processing. A link to download the data is given below.