Introduction

This is a small file that accesses and generates some statistics from the results of Masters Programs related to Public Health held at 15th June 2017. The result was declared onthe 19th June 2017 and is available on the website at this link.

We will go through the analysis in the folowing order:

  1. We will obtain and clean the data
  2. We will generate summary statistics
  3. We will generate graphical summaries

Geting and Cleaning Data

Getting Data

The original data is available on the UHS Website. As the original data is a scanned document, I had to use an external OCR service, ABBYY FineReader Online to read the text. The file was then stripped of the headers and footers of every page, leaving only the tabulated data. This tabulated data is available at this link and the actual dataframe used to do everything in this file is available at the bottom of this page.

Cleaning Data

The dataset thus obtained looks something like this:

if (file.exists("resultData.RData")) {
  load("resultData.RData")
} else {
  df <- read_tsv("https://gist.githubusercontent.com/AliSajid/4cbe5a684a5b922224d119bdff1cced3/raw/a0a1697f48335272038c490a5d559493dfd1b07d/data.txt", na = c("ABSENT", ""))
  }
head(df) %>%
  knitr::kable(caption = "Sneak peak at the loaded data")
Sneak peak at the loaded data
Sr. # Roll Number Name of Candidate Obtained Marks (out of 100)
1 6601 Bushra Liaqat D/O Malik Liaqat Hayat 53
2 6602 Hurara Khalid D/O Khalid Nawaz 56
3 6603 Shamaila Mahmood D/O Parveze Mahmood 34
4 6604 Zahida Yasmeen D/O Bashir Muhammad 58
5 6605 Talal Farrukh S/O Farrukh Suleman 63
6 6606 Abeer Qazi D/O Tariq Ismail Qazi 57

As we can see, we get 4 variables with some long names. We also have one variable which is actually two different variables. Furthermore, based on the given data we can also extract the gender of the student appearing in the exam. For this, we will use dplyr.

clean.df <- df %>%
            rename(`Serial #` = `Sr. #`,
                   `Roll #` = `Roll Number`,
                   Name = `Name of Candidate`,
                   `Marks Obtained` = `Obtained Marks (out of 100)`) %>%
            mutate(Gender = as.factor(ifelse(str_detect(Name, "S/O"), "Male", "Female")),
                   Name = str_replace(Name, "D/0", "D/O"),
                   Ranking = rank(-`Marks Obtained`, ties.method = "min"),
                   Attendance = as.factor(ifelse(is.na(`Marks Obtained`), "Absent", "Present"))) %>%
            separate(Name, c('Name', "Father's Name"), sep = "(S|D)/O") %>%
            group_by(Gender) %>%
            mutate(`Rank Within Gender` = rank(Ranking, ties.method = "min")) %>%
            ungroup

This now gives us a nice nine variables. We can now proceed with the analysis. Of these eight, 5 were present in the dataset explicitly and four were inferred out of it.

Statistical Analysis

Descriptive Statistics

Here are the summary statistics obtained from the data:

clean.df %>%
  na.omit %>%
  summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
            `Standard Deviation` = round(sd(`Marks Obtained`), 3),
            `Median Marks` = median(`Marks Obtained`),
            `Mode Marks` = getmode(`Marks Obtained`)) %>%
  knitr::kable(caption = "Descriptive Statistics")
Descriptive Statistics
Mean Score Standard Deviation Median Marks Mode Marks
54.811 9.856 55 56

We can see how removal of the two outliers affects the analysis.

clean.df %>%
  na.omit %>%
  filter(Ranking > 2) %>%
  summarise(`Mean Score` = round(mean(`Marks Obtained`), 3),
            `Standard Deviation` = round(sd(`Marks Obtained`), 3),
            `Median Marks` = median(`Marks Obtained`),
            `Mode Marks` = getmode(`Marks Obtained`)) %>%
  knitr::kable(caption = "Descriptive Statistics After Removing Outliers")
Descriptive Statistics After Removing Outliers
Mean Score Standard Deviation Median Marks Mode Marks
54.416 9.419 55 56
Descriptive Value
Number of Students Registered 144
Number of Students Appeared 127
Appearance Ratio 0.882
Number of Males Registered 69
Number of Males Appeared 54
Male Appearance Ratio 0.783
Number of Females Registered 75
Number of Females Appeared 73
Female Appearance Ratio 0.973

Top 5 Candidates

clean.df %>%
  select(-Gender, -`Rank Within Gender`, -Attendance) %>%
  arrange(Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Students in the Entry Test")
Top 5 Students in the Entry Test
Serial # Roll # Name Father’s Name Marks Obtained Ranking
144 6744 Abid Wazir Malik Wazir Muhammad 80 1
62 6662 Ali Sajid Imami Sajid Hussain Imami 79 2
124 6724 Zahabia Khalid Muhammad Khalid 74 3
25 6625 Muhammad Shoaib Malik Ghulam Qadir 72 4
143 6743 Naheed Pirzada Pirzada MA Qureshi 71 5

Top 5 Male Candidates

clean.df %>%
  filter(Gender == "Male") %>%
  select(-Gender, -Attendance) %>%
  arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Male Students in the Entry Test")
Top 5 Male Students in the Entry Test
Serial # Roll # Name Father’s Name Marks Obtained Ranking Rank Within Gender
144 6744 Abid Wazir Malik Wazir Muhammad 80 1 1
62 6662 Ali Sajid Imami Sajid Hussain Imami 79 2 2
25 6625 Muhammad Shoaib Malik Ghulam Qadir 72 4 3
139 6739 Danish Mohsin Gulzar Ahmed 70 6 4
34 6634 Ubaid Azher Muhammad Pervaiz Azher 70 6 4

Top 5 Female Candidates

clean.df %>%
  filter(Gender == "Female") %>%
  select(-Gender, -Attendance) %>%
  arrange(`Rank Within Gender`, Ranking, Name, `Father's Name`) %>%
  head(n=5) %>%
  knitr::kable(caption = "Top 5 Female Students in the Entry Test")
Top 5 Female Students in the Entry Test
Serial # Roll # Name Father’s Name Marks Obtained Ranking Rank Within Gender
124 6724 Zahabia Khalid Muhammad Khalid 74 3 1
143 6743 Naheed Pirzada Pirzada MA Qureshi 71 5 2
64 6664 Kiran Batool Hassan Akhtar 70 6 3
56 6656 Maria Noor Afzal Ali Khokhar 70 6 3
7 6607 Shamaila Hassnain Muhammad Aslam 70 6 3

Genderwise Descriptive Statistics

clean.df %>%
  group_by(Gender) %>%
  summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
         `Median Score` = median(`Marks Obtained`, na.rm=T),
         `Mode Score` = getmode(`Marks Obtained`),
         `Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
  knitr::kable(caption = "Differences in Descriptive Statistics by Gender")
Differences in Descriptive Statistics by Gender
Gender Average Score Median Score Mode Score Standard Deviation
Female 54.740 56.0 56 9.944
Male 54.907 54.5 49 9.827

Genderwise Descriptive Statistics Without Outliers

clean.df %>%
  group_by(Gender) %>%
  filter(Ranking > 2) %>%
  summarise(`Average Score` = round(mean(`Marks Obtained`, na.rm = T),3),
         `Median Score` = median(`Marks Obtained`, na.rm=T),
         `Mode Score` = getmode(`Marks Obtained`),
         `Standard Deviation` = round(sd(`Marks Obtained`, na.rm = T), 3)) %>%
  knitr::kable(caption = "Differences in Descriptive Statistics by Gender After Removing Outliers")
Differences in Descriptive Statistics by Gender After Removing Outliers
Gender Average Score Median Score Mode Score Standard Deviation
Female 54.740 56 56 9.944
Male 53.962 54 49 8.702

Inferential Statistics

Relationship between Attendance of Test and Gender

From a simple look at the table above, it looks like that more Males signed up for the test but didn’t appear for the test than the females. These are two categorical variables that we have extracted specifically. Let’s focus at the data.

xtabs( ~ Attendance + Gender, data = clean.df) %>%
  knitr::kable(caption = "Crosstabulation of Attendance and Gender") %>%
  kableExtra::kable_styling(full_width = F)
Crosstabulation of Attendance and Gender
Female Male
Absent 2 15
Present 73 54

It appears that there is a significant difference in attendance between genders. To test if it is statistically significant, we can use the \(\chi^2\) test for goodness of fit.

chisq.result <- chisq.test(xtabs(~ Attendance + Gender, data = clean.df))
print(chisq.result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~Attendance + Gender, data = clean.df)
## X-squared = 10.791, df = 1, p-value = 0.00102

The \(\chi^2\) statistic is 10.791 with a \(p-value\) of 0.00102. Based on these two values, we can say that in our data there was a singificant relationship between the two variable. We can not elaborate on what kind of relationship however.

Relationship between Marks Obtained and Gender

It appears that the difference between the mean scores of Males and Females is nearly equal. This means that the probability of there being a statistically significant difference is low. However, we will test it with whole data and again with the two top-most outliers removed.

We will state our null and alternative hypotheses first:

  • \(H_0\): There is no difference in means of marks obtained between genders.
  • \(H_A\): There exists a difference in means of marks obtained between genders.

The \(H_A\) makes it clear that we are looking for a two-tailed test.

tt <- t.test(`Marks Obtained` ~ Gender, data=clean.df, conf.level = 0.90)
tt
## 
##  Welch Two Sample t-test
## 
## data:  Marks Obtained by Gender
## t = -0.094581, df = 115.09, p-value = 0.9248
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -3.107472  2.772109
## sample estimates:
## mean in group Female   mean in group Male 
##             54.73973             54.90741

The \(t-statistic\) is a mere -0.0945814 and the \(p-value\) is 0.9248118 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-3.1074716, 2.7721089). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero.

However, let’s remove the top two outliers and try it again with the same hypotheses and confidence levels.

tt.outliers <- clean.df %>%
  filter(Ranking > 2) %>%
  t.test(`Marks Obtained` ~ Gender, data= . , conf.level = 0.90)
tt.outliers
## 
##  Welch Two Sample t-test
## 
## data:  Marks Obtained by Gender
## t = 0.46415, df = 117.81, p-value = 0.6434
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -2.001381  3.557756
## sample estimates:
## mean in group Female   mean in group Male 
##             54.73973             53.96154

The \(t-statistic\) is a again low, a paltry 0.4641548 and the \(p-value\) is 0.6433934 with the alternative hypothesis that the true difference of means is equal to \(0\). The lower and upper bounds on the 90 percent confidence interval are (-2.0013809, 3.557756). With these results we can confidently reject the alternative hypothesis and beleive that the true difference in means in the two groups is equal to zero, even when high scoring outliers are excluded.

Graphical Representation

Distribution of Attendance

plot <- ggplot(data = clean.df, aes(x = Attendance, fill=Attendance))
plot + geom_bar() + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))

Distribution of Attendance by Gender

plot + geom_bar() + facet_grid(. ~ Gender) + ggtitle("Distribution of Attendance overall") + xlab("Attendance") + ylab("Count") + theme(plot.title = element_text(hjust = 0.5))

Distribution of Marks

plot <- ggplot(data = clean.df, aes(x = "", y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = "Green"), width = 0.5) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained")+ xlab("Students") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')

Distribution of Marks by Gender

plot <- ggplot(data = clean.df, aes(x = Gender, y = `Marks Obtained`))
plot + geom_boxplot(aes(fill = Gender)) + guides(fill = FALSE) + ggtitle("Distribution of Marks Obtained with Gender") + xlab("Gender") + ylab("Marks Obtained") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun.y=mean, geom="point", shape=23, size=4) + stat_boxplot(geom ='errorbar')

Dataset

The dataset used in the analysis above was processed from OCR using ABBYY® FineReader® Online and then hand edited before processing. A link to download the data is given below.