In this report we explore publicly available data from the website of the Tamil Nadu government’s tnpsc.gov.in to analyze the marks obtained by candidates called for interview for the post of Assistant Statistical Investigator, for the recruitment year 2016.
The primary motivation behind this report is three-fold: (i) to perform elementary statistical analysis on the staff selection data, (ii) to document the process of getting and cleaning publicly available unstructured data into structured data in preferred formats that are ready for data science projects, and (iii) to get hands-on experience in generating publishable quality pdf/html documents using rmarkdown and knitr. The R language is used to clean the unstructured data and to perform statistical analysis. The R codes when present appear in grey boxes.
Tamil Nadu Public Service Commission (TNPSC), the Tamil Nadu state government body which is responsible for recruitment of staff for various posts in the state government, publishes results of the written examinations and interviews for staff selection in its website tnpsc.gov.in, with open access to the public. The data used in this report is the marks obtained by candidates called for interview process for the selection of Assistant Statistical Investigator in The Department of Economics and Statistics in The Tamil Nadu General Subordinate Service, 2014. The written test was conducted in July 2015 and the oral test was conducted in July 2016. The data includes the written, oral, and total marks of 534 candidates that were selected for the final interview before selection. The register number of the candidate is the identifier in the data; other variables present in the data are the community (based on caste) to which each candidate belongs and the sex of the candidates. A final comments column is also included for each candidate.
The data available in the tnpsc.gov.in website was in pdf format (http://www.tnpsc.gov.in/results/cml_asi_2014.pdf). The pdf file was downloaded and then converted into an xlsx file from the online converter http://www.zamzar.com/.
Since this is a data downloaded from the internet, it is better practice to record the time stamp of the download, which can be done with the R code date(). The time stamp output of our download is ## [1] "Mon Jan 9 23:20:47 2017".
Figure 1 shows a snapshot of the converted xlsx document. The data in the pdf were converted into a single xlsx file with 23 sheets of data in it (one sheet per pdf page).
Screen shot of the converted xlxs document. Highlighted in red is the multiple pages (sheets) inside a single xlsx file.
As can be seen from the screen shot, the converted xlxs file is unstructured with additional information other than the data, and the data spread over many sheets of an xlsx file. Comparing the pdf and the xlsx file, it is confirmed that there were no data loss or error in data conversion.
The R code given below is used for cleaning the data and converting it to usable csv format and also as R dataframe object, ready to be used in data science projects.
# load necessary package
suppressMessages(library(xlsx))
# create a vector with the name of all sheets in the .xls file
page = NULL
for (i in 1:23) {
page[i] = sprintf("Page %d", i)
}
# Create an empty data frame
df <- data.frame()
# For the first 22 sheets (uptp Page 22, read from row 12 onwards), for Page
# 23, read from row 12 to row 17. Add data from all these pages into df
# using rbind.
for (x in 1:23) {
if (x < 23) {
df <- rbind(df, read.xlsx("cml_asi_2014.xlsx", sheetName = page[x],
startRow = 12, header = FALSE))
} else {
df <- rbind(df, read.xlsx("cml_asi_2014.xlsx", sheetName = page[x],
startRow = 12, endRow = 17, header = FALSE))
}
}
# Remove rows that are 'WITHHELD'
df_clean <- df[!grepl("WITHHELD", df$X2), ]
# Add column names to the data frame.
colnames(df_clean) <- c("RegisterNo", "Community", "Sex", "WrittenMark", "OralMark",
"TotalMark", "Comments")
# Convert the class of the columns
df_clean$RegisterNo <- as.factor(df_clean$RegisterNo)
df_clean$Comments <- as.character(df_clean$Comments)
# Write dataframe df_clean to a csv file
write.csv(df_clean, file = "tnpsc.csv")
The data in the xlsx file consisted of 4 candidates for whom the results were withheld and only the register number was present. These 4 rows have been removed in the final R dataframe object.
The R function str() is a useful function to understand the structure and components of a dataframe.
str(df_clean)
## 'data.frame': 530 obs. of 7 variables:
## $ RegisterNo : Factor w/ 530 levels "10001034","10001036",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Community : Factor w/ 8 levels "BC(M)","BC(OBCM)",..: 2 3 4 4 4 2 2 3 4 5 ...
## $ Sex : Factor w/ 2 levels "FEMALE","MALE": 1 1 2 2 1 1 1 2 1 1 ...
## $ WrittenMark: num 272 284 260 255 249 ...
## $ OralMark : num 52.5 52.5 42 47.2 42 ...
## $ TotalMark : num 324 336 302 302 291 ...
## $ Comments : chr NA NA "PSTM" NA ...
From the above output, we see that the data consists of 530 observations of 7 variables. The names appearing on the left side are the column names of the data. Next to the names are the classes of the data (“factor”, “numeric”, “character” etc.). Though there are 7 columns (features) in this data, the useful features for data analysis are the “Community”, “Sex”, “WrittenMark”, “OralMark”, and “TotalMark”.
The cleaned and structured data that is ready for data analysis is shown below. The R function head() below displays the first 8 rows of the dataframe:
head(df_clean,8)
## RegisterNo Community Sex WrittenMark OralMark TotalMark Comments
## 1 10001034 BC(OBCM) FEMALE 271.5 52.50 324.00 <NA>
## 2 10001036 MBC/DC FEMALE 284.0 52.50 336.50 <NA>
## 3 10001113 SC MALE 260.5 42.00 302.50 PSTM
## 4 10001133 SC MALE 255.0 47.25 302.25 <NA>
## 5 10001158 SC FEMALE 249.0 42.00 291.00 <NA>
## 6 10001176 BC(OBCM) FEMALE 263.5 52.50 316.00 (*)
## 7 10001196 BC(OBCM) FEMALE 283.5 36.75 320.25 PSTM
## 8 10001205 MBC/DC MALE 249.0 47.25 296.25 PSTM
To view the social composition of the candidates, let us tabluate the data based on the “Community” and “Sex” and let us also tabulate based on gender alone.
table(df_clean$Community,df_clean$Sex)
##
## FEMALE MALE
## BC(M) 13 8
## BC(OBCM) 133 102
## MBC/DC 75 81
## SC 45 47
## ST 5 3
## ****WITHHELD**** 0 0
## SC(A) 8 8
## OTHERS 0 2
# Spread of candidates based on gender
table(df_clean$Sex)
##
## FEMALE MALE
## 279 251
The candidates belong to 7 communities (WITHHELD is not a community) and we see the composition of male and female candidates from each communities that are short listed for the final interview. The number of candidates called for final interview from each community group depends on the reservation policy of the government, representation, and merit among other factors. The above table gives us quantitative picture of the spread of candidates from various community groups and between the two gender. There are 279 female candidates and 251 male candidates called for the final interview.
Below is a summary of the TotalMark column, which states the total marks obtained by the candidates in the written and oral exam. We see that the mean of the total marks is 304.4 with a standard deviation of 35.31. The summary also displays other relevant quantities such as the median, min, max, and values in 1st and 3rd quartiles.
summary(df_clean$TotalMark)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 156.5 289.1 308.5 304.4 326.0 393.2
# Standard deviation of total marks
sd(df_clean$TotalMark)
## [1] 35.30859
In Figure 2, we plot a histogram of the feature ‘TotalMark’ to view the distribution of marks of the candidate pool. It shows that the distribution is bell-shaped but with some skewness to the right. It cannot be termed as a gaussian distribution. The red line denotes the average total marks.
hist(df_clean$TotalMark,30,xlab = "TotalMark",main = "")
abline(v = mean(df_clean$TotalMark), lwd = 3, col = "red")
Histogram showing the distribution of total marks by the candiate pool. The red line shows the mean of the total marks.
Figure 3 shows boxplots of total marks as a function of the community of the candidates. At first, the distribution of total marks seem to vary significantly between various community groups, but it may not be right. As we saw previously, there are too few candidates in BC(M), ST, SC(A), and OTHERS community groups compared to BC(OBCM), MBC/DC, and SC communities. Moreover, socially and economically, there are wide disparities among each group, more so with BC(M), ST, and SC(A) compared to OTHERS group.
library(ggplot2)
ggplot(data = df_clean, aes(x = Community, y = TotalMark)) +
geom_boxplot(aes(fill = factor(Community)))
Boxplot of TotalMark variable as a function of Community.
As a group, the boxplot of OTHERS community seem to be vastly different from the rest, but it is that different in reality? The mean total mark in the OTHERS group is 353, but there are only two candidates in the OTHERS group that were shortlisted for the final interview, with total marks of 357 and 349. The number of candidates selected for interview from this group may depend on the reservation policy of the state government and the backlogs in filling positions in the concerned department. But these two candidates (with average total mark of 353) may probably be the top performers in this group. Since there are just two candidates in the OTHERS group, we could ask if there are individual candidates who have scored better total marks than the average marks in the OTHERS group. The R code below answers exactly this question.
dim(df_clean[df_clean$TotalMark > 353,])
## [1] 21 7
There are 21 candidates who have performed better than the average total score of the OTHERS group (this includes one candidate from the OTHERS group who scored 357). Excluding him, there are 20 candidates (from BC(OBCM), MBC/DC, and SC categories of the Community variable). Thus the general notion that people from the OTHERS category (usually citizens of the so-called upper castes) are at the receiving end of the reservation policy despite being extremely meritorious may not be true, at least not true in this specific case.
It may be useful to answer a fundamental question: is there real difference in the average test score between two groups in the ‘Community’ variable? Performing unpaired two-sample Student t-test could address this question. It may not be meaningful to include ST, SC(A), or OTHERS group to compute the confidence intervals or perform hypothesis testing, as there are too few candidates from these groups.
The R function t.test() is a handy function to perform Student t-test. It provides, for the difference in mean between two samples, the t-score, the 95% confidence interval (default), the p-value for hypothesis testing (default: alternative hypothesis is that the true difference in means is not 0).
Below, we perform two sample t-test for the groups BC(OBCM) and MBC/DC. We make the following assumptions: (i) the outcome for each candidate is independent and random (ii) the two groups are unpaired, and (iii) the variance between the two groups is different
subset1 <- subset(df_clean, Community %in% c("BC(OBCM)", "MBC/DC"))
t.test(data = subset1, TotalMark~Community, paired = F, var.equal = F)
##
## Welch Two Sample t-test
##
## data: TotalMark by Community
## t = 0.30453, df = 345.72, p-value = 0.7609
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.969354 8.156493
## sample estimates:
## mean in group BC(OBCM) mean in group MBC/DC
## 310.7298 309.6362
We see that the t-score is 0.304 with 346 degrees of freedom. A 95% confidence interval in the true difference in means in the total score in this group is (-5.969, 8.156). The p-value is 0.7609. Hence, there is no evidence to reject the null hypothesis, which is the true difference in means is equal to 0.
Similarly, there is no evidence against the null hypothesis between the SC and BC(M) groups. But this is not the case for comparison between the MBC/DC and SC groups. There is strong evidence against the null hypothesis and in favor of the alternative hypothesis that the true difference in means (in total marks) is not equal to 0, with a p-value of 9.33e-06 and a 95% confidence interval of (9.129, 23.187). The MBC/DC community group seem to perform reasonably better than the SC community group.
The comparison in performace between the two genders in calculated for each Community group and also for the entire candidate pool. In case of the entire candidate pool male and female candidates have performed equally well. This is also true when we perform the analysis for every individual group, except the SC group. In case of the SC community group, there is strong evidence against the null hypothesis and in favor of the one-sided alternative hypothesis that the true difference in means (in total marks) is less than 0, with a p-value of 0.033 and a 95% confidence interval of (-Inf, -0.886).
subset2 <- subset(df_clean, Community %in% "SC")
t.test(data = subset2, TotalMark~Sex, paired = F, var.equal = F, alternative = "less")
##
## Welch Two Sample t-test
##
## data: TotalMark by Sex
## t = -1.8584, df = 83.974, p-value = 0.03331
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.8864989
## sample estimates:
## mean in group FEMALE mean in group MALE
## 289.1667 297.6064
The males in this community group seem to perform marginally better than the females. Let us plot a boxplot of TotalMark for the two genders in the SC community group.
ggplot(data = subset2, aes(x = Sex, y = TotalMark)) +
geom_boxplot(aes(fill = Sex))
Boxplot of TotalMark variable as a function of Sex for SC community group.
As stated in the motivation, the main objective in generating this report is to demonstrate the process of getting and cleaning data and to generate publishable quality reproducible report using rmarkdown and knitr and to perform elementary statistical analysis on publicly available government data. The hope is that the report serves as a candid visual and statistical display to understand the recruitment process, the gender distribution in recruitment, the effects of reservation policy, the performance of candidates as a function of gender and community in the recruitment of Assistant Statistical Investigators in the Tamil Nadu government in the year 2016.