Sometime back UPSC 2016 results were declared and amidst usual hype and hoopla, a blog post by twitter handle yugaparivatan claimed, that higher interview marks were awarded to Muslims . I was intrigued at that time and decided to analyse whole dataset later.

so I had some spare time this weekend and decided to analyse the whole thing along with other interesting factoids such as gender,religion,year-wise and surname wise variation.

Key issues :

Are girls and Muslims given higher marks in interview ? Does this trend hold true across various years?
Do the various categories differ in marks awarded ?
How many percent girls get through UPSC in total and community wise
Which surnames/castes dominate UPSC selected candidate list
Have some caste/Surnames become super-privileged in specific category?
What is the trend of interview and written marks across the years?

Data Extraction

2013,2014,2015,2016 UPSC list of candidates with categories were collected from Internet. 2015 list was pdf only,hence was extracted back to text then csv surnames were extracted with Regex The code for extraction and csv file of data can be found here and here Gender status and religion status was coded by hand after evaluation of data.

Distribution of Marks

Table of year-wise marks across categories

First we will look at average distribution of marks in Caste categories and percentage of Muslims and females in them

a=upm  %>% group_by(caste) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 

library(knitr)
knitr::kable(a)

caste	n	marks	Written	Interview	Rank	Muslims_percentage	Females_percentage
General	2106	909	734	175	352	3.04	20.32
OBC	1342	876	709	167	669	4.25	10.28
SC	720	853	688	165	831	0.00	14.58
ST	367	842	675	167	934	3.54	13.62

Thus we see percentage of Muslims is between 3-4% in UPSC selected candidates, interestingly percentage of Muslim OBC is higher than general OBC. Expectedly Muslim percentage in Sc is zero as different religions can be given SC status , while ST classification depends upon regions and hence 3.54% Muslims there as well.

What is obvious here is inspite of protestations Muslims are under-represented in UPSC pool

Female representation is best in General group at 20%, while in OBC it is lowest 10.28% even lower than SC group which is surprising to say the least

Interview and written marks follow the general order General>OBC>SC>ST except in case of Interview where ST secure higher marks than SC(we shall examine this in a moment.)

General category candidates have better rank , but as we know that there is quota wise representation in each service and separate list for each categories

Lets see the whole data year-wise

b=upm  %>%  group_by(caste,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 
knitr::kable(b)

caste	year	n	marks	Written	Interview	Rank	Muslims_percentage	Females_percentage
General	2013	517	809	629	180	347	3.09	20.89
General	2014	590	917	739	177	390	2.03	22.20
General	2015	499	900	729	171	328	3.01	16.63
General	2016	500	1011	841	170	338	4.20	21.20
OBC	2013	327	775	606	169	643	3.98	10.09
OBC	2014	354	879	708	171	746	2.82	10.17
OBC	2015	314	864	698	165	634	5.10	6.37
OBC	2016	347	980	816	164	646	5.19	14.12
SC	2013	187	753	586	167	831	0.00	16.58
SC	2014	194	864	697	167	884	0.00	11.86
SC	2015	176	844	681	163	805	0.00	13.64
SC	2016	163	965	801	163	795	0.00	16.56
ST	2013	91	740	572	168	934	4.40	12.09
ST	2014	98	846	676	170	1012	2.04	14.29
ST	2015	89	834	672	162	867	3.37	14.61
ST	2016	89	949	783	166	918	4.49	13.48

year-wise gender and religious representation

Let’s Plot Muslim representation community wise

ggplot(b,aes(x=year,y=Muslims_percentage,color=caste,group=caste))+geom_line()+
  geom_hline(yintercept =round(100*mean(upm$Muslim),2) ,color="red",linetype="dashed")+
  annotate("text",x=2015.5,y=round(100*mean(upm$Muslim),2),label="Average Muslim \n Representation %")+
  labs(color="Category")

Let’s Plot Female Representation Community wise

ggplot(b,aes(x=year,y=Females_percentage,color=caste,group=caste))+geom_line()+
   
   geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  
annotate("text",x=2013.5,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation %")+
    labs(color="Category")

So we saw that OBCs have poor female representation and slightly higher Muslim representation than general category

Simpson’s Paradox

Written Marks

Let’s look at the histogram of all Written marks across the years.

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

Appears fairly symmetrical. Howver we know that caste wise differences exist so let’s plot caste wise histogram

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

So, the fairly symmetrical curve becomes almost trimodal with three peaks, what could be the potential cause.. Let’s investigate further

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

So once we plot by year, the plots become symmetric again..How is that? a) in 2013 marks were considerably lower than 2016

In caste groups we see that general marks distribution is shifted more to right of SC/ST/OBC

Thus we saw how an innocuous looking histogram hid within itself many sub-populations, on eof the few reasons a single measure of central tendency should never be accepted till we see graphical representation of data.

It is also known as Simpson’s paradox

Interview Marks

Lets plot Interview marks

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

The histograms and density plots give a lot of insight in the distribution of marks.

Time-series

Category-wise Variation

Let’s Plot Interview marks community wise

ggplot(b,aes(x=year,y=Interview,color=caste,group=caste))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks have gone down in general category while they have gone up in ST category,general change is down.

ggplot(b,aes(x=year,y=Written,color=caste,group=caste))+geom_line()+
  
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Average Written Marks have gone up in all categories across in usual order.

Religion-wise variation

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks in both Muslims and Non-Muslims go down, however Muslims have higher marks(around 5 marks) on average

Now some people alleged that Interview marks inflation was done deliberately in top ranks(Ranks less than 500),So lets see the trend in top rankers

upm  %>% filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

In this graph,we can clearly see that while marks have been going down in both groups, in 2016 there is an uptrend(almost 13 marks higher in top 500 candidates) and that led to higher selection of Muslims in top ranks. This might be one off event, but in this year 2016 while interview marks of Muslim candidates on average went down, in top groups it went up against the general trend..

Lets now examine Written Marks

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

In this graph we clearly see that there is almost no difference between Muslim and Non-Muslim candidates in UPSC written testLets examine top guys under 500

upm  %>%filter(Rank<500) %>%  mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Again,here as well-there is no difference in Written scores(as the mark sheets are blinded) and hence the better performance in top group was driven entirely by better performance in interview

Gender-wise Variation

Lets see corresponding graphs for females in Written category

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

We see almost no difference here as opposed to Interview marks

Now in top candidates

upm  %>% filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Here as well , there is no difference

Let’s see Interview Scores Now

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

Now it appears being a female confers an advantage of almost 5 points in interview..

Let’s analyse it in top rankers

upm  %>%filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see in top rankers difference is lesser and even that closed down in 2016 c.f. performance of Muslim community. Lets’ examine this association statistically.

Correlation between Interview and Written marks

First let’s look at relationship between Interview and Written marks

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)

We see that there is a predominatly negative relationship. Does this relationship hold across years. Let’s see..

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

We can see the relationship holds but degrees vary. It is expected as well as those who do very well in Written , might be introvert and perform worse in Interview..

Let’s see if this relationship holds in females as well.

 upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% ggplot(aes(x=Interview,y=Written,color=Gender))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

Well it does across years.though Females tend to score higher

Lets look at it in Muslims

 upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% ggplot(aes(x=Interview,y=Written,color=Muslim))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

On adjusting for Written marks we see that Muslims have only minor difference with Non-Muslims,However in 2016 we can see that among higher marks there is a clear divergence indicating higher marks for Muslim candidates.

Statistical Model

Let’s now do a formal statistical test . Let’s consider Interview marks as dependent variable,predicted by indivisual’s caste,gender,religion (in which there are obvious trends)-obviously though the model is flawed as all models are as it doesnt have measure of innate ability of participant, but it helps us in controlling for gender,caste n religion difference in candidate as well as yearly variation.I have avoided taking year 2013 as there was significant ~100 point shift in written marks there..

f= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013))
summary(f,digits=2)

## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year > 2013))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.251 -11.024   0.173  11.299  52.918 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      179.155691   0.618955 289.449  < 2e-16 ***
## Written.c         -0.126074   0.008255 -15.272  < 2e-16 ***
## Gender             3.770246   0.794961   4.743 2.20e-06 ***
## casteOBC          -9.704373   0.718291 -13.510  < 2e-16 ***
## casteSC          -13.807427   0.914435 -15.099  < 2e-16 ***
## casteST          -14.421253   1.198486 -12.033  < 2e-16 ***
## Muslim             5.984816   1.701129   3.518  0.00044 ***
## factor(year)2015  -7.046271   0.704688  -9.999  < 2e-16 ***
## factor(year)2016   6.922696   1.108052   6.248 4.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.76 on 3404 degrees of freedom
## Multiple R-squared:  0.1344, Adjusted R-squared:  0.1324 
## F-statistic: 66.06 on 8 and 3404 DF,  p-value: < 2.2e-16

The analysis shows that for an averal General candidate in year 2014 as base , had 179 marks in interview, every 10 increasein written marks would decrease his interview marks by 1.2, Being of female gender would increase the mark by 4, OBC,SC and ST have around 10,13,14 lower point than general candidates,while being a Muslim adds around 6 marks to interview score. 2013 was alow score year There are yearly 6 points in subsequent years

However it must be noted that since we dont have other measures of innate ability of students the predictive perrobability of model is poor despite controlling for confounders.

Lets do the same analysis in top rankers

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013,Rank<500))
summary(g)

## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year > 2013, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.652  -9.895  -0.884   9.960  60.085 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      190.60196    0.93964 202.845  < 2e-16 ***
## Written.c         -0.27833    0.01455 -19.126  < 2e-16 ***
## Gender             2.06934    0.98705   2.096   0.0362 *  
## casteOBC          -2.41066    1.09727  -2.197   0.0282 *  
## casteSC           -3.54604    1.96226  -1.807   0.0709 .  
## casteST           -2.57364    3.81958  -0.674   0.5005    
## Muslim             9.76134    2.37969   4.102 4.32e-05 ***
## factor(year)2015 -11.45810    0.99103 -11.562  < 2e-16 ***
## factor(year)2016  17.36783    1.70377  10.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.16 on 1488 degrees of freedom
## Multiple R-squared:  0.2482, Adjusted R-squared:  0.2442 
## F-statistic: 61.41 on 8 and 1488 DF,  p-value: < 2.2e-16

We see that while in top rankers there is a much lesser difference in interview marks between general and OBC/SC/ST candidates and even in females, however intrestingly the advantage of being a Muslim is amplified in these top rankers and gives an advantage of upto 10 marks even on controlling for other variables and points to some evidence of deliberate grade inflation in interview marks based on religion Lets run this analysis without controversial year 2016

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year<2016,Rank<500))
summary(g)

## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year < 2016, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.852 -11.569  -0.323  10.890  65.527 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      157.18403    1.38120 113.802  < 2e-16 ***
## Written.c         -0.30602    0.01471 -20.803  < 2e-16 ***
## Gender             2.90094    1.10309   2.630  0.00863 ** 
## casteOBC          -3.68990    1.19525  -3.087  0.00206 ** 
## casteSC           -4.28438    2.16018  -1.983  0.04751 *  
## casteST           -4.41615    4.23344  -1.043  0.29704    
## Muslim             8.81479    2.83858   3.105  0.00194 ** 
## factor(year)2014  34.60859    1.99323  17.363  < 2e-16 ***
## factor(year)2015  22.73994    1.79319  12.681  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 1488 degrees of freedom
## Multiple R-squared:  0.257,  Adjusted R-squared:  0.253 
## F-statistic: 64.35 on 8 and 1488 DF,  p-value: < 2.2e-16

We see that even when we exclude 2016 (as one-off random event), being a Muslim definitely provides upto 10 point advantage in interview process in top rankers and is infact the most significant of predictive factors excluding systematic yearly variations.

Note

We must also note that even though there is category wise variation in Interview marks in caste as well, this difference is seen in written marks as well, However in case of Female and Muslim candidates those having similar marks in written perform better in Interview indicating potential bias. Further, while the interview marks advantage for females diminishes at top ranks(<500),it increases for Muslim candidates. However, it should be noted that Female candidates and Muslim candidates are vastly under-represented than their population levels. One of the points of this analysis is if these candidates are being favored and this analysis shows that they are,then whole process should be open.

Lets do some fun analysis now..

Surname Analysis

I extracted last names of candidates with regex and now lets see which surnames dominate UPSC list

 upm  %>% group_by(surname) %>% 
  summarise(n=n()) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%head(n=15)

## # A tibble: 15 x 3
##    surname     n proportion
##    <chr>   <int>      <dbl>
##  1 KUMAR     269      5.93 
##  2 SINGH     239      5.27 
##  3 MEENA     113      2.49 
##  4 SHARMA     97      2.14 
##  5 S          95      2.09 
##  6 YADAV      91      2.01 
##  7 GUPTA      69      1.52 
##  8 JAIN       58      1.28 
##  9 K          58      1.28 
## 10 R          55      1.21 
## 11 MISHRA     51      1.12 
## 12 VERMA      45      0.990
## 13 GARG       38      0.840
## 14 P          38      0.840
## 15 PANDEY     37      0.820

We see while generic surnames like KUMAR and SINGH dominate the scene, what is surprising is MEENA comminity which comes in ST has almost 113 seats higher than communities like sharma,gupta etc

Let’s plot it

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="blue")+
  
  coord_flip()+
  xlab("Surname")

Let’s look at which surnames have best ranks

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,desc(Rank)),y=Rank))+geom_bar(stat="identity",fill="blue")+
  coord_flip()+
  xlab("Surname")

We see that the best ranks are secured by brahmin ,bania communities event though absolute numbers may be variable

Which community/surname(except kumari) has best female representation..lets see

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank),Female_percentage=round(100*mean(Gender),2)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,Female_percentage),y=Female_percentage))+geom_bar(stat="identity",fill="blue")+
     geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  annotate("text",x=2,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation (15.9) %")+


  coord_flip()+
  xlab("Surname")

We see that Baniyas(guptas)and some brahmins and some south indian communities(represnted by single surnames like S,C,A,B,J,K) have high above average female representation

Lets look inside communities. First look at General community

  upm   %>% filter(caste=="General") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

Lets lookat ST community

  upm   %>% filter(caste=="ST") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>4) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

What is clearly evident here is that MEENA group is an anomaly in this population and sort of super-elite which is cornering most of benefits reserved for ST.

Let’s look at OBC community

  upm   %>% filter(caste=="OBC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that apart from generic kumar and singh surnames , Yadav and Patels from North India are capturing lot of OBC seats, however the story here is that many south indian candidates(with single letter surnames) are dominant in OBC community

Lets examine SC community

  upm   %>% filter(caste=="SC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that while most SC prefer kumar,singh and even verma titles..many south indian surname and hence candidates take these benefits.

Thus, on the whole in General category -Northern Communitites predominate,while in OBC and Sc groups South Indian surnames are well represented. In ST group MEENA community captures most of the seats.

Here is a list of summary statistics by surnames

upm %>% group_by(surname) %>% 
  summarize_at(vars(Total,Rank,Interview,Written,Gender),funs(mean,n=n())) %>% arrange(desc(Total_n)) %>% 
  rename(count=Total_n) %>% 
  select(-Rank_n,-Interview_n,-Written_n,-Gender_n) %>% 
  filter(count>4) %>% 
  arrange(Rank_mean) %>% mutate(Female_percentage = round(100*Gender_mean,2)) %>% 
  mutate(proportion=round(100*(count/sum(count)),2)) %>% 
  mutate(Female_total = round(Female_percentage*count/100,0)) %>% 
    select(-Gender_mean) %>% 
   #arrange(desc(Female_total))
 arrange(desc(proportion)) %>% 
head(n=15)

## # A tibble: 15 x 9
##    surname Total_mean Rank_mean Interview_mean Written_mean count
##    <chr>        <dbl>     <dbl>          <dbl>        <dbl> <int>
##  1 KUMAR          881       636            166          715   269
##  2 SINGH          874       600            169          706   239
##  3 MEENA          852       909            160          692   113
##  4 SHARMA         884       428            178          705    97
##  5 S              883       573            174          709    95
##  6 YADAV          871       649            164          708    91
##  7 GUPTA          908       373            171          737    69
##  8 JAIN           911       345            172          739    58
##  9 K              877       615            170          707    58
## 10 R              881       619            170          711    55
## 11 MISHRA         911       385            178          733    51
## 12 VERMA          892       607            169          723    45
## 13 GARG           911       297            174          737    38
## 14 P              865       602            174          691    38
## 15 PANDEY         902       387            175          727    37
## # ... with 3 more variables: Female_percentage <dbl>, proportion <dbl>,
## #   Female_total <dbl>

Repeater analysis

Let’s lookat names which repeat through various years(while it is true people with same names can get selected again,lesser probbaility that it happens on continuous years,in any case with limited resources-it is slight overestimation of repeater count

 repeater= upm %>% group_by(Name) %>% summarize(n=n()) %>% filter(n>1) %>%  pull(Name)

The number of repeater is 697.So the percentage of people who repeat out of total 4535 candidates is ~

round(100*(length(repeater)/length(upm$Name)),2)

## [1] 15.37

Thus proportion of repeaters is around 15%

So what percentage of repeaters eventually end up getting a top 100 rank.. Lets’s calculate

 top_ranker =upm %>% filter(Name %in% repeater) %>% filter(Rank<100) %>% pull(Name)

The number of top ranker out of these repeaters is 111.

So the percent of repeaters who end up as top ranker is 111/697

round(100*(length(top_ranker)/length(repeater)),2)

## [1] 15.93

or around 16%.

Thus around 15% selected people end up repeating and only 15% of these determined ones end up getting top 100 ranks.who said UPSC was an easy exam! It takes a lot to be a Babu..:-)

Key Takeaways

1.Being a Female or a Muslim confers advantage in Interview process which is minor(5 marks)since whole Interview is of 275 marks however muslim candidates show an increasing trend of higher marks(10) at top ranks so there is some evidence of deliberate grade inflation

2.General candidates get bettter marks than OBC,SC,ST in that order in Interview process,however this advantage doesnt hold up in top ranks

3.Baniyas(Gupta,Agarwal) in particular get the best mean ranks and have highest female representation almost double the national average, arguably one of the most developed communities in India

4.Female representation in SC is higher than OBCs

5.MEENA community captures most of the seats in ST community and is a cause of concern

6.Higher marks are inversely correlated with Interview marks.A person with 100 more written marks than average is likely to get 10 less marks in an Interview

7Muslims iand Females are under-represented in UPSC list, However in my personal opinion covert grade inflation is not the solution as will backfire, we need open and transparent policy to increase their representation(eg. opening more schools in Muslim dominated areas rather than specific quota)

8South Indian Community are better represented in OBC and SC list(understandable since despite advanced demographics on almost every parameters few people let go of the advantage of quotas), however in General list North Indian communities dominate

Analysis of sources of variation in UPSC marks

Anupam kumar Singh, MD

26 February 2018