This is an updated version of old upsc data analysis done by me on UPSC 2013-16 data data, which suggested that females and muslims were being awarded higher markers in interview even on controlling for written marks and caste and some sub castes were cornering all benefits in their respective categories.

The core of the analysis remains the same-

Key issues :

  1. Are girls and Muslims given higher marks in interview ? Does this trend hold true across various years?
  2. Do the various categories differ in marks awarded ?
  3. How many percent girls get through UPSC in total and community wise
  4. Which surnames/castes dominate UPSC selected candidate list
  5. Have some caste/Surnames become super-privileged in specific category?
  6. What is the trend of interview and written marks across the years?

Data Extraction

2013,2014,2015,2016 UPSC list of candidates with categories were collected from Internet. 2015 list was pdf only,hence was extracted back to text then csv surnames were extracted with Regex. 2017 data was added. The code for extraction and csv file of data can be found here and here Gender status and religion status was coded by hand after evaluation of data.

Distribution of Marks

Table of year-wise marks across categories

First we will look at average distribution of marks in Caste categories and percentage of Muslims and females in them

a=upm  %>% group_by(caste) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 

library(knitr)
knitr::kable(a)
caste n marks Written Interview Rank Muslims_percentage Females_percentage
General 2582 931 757 174 343 3.14 20.80
OBC 1617 897 730 167 656 4.70 10.70
SC 885 875 710 165 818 0.00 16.84
ST 441 863 697 166 909 3.40 13.15

Thus we see percentage of Muslims is between 3-5% in UPSC selected candidates, interestingly percentage of Muslim OBC is higher than general OBC. Expectedly Muslim percentage in SC is zero as different religions can’t be given SC status , while ST classification depends upon regions and hence 3.54% Muslims there as well.

What is obvious here is inspite of protestations Muslims are under-represented in UPSC pool

Female representation is best in General group at ~21%, while in OBC it is lowest 10.7% even lower than SC group () better at ~17% ) and even ST group surprising to say the least

Mean Interview and written marks follow the general order General>OBC>SC>ST except in case of Interview where ST secure higher marks than SC(we shall examine this in a moment.)

General category candidates have better rank , but as we know that there is quota wise representation in each service and separate list for each categories

Lets see the whole data year-wise

b=upm  %>%  group_by(caste,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 
knitr::kable(b)
caste year n marks Written Interview Rank Muslims_percentage Females_percentage
General 2013 517 809 629 180 347 3.09 20.89
General 2014 590 917 739 177 390 2.03 22.20
General 2015 499 900 729 171 328 3.01 16.63
General 2016 500 1011 841 170 338 4.20 21.20
General 2017 476 1028 857 171 304 3.57 22.90
OBC 2013 327 775 606 169 643 3.98 10.09
OBC 2014 354 879 708 171 746 2.82 10.17
OBC 2015 314 864 698 165 634 5.10 6.37
OBC 2016 347 980 816 164 646 5.19 14.12
OBC 2017 275 997 833 164 591 6.91 12.73
SC 2013 187 753 586 167 831 0.00 16.58
SC 2014 194 864 697 167 884 0.00 11.86
SC 2015 176 844 681 163 805 0.00 13.64
SC 2016 163 965 801 163 795 0.00 16.56
SC 2017 165 974 810 164 762 0.00 26.67
ST 2013 91 740 572 168 934 4.40 12.09
ST 2014 98 846 676 170 1012 2.04 14.29
ST 2015 89 834 672 162 867 3.37 14.61
ST 2016 89 949 783 166 918 4.49 13.48
ST 2017 74 971 805 166 781 2.70 10.81

year-wise gender and religious representation

Let’s Plot Muslim representation community wise

ggplot(b,aes(x=year,y=Muslims_percentage,color=caste,group=caste))+geom_line()+
  geom_hline(yintercept =round(100*mean(upm$Muslim),2) ,color="red",linetype="dashed")+
  annotate("text",x=2015.5,y=round(100*mean(upm$Muslim),2),label="Average Muslim \n Representation %")+
  labs(color="Category")

We see OBC Muslims are doing comparatively better than General category Muslims.

Let’s Plot Female Representation Community wise

ggplot(b,aes(x=year,y=Females_percentage,color=caste,group=caste))+geom_line()+
   
   geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  
annotate("text",x=2013.5,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation %")+
    labs(color="Category")

So we saw that OBCs have poor female representation and slightly higher Muslim representation than general category.

In 2017 there was a massive jump in number of females in SC category

Simpson’s Paradox

Written Marks

Let’s look at the histogram of all Written marks across the years.

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

Appears fairly symmetrical. Howver we know that caste wise differences exist so let’s plot caste wise histogram

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

So, the fairly symmetrical curve becomes almost trimodal with three peaks, what could be the potential cause.. Let’s investigate further

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

So once we plot by year, the plots become symmetric again..How is that?

  1. in 2013 marks were considerably lower than 2016

  2. In caste groups we see that general marks distribution is shifted more to right of SC/ST/OBC

Thus we saw how an innocuous looking histogram hid within itself many sub-populations, on eof the few reasons a single measure of central tendency should never be accepted till we see graphical representation of data.

It is also known as Simpson’s paradox

Interview Marks

Lets plot Interview marks

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

The histograms and density plots give a lot of insight in the distribution of marks.

Time-series

Category-wise Variation

Let’s Plot Interview marks community wise

ggplot(b,aes(x=year,y=Interview,color=caste,group=caste))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks general change is down but ST do better than OBC and SC

ggplot(b,aes(x=year,y=Written,color=caste,group=caste))+geom_line()+
  
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Average Written Marks have gone up in all categories across in usual order.

Religion-wise variation

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks in both Muslims and Non-Muslims go down, however Muslims have higher marks(around 5 marks) on average

Now some people alleged that Interview marks inflation was done deliberately in top ranks(Ranks less than 500),So lets see the trend in top rankers

upm  %>% filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

In this graph,we can clearly see that while marks have been going down in both groups, in 2016 there is an uptrend(almost 13 marks higher in top 500 candidates) and that led to higher selection of Muslims in top ranks. This might be one off event, but in this year 2016 while interview marks of Muslim candidates on average went down, in top groups it went up against the general trend..

In 2017 though difference has reduced, but is still there. We will also do a regression analysis controlling for these factors.

Lets now examine Written Marks

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

In this graph we clearly see that there is almost no difference between Muslim and Non-Muslim candidates in UPSC written testLets examine top guys under 500

upm  %>%filter(Rank<500) %>%  mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Again,here as well-there is no difference in Written scores(as the mark sheets are blinded) and hence the better performance in top group was driven entirely by better performance in interview

Gender-wise Variation

Lets see corresponding graphs for females in Written category

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

We see almost no difference here as opposed to Interview marks

Now in top candidates

upm  %>% filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Here as well , there is no difference

Let’s see Interview Scores Now

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

Now it appears being a female confers an advantage of almost 5 points in interview..

Let’s analyse it in top rankers

upm  %>%filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see in top rankers difference is lesser and even that closed down in 2016 c.f. performance of Muslim community. However in 2017 it has again shot up.

Lets’ examine this association statistically.

Correlation between Interview and Written marks

First let’s look at relationship between Interview and Written marks

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)

We see that there is a predominatly negative relationship. Does this relationship hold across years. Let’s see..

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

We can see the relationship holds but degrees vary. It is expected as well as those who do very well in Written , might be introvert and perform worse in Interview..

Let’s see if this relationship holds in females as well.

 upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% ggplot(aes(x=Interview,y=Written,color=Gender))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

Well it does across years.though Females tend to score higher

Lets look at it in Muslims

 upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% ggplot(aes(x=Interview,y=Written,color=Muslim))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

On adjusting for Written marks we see that Muslims have only minor difference with Non-Muslims,However in 2016 we can see that among higher marks there is a clear divergence indicating higher marks for Muslim candidates.

Statistical Model

Let’s now do a formal statistical test . Let’s consider Interview marks as dependent variable,predicted by indivisual’s caste,gender,religion (in which there are obvious trends)-obviously though the model is flawed as all models are as it doesnt have measure of innate ability of participant, but it helps us in controlling for gender,caste n religion difference in candidate as well as yearly variation.

f= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=upm)
summary(f,digits=2)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = upm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.853 -11.350   0.112  11.551  57.613 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      180.161739   0.601409 299.566  < 2e-16 ***
## Written.c         -0.153148   0.006813 -22.479  < 2e-16 ***
## Gender             3.820673   0.627526   6.088 1.22e-09 ***
## casteOBC         -11.315397   0.578359 -19.565  < 2e-16 ***
## casteSC          -15.810523   0.735807 -21.487  < 2e-16 ***
## casteST          -16.431979   0.969966 -16.941  < 2e-16 ***
## Muslim             5.132072   1.338552   3.834 0.000127 ***
## factor(year)2014  -0.602161   0.708922  -0.849 0.395694    
## factor(year)2015  -6.182381   0.733325  -8.431  < 2e-16 ***
## factor(year)2016  -6.721234   0.729755  -9.210  < 2e-16 ***
## factor(year)2017  -6.423793   0.749939  -8.566  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.19 on 5514 degrees of freedom
## Multiple R-squared:  0.1579, Adjusted R-squared:  0.1564 
## F-statistic: 103.4 on 10 and 5514 DF,  p-value: < 2.2e-16

*The analysis shows that for an averal General candidate in year 2013 as base , had 180 marks in interview, every 10 increasein written marks would decrease his interview marks by 1.5, Being of female gender would increase the mark by 4, OBC,SC and ST have around 11,15,16 lower point than general candidates,while being a Muslim adds around 5 marks to interview score. However it must be noted that since we dont have other measures of innate ability of students the predictive perrobability of model is poor despite controlling for confounders.

Lets do the same analysis in top rankers

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013,Rank<500))
summary(g)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year > 2013, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.140 -10.004  -0.606  10.014  59.068 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      189.44749    0.84339 224.627  < 2e-16 ***
## Written.c         -0.28364    0.01258 -22.547  < 2e-16 ***
## Gender             2.80179    0.83515   3.355 0.000809 ***
## casteOBC          -2.59404    0.95551  -2.715 0.006689 ** 
## casteSC           -2.46735    1.69117  -1.459 0.144733    
## casteST           -0.70120    3.07164  -0.228 0.819451    
## Muslim             7.78875    1.92950   4.037 5.63e-05 ***
## factor(year)2015  -8.28404    0.94761  -8.742  < 2e-16 ***
## factor(year)2016 -11.54453    0.94921 -12.162  < 2e-16 ***
## factor(year)2017 -10.83843    0.95581 -11.339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.91 on 1986 degrees of freedom
## Multiple R-squared:  0.2478, Adjusted R-squared:  0.2444 
## F-statistic:  72.7 on 9 and 1986 DF,  p-value: < 2.2e-16

We see that while in top rankers there is a much lesser difference in interview marks between general and OBC/SC/ST candidates and even in females, however intrestingly the advantage of being a Muslim is amplified in these top rankers and gives an advantage of upto 10 marks even on controlling for other variables and points to some evidence of deliberate grade inflation in interview marks based on religion Lets run this analysis without controversial year 2016

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year<2016,Rank<500))
summary(g)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year < 2016, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.852 -11.569  -0.323  10.890  65.527 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      188.96596    0.93958 201.117  < 2e-16 ***
## Written.c         -0.30602    0.01471 -20.803  < 2e-16 ***
## Gender             2.90094    1.10309   2.630  0.00863 ** 
## casteOBC          -3.68990    1.19525  -3.087  0.00206 ** 
## casteSC           -4.28438    2.16018  -1.983  0.04751 *  
## casteST           -4.41615    4.23344  -1.043  0.29704    
## Muslim             8.81479    2.83858   3.105  0.00194 ** 
## factor(year)2014   1.51692    1.06957   1.418  0.15633    
## factor(year)2015  -6.88825    1.06673  -6.457 1.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 1488 degrees of freedom
## Multiple R-squared:  0.257,  Adjusted R-squared:  0.253 
## F-statistic: 64.35 on 8 and 1488 DF,  p-value: < 2.2e-16

We see that even when we exclude 2016 (as one-off random event), being a Muslim definitely provides upto 9 point advantage in interview process in top rankers and is infact the most significant of predictive factors excluding systematic yearly variations.

Note

We must also note that even though there is category wise variation in Interview marks in caste as well, this difference is seen in written marks as well, However in case of Female and Muslim candidates those having similar marks in written perform better in Interview indicating potential bias. Further, while the interview marks advantage for females diminishes at top ranks(<500),it increases for Muslim candidates. However, it should be noted that Female candidates and Muslim candidates are vastly under-represented than their population levels. One of the points of this analysis is if these candidates are being favored and this analysis shows that they are,then whole process should be open.

Lets do some fun analysis now..

Surname Analysis

I extracted last names of candidates with regex and now lets see which surnames dominate UPSC list

 upm  %>% group_by(surname) %>% 
  summarise(n=n()) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%head(n=15)
## # A tibble: 15 x 3
##    surname     n proportion
##    <chr>   <int>      <dbl>
##  1 KUMAR     325      5.88 
##  2 SINGH     290      5.25 
##  3 MEENA     140      2.53 
##  4 SHARMA    117      2.12 
##  5 YADAV     114      2.06 
##  6 S         113      2.05 
##  7 GUPTA      83      1.50 
##  8 JAIN       78      1.41 
##  9 R          64      1.16 
## 10 K          63      1.14 
## 11 MISHRA     63      1.14 
## 12 VERMA      53      0.960
## 13 GARG       46      0.830
## 14 PANDEY     43      0.780
## 15 P          41      0.740

We see while generic surnames like KUMAR and SINGH dominate the scene, what is surprising is MEENA community which comes in ST has seats higher than communities like sharma,gupta etc

Let’s plot it

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="blue")+
  
  coord_flip()+
  xlab("Surname")

Let’s look at which surnames have best ranks

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,desc(Rank)),y=Rank))+geom_bar(stat="identity",fill="blue")+
  coord_flip()+
  xlab("Surname")

We see that the best ranks are secured by brahmin ,bania communities event though absolute numbers may be variable

Which community/surname(except kumari) has best female representation..lets see

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank),Female_percentage=round(100*mean(Gender),2)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,Female_percentage),y=Female_percentage))+geom_bar(stat="identity",fill="blue")+
     geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  annotate("text",x=2,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation (15.9) %")+


  coord_flip()+
  xlab("Surname")

We see that Baniyas(guptas)and some brahmins and some south indian communities(represnted by single surnames like S,C,A,B,J,K) have high above average female representation

Lets look inside communities. First look at General community

  upm   %>% filter(caste=="General") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

Lets lookat ST community

  upm   %>% filter(caste=="ST") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>4) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

What is clearly evident here is that MEENA group is an anomaly in this population and sort of super-elite which is cornering most of benefits reserved for ST.

Almost 37 of ST seats is bagged by MEENA community which forms only 5-6% of ST population

Let’s look at OBC community

  upm   %>% filter(caste=="OBC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that apart from generic kumar and singh surnames , Yadav and Patels from North India are capturing lot of OBC seats, however the story here is that many south indian candidates(with single letter surnames) are dominant in OBC community

Lets examine SC community

  upm   %>% filter(caste=="SC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that while most SC prefer kumar,singh and even verma titles..many south indian surname and hence candidates take these benefits.

Thus, on the whole in General category -Northern Communitites predominate,while in OBC and Sc groups South Indian surnames are well represented. In ST group MEENA community captures most of the seats.

Here is a list of summary statistics by surnames

upm %>% group_by(surname) %>% 
  summarize_at(vars(Total,Rank,Interview,Written,Gender),funs(mean,n=n())) %>% arrange(desc(Total_n)) %>% 
  rename(count=Total_n) %>% 
  select(-Rank_n,-Interview_n,-Written_n,-Gender_n) %>% 
  filter(count>4) %>% 
  arrange(Rank_mean) %>% mutate(Female_percentage = round(100*Gender_mean,2)) %>% 
  mutate(proportion=round(100*(count/sum(count)),2)) %>% 
  mutate(Female_total = round(Female_percentage*count/100,0)) %>% 
    select(-Gender_mean) %>% 
   #arrange(desc(Female_total))
 arrange(desc(proportion)) %>% 
head(n=15)
## # A tibble: 15 x 9
##    surname Total_mean Rank_mean Interview_mean Written_mean count
##    <chr>        <dbl>     <dbl>          <dbl>        <dbl> <int>
##  1 KUMAR         902.      615.           166.         737.   325
##  2 SINGH         897.      588.           169.         728.   290
##  3 MEENA         875.      883.           160.         715.   140
##  4 SHARMA        908.      409.           177.         731.   117
##  5 YADAV         896.      636.           165.         731.   114
##  6 S             902.      573.           174.         728.   113
##  7 GUPTA         928.      372.           171.         757.    83
##  8 JAIN          942.      329.           171.         772.    78
##  9 R             897.      617.           169.         728.    64
## 10 MISHRA        934.      364.           176.         758.    63
## 11 K             887.      608.           168.         719.    63
## 12 VERMA         907.      615.           168.         739.    53
## 13 GARG          930.      306.           172.         758.    46
## 14 PANDEY        921.      365.           176.         745.    43
## 15 P             874.      604.           174.         700.    41
## # ... with 3 more variables: Female_percentage <dbl>, proportion <dbl>,
## #   Female_total <dbl>

Repeater analysis

Let’s lookat names which repeat through various years(while it is true people with same names can get selected again,lesser probbaility that it happens on continuous years,in any case with limited resources-it is slight overestimation of repeater count

 repeater= upm %>% group_by(NAME) %>% summarize(n=n()) %>% filter(n>1) %>%  pull(NAME) 
round(100*(length(repeater)/length(upm$NAME)),2)
## [1] 16.36

Thus proportion of repeaters is around 16.36%

So what percentage of repeaters eventually end up getting a top 100 rank.. Lets’s calculate

 top_ranker =upm %>% filter(NAME %in% repeater) %>% filter(Rank<100) %>% pull(NAME) 

T So the percent of repeaters who end up as top ranker is

round(100*(length(top_ranker)/length(repeater)),2)
## [1] 16.48

or around 16.5%.

Thus around 16% selected people repeating end up selected and only 16% of these determined ones end up getting top 100 ranks.who said UPSC was an easy exam! It takes a lot to be a Babu..:-)

**Thus Despite adding 2017 data the analysis remains mostly unchanged, in 2017 females had a massive jump in interview marks while Muslims still had high but statistically not significant higher than non-muslims, though in analysis over the years the trend still holds up.

The overall point remains- Muslims and females are under-represented in UPSC. But rather than inflating marks through covert processes UPSC should be open about it, as its own data shows it and will be found out sooner than later.

The MEENA super-elite is cornering most of benefits in ST community, and a necessary correction is required by NCBC(sub categorization of ST seats without interfering with quota) rather than keeping quiet due to poilitical correctness.

Key Takeaways

1.Being a Female or a Muslim confers advantage in Interview process which is minor(5 marks)since whole Interview is of 275 marks however muslim candidates show an increasing trend of higher marks(10) at top ranks so there is some evidence of deliberate grade inflation

2.General candidates get bettter marks than OBC,SC,ST in that order in Interview process,however this advantage doesnt hold up in top ranks

3.Baniyas(Gupta,Agarwal) in particular get the best mean ranks and have highest female representation almost double the national average, arguably one of the most developed communities in India

4.Female representation in SC is higher than OBCs

5.MEENA community captures most of the seats in ST community and is a cause of concern

6.Higher marks are inversely correlated with Interview marks.A person with 100 more written marks than average is likely to get 10 less marks in an Interview

7Muslims iand Females are under-represented in UPSC list, However in my personal opinion covert grade inflation is not the solution as will backfire, we need open and transparent policy to increase their representation(eg. opening more schools in Muslim dominated areas rather than specific quota)

8South Indian Community are better represented in OBC and SC list(understandable since despite advanced demographics on almost every parameters few people let go of the advantage of quotas), however in General list North Indian communities dominate