Sometime back UPSC 2016 results were declared and amidst usual hype and hoopla, a blog post by twitter handle yugaparivatan claimed, that higher interview marks were awarded to Muslims . I was intrigued at that time and decided to analyse whole dataset later.

so I had some spare time this weekend and decided to analyse the whole thing along with other interesting factoids such as gender,religion,year-wise and surname wise variation.

Key issues :

  1. Are girls and Muslims given higher marks in interview ? Does this trend hold true across various years?
  2. Do the various categories differ in marks awarded ?
  3. How many percent girls get through UPSC in total and community wise
  4. Which surnames/castes dominate UPSC selected candidate list
  5. Have some caste/Surnames become super-privileged in specific category?
  6. What is the trend of interview and written marks across the years?

Data Extraction

2013,2014,2015,2016 UPSC list of candidates with categories were collected from Internet. 2015 list was pdf only,hence was extracted back to text then csv surnames were extracted with Regex The code for extraction and csv file of data can be found here and here Gender status and religion status was coded by hand after evaluation of data.

Distribution of Marks

Table of year-wise marks across categories

First we will look at average distribution of marks in Caste categories and percentage of Muslims and females in them

a=upm  %>% group_by(caste) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 

library(knitr)
knitr::kable(a)
caste n marks Written Interview Rank Muslims_percentage Females_percentage
General 2106 909 734 175 352 3.04 20.32
OBC 1342 876 709 167 669 4.25 10.28
SC 720 853 688 165 831 0.00 14.58
ST 367 842 675 167 934 3.54 13.62

Thus we see percentage of Muslims is between 3-4% in UPSC selected candidates, interestingly percentage of Muslim OBC is higher than general OBC. Expectedly Muslim percentage in Sc is zero as different religions can be given SC status , while ST classification depends upon regions and hence 3.54% Muslims there as well.

What is obvious here is inspite of protestations Muslims are under-represented in UPSC pool

Female representation is best in General group at 20%, while in OBC it is lowest 10.28% even lower than SC group which is surprising to say the least

Interview and written marks follow the general order General>OBC>SC>ST except in case of Interview where ST secure higher marks than SC(we shall examine this in a moment.)

General category candidates have better rank , but as we know that there is quota wise representation in each service and separate list for each categories

Lets see the whole data year-wise

b=upm  %>%  group_by(caste,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0),
                                       Muslims_percentage=round(100*mean(Muslim),2),
                   Females_percentage=round(100*mean(Gender),2)) 
knitr::kable(b)
caste year n marks Written Interview Rank Muslims_percentage Females_percentage
General 2013 517 809 629 180 347 3.09 20.89
General 2014 590 917 739 177 390 2.03 22.20
General 2015 499 900 729 171 328 3.01 16.63
General 2016 500 1011 841 170 338 4.20 21.20
OBC 2013 327 775 606 169 643 3.98 10.09
OBC 2014 354 879 708 171 746 2.82 10.17
OBC 2015 314 864 698 165 634 5.10 6.37
OBC 2016 347 980 816 164 646 5.19 14.12
SC 2013 187 753 586 167 831 0.00 16.58
SC 2014 194 864 697 167 884 0.00 11.86
SC 2015 176 844 681 163 805 0.00 13.64
SC 2016 163 965 801 163 795 0.00 16.56
ST 2013 91 740 572 168 934 4.40 12.09
ST 2014 98 846 676 170 1012 2.04 14.29
ST 2015 89 834 672 162 867 3.37 14.61
ST 2016 89 949 783 166 918 4.49 13.48

year-wise gender and religious representation

Let’s Plot Muslim representation community wise

ggplot(b,aes(x=year,y=Muslims_percentage,color=caste,group=caste))+geom_line()+
  geom_hline(yintercept =round(100*mean(upm$Muslim),2) ,color="red",linetype="dashed")+
  annotate("text",x=2015.5,y=round(100*mean(upm$Muslim),2),label="Average Muslim \n Representation %")+
  labs(color="Category")

Let’s Plot Female Representation Community wise

ggplot(b,aes(x=year,y=Females_percentage,color=caste,group=caste))+geom_line()+
   
   geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  
annotate("text",x=2013.5,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation %")+
    labs(color="Category")

So we saw that OBCs have poor female representation and slightly higher Muslim representation than general category

Simpson’s Paradox

Written Marks

Let’s look at the histogram of all Written marks across the years.

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

Appears fairly symmetrical. Howver we know that caste wise differences exist so let’s plot caste wise histogram

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

So, the fairly symmetrical curve becomes almost trimodal with three peaks, what could be the potential cause.. Let’s investigate further

ggplot(upm,aes(x=Written,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

So once we plot by year, the plots become symmetric again..How is that? a) in 2013 marks were considerably lower than 2016

  1. In caste groups we see that general marks distribution is shifted more to right of SC/ST/OBC

Thus we saw how an innocuous looking histogram hid within itself many sub-populations, on eof the few reasons a single measure of central tendency should never be accepted till we see graphical representation of data.

It is also known as Simpson’s paradox

Interview Marks

Lets plot Interview marks

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(~caste)

ggplot(upm,aes(x=Interview,y=..density..))+geom_histogram(fill="orange")+
  geom_density(color="black")+facet_wrap(caste~year)

The histograms and density plots give a lot of insight in the distribution of marks.

Time-series

Category-wise Variation

Let’s Plot Interview marks community wise

ggplot(b,aes(x=year,y=Interview,color=caste,group=caste))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks have gone down in general category while they have gone up in ST category,general change is down.

ggplot(b,aes(x=year,y=Written,color=caste,group=caste))+geom_line()+
  
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Average Written Marks have gone up in all categories across in usual order.

Religion-wise variation

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see that Interview marks in both Muslims and Non-Muslims go down, however Muslims have higher marks(around 5 marks) on average

Now some people alleged that Interview marks inflation was done deliberately in top ranks(Ranks less than 500),So lets see the trend in top rankers

upm  %>% filter(Rank<500) %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

In this graph,we can clearly see that while marks have been going down in both groups, in 2016 there is an uptrend(almost 13 marks higher in top 500 candidates) and that led to higher selection of Muslims in top ranks. This might be one off event, but in this year 2016 while interview marks of Muslim candidates on average went down, in top groups it went up against the general trend..

Lets now examine Written Marks

upm  %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

In this graph we clearly see that there is almost no difference between Muslim and Non-Muslim candidates in UPSC written testLets examine top guys under 500

upm  %>%filter(Rank<500) %>%  mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>%  group_by(Muslim,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Muslim),group=factor(Muslim)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Again,here as well-there is no difference in Written scores(as the mark sheets are blinded) and hence the better performance in top group was driven entirely by better performance in interview

Gender-wise Variation

Lets see corresponding graphs for females in Written category

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

We see almost no difference here as opposed to Interview marks

Now in top candidates

upm  %>% filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Written,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Written),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Written),2),label="Average Written Marks")+
labs(color="Category")

Here as well , there is no difference

Let’s see Interview Scores Now

upm  %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

Now it appears being a female confers an advantage of almost 5 points in interview..

Let’s analyse it in top rankers

upm  %>%filter(Rank<500) %>%  mutate(Gender=ifelse(Gender==1,"Female","Male")) %>%  group_by(Gender,year) %>% summarise(n=n(),marks=round(mean(Total),0),Written=round(mean(Written),0),
                                       Interview=round(mean(Interview),0),Rank=round(mean(Rank),0)) %>% 

ggplot(aes(x=year,y=Interview,color=factor(Gender),group=factor(Gender)))+geom_line()+
  
 geom_hline(yintercept =round(mean(upm$Interview),2) ,color="red",linetype="dashed")+
annotate("text",x=2013.5,y=round(mean(upm$Interview),2),label="Average Interview Marks")+
labs(color="Category")

We see in top rankers difference is lesser and even that closed down in 2016 c.f. performance of Muslim community. Lets’ examine this association statistically.

Correlation between Interview and Written marks

First let’s look at relationship between Interview and Written marks

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)

We see that there is a predominatly negative relationship. Does this relationship hold across years. Let’s see..

 upm %>% ggplot(aes(x=Interview,y=Written))+
  geom_jitter(color="red")+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

We can see the relationship holds but degrees vary. It is expected as well as those who do very well in Written , might be introvert and perform worse in Interview..

Let’s see if this relationship holds in females as well.

 upm %>% mutate(Gender=ifelse(Gender==1,"Female","Male")) %>% ggplot(aes(x=Interview,y=Written,color=Gender))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

Well it does across years.though Females tend to score higher

Lets look at it in Muslims

 upm %>% mutate(Muslim=ifelse(Muslim==1,"Muslim","Non-Muslim")) %>% ggplot(aes(x=Interview,y=Written,color=Muslim))+
  geom_jitter()+
   stat_smooth(method="lm",se=FALSE)+
  facet_wrap(~year)

On adjusting for Written marks we see that Muslims have only minor difference with Non-Muslims,However in 2016 we can see that among higher marks there is a clear divergence indicating higher marks for Muslim candidates.

Statistical Model

Let’s now do a formal statistical test . Let’s consider Interview marks as dependent variable,predicted by indivisual’s caste,gender,religion (in which there are obvious trends)-obviously though the model is flawed as all models are as it doesnt have measure of innate ability of participant, but it helps us in controlling for gender,caste n religion difference in candidate as well as yearly variation.I have avoided taking year 2013 as there was significant ~100 point shift in written marks there..

f= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013))
summary(f,digits=2)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year > 2013))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.251 -11.024   0.173  11.299  52.918 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      179.155691   0.618955 289.449  < 2e-16 ***
## Written.c         -0.126074   0.008255 -15.272  < 2e-16 ***
## Gender             3.770246   0.794961   4.743 2.20e-06 ***
## casteOBC          -9.704373   0.718291 -13.510  < 2e-16 ***
## casteSC          -13.807427   0.914435 -15.099  < 2e-16 ***
## casteST          -14.421253   1.198486 -12.033  < 2e-16 ***
## Muslim             5.984816   1.701129   3.518  0.00044 ***
## factor(year)2015  -7.046271   0.704688  -9.999  < 2e-16 ***
## factor(year)2016   6.922696   1.108052   6.248 4.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.76 on 3404 degrees of freedom
## Multiple R-squared:  0.1344, Adjusted R-squared:  0.1324 
## F-statistic: 66.06 on 8 and 3404 DF,  p-value: < 2.2e-16

The analysis shows that for an averal General candidate in year 2014 as base , had 179 marks in interview, every 10 increasein written marks would decrease his interview marks by 1.2, Being of female gender would increase the mark by 4, OBC,SC and ST have around 10,13,14 lower point than general candidates,while being a Muslim adds around 6 marks to interview score. 2013 was alow score year There are yearly 6 points in subsequent years

However it must be noted that since we dont have other measures of innate ability of students the predictive perrobability of model is poor despite controlling for confounders.

Lets do the same analysis in top rankers

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year>2013,Rank<500))
summary(g)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year > 2013, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.652  -9.895  -0.884   9.960  60.085 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      190.60196    0.93964 202.845  < 2e-16 ***
## Written.c         -0.27833    0.01455 -19.126  < 2e-16 ***
## Gender             2.06934    0.98705   2.096   0.0362 *  
## casteOBC          -2.41066    1.09727  -2.197   0.0282 *  
## casteSC           -3.54604    1.96226  -1.807   0.0709 .  
## casteST           -2.57364    3.81958  -0.674   0.5005    
## Muslim             9.76134    2.37969   4.102 4.32e-05 ***
## factor(year)2015 -11.45810    0.99103 -11.562  < 2e-16 ***
## factor(year)2016  17.36783    1.70377  10.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.16 on 1488 degrees of freedom
## Multiple R-squared:  0.2482, Adjusted R-squared:  0.2442 
## F-statistic: 61.41 on 8 and 1488 DF,  p-value: < 2.2e-16

We see that while in top rankers there is a much lesser difference in interview marks between general and OBC/SC/ST candidates and even in females, however intrestingly the advantage of being a Muslim is amplified in these top rankers and gives an advantage of upto 10 marks even on controlling for other variables and points to some evidence of deliberate grade inflation in interview marks based on religion Lets run this analysis without controversial year 2016

g= lm(Interview~Written.c+Gender+caste+Muslim+factor(year),data=filter(upm,year<2016,Rank<500))
summary(g)
## 
## Call:
## lm(formula = Interview ~ Written.c + Gender + caste + Muslim + 
##     factor(year), data = filter(upm, year < 2016, Rank < 500))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.852 -11.569  -0.323  10.890  65.527 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      157.18403    1.38120 113.802  < 2e-16 ***
## Written.c         -0.30602    0.01471 -20.803  < 2e-16 ***
## Gender             2.90094    1.10309   2.630  0.00863 ** 
## casteOBC          -3.68990    1.19525  -3.087  0.00206 ** 
## casteSC           -4.28438    2.16018  -1.983  0.04751 *  
## casteST           -4.41615    4.23344  -1.043  0.29704    
## Muslim             8.81479    2.83858   3.105  0.00194 ** 
## factor(year)2014  34.60859    1.99323  17.363  < 2e-16 ***
## factor(year)2015  22.73994    1.79319  12.681  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 1488 degrees of freedom
## Multiple R-squared:  0.257,  Adjusted R-squared:  0.253 
## F-statistic: 64.35 on 8 and 1488 DF,  p-value: < 2.2e-16

We see that even when we exclude 2016 (as one-off random event), being a Muslim definitely provides upto 10 point advantage in interview process in top rankers and is infact the most significant of predictive factors excluding systematic yearly variations.

Note

We must also note that even though there is category wise variation in Interview marks in caste as well, this difference is seen in written marks as well, However in case of Female and Muslim candidates those having similar marks in written perform better in Interview indicating potential bias. Further, while the interview marks advantage for females diminishes at top ranks(<500),it increases for Muslim candidates. However, it should be noted that Female candidates and Muslim candidates are vastly under-represented than their population levels. One of the points of this analysis is if these candidates are being favored and this analysis shows that they are,then whole process should be open.

Lets do some fun analysis now..

Surname Analysis

I extracted last names of candidates with regex and now lets see which surnames dominate UPSC list

 upm  %>% group_by(surname) %>% 
  summarise(n=n()) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%head(n=15)
## # A tibble: 15 x 3
##    surname     n proportion
##    <chr>   <int>      <dbl>
##  1 KUMAR     269      5.93 
##  2 SINGH     239      5.27 
##  3 MEENA     113      2.49 
##  4 SHARMA     97      2.14 
##  5 S          95      2.09 
##  6 YADAV      91      2.01 
##  7 GUPTA      69      1.52 
##  8 JAIN       58      1.28 
##  9 K          58      1.28 
## 10 R          55      1.21 
## 11 MISHRA     51      1.12 
## 12 VERMA      45      0.990
## 13 GARG       38      0.840
## 14 P          38      0.840
## 15 PANDEY     37      0.820

We see while generic surnames like KUMAR and SINGH dominate the scene, what is surprising is MEENA comminity which comes in ST has almost 113 seats higher than communities like sharma,gupta etc

Let’s plot it

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="blue")+
  
  coord_flip()+
  xlab("Surname")

Let’s look at which surnames have best ranks

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,desc(Rank)),y=Rank))+geom_bar(stat="identity",fill="blue")+
  coord_flip()+
  xlab("Surname")

We see that the best ranks are secured by brahmin ,bania communities event though absolute numbers may be variable

Which community/surname(except kumari) has best female representation..lets see

 upm   %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank),Female_percentage=round(100*mean(Gender),2)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,Female_percentage),y=Female_percentage))+geom_bar(stat="identity",fill="blue")+
     geom_hline(yintercept =round(100*mean(upm$Gender),2) ,color="red",linetype="dashed")+
  annotate("text",x=2,y=round(100*mean(upm$Gender),2),label="Average Female \n Representation (15.9) %")+


  coord_flip()+
  xlab("Surname")

We see that Baniyas(guptas)and some brahmins and some south indian communities(represnted by single surnames like S,C,A,B,J,K) have high above average female representation

Lets look inside communities. First look at General community

  upm   %>% filter(caste=="General") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>15) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

Lets lookat ST community

  upm   %>% filter(caste=="ST") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>4) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

What is clearly evident here is that MEENA group is an anomaly in this population and sort of super-elite which is cornering most of benefits reserved for ST.

Let’s look at OBC community

  upm   %>% filter(caste=="OBC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that apart from generic kumar and singh surnames , Yadav and Patels from North India are capturing lot of OBC seats, however the story here is that many south indian candidates(with single letter surnames) are dominant in OBC community

Lets examine SC community

  upm   %>% filter(caste=="SC") %>% group_by(surname) %>% 
  summarise(n=n(),Rank=mean(Rank)) %>% mutate(proportion=round(100*(n/sum(n)),2)) %>% 
   arrange(desc(n)) %>%
  filter(n>5) %>% 
  ggplot(aes(x=reorder(surname,n),y=n))+geom_bar(stat="identity",fill="orange")+
  
  coord_flip()+
  xlab("Surname")

We see that while most SC prefer kumar,singh and even verma titles..many south indian surname and hence candidates take these benefits.

Thus, on the whole in General category -Northern Communitites predominate,while in OBC and Sc groups South Indian surnames are well represented. In ST group MEENA community captures most of the seats.

Here is a list of summary statistics by surnames

upm %>% group_by(surname) %>% 
  summarize_at(vars(Total,Rank,Interview,Written,Gender),funs(mean,n=n())) %>% arrange(desc(Total_n)) %>% 
  rename(count=Total_n) %>% 
  select(-Rank_n,-Interview_n,-Written_n,-Gender_n) %>% 
  filter(count>4) %>% 
  arrange(Rank_mean) %>% mutate(Female_percentage = round(100*Gender_mean,2)) %>% 
  mutate(proportion=round(100*(count/sum(count)),2)) %>% 
  mutate(Female_total = round(Female_percentage*count/100,0)) %>% 
    select(-Gender_mean) %>% 
   #arrange(desc(Female_total))
 arrange(desc(proportion)) %>% 
head(n=15)
## # A tibble: 15 x 9
##    surname Total_mean Rank_mean Interview_mean Written_mean count
##    <chr>        <dbl>     <dbl>          <dbl>        <dbl> <int>
##  1 KUMAR          881       636            166          715   269
##  2 SINGH          874       600            169          706   239
##  3 MEENA          852       909            160          692   113
##  4 SHARMA         884       428            178          705    97
##  5 S              883       573            174          709    95
##  6 YADAV          871       649            164          708    91
##  7 GUPTA          908       373            171          737    69
##  8 JAIN           911       345            172          739    58
##  9 K              877       615            170          707    58
## 10 R              881       619            170          711    55
## 11 MISHRA         911       385            178          733    51
## 12 VERMA          892       607            169          723    45
## 13 GARG           911       297            174          737    38
## 14 P              865       602            174          691    38
## 15 PANDEY         902       387            175          727    37
## # ... with 3 more variables: Female_percentage <dbl>, proportion <dbl>,
## #   Female_total <dbl>

Repeater analysis

Let’s lookat names which repeat through various years(while it is true people with same names can get selected again,lesser probbaility that it happens on continuous years,in any case with limited resources-it is slight overestimation of repeater count

 repeater= upm %>% group_by(Name) %>% summarize(n=n()) %>% filter(n>1) %>%  pull(Name) 

The number of repeater is 697.So the percentage of people who repeat out of total 4535 candidates is ~

round(100*(length(repeater)/length(upm$Name)),2)
## [1] 15.37

Thus proportion of repeaters is around 15%

So what percentage of repeaters eventually end up getting a top 100 rank.. Lets’s calculate

 top_ranker =upm %>% filter(Name %in% repeater) %>% filter(Rank<100) %>% pull(Name) 

The number of top ranker out of these repeaters is 111.

So the percent of repeaters who end up as top ranker is 111/697

round(100*(length(top_ranker)/length(repeater)),2)
## [1] 15.93

or around 16%.

Thus around 15% selected people end up repeating and only 15% of these determined ones end up getting top 100 ranks.who said UPSC was an easy exam! It takes a lot to be a Babu..:-)

Key Takeaways

1.Being a Female or a Muslim confers advantage in Interview process which is minor(5 marks)since whole Interview is of 275 marks however muslim candidates show an increasing trend of higher marks(10) at top ranks so there is some evidence of deliberate grade inflation

2.General candidates get bettter marks than OBC,SC,ST in that order in Interview process,however this advantage doesnt hold up in top ranks

3.Baniyas(Gupta,Agarwal) in particular get the best mean ranks and have highest female representation almost double the national average, arguably one of the most developed communities in India

4.Female representation in SC is higher than OBCs

5.MEENA community captures most of the seats in ST community and is a cause of concern

6.Higher marks are inversely correlated with Interview marks.A person with 100 more written marks than average is likely to get 10 less marks in an Interview

7Muslims iand Females are under-represented in UPSC list, However in my personal opinion covert grade inflation is not the solution as will backfire, we need open and transparent policy to increase their representation(eg. opening more schools in Muslim dominated areas rather than specific quota)

8South Indian Community are better represented in OBC and SC list(understandable since despite advanced demographics on almost every parameters few people let go of the advantage of quotas), however in General list North Indian communities dominate