1 INTRODUCTION

As data analysts for a major university, we have been tasked by the Dean of the school to provide them with data driven insights into factors which influence a student’s passing the Uniform Bar Exam (UBE). Our study was limited to performing regression analysis in order to predict the bar exam passage based on the provided variables. Whether the significant variables that our model will unveil is actionable or not by the law school will be left to the discretion of the Dean of school and their leadership team. The following dataset available in csv format has been provided to us using the following link: https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv

A pull of historical data from the link above unveiled a total of 476 observations contained in 26 variables, including the response variable of Passfail, which denotes whether the student passed the exam. This represents datasets taken from 2021 to 2024. Below is the list of the variables summarized and defined:

1. LSAT: Score on the LSAT entrance examination

2. UGPA: Undergraduate

GPA 3. Class: Year student entered law school

4. CivPro, LP1, LP2: Scores in 1L core courses

5. OneCum, FGPA: Cumulative GPA at end of 1L year and 3L year, respectively

6. Accom: Received accommodations from Student Disability Services (Yes/No)

7. Probation: Ever placed on academic probation (Yes/No)

8. LegalAnalysis, AdvLegalPerf, AdvLegalAnalysis: Enrollment in various elective courses (Yes/No)

9. BarPrep: Type of bar preparation course taken

10. PctBarPrepComplete: Percent of bar prep course completed

11. NumPrepWorkshops: Number of bar prep workshops attended

12. StudentSuccessInitiative: Participation in academic support program (Yes/No)

13. BarPrepMentor: Whether student had a mentor for bar prep (Yes/No) 14. MPRE, MPT, MEE, MBE: Component scores of the bar exam

15. UBE: Composite Uniform Bar Exam score

16. Pass: Final outcome – Did the student pass the bar exam? (Yes/No)

We did not use all the variables above to build our model. We left out variables in our analysis, such as Age for example, that we practically deemed irrelevant in passing the bar exam. The following section on assumptions explain the logic of our variables selection.

2 Assumptions on the variables retained for the initial model build:

Assumptions: According to the National Conference of Bar Examiners (https://www.ncbex.org/exams/ube), the Uniform Bar Exam (UBE) includes the Multistate Essay Examination (MEE), the Multistate Bar Examination (MBE), the Multistate Professional Responsibility Exam (MPRE) and two of Multistate Performance Test (MPT). Hence, passing or failing the exam depends on passing all these tests. Every year, the cutoff score changes. Therefore, we have decided to regress the predictor variables on the Passfail response instead of UBE:

• Passfail will be our response variable and treated as a categorical variable with two outcomes: Yes or No. Therefore our model will assume a Bernoulli distribution, which is a particular case of the binomial distribution.

• Civpro, LP1 and LP2 are foundational courses in the law program and one can logically assume that students who perform well in these classes will be well prepared for the Uniform Bar Exam. We have treated these as ordinal variables.

• We think that making accommodation for students with disabilities will help this segment of the student population prepare well for the exam. Issues related to moving around, hearing, typing, or any other physical abilities related to learning hinder students from properly studying. So in our project we are interested in knowing whether accommodations have been made for the physically impaired students. It’s a Yes or No answer, therefore treated as a categorical variable.
• A student having been placed on probation is definitely an indication of learning issues whatever the causes are. Therefore we have included these in our initial set of potential predictor variables

• Another set of courses, LegalAnalysis, AdvLegalPerf, AdvLegalAnalysis, similar to the Civpro, LP1 and LP2 in how they can impact the outcome of the Bar Exam have also been considered. Although these are elective courses, they are of legal relevance. In our project, we want to know whether students have taken these yes or no. Therefore we have treated this set as a categorical variable

• We also have assumed that BarPrepCompany which indicates the type of Bar Preparation completed, can be an interesting factor to study, therefore we have included it in our initial set as a factor variable

• PctBarPrepComplete can also positively influence the success rates for bar exams since it indicates the percentage of bar preparatory courses completed. The number of bar preparatory workshops attended, represented by the variable PrepWorkshops`, is also available and is similar to the PctBarPrepComplete variable. One describes preparatory courses and the other preparatory workshops. To avoid potential correlation between two predictor variables, we decided to select the prime one.

• Participation in an academic support program, as a categorical variable, has also been assumed in our initial set as potentially relevant. So we have included it as a categorial variable

• Last but not necessarily least at this stage of our analysis, is the variable BarPrepMentor indicating whether the student has used a mentor or not in preparing for the exam. We assumed this to be relevant and therefore have included it in our intial model as a categorical value.

3 DATA PREPARATION PROCESS

3.1 Dowload and create new dataset

library(tidyr)
library(dplyr) 
library(stringr)
library(ggplot2)
library(ggpubr)

#dowload the dataset from github
df<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")

#define new dataset by removing (MPRE, MPT< MEE< Writtenscalescore, MBE, UBE )
new_df <- data.frame(df$PassFail, df$Age, df$LSAT, df$UGPA, df$CivPro, df$LPI, df$LPII, df$GPA_1L,
                 df$GPA_Final, df$FinalRankPercentile, df$Accommodations, df$Probation,
                 df$LegalAnalysis_TexasPractice, df$AdvLegalPerfSkills, df$AdvLegalAnalysis, df$BarPrepCompany,
                 df$BarPrepCompletion, df$X.LawSchoolBarPrepWorkshops, df$StudentSuccessInitiative,
                 df$BarPrepMentor) 

3.2 Data cleaning

#cleaning dataset operation 

clear_df <- na.omit(new_df)  #clean missing value in the dataset 
colSums(is.na(clear_df)) # Check for missing values
##                    df.PassFail                         df.Age 
##                              0                              0 
##                        df.LSAT                        df.UGPA 
##                              0                              0 
##                      df.CivPro                         df.LPI 
##                              0                              0 
##                        df.LPII                      df.GPA_1L 
##                              0                              0 
##                   df.GPA_Final         df.FinalRankPercentile 
##                              0                              0 
##              df.Accommodations                   df.Probation 
##                              0                              0 
## df.LegalAnalysis_TexasPractice          df.AdvLegalPerfSkills 
##                              0                              0 
##            df.AdvLegalAnalysis              df.BarPrepCompany 
##                              0                              0 
##           df.BarPrepCompletion df.X.LawSchoolBarPrepWorkshops 
##                              0                              0 
##    df.StudentSuccessInitiative               df.BarPrepMentor 
##                              0                              0
clear_df <- na.omit(clear_df) #clean missing value in the dataset 

# Remove 'df.' prefix from all column names
names(clear_df) <- str_remove(names(clear_df), "^df.")

3.3 Data transformation

#Transform our variables types 

clear_df$PassFail <- as.factor(clear_df$PassFail)
clear_df$CivPro <- factor(clear_df$CivPro, levels =c("A","B+","B", "C+",
                                                           "c","D+","D", "F"), ordered = TRUE)
clear_df$LPI <- factor(clear_df$LPI, levels =c("A","B+","B", "C+",
                                                           "c","D+","D", "F"), ordered = TRUE)
clear_df$LPII <- factor(clear_df$LPII, levels =c("A","B+","B", "C+",
                                                     "c","D+","D", "F"), ordered = TRUE)
clear_df$Accommodations <- as.factor(clear_df$Accommodations)
clear_df$Probation <- factor(clear_df$Probation, levels = c("Y","N"))
clear_df$LegalAnalysis_TexasPractice <- as.factor(clear_df$LegalAnalysis_TexasPractice)
clear_df$AdvLegalPerfSkills <- as.factor(clear_df$AdvLegalPerfSkills)
clear_df$AdvLegalAnalysis <- as.factor(clear_df$AdvLegalAnalysis)
clear_df$BarPrepCompany <- as.factor(clear_df$BarPrepCompany)
clear_df$StudentSuccessInitiative <- as.factor(clear_df$StudentSuccessInitiative)
clear_df$BarPrepMentor <- as.factor(clear_df$BarPrepMentor)
#Take a look of our dataset
summary(clear_df)
##  PassFail      Age             LSAT            UGPA           CivPro   
##  F: 51    Min.   :23.10   Min.   :145.0   Min.   :2.010   B      :141  
##  P:398    1st Qu.:26.70   1st Qu.:153.0   1st Qu.:3.260   B+     :106  
##           Median :28.20   Median :156.0   Median :3.500   A      : 77  
##           Mean   :29.08   Mean   :155.3   Mean   :3.454   C+     : 76  
##           3rd Qu.:30.10   3rd Qu.:157.0   3rd Qu.:3.710   D+     : 11  
##           Max.   :65.70   Max.   :168.0   Max.   :4.140   (Other):  2  
##                                                           NA's   : 36  
##       LPI           LPII         GPA_1L        GPA_Final    
##  B      :159   B      :118   Min.   :2.200   Min.   :2.440  
##  B+     :101   B+     :108   1st Qu.:2.783   1st Qu.:3.060  
##  C+     : 84   C+     : 77   Median :3.086   Median :3.280  
##  A      : 65   A      : 69   Mean   :3.091   Mean   :3.288  
##  D+     :  7   D+     :  5   3rd Qu.:3.391   3rd Qu.:3.530  
##  (Other):  5   (Other):  3   Max.   :4.000   Max.   :3.990  
##  NA's   : 28   NA's   : 69                                  
##  FinalRankPercentile Accommodations Probation  LegalAnalysis_TexasPractice
##  Min.   :0.0000      N:383          Y   : 37   N:207                      
##  1st Qu.:0.2700      Y: 66          N   :408   Y:242                      
##  Median :0.5300                     NA's:  4                              
##  Mean   :0.5137                                                           
##  3rd Qu.:0.7500                                                           
##  Max.   :0.9900                                                           
##                                                                           
##  AdvLegalPerfSkills AdvLegalAnalysis BarPrepCompany BarPrepCompletion
##  N:204              N:155            Barbri:230     Min.   :0.0200   
##  Y:245              Y:294            Helix :  1     1st Qu.:0.8000   
##                                      Kaplan: 14     Median :0.8900   
##                                      Themis:204     Mean   :0.8639   
##                                                     3rd Qu.:0.9800   
##                                                     Max.   :1.0000   
##                                                                      
##  X.LawSchoolBarPrepWorkshops StudentSuccessInitiative       BarPrepMentor
##  Min.   :0.000               N          :337          N            :347  
##  1st Qu.:0.000               Keffer     : 12          ColleenPotts :  5  
##  Median :1.000               Christopher: 11          DeirdreWard  :  5  
##  Mean   :1.559               Sherwin    :  9          ClayElliott  :  4  
##  3rd Qu.:3.000               Saavedra   :  8          AshleyPirtle :  3  
##  Max.   :5.000               Baldwin    :  7          AshleySanders:  3  
##                              (Other)    : 65          (Other)      : 82

4 EXPLORATORY DATA ANALYSIS

4.1 Response variable analysis

Before modeling, we first examine the distribution of our target variable, PassFail, which indicates whether students passed (P) or failed (F) the bar exam.

4.1.1 Bar passage rate

#calculate the passage rate 
pass_rate <- clear_df %>% 
  count(PassFail) %>% 
  mutate(percentage = n / sum(n) * 100)
pass_rate
##   PassFail   n percentage
## 1        F  51   11.35857
## 2        P 398   88.64143

4.1.1.1 Interpretation

Looking at this result F (Fail) represents 51 students (11.4%), while P (Pass) includes 398 students (88.6%), showing a strong majority of bar exam passers in the datasetalize distribution

4.1.2 Visualize distribution

#plot PassFail rate 
ggplot(clear_df, aes(x = PassFail, fill = PassFail)) +
  geom_bar() +
  labs(title = "Bar Passage Outcomes", x = "Outcome", y = "Count")

4.1.2.1 Interpretation

88.6% of students (397 individuals) successfully passed the bar exam, this exceptionally high pass rate suggests either: An academically strong cohort of students or an effective law school curriculum and bar preparation program. Fail Rate: Only 11.4% of students (51 individuals) failed the bar exam, while relatively small, this group represents a significant opportunity for intervention.

4.2 Continuous predictor analysis

Let;s look at trends in academic performance.

4.2.1 Plot key continuous predictors

# Plot key continuous predictors
p1 <- ggplot(clear_df, aes(x = PassFail, y = LSAT, fill = factor(PassFail))) +
  geom_boxplot() +
  labs(title = "LSAT Scores by Bar Outcome")
p2 <- ggplot(clear_df, aes(x = PassFail, y = GPA_Final, fill = factor(PassFail))) +
  geom_boxplot() +
  labs(title = "Final GPA by Bar Outcome")

ggarrange(p1, p2, ncol = 2)

4.2.2 Interpretation

Passing students tend to have higher LSAT scores and final GPAs.

In the plot on the left Median LSAT for Passers (“P”): Likely around 155-157 for successful law students Median LSAT for Failers (“F”): Likely 3-5 points lower (150-153) This plot on the left indicate that LSAT scores give us to understand that LSAT scores are a moderate predictor of bar success. The 5-point median gap between Pass and Fail boxplot shows that LSAT correlates with bar passage but isn’t deterministic .

In the plot on the right Median GPA for Passers (“P”): Likely 3.3–3.5 (strong performance), Median GPA for Failers (“F”): Likely 0.3–0.5 points lower. Failers may include extreme low-GPA students , while passers rarely dip below 3.0. We can observe that GPA is a stronger predictor than LSAT. A 3.0+ GPA appears critical for success. while it looks give GPA and LSAT give a little explanation about the passers or failers, those two alone don’t fully explain outcomes, other factors may intervent.

4.3 Categorical Predictor Analysis

4.3.1 Plot key categorical predictors

p3 <- ggplot(clear_df, aes(x = Probation, fill = factor(PassFail))) +
  geom_bar(position = "fill") +
  labs(title = "Bar Passage by Probation Status")

p4 <- ggplot(clear_df, aes(x = BarPrepMentor, fill = factor(PassFail))) +
  geom_bar(position = "fill") +
  labs(title = "Bar Passage by Mentor Status")

ggarrange(p3, p4, ncol = 2)

4.3.2 Interpretation

Students on probation typically have GPAs below 2.0–2.5, indicating persistent struggles with legal coursework. (probation = failure of academic integration)

Mentors who recently passed the bar offer credible, relatable advice (e.g., how to approach exam questions), and can teach candidate to identify early signs of struggle. (mentorship = belonging and guidance)

5 MODEL BUILDING

5.1 Full model

To identify key predictors of bar exam success, we begin by fitting a full logistic regression model using all available variables. This initial model serves as a baseline for evaluating predictor significance and guiding subsequent refinement through stepwise selection.

# Fit full logistic regression model
# Convert PassFail to binary (1 for Pass, 0 for Fail)
clear_df$PassFail <- as.numeric(clear_df$PassFail == "P") # Convert PassFail to binary (1 for Pass, 0 for Fail)
model_data <- na.omit(clear_df)
full_model <- glm(PassFail ~ ., 
                  family = binomial(link = "logit"), 
                  data = model_data)

# Model summary
summary(full_model)
## 
## Call:
## glm(formula = PassFail ~ ., family = binomial(link = "logit"), 
##     data = model_data)
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         -2.776e+03  1.962e+06  -0.001    0.999
## Age                                  1.828e+00  4.224e+03   0.000    1.000
## LSAT                                 1.281e+01  8.704e+03   0.001    0.999
## UGPA                                 6.148e+01  7.769e+04   0.001    0.999
## CivPro.L                            -2.239e+01  4.170e+05   0.000    1.000
## CivPro.Q                            -1.166e+02  3.086e+05   0.000    1.000
## CivPro.C                             9.158e+00  5.049e+05   0.000    1.000
## CivPro^4                            -2.987e+01  4.898e+05   0.000    1.000
## CivPro^5                            -9.605e-01  3.260e+05   0.000    1.000
## CivPro^6                             3.856e+01  2.094e+05   0.000    1.000
## LPI.L                                2.110e+01  3.800e+05   0.000    1.000
## LPI.Q                               -1.290e+01  2.894e+05   0.000    1.000
## LPI.C                               -3.850e+01  4.412e+05   0.000    1.000
## LPI^4                               -5.727e+01  4.314e+05   0.000    1.000
## LPI^5                               -3.740e+01  2.338e+05   0.000    1.000
## LPII.L                               1.427e+02  2.885e+05   0.000    1.000
## LPII.Q                               2.708e+01  1.866e+05   0.000    1.000
## LPII.C                               8.864e+01  2.451e+05   0.000    1.000
## LPII^4                               4.821e+01  2.347e+05   0.000    1.000
## LPII^5                               5.116e+01  1.294e+05   0.000    1.000
## GPA_1L                               1.318e+02  1.550e+05   0.001    0.999
## GPA_Final                           -7.442e+00  4.311e+05   0.000    1.000
## FinalRankPercentile                  7.955e+01  4.145e+05   0.000    1.000
## AccommodationsY                     -1.366e+01  9.136e+04   0.000    1.000
## ProbationN                          -7.968e+01  2.114e+05   0.000    1.000
## LegalAnalysis_TexasPracticeY        -3.386e+01  6.933e+04   0.000    1.000
## AdvLegalPerfSkillsY                 -5.742e-01  1.052e+05   0.000    1.000
## AdvLegalAnalysisY                    2.934e+00  4.521e+04   0.000    1.000
## BarPrepCompanyHelix                  3.862e+01  3.683e+05   0.000    1.000
## BarPrepCompanyKaplan                -7.907e+01  1.001e+05  -0.001    0.999
## BarPrepCompanyThemis                 3.791e+01  7.220e+04   0.001    1.000
## BarPrepCompletion                    2.089e+02  9.402e+04   0.002    0.998
## X.LawSchoolBarPrepWorkshops          5.895e-01  1.738e+04   0.000    1.000
## StudentSuccessInitiativeAycock       4.696e+01  4.133e+05   0.000    1.000
## StudentSuccessInitiativeBaldwin      6.588e-01  5.083e+05   0.000    1.000
## StudentSuccessInitiativeBeyer       -4.626e+01  1.239e+06   0.000    1.000
## StudentSuccessInitiativeChapman      7.559e+01  4.414e+05   0.000    1.000
## StudentSuccessInitiativeChristopher  1.324e+02  2.189e+06   0.000    1.000
## StudentSuccessInitiativeCochran      2.763e+01  5.696e+05   0.000    1.000
## StudentSuccessInitiativeCorn         5.167e+01  4.840e+05   0.000    1.000
## StudentSuccessInitiativeGonzalez    -1.642e+01  4.574e+05   0.000    1.000
## StudentSuccessInitiativeHardberger  -2.850e+01  7.522e+05   0.000    1.000
## StudentSuccessInitiativeHumphrey     5.541e+01  5.480e+07   0.000    1.000
## StudentSuccessInitiativeKeffer       6.116e+01  2.878e+05   0.000    1.000
## StudentSuccessInitiativeLauriat     -4.533e+00  3.274e+05   0.000    1.000
## StudentSuccessInitiativeLux         -8.721e+01  3.986e+05   0.000    1.000
## StudentSuccessInitiativeMcDonald     8.733e+01  5.960e+05   0.000    1.000
## StudentSuccessInitiativeN            8.072e+01  3.018e+05   0.000    1.000
## StudentSuccessInitiativeRosen        9.118e+01  3.523e+05   0.000    1.000
## StudentSuccessInitiativeRSherwin     1.143e+02  3.226e+05   0.000    1.000
## StudentSuccessInitiativeSaavedra    -1.075e+00  7.767e+05   0.000    1.000
## StudentSuccessInitiativeSherwin      2.124e+02  3.330e+05   0.001    0.999
## StudentSuccessInitiativeSmith        1.002e+02  3.614e+05   0.000    1.000
## BarPrepMentorAmberBeard              8.183e+01  5.293e+05   0.000    1.000
## BarPrepMentorAmberRich              -8.478e+01  5.304e+05   0.000    1.000
## BarPrepMentorAshleyPirtle           -1.353e+02  6.226e+05   0.000    1.000
## BarPrepMentorAshleySanders           2.865e+01  4.036e+05   0.000    1.000
## BarPrepMentorBrendaJohnson           5.660e+01  5.259e+05   0.000    1.000
## BarPrepMentorBryanGreer             -1.018e+01  5.254e+05   0.000    1.000
## BarPrepMentorCadyMello              -1.032e+02  1.574e+06   0.000    1.000
## BarPrepMentorChrisRhodes             7.792e+01  5.418e+05   0.000    1.000
## BarPrepMentorClayElliott            -7.071e+01  4.589e+05   0.000    1.000
## BarPrepMentorColeShooter            -4.321e+01  4.650e+05   0.000    1.000
## BarPrepMentorColleenByrom            1.551e+01  5.701e+05   0.000    1.000
## BarPrepMentorColleenElbe(Potts)     -1.813e+02  5.570e+05   0.000    1.000
## BarPrepMentorColleenPotts           -1.153e+02  4.430e+05   0.000    1.000
## BarPrepMentorDanielleSaavedra       -5.354e+01  4.426e+05   0.000    1.000
## BarPrepMentorDavidHutchens           3.708e+01  5.139e+05   0.000    1.000
## BarPrepMentorDavidRice              -1.973e+01  8.652e+05   0.000    1.000
## BarPrepMentorDeirdreWard            -1.701e+01  3.846e+05   0.000    1.000
## BarPrepMentorDenetteVaughn          -1.923e+02  5.399e+05   0.000    1.000
## BarPrepMentorGrantCoffey            -1.753e+02  5.246e+05   0.000    1.000
## BarPrepMentorHaleyHickey             2.370e+01  4.819e+05   0.000    1.000
## BarPrepMentorHolleyMcDaniel         -3.638e+01  4.448e+05   0.000    1.000
## BarPrepMentorHollyHaseloff          -4.210e+01  4.369e+05   0.000    1.000
## BarPrepMentorHoltonWestbrook        -8.914e+01  5.239e+05   0.000    1.000
## BarPrepMentorJacquelynnMayes        -9.132e+01  5.276e+05   0.000    1.000
## BarPrepMentorJessicaAycock          -1.864e+02  5.515e+05   0.000    1.000
## BarPrepMentorJohnMoore              -9.737e+01  5.214e+05   0.000    1.000
## BarPrepMentorJordanChavez           -1.397e+02  5.128e+05   0.000    1.000
## BarPrepMentorJosephAustin           -9.056e+01  5.480e+07   0.000    1.000
## BarPrepMentorJulieDavis             -3.598e+01  5.109e+05   0.000    1.000
## BarPrepMentorJustinPlescha          -1.033e+01  4.349e+05   0.000    1.000
## BarPrepMentorKathleenGoegel         -3.263e+02  5.417e+05  -0.001    1.000
## BarPrepMentorKatyCrocker             1.132e+02  5.835e+05   0.000    1.000
## BarPrepMentorKimberlyKelley         -3.836e+01  5.259e+05   0.000    1.000
## BarPrepMentorLauraFidelie           -1.415e+02  5.161e+05   0.000    1.000
## BarPrepMentorLauraMcDivitt           2.320e+01  5.107e+05   0.000    1.000
## BarPrepMentorLaurenWelch            -1.400e+02  5.180e+05   0.000    1.000
## BarPrepMentorLeenaAl-Souki          -4.049e+01  5.480e+07   0.000    1.000
## BarPrepMentorMadelynDeviney         -9.884e+01  4.433e+05   0.000    1.000
## BarPrepMentorMariaOviedo            -1.931e+01  6.814e+05   0.000    1.000
## BarPrepMentorMelissaWaggoner        -2.272e+01  5.509e+05   0.000    1.000
## BarPrepMentorMerylBenham             5.990e+00  4.418e+05   0.000    1.000
## BarPrepMentorMichaelEconomidis      -5.476e+01  3.877e+05   0.000    1.000
## BarPrepMentorMistyPratt             -7.841e+01  7.000e+05   0.000    1.000
## BarPrepMentorMonicaReyes             2.330e+01  5.769e+05   0.000    1.000
## BarPrepMentorN                       2.777e+01  3.788e+05   0.000    1.000
## BarPrepMentorPaulaMilan             -5.839e+01  4.745e+07   0.000    1.000
## BarPrepMentorPaulBarkhurst          -2.173e+02  7.428e+05   0.000    1.000
## BarPrepMentorQuentinWetsel          -5.852e+01  5.183e+05   0.000    1.000
## BarPrepMentorScottKeffer            -1.458e+02  5.259e+05   0.000    1.000
## BarPrepMentorScoutBlosser           -1.639e+02  5.683e+05   0.000    1.000
## BarPrepMentorTasiaEaslon            -7.416e+01  5.904e+05   0.000    1.000
## BarPrepMentorTomHall                -4.313e+01  5.206e+05   0.000    1.000
## BarPrepMentorTravisWeibold          -2.054e+02  5.446e+05   0.000    1.000
## BarPrepMentorTylynnPayne            -1.014e+02  5.342e+05   0.000    1.000
## BarPrepMentorVictoriaWhitehead      -1.929e+01  5.108e+05   0.000    1.000
## BarPrepMentorVictorMellinger        -7.847e+01  5.382e+05   0.000    1.000
## BarPrepMentorWillRaftis             -1.290e+02  4.642e+05   0.000    1.000
## BarPrepMentorY-DanielleSaavedra     -7.018e+01  5.278e+05   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.8758e+02  on 333  degrees of freedom
## Residual deviance: 2.6310e-08  on 223  degrees of freedom
## AIC: 222
## 
## Number of Fisher Scoring iterations: 25

5.1.1 Interpretation

Running the full model it’s appear that his isn’t significant. All p-values all $\approx$ 1.000, Residual deviance near 0 (2.6310e-08) with AIC = 222 (too low for a meaningful model).

5.2 Evaluation of models using Stepwise selection based on AIC

To identify the most impactful predictors while maintaining model parsimony, we perform stepwise AIC-based selection on the full model.

library(MASS) # Load necessary library
# Perform stepwise selection 

step_results <- stepAIC(full_model, direction = "both", trace = FALSE, steps=9)

# Extract the top 19 models
top_3_models <- head(step_results$anova, 9)

top_3_models
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## PassFail ~ Age + LSAT + UGPA + CivPro + LPI + LPII + GPA_1L + 
##     GPA_Final + FinalRankPercentile + Accommodations + Probation + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + AdvLegalAnalysis + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     StudentSuccessInitiative + BarPrepMentor
## 
## Final Model:
## PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation + 
##     LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion + 
##     StudentSuccessInitiative
## 
## 
##                            Step Df     Deviance Resid. Df   Resid. Dev
## 1                                                     223 2.630955e-08
## 2               - BarPrepMentor 58 3.818392e-08       281 6.449346e-08
## 3                      - CivPro  6 2.260952e-07       287 2.905887e-07
## 4 - X.LawSchoolBarPrepWorkshops  1 5.447172e-09       288 2.960359e-07
## 5          - AdvLegalPerfSkills  1 6.465908e-09       289 3.025018e-07
## 6         - FinalRankPercentile  1 6.534497e-09       290 3.090363e-07
## 7              - Accommodations  1 1.025674e-08       291 3.192930e-07
## 8            - AdvLegalAnalysis  1 2.213149e-05       292 2.245078e-05
## 9                         - Age  1 3.612132e-04       293 3.836640e-04
##         AIC
## 1 222.00000
## 2 106.00000
## 3  94.00000
## 4  92.00000
## 5  90.00000
## 6  88.00000
## 7  86.00000
## 8  84.00002
## 9  82.00038

5.2.1 Conclusion

Started with 19 predictors and systematically removed 11 variables that contributed least to model fit AIC decreased from 222 to 82, indicating significantly better model quality. Variable ‘Age’ appears least important since it has been remove the last. Some academic metrics like CiVPro, FinalRankPercentile were remove earlier on the stepwise process likely because of their insignificant impact on the response small deviance score ( 2.26e-07 and 6.53e-09).

The final select Model is PassFail regress on LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation + LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion + StudentSuccessInitiative

For this final model the Residual Deviance: Reduced from 2.63e-08 to 3.84e-04, AIC is 82 (lower is better) and we have 11 variables instead of 19 for original model that means The model achieves good fit with just 11 key predictors instead of the original 19.

5.3 Final Model Analysis

To be able to analyse our model we have to convert LPI, LPII and StudentSuccessInitiative to numerical variable

#let's create a new dataset 
analysis_df <- clear_df
# Convert polynomials to numeric
## Create grade-to-numeric mapping for converting ordinal variable to numerical
grade_map <- c("A" = 4, "A-" = 3.7, "B+" = 3.3, "B" = 3.0, "B-" = 2.7, 
               "C+" = 2.3, "C" = 2.0, "D" = 1.0, "F" = 0)

analysis_df$LPI <- grade_map[as.character(analysis_df$LPI)]
analysis_df$LPII <- grade_map[as.character(analysis_df$LPII)]

Initiative_group <- names(sort(table(analysis_df$StudentSuccessInitiative), decreasing = TRUE)[1:3])
analysis_df$StudentSuccessInitiative <- ifelse(
  analysis_df$StudentSuccessInitiative %in% Initiative_group,
  as.character(analysis_df$StudentSuccessInitiative),
  "Other_Programs"
) %>% factor()

5.3.1 Analysis

Having refined our predictors through stepwise selection, we now analyze the final logistic regression model to identify which factors significantly impact bar exam success. This model balances complexity and predictive power by retaining only the most influential variables.

final_model <- glm(PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation + 
  LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion + 
  StudentSuccessInitiative, 
  family = binomial(link = "logit"),
  data = analysis_df)

summary(final_model)
## 
## Call:
## glm(formula = PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L + 
##     GPA_Final + Probation + LegalAnalysis_TexasPractice + BarPrepCompany + 
##     BarPrepCompletion + StudentSuccessInitiative, family = binomial(link = "logit"), 
##     data = analysis_df)
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             -77.89262   18.25574  -4.267 1.98e-05
## LSAT                                      0.36190    0.09740   3.716 0.000203
## UGPA                                      0.62718    0.79922   0.785 0.432610
## LPI                                      -0.12250    0.51236  -0.239 0.811036
## LPII                                     -2.40697    0.69935  -3.442 0.000578
## GPA_1L                                    5.93756    1.89619   3.131 0.001740
## GPA_Final                                 1.67400    2.04473   0.819 0.412964
## ProbationN                               -0.46980    0.86338  -0.544 0.586342
## LegalAnalysis_TexasPracticeY             -0.38584    0.60509  -0.638 0.523701
## BarPrepCompanyHelix                      15.83063 1455.39804   0.011 0.991321
## BarPrepCompanyKaplan                     -3.21891    1.12366  -2.865 0.004175
## BarPrepCompanyThemis                      1.39532    0.61631   2.264 0.023576
## BarPrepCompletion                         7.62277    1.87890   4.057 4.97e-05
## StudentSuccessInitiativeKeffer            0.01598    1.43917   0.011 0.991141
## StudentSuccessInitiativeN                 1.61866    1.17559   1.377 0.168545
## StudentSuccessInitiativeOther_Programs    0.75744    1.14088   0.664 0.506748
##                                           
## (Intercept)                            ***
## LSAT                                   ***
## UGPA                                      
## LPI                                       
## LPII                                   ***
## GPA_1L                                 ** 
## GPA_Final                                 
## ProbationN                                
## LegalAnalysis_TexasPracticeY              
## BarPrepCompanyHelix                       
## BarPrepCompanyKaplan                   ** 
## BarPrepCompanyThemis                   *  
## BarPrepCompletion                      ***
## StudentSuccessInitiativeKeffer            
## StudentSuccessInitiativeN                 
## StudentSuccessInitiativeOther_Programs    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 236.67  on 351  degrees of freedom
## Residual deviance: 110.38  on 336  degrees of freedom
##   (97 observations deleted due to missingness)
## AIC: 142.38
## 
## Number of Fisher Scoring iterations: 14

5.3.1.1 Interpration

Significants variables:

  • LSAT: pvalue=0.0002; a 1 point increase in LSAT score increase the odds of passing the bar by approximatelt 44% (exp(0.3619)

  • GPA_1L: p-value = 0.0017; a one unit increase in first year GPA increase the odds of passing the bar by 379% (exp(5.938)

  • LPII: p-value = 0.0006; each one unit increase in LPII (converted from grade) decreasses the odds of passing the bar by 91% (exp(-2.407)

  • BarPrepCompletion: p-value = 0.00005; completing the bar prep program increased the odds of passing by 2.038% (exp(7.623))

Key Model Metrics

-AIC 142.38 lower than null model, good model fit

-Null Deviance 236.67 Baseline error (higher = worse fit)

-Residual Deviance 110, 53% lower than the null deviance value this indicate a Good fit

-The difference between Null Deviance and Residual Deviance is greater than (qchisq(.95,1)), we a model which is statistical significant.

library(car)
vif(final_model)
##                                 GVIF Df GVIF^(1/(2*Df))
## LSAT                        1.785294  1        1.336149
## UGPA                        1.389645  1        1.178832
## LPI                         1.579703  1        1.256862
## LPII                        2.670961  1        1.634308
## GPA_1L                      4.942028  1        2.223067
## GPA_Final                   3.550727  1        1.884337
## Probation                   1.939126  1        1.392525
## LegalAnalysis_TexasPractice 1.450698  1        1.204449
## BarPrepCompany              2.032850  3        1.125514
## BarPrepCompletion           1.947905  1        1.395674
## StudentSuccessInitiative    2.676942  3        1.178347

5.3.1.2 Vif conclusion

GPA_1L: GVIF = 4.94. Moderate concern. This variable may be moderately correlated with GPA_Final, as both measure academic performance over time. All other variables are non concerning collinearity because of their GVIF (<5)

5.4 Refine model

For the refine model we drop UGPA, LPI, GPA_Final, Probation, LegalAnalysis_TexasPractice, StudentSuccessInitiative and LPII.

refine_model <- glm(
  PassFail ~ LSAT + GPA_1L + BarPrepCompletion + BarPrepCompany,
  family = binomial,
  data = analysis_df)

summary(refine_model)
## 
## Call:
## glm(formula = PassFail ~ LSAT + GPA_1L + BarPrepCompletion + 
##     BarPrepCompany, family = binomial, data = analysis_df)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -36.67186    8.61617  -4.256 2.08e-05 ***
## LSAT                   0.15775    0.05257   3.001  0.00269 ** 
## GPA_1L                 3.09320    0.53538   5.778 7.58e-09 ***
## BarPrepCompletion      5.84549    1.22510   4.771 1.83e-06 ***
## BarPrepCompanyHelix   15.69717  882.74367   0.018  0.98581    
## BarPrepCompanyKaplan  -1.08325    0.78966  -1.372  0.17013    
## BarPrepCompanyThemis   1.26794    0.41465   3.058  0.00223 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 317.84  on 448  degrees of freedom
## Residual deviance: 220.38  on 442  degrees of freedom
## AIC: 234.38
## 
## Number of Fisher Scoring iterations: 13
vif(refine_model)
##                       GVIF Df GVIF^(1/(2*Df))
## LSAT              1.156420  1        1.075370
## GPA_1L            1.069834  1        1.034328
## BarPrepCompletion 1.352713  1        1.163062
## BarPrepCompany    1.344860  3        1.050621

5.4.1 Interpretation

we know have a model with of significant value and no collinearity between predictors. although we have a higher AIc 125.13 than in final_model, the difference between the null deviance and the residual deviance (187.58 - 111.13) is > than the critical delta # (qchisq(.95,1)) that means our model is significant.

6 Conclusion

AIC = 234.38, Higher than the final model’s AIC (142.38), but acceptable given model simplicity. Null Deviance = 317.84, Residual Deviance = 220.38. This refined model simplifies the analysis using just four predictors while retaining high explanatory power and statistical significance. It supports the conclusion that early academic strength, LSAT, and completing bar prep are key predictors of bar exam success. Despite a higher AIC than the full model, this model provides clarity, interpretability, and robust predictive insight without collinearity concerns.